Read A Parquet File From The Filesystem

Problem

You need to read and work with data in Parquet format stored on your your filesystem.

Solution

Treat Parquet files as database tables.

Discussion

We’ll use some Parquet data from the NYC Taxi & Limousine Commission for the examples here in these Parquet sections.

We’ll start by grabbing yellow_tripdata_2024-01.parquet from the NYC Taxi & Limousine Commission (though, as we will see later, we do not necessarily need this data on a filesystem to work with it).

curl \
  --silent \
  --location \
  --output "./data/yellow_tripdata_2024-01.parquet" \
  --url "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet"

Let’s take a look at the format:

duckdb -c "
DESCRIBE FROM './data/yellow_tripdata_2024-01.parquet'
"
┌───────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│      column_name      │ column_type │  null   │   key   │ default │  extra  │
│        varchar        │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ VendorID              │ INTEGER     │ YES     │         │         │         │
│ tpep_pickup_datetime  │ TIMESTAMP   │ YES     │         │         │         │
│ tpep_dropoff_datetime │ TIMESTAMP   │ YES     │         │         │         │
│ passenger_count       │ BIGINT      │ YES     │         │         │         │
│ trip_distance         │ DOUBLE      │ YES     │         │         │         │
│ RatecodeID            │ BIGINT      │ YES     │         │         │         │
│ store_and_fwd_flag    │ VARCHAR     │ YES     │         │         │         │
│ PULocationID          │ INTEGER     │ YES     │         │         │         │
│ DOLocationID          │ INTEGER     │ YES     │         │         │         │
│ payment_type          │ BIGINT      │ YES     │         │         │         │
│ fare_amount           │ DOUBLE      │ YES     │         │         │         │
│ extra                 │ DOUBLE      │ YES     │         │         │         │
│ mta_tax               │ DOUBLE      │ YES     │         │         │         │
│ tip_amount            │ DOUBLE      │ YES     │         │         │         │
│ tolls_amount          │ DOUBLE      │ YES     │         │         │         │
│ improvement_surcharge │ DOUBLE      │ YES     │         │         │         │
│ total_amount          │ DOUBLE      │ YES     │         │         │         │
│ congestion_surcharge  │ DOUBLE      │ YES     │         │         │         │
│ Airport_fee           │ DOUBLE      │ YES     │         │         │         │
├───────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 19 rows                                                           6 columns │
└─────────────────────────────────────────────────────────────────────────────┘

Yes, you’re eyes are not failing you.

DuckDB is “FROM first”, and needless SELECTs are optional. This is part of DuckDB’s friendly SQL initiative, which we’ll get into more later.

Note the use of the -c command line option. This tells DuckDB to run the query in the paramater value and exit. It’s one of many ways execute queries on the command line.

From the schema output, we see that DuckDB turned Subscription Date into an actual DATE and figured out the other types, too. You have full control over the schema, too (we’ll get into that later).

So, what does this data actually look like?

duckdb -markdown -c "
FROM './data/yellow_tripdata_2024-01.parquet' LIMIT 5
"
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge Airport_fee
2 2024-01-01 00:57:55 2024-01-01 01:17:43 1 1.72 1 N 186 79 2 17.7 1.0 0.5 0.0 0.0 1.0 22.7 2.5 0.0
1 2024-01-01 00:03:00 2024-01-01 00:09:36 1 1.8 1 N 140 236 1 10.0 3.5 0.5 3.75 0.0 1.0 18.75 2.5 0.0
1 2024-01-01 00:17:06 2024-01-01 00:35:01 1 4.7 1 N 236 79 1 23.3 3.5 0.5 3.0 0.0 1.0 31.3 2.5 0.0
1 2024-01-01 00:36:38 2024-01-01 00:44:56 1 1.4 1 N 79 211 1 10.0 3.5 0.5 2.0 0.0 1.0 17.0 2.5 0.0
1 2024-01-01 00:46:51 2024-01-01 00:52:57 1 0.8 1 N 211 148 1 7.9 3.5 0.5 3.2 0.0 1.0 16.1 2.5 0.0

Let’s see how many trips required an airport fee this year:

duckdb -c "
FROM 'data/yellow_tripdata_2024-01.parquet'
SELECT 
  COUNT(*) AS total_trips_so_far_in_2024,
  SUM(CASE WHEN Airport_fee = 0 THEN 1 ELSE 0 END) AS trips_with_airport_fees,
  total_trips_so_far_in_2024 - trips_with_airport_fees AS diff
"
┌────────────────────────────┬─────────────────────────┬────────┐
│ total_trips_so_far_in_2024 │ trips_with_airport_fees │  diff  │
│           int64            │         int128          │ int128 │
├────────────────────────────┼─────────────────────────┼────────┤
│                    2964624 │                 2586789 │ 377835 │
└────────────────────────────┴─────────────────────────┴────────┘

Make sure to give the official DuckDB Parquet documentation a read through as there are many more options for handling Parquet files optimally.