Read A Parquet File From The Filesystem

Problem

You need to read and work with data in Parquet format stored on your your filesystem.

Solution

Treat Parquet files as database tables.

Discussion

We’ll use some Parquet data from the NYC Taxi & Limousine Commission for the examples here in these Parquet sections.

We’ll start by grabbing yellow_tripdata_2024-01.parquet from the NYC Taxi & Limousine Commission (though, as we will see later, we do not necessarily need this data on a filesystem to work with it).

curl \
  --silent \
  --location \
  --output "./data/yellow_tripdata_2024-01.parquet" \
  --url "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet"

Let’s take a look at the format:

duckdb -c "
DESCRIBE FROM './data/yellow_tripdata_2024-01.parquet'
"

┌───────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│      column_name      │ column_type │  null   │   key   │ default │  extra  │
│        varchar        │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ VendorID              │ INTEGER     │ YES     │         │         │         │
│ tpep_pickup_datetime  │ TIMESTAMP   │ YES     │         │         │         │
│ tpep_dropoff_datetime │ TIMESTAMP   │ YES     │         │         │         │
│ passenger_count       │ BIGINT      │ YES     │         │         │         │
│ trip_distance         │ DOUBLE      │ YES     │         │         │         │
│ RatecodeID            │ BIGINT      │ YES     │         │         │         │
│ store_and_fwd_flag    │ VARCHAR     │ YES     │         │         │         │
│ PULocationID          │ INTEGER     │ YES     │         │         │         │
│ DOLocationID          │ INTEGER     │ YES     │         │         │         │
│ payment_type          │ BIGINT      │ YES     │         │         │         │
│ fare_amount           │ DOUBLE      │ YES     │         │         │         │
│ extra                 │ DOUBLE      │ YES     │         │         │         │
│ mta_tax               │ DOUBLE      │ YES     │         │         │         │
│ tip_amount            │ DOUBLE      │ YES     │         │         │         │
│ tolls_amount          │ DOUBLE      │ YES     │         │         │         │
│ improvement_surcharge │ DOUBLE      │ YES     │         │         │         │
│ total_amount          │ DOUBLE      │ YES     │         │         │         │
│ congestion_surcharge  │ DOUBLE      │ YES     │         │         │         │
│ Airport_fee           │ DOUBLE      │ YES     │         │         │         │
├───────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 19 rows                                                           6 columns │
└─────────────────────────────────────────────────────────────────────────────┘

Yes, you’re eyes are not failing you.

DuckDB is “FROM first”, and needless SELECTs are optional. This is part of DuckDB’s friendly SQL initiative, which we’ll get into more later.

Note the use of the -c command line option. This tells DuckDB to run the query in the paramater value and exit. It’s one of many ways execute queries on the command line.

From the schema output, we see that DuckDB turned Subscription Date into an actual DATE and figured out the other types, too. You have full control over the schema, too (we’ll get into that later).

So, what does this data actually look like?

duckdb -markdown -c "
FROM './data/yellow_tripdata_2024-01.parquet' LIMIT 5
"

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	congestion_surcharge
2	2024-01-01 00:57:55	2024-01-01 01:17:43	1	1.72	1	N	186	79	2	17.7	1.0	0.5	0.0	1.0	22.7	2.5
1	2024-01-01 00:03:00	2024-01-01 00:09:36	1	1.8	1	N	140	236	1	10.0	3.5	0.5	3.75	1.0	18.75	2.5
1	2024-01-01 00:17:06	2024-01-01 00:35:01	1	4.7	1	N	236	79	1	23.3	3.5	0.5	3.0	1.0	31.3	2.5
1	2024-01-01 00:36:38	2024-01-01 00:44:56	1	1.4	1	N	79	211	1	10.0	3.5	0.5	2.0	1.0	17.0	2.5
1	2024-01-01 00:46:51	2024-01-01 00:52:57	1	0.8	1	N	211	148	1	7.9	3.5	0.5	3.2	1.0	16.1	2.5

Let’s see how many trips required an airport fee this year:

duckdb -c "
FROM 'data/yellow_tripdata_2024-01.parquet'
SELECT 
  COUNT(*) AS total_trips_so_far_in_2024,
  SUM(CASE WHEN Airport_fee = 0 THEN 1 ELSE 0 END) AS trips_with_airport_fees,
  total_trips_so_far_in_2024 - trips_with_airport_fees AS diff
"

┌────────────────────────────┬─────────────────────────┬────────┐
│ total_trips_so_far_in_2024 │ trips_with_airport_fees │  diff  │
│           int64            │         int128          │ int128 │
├────────────────────────────┼─────────────────────────┼────────┤
│                    2964624 │                 2586789 │ 377835 │
└────────────────────────────┴─────────────────────────┴────────┘

Make sure to give the official DuckDB Parquet documentation a read through as there are many more options for handling Parquet files optimally.