curl \
--silent \
--location \
--output "./data/yellow_tripdata_2024-01.parquet" \
--url "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet"
Read A Parquet File From The Filesystem
Problem
You need to read and work with data in Parquet format stored on your your filesystem.
Solution
Treat Parquet files as database tables.
Discussion
We’ll use some Parquet data from the NYC Taxi & Limousine Commission for the examples here in these Parquet sections.
We’ll start by grabbing yellow_tripdata_2024-01.parquet
from the NYC Taxi & Limousine Commission (though, as we will see later, we do not necessarily need this data on a filesystem to work with it).
Let’s take a look at the format:
duckdb -c "
DESCRIBE FROM './data/yellow_tripdata_2024-01.parquet'
"
┌───────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├───────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ VendorID │ INTEGER │ YES │ │ │ │
│ tpep_pickup_datetime │ TIMESTAMP │ YES │ │ │ │
│ tpep_dropoff_datetime │ TIMESTAMP │ YES │ │ │ │
│ passenger_count │ BIGINT │ YES │ │ │ │
│ trip_distance │ DOUBLE │ YES │ │ │ │
│ RatecodeID │ BIGINT │ YES │ │ │ │
│ store_and_fwd_flag │ VARCHAR │ YES │ │ │ │
│ PULocationID │ INTEGER │ YES │ │ │ │
│ DOLocationID │ INTEGER │ YES │ │ │ │
│ payment_type │ BIGINT │ YES │ │ │ │
│ fare_amount │ DOUBLE │ YES │ │ │ │
│ extra │ DOUBLE │ YES │ │ │ │
│ mta_tax │ DOUBLE │ YES │ │ │ │
│ tip_amount │ DOUBLE │ YES │ │ │ │
│ tolls_amount │ DOUBLE │ YES │ │ │ │
│ improvement_surcharge │ DOUBLE │ YES │ │ │ │
│ total_amount │ DOUBLE │ YES │ │ │ │
│ congestion_surcharge │ DOUBLE │ YES │ │ │ │
│ Airport_fee │ DOUBLE │ YES │ │ │ │
├───────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 19 rows 6 columns │
└─────────────────────────────────────────────────────────────────────────────┘
Yes, you’re eyes are not failing you.
DuckDB is “FROM
first”, and needless SELECT
s are optional. This is part of DuckDB’s friendly SQL initiative, which we’ll get into more later.
Note the use of the -c
command line option. This tells DuckDB to run the query in the paramater value and exit. It’s one of many ways execute queries on the command line.
From the schema output, we see that DuckDB turned Subscription Date
into an actual DATE
and figured out the other types, too. You have full control over the schema, too (we’ll get into that later).
So, what does this data actually look like?
duckdb -markdown -c "
FROM './data/yellow_tripdata_2024-01.parquet' LIMIT 5
"
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | Airport_fee |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2024-01-01 00:57:55 | 2024-01-01 01:17:43 | 1 | 1.72 | 1 | N | 186 | 79 | 2 | 17.7 | 1.0 | 0.5 | 0.0 | 0.0 | 1.0 | 22.7 | 2.5 | 0.0 |
1 | 2024-01-01 00:03:00 | 2024-01-01 00:09:36 | 1 | 1.8 | 1 | N | 140 | 236 | 1 | 10.0 | 3.5 | 0.5 | 3.75 | 0.0 | 1.0 | 18.75 | 2.5 | 0.0 |
1 | 2024-01-01 00:17:06 | 2024-01-01 00:35:01 | 1 | 4.7 | 1 | N | 236 | 79 | 1 | 23.3 | 3.5 | 0.5 | 3.0 | 0.0 | 1.0 | 31.3 | 2.5 | 0.0 |
1 | 2024-01-01 00:36:38 | 2024-01-01 00:44:56 | 1 | 1.4 | 1 | N | 79 | 211 | 1 | 10.0 | 3.5 | 0.5 | 2.0 | 0.0 | 1.0 | 17.0 | 2.5 | 0.0 |
1 | 2024-01-01 00:46:51 | 2024-01-01 00:52:57 | 1 | 0.8 | 1 | N | 211 | 148 | 1 | 7.9 | 3.5 | 0.5 | 3.2 | 0.0 | 1.0 | 16.1 | 2.5 | 0.0 |
Let’s see how many trips required an airport fee this year:
duckdb -c "
FROM 'data/yellow_tripdata_2024-01.parquet'
SELECT
COUNT(*) AS total_trips_so_far_in_2024,
SUM(CASE WHEN Airport_fee = 0 THEN 1 ELSE 0 END) AS trips_with_airport_fees,
total_trips_so_far_in_2024 - trips_with_airport_fees AS diff
"
┌────────────────────────────┬─────────────────────────┬────────┐
│ total_trips_so_far_in_2024 │ trips_with_airport_fees │ diff │
│ int64 │ int128 │ int128 │
├────────────────────────────┼─────────────────────────┼────────┤
│ 2964624 │ 2586789 │ 377835 │
└────────────────────────────┴─────────────────────────┴────────┘
Make sure to give the official DuckDB Parquet documentation a read through as there are many more options for handling Parquet files optimally.