mkdir -p ./data/taxi
for month in {01..12}
do
curl \
--silent \
--location \
--output "./data/taxi/yellow_tripdata_2023-${month}.parquet" \
--url "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-${month}.parquet"
done
Read Many Parquet Files From The Filesystem
Problem
You need to read and work with a directory of Parquet files with similar schemas.
Solution
Treat the directory of Parquet files as single database table.
Discussion
This is, perhaps, one of the best use-cases for DuckDB, as it leverages all the strengths of the Parquet format to help turn a single-node system into a very real rival for “big data” clusters.
We’ll demonstrate this by using all 12 of the 2023 rides Parquet (we used the single 2024 one in the previous chapter). Let’s get them downloaded:
duckdb -c "
DESCRIBE
FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
"
┌───────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├───────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ VendorID │ BIGINT │ YES │ │ │ │
│ tpep_pickup_datetime │ TIMESTAMP │ YES │ │ │ │
│ tpep_dropoff_datetime │ TIMESTAMP │ YES │ │ │ │
│ passenger_count │ DOUBLE │ YES │ │ │ │
│ trip_distance │ DOUBLE │ YES │ │ │ │
│ RatecodeID │ DOUBLE │ YES │ │ │ │
│ store_and_fwd_flag │ VARCHAR │ YES │ │ │ │
│ PULocationID │ BIGINT │ YES │ │ │ │
│ DOLocationID │ BIGINT │ YES │ │ │ │
│ payment_type │ BIGINT │ YES │ │ │ │
│ fare_amount │ DOUBLE │ YES │ │ │ │
│ extra │ DOUBLE │ YES │ │ │ │
│ mta_tax │ DOUBLE │ YES │ │ │ │
│ tip_amount │ DOUBLE │ YES │ │ │ │
│ tolls_amount │ DOUBLE │ YES │ │ │ │
│ improvement_surcharge │ DOUBLE │ YES │ │ │ │
│ total_amount │ DOUBLE │ YES │ │ │ │
│ congestion_surcharge │ DOUBLE │ YES │ │ │ │
│ airport_fee │ DOUBLE │ YES │ │ │ │
├───────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 19 rows 6 columns │
└─────────────────────────────────────────────────────────────────────────────┘
NOTE: DuckDB fully groks Hive partitions for all file formats, including Parquet.
Let’s see how many records we have:
time duckdb -c "
FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
SELECT COUNT(*)
"
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 38310226 │
└──────────────┘
real 0m0.018s
user 0m0.023s
sys 0m0.005s
I used time
with that query to show how both fast and smart DuckDB is when it comes to Parquet files. Each individual Parquet file has metadata associated with it, so all DuckDB has to do to figure out the total number of rows is to read this metadata.
Let’s see the ratio of airport vs non-airport trips by month in 2023:
time duckdb -c "
FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
SELECT
date_part('month', tpep_pickup_datetime) AS month,
COUNT(*) AS total_trips,
SUM(CASE WHEN Airport_fee = 0 THEN 1 ELSE 0 END) AS airport_trips,
airport_trips/total_trips AS trip_ratio
GROUP BY 1
ORDER BY 1
"
┌───────┬─────────────┬───────────────┬────────────────────┐
│ month │ total_trips │ airport_trips │ trip_ratio │
│ int64 │ int64 │ int128 │ double │
├───────┼─────────────┼───────────────┼────────────────────┤
│ 1 │ 3066759 │ 2730446 │ 0.8903360192307254 │
│ 2 │ 2914003 │ 2611706 │ 0.8962605735134795 │
│ 3 │ 3403660 │ 3032630 │ 0.8909908745291832 │
│ 4 │ 3288248 │ 2922547 │ 0.8887854565714022 │
│ 5 │ 3513664 │ 3129712 │ 0.8907260341341687 │
│ 6 │ 3307259 │ 2918007 │ 0.8823037445812378 │
│ 7 │ 2907093 │ 2547888 │ 0.8764384214746483 │
│ 8 │ 2824201 │ 2456861 │ 0.8699313540360619 │
│ 9 │ 2846741 │ 2442020 │ 0.8578300590043141 │
│ 10 │ 3522280 │ 3060728 │ 0.8689621495167903 │
│ 11 │ 3339732 │ 2924381 │ 0.8756334340599785 │
│ 12 │ 3376586 │ 2936512 │ 0.8696689496432195 │
├───────┴─────────────┴───────────────┴────────────────────┤
│ 12 rows 4 columns │
└──────────────────────────────────────────────────────────┘
real 0m0.228s
user 0m1.576s
sys 0m0.054s
Now, lets make DuckDB work for a living and sum the fare amount for the entire year:
time duckdb -c "
FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
SELECT
printf('$%,d', CAST(SUM(fare_amount) AS INTEGER)) AS total_2023_fares
"
┌──────────────────┐
│ total_2023_fares │
│ varchar │
├──────────────────┤
│ $747,901,834 │
└──────────────────┘
real 0m0.070s
user 0m0.447s
sys 0m0.017s
Remember, along with Hive partition schemes, you can also go crazy with the file/path glob pattern, too.