Read Many Parquet Files From The Filesystem

Problem

You need to read and work with a directory of Parquet files with similar schemas.

Solution

Treat the directory of Parquet files as single database table.

Discussion

This is, perhaps, one of the best use-cases for DuckDB, as it leverages all the strengths of the Parquet format to help turn a single-node system into a very real rival for “big data” clusters.

We’ll demonstrate this by using all 12 of the 2023 rides Parquet (we used the single 2024 one in the previous chapter). Let’s get them downloaded:

mkdir -p ./data/taxi

for month in {01..12}
do
  curl \
    --silent \
    --location \
    --output "./data/taxi/yellow_tripdata_2023-${month}.parquet" \
    --url "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-${month}.parquet"
done
duckdb -c "
DESCRIBE
FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
"
┌───────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│      column_name      │ column_type │  null   │   key   │ default │  extra  │
│        varchar        │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ VendorID              │ BIGINT      │ YES     │         │         │         │
│ tpep_pickup_datetime  │ TIMESTAMP   │ YES     │         │         │         │
│ tpep_dropoff_datetime │ TIMESTAMP   │ YES     │         │         │         │
│ passenger_count       │ DOUBLE      │ YES     │         │         │         │
│ trip_distance         │ DOUBLE      │ YES     │         │         │         │
│ RatecodeID            │ DOUBLE      │ YES     │         │         │         │
│ store_and_fwd_flag    │ VARCHAR     │ YES     │         │         │         │
│ PULocationID          │ BIGINT      │ YES     │         │         │         │
│ DOLocationID          │ BIGINT      │ YES     │         │         │         │
│ payment_type          │ BIGINT      │ YES     │         │         │         │
│ fare_amount           │ DOUBLE      │ YES     │         │         │         │
│ extra                 │ DOUBLE      │ YES     │         │         │         │
│ mta_tax               │ DOUBLE      │ YES     │         │         │         │
│ tip_amount            │ DOUBLE      │ YES     │         │         │         │
│ tolls_amount          │ DOUBLE      │ YES     │         │         │         │
│ improvement_surcharge │ DOUBLE      │ YES     │         │         │         │
│ total_amount          │ DOUBLE      │ YES     │         │         │         │
│ congestion_surcharge  │ DOUBLE      │ YES     │         │         │         │
│ airport_fee           │ DOUBLE      │ YES     │         │         │         │
├───────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 19 rows                                                           6 columns │
└─────────────────────────────────────────────────────────────────────────────┘

NOTE: DuckDB fully groks Hive partitions for all file formats, including Parquet.

Let’s see how many records we have:

time duckdb -c "
FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
SELECT COUNT(*)
"
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│     38310226 │
└──────────────┘

real    0m0.018s
user    0m0.023s
sys 0m0.005s

I used time with that query to show how both fast and smart DuckDB is when it comes to Parquet files. Each individual Parquet file has metadata associated with it, so all DuckDB has to do to figure out the total number of rows is to read this metadata.

Let’s see the ratio of airport vs non-airport trips by month in 2023:

time duckdb -c "
FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
SELECT
  date_part('month', tpep_pickup_datetime) AS month,
  COUNT(*) AS total_trips,
  SUM(CASE WHEN Airport_fee = 0 THEN 1 ELSE 0 END) AS airport_trips,
  airport_trips/total_trips AS trip_ratio 
GROUP BY 1
ORDER BY 1
"
┌───────┬─────────────┬───────────────┬────────────────────┐
│ month │ total_trips │ airport_trips │     trip_ratio     │
│ int64 │    int64    │    int128     │       double       │
├───────┼─────────────┼───────────────┼────────────────────┤
│     1 │     3066759 │       2730446 │ 0.8903360192307254 │
│     2 │     2914003 │       2611706 │ 0.8962605735134795 │
│     3 │     3403660 │       3032630 │ 0.8909908745291832 │
│     4 │     3288248 │       2922547 │ 0.8887854565714022 │
│     5 │     3513664 │       3129712 │ 0.8907260341341687 │
│     6 │     3307259 │       2918007 │ 0.8823037445812378 │
│     7 │     2907093 │       2547888 │ 0.8764384214746483 │
│     8 │     2824201 │       2456861 │ 0.8699313540360619 │
│     9 │     2846741 │       2442020 │ 0.8578300590043141 │
│    10 │     3522280 │       3060728 │ 0.8689621495167903 │
│    11 │     3339732 │       2924381 │ 0.8756334340599785 │
│    12 │     3376586 │       2936512 │ 0.8696689496432195 │
├───────┴─────────────┴───────────────┴────────────────────┤
│ 12 rows                                        4 columns │
└──────────────────────────────────────────────────────────┘

real    0m0.228s
user    0m1.576s
sys 0m0.054s

Now, lets make DuckDB work for a living and sum the fare amount for the entire year:

time duckdb -c "
FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
SELECT 
  printf('$%,d', CAST(SUM(fare_amount) AS INTEGER)) AS total_2023_fares
"
┌──────────────────┐
│ total_2023_fares │
│     varchar      │
├──────────────────┤
│ $747,901,834     │
└──────────────────┘

real    0m0.070s
user    0m0.447s
sys 0m0.017s

Remember, along with Hive partition schemes, you can also go crazy with the file/path glob pattern, too.