Read Parquet File(s) From The Web

Problem

You need to read and work with Parquet data stored on the internets.

Solution

Treat Parquet files on the web as database tables.

Discussion

NOTE: There are some new techniques in this chapter that apply to JSON and CSV files as well. However, it is a more likely scenario to have to treat multiple remote Parquet files as a single database table than it is JSON. And, using Parquet files in this manner is more efficient than JSON or CSV.

First, let’s demonstrate how we can use an array of HTTP URLs instead of a string when reading from a remote data source:

time duckdb -c "
FROM read_parquet([
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-06.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-08.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-09.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-11.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-12.parquet'
  ])
SELECT
  COUNT(*) AS n_rides_2024
"
┌──────────────┐
│ n_rides_2024 │
│    int64     │
├──────────────┤
│     38310226 │
└──────────────┘

real    0m1.555s
user    0m0.429s
sys 0m0.436s

This matches the 38,310,226 value we saw in the previous chapter.

As noted, Parquet files are super efficient, so DuckDB was able to use HTTP range requests to read the data in.

We’ll move the requests to a server I control so I can show you what’s going on:

time duckdb -c "
FROM read_parquet([
  'https://data.hrbrmstr.dev/taxi/yellow_tripdata_2023-01.parquet',
  'https://data.hrbrmstr.dev/taxi/yellow_tripdata_2023-02.parquet'
  ])
SELECT
  COUNT(*) AS ignored
"
┌─────────┐
│ ignored │
│  int64  │
├─────────┤
│ 5980721 │
└─────────┘

real    0m1.671s
user    0m0.052s
sys 0m0.013s

Let’s see what DuckDB sends and what my server replies:

HEAD /taxi/yellow_tripdata_2023-01.parquet HTTP/1.1
Host: data.hrbrmstr.dev

Webserver replies:

HTTP/1.1 200
Accept-Ranges:bytes
Content-Length:47673370

It then sends:

GET /taxi/yellow_tripdata_2023-01.parquet HTTP/1.1
Host: data.hrbrmstr.dev
Range:bytes=47673362-47673369

and, the server sends back this with data:

HTTP/1.1 206
Content-Range:bytes 47673362-47673369/47673370

HTTP status code 206 is “partial content”

In a Parquet file, the metadata is stored in a separate block within the file called the “footer.” This metadata includes comprehensive information about the file, such as the file format version, the schema, encoding used for the data, and locations of all the column metadata start locations. This design allows for efficient data processing and querying, as the metadata can be read first to understand the structure and contents of the file without needing to read the entire file. The footer’s structure facilitates single-pass writing, where data is written first, and the metadata, which records details about the data, is written at the end. This approach enables efficient reading and writing operations, as well as the flexibility to split the file into multiple parts if necessary, without losing the ability to understand the file’s structure.

We can use pqrs to see what DuckDB sees when it makes the request and parses the response:

pqrs schema --json data/taxi/yellow_tripdata_2023-01.parquet | jq
{
  "version": 1,
  "num_rows": 3066766,
  "created_by": "parquet-cpp-arrow version 8.0.0",
  "metadata": {
    "pandas": "{\"index_columns\": [], \"column_indexes\": [], \"columns\": [{\"name\": \"VendorID\", \"field_name\": \"VendorID\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"tpep_pickup_datetime\", \"field_name\": \"tpep_pickup_datetime\", \"pandas_type\": \"datetime\", \"numpy_type\": \"datetime64[ns]\", \"metadata\": null}, {\"name\": \"tpep_dropoff_datetime\", \"field_name\": \"tpep_dropoff_datetime\", \"pandas_type\": \"datetime\", \"numpy_type\": \"datetime64[ns]\", \"metadata\": null}, {\"name\": \"passenger_count\", \"field_name\": \"passenger_count\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"trip_distance\", \"field_name\": \"trip_distance\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"RatecodeID\", \"field_name\": \"RatecodeID\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"store_and_fwd_flag\", \"field_name\": \"store_and_fwd_flag\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"PULocationID\", \"field_name\": \"PULocationID\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"DOLocationID\", \"field_name\": \"DOLocationID\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"payment_type\", \"field_name\": \"payment_type\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\": \"fare_amount\", \"field_name\": \"fare_amount\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"extra\", \"field_name\": \"extra\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"mta_tax\", \"field_name\": \"mta_tax\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"tip_amount\", \"field_name\": \"tip_amount\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"tolls_amount\", \"field_name\": \"tolls_amount\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"improvement_surcharge\", \"field_name\": \"improvement_surcharge\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"total_amount\", \"field_name\": \"total_amount\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"congestion_surcharge\", \"field_name\": \"congestion_surcharge\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"airport_fee\", \"field_name\": \"airport_fee\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}], \"creator\": {\"library\": \"pyarrow\", \"version\": \"8.0.0\"}, \"pandas_version\": \"1.2.3\"}",
    "ARROW:schema": "/////7gOAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAADQKAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAPwJAAB7ImluZGV4X2NvbHVtbnMiOiBbXSwgImNvbHVtbl9pbmRleGVzIjogW10sICJjb2x1bW5zIjogW3sibmFtZSI6ICJWZW5kb3JJRCIsICJmaWVsZF9uYW1lIjogIlZlbmRvcklEIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInRwZXBfcGlja3VwX2RhdGV0aW1lIiwgImZpZWxkX25hbWUiOiAidHBlcF9waWNrdXBfZGF0ZXRpbWUiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidHBlcF9kcm9wb2ZmX2RhdGV0aW1lIiwgImZpZWxkX25hbWUiOiAidHBlcF9kcm9wb2ZmX2RhdGV0aW1lIiwgInBhbmRhc190eXBlIjogImRhdGV0aW1lIiwgIm51bXB5X3R5cGUiOiAiZGF0ZXRpbWU2NFtuc10iLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInBhc3Nlbmdlcl9jb3VudCIsICJmaWVsZF9uYW1lIjogInBhc3Nlbmdlcl9jb3VudCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidHJpcF9kaXN0YW5jZSIsICJmaWVsZF9uYW1lIjogInRyaXBfZGlzdGFuY2UiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIlJhdGVjb2RlSUQiLCAiZmllbGRfbmFtZSI6ICJSYXRlY29kZUlEIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJzdG9yZV9hbmRfZndkX2ZsYWciLCAiZmllbGRfbmFtZSI6ICJzdG9yZV9hbmRfZndkX2ZsYWciLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiUFVMb2NhdGlvbklEIiwgImZpZWxkX25hbWUiOiAiUFVMb2NhdGlvbklEIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogIkRPTG9jYXRpb25JRCIsICJmaWVsZF9uYW1lIjogIkRPTG9jYXRpb25JRCIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwYXltZW50X3R5cGUiLCAiZmllbGRfbmFtZSI6ICJwYXltZW50X3R5cGUiLCAicGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZmFyZV9hbW91bnQiLCAiZmllbGRfbmFtZSI6ICJmYXJlX2Ftb3VudCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZXh0cmEiLCAiZmllbGRfbmFtZSI6ICJleHRyYSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAibXRhX3RheCIsICJmaWVsZF9uYW1lIjogIm10YV90YXgiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInRpcF9hbW91bnQiLCAiZmllbGRfbmFtZSI6ICJ0aXBfYW1vdW50IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJ0b2xsc19hbW91bnQiLCAiZmllbGRfbmFtZSI6ICJ0b2xsc19hbW91bnQiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQ2NCIsICJudW1weV90eXBlIjogImZsb2F0NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImltcHJvdmVtZW50X3N1cmNoYXJnZSIsICJmaWVsZF9uYW1lIjogImltcHJvdmVtZW50X3N1cmNoYXJnZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDY0IiwgIm51bXB5X3R5cGUiOiAiZmxvYXQ2NCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidG90YWxfYW1vdW50IiwgImZpZWxkX25hbWUiOiAidG90YWxfYW1vdW50IiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJjb25nZXN0aW9uX3N1cmNoYXJnZSIsICJmaWVsZF9uYW1lIjogImNvbmdlc3Rpb25fc3VyY2hhcmdlIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJhaXJwb3J0X2ZlZSIsICJmaWVsZF9uYW1lIjogImFpcnBvcnRfZmVlIiwgInBhbmRhc190eXBlIjogImZsb2F0NjQiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDY0IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiOC4wLjAifSwgInBhbmRhc192ZXJzaW9uIjogIjEuMi4zIn0AAAAAEwAAABgEAADEAwAAhAMAAEwDAAAUAwAA4AIAAKQCAABoAgAALAIAAPABAAC8AQAAjAEAAFwBAAAoAQAA8AAAALAAAAB4AAAAOAAAAAQAAABE/P//AAABAxAAAAAcAAAABAAAAAAAAAALAAAAYWlycG9ydF9mZWUAevz//wAAAgB0/P//AAABAxAAAAAoAAAABAAAAAAAAAAUAAAAY29uZ2VzdGlvbl9zdXJjaGFyZ2UAAAAAtvz//wAAAgCw/P//AAABAxAAAAAgAAAABAAAAAAAAAAMAAAAdG90YWxfYW1vdW50AAAAAOr8//8AAAIA5Pz//wAAAQMQAAAAKAAAAAQAAAAAAAAAFQAAAGltcHJvdmVtZW50X3N1cmNoYXJnZQAAACb9//8AAAIAIP3//wAAAQMQAAAAIAAAAAQAAAAAAAAADAAAAHRvbGxzX2Ftb3VudAAAAABa/f//AAACAFT9//8AAAEDEAAAABwAAAAEAAAAAAAAAAoAAAB0aXBfYW1vdW50AACK/f//AAACAIT9//8AAAEDEAAAABgAAAAEAAAAAAAAAAcAAABtdGFfdGF4ALb9//8AAAIAsP3//wAAAQMQAAAAGAAAAAQAAAAAAAAABQAAAGV4dHJhAAAA4v3//wAAAgDc/f//AAABAxAAAAAcAAAABAAAAAAAAAALAAAAZmFyZV9hbW91bnQAEv7//wAAAgAM/v//AAABAhAAAAAgAAAABAAAAAAAAAAMAAAAcGF5bWVudF90eXBlAAAAAAD+//8AAAABQAAAAET+//8AAAECEAAAACAAAAAEAAAAAAAAAAwAAABET0xvY2F0aW9uSUQAAAAAOP7//wAAAAFAAAAAfP7//wAAAQIQAAAAIAAAAAQAAAAAAAAADAAAAFBVTG9jYXRpb25JRAAAAABw/v//AAAAAUAAAAC0/v//AAABBRAAAAAoAAAABAAAAAAAAAASAAAAc3RvcmVfYW5kX2Z3ZF9mbGFnAAAEAAQABAAAAOz+//8AAAEDEAAAABwAAAAEAAAAAAAAAAoAAABSYXRlY29kZUlEAAAi////AAACABz///8AAAEDEAAAACAAAAAEAAAAAAAAAA0AAAB0cmlwX2Rpc3RhbmNlAAAAVv///wAAAgBQ////AAABAxAAAAAgAAAABAAAAAAAAAAPAAAAcGFzc2VuZ2VyX2NvdW50AIr///8AAAIAhP///wAAAQoQAAAAKAAAAAQAAAAAAAAAFQAAAHRwZXBfZHJvcG9mZl9kYXRldGltZQAAAMb///8AAAMAwP///wAAAQoQAAAALAAAAAQAAAAAAAAAFAAAAHRwZXBfcGlja3VwX2RhdGV0aW1lAAAGAAgABgAGAAAAAAADABAAFAAIAAYABwAMAAAAEAAQAAAAAAABAhAAAAAkAAAABAAAAAAAAAAIAAAAVmVuZG9ySUQAAAAACAAMAAgABwAIAAAAAAAAAUAAAAA="
  },
  "columns": [
    {
      "path": "VendorID",
      "optional": "true",
      "name": "VendorID",
      "physical_type": "INT64",
      "converted_type": "NONE"
    },
    {
      "path": "tpep_pickup_datetime",
      "name": "tpep_pickup_datetime",
      "optional": "true",
      "physical_type": "INT64",
      "converted_type": "TIMESTAMP_MICROS"
    },
    {
      "path": "tpep_dropoff_datetime",
      "optional": "true",
      "physical_type": "INT64",
      "name": "tpep_dropoff_datetime",
      "converted_type": "TIMESTAMP_MICROS"
    },
    {
      "name": "passenger_count",
      "optional": "true",
      "physical_type": "DOUBLE",
      "converted_type": "NONE",
      "path": "passenger_count"
    },
    {
      "converted_type": "NONE",
      "name": "trip_distance",
      "path": "trip_distance",
      "optional": "true",
      "physical_type": "DOUBLE"
    },
    {
      "physical_type": "DOUBLE",
      "converted_type": "NONE",
      "optional": "true",
      "name": "RatecodeID",
      "path": "RatecodeID"
    },
    {
      "converted_type": "UTF8",
      "path": "store_and_fwd_flag",
      "name": "store_and_fwd_flag",
      "physical_type": "BYTE_ARRAY",
      "optional": "true"
    },
    {
      "name": "PULocationID",
      "converted_type": "NONE",
      "path": "PULocationID",
      "physical_type": "INT64",
      "optional": "true"
    },
    {
      "converted_type": "NONE",
      "path": "DOLocationID",
      "optional": "true",
      "name": "DOLocationID",
      "physical_type": "INT64"
    },
    {
      "converted_type": "NONE",
      "optional": "true",
      "path": "payment_type",
      "name": "payment_type",
      "physical_type": "INT64"
    },
    {
      "name": "fare_amount",
      "path": "fare_amount",
      "optional": "true",
      "physical_type": "DOUBLE",
      "converted_type": "NONE"
    },
    {
      "path": "extra",
      "name": "extra",
      "converted_type": "NONE",
      "physical_type": "DOUBLE",
      "optional": "true"
    },
    {
      "optional": "true",
      "name": "mta_tax",
      "path": "mta_tax",
      "physical_type": "DOUBLE",
      "converted_type": "NONE"
    },
    {
      "converted_type": "NONE",
      "physical_type": "DOUBLE",
      "path": "tip_amount",
      "name": "tip_amount",
      "optional": "true"
    },
    {
      "name": "tolls_amount",
      "physical_type": "DOUBLE",
      "optional": "true",
      "path": "tolls_amount",
      "converted_type": "NONE"
    },
    {
      "converted_type": "NONE",
      "optional": "true",
      "name": "improvement_surcharge",
      "physical_type": "DOUBLE",
      "path": "improvement_surcharge"
    },
    {
      "physical_type": "DOUBLE",
      "path": "total_amount",
      "converted_type": "NONE",
      "name": "total_amount",
      "optional": "true"
    },
    {
      "name": "congestion_surcharge",
      "path": "congestion_surcharge",
      "optional": "true",
      "converted_type": "NONE",
      "physical_type": "DOUBLE"
    },
    {
      "converted_type": "NONE",
      "optional": "true",
      "physical_type": "DOUBLE",
      "name": "airport_fee",
      "path": "airport_fee"
    }
  ],
  "message": "message schema {\n  OPTIONAL INT64 VendorID;\n  OPTIONAL INT64 tpep_pickup_datetime (TIMESTAMP(MICROS,false));\n  OPTIONAL INT64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false));\n  OPTIONAL DOUBLE passenger_count;\n  OPTIONAL DOUBLE trip_distance;\n  OPTIONAL DOUBLE RatecodeID;\n  OPTIONAL BYTE_ARRAY store_and_fwd_flag (STRING);\n  OPTIONAL INT64 PULocationID;\n  OPTIONAL INT64 DOLocationID;\n  OPTIONAL INT64 payment_type;\n  OPTIONAL DOUBLE fare_amount;\n  OPTIONAL DOUBLE extra;\n  OPTIONAL DOUBLE mta_tax;\n  OPTIONAL DOUBLE tip_amount;\n  OPTIONAL DOUBLE tolls_amount;\n  OPTIONAL DOUBLE improvement_surcharge;\n  OPTIONAL DOUBLE total_amount;\n  OPTIONAL DOUBLE congestion_surcharge;\n  OPTIONAL DOUBLE airport_fee;\n}\n"
}

This is plenty of information to make an efficient “tell me how many rows are in the table” and other queries.

NOTE: data lives, breathes, and evolves, and it’s possible to have Parquet files generated by a given process to have different schemas over time. If that happens, you should use union_by_name to ensure these remote queries stay efficient.

Let’s see how long our “total fares” query takes in this remote context. Let’s also see how we can make query operations a bit more visually efficient by creating a VIEW from our remote list of parquet files first:

time duckdb -c "
CREATE VIEW rides AS 
  FROM read_parquet([
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-06.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-08.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-09.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-11.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-12.parquet'
  ])
;

FROM rides LIMIT 10;

FROM rides
SELECT 
  printf('$%,d', CAST(SUM(fare_amount) AS INTEGER)) AS total_2023_fares;
"
┌──────────┬──────────────────────┬──────────────────────┬───┬──────────────┬──────────────────────┬─────────────┐
│ VendorID │ tpep_pickup_datetime │ tpep_dropoff_datet…  │ … │ total_amount │ congestion_surcharge │ airport_fee │
│  int64   │      timestamp       │      timestamp       │   │    double    │        double        │   double    │
├──────────┼──────────────────────┼──────────────────────┼───┼──────────────┼──────────────────────┼─────────────┤
│        2 │ 2023-01-01 00:32:10  │ 2023-01-01 00:40:36  │ … │         14.3 │                  2.5 │         0.0 │
│        2 │ 2023-01-01 00:55:08  │ 2023-01-01 01:01:27  │ … │         16.9 │                  2.5 │         0.0 │
│        2 │ 2023-01-01 00:25:04  │ 2023-01-01 00:37:49  │ … │         34.9 │                  2.5 │         0.0 │
│        1 │ 2023-01-01 00:03:48  │ 2023-01-01 00:13:25  │ … │        20.85 │                  0.0 │        1.25 │
│        2 │ 2023-01-01 00:10:29  │ 2023-01-01 00:21:19  │ … │        19.68 │                  2.5 │         0.0 │
│        2 │ 2023-01-01 00:50:34  │ 2023-01-01 01:02:52  │ … │         27.8 │                  2.5 │         0.0 │
│        2 │ 2023-01-01 00:09:22  │ 2023-01-01 00:19:49  │ … │        20.52 │                  2.5 │         0.0 │
│        2 │ 2023-01-01 00:27:12  │ 2023-01-01 00:49:56  │ … │        64.44 │                  2.5 │         0.0 │
│        2 │ 2023-01-01 00:21:44  │ 2023-01-01 00:36:40  │ … │        28.38 │                  2.5 │         0.0 │
│        2 │ 2023-01-01 00:39:42  │ 2023-01-01 00:50:36  │ … │         19.9 │                  2.5 │         0.0 │
├──────────┴──────────────────────┴──────────────────────┴───┴──────────────┴──────────────────────┴─────────────┤
│ 10 rows                                                                                   19 columns (6 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌──────────────────┐
│ total_2023_fares │
│     varchar      │
├──────────────────┤
│ $747,901,834     │
└──────────────────┘

real    0m19.975s
user    0m2.634s
sys 0m2.315s

The “total fares” query did take orders of magnitude longer than our previous query, but it’s pretty amazing it works!