Quickly Summarize Tables
Problem
You need an overview of your data.
Solution
Use DuckDB’s SUMMARIZE
statement.
Discussion
The friendly SQL enhancements in DuckDB include SUMMARIZE
, a way to get – min
, max
, approx_unique
, avg
, std
, q25
, q50
, q75
, and count
– right within SQL.
time duckdb -markdown -c "
SUMMARIZE
FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
"
column_name | column_type | min | max | approx_unique | avg | std | q25 | q50 | q75 | count | null_percentage |
---|---|---|---|---|---|---|---|---|---|---|---|
VendorID | BIGINT | 1 | 6 | 3 | 1.738893683373207 | 0.4443590617384614 | 1 | 2 | 2 | 38310226 | 0.00 |
tpep_pickup_datetime | TIMESTAMP | 2001-01-01 00:06:49 | 2024-01-03 19:42:57 | 19288620 | 38310226 | 0.00 | |||||
tpep_dropoff_datetime | TIMESTAMP | 1970-01-20 10:16:32 | 2024-01-03 20:15:55 | 19759018 | 38310226 | 0.00 | |||||
passenger_count | DOUBLE | 0.0 | 9.0 | 10 | 1.3704258575541601 | 0.892592437213934 | 1.0 | 1.0 | 1.0 | 38310226 | 3.42 |
trip_distance | DOUBLE | 0.0 | 345729.44 | 8555 | 4.088946216082994 | 241.25089989009854 | 1.0408642024676638 | 1.7824496954781088 | 3.4191231293725886 | 38310226 | 0.00 |
RatecodeID | DOUBLE | 1.0 | 99.0 | 7 | 1.6419056362728768 | 7.427350694183615 | 1.0 | 1.0 | 1.0 | 38310226 | 3.42 |
store_and_fwd_flag | VARCHAR | N | Y | 2 | 38310226 | 3.42 | |||||
PULocationID | BIGINT | 1 | 265 | 264 | 165.175477403866 | 63.99765082928126 | 132 | 162 | 233 | 38310226 | 0.00 |
DOLocationID | BIGINT | 1 | 265 | 263 | 163.94996043615092 | 69.85543107793322 | 114 | 162 | 234 | 38310226 | 0.00 |
payment_type | BIGINT | 0 | 5 | 6 | 1.1845846589367548 | 0.5562739535077442 | 1 | 1 | 1 | 38310226 | 0.00 |
fare_amount | DOUBLE | -1087.3 | 386983.63 | 15055 | 19.522250636419308 | 75.72728960022003 | 9.294555491126967 | 13.50798134303114 | 22.133525367677475 | 38310226 | 0.00 |
extra | DOUBLE | -39.17 | 10002.5 | 208 | 1.5560566439884778 | 2.450398738631048 | 0.0 | 1.0 | 2.5 | 38310226 | 0.00 |
mta_tax | DOUBLE | -0.5 | 53.16 | 28 | 0.4856166729478452 | 0.10956288620186637 | 0.5 | 0.5 | 0.5 | 38310226 | 0.00 |
tip_amount | DOUBLE | -411.0 | 4174.0 | 7820 | 3.5222654029777845 | 4.147060116226941 | 0.9589593204277058 | 2.8121417656778673 | 4.419803150744926 | 38310226 | 0.00 |
tolls_amount | DOUBLE | -91.3 | 665.56 | 3088 | 0.5897361378132934 | 2.2009994548076457 | 0.0 | 0.0 | 0.0 | 38310226 | 0.00 |
improvement_surcharge | DOUBLE | -1.0 | 1.0 | 5 | 0.9794487273449718 | 0.1991462132862492 | 1.0 | 1.0 | 1.0 | 38310226 | 0.00 |
total_amount | DOUBLE | -1094.05 | 386987.63 | 34403 | 28.461941516269764 | 77.1282091691602 | 15.875794523177571 | 21.025118352259916 | 30.826469542053864 | 38310226 | 0.00 |
congestion_surcharge | DOUBLE | -2.5 | 2.75 | 7 | 2.264610487537185 | 0.7971480634152025 | 2.5 | 2.5 | 2.5 | 38310226 | 3.42 |
airport_fee | DOUBLE | -1.75 | 1.75 | 7 | 0.14006659032611937 | 0.46864290277197507 | 0.0 | 0.0 | 0.0 | 38310226 | 3.42 |
real 0m11.152s user 1m30.298s sys 0m0.348s
If you need that in long format vs. wide, we can take advantage of the friendly SQL UNPIVOT
statment.
NOTE: due to limitations with
UNPIVOT
we have to return thevalue
column as a string. Since this view is mainly for eyballing the output, it’s not a big deal. And, you can alwaysCAST
the values back conditionally if you need to.
duckdb -markdown<<EOF
WITH smry AS (
SELECT *
FROM (SUMMARIZE FROM read_parquet('data/taxi/yellow_tripdata_*.parquet'))
)
UNPIVOT smry
ON COLUMNS(* EXCLUDE (column_name, column_type))::VARCHAR
INTO
NAME stat
VALUE value
EOF
column_name | column_type | stat | value |
---|---|---|---|
VendorID | BIGINT | min | 1 |
VendorID | BIGINT | max | 6 |
VendorID | BIGINT | approx_unique | 3 |
VendorID | BIGINT | avg | 1.738893683373207 |
VendorID | BIGINT | std | 0.4443590617384622 |
VendorID | BIGINT | q25 | 1 |
VendorID | BIGINT | q50 | 2 |
VendorID | BIGINT | q75 | 2 |
VendorID | BIGINT | count | 38310226 |
VendorID | BIGINT | null_percentage | 0.00 |
tpep_pickup_datetime | TIMESTAMP | min | 2001-01-01 00:06:49 |
tpep_pickup_datetime | TIMESTAMP | max | 2024-01-03 19:42:57 |
tpep_pickup_datetime | TIMESTAMP | approx_unique | 19288620 |
tpep_pickup_datetime | TIMESTAMP | count | 38310226 |
tpep_pickup_datetime | TIMESTAMP | null_percentage | 0.00 |
tpep_dropoff_datetime | TIMESTAMP | min | 1970-01-20 10:16:32 |
tpep_dropoff_datetime | TIMESTAMP | max | 2024-01-03 20:15:55 |
tpep_dropoff_datetime | TIMESTAMP | approx_unique | 19759018 |
tpep_dropoff_datetime | TIMESTAMP | count | 38310226 |
tpep_dropoff_datetime | TIMESTAMP | null_percentage | 0.00 |
passenger_count | DOUBLE | min | 0.0 |
passenger_count | DOUBLE | max | 9.0 |
passenger_count | DOUBLE | approx_unique | 10 |
passenger_count | DOUBLE | avg | 1.3704258575541601 |
passenger_count | DOUBLE | std | 0.892592437213935 |
passenger_count | DOUBLE | q25 | 1.0 |
passenger_count | DOUBLE | q50 | 1.0 |
passenger_count | DOUBLE | q75 | 1.0 |
passenger_count | DOUBLE | count | 38310226 |
passenger_count | DOUBLE | null_percentage | 3.42 |
trip_distance | DOUBLE | min | 0.0 |
trip_distance | DOUBLE | max | 345729.44 |
trip_distance | DOUBLE | approx_unique | 8555 |
trip_distance | DOUBLE | avg | 4.088946216082925 |
trip_distance | DOUBLE | std | 241.25089989010124 |
trip_distance | DOUBLE | q25 | 1.0413981418912506 |
trip_distance | DOUBLE | q50 | 1.7835221915398023 |
trip_distance | DOUBLE | q75 | 3.4192728605631064 |
trip_distance | DOUBLE | count | 38310226 |
trip_distance | DOUBLE | null_percentage | 0.00 |
RatecodeID | DOUBLE | min | 1.0 |
RatecodeID | DOUBLE | max | 99.0 |
RatecodeID | DOUBLE | approx_unique | 7 |
RatecodeID | DOUBLE | avg | 1.6419056362728768 |
RatecodeID | DOUBLE | std | 7.427350694183523 |
RatecodeID | DOUBLE | q25 | 1.0 |
RatecodeID | DOUBLE | q50 | 1.0 |
RatecodeID | DOUBLE | q75 | 1.0 |
RatecodeID | DOUBLE | count | 38310226 |
RatecodeID | DOUBLE | null_percentage | 3.42 |
store_and_fwd_flag | VARCHAR | min | N |
store_and_fwd_flag | VARCHAR | max | Y |
store_and_fwd_flag | VARCHAR | approx_unique | 2 |
store_and_fwd_flag | VARCHAR | count | 38310226 |
store_and_fwd_flag | VARCHAR | null_percentage | 3.42 |
PULocationID | BIGINT | min | 1 |
PULocationID | BIGINT | max | 265 |
PULocationID | BIGINT | approx_unique | 264 |
PULocationID | BIGINT | avg | 165.175477403866 |
PULocationID | BIGINT | std | 63.99765082928099 |
PULocationID | BIGINT | q25 | 132 |
PULocationID | BIGINT | q50 | 162 |
PULocationID | BIGINT | q75 | 234 |
PULocationID | BIGINT | count | 38310226 |
PULocationID | BIGINT | null_percentage | 0.00 |
DOLocationID | BIGINT | min | 1 |
DOLocationID | BIGINT | max | 265 |
DOLocationID | BIGINT | approx_unique | 263 |
DOLocationID | BIGINT | avg | 163.94996043615092 |
DOLocationID | BIGINT | std | 69.85543107793315 |
DOLocationID | BIGINT | q25 | 114 |
DOLocationID | BIGINT | q50 | 162 |
DOLocationID | BIGINT | q75 | 234 |
DOLocationID | BIGINT | count | 38310226 |
DOLocationID | BIGINT | null_percentage | 0.00 |
payment_type | BIGINT | min | 0 |
payment_type | BIGINT | max | 5 |
payment_type | BIGINT | approx_unique | 6 |
payment_type | BIGINT | avg | 1.1845846589367548 |
payment_type | BIGINT | std | 0.5562739535077519 |
payment_type | BIGINT | q25 | 1 |
payment_type | BIGINT | q50 | 1 |
payment_type | BIGINT | q75 | 1 |
payment_type | BIGINT | count | 38310226 |
payment_type | BIGINT | null_percentage | 0.00 |
fare_amount | DOUBLE | min | -1087.3 |
fare_amount | DOUBLE | max | 386983.63 |
fare_amount | DOUBLE | approx_unique | 15055 |
fare_amount | DOUBLE | avg | 19.52225063641895 |
fare_amount | DOUBLE | std | 75.7272896002185 |
fare_amount | DOUBLE | q25 | 9.286774612476815 |
fare_amount | DOUBLE | q50 | 13.512731442356468 |
fare_amount | DOUBLE | q75 | 22.133083641529037 |
fare_amount | DOUBLE | count | 38310226 |
fare_amount | DOUBLE | null_percentage | 0.00 |
extra | DOUBLE | min | -39.17 |
extra | DOUBLE | max | 10002.5 |
extra | DOUBLE | approx_unique | 208 |
extra | DOUBLE | avg | 1.556056643988478 |
extra | DOUBLE | std | 2.450398738630944 |
extra | DOUBLE | q25 | 0.0 |
extra | DOUBLE | q50 | 1.0 |
extra | DOUBLE | q75 | 2.5 |
extra | DOUBLE | count | 38310226 |
extra | DOUBLE | null_percentage | 0.00 |
mta_tax | DOUBLE | min | -0.5 |
mta_tax | DOUBLE | max | 53.16 |
mta_tax | DOUBLE | approx_unique | 28 |
mta_tax | DOUBLE | avg | 0.4856166729478452 |
mta_tax | DOUBLE | std | 0.10956288620186624 |
mta_tax | DOUBLE | q25 | 0.5 |
mta_tax | DOUBLE | q50 | 0.5 |
mta_tax | DOUBLE | q75 | 0.5 |
mta_tax | DOUBLE | count | 38310226 |
mta_tax | DOUBLE | null_percentage | 0.00 |
tip_amount | DOUBLE | min | -411.0 |
tip_amount | DOUBLE | max | 4174.0 |
tip_amount | DOUBLE | approx_unique | 7820 |
tip_amount | DOUBLE | avg | 3.5222654029777845 |
tip_amount | DOUBLE | std | 4.147060116226972 |
tip_amount | DOUBLE | q25 | 0.9241944405668933 |
tip_amount | DOUBLE | q50 | 2.8109131511020298 |
tip_amount | DOUBLE | q75 | 4.417960419658873 |
tip_amount | DOUBLE | count | 38310226 |
tip_amount | DOUBLE | null_percentage | 0.00 |
tolls_amount | DOUBLE | min | -91.3 |
tolls_amount | DOUBLE | max | 665.56 |
tolls_amount | DOUBLE | approx_unique | 3088 |
tolls_amount | DOUBLE | avg | 0.5897361378132935 |
tolls_amount | DOUBLE | std | 2.2009994548076426 |
tolls_amount | DOUBLE | q25 | 0.0 |
tolls_amount | DOUBLE | q50 | 0.0 |
tolls_amount | DOUBLE | q75 | 0.0 |
tolls_amount | DOUBLE | count | 38310226 |
tolls_amount | DOUBLE | null_percentage | 0.00 |
improvement_surcharge | DOUBLE | min | -1.0 |
improvement_surcharge | DOUBLE | max | 1.0 |
improvement_surcharge | DOUBLE | approx_unique | 5 |
improvement_surcharge | DOUBLE | avg | 0.9794487273449718 |
improvement_surcharge | DOUBLE | std | 0.19914621328624976 |
improvement_surcharge | DOUBLE | q25 | 1.0 |
improvement_surcharge | DOUBLE | q50 | 1.0 |
improvement_surcharge | DOUBLE | q75 | 1.0 |
improvement_surcharge | DOUBLE | count | 38310226 |
improvement_surcharge | DOUBLE | null_percentage | 0.00 |
total_amount | DOUBLE | min | -1094.05 |
total_amount | DOUBLE | max | 386987.63 |
total_amount | DOUBLE | approx_unique | 34403 |
total_amount | DOUBLE | avg | 28.461941516269675 |
total_amount | DOUBLE | std | 77.12820916916066 |
total_amount | DOUBLE | q25 | 15.89257855603436 |
total_amount | DOUBLE | q50 | 21.04532936341972 |
total_amount | DOUBLE | q75 | 30.807221512535367 |
total_amount | DOUBLE | count | 38310226 |
total_amount | DOUBLE | null_percentage | 0.00 |
congestion_surcharge | DOUBLE | min | -2.5 |
congestion_surcharge | DOUBLE | max | 2.75 |
congestion_surcharge | DOUBLE | approx_unique | 7 |
congestion_surcharge | DOUBLE | avg | 2.264610487537185 |
congestion_surcharge | DOUBLE | std | 0.7971480634151977 |
congestion_surcharge | DOUBLE | q25 | 2.5 |
congestion_surcharge | DOUBLE | q50 | 2.5 |
congestion_surcharge | DOUBLE | q75 | 2.5 |
congestion_surcharge | DOUBLE | count | 38310226 |
congestion_surcharge | DOUBLE | null_percentage | 3.42 |
airport_fee | DOUBLE | min | -1.75 |
airport_fee | DOUBLE | max | 1.75 |
airport_fee | DOUBLE | approx_unique | 7 |
airport_fee | DOUBLE | avg | 0.14006659032611937 |
airport_fee | DOUBLE | std | 0.46864290277197884 |
airport_fee | DOUBLE | q25 | 0.0 |
airport_fee | DOUBLE | q50 | 0.0 |
airport_fee | DOUBLE | q75 | 0.0 |
airport_fee | DOUBLE | count | 38310226 |
airport_fee | DOUBLE | null_percentage | 3.42 |
This can really help you get a quick overview of your data right at the CLI.
Consider outputing it to JSON so you can perform some pipeline checks.