Quickly Summarize Tables

Problem

You need an overview of your data.

Solution

Use DuckDB’s SUMMARIZE statement.

Discussion

The friendly SQL enhancements in DuckDB include SUMMARIZE, a way to get – min, max, approx_unique, avg, std, q25, q50, q75, and count – right within SQL.

time duckdb -markdown -c "
SUMMARIZE 
  FROM read_parquet('data/taxi/yellow_tripdata_*.parquet')
"
column_name column_type min max approx_unique avg std q25 q50 q75 count null_percentage
VendorID BIGINT 1 6 3 1.738893683373207 0.4443590617384614 1 2 2 38310226 0.00
tpep_pickup_datetime TIMESTAMP 2001-01-01 00:06:49 2024-01-03 19:42:57 19288620 38310226 0.00
tpep_dropoff_datetime TIMESTAMP 1970-01-20 10:16:32 2024-01-03 20:15:55 19759018 38310226 0.00
passenger_count DOUBLE 0.0 9.0 10 1.3704258575541601 0.892592437213934 1.0 1.0 1.0 38310226 3.42
trip_distance DOUBLE 0.0 345729.44 8555 4.088946216082994 241.25089989009854 1.0408642024676638 1.7824496954781088 3.4191231293725886 38310226 0.00
RatecodeID DOUBLE 1.0 99.0 7 1.6419056362728768 7.427350694183615 1.0 1.0 1.0 38310226 3.42
store_and_fwd_flag VARCHAR N Y 2 38310226 3.42
PULocationID BIGINT 1 265 264 165.175477403866 63.99765082928126 132 162 233 38310226 0.00
DOLocationID BIGINT 1 265 263 163.94996043615092 69.85543107793322 114 162 234 38310226 0.00
payment_type BIGINT 0 5 6 1.1845846589367548 0.5562739535077442 1 1 1 38310226 0.00
fare_amount DOUBLE -1087.3 386983.63 15055 19.522250636419308 75.72728960022003 9.294555491126967 13.50798134303114 22.133525367677475 38310226 0.00
extra DOUBLE -39.17 10002.5 208 1.5560566439884778 2.450398738631048 0.0 1.0 2.5 38310226 0.00
mta_tax DOUBLE -0.5 53.16 28 0.4856166729478452 0.10956288620186637 0.5 0.5 0.5 38310226 0.00
tip_amount DOUBLE -411.0 4174.0 7820 3.5222654029777845 4.147060116226941 0.9589593204277058 2.8121417656778673 4.419803150744926 38310226 0.00
tolls_amount DOUBLE -91.3 665.56 3088 0.5897361378132934 2.2009994548076457 0.0 0.0 0.0 38310226 0.00
improvement_surcharge DOUBLE -1.0 1.0 5 0.9794487273449718 0.1991462132862492 1.0 1.0 1.0 38310226 0.00
total_amount DOUBLE -1094.05 386987.63 34403 28.461941516269764 77.1282091691602 15.875794523177571 21.025118352259916 30.826469542053864 38310226 0.00
congestion_surcharge DOUBLE -2.5 2.75 7 2.264610487537185 0.7971480634152025 2.5 2.5 2.5 38310226 3.42
airport_fee DOUBLE -1.75 1.75 7 0.14006659032611937 0.46864290277197507 0.0 0.0 0.0 38310226 3.42

real 0m11.152s user 1m30.298s sys 0m0.348s

If you need that in long format vs. wide, we can take advantage of the friendly SQL UNPIVOT statment.

NOTE: due to limitations with UNPIVOT we have to return the value column as a string. Since this view is mainly for eyballing the output, it’s not a big deal. And, you can always CAST the values back conditionally if you need to.

duckdb -markdown<<EOF
WITH smry AS (
  SELECT * 
  FROM (SUMMARIZE FROM read_parquet('data/taxi/yellow_tripdata_*.parquet'))
)
UNPIVOT smry
ON COLUMNS(* EXCLUDE (column_name, column_type))::VARCHAR 
INTO 
  NAME stat 
  VALUE value
EOF
column_name column_type stat value
VendorID BIGINT min 1
VendorID BIGINT max 6
VendorID BIGINT approx_unique 3
VendorID BIGINT avg 1.738893683373207
VendorID BIGINT std 0.4443590617384622
VendorID BIGINT q25 1
VendorID BIGINT q50 2
VendorID BIGINT q75 2
VendorID BIGINT count 38310226
VendorID BIGINT null_percentage 0.00
tpep_pickup_datetime TIMESTAMP min 2001-01-01 00:06:49
tpep_pickup_datetime TIMESTAMP max 2024-01-03 19:42:57
tpep_pickup_datetime TIMESTAMP approx_unique 19288620
tpep_pickup_datetime TIMESTAMP count 38310226
tpep_pickup_datetime TIMESTAMP null_percentage 0.00
tpep_dropoff_datetime TIMESTAMP min 1970-01-20 10:16:32
tpep_dropoff_datetime TIMESTAMP max 2024-01-03 20:15:55
tpep_dropoff_datetime TIMESTAMP approx_unique 19759018
tpep_dropoff_datetime TIMESTAMP count 38310226
tpep_dropoff_datetime TIMESTAMP null_percentage 0.00
passenger_count DOUBLE min 0.0
passenger_count DOUBLE max 9.0
passenger_count DOUBLE approx_unique 10
passenger_count DOUBLE avg 1.3704258575541601
passenger_count DOUBLE std 0.892592437213935
passenger_count DOUBLE q25 1.0
passenger_count DOUBLE q50 1.0
passenger_count DOUBLE q75 1.0
passenger_count DOUBLE count 38310226
passenger_count DOUBLE null_percentage 3.42
trip_distance DOUBLE min 0.0
trip_distance DOUBLE max 345729.44
trip_distance DOUBLE approx_unique 8555
trip_distance DOUBLE avg 4.088946216082925
trip_distance DOUBLE std 241.25089989010124
trip_distance DOUBLE q25 1.0413981418912506
trip_distance DOUBLE q50 1.7835221915398023
trip_distance DOUBLE q75 3.4192728605631064
trip_distance DOUBLE count 38310226
trip_distance DOUBLE null_percentage 0.00
RatecodeID DOUBLE min 1.0
RatecodeID DOUBLE max 99.0
RatecodeID DOUBLE approx_unique 7
RatecodeID DOUBLE avg 1.6419056362728768
RatecodeID DOUBLE std 7.427350694183523
RatecodeID DOUBLE q25 1.0
RatecodeID DOUBLE q50 1.0
RatecodeID DOUBLE q75 1.0
RatecodeID DOUBLE count 38310226
RatecodeID DOUBLE null_percentage 3.42
store_and_fwd_flag VARCHAR min N
store_and_fwd_flag VARCHAR max Y
store_and_fwd_flag VARCHAR approx_unique 2
store_and_fwd_flag VARCHAR count 38310226
store_and_fwd_flag VARCHAR null_percentage 3.42
PULocationID BIGINT min 1
PULocationID BIGINT max 265
PULocationID BIGINT approx_unique 264
PULocationID BIGINT avg 165.175477403866
PULocationID BIGINT std 63.99765082928099
PULocationID BIGINT q25 132
PULocationID BIGINT q50 162
PULocationID BIGINT q75 234
PULocationID BIGINT count 38310226
PULocationID BIGINT null_percentage 0.00
DOLocationID BIGINT min 1
DOLocationID BIGINT max 265
DOLocationID BIGINT approx_unique 263
DOLocationID BIGINT avg 163.94996043615092
DOLocationID BIGINT std 69.85543107793315
DOLocationID BIGINT q25 114
DOLocationID BIGINT q50 162
DOLocationID BIGINT q75 234
DOLocationID BIGINT count 38310226
DOLocationID BIGINT null_percentage 0.00
payment_type BIGINT min 0
payment_type BIGINT max 5
payment_type BIGINT approx_unique 6
payment_type BIGINT avg 1.1845846589367548
payment_type BIGINT std 0.5562739535077519
payment_type BIGINT q25 1
payment_type BIGINT q50 1
payment_type BIGINT q75 1
payment_type BIGINT count 38310226
payment_type BIGINT null_percentage 0.00
fare_amount DOUBLE min -1087.3
fare_amount DOUBLE max 386983.63
fare_amount DOUBLE approx_unique 15055
fare_amount DOUBLE avg 19.52225063641895
fare_amount DOUBLE std 75.7272896002185
fare_amount DOUBLE q25 9.286774612476815
fare_amount DOUBLE q50 13.512731442356468
fare_amount DOUBLE q75 22.133083641529037
fare_amount DOUBLE count 38310226
fare_amount DOUBLE null_percentage 0.00
extra DOUBLE min -39.17
extra DOUBLE max 10002.5
extra DOUBLE approx_unique 208
extra DOUBLE avg 1.556056643988478
extra DOUBLE std 2.450398738630944
extra DOUBLE q25 0.0
extra DOUBLE q50 1.0
extra DOUBLE q75 2.5
extra DOUBLE count 38310226
extra DOUBLE null_percentage 0.00
mta_tax DOUBLE min -0.5
mta_tax DOUBLE max 53.16
mta_tax DOUBLE approx_unique 28
mta_tax DOUBLE avg 0.4856166729478452
mta_tax DOUBLE std 0.10956288620186624
mta_tax DOUBLE q25 0.5
mta_tax DOUBLE q50 0.5
mta_tax DOUBLE q75 0.5
mta_tax DOUBLE count 38310226
mta_tax DOUBLE null_percentage 0.00
tip_amount DOUBLE min -411.0
tip_amount DOUBLE max 4174.0
tip_amount DOUBLE approx_unique 7820
tip_amount DOUBLE avg 3.5222654029777845
tip_amount DOUBLE std 4.147060116226972
tip_amount DOUBLE q25 0.9241944405668933
tip_amount DOUBLE q50 2.8109131511020298
tip_amount DOUBLE q75 4.417960419658873
tip_amount DOUBLE count 38310226
tip_amount DOUBLE null_percentage 0.00
tolls_amount DOUBLE min -91.3
tolls_amount DOUBLE max 665.56
tolls_amount DOUBLE approx_unique 3088
tolls_amount DOUBLE avg 0.5897361378132935
tolls_amount DOUBLE std 2.2009994548076426
tolls_amount DOUBLE q25 0.0
tolls_amount DOUBLE q50 0.0
tolls_amount DOUBLE q75 0.0
tolls_amount DOUBLE count 38310226
tolls_amount DOUBLE null_percentage 0.00
improvement_surcharge DOUBLE min -1.0
improvement_surcharge DOUBLE max 1.0
improvement_surcharge DOUBLE approx_unique 5
improvement_surcharge DOUBLE avg 0.9794487273449718
improvement_surcharge DOUBLE std 0.19914621328624976
improvement_surcharge DOUBLE q25 1.0
improvement_surcharge DOUBLE q50 1.0
improvement_surcharge DOUBLE q75 1.0
improvement_surcharge DOUBLE count 38310226
improvement_surcharge DOUBLE null_percentage 0.00
total_amount DOUBLE min -1094.05
total_amount DOUBLE max 386987.63
total_amount DOUBLE approx_unique 34403
total_amount DOUBLE avg 28.461941516269675
total_amount DOUBLE std 77.12820916916066
total_amount DOUBLE q25 15.89257855603436
total_amount DOUBLE q50 21.04532936341972
total_amount DOUBLE q75 30.807221512535367
total_amount DOUBLE count 38310226
total_amount DOUBLE null_percentage 0.00
congestion_surcharge DOUBLE min -2.5
congestion_surcharge DOUBLE max 2.75
congestion_surcharge DOUBLE approx_unique 7
congestion_surcharge DOUBLE avg 2.264610487537185
congestion_surcharge DOUBLE std 0.7971480634151977
congestion_surcharge DOUBLE q25 2.5
congestion_surcharge DOUBLE q50 2.5
congestion_surcharge DOUBLE q75 2.5
congestion_surcharge DOUBLE count 38310226
congestion_surcharge DOUBLE null_percentage 3.42
airport_fee DOUBLE min -1.75
airport_fee DOUBLE max 1.75
airport_fee DOUBLE approx_unique 7
airport_fee DOUBLE avg 0.14006659032611937
airport_fee DOUBLE std 0.46864290277197884
airport_fee DOUBLE q25 0.0
airport_fee DOUBLE q50 0.0
airport_fee DOUBLE q75 0.0
airport_fee DOUBLE count 38310226
airport_fee DOUBLE null_percentage 3.42

This can really help you get a quick overview of your data right at the CLI.

Consider outputing it to JSON so you can perform some pipeline checks.