Read Many CSV Files From The Filesystem

Problem

You need to read and work with a directory of similarly structured CSVs.

Solution

Treat the directory of CSV files as single database table.

Discussion

This is one of DuckDB’s superpowers. The developers seem to know what we data folk have to put up with on a daily basis and have made wonderful affordances for many common idioms.

We’ll use some climate data from a 538 project for a quick example. It has ten CSV files, one for each of the stations they chose to include in the study. We’ll read them all in and also keep the filename around since the station id does not seem to be in the schema (please be kind to your future self and others and include this info in the data file).

duckdb -c "
DESCRIBE
FROM read_csv('data/538-wx/*.csv', filename=true)
"

┌───────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│      column_name      │ column_type │  null   │   key   │ default │  extra  │
│        varchar        │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ date                  │ DATE        │ YES     │         │         │         │
│ actual_mean_temp      │ BIGINT      │ YES     │         │         │         │
│ actual_min_temp       │ BIGINT      │ YES     │         │         │         │
│ actual_max_temp       │ BIGINT      │ YES     │         │         │         │
│ average_min_temp      │ BIGINT      │ YES     │         │         │         │
│ average_max_temp      │ BIGINT      │ YES     │         │         │         │
│ record_min_temp       │ BIGINT      │ YES     │         │         │         │
│ record_max_temp       │ BIGINT      │ YES     │         │         │         │
│ record_min_temp_year  │ BIGINT      │ YES     │         │         │         │
│ record_max_temp_year  │ BIGINT      │ YES     │         │         │         │
│ actual_precipitation  │ DOUBLE      │ YES     │         │         │         │
│ average_precipitation │ DOUBLE      │ YES     │         │         │         │
│ record_precipitation  │ DOUBLE      │ YES     │         │         │         │
│ filename              │ VARCHAR     │ YES     │         │         │         │
├───────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 14 rows                                                           6 columns │
└─────────────────────────────────────────────────────────────────────────────┘

Now, we can do some comparisons across the stations for a given day:

duckdb -markdown -c "
FROM read_csv('data/538-wx/*.csv', filename=true)
SELECT
  date,
  replace(parse_path(filename, '/')[-1], '.csv', '') AS station,
  actual_max_temp,
  record_max_temp,
  record_max_temp_year
WHERE
  date = '2015-01-26'
ORDER BY
  actual_max_temp DESC
"

date	station	actual_max_temp	record_max_temp	record_max_temp_year
2015-01-26	KCQT	80	86	1987
2015-01-26	KPHX	71	81	1987
2015-01-26	KHOU	67	84	1975
2015-01-26	KSEA	61	61	2015
2015-01-26	KJAX	60	82	1950
2015-01-26	KCLT	53	75	1950
2015-01-26	KPHL	34	74	1950
2015-01-26	KNYC	31	72	1950
2015-01-26	KMDW	29	62	1944
2015-01-26	KIND	26	67	1944

You can go crazy with the file/path glob pattern, too.