Read Many CSV Files From The Filesystem

Problem

You need to read and work with a directory of similarly structured CSVs.

Solution

Treat the directory of CSV files as single database table.

Discussion

This is one of DuckDB’s superpowers. The developers seem to know what we data folk have to put up with on a daily basis and have made wonderful affordances for many common idioms.

We’ll use some climate data from a 538 project for a quick example. It has ten CSV files, one for each of the stations they chose to include in the study. We’ll read them all in and also keep the filename around since the station id does not seem to be in the schema (please be kind to your future self and others and include this info in the data file).

duckdb -c "
DESCRIBE
FROM read_csv('data/538-wx/*.csv', filename=true)
"
┌───────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│      column_name      │ column_type │  null   │   key   │ default │  extra  │
│        varchar        │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ date                  │ DATE        │ YES     │         │         │         │
│ actual_mean_temp      │ BIGINT      │ YES     │         │         │         │
│ actual_min_temp       │ BIGINT      │ YES     │         │         │         │
│ actual_max_temp       │ BIGINT      │ YES     │         │         │         │
│ average_min_temp      │ BIGINT      │ YES     │         │         │         │
│ average_max_temp      │ BIGINT      │ YES     │         │         │         │
│ record_min_temp       │ BIGINT      │ YES     │         │         │         │
│ record_max_temp       │ BIGINT      │ YES     │         │         │         │
│ record_min_temp_year  │ BIGINT      │ YES     │         │         │         │
│ record_max_temp_year  │ BIGINT      │ YES     │         │         │         │
│ actual_precipitation  │ DOUBLE      │ YES     │         │         │         │
│ average_precipitation │ DOUBLE      │ YES     │         │         │         │
│ record_precipitation  │ DOUBLE      │ YES     │         │         │         │
│ filename              │ VARCHAR     │ YES     │         │         │         │
├───────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 14 rows                                                           6 columns │
└─────────────────────────────────────────────────────────────────────────────┘

Now, we can do some comparisons across the stations for a given day:

duckdb -markdown -c "
FROM read_csv('data/538-wx/*.csv', filename=true)
SELECT
  date,
  replace(parse_path(filename, '/')[-1], '.csv', '') AS station,
  actual_max_temp,
  record_max_temp,
  record_max_temp_year
WHERE
  date = '2015-01-26'
ORDER BY
  actual_max_temp DESC
"
date station actual_max_temp record_max_temp record_max_temp_year
2015-01-26 KCQT 80 86 1987
2015-01-26 KPHX 71 81 1987
2015-01-26 KHOU 67 84 1975
2015-01-26 KSEA 61 61 2015
2015-01-26 KJAX 60 82 1950
2015-01-26 KCLT 53 75 1950
2015-01-26 KPHL 34 74 1950
2015-01-26 KNYC 31 72 1950
2015-01-26 KMDW 29 62 1944
2015-01-26 KIND 26 67 1944

You can go crazy with the file/path glob pattern, too.