Read Many JSON Files From The Filesystem

Problem

You need to read and work with a directory of similarly structured JSON files.

Solution

Treat the directory of JSON files as single database table.

Discussion

This is one of DuckDB’s superpowers. The developers seem to know what we data folk have to put up with on a daily basis and have made wonderful affordances for many common idioms.

We’ll use some “March Madness” bracket data from a 538 project that I converted to ndjson for a quick example. It has 61 ndjson files, one for each of the brackets. We’ll read them all in and also keep the filename around since the bracket number does not seem to be in the schema (please be kind to your future self and others and include this info in the data file).

duckdb -c "
DESCRIBE
FROM read_json('data/538-bracket/*.json', filename=true)
"
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ team_id     │ BIGINT      │ YES     │         │         │         │
│ team_name   │ VARCHAR     │ YES     │         │         │         │
│ team_seed   │ VARCHAR     │ YES     │         │         │         │
│ team_region │ VARCHAR     │ YES     │         │         │         │
│ playin_flag │ BIGINT      │ YES     │         │         │         │
│ team_alive  │ BIGINT      │ YES     │         │         │         │
│ rd1_win     │ DOUBLE      │ YES     │         │         │         │
│ rd2_win     │ DOUBLE      │ YES     │         │         │         │
│ rd3_win     │ DOUBLE      │ YES     │         │         │         │
│ rd4_win     │ DOUBLE      │ YES     │         │         │         │
│ rd5_win     │ DOUBLE      │ YES     │         │         │         │
│ rd6_win     │ DOUBLE      │ YES     │         │         │         │
│ rd7_win     │ DOUBLE      │ YES     │         │         │         │
│ win_odds    │ JSON        │ YES     │         │         │         │
│ timestamp   │ VARCHAR     │ YES     │         │         │         │
│ filename    │ VARCHAR     │ YES     │         │         │         │
├─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 16 rows                                                 6 columns │
└───────────────────────────────────────────────────────────────────┘

Take note that there’s a timestamp field in the schema, but that field only appears in 21 of the 61 JSON files. DuckDB handles this well, but you should always make a thorough inspection of your data before blindly letting these conversions take place.

That timestamp is in a gosh awful format: 3/20/2014 10:31 PM, but we can help DuckDB deal with it and let it know it should be in all the files:

duckdb -markdown -c "
FROM read_json('data/538-bracket/*.json', filename=true, timestampformat='%m/%d/%Y %I:%M %p',
  columns= {
    'team_id': 'BIGINT',
    'team_name': 'VARCHAR',
    'team_seed': 'VARCHAR',
    'team_region': 'VARCHAR',
    'playin_flag': 'BIGINT',
    'team_alive': 'BIGINT',
    'rd1_win': 'DOUBLE',
    'rd2_win': 'DOUBLE',
    'rd3_win': 'DOUBLE',
    'rd4_win': 'DOUBLE',
    'rd5_win': 'DOUBLE',
    'rd6_win': 'DOUBLE',
    'rd7_win': 'DOUBLE',
    'win_odds': 'JSON',
    'timestamp': 'TIMESTAMP'
})
SELECT DISTINCT
  replace(parse_path(filename, '/')[-1], '.json', '') AS bracket, 
  timestamp
WHERE timestamp IS NOT NULL
ORDER BY timestamp
"
bracket timestamp
bracket-06 2014-03-20 14:27:00
bracket-07 2014-03-20 15:33:00
bracket-08 2014-03-20 15:51:00
bracket-09 2014-03-20 16:31:00
bracket-10 2014-03-20 16:59:00
bracket-11 2014-03-20 17:48:00
bracket-12 2014-03-20 19:00:00
bracket-13 2014-03-20 20:10:00
bracket-14 2014-03-20 21:07:00
bracket-15 2014-03-20 21:29:00
bracket-16 2014-03-20 22:01:00
bracket-17 2014-03-20 22:31:00
bracket-18 2014-03-21 00:37:00
bracket-19 2014-03-21 00:38:00
bracket-20 2014-03-21 01:37:00
bracket-21 2014-03-21 01:39:00
bracket-41 2014-03-23 15:04:00
bracket-44 2014-03-23 21:28:00
bracket-46 2014-03-23 22:40:00
bracket-47 2014-03-23 23:09:00
bracket-48 2014-03-23 23:53:00

Notice how we did not put filename in the schema. That’s because DuckDB adds it later on in the process.

Remember, you can go crazy with the file/path glob pattern, too.