Customize JSON Read Behavior

Problem

DuckDB’s auto-configuration for JSON reading is not working the way I need it to for some datasets.

Solution

Roll up your sleeves and add some verbosity to the DuckDB JSON reader.

Discussion

DuckDB gives us as much or as little control over how the JSON reader works. Normally, the defaults work well, but we’ve all had one of those JSONs.

The DuckDB online documentation has a wealth of information on all the settings.

One common source of pain is that DuckDB cannot figure out the correct column type due to sampling too few rows. Change sample_size to -1 in a call to read_json() and DuckDB will scan the entire file/input before making assumptions. You can also change it to an arbitrarily large number.

Doing either of those value changes may lead to sub-optimal load performance. It may be better to tell DuckDB the column types up-front.

In the “tags” dataset, DuckDB pokes at all the fields and thinks id is a UUID. I mean, it is, but depending on how it gets stuck into Parquet, it can be a bear to use in other environments:

jq .metadata[] data/tag-metadata.json | \
  duckdb -c "
DESCRIBE
FROM read_json('/dev/stdin', sample_size=-1)
"

┌─────────────────┬────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│   column_name   │                        column_type                         │  null   │   key   │ default │  extra  │
│     varchar     │                          varchar                           │ varchar │ varchar │ varchar │ varchar │
├─────────────────┼────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ id              │ UUID                                                       │ YES     │         │         │         │
│ label           │ VARCHAR                                                    │ YES     │         │         │         │
│ slug            │ VARCHAR                                                    │ YES     │         │         │         │
│ name            │ VARCHAR                                                    │ YES     │         │         │         │
│ category        │ VARCHAR                                                    │ YES     │         │         │         │
│ intention       │ VARCHAR                                                    │ YES     │         │         │         │
│ description     │ VARCHAR                                                    │ YES     │         │         │         │
│ references      │ VARCHAR[]                                                  │ YES     │         │         │         │
│ recommend_block │ BOOLEAN                                                    │ YES     │         │         │         │
│ cves            │ VARCHAR[]                                                  │ YES     │         │         │         │
│ created_at      │ DATE                                                       │ YES     │         │         │         │
│ related_tags    │ STRUCT(id UUID, "name" VARCHAR, intention VARCHAR, categ…  │ YES     │         │         │         │
├─────────────────┴────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┤
│ 12 rows                                                                                                    6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

To fix this, we can tell DuckDB to treat id as a VARCHAR instead of a UUID:

jq -c .metadata[] data/tag-metadata.json | \
  duckdb -c "
DESCRIBE
FROM read_json('/dev/stdin', sample_size=-1, columns = {
  'id':'VARCHAR',
  'label':'VARCHAR',
  'slug':'VARCHAR',
  'name':'VARCHAR',
  'category':'VARCHAR',
  'intention':'VARCHAR',
  'description':'VARCHAR',
  'references':'VARCHAR',
  'recommend_block':'BOOLEAN',
  'cves':'VARCHAR',
  'created_at':'DATE',
  'related_tags':'STRUCT(id VARCHAR, \"name\" VARCHAR, intention VARCHAR, category VARCHAR, slug VARCHAR)[]'
})
"

┌─────────────────┬────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│   column_name   │                        column_type                         │  null   │   key   │ default │  extra  │
│     varchar     │                          varchar                           │ varchar │ varchar │ varchar │ varchar │
├─────────────────┼────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ id              │ VARCHAR                                                    │ YES     │         │         │         │
│ label           │ VARCHAR                                                    │ YES     │         │         │         │
│ slug            │ VARCHAR                                                    │ YES     │         │         │         │
│ name            │ VARCHAR                                                    │ YES     │         │         │         │
│ category        │ VARCHAR                                                    │ YES     │         │         │         │
│ intention       │ VARCHAR                                                    │ YES     │         │         │         │
│ description     │ VARCHAR                                                    │ YES     │         │         │         │
│ references      │ VARCHAR                                                    │ YES     │         │         │         │
│ recommend_block │ BOOLEAN                                                    │ YES     │         │         │         │
│ cves            │ VARCHAR                                                    │ YES     │         │         │         │
│ created_at      │ DATE                                                       │ YES     │         │         │         │
│ related_tags    │ STRUCT(id VARCHAR, "name" VARCHAR, intention VARCHAR, ca…  │ YES     │         │         │         │
├─────────────────┴────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┤
│ 12 rows                                                                                                    6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Note how we also remembered to change the schema type of the id field in the related_tags struct.

As noted in the section on reading CSVs with custom parameters, you can also use ignore_errors to get past any issues reading particular rows, but be sure to diagnose why that was (it’s most likely due to character encoding issues).