Read JSON Files From The Filesystem

Problem

You need to read and work with JSON data stored on your your filesystem.

Solution

Treat JSON files as database tables

Discussion

This is a bit more complicated than dealing with the nice, rectangular data stored in CSV files since JSON can come in many shapes and sizes, only two of which are “equivalent” to rectangular.

We’ll use some data from JSON Editor Online (that I will also mangle into other formats) in some of these examples. You can find users_100k.json there and other forms of it in the repo (I changed the _ to - as I am not a monster). I’ve also gzip’d them to make them more suitable for a Git repo and because DuckDB groks gzip’d JSON files, as that’s a very common way to store them.

The first kind of rectangular JSON data is an array of named objects. Let’s see what one record from our dataset looks like:

jq '[.[20]]' data/users-100k.json
[
  {
    "id": 20,
    "name": "Chloe",
    "city": "San Francisco",
    "age": 85,
    "friends": [
      {
        "name": "Mateo",
        "hobbies": [
          "Quilting",
          "Martial Arts"
        ]
      },
      {
        "name": "Ava",
        "hobbies": [
          "Socializing",
          "Painting",
          "Fishing"
        ]
      },
      {
        "name": "Jack",
        "hobbies": [
          "Playing Cards",
          "Tennis"
        ]
      }
    ]
  }
]

I deliberately left it in “array of objects form”. Let’s read it in with DuckDB:

duckdb -c "
FROM 'data/users-100k.json.gz' LIMIT 10
"
┌───────┬───────────┬─────────────┬───────┬────────────────────────────────────────────────────────────────────────────┐
│  id   │   name    │    city     │  age  │                                  friends                                   │
│ int64 │  varchar  │   varchar   │ int64 │                struct("name" varchar, hobbies varchar[])[]                 │
├───────┼───────────┼─────────────┼───────┼────────────────────────────────────────────────────────────────────────────┤
│     0 │ Daniel    │ San Diego   │    48 │ [{'name': Charlotte, 'hobbies': [Quilting, Writing, Music]}, {'name': Le…  │
│     1 │ Sarah     │ Lahaina     │    39 │ [{'name': Mia, 'hobbies': [Church Activities, Writing, Collecting]}, {'n…  │
│     2 │ Charlotte │ Honolulu    │    49 │ [{'name': Chris, 'hobbies': [Socializing, Writing]}, {'name': Victoria, …  │
│     3 │ Noah      │ St. Louis   │    89 │ [{'name': Sarah, 'hobbies': [Board Games, Woodworking, Skiing & Snowboar…  │
│     4 │ Daniel    │ San Diego   │    87 │ [{'name': Michael, 'hobbies': [Genealogy, Fishing]}, {'name': Noah, 'hob…  │
│     5 │ Amelia    │ Portland    │    57 │ [{'name': Noah, 'hobbies': [Watching Sports, Cooking, Golf]}, {'name': E…  │
│     6 │ Levi      │ Las Vegas   │    23 │ [{'name': Grace, 'hobbies': [Bicycling, Watching Sports]}, {'name': Char…  │
│     7 │ Noah      │ Los Angeles │    76 │ [{'name': Leo, 'hobbies': [Genealogy, Martial Arts]}, {'name': Emily, 'h…  │
│     8 │ Isabella  │ Honolulu    │    25 │ [{'name': Sarah, 'hobbies': [Quilting, Yoga]}, {'name': Victoria, 'hobbi…  │
│     9 │ Leo       │ Portland    │    56 │ [{'name': Grace, 'hobbies': [Golf, Eating Out]}, {'name': Oliver, 'hobbi…  │
├───────┴───────────┴─────────────┴───────┴────────────────────────────────────────────────────────────────────────────┤
│ 10 rows                                                                                                    5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The ability to read JSON data comes in the form of an extension. Normally we have to install and load extensions (which live in versioned directories in ~/.duckdb/extensions/), but JSON is built-in to DuckDB since it is a very common use-case.

The friends field is an interesting one since it contains contains a nested JSON array of objects. Let’s get a list of all the friends of anyone who is aged 44 (choses as it has the fewest records):

duckdb -c "
FROM 'data/users-100k.json.gz'
SELECT 
  UNNEST(friends) AS friend
WHERE
  age = 44
"
┌───────────────────────────────────────────────────────────────────────┐
│                                friend                                 │
│               struct("name" varchar, hobbies varchar[])               │
├───────────────────────────────────────────────────────────────────────┤
│ {'name': Daniel, 'hobbies': [Quilting, Podcasts, Martial Arts]}       │
│ {'name': Camila, 'hobbies': [Fishing, Traveling]}                     │
│ {'name': Liam, 'hobbies': [Reading, Volunteer Work, Television]}      │
│ {'name': Michelle, 'hobbies': [Video Games, Running, Tennis]}         │
│ {'name': Robert, 'hobbies': [Reading, Podcasts, Housework]}           │
│ {'name': Jack, 'hobbies': [Skiing & Snowboarding, Golf, Board Games]} │
│ {'name': Leo, 'hobbies': [Jewelry Making, Video Games, Calligraphy]}  │
│ {'name': Emily, 'hobbies': [Socializing, Television]}                 │
│ {'name': Emma, 'hobbies': [Reading, Jewelry Making, Podcasts]}        │
│ {'name': Sophie, 'hobbies': [Team Sports, Podcasts, Running]}         │
│ {'name': Grace, 'hobbies': [Calligraphy, Television, Music]}          │
│ {'name': Chris, 'hobbies': [Gardening, Church Activities, Shopping]}  │
│ {'name': Levi, 'hobbies': [Calligraphy, Jewelry Making]}              │
│ {'name': John, 'hobbies': [Walking, Volunteer Work]}                  │
│ {'name': Oliver, 'hobbies': [Martial Arts, Shopping]}                 │
│ {'name': Chloe, 'hobbies': [Quilting, Yoga, Cooking]}                 │
│ {'name': Joe, 'hobbies': [Television, Painting]}                      │
│ {'name': Joe, 'hobbies': [Woodworking, Movie Watching, Martial Arts]} │
│ {'name': Emily, 'hobbies': [Quilting, Traveling]}                     │
│ {'name': Emily, 'hobbies': [Music, Dancing, Socializing]}             │
│                           ·                                           │
│                           ·                                           │
│                           ·                                           │
│ {'name': Kevin, 'hobbies': [Martial Arts, Genealogy]}                 │
│ {'name': Isabella, 'hobbies': [Watching Sports, Tennis, Eating Out]}  │
│ {'name': Lucas, 'hobbies': [Fishing, Traveling]}                      │
│ {'name': Isabella, 'hobbies': [Team Sports, Bicycling]}               │
│ {'name': Leo, 'hobbies': [Running, Podcasts, Walking]}                │
│ {'name': Isabella, 'hobbies': [Music, Collecting]}                    │
│ {'name': Grace, 'hobbies': [Podcasts, Video Games, Reading]}          │
│ {'name': Mia, 'hobbies': [Eating Out, Music]}                         │
│ {'name': Michael, 'hobbies': [Bicycling, Reading]}                    │
│ {'name': Michael, 'hobbies': [Reading, Eating Out]}                   │
│ {'name': Sarah, 'hobbies': [Dancing, Writing]}                        │
│ {'name': Ava, 'hobbies': [Playing Cards, Video Games]}                │
│ {'name': Elijah, 'hobbies': [Gardening, Martial Arts]}                │
│ {'name': Amelia, 'hobbies': [Watching Sports, Team Sports]}           │
│ {'name': Noah, 'hobbies': [Playing Cards, Golf]}                      │
│ {'name': Robert, 'hobbies': [Fishing, Jewelry Making, Podcasts]}      │
│ {'name': Elijah, 'hobbies': [Traveling, Calligraphy]}                 │
│ {'name': Jack, 'hobbies': [Board Games, Collecting, Walking]}         │
│ {'name': Grace, 'hobbies': [Podcasts, Reading]}                       │
│ {'name': Charlotte, 'hobbies': [Housework, Yoga]}                     │
├───────────────────────────────────────────────────────────────────────┤
│                         4541 rows (40 shown)                          │
└───────────────────────────────────────────────────────────────────────┘

DuckDB only turns top-level objects into “normal” DuckDB database field strucures, but that doesn’t mean it’s hard to work with second++ level structures:

duckdb -c "
WITH friends44 AS (
  FROM 'data/users-100k.json.gz'
  SELECT 
    name,
    UNNEST(friends) AS friend
  WHERE
    age = 44
)
FROM friends44
SELECT
  name,
  friend.name AS friend,
  UNNEST(friend.hobbies) AS hobby
"
┌─────────┬───────────┬───────────────────────┐
│  name   │  friend   │         hobby         │
│ varchar │  varchar  │        varchar        │
├─────────┼───────────┼───────────────────────┤
│ Luke    │ Daniel    │ Quilting              │
│ Luke    │ Daniel    │ Podcasts              │
│ Luke    │ Daniel    │ Martial Arts          │
│ Luke    │ Camila    │ Fishing               │
│ Luke    │ Camila    │ Traveling             │
│ Mateo   │ Liam      │ Reading               │
│ Mateo   │ Liam      │ Volunteer Work        │
│ Mateo   │ Liam      │ Television            │
│ Mateo   │ Michelle  │ Video Games           │
│ Mateo   │ Michelle  │ Running               │
│ Mateo   │ Michelle  │ Tennis                │
│ Mateo   │ Robert    │ Reading               │
│ Mateo   │ Robert    │ Podcasts              │
│ Mateo   │ Robert    │ Housework             │
│ Mateo   │ Jack      │ Skiing & Snowboarding │
│ Mateo   │ Jack      │ Golf                  │
│ Mateo   │ Jack      │ Board Games           │
│ Mateo   │ Leo       │ Jewelry Making        │
│ Mateo   │ Leo       │ Video Games           │
│ Mateo   │ Leo       │ Calligraphy           │
│   ·     │  ·        │      ·                │
│   ·     │  ·        │      ·                │
│   ·     │  ·        │      ·                │
│ Grace   │ Ava       │ Playing Cards         │
│ Grace   │ Ava       │ Video Games           │
│ Grace   │ Elijah    │ Gardening             │
│ Grace   │ Elijah    │ Martial Arts          │
│ Grace   │ Amelia    │ Watching Sports       │
│ Grace   │ Amelia    │ Team Sports           │
│ Grace   │ Noah      │ Playing Cards         │
│ Grace   │ Noah      │ Golf                  │
│ Grace   │ Robert    │ Fishing               │
│ Grace   │ Robert    │ Jewelry Making        │
│ Grace   │ Robert    │ Podcasts              │
│ Grace   │ Elijah    │ Traveling             │
│ Grace   │ Elijah    │ Calligraphy           │
│ Luke    │ Jack      │ Board Games           │
│ Luke    │ Jack      │ Collecting            │
│ Luke    │ Jack      │ Walking               │
│ Luke    │ Grace     │ Podcasts              │
│ Luke    │ Grace     │ Reading               │
│ Luke    │ Charlotte │ Housework             │
│ Luke    │ Charlotte │ Yoga                  │
├─────────┴───────────┴───────────────────────┤
│ 11115 rows (40 shown)             3 columns │
└─────────────────────────────────────────────┘

There are some specialized situations where more complex extraction techniques may be necessary. Functions and syntax for those are covered very well in the official documentation.

I converted the file to JSON lines (jsonl) / newline-delimited JSON (ndjson) (ref). This is just the array of objects format without the surrounding brackets, and without the commas separating records, and with each record smushed onto one line. This is great for streaming JSON over a connection and removes some unneded cruft.

It works the same way:

duckdb -c "
FROM 'data/users-100k.ndjson.gz' LIMIT 10
"
┌───────┬───────────┬─────────────┬───────┬────────────────────────────────────────────────────────────────────────────┐
│  id   │   name    │    city     │  age  │                                  friends                                   │
│ int64 │  varchar  │   varchar   │ int64 │                struct("name" varchar, hobbies varchar[])[]                 │
├───────┼───────────┼─────────────┼───────┼────────────────────────────────────────────────────────────────────────────┤
│     0 │ Daniel    │ San Diego   │    48 │ [{'name': Charlotte, 'hobbies': [Quilting, Writing, Music]}, {'name': Le…  │
│     1 │ Sarah     │ Lahaina     │    39 │ [{'name': Mia, 'hobbies': [Church Activities, Writing, Collecting]}, {'n…  │
│     2 │ Charlotte │ Honolulu    │    49 │ [{'name': Chris, 'hobbies': [Socializing, Writing]}, {'name': Victoria, …  │
│     3 │ Noah      │ St. Louis   │    89 │ [{'name': Sarah, 'hobbies': [Board Games, Woodworking, Skiing & Snowboar…  │
│     4 │ Daniel    │ San Diego   │    87 │ [{'name': Michael, 'hobbies': [Genealogy, Fishing]}, {'name': Noah, 'hob…  │
│     5 │ Amelia    │ Portland    │    57 │ [{'name': Noah, 'hobbies': [Watching Sports, Cooking, Golf]}, {'name': E…  │
│     6 │ Levi      │ Las Vegas   │    23 │ [{'name': Grace, 'hobbies': [Bicycling, Watching Sports]}, {'name': Char…  │
│     7 │ Noah      │ Los Angeles │    76 │ [{'name': Leo, 'hobbies': [Genealogy, Martial Arts]}, {'name': Emily, 'h…  │
│     8 │ Isabella  │ Honolulu    │    25 │ [{'name': Sarah, 'hobbies': [Quilting, Yoga]}, {'name': Victoria, 'hobbi…  │
│     9 │ Leo       │ Portland    │    56 │ [{'name': Grace, 'hobbies': [Golf, Eating Out]}, {'name': Oliver, 'hobbi…  │
├───────┴───────────┴─────────────┴───────┴────────────────────────────────────────────────────────────────────────────┤
│ 10 rows                                                                                                    5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

and, there are explicit readers for it as well.

DuckDB can also read “unstructured” JSON. The examples they show in the official documentation are far too “structured”/clean, though, so I grabbed a dataset I use everyday at my current employer (ref: tag-metadata.json in the repo). This is a sample:

jq '{ metadata: [.metadata[900]], vpn_services: [.vpn_services[10]] }' data/tag-metadata.json
{
  "metadata": [
    {
      "id": "73ec56cf-f2cb-4052-85dd-f82a6ec28bd3",
      "label": "ZIMBRA_COLLABORATION_SUITE_XXE_ATTEMPT",
      "slug": "zimbra-collaboration-suite-xxe-attempt",
      "name": "Zimbra Collaboration Suite XXE Attempt",
      "category": "activity",
      "intention": "malicious",
      "description": "IP addresses with this tag have been observed attempting to exploit CVE-2019-9670, an XXE vulnerability in Synacor Zimbra Collaboration Suite 8.7.x before 8.7.11p10.",
      "references": [
        "https://www.exploit-db.com/exploits/46693/",
        "https://wiki.zimbra.com/wiki/Zimbra_Security_Advisories",
        "https://bugzilla.zimbra.com/show_bug.cgi?id=109129",
        "http://www.rapid7.com/db/modules/exploit/linux/http/zimbra_xxe_rce",
        "http://packetstormsecurity.com/files/152487/Zimbra-Collaboration-Autodiscover-Servlet-XXE-ProxyServlet-SSRF.html",
        "https://isc.sans.edu/forums/diary/CVE20199670+Zimbra+Collaboration+Suite+XXE+vulnerability/27570/"
      ],
      "recommend_block": true,
      "cves": [
        "CVE-2019-9670"
      ],
      "created_at": "2022-01-18",
      "related_tags": []
    }
  ],
  "vpn_services": [
    "CLOUDVPN_VPN"
  ]
}

Truth be told it’s an API response, and many API responses have a field with summary cruft along with the actual data response.

We can get all the cves from this “unstructured” data pretty easily with DuckDb:

duckdb -c "
WITH metadata AS (
  FROM read_json_auto('data/tag-metadata.json', sample_size=-1)
  SELECT
    UNNEST(metadata) AS m
)
FROM metadata
SELECT DISTINCT
  UNNEST(m.cves) AS cve
ORDER BY 1 DESC
"
┌────────────────┐
│      cve       │
│    varchar     │
├────────────────┤
│ CVE-2024-27199 │
│ CVE-2024-27198 │
│ CVE-2024-25600 │
│ CVE-2024-22024 │
│ CVE-2024-21893 │
│ CVE-2024-21887 │
│ CVE-2024-21762 │
│ CVE-2024-21591 │
│ CVE-2024-1709  │
│ CVE-2024-1071  │
│ CVE-2024-0292  │
│ CVE-2024-0204  │
│ CVE-2023-7028  │
│ CVE-2023-6875  │
│ CVE-2023-6634  │
│ CVE-2023-6553  │
│ CVE-2023-6023  │
│ CVE-2023-51971 │
│ CVE-2023-51467 │
│ CVE-2023-50917 │
│       ·        │
│       ·        │
│       ·        │
│ CVE-2007-3010  │
│ CVE-2006-4602  │
│ CVE-2006-4000  │
│ CVE-2006-2529  │
│ CVE-2006-2237  │
│ CVE-2006-0656  │
│ CVE-2005-4593  │
│ CVE-2005-3299  │
│ CVE-2005-2847  │
│ CVE-2005-2773  │
│ CVE-2005-0116  │
│ CVE-2004-2687  │
│ CVE-2004-1083  │
│ CVE-2003-1157  │
│ CVE-2003-0050  │
│ CVE-2002-1042  │
│ CVE-2001-0924  │
│ CVE-2000-0126  │
│ CVE-1999-0800  │
│ CVE-1999-0526  │
├────────────────┤
│    771 rows    │
│   (40 shown)   │
└────────────────┘

While you can do some extraction tricks with oddly formatted JSON files, I usually try to get any JSON data I need to work with in either ndjson or array of objects format.