Read JSON File From The Web

Problem

You need to read and work with JSON data from the internets.

Solution

Treat JSON web content as database tables

Discussion

Most of the responses we get from APIs ends up being JSON data. This is great, since JSON is easy to work with both inside and outside of DuckDB. However, it’s not so great in that every REST API (friends don’t let friends jsut GraphQL) seems to pick a bespoke format for the responses.

NOTE: reading in remote JSON relies on the httpfs extension being installed. After you do that once, it will auto load when DuckDB sees https URLs.

We’ve already seen how this works in the chapter on installing DuckDB. Let’s re-visit that example:

duckdb -c "
DESCRIBE
FROM read_json_auto('https://api.github.com/repos/duckdb/duckdb/releases/latest') 
"
┌──────────────────┬───────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│   column_name    │                        column_type                        │  null   │   key   │ default │  extra  │
│     varchar      │                          varchar                          │ varchar │ varchar │ varchar │ varchar │
├──────────────────┼───────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ url              │ VARCHAR                                                   │ YES     │         │         │         │
│ assets_url       │ VARCHAR                                                   │ YES     │         │         │         │
│ upload_url       │ VARCHAR                                                   │ YES     │         │         │         │
│ html_url         │ VARCHAR                                                   │ YES     │         │         │         │
│ id               │ BIGINT                                                    │ YES     │         │         │         │
│ author           │ STRUCT(login VARCHAR, id BIGINT, node_id VARCHAR, avata…  │ YES     │         │         │         │
│ node_id          │ VARCHAR                                                   │ YES     │         │         │         │
│ tag_name         │ VARCHAR                                                   │ YES     │         │         │         │
│ target_commitish │ VARCHAR                                                   │ YES     │         │         │         │
│ name             │ VARCHAR                                                   │ YES     │         │         │         │
│ draft            │ BOOLEAN                                                   │ YES     │         │         │         │
│ prerelease       │ BOOLEAN                                                   │ YES     │         │         │         │
│ created_at       │ TIMESTAMP                                                 │ YES     │         │         │         │
│ published_at     │ TIMESTAMP                                                 │ YES     │         │         │         │
│ assets           │ STRUCT(url VARCHAR, id BIGINT, node_id VARCHAR, "name" …  │ YES     │         │         │         │
│ tarball_url      │ VARCHAR                                                   │ YES     │         │         │         │
│ zipball_url      │ VARCHAR                                                   │ YES     │         │         │         │
│ body             │ VARCHAR                                                   │ YES     │         │         │         │
│ reactions        │ STRUCT(url VARCHAR, total_count BIGINT, "+1" BIGINT, "-…  │ YES     │         │         │         │
│ mentions_count   │ BIGINT                                                    │ YES     │         │         │         │
├──────────────────┴───────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┤
│ 20 rows                                                                                                    6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The response from this GitHub API endpoint is pretty interesting. DuckDB figured out the *_at fields are TIMESTAMPs, and the data was clean enough that we have accessible STRUCT sub-fields in some of the top-level fields.

Let’s try to get all releases:

duckdb -c "
FROM read_json_auto('https://api.github.com/repos/duckdb/duckdb/releases?per_page=100')
SELECT COUNT(*)
"
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│           38 │
└──────────────┘

We got lucky in that there have only been 38 releases. We’re also lucky in that these endpoints are not authenticated. By the time I’m done with this tome, this chapter will be re-written to take advantage of some new capabilities DuckDB will have in sending arbitrary headers along with HTTP requests.

Let’s see how excited folks were about the releases:

duckdb -markdown -c "
WITH ReleaseReactions AS (
  FROM read_json_auto('https://api.github.com/repos/duckdb/duckdb/releases?per_page=100')
  SELECT
    name,
    tag_name,
    UNNEST(reactions)
)
SELECT
  *
  EXCLUDE (url, total_count, confused)
FROM ReleaseReactions
"
name tag_name +1 -1 laugh hooray heart rocket eyes
v0.10.1 Bugfix Release v0.10.1 3 0 0 27 0 12 0
DuckDB 0.10.0 “Fusca” v0.10.0 3 0 0 69 11 33 0
0.9.2 Bugfix Release v0.9.2 14 0 0 0 0 2 0
0.9.1 Bugfix Release v0.9.1 14 0 0 9 0 8 0
0.9.0 Preview Release “Undulata” v0.9.0 9 0 0 30 10 17 0
0.8.1 Bugfix Release v0.8.1 0 0 0 29 0 0 0
0.8.0 Preview Release “Fulvigula” v0.8.0 18 0 0 34 3 2 0
0.7.1 Bugfix Release v0.7.1 6 0 0 14 0 8 0
0.7.0 Preview Release “Labradorius” v0.7.0 15 0 0 21 0 0 0
0.6.1 Bugfix Release v0.6.1 0 0 0 11 0 0 0
0.6.0 Preview Release “Oxyura” v0.6.0 0 0 0 18 0 0 0
0.5.1 Bugfix Release v0.5.1 6 0 0 12 0 2 0
0.5.0 Preview Release “Pulchellus” v0.5.0 11 0 0 8 0 9 0
0.4.0 Preview Release “Ferruginea” v0.4.0 10 0 0 12 0 0 0
0.3.4 Bugfix Release v0.3.4 9 0 0 0 0 0 0
0.3.3 Preview Release “Sansaniensis” v0.3.3 3 0 0 10 1 0 0
0.3.2 Preview Release “Gibberifrons” v0.3.2 2 0 0 13 0 0 0
0.3.1 Preview Release “Spectabilis” v0.3.1 10 0 0 5 0 2 0
0.3.0 Preview Release “Gracilis” v0.3.0 1 0 0 0 0 0 0
0.2.9 Preview Release “Platyrhynchos” v0.2.9 0 0 0 3 0 0 0
0.2.8 Preview Release “Ceruttii” v0.2.8 4 0 0 0 0 2 0
0.2.7 Preview Release “Mollissima” v0.2.7 3 0 0 7 0 0 0
0.2.6 Preview Release “Jamaicensis” v0.2.6
0.2.5 Preview Release “Falcata” v0.2.5
0.2.4 Preview Release “Jubata” v0.2.4
0.2.3 Preview Release “Serrator” v0.2.3
0.2.2 Preview Release “Clypeata” v0.2.2
0.2.1 Preview Release v0.2.1
0.2.0 Preview Release v0.2.0
0.1.9 Preview Release v0.1.9
0.1.8 Preview Release v0.1.8
0.1.7 Preview Release v0.1.7
0.1.6 Preview Release v0.1.6
0.1.5 Preview Release v0.1.5
0.1.3 Preview Release v0.1.3
0.1.2 Preview Release v0.1.2
0.1.1 Preview Release v0.1.1
0.1.0 Preview Release v0.1.0

Ah, there’s that friendly SQL dialect again, making our queries less ugly. This comes in especially handy for gnarly JSON responses.