Using Remote JSON, Parquet, etc. Data Sources With DuckDB WASM

Problem

You want to use DuckDB WASM in-browser to work a diverse array of remote datasets as tables in a single database.

Solution

Use DuckDB’s httpfs abilities or Observable FileAttachments to wire up many external soruces into one DuckDB WASM databaase instance.

Discussion

🔵 DuckDB loading…

If you’ve used Observable notebooks then you are almost certainly familiar with FileAttachments. They are way to embed the content of files with a notebook, and any CSV/JSON/Parquet/etc. FileAttachment can be used directly when creating a new DuckDBClient. If you are used to that syntax, it’s possible to keep using it with DuckDB in your own web projects via a small helper function:

import { Library, FileAttachments } from 'https://cdn.jsdelivr.net/npm/@observablehq/stdlib@latest/+esm';

const FileAttachment = FileAttachments((url) =>
  new URL(`${url}`)
);

// let's see what it creates
await FileAttachment("https://data.hrbrmstr.dev/chinook/json/album.json")

{
  "name": {
    "value": "https://data.hrbrmstr.dev/chinook/json/album.json",
    "writable": false,
    "enumerable": true,
    "configurable": false
  },
  "_url": {
    "value": "https://data.hrbrmstr.dev/chinook/json/album.json",
    "writable": false,
    "enumerable": false,
    "configurable": false
  }
}

We can now use FileAttachment to load up each table from the chinook example in the SQLite CLI chapter:

const dbFA = await DuckDBClient().of({
  album: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/album.json").json(),
  artist: await FileAttachment("https://data.hrbrmstr.dev/chinook/parquet/artist.parquet"),
  customer: await FileAttachment("https://data.hrbrmstr.dev/chinook/csv/customer.csv").csv({typed: true}),
  employee: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/employee.json").json(),
  genre: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/genre.json").json(),
  invoice_items: await FileAttachment("https://data.hrbrmstr.dev/chinook/parquet/invoiceline.parquet"),
  invoice: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/invoice.json").json(),
  mediatype: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/mediatype.json").json(),
  playlist: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/playlist.json").json(),
  playlisttrack: await FileAttachment("https://data.hrbrmstr.dev/chinook/parquet/playlisttrack.parquet"),
  tracks: await FileAttachment("https://data.hrbrmstr.dev/chinook/parquet/track.parquet"),
})

Note the mixing and matching of JSON, CSV, and Parquet files.

We’ll use the same query from that chapter and run it against the database we just loaded into our browser:

const query = `
FROM 
  tracks t, 
  invoice_items i
SELECT 
  t.Name AS Track, 
  SUM(i.Quantity) AS "Total Sold"
WHERE 
  t.TrackId = i.TrackId
GROUP BY t.Name
ORDER BY "Total Sold" DESC
LIMIT 10
`

await dbFA.sql`${query}`

It’s also possible to start with an empty in-memory DuckDB instance and use DuckDB’s httpfs to read the files:

const dbHTTP = await DuckDBClient().of({});
const tables = [ "album", "artist", "customer", "employee", "genre", 
                 "invoice_items", "invoice", "mediatype", "playlist", 
                 "playlisttrack", "tracks" ]

for (const tbl of tables) {
  await dbHTTP.query(
    `CREATE TABLE ${tbl} AS (FROM 'https://data.hrbrmstr.dev/chinook/parquet/${tbl}.parquet')`
  )
}

I strongly suggest using the Parquet format for remote files unless they are small.