Using Remote JSON, Parquet, etc. Data Sources With DuckDB WASM
Problem
You want to use DuckDB WASM in-browser to work a diverse array of remote datasets as tables in a single database.
Solution
Use DuckDB’s httpfs
abilities or Observable FileAttachments
to wire up many external soruces into one DuckDB WASM databaase instance.
Discussion
🔵 DuckDB loading…
If you’ve used Observable notebooks then you are almost certainly familiar with FileAttachment
s. They are way to embed the content of files with a notebook, and any CSV/JSON/Parquet/etc. FileAttachment
can be used directly when creating a new DuckDBClient
. If you are used to that syntax, it’s possible to keep using it with DuckDB in your own web projects via a small helper function:
import { Library, FileAttachments } from 'https://cdn.jsdelivr.net/npm/@observablehq/stdlib@latest/+esm';
const FileAttachment = FileAttachments((url) =>
new URL(`${url}`)
;
)
// let's see what it creates
await FileAttachment("https://data.hrbrmstr.dev/chinook/json/album.json")
{
"name": {
"value": "https://data.hrbrmstr.dev/chinook/json/album.json",
"writable": false,
"enumerable": true,
"configurable": false
},
"_url": {
"value": "https://data.hrbrmstr.dev/chinook/json/album.json",
"writable": false,
"enumerable": false,
"configurable": false
}
}
We can now use FileAttachment
to load up each table from the chinook example in the SQLite CLI chapter:
const dbFA = await DuckDBClient().of({
album: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/album.json").json(),
artist: await FileAttachment("https://data.hrbrmstr.dev/chinook/parquet/artist.parquet"),
customer: await FileAttachment("https://data.hrbrmstr.dev/chinook/csv/customer.csv").csv({typed: true}),
employee: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/employee.json").json(),
genre: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/genre.json").json(),
invoice_items: await FileAttachment("https://data.hrbrmstr.dev/chinook/parquet/invoiceline.parquet"),
invoice: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/invoice.json").json(),
mediatype: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/mediatype.json").json(),
playlist: await FileAttachment("https://data.hrbrmstr.dev/chinook/json/playlist.json").json(),
playlisttrack: await FileAttachment("https://data.hrbrmstr.dev/chinook/parquet/playlisttrack.parquet"),
tracks: await FileAttachment("https://data.hrbrmstr.dev/chinook/parquet/track.parquet"),
})
Note the mixing and matching of JSON, CSV, and Parquet files.
We’ll use the same query from that chapter and run it against the database we just loaded into our browser:
const query = `
FROM
tracks t,
invoice_items i
SELECT
t.Name AS Track,
SUM(i.Quantity) AS "Total Sold"
WHERE
t.TrackId = i.TrackId
GROUP BY t.Name
ORDER BY "Total Sold" DESC
LIMIT 10
`
await dbFA.sql`${query}`
It’s also possible to start with an empty in-memory DuckDB instance and use DuckDB’s httpfs
to read the files:
const dbHTTP = await DuckDBClient().of({});
const tables = [ "album", "artist", "customer", "employee", "genre",
"invoice_items", "invoice", "mediatype", "playlist",
"playlisttrack", "tracks" ]
for (const tbl of tables) {
await dbHTTP.query(
`CREATE TABLE ${tbl} AS (FROM 'https://data.hrbrmstr.dev/chinook/parquet/${tbl}.parquet')`
) }
I strongly suggest using the Parquet format for remote files unless they are small.