A Survey Of Using DuckDB From R

Problem

You want to use DuckDB from R.

Solution

Review your options here, then take a deep dive in subsequent chapters.

Discussion

We have an abundance of DuckDB blessings in the R ecosystem. Even without any packages, it’s possible to use R’s system() or system2() (or some modern package-based equivalents) to shell out to the DuckDB CLI and either read in a JSON/CSV stream of query results, or from a saved file. However, most folks want to use some sort of tailor-made package to perform database wrangling in R.

There are at least four ways of using DuckDB directly in R with the help of some packages. There are separate chapters on each of them, but let’s, first, get an overview of them, with a short example of how to use them in case you just want to experiment on your own.

duckdb: DBI Package for the DuckDB Database Management System

The {duckdb} package comes straight from DuckDB HQ and ships with a full copy of DuckDB as as shared library and has some C++ glue to expose it as a DBI-compatible database. This means you can use it just like you would any other DBI-compatible database, and it’s also possible to use it with the {dplyr} package:

library(duckdb)
library(tidyverse)

duckdb(
  dbdir = ":memory:",
  read_only = FALSE,
  bigint = "numeric",
  config = list()
) |>
  dbConnect() -> duckdb_con

invisible(dbExecute(duckdb_con, "INSTALL httpfs; LOAD httpfs;"))

album_duckdb_df <- tbl(duckdb_con, sql("FROM read_parquet('https://data.hrbrmstr.dev/chinook/parquet/album.parquet')"))

glimpse(album_duckdb_df)
Rows: ??
Columns: 3
Database: DuckDB v0.9.2 [root@Darwin 23.4.0:R 4.2.1/:memory:]
$ AlbumId  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ Title    <chr> "For Those About To Rock We Salute You", "Balls to the Wall",…
$ ArtistId <int> 1, 2, 2, 1, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 11, 12, 12, 13, 1…

duckdbfs: High Performance Remote File System, Database and ‘Geospatial’ Access Using ‘duckdb’

The {duckdbfs} package builds on {duckdb}, but adds a nice UX layer that’s geared towards working with tabular data in filesystems (local, remote, S3, etc.). It’s my preferred way of working with DuckDB in R.

library(duckdbfs)
library(tidyverse)

album_duckdbfs_df <- open_dataset("https://data.hrbrmstr.dev/chinook/parquet/album.parquet")

glimpse(album_duckdbfs_df)
Rows: ??
Columns: 3
Database: DuckDB v0.9.2 [root@Darwin 23.4.0:R 4.2.1/:memory:]
$ AlbumId  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ Title    <chr> "For Those About To Rock We Salute You", "Balls to the Wall",…
$ ArtistId <int> 1, 2, 2, 1, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 11, 12, 12, 13, 1…

Since it does some work in the background, you also have access to the DBI connection it created via duckdbfs::cached_connection(). We’ll discuss this more in the chapter dedicated to {duckdbfs}.

duckplyr: A ‘DuckDB’-Backed Version of ‘dplyr’

The {duckplyr} package is from DuckDB Labs and is a drop-in replacement for {dplyr}, and uses the DuckDB C API with modern R’s ALTREP vector interface to provide a fast, native R interface to DuckDB when using local files.

library(duckplyr)
library(tidyverse)

taxi_duckplyr <- df_from_parquet("data/chinook/parquet/album.parquet")

glimpse(taxi_duckplyr)
materializing:
---------------------
--- Relation Tree ---
---------------------
read_parquet(data/chinook/parquet/album.parquet)

---------------------
-- Result Columns  --
---------------------
- AlbumId (INTEGER)
- Title (VARCHAR)
- ArtistId (INTEGER)

Rows: 347
Columns: 3
$ AlbumId  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ Title    <chr> "For Those About To Rock We Salute You", "Balls to the Wall",…
$ ArtistId <int> 1, 2, 2, 1, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 11, 12, 12, 13, 1…

As you can see, there’s quite a bit going on behind the scenes, which we’ll cover in the dedicated chapter.

DuckDB + {odbc}: Connect to ODBC Compatible Databases (using the DBI Interface)

Yes, the DuckDB folks were thorough enough to provid ODBC drivers. Well, an ODBC driver wrapper around the entire DuckDB library (it’s a 90+ MB driver on macOS). This means we can use it with R’s {odbc} package.

library(odbc)
library(tidyverse)

dbConnect(
  drv = odbc::odbc(),
  Driver = file.path(Sys.getenv("HOME"), "lib/libduckdb_odbc.dylib")
) -> odbc_con

album_odbc_df <- tbl(odbc_con, sql("FROM read_parquet('https://data.hrbrmstr.dev/chinook/parquet/album.parquet')"))

glimpse(album_odbc_df)
Rows: ??
Columns: 3
Database: DuckDB v0.10.1[@localhost/]
$ AlbumId  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ Title    <chr> "For Those About To Rock We Salute You", "Balls to the Wall",…
$ ArtistId <int> 1, 2, 2, 1, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 11, 12, 12, 13, 1…

Let’s take a more thorough look at each of these ways of using DuckDB from R.