Use R/{tidyverse}-Esque ‘Factors’ In DuckDB

Problem

You’re coming from a R/{tidyverse}-esque environment and really need to be able to use factors for sorting and other operations.

Solution

Use DuckDB’s enums like you would R/{tidyverse} factors.

Discussion

Factors are very handy for ensuring values are in a certain domain and also for turning (sorting) chaos into order.

DuckDB has the concept of an enum type which represents a dictionary data structure with all possible unique values of a column. They’re similar to R’s factor type in that the numeric value is stored in the column, but you have access to the string value. This can save a great deal of space, especially in larger datasets.

Let’s examine factors using this dataset from Pew Research.

duckdb<<EOF
FROM read_ndjson('https://data.hrbrmstr.dev/pew-secure-comms.ndjson')
EOF
┌────────────────────────────┬───────────────────┬───────┐
│          Activity          │      measure      │ value │
│          varchar           │      varchar      │ int64 │
├────────────────────────────┼───────────────────┼───────┤
│ Using a landline           │ Very secure       │    16 │
│ Calling on your cell phone │ Very secure       │     9 │
│ Sending text messages      │ Very secure       │     7 │
│ Sending email              │ Very secure       │     5 │
│ Using chat or IM           │ Very secure       │     4 │
│ Using social media sites   │ Very secure       │     2 │
│ Using a landline           │ Somewhat secure   │    51 │
│ Calling on your cell phone │ Somewhat secure   │    43 │
│ Sending text messages      │ Somewhat secure   │    32 │
│ Sending email              │ Somewhat secure   │    35 │
│ Using chat or IM           │ Somewhat secure   │    25 │
│ Using social media sites   │ Somewhat secure   │    14 │
│ Using a landline           │ Not very secure   │    19 │
│ Calling on your cell phone │ Not very secure   │    29 │
│ Sending text messages      │ Not very secure   │    37 │
│ Sending email              │ Not very secure   │    36 │
│ Using chat or IM           │ Not very secure   │    38 │
│ Using social media sites   │ Not very secure   │    28 │
│ Using a landline           │ Not at all secure │    12 │
│ Calling on your cell phone │ Not at all secure │    17 │
│ Sending text messages      │ Not at all secure │    22 │
│ Sending email              │ Not at all secure │    21 │
│ Using chat or IM           │ Not at all secure │    32 │
│ Using social media sites   │ Not at all secure │    53 │
├────────────────────────────┴───────────────────┴───────┤
│ 24 rows                                      3 columns │
└────────────────────────────────────────────────────────┘

If we were to sort the questions or answers, they’d be in “alphabetical” order. We can changer that by defining an enum. We’ll create one manually for the “feeling” folks had about various communications channels:

duckdb<<EOF
CREATE TYPE feeling AS 
  ENUM ('Activity', 'Very secure', 'Somewhat secure', 'Not very secure', 'Not at all secure');

FROM read_ndjson('https://data.hrbrmstr.dev/pew-secure-comms.ndjson')
SELECT
  activity,
  measure::feeling AS measure,
  value
ORDER BY 2 DESC
EOF
┌──────────────────────┬───────────────────────────────────────────────────────────────────────────────────────┬───────┐
│       Activity       │                                        measure                                        │ value │
│       varchar        │ enum('activity', 'very secure', 'somewhat secure', 'not very secure', 'not at all s…  │ int64 │
├──────────────────────┼───────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ Using a landline     │ Not at all secure                                                                     │    12 │
│ Calling on your ce…  │ Not at all secure                                                                     │    17 │
│ Sending text messa…  │ Not at all secure                                                                     │    22 │
│ Sending email        │ Not at all secure                                                                     │    21 │
│ Using chat or IM     │ Not at all secure                                                                     │    32 │
│ Using social media…  │ Not at all secure                                                                     │    53 │
│ Using a landline     │ Not very secure                                                                       │    19 │
│ Calling on your ce…  │ Not very secure                                                                       │    29 │
│ Sending text messa…  │ Not very secure                                                                       │    37 │
│ Sending email        │ Not very secure                                                                       │    36 │
│ Using chat or IM     │ Not very secure                                                                       │    38 │
│ Using social media…  │ Not very secure                                                                       │    28 │
│ Using a landline     │ Somewhat secure                                                                       │    51 │
│ Calling on your ce…  │ Somewhat secure                                                                       │    43 │
│ Sending text messa…  │ Somewhat secure                                                                       │    32 │
│ Sending email        │ Somewhat secure                                                                       │    35 │
│ Using chat or IM     │ Somewhat secure                                                                       │    25 │
│ Using social media…  │ Somewhat secure                                                                       │    14 │
│ Using a landline     │ Very secure                                                                           │    16 │
│ Calling on your ce…  │ Very secure                                                                           │     9 │
│ Sending text messa…  │ Very secure                                                                           │     7 │
│ Sending email        │ Very secure                                                                           │     5 │
│ Using chat or IM     │ Very secure                                                                           │     4 │
│ Using social media…  │ Very secure                                                                           │     2 │
├──────────────────────┴───────────────────────────────────────────────────────────────────────────────────────┴───────┤
│ 24 rows                                                                                                    3 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We aren’t limited to statically crafted enums. We can use the results of a SELECT statement to create one as well:

duckdb -markdown<<EOF
CREATE TYPE feeling AS 
  ENUM ('Activity', 'Very secure', 'Somewhat secure', 'Not very secure', 'Not at all secure');

CREATE TABLE responses AS (
  FROM read_ndjson('https://data.hrbrmstr.dev/pew-secure-comms.ndjson')
  SELECT
    activity,
    measure::feeling AS measure,
    value
  ORDER BY 2 DESC
);

CREATE TYPE activity AS ENUM (
  FROM responses
  SELECT 
    activity
  WHERE
    measure = 'Very secure'
  ORDER BY value DESC
);

PIVOT (
  FROM responses
  SELECT
    activity::activity AS activity,
    measure,
    value
  ORDER BY 1, 2
) 
ON measure 
USING first(value)
EOF
activity Not at all secure Not very secure Somewhat secure Very secure
Using a landline 12 19 51 16
Calling on your cell phone 17 29 43 9
Sending text messages 22 37 32 7
Sending email 21 36 35 5
Using chat or IM 32 38 25 4
Using social media sites 53 28 14 2