Use DuckDB To Convert CSV To Other Formats

Problem

You have CSV data and need to transform it to JSON, Parquet, etc.

Solution

Use DuckDB’s CLI directly or COPY TO.

Discussion

While we now have an abundance of (non-Java-based) tooling at the command line to convert data formats into other data formats, we can toss many aside thanks to DuckDB.

Converting to JSON

Let’s first go over a very simple idiom you may be tempted to do (deliberately limiting the input to 3 lines to avoid DoS’ing the book):

head -3 data/customers-100000.csv | \
  duckdb -json -c "
FROM read_csv('/dev/stdin')
"

[{"Index":1,"Customer Id":"ffeCAb7AbcB0f07","First Name":"Jared","Last Name":"Jarvis","Company":"Sanchez-Fletcher","City":"Hatfieldshire","Country":"Eritrea","Phone 1":"274.188.8773x41185","Phone 2":"001-215-760-4642x969","Email":"gabriellehartman@benjamin.com","Subscription Date":"2021-11-11","Website":"https://www.mccarthy.info/"},
{"Index":2,"Customer Id":"b687FfC4F1600eC","First Name":"Marie","Last Name":"Malone","Company":"Mckay PLC","City":"Robertsonburgh","Country":"Botswana","Phone 1":"283-236-9529","Phone 2":"(189)129-8356x63741","Email":"kstafford@sexton.com","Subscription Date":"2021-05-14","Website":"http://www.reynolds.com/"}]

This worked! We now have JSON from CSV and can go about our merry way.

But, this isn’t some trivial task were doing, like piping ls -al to less (or more, or bat). This is a data pipeline operation, which means we should be a bit more thorough.

Watch Your Schemas

While a raw, automatic conversion may be fine for a one-off operation, we should sit down with whomever needs this data in JSON format to have a discussion about the data contract. If that person is you, I will 100% not judge you for having a conversation with yourself.

The goal of this operation needs to be ensuring that whatever’s in the source data ends up in the destination data in the expected way.

First make sure the schema that DuckDB chose is what is expected on the other end:

duckdb -c "
DESCRIBE
FROM read_csv('data/customers-100000.csv')
"

┌───────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name    │ column_type │  null   │   key   │ default │  extra  │
│      varchar      │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ Index             │ BIGINT      │ YES     │         │         │         │
│ Customer Id       │ VARCHAR     │ YES     │         │         │         │
│ First Name        │ VARCHAR     │ YES     │         │         │         │
│ Last Name         │ VARCHAR     │ YES     │         │         │         │
│ Company           │ VARCHAR     │ YES     │         │         │         │
│ City              │ VARCHAR     │ YES     │         │         │         │
│ Country           │ VARCHAR     │ YES     │         │         │         │
│ Phone 1           │ VARCHAR     │ YES     │         │         │         │
│ Phone 2           │ VARCHAR     │ YES     │         │         │         │
│ Email             │ VARCHAR     │ YES     │         │         │         │
│ Subscription Date │ DATE        │ YES     │         │         │         │
│ Website           │ VARCHAR     │ YES     │         │         │         │
├───────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 12 rows                                                       6 columns │
└─────────────────────────────────────────────────────────────────────────┘

If so, you should codify that schema and make sure DuckDB uses it every time. R folks who are {tidyverse} users will be familiar with this idiom. In fact, we’ll show what R’s equivalent operation readr::spec_csv() does:

readr::spec_csv(
  file = "~/projects/cooking-with-duckdb/data/customers-100000.csv"
)

cols(
  Index = col_double(),
  `Customer Id` = col_character(),
  `First Name` = col_character(),
  `Last Name` = col_character(),
  Company = col_character(),
  City = col_character(),
  Country = col_character(),
  `Phone 1` = col_character(),
  `Phone 2` = col_character(),
  Email = col_character(),
  `Subscription Date` = col_date(format = ""),
  Website = col_character()
)

In R we can just copy the cols(…) output and use it in a call to readr::read_csv(). There is no equivalent operation in DuckDB (yet). But, we’re savvy data engineers, so let’s fill this gap.

We can use the -json flag to have the output of DESCRIBE turned into JSON, then get a little help from jq to do the same thing as readr::spec_csv():

duckdb -json -c "
SELECT
  column_name,
  column_type
FROM (
  DESCRIBE
  FROM read_csv('data/customers-100000.csv')
)
" | \
  jq 'reduce .[] as $item ({}; .[$item.column_name] = $item.column_type)'

{
  "Index": "BIGINT",
  "Customer Id": "VARCHAR",
  "First Name": "VARCHAR",
  "Last Name": "VARCHAR",
  "Company": "VARCHAR",
  "City": "VARCHAR",
  "Country": "VARCHAR",
  "Phone 1": "VARCHAR",
  "Phone 2": "VARCHAR",
  "Email": "VARCHAR",
  "Subscription Date": "DATE",
  "Website": "VARCHAR"
}

Perhaps we were a bit hasty in suggesting to toss jq to the wind?

While there is a way to do this same operation in DuckDB, the SQL is very ugly, and we have no way to use a stored procedure or user-defined function (UDF) in ways we can in other database ecosystems.

Using our example, we can define a Bash function for our jq operation that makes it look scary close to R:

spec_csv() {
  duckdb -json -c "
    SELECT
      column_name,
      column_type
    FROM (
      DESCRIBE
      FROM read_csv('${1}')
    )
    " | \
      jq 'reduce .[] as $item ({}; .[$item.column_name] = $item.column_type)'
}

spec_csv 'data/customers-100000.csv'

{
  "Index": "BIGINT",
  "Customer Id": "VARCHAR",
  "First Name": "VARCHAR",
  "Last Name": "VARCHAR",
  "Company": "VARCHAR",
  "City": "VARCHAR",
  "Country": "VARCHAR",
  "Phone 1": "VARCHAR",
  "Phone 2": "VARCHAR",
  "Email": "VARCHAR",
  "Subscription Date": "DATE",
  "Website": "VARCHAR"
}

Now we can use that with future data pipeline calls to DuckDB’s read_csv().

Strongly consider having some other data pipeline contracts in place, such as normalizing the column names the way R’s janitor::make_clean_names() does. We can make a Bash helper function (or two) for that as well:

colnameify() {
  local input="$1"
  local colname
  colname=$(echo "$input" | iconv -t ascii//TRANSLIT)
  colname=$(echo "$colname" | sed -E 's/[^a-zA-Z0-9]+/_/g')
  colname=$(echo "$colname" | sed -E 's/^_+|-+$//g')
  colname=$(echo "$colname" | tr '[:upper:]' '[:lower:]')
  echo "$colname"
}

make_clean_names() {
  local colnames=$(
   duckdb -list -noheader -c "
SELECT
  column_name
FROM (
  DESCRIBE
  FROM read_csv('$1')
)")

while read -r colname ; do
  colnameify "${colname}"
done<<<"${colnames}"
}

make_clean_names 'data/customers-100000.csv'

index
customer_id
first_name
last_name
company
city
country
phone_1
phone_2
email
subscription_date
website

Using `COPY TO …` To Convert To JSON And Other Formats

The CLI flags for various output “conversions” are great, but you get far more control over the output with the COPY statement.

There are some generic options you should take time to pore over and memorize.

One thing to note is that you will almost certainly have to take any CSV pipeline work and stick it in a CREATE TABLE statement. for COPY TO to work, as it needs materialized data to operate on. That means you may need to migrate from using an in-memory table (the default) to a persistent one. Don’t worry! All that means is ensuing you add a temporary-pipeline.ddb filename to the duckdb CLI call (or any name you want to use).

I strongly suggest taking an hour and playing around with the various output formats and options to see what output they generate. And, also take note of the systems these data files will be used in post-conversion. You’ll want to optimize both JSON and Parquet outputs to meet the optimal needs of environment such as Athena/Presto and other “big data” systems.