You have CSV data and need to transform it to JSON, Parquet, etc.
Solution
Use DuckDB’s CLI directly or COPY TO.
Discussion
While we now have an abundance of (non-Java-based) tooling at the command line to convert data formats into other data formats, we can toss many aside thanks to DuckDB.
Converting to JSON
Let’s first go over a very simple idiom you may be tempted to do (deliberately limiting the input to 3 lines to avoid DoS’ing the book):
This worked! We now have JSON from CSV and can go about our merry way.
But, this isn’t some trivial task were doing, like piping ls -al to less (or more, or bat). This is a data pipeline operation, which means we should be a bit more thorough.
Watch Your Schemas
While a raw, automatic conversion may be fine for a one-off operation, we should sit down with whomever needs this data in JSON format to have a discussion about the data contract. If that person is you, I will 100% not judge you for having a conversation with yourself.
The goal of this operation needs to be ensuring that whatever’s in the source data ends up in the destination data in the expected way.
First make sure the schema that DuckDB chose is what is expected on the other end:
If so, you should codify that schema and make sure DuckDB uses it every time. R folks who are {tidyverse} users will be familiar with this idiom. In fact, we’ll show what R’s equivalent operation readr::spec_csv() does:
In R we can just copy the cols(…) output and use it in a call to readr::read_csv(). There is no equivalent operation in DuckDB (yet). But, we’re savvy data engineers, so let’s fill this gap.
We can use the -json flag to have the output of DESCRIBE turned into JSON, then get a little help from jq to do the same thing as readr::spec_csv():
duckdb-json-c"SELECT column_name, column_typeFROM ( DESCRIBE FROM read_csv('data/customers-100000.csv'))"|\jq'reduce .[] as $item ({}; .[$item.column_name] = $item.column_type)'
Perhaps we were a bit hasty in suggesting to toss jq to the wind?
While there is a way to do this same operation in DuckDB, the SQL is very ugly, and we have no way to use a stored procedure or user-defined function (UDF) in ways we can in other database ecosystems.
Using our example, we can define a Bash function for our jq operation that makes it look scary close to R:
spec_csv(){duckdb-json-c" SELECT column_name, column_type FROM ( DESCRIBE FROM read_csv('${1}') ) "|\jq'reduce .[] as $item ({}; .[$item.column_name] = $item.column_type)'}spec_csv'data/customers-100000.csv'
Now we can use that with future data pipeline calls to DuckDB’s read_csv().
Strongly consider having some other data pipeline contracts in place, such as normalizing the column names the way R’s janitor::make_clean_names() does. We can make a Bash helper function (or two) for that as well:
colnameify(){localinput="$1"localcolnamecolname=$(echo"$input"|iconv-t ascii//TRANSLIT)colname=$(echo"$colname"|sed-E's/[^a-zA-Z0-9]+/_/g')colname=$(echo"$colname"|sed-E's/^_+|-+$//g')colname=$(echo"$colname"|tr'[:upper:]''[:lower:]')echo"$colname"}make_clean_names(){localcolnames=$(duckdb-list-noheader-c"SELECT column_nameFROM ( DESCRIBE FROM read_csv('$1'))")whileread-rcolname;docolnameify"${colname}"done<<<"${colnames}"}make_clean_names'data/customers-100000.csv'
index
customer_id
first_name
last_name
company
city
country
phone_1
phone_2
email
subscription_date
website
Using COPY TO … To Convert To JSON And Other Formats
The CLI flags for various output “conversions” are great, but you get far more control over the output with the COPY statement.
There are some generic options you should take time to pore over and memorize.
One thing to note is that you will almost certainly have to take any CSV pipeline work and stick it in a CREATE TABLE statement. for COPY TO to work, as it needs materialized data to operate on. That means you may need to migrate from using an in-memory table (the default) to a persistent one. Don’t worry! All that means is ensuing you add a temporary-pipeline.ddb filename to the duckdb CLI call (or any name you want to use).
I strongly suggest taking an hour and playing around with the various output formats and options to see what output they generate. And, also take note of the systems these data files will be used in post-conversion. You’ll want to optimize both JSON and Parquet outputs to meet the optimal needs of environment such as Athena/Presto and other “big data” systems.