Use DuckDB To Convert JSON To Other Formats

Problem

You have JSON data and need to transform it to CSV, Parquet, etc.

Solution

Use DuckDB’s CLI directly or COPY TO.

Discussion

While we now have an abundance of (non-Java-based) tooling at the command line to convert data formats into other data formats, we can toss many aside thanks to DuckDB.

Converting to CSV

Let’s first go over a very simple idiom you may be tempted to do (deliberately limiting the input to 10 lines to avoid DoS’ing the book):

jq -c .metadata[] data/tag-metadata.json | head -10 | \
  duckdb -csv -c "
FROM read_json('/dev/stdin', sample_size=-1)
"
id,label,slug,name,category,intention,description,references,recommend_block,cves,created_at,related_tags
f9f50d90-0168-11ee-8493-2752010189ff,ORACLE_GLASSFISH_DIRECTORY_TRAVERSAL_ATTEMPT,oracle-glassfish-directory-traversal-attempt,"Oracle Glassfish Directory Traversal Attempt",activity,malicious,"IP addresses with this tag have been observed attempting to exploit a directory traversal vulnerability in Oracle Glassfish Open Source Edition under version 4.1",[https://www.trustwave.com/en-us/resources/security-resources/security-advisories/?fid=18822],false,[],2023-06-02,[]
27ff7ba3-8ba1-450f-9985-d42b169dc410,MOJEEK,mojeek,Mojeek,search_engine,benign,"IP addresses with this tag belong to the Mojeek search engine web crawler.",[https://www.mojeek.com/bot.html],false,[],2020-04-07,[]
075b5b8e-41ce-4794-91ff-3d15750e1b34,DATAGRID_SURFACE,datagrid-surface,"DataGrid Surface",actor,benign,"IP addresses with this tag belong to DataGrid Surface, a startup focused on improving vulnerability data.",[http://datagridsurface.com],false,[],2023-04-19,[]
71cd5767-1490-4c1d-a7be-df797f29a0cf,BABBAR,babbar,Babbar,actor,benign,"IP addresses with this tag belong to Babbar, a data analysis company specializing in large-scale text analysis.",[https://babbar.tech/crawler],false,[],2020-04-07,[]
88a1cfe5-94a9-46b1-a45a-e8c3cbe13135,INTRINSEC,intrinsec,Intrinsec,actor,benign,"IP addresses with this tag belong to Intrinsec, a legitimate security organization that frequently scans the Internet.",[],false,[],2020-04-07,[]
81efae2a-44c1-480d-a259-23261e08a005,QUANTLOADER_SCANNER,quantloader-scanner,QuantLoader,activity,malicious,"IP addresses with this tag exhibit behavior that indicates they are infected with malware that uses QuantLoader, a Trojan downloader.",[https://engineering.salesforce.com/an-in-depth-malware-analysis-of-quantloader-7393439309a5],true,[],2020-04-07,[]
887a3440-48ef-4eb2-ba09-95f57a7275e0,ZTE_ZXV10_H108L_RCE_ATTEMPT,zte-zxv10-h108l-rce-attempt,"ZTE ZXV10 H108L RCE Attempt",activity,malicious,"IP addresses with this tag have been observed attempting to exploit a remote code execution vulnerability in ZTE ZXV10 H108L Routers.","[https://www.fortiguard.com/encyclopedia/ips/47341, https://github.com/stasinopoulos/ZTExploit/]",true,[],2023-06-08,[]
846b7e85-760c-4871-a21c-850f02bd2f06,ANDROXGH0ST,androxgh0st,Androxgh0st,worm,malicious,"IP addresses with this tag have been observed attempting to establish a botnet by utilizing a default version of androxgh0st malware.",[https://www.cisa.gov/news-events/cybersecurity-advisories/aa24-016a],true,[],2024-02-03,[]
c972ef23-95b1-440e-b444-5e23d5708b4e,SOCRADAR,socradar,SOCRadar,actor,benign,"IP addresses with this tag belong to SOCRadar, an MSSP primarily concerned with vulnerability management and threat intelligence.",[https://socradar.io/],false,[],2023-05-05,[]
67f35e5c-b351-4df9-9365-01cb484a6f6f,JOOMLA_CONNECT_BACK_RCE_ATTEMPT,joomla-connect-back-rce-attempt,"Joomla Connect Back RCE Attempt",activity,malicious,"IP addresses with this tag have been observed attempting to exploit an unauthorized file upload vulnerability resulting in remote code execution in Joomla! 1.5.12.",[https://www.exploit-db.com/exploits/11262],true,[],2024-02-01,[]

This worked! We now have CSV from JSON and can go about our merry way.

But, this isn’t some trivial task were doing, like piping ls -al to less (or more, or bat). This is a data pipeline operation, which means we should be a bit more thorough.

Watch Your Schemas And Output Options

This chapter’s sibling on converting CSVs has some solid advice and tools on schema conversion we won’t repeat here.

We need to call out a special case consideration for CSV output, as other output options are far more robuse/resilient. DuckDB knows this, hence an entire section on CSV output options.

I strongly suggest using force_quote to make sure receiving systems can properly import each column. I’ve seen to many pipelines break without doing so. Also, c onsider using delim or sep to pick something different than a , if the data or receiving system warrants it. Sometimes a good ol \t or | just works better.

Using COPY TO … To Convert To CSV And Other Formats

This chapter’s sibling on converting CSVs has some solid advice on COPY TO… that applies to JSON and other formats, so please see that section.