Read A CSV From Standard Input (`stdin`)

Problem

You need to read and work with CSV data that is streaming into stdin.

Solution

Leverage DuckDB’s build-in methods for working with data from stdin.

Discussion

There are times when you don’t feel like taking the extra step make a temporary file associated with a running shell command process just to get access to the output from said command. One example might be analyzing various aspects of Git history.

We can take advantage of DuckDB’s ability to read from stdin to avoid creating unnecessary cruft.

Let’s see who are the top authors of Git commits to RStudio’s IDE (what, you don’t keep an up-to-date fork/clone of RStudio lying around?):

cd ~/Development/clones/rstudio
git pull > /dev/null # get the latest commits
git log --pretty=format:%as,%an | \
  duckdb -c "
FROM read_csv('/dev/stdin', columns = { 
  'day': 'DATE',
  'author': 'VARCHAR'
})
SELECT 
  author, 
  COUNT(*) AS n_commits
GROUP BY 1
ORDER BY 2 DESC
"

┌─────────────────────┬───────────┐
│       author        │ n_commits │
│       varchar       │   int64   │
├─────────────────────┼───────────┤
│ JJ Allaire          │     12302 │
│ Kevin Ushey         │      7196 │
│ Jonathan McPherson  │      5551 │
│ Gary Ritchie        │      2649 │
│ Gary                │      1863 │
│ Javier Luraschi     │      1848 │
│ Jonathan            │      1598 │
│ Joe Cheng           │      1568 │
│ Maria Semple        │       907 │
│ melissa-barca       │       698 │
│ Charles Teague      │       666 │
│ karl@rstudio.com    │       477 │
│ Matheus Tavares     │       299 │
│ Jackie Gutman       │       291 │
│ Tim Mok             │       288 │
│ Jeff Vroom          │       273 │
│ Karl Feinauer       │       270 │
│ Melissa Barca       │       240 │
│ Jacqueline Gutman   │       212 │
│ Romain François     │       186 │
│     ·               │         · │
│     ·               │         · │
│     ·               │         · │
│ Adam Majer          │         1 │
│ Jacob Munro         │         1 │
│ Tracy Teal          │         1 │
│ rhinopotamus        │         1 │
│ Christophe Dervieux │         1 │
│ XiangyunHuang       │         1 │
│ giuseppec           │         1 │
│ Jon Harker          │         1 │
│ Thomas Themel       │         1 │
│ Jari Karppinen      │         1 │
│ fishgal64           │         1 │
│ Fredric Johansson   │         1 │
│ AWS Admin           │         1 │
│ Gábor Csárdi        │         1 │
│ David               │         1 │
│ ej                  │         1 │
│ Carl A. B. Pearson  │         1 │
│ pjd                 │         1 │
│ Sainath Adapa       │         1 │
│ Christian Brueffer  │         1 │
├─────────────────────┴───────────┤
│ 195 rows (40 shown)   2 columns │
└─────────────────────────────────┘

By having this data in a SQL context, we can ask questions that require some wrangling or use of many tools in other contexts. Like, Who is the top author this month?

cd ~/Development/clones/rstudio
git pull > /dev/null # get the latest commits
git log --pretty=format:%as,%an | \
  duckdb -markdown -c "
FROM read_csv('/dev/stdin', 
  auto_detect = false, 
  columns = { 
    'day': 'DATE',
    'author': 'VARCHAR'
  }
)
SELECT
  strftime(date_trunc('month', day), '%Y %B') AS month,  
  author,
  COUNT(*) AS n_commits_this_month
WHERE
  date_trunc('month', day) = date_trunc('month', now()::DATE)
GROUP BY 1, 2
ORDER BY 3 DESC
"

month	author	n_commits_this_month
2024 March	Kevin Ushey	50
2024 March	Gary Ritchie	41
2024 March	olivroy	16
2024 March	Zach Hannum	4
2024 March	GCRev	4
2024 March	Ashley Henry	4
2024 March	Graham Held	3
2024 March	Tim Mok	1
2024 March	Melissa Barca	1
2024 March	petetronic	1
2024 March	Mervin Fansler	1
2024 March	Pete	1
2024 March	Adam Foryś	1

Remember, you can always use tee to store a copy of the data you’re piping into DuckDB if you think you’ll need to reprocess it, especially if generating the CSV is resource intensive.