Read A CSV From Standard Input (stdin)

Problem

You need to read and work with CSV data that is streaming into stdin.

Solution

Leverage DuckDB’s build-in methods for working with data from stdin.

Discussion

There are times when you don’t feel like taking the extra step make a temporary file associated with a running shell command process just to get access to the output from said command. One example might be analyzing various aspects of Git history.

We can take advantage of DuckDB’s ability to read from stdin to avoid creating unnecessary cruft.

Let’s see who are the top authors of Git commits to RStudio’s IDE (what, you don’t keep an up-to-date fork/clone of RStudio lying around?):

cd ~/Development/clones/rstudio
git pull > /dev/null # get the latest commits
git log --pretty=format:%as,%an | \
  duckdb -c "
FROM read_csv('/dev/stdin', columns = { 
  'day': 'DATE',
  'author': 'VARCHAR'
})
SELECT 
  author, 
  COUNT(*) AS n_commits
GROUP BY 1
ORDER BY 2 DESC
"
┌─────────────────────┬───────────┐
│       author        │ n_commits │
│       varchar       │   int64   │
├─────────────────────┼───────────┤
│ JJ Allaire          │     12302 │
│ Kevin Ushey         │      7196 │
│ Jonathan McPherson  │      5551 │
│ Gary Ritchie        │      2649 │
│ Gary                │      1863 │
│ Javier Luraschi     │      1848 │
│ Jonathan            │      1598 │
│ Joe Cheng           │      1568 │
│ Maria Semple        │       907 │
│ melissa-barca       │       698 │
│ Charles Teague      │       666 │
│ karl@rstudio.com    │       477 │
│ Matheus Tavares     │       299 │
│ Jackie Gutman       │       291 │
│ Tim Mok             │       288 │
│ Jeff Vroom          │       273 │
│ Karl Feinauer       │       270 │
│ Melissa Barca       │       240 │
│ Jacqueline Gutman   │       212 │
│ Romain François     │       186 │
│     ·               │         · │
│     ·               │         · │
│     ·               │         · │
│ Adam Majer          │         1 │
│ Jacob Munro         │         1 │
│ Tracy Teal          │         1 │
│ rhinopotamus        │         1 │
│ Christophe Dervieux │         1 │
│ XiangyunHuang       │         1 │
│ giuseppec           │         1 │
│ Jon Harker          │         1 │
│ Thomas Themel       │         1 │
│ Jari Karppinen      │         1 │
│ fishgal64           │         1 │
│ Fredric Johansson   │         1 │
│ AWS Admin           │         1 │
│ Gábor Csárdi        │         1 │
│ David               │         1 │
│ ej                  │         1 │
│ Carl A. B. Pearson  │         1 │
│ pjd                 │         1 │
│ Sainath Adapa       │         1 │
│ Christian Brueffer  │         1 │
├─────────────────────┴───────────┤
│ 195 rows (40 shown)   2 columns │
└─────────────────────────────────┘

By having this data in a SQL context, we can ask questions that require some wrangling or use of many tools in other contexts. Like, Who is the top author this month?

cd ~/Development/clones/rstudio
git pull > /dev/null # get the latest commits
git log --pretty=format:%as,%an | \
  duckdb -markdown -c "
FROM read_csv('/dev/stdin', 
  auto_detect = false, 
  columns = { 
    'day': 'DATE',
    'author': 'VARCHAR'
  }
)
SELECT
  strftime(date_trunc('month', day), '%Y %B') AS month,  
  author,
  COUNT(*) AS n_commits_this_month
WHERE
  date_trunc('month', day) = date_trunc('month', now()::DATE)
GROUP BY 1, 2
ORDER BY 3 DESC
"
month author n_commits_this_month
2024 March Kevin Ushey 50
2024 March Gary Ritchie 41
2024 March olivroy 16
2024 March Zach Hannum 4
2024 March GCRev 4
2024 March Ashley Henry 4
2024 March Graham Held 3
2024 March Tim Mok 1
2024 March Melissa Barca 1
2024 March petetronic 1
2024 March Mervin Fansler 1
2024 March Pete 1
2024 March Adam Foryś 1

Remember, you can always use tee to store a copy of the data you’re piping into DuckDB if you think you’ll need to reprocess it, especially if generating the CSV is resource intensive.