Read A CSV File From The Filesystem

Problem

You need to read and work with CSV data stored on your your filesystem.

Solution

Treat CSV files as database tables.

Discussion

We’ll use some CSV data from Datablist for the examples here, and elsewhere in this book.

We’ll start by grabbing customers-100000.csv (though, as we will see later, we do not necessarily need this data on a filesystem to work with it).

curl \
  --silent \
  --location \
  --output "./data/customers-100000.csv" \
  --url "https://drive.google.com/uc?id=1N1xoxgcw2K3d-49tlchXAWw4wuxLj7EV&export=download"

Let’s take a look at the format:

duckdb -c "
DESCRIBE FROM './data/customers-100000.csv'
"
┌───────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name    │ column_type │  null   │   key   │ default │  extra  │
│      varchar      │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ Index             │ BIGINT      │ YES     │         │         │         │
│ Customer Id       │ VARCHAR     │ YES     │         │         │         │
│ First Name        │ VARCHAR     │ YES     │         │         │         │
│ Last Name         │ VARCHAR     │ YES     │         │         │         │
│ Company           │ VARCHAR     │ YES     │         │         │         │
│ City              │ VARCHAR     │ YES     │         │         │         │
│ Country           │ VARCHAR     │ YES     │         │         │         │
│ Phone 1           │ VARCHAR     │ YES     │         │         │         │
│ Phone 2           │ VARCHAR     │ YES     │         │         │         │
│ Email             │ VARCHAR     │ YES     │         │         │         │
│ Subscription Date │ DATE        │ YES     │         │         │         │
│ Website           │ VARCHAR     │ YES     │         │         │         │
├───────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 12 rows                                                       6 columns │
└─────────────────────────────────────────────────────────────────────────┘

Yes, you’re eyes are not failing you.

DuckDB is “FROM first”, and needless SELECTs are optional. This is part of DuckDB’s friendly SQL initiative, which we’ll get into more later.

Note the use of the -c command line option. This tells DuckDB to run the query in the paramater value and exit. It’s one of many ways execute queries on the command line.

From the schema output, we see that DuckDB turned Subscription Date into an actual DATE and figured out the other types, too. You have full control over the schema, too (we’ll get into that later).

So, what does this data actually look like?

duckdb -c "
FROM './data/customers-100000.csv' LIMIT 5
"
┌───────┬─────────────────┬────────────┬───┬──────────────────────┬───────────────────┬──────────────────────┐
│ Index │   Customer Id   │ First Name │ … │        Email         │ Subscription Date │       Website        │
│ int64 │     varchar     │  varchar   │   │       varchar        │       date        │       varchar        │
├───────┼─────────────────┼────────────┼───┼──────────────────────┼───────────────────┼──────────────────────┤
│     1 │ ffeCAb7AbcB0f07 │ Jared      │ … │ gabriellehartman@b…  │ 2021-11-11        │ https://www.mccart…  │
│     2 │ b687FfC4F1600eC │ Marie      │ … │ kstafford@sexton.com │ 2021-05-14        │ http://www.reynold…  │
│     3 │ 9FF9ACbc69dcF9c │ Elijah     │ … │ jeanettecross@brow…  │ 2021-03-17        │ https://neal.com/    │
│     4 │ b49edDB1295FF6E │ Sheryl     │ … │ thomassierra@barre…  │ 2020-09-23        │ https://www.powell…  │
│     5 │ 3dcCbFEB17CCf2E │ Jeremy     │ … │ rubenwatkins@jacob…  │ 2020-09-18        │ https://www.carril…  │
├───────┴─────────────────┴────────────┴───┴──────────────────────┴───────────────────┴──────────────────────┤
│ 5 rows                                                                                12 columns (6 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Let’s see how many unique customers are in the database:

duckdb <<EOF
FROM 'data/customers-100000.csv'
SELECT 
  COUNT(DISTINCT columns("Customer Id")) AS num_customers;
EOF
┌───────────────┐
│ num_customers │
│     int64     │
├───────────────┤
│        100000 │
└───────────────┘

While the -c flag is helpful, it’s not necessary, and gets in the way when we need to use both " and ' in queries.

The bare quotes way of reading a CSV file is handy, but we had to use the COLUMNS expression to access the fields.

If we use the [read_csv()](https://duckdb.org/docs/data/csv/overview.html#csv-functions) function, we can avoid theCOLUMNS` expression:

duckdb <<EOF
FROM read_csv('data/customers-100000.csv')
SELECT 
  COUNT(DISTINCT "Customer Id") AS num_customers;
EOF
┌───────────────┐
│ num_customers │
│     int64     │
├───────────────┤
│        100000 │
└───────────────┘

You can do all the normal things SQL can do with CSV files, like filtering, ordering, and aggregation.

We’ll cover how to get data out from DuckDB beyond these basic ASCII tables in other chapters.