curl \
--silent \
--location \
--output "./data/customers-100000.csv" \
--url "https://drive.google.com/uc?id=1N1xoxgcw2K3d-49tlchXAWw4wuxLj7EV&export=download"
Read A CSV File From The Filesystem
Problem
You need to read and work with CSV data stored on your your filesystem.
Solution
Treat CSV files as database tables.
Discussion
We’ll use some CSV data from Datablist for the examples here, and elsewhere in this book.
We’ll start by grabbing customers-100000.csv
(though, as we will see later, we do not necessarily need this data on a filesystem to work with it).
Let’s take a look at the format:
duckdb -c "
DESCRIBE FROM './data/customers-100000.csv'
"
┌───────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├───────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ Index │ BIGINT │ YES │ │ │ │
│ Customer Id │ VARCHAR │ YES │ │ │ │
│ First Name │ VARCHAR │ YES │ │ │ │
│ Last Name │ VARCHAR │ YES │ │ │ │
│ Company │ VARCHAR │ YES │ │ │ │
│ City │ VARCHAR │ YES │ │ │ │
│ Country │ VARCHAR │ YES │ │ │ │
│ Phone 1 │ VARCHAR │ YES │ │ │ │
│ Phone 2 │ VARCHAR │ YES │ │ │ │
│ Email │ VARCHAR │ YES │ │ │ │
│ Subscription Date │ DATE │ YES │ │ │ │
│ Website │ VARCHAR │ YES │ │ │ │
├───────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 12 rows 6 columns │
└─────────────────────────────────────────────────────────────────────────┘
Yes, you’re eyes are not failing you.
DuckDB is “FROM
first”, and needless SELECT
s are optional. This is part of DuckDB’s friendly SQL initiative, which we’ll get into more later.
Note the use of the -c
command line option. This tells DuckDB to run the query in the paramater value and exit. It’s one of many ways execute queries on the command line.
From the schema output, we see that DuckDB turned Subscription Date
into an actual DATE
and figured out the other types, too. You have full control over the schema, too (we’ll get into that later).
So, what does this data actually look like?
duckdb -c "
FROM './data/customers-100000.csv' LIMIT 5
"
┌───────┬─────────────────┬────────────┬───┬──────────────────────┬───────────────────┬──────────────────────┐
│ Index │ Customer Id │ First Name │ … │ Email │ Subscription Date │ Website │
│ int64 │ varchar │ varchar │ │ varchar │ date │ varchar │
├───────┼─────────────────┼────────────┼───┼──────────────────────┼───────────────────┼──────────────────────┤
│ 1 │ ffeCAb7AbcB0f07 │ Jared │ … │ gabriellehartman@b… │ 2021-11-11 │ https://www.mccart… │
│ 2 │ b687FfC4F1600eC │ Marie │ … │ kstafford@sexton.com │ 2021-05-14 │ http://www.reynold… │
│ 3 │ 9FF9ACbc69dcF9c │ Elijah │ … │ jeanettecross@brow… │ 2021-03-17 │ https://neal.com/ │
│ 4 │ b49edDB1295FF6E │ Sheryl │ … │ thomassierra@barre… │ 2020-09-23 │ https://www.powell… │
│ 5 │ 3dcCbFEB17CCf2E │ Jeremy │ … │ rubenwatkins@jacob… │ 2020-09-18 │ https://www.carril… │
├───────┴─────────────────┴────────────┴───┴──────────────────────┴───────────────────┴──────────────────────┤
│ 5 rows 12 columns (6 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Let’s see how many unique customers are in the database:
duckdb <<EOF
FROM 'data/customers-100000.csv'
SELECT
COUNT(DISTINCT columns("Customer Id")) AS num_customers;
EOF
┌───────────────┐
│ num_customers │
│ int64 │
├───────────────┤
│ 100000 │
└───────────────┘
While the -c
flag is helpful, it’s not necessary, and gets in the way when we need to use both "
and '
in queries.
The bare quotes way of reading a CSV file is handy, but we had to use the COLUMNS
expression to access the fields.
If we use the [read_csv()](https://duckdb.org/docs/data/csv/overview.html#csv-functions) function, we can avoid the
COLUMNS` expression:
duckdb <<EOF
FROM read_csv('data/customers-100000.csv')
SELECT
COUNT(DISTINCT "Customer Id") AS num_customers;
EOF
┌───────────────┐
│ num_customers │
│ int64 │
├───────────────┤
│ 100000 │
└───────────────┘
You can do all the normal things SQL can do with CSV files, like filtering, ordering, and aggregation.
We’ll cover how to get data out from DuckDB beyond these basic ASCII tables in other chapters.