Customize CSV Read Behavior

Problem

DuckDB’s auto-configuration for CSV reading is not working the way I need it to for some datasets.

Solution

Roll up your sleeves and add some verbosity to the DuckDB CSV reader.

Discussion

DuckDB gives us as much or as little control over how the CSV reader works. Normally, the defaults work well, but we’ve all had one of those CVEs.

The DuckDB online documentation has a wealth of information on all the settings.

One common source of pain is that DuckDB cannot figure out the correct column type due to sampling too few rows. Change sample_size to -1 in a call to read_csv() and DuckDB will scan the entire file/input before making assumptions. You can also change it to an arbitrarily large number.

Doing either of those value changes may lead to sub-optimal load performance. It may be better to tell DuckDB the column types up-front.

I, for one, do not like columns with mixed case and spaces in the names. Let’s do something about that with our example file, and turn the subscription date into a VARCHAR just for fun:

duckdb <<EOF
FROM read_csv('data/customers-100000.csv', columns = {
  'index': 'BIGINT',
  'customer_id': 'VARCHAR',
  'first_name': 'VARCHAR',
  'last_name': 'VARCHAR',
  'company': 'VARCHAR',
  'city': 'VARCHAR',
  'country': 'VARCHAR',
  'phone_1': 'VARCHAR',
  'phone_2': 'VARCHAR',
  'email': 'VARCHAR',
  'subscription_date': 'VARCHAR',
  'website': 'VARCHAR',
})
SELECT 
  COUNT(DISTINCT customer_id) AS num_customers;
EOF
┌───────────────┐
│ num_customers │
│     int64     │
├───────────────┤
│        100000 │
└───────────────┘

You can also use ignore_errors to get past any issues reading particular rows, but be sure to diagnose why that was (it’s most likely due to character encoding issues).