You need to boost the performance of the already bonkers fast DuckDB.
Solution
Take advantage of the tuning parameters offered by DuckDB.
Discussion
DuckDB gives us some levers to pull when it comes to getting the most performance possible. Some of those levers exist in the database engine itself, while others are the choices you make in terms of data file location/organization and the system you’re using.
Two critical settings that DuckDB usually does a fantastic job of picking for you are the memory_limit and threads settings. Here’s how you can see what they’re set to
duckdb<<EOFSELECT current_setting('memory_limit') AS memory_limit, current_setting('threads') AS threadsEOF
Depending on where I rendered this, you may see above 50 GiB of memory and 10 threads as the values. If so, that’s my ridiculously overpowered M1 MacBook Pro Max. DuckDB queries fly on this system.
You can tweak those settings for your system, especially if you’re memory constrained (this is when you should consider reducing the thread count). The official manual says:
As a rule of thumb, DuckDB requires a minimum of 125 MB of memory per thread. For example, if you use 8 threads, you need at least 1 GB of memory. For ideal performance, aggregation-heavy workloads require approx. 5 GB memory per thread and join-heavy workloads require approximately 10 GB memory per thread.
Aim for 5-10 GB memory per thread.
Another performance consideration that readers may not be aware of is that when you just fire up DuckDB from the CLI, or use it via something like the R {duckdb} / {duckdbfs} packages, you’re using an in memory database. This is usually fine, but if your query operations are causing memory errors, you should consider either running DuckDB with a local, persistent database file or explicitly telling DuckDB where it can spill memory to disk. To do that you can use the temp_directory setting like this:
SET temp_directory ='/path/to/temp_dir.tmp/';
Other performance enhancements to consider are limiting the use of:
ORDER BY
GROUP BY
OVER ... (PARTITION BY ... ORDER BY ...)
and, any JOINs
Of those, you can likely sacrifice ORDER BY more readily than the others.
If the order of your data doesn’t matter, you can further turn off the setting that preserves insertion order:
SET preserve_insertion_order =false;
This will give DuckDB freedom to be a bit more lazy about how it recombines threaded operations.
Finally, one performance enhancement that can also help, but is beyond the scope of this book, is optimizing the way you create Parquet files. This is not just a DuckDB consideration, since most environments that support operations on Parquet files benefit from well-crafted datasets. Check out the DuckDB manual for some tips on this as well as this 2023 article by Airbyte.