DuckDB and Polars: Transform Engines Compared¶
Both DuckDB and Polars are single-process, columnar, in-memory engines that read and write Parquet natively. Neither requires a server, a cluster, or a running service. Both run inside the Python process that calls them. At the hardware level, both perform vectorised columnar scans with predicate pushdown on Parquet row groups.
They are more alike than different. The divergences matter at the margins — and those margins determine which is the better fit for specific transform patterns.
Where they converge¶
| Property | DuckDB | Polars |
|---|---|---|
| Execution model | In-process, single machine | In-process, single machine |
| Memory format | Apache Arrow columnar | Apache Arrow columnar |
| Parquet support | Native read/write with predicate pushdown | Native read/write with predicate pushdown |
| Lazy evaluation | Yes (duckdb.sql() builds a plan) |
Yes (pl.scan_parquet().lazy()) |
| Multi-threaded | Yes (automatic parallelism) | Yes (automatic parallelism) |
| Python integration | import duckdb — one line |
import polars as pl — one line |
| Interop | DuckDB can read Polars DataFrames directly | Polars can read Arrow tables from DuckDB |
| 27M row transform | Seconds | Seconds |
| Infrastructure required | None | None |
| Cost | Free, open source | Free, open source |
At the datalake's current scale (largest silver input 27.9M rows), both engines handle every transform in seconds on a single worker. The "DuckDB vs Polars" question is about developer ergonomics and code maintainability, not performance.
Where they diverge¶
SQL vs DataFrame API¶
DuckDB's native interface is SQL. Polars' native interface is method-chained expressions.
A three-table dimensional join (the datalake's fact_invoice build):
DuckDB:
SELECT
i.invoice_id,
i.customer_id,
i.invoice_date,
i.total_amount,
s.status_label,
d.payment_method
FROM bronze_invoice i
LEFT JOIN bronze_invoice_status s ON i.status_id = s.status_id
LEFT JOIN bronze_invoice_data d ON i.invoice_id = d.invoice_id
WHERE i.invoice_date >= '2026-01-01'
Polars:
(
bronze_invoice
.filter(pl.col("invoice_date") >= date(2026, 1, 1))
.join(bronze_invoice_status, on="status_id", how="left")
.join(bronze_invoice_data, on="invoice_id", how="left")
.select("invoice_id", "customer_id", "invoice_date", "total_amount",
"status_label", "payment_method")
)
Both work. The SQL is more readable for relational joins — SQL was designed for exactly this operation. The Polars is more verbose but offers IDE autocompletion on column names.
Window functions¶
DuckDB:
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY invoice_date DESC) AS recency_rank,
SUM(total_amount) OVER (PARTITION BY customer_id) AS lifetime_value
FROM silver_fact_invoice
Polars:
df.with_columns(
pl.col("invoice_date").rank("ordinal", descending=True)
.over("customer_id").alias("recency_rank"),
pl.col("total_amount").sum()
.over("customer_id").alias("lifetime_value"),
)
Comparable readability. DuckDB uses standard SQL window syntax; Polars uses .over(). Both are idiomatic in their respective paradigms.
Programmatic logic and ML integration¶
This is where Polars pulls ahead. Data science transforms — conditional columns, UDFs, ML model inference, statistical tests — are imperative operations that map poorly to SQL.
Polars (Rev-Sci ML drift detection):
shap_values = explainer(model, X_test)
df = df.with_columns(
pl.Series("shap_importance", shap_values.values.mean(axis=0)),
pl.when(pl.col("drift_score") > threshold)
.then(pl.lit("DRIFT"))
.otherwise(pl.lit("STABLE"))
.alias("drift_status"),
)
DuckDB equivalent — not practical. You'd extract to Arrow, compute in Python, then re-register. The SQL boundary becomes overhead, not abstraction.
Direct Parquet scanning (bypassing IO manager)¶
DuckDB can query Parquet files by path, applying predicates at the file-scan level:
con.execute("""
SELECT customer_id, SUM(amount)
FROM 'bronze/billing_cdrs/cdr_bare/ingestion_date=2026-04-21/*.parquet'
WHERE usage_type = 'data'
GROUP BY customer_id
""")
This reads only the customer_id, amount, and usage_type columns from the Parquet metadata, and only row groups where usage_type could contain 'data'. For a 75 GB billing CDR table, this might scan 500 MB instead of the full file.
Polars via IO manager loads the full upstream DataFrame into memory first (the IO manager calls pl.read_parquet(path)), then the asset filters in-memory. For large bronze tables, this means loading gigabytes before discarding most of it.
This is the strongest technical argument for DuckDB in bronze-to-silver transforms. When the input is large and the output is a filtered/aggregated subset, DuckDB's direct Parquet scanning avoids loading data that will be immediately discarded.
The trade-off: direct Parquet scanning means the asset contains file paths, which ties it to a storage layout. IO managers decouple this. See IO Managers for how to get the best of both.
Recommendation: coexistence, not replacement¶
DuckDB and Polars coexist naturally in the same code location. A single asset can use both:
@dg.asset(io_manager_key="polars_io_manager")
def silver_fact_invoice(
bronze_invoice: pl.DataFrame,
bronze_invoice_status: pl.DataFrame,
bronze_invoice_data: pl.DataFrame,
) -> pl.DataFrame:
# DuckDB for the relational join (reads Polars DataFrames directly)
return duckdb.sql("""
SELECT i.*, s.status_label, d.payment_method
FROM bronze_invoice i
LEFT JOIN bronze_invoice_status s ON i.status_id = s.id
LEFT JOIN bronze_invoice_data d ON i.invoice_id = d.invoice_id
""").pl()
The IO manager loads inputs as Polars DataFrames. DuckDB picks them up as virtual tables (no copy — it reads the Arrow buffers directly). The .pl() call returns a Polars DataFrame for the IO manager to write.
When to use which¶
| Transform type | Use | Why |
|---|---|---|
| Multi-table joins (silver facts) | DuckDB SQL | SQL is the natural syntax for relational algebra |
| Window functions, GROUP BY, aggregations | Either | Both handle these well |
| ML feature engineering, SHAP, statistical tests | Polars | Programmatic logic doesn't fit SQL |
| Large bronze-to-silver with heavy filtering | DuckDB (direct scan) | Predicate pushdown on Parquet avoids loading discarded data |
| Small transforms, column renames, type casts | Polars | Method chaining is concise for simple operations |
| Anything that returns to the IO manager | Polars (as output format) | PolarsParquetIOManager expects pl.DataFrame |