DuckDB and Polars: Transform Engines Compared¶

Both DuckDB and Polars are single-process, columnar, in-memory engines that read and write Parquet natively. Neither requires a server, a cluster, or a running service. Both run inside the Python process that calls them. At the hardware level, both perform vectorised columnar scans with predicate pushdown on Parquet row groups.

They are more alike than different. The divergences matter at the margins — and those margins determine which is the better fit for specific transform patterns.

Where they converge¶

Property	DuckDB	Polars
Execution model	In-process, single machine	In-process, single machine
Memory format	Apache Arrow columnar	Apache Arrow columnar
Parquet support	Native read/write with predicate pushdown	Native read/write with predicate pushdown
Lazy evaluation	Yes (`duckdb.sql()` builds a plan)	Yes (`pl.scan_parquet().lazy()`)
Multi-threaded	Yes (automatic parallelism)	Yes (automatic parallelism)
Python integration	`import duckdb` — one line	`import polars as pl` — one line
Interop	DuckDB can read Polars DataFrames directly	Polars can read Arrow tables from DuckDB
27M row transform	Seconds	Seconds
Infrastructure required	None	None
Cost	Free, open source	Free, open source

At the datalake's current scale (largest silver input 27.9M rows), both engines handle every transform in seconds on a single worker. The "DuckDB vs Polars" question is about developer ergonomics and code maintainability, not performance.

Where they diverge¶

SQL vs DataFrame API¶

DuckDB's native interface is SQL. Polars' native interface is method-chained expressions.

A three-table dimensional join (the datalake's fact_invoice build):

DuckDB:

SELECT
    i.invoice_id,
    i.customer_id,
    i.invoice_date,
    i.total_amount,
    s.status_label,
    d.payment_method
FROM bronze_invoice i
LEFT JOIN bronze_invoice_status s ON i.status_id = s.status_id
LEFT JOIN bronze_invoice_data d ON i.invoice_id = d.invoice_id
WHERE i.invoice_date >= '2026-01-01'

Polars:

(
    bronze_invoice
    .filter(pl.col("invoice_date") >= date(2026, 1, 1))
    .join(bronze_invoice_status, on="status_id", how="left")
    .join(bronze_invoice_data, on="invoice_id", how="left")
    .select("invoice_id", "customer_id", "invoice_date", "total_amount",
            "status_label", "payment_method")
)

Both work. The SQL is more readable for relational joins — SQL was designed for exactly this operation. The Polars is more verbose but offers IDE autocompletion on column names.

Window functions¶

DuckDB:

SELECT *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY invoice_date DESC) AS recency_rank,
    SUM(total_amount) OVER (PARTITION BY customer_id) AS lifetime_value
FROM silver_fact_invoice

Polars:

df.with_columns(
    pl.col("invoice_date").rank("ordinal", descending=True)
        .over("customer_id").alias("recency_rank"),
    pl.col("total_amount").sum()
        .over("customer_id").alias("lifetime_value"),
)

Comparable readability. DuckDB uses standard SQL window syntax; Polars uses .over(). Both are idiomatic in their respective paradigms.

Programmatic logic and ML integration¶

This is where Polars pulls ahead. Data science transforms — conditional columns, UDFs, ML model inference, statistical tests — are imperative operations that map poorly to SQL.

Polars (Rev-Sci ML drift detection):

shap_values = explainer(model, X_test)
df = df.with_columns(
    pl.Series("shap_importance", shap_values.values.mean(axis=0)),
    pl.when(pl.col("drift_score") > threshold)
      .then(pl.lit("DRIFT"))
      .otherwise(pl.lit("STABLE"))
      .alias("drift_status"),
)

DuckDB equivalent — not practical. You'd extract to Arrow, compute in Python, then re-register. The SQL boundary becomes overhead, not abstraction.

Direct Parquet scanning (bypassing IO manager)¶

DuckDB can query Parquet files by path, applying predicates at the file-scan level:

con.execute("""
    SELECT customer_id, SUM(amount)
    FROM 'bronze/billing_cdrs/cdr_bare/ingestion_date=2026-04-21/*.parquet'
    WHERE usage_type = 'data'
    GROUP BY customer_id
""")

This reads only the customer_id, amount, and usage_type columns from the Parquet metadata, and only row groups where usage_type could contain 'data'. For a 75 GB billing CDR table, this might scan 500 MB instead of the full file.

Polars via IO manager loads the full upstream DataFrame into memory first (the IO manager calls pl.read_parquet(path)), then the asset filters in-memory. For large bronze tables, this means loading gigabytes before discarding most of it.

This is the strongest technical argument for DuckDB in bronze-to-silver transforms. When the input is large and the output is a filtered/aggregated subset, DuckDB's direct Parquet scanning avoids loading data that will be immediately discarded.

The trade-off: direct Parquet scanning means the asset contains file paths, which ties it to a storage layout. IO managers decouple this. See IO Managers for how to get the best of both.

Recommendation: coexistence, not replacement¶

DuckDB and Polars coexist naturally in the same code location. A single asset can use both:

@dg.asset(io_manager_key="polars_io_manager")
def silver_fact_invoice(
    bronze_invoice: pl.DataFrame,
    bronze_invoice_status: pl.DataFrame,
    bronze_invoice_data: pl.DataFrame,
) -> pl.DataFrame:
    # DuckDB for the relational join (reads Polars DataFrames directly)
    return duckdb.sql("""
        SELECT i.*, s.status_label, d.payment_method
        FROM bronze_invoice i
        LEFT JOIN bronze_invoice_status s ON i.status_id = s.id
        LEFT JOIN bronze_invoice_data d ON i.invoice_id = d.invoice_id
    """).pl()

The IO manager loads inputs as Polars DataFrames. DuckDB picks them up as virtual tables (no copy — it reads the Arrow buffers directly). The .pl() call returns a Polars DataFrame for the IO manager to write.

When to use which¶

Transform type	Use	Why
Multi-table joins (silver facts)	DuckDB SQL	SQL is the natural syntax for relational algebra
Window functions, GROUP BY, aggregations	Either	Both handle these well
ML feature engineering, SHAP, statistical tests	Polars	Programmatic logic doesn't fit SQL
Large bronze-to-silver with heavy filtering	DuckDB (direct scan)	Predicate pushdown on Parquet avoids loading discarded data
Small transforms, column renames, type casts	Polars	Method chaining is concise for simple operations
Anything that returns to the IO manager	Polars (as output format)	`PolarsParquetIOManager` expects `pl.DataFrame`