Skip to content

DuckDB and Polars: Transform Engines Compared

Both DuckDB and Polars are single-process, columnar, in-memory engines that read and write Parquet natively. Neither requires a server, a cluster, or a running service. Both run inside the Python process that calls them. At the hardware level, both perform vectorised columnar scans with predicate pushdown on Parquet row groups.

They are more alike than different. The divergences matter at the margins — and those margins determine which is the better fit for specific transform patterns.


Where they converge

Property DuckDB Polars
Execution model In-process, single machine In-process, single machine
Memory format Apache Arrow columnar Apache Arrow columnar
Parquet support Native read/write with predicate pushdown Native read/write with predicate pushdown
Lazy evaluation Yes (duckdb.sql() builds a plan) Yes (pl.scan_parquet().lazy())
Multi-threaded Yes (automatic parallelism) Yes (automatic parallelism)
Python integration import duckdb — one line import polars as pl — one line
Interop DuckDB can read Polars DataFrames directly Polars can read Arrow tables from DuckDB
27M row transform Seconds Seconds
Infrastructure required None None
Cost Free, open source Free, open source

At the datalake's current scale (largest silver input 27.9M rows), both engines handle every transform in seconds on a single worker. The "DuckDB vs Polars" question is about developer ergonomics and code maintainability, not performance.


Where they diverge

SQL vs DataFrame API

DuckDB's native interface is SQL. Polars' native interface is method-chained expressions.

A three-table dimensional join (the datalake's fact_invoice build):

DuckDB:

SELECT
    i.invoice_id,
    i.customer_id,
    i.invoice_date,
    i.total_amount,
    s.status_label,
    d.payment_method
FROM bronze_invoice i
LEFT JOIN bronze_invoice_status s ON i.status_id = s.status_id
LEFT JOIN bronze_invoice_data d ON i.invoice_id = d.invoice_id
WHERE i.invoice_date >= '2026-01-01'

Polars:

(
    bronze_invoice
    .filter(pl.col("invoice_date") >= date(2026, 1, 1))
    .join(bronze_invoice_status, on="status_id", how="left")
    .join(bronze_invoice_data, on="invoice_id", how="left")
    .select("invoice_id", "customer_id", "invoice_date", "total_amount",
            "status_label", "payment_method")
)

Both work. The SQL is more readable for relational joins — SQL was designed for exactly this operation. The Polars is more verbose but offers IDE autocompletion on column names.

Window functions

DuckDB:

SELECT *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY invoice_date DESC) AS recency_rank,
    SUM(total_amount) OVER (PARTITION BY customer_id) AS lifetime_value
FROM silver_fact_invoice

Polars:

df.with_columns(
    pl.col("invoice_date").rank("ordinal", descending=True)
        .over("customer_id").alias("recency_rank"),
    pl.col("total_amount").sum()
        .over("customer_id").alias("lifetime_value"),
)

Comparable readability. DuckDB uses standard SQL window syntax; Polars uses .over(). Both are idiomatic in their respective paradigms.

Programmatic logic and ML integration

This is where Polars pulls ahead. Data science transforms — conditional columns, UDFs, ML model inference, statistical tests — are imperative operations that map poorly to SQL.

Polars (Rev-Sci ML drift detection):

shap_values = explainer(model, X_test)
df = df.with_columns(
    pl.Series("shap_importance", shap_values.values.mean(axis=0)),
    pl.when(pl.col("drift_score") > threshold)
      .then(pl.lit("DRIFT"))
      .otherwise(pl.lit("STABLE"))
      .alias("drift_status"),
)

DuckDB equivalent — not practical. You'd extract to Arrow, compute in Python, then re-register. The SQL boundary becomes overhead, not abstraction.

Direct Parquet scanning (bypassing IO manager)

DuckDB can query Parquet files by path, applying predicates at the file-scan level:

con.execute("""
    SELECT customer_id, SUM(amount)
    FROM 'bronze/billing_cdrs/cdr_bare/ingestion_date=2026-04-21/*.parquet'
    WHERE usage_type = 'data'
    GROUP BY customer_id
""")

This reads only the customer_id, amount, and usage_type columns from the Parquet metadata, and only row groups where usage_type could contain 'data'. For a 75 GB billing CDR table, this might scan 500 MB instead of the full file.

Polars via IO manager loads the full upstream DataFrame into memory first (the IO manager calls pl.read_parquet(path)), then the asset filters in-memory. For large bronze tables, this means loading gigabytes before discarding most of it.

This is the strongest technical argument for DuckDB in bronze-to-silver transforms. When the input is large and the output is a filtered/aggregated subset, DuckDB's direct Parquet scanning avoids loading data that will be immediately discarded.

The trade-off: direct Parquet scanning means the asset contains file paths, which ties it to a storage layout. IO managers decouple this. See IO Managers for how to get the best of both.


Recommendation: coexistence, not replacement

DuckDB and Polars coexist naturally in the same code location. A single asset can use both:

@dg.asset(io_manager_key="polars_io_manager")
def silver_fact_invoice(
    bronze_invoice: pl.DataFrame,
    bronze_invoice_status: pl.DataFrame,
    bronze_invoice_data: pl.DataFrame,
) -> pl.DataFrame:
    # DuckDB for the relational join (reads Polars DataFrames directly)
    return duckdb.sql("""
        SELECT i.*, s.status_label, d.payment_method
        FROM bronze_invoice i
        LEFT JOIN bronze_invoice_status s ON i.status_id = s.id
        LEFT JOIN bronze_invoice_data d ON i.invoice_id = d.invoice_id
    """).pl()

The IO manager loads inputs as Polars DataFrames. DuckDB picks them up as virtual tables (no copy — it reads the Arrow buffers directly). The .pl() call returns a Polars DataFrame for the IO manager to write.

When to use which

Transform type Use Why
Multi-table joins (silver facts) DuckDB SQL SQL is the natural syntax for relational algebra
Window functions, GROUP BY, aggregations Either Both handle these well
ML feature engineering, SHAP, statistical tests Polars Programmatic logic doesn't fit SQL
Large bronze-to-silver with heavy filtering DuckDB (direct scan) Predicate pushdown on Parquet avoids loading discarded data
Small transforms, column renames, type casts Polars Method chaining is concise for simple operations
Anything that returns to the IO manager Polars (as output format) PolarsParquetIOManager expects pl.DataFrame