Skip to content

DAG-3: ClickHouse OLAP Lakehouse for Vanguard Wholesale

Summary

Adds a ClickHouse data lakehouse layer that provides SQL-queryable VIEWs over all Dagster-managed Parquet assets. ClickHouse reads Parquet files directly from disk via its file() table function — zero data duplication, zero changes to the existing Dagster pipeline.

VIEWs are created automatically, covering all v3-derived assets, billing CDR layers, wholesale liability, customer filters, factory-generated report assets (raw, masked, backfilled), and UUID mappings, etc...

  • Natively queryable via HTTP (contianer port 8123) — browser, curl, API integrations
  • Queryable via TCP (port 9000) — TablePlus, DBeaver, BI tools
  • Queryable via built-in UIhttp://localhost:8123/play

JIRA

Epic: DAG-3 — OLAP Lakehouse Federation for Reports Data

Task Summary
DAG-9 ClickHouse config and Polars-to-ClickHouse schema mapper
DAG-10 Asset registry and VIEW generator
DAG-11 Bootstrap script and CLI entry point
DAG-12 Docker Compose — ClickHouse for local dev
DAG-13 Docker Swarm stack — ClickHouse for production
DAG-14 Analytical summary VIEWs
DAG-15 Verification and end-to-end testing
DAG-16 Auto-discover parquet assets for ClickHouse VIEWs

Architecture

Dagster Pipeline ──writes──> Parquet files (vanguard-lake/)
                                    │
                              (shared volume mount)
                                    │
                          ClickHouse ──reads via file()──> VIEWs
                                    │
                         ┌──────────┼──────────┐
                     TCP:9000    HTTP:8123    Traefik
                    (TablePlus)  (browser/    (production
                                  curl/API)   routing)

ClickHouse VIEWs use the file() table function to read Parquet directly. VIEWs are lazy — no data is stored in ClickHouse. New asset materializations are immediately visible on the next query (glob patterns auto-discover new files).

How VIEWs work

Registered assets (17) have explicit Polars-to-ClickHouse schema mappings for type safety:

CREATE OR REPLACE VIEW vanguard.v3_service_products AS
SELECT * FROM file('/var/lib/clickhouse/user_files/vanguard-lake/v3_service_products/*.parquet',
  Parquet, 'service_product_id Int64, service_id Int64, ...')

Auto-discovered assets (~35) use ClickHouse's Parquet metadata inference:

CREATE OR REPLACE VIEW vanguard.active_services_raw AS
SELECT * FROM file('/var/lib/clickhouse/user_files/vanguard-lake/active_services_raw/*.parquet', Parquet)

Summary VIEWs (4) are hand-crafted aggregations that reference the raw VIEWs: - service_summary_by_partner — active/suspended services, overspend per partner - usage_summary_by_partner — domestic/roaming data/voice/SMS/cost per partner per month - product_distribution — service counts by product, network, plan level - liability_by_partner — customer count, service count, monthly liability per partner

What changed

New files

File Purpose
src/vanguard_wholesale/clickhouse/__init__.py Package init
src/vanguard_wholesale/clickhouse/config.py ClickHouseConfig dataclass with env var defaults
src/vanguard_wholesale/clickhouse/schema_mapper.py polars_schema_to_clickhouse() type mapping
src/vanguard_wholesale/clickhouse/asset_registry.py 17 registered assets with explicit schemas
src/vanguard_wholesale/clickhouse/view_generator.py DDL generation + discover_unregistered_assets()
src/vanguard_wholesale/clickhouse/summary_views.py 4 analytical aggregate VIEWs
src/vanguard_wholesale/clickhouse/bootstrap.py 3-phase bootstrap CLI (bootstrap-clickhouse)
compose.clickhouse.yml Standalone ClickHouse for local dev
clickhouse-config/parquet_settings.xml ClickHouse settings for Parquet robustness

Modified files

File Change
pyproject.toml Added clickhouse-connect>=0.8.0 dep + bootstrap-clickhouse entry point
docker-stack.yml Added ClickHouse service with Traefik labels, Swarm config, clickhouse_data volume
.env.template Added CLICKHOUSE_* env vars (localhost defaults)
environments/vanguard_wholesale.env.template Added CLICKHOUSE_* env vars (Docker defaults)
uv.lock Updated lockfile

Research findings and design decisions

Why ClickHouse?

  • Already proven in the fraud-scoring model (underwriting-fraud-model repo) with the same Parquet-over-VIEWs pattern
  • Native file() function reads Parquet directly — no ETL, no data duplication
  • HTTP interface enables browser-based queries, curl, and API integration out of the box
  • The NFS volume path already contains CLICK_HOUSE in the name (DAG-7), confirming this was anticipated
  • Lightweight: VIEWs are schema-only metadata (~KB each), zero cost when idle

Why auto-discovery?

The manual registry only covers assets with module-level pl.Schema constants (17 of ~50 assets). The ReportAssetFactory generates 3 assets per report config ({name}_raw, {name}_masked, {name}_backfilled) at runtime. Customer filter assets are similarly dynamic. Auto-discovery scans the vanguard-lake directory and creates VIEWs for everything not already registered — future assets are covered automatically.

ClickHouse settings for robustness

Two settings in clickhouse-config/parquet_settings.xml:

Setting Default Our value Why
engine_file_skip_empty_files 0 1 Dagster creates 0-byte .parquet files for partners with no data. Without this, ClickHouse throws ParquetInvalidOrCorruptedFileException.
engine_file_empty_if_not_exists 0 1 VIEWs for not-yet-materialized assets return empty results instead of erroring.

Known limitations

Limitation Impact Mitigation
Auto-infer VIEWs fail at creation if no Parquet files exist VIEW not created until asset is materialized Re-run bootstrap-clickhouse after materializing; registered assets with explicit schemas are unaffected
Schema drift (type changes) on explicit-schema VIEWs Query-time error if Parquet type doesn't match VIEW schema Re-run bootstrap after schema changes; adding/removing columns is safe
No query caching Every query re-reads from Parquet Acceptable for analytical workloads; consider Refreshable MVs for hot summary views if needed
Concurrent reads during writes Potential partial-file reads Polars uses atomic write-then-rename; ClickHouse sees complete files or nothing
VIEWs are not indexed No predicate pushdown beyond Parquet row groups Acceptable for current data volumes; MergeTree ingestion is an option if performance degrades

What this does NOT change

  • Zero modifications to the Dagster pipeline, assets, IO manager, sensors, or jobs
  • Zero changes to existing MySQL reporting databases
  • ClickHouse is a purely additive, read-only layer

Local dev usage

# Start ClickHouse
docker compose -f compose.clickhouse.yml up -d

# Run Dagster to materialize assets
dg dev

# Create/refresh all VIEWs
CLICKHOUSE_PASSWORD=clickhouse uv run bootstrap-clickhouse

# Query via HTTP
curl 'http://localhost:8123/?user=default&password=clickhouse&query=SELECT+*+FROM+vanguard.product_distribution'

GUI clients

ClickHouse Play UI (built-in, zero install)

Available immediately at http://localhost:8123/play once ClickHouse is running.

  • Lightweight SQL editor with syntax highlighting in the browser
  • Tabular result display with copy/export
  • Enter credentials in the top bar: user default, password clickhouse
  • Supports all ClickHouse SQL including SHOW TABLES, DESCRIBE, and full queries
  • Good for quick ad-hoc queries; no schema tree or saved queries

Tabix (web-based, ClickHouse-specific)

Open https://tabix.io in your browser — no install required. It connects directly to your local ClickHouse instance via the HTTP interface.

To connect: 1. Go to https://tabix.io 2. Enter connection details: - URL: http://localhost:8123 - User: default - Password: clickhouse - Database: vanguard 3. Click Connect

Features: - Schema/table browser tree in the left sidebar — browse all VIEWs and their columns - SQL editor with ClickHouse-aware autocomplete - Result visualisation: tables, line charts, bar charts, area charts - Saved queries within the browser session - Purpose-built for ClickHouse (understands system tables, engines, etc.)

TablePlus (tested)

Connect using: - Connection type: ClickHouse - Host: localhost - Port: 8123 - User: default - Password: clickhouse - Database: vanguard

All VIEWs appear in the sidebar and are fully queryable. Tested and confirmed working with this implementation.

DBeaver Community (not tested)

Should work via the ClickHouse JDBC driver. Connect using: - JDBC URL: jdbc:clickhouse://localhost:8123/vanguard - User: default - Password: clickhouse

DBeaver provides a full schema browser, ER diagrams, and data export. The ClickHouse JDBC driver may need to be downloaded on first connection. Note: not tested with this implementation.

Excel — Windows ODBC Driver (direct connection)

On Windows machines, Excel can connect directly to ClickHouse using the ClickHouse ODBC driver:

  1. Install the ClickHouse ODBC driver from GitHub releases
  2. Configure a User DSN in ODBC Data Source Administrator with the ClickHouse server address, port, database (vanguard), and credentials
  3. In Excel: Data > Get Data > From ODBC, select the DSN
  4. Browse VIEWs directly or write custom SQL

This gives Excel users direct, refreshable access to all Vanguard data VIEWs without any intermediate exports. The Altinity blog has a detailed walkthrough of the connection process.

On macOS, Excel's sandbox blocks ODBC driver loading. Use Power Query with the HTTP interface instead (POST SQL to http://<host>:8123 and parse the CSV response). See the Confluence docs for a working Power Query M script.

HTTP output formats

ClickHouse supports multiple output formats via the FORMAT clause, useful for integrations:

Format Use case Example
Pretty Human-readable in browser FORMAT Pretty
CSVWithNames CSV with headers (Excel, scripts) FORMAT CSVWithNames
TabSeparatedWithNames TSV with headers FORMAT TabSeparatedWithNames
JSON JSON array (API consumers) FORMAT JSON
JSONEachRow NDJSON / .jsonl (streaming) FORMAT JSONEachRow
Parquet Binary Parquet download FORMAT Parquet

Existing documentation

The fraud-scoring model has a working ClickHouse deployment with detailed Confluence documentation covering browser access, output formats, Power BI/Excel integration, and example queries:

The same patterns apply to this Vanguard Wholesale deployment — substitute vanguard for predictions as the database name and use the appropriate host/credentials.

Deployment steps (production)

  1. Create Docker Swarm config for ClickHouse settings:

    docker config create clickhouse_parquet_settings clickhouse-config/parquet_settings.xml
    

  2. Add ClickHouse env vars to the NFS env file at /opt/dagster/shared-data/environments/:

    CLICKHOUSE_HOST=clickhouse
    CLICKHOUSE_PORT=8123
    CLICKHOUSE_USER=default
    CLICKHOUSE_PASSWORD=<secure-password>
    CLICKHOUSE_DB=vanguard
    CLICKHOUSE_TRAEFIK_HOST=ch.dagster-prod.plan.com   # or appropriate hostname
    

  3. Deploy the stack — the ClickHouse service is included in docker-stack.yml:

    docker stack deploy -c docker-stack.yml dagster
    

  4. Run bootstrap from any container with access to ClickHouse:

    bootstrap-clickhouse
    

  5. Verify via Traefik-routed HTTP:

    https://ch.dagster-prod.plan.com/?query=SHOW+TABLES+FROM+vanguard
    

Test plan

  • [x] Schema mapper produces correct ClickHouse types for all Polars dtypes used in the project
  • [x] Registry imports all 17 explicit schemas without error
  • [x] VIEW generator produces valid DDL for partitioned and unpartitioned assets
  • [x] Auto-discovery finds all factory-generated report assets and customer filters
  • [x] Bootstrap is idempotent (re-run produces no errors)
  • [x] Bootstrap gracefully skips assets without Parquet files
  • [x] 0-byte Parquet files are skipped (verified with/without engine_file_skip_empty_files)
  • [x] Queries return real data from materialized assets
  • [x] Summary VIEWs return correct aggregations
  • [x] HTTP interface accessible at localhost:8123
  • [x] compose.clickhouse.yml starts and passes healthcheck
  • [ ] Production deployment via Docker Swarm (pending)
  • [ ] Traefik routing for production HTTP access (pending)