DAG-3: ClickHouse OLAP Lakehouse for Vanguard Wholesale¶

Summary¶

Adds a ClickHouse data lakehouse layer that provides SQL-queryable VIEWs over all Dagster-managed Parquet assets. ClickHouse reads Parquet files directly from disk via its file() table function — zero data duplication, zero changes to the existing Dagster pipeline.

VIEWs are created automatically, covering all v3-derived assets, billing CDR layers, wholesale liability, customer filters, factory-generated report assets (raw, masked, backfilled), and UUID mappings, etc...

Natively queryable via HTTP (contianer port 8123) — browser, curl, API integrations
Queryable via TCP (port 9000) — TablePlus, DBeaver, BI tools
Queryable via built-in UI — http://localhost:8123/play

JIRA¶

Epic: DAG-3 — OLAP Lakehouse Federation for Reports Data

Task	Summary
DAG-9	ClickHouse config and Polars-to-ClickHouse schema mapper
DAG-10	Asset registry and VIEW generator
DAG-11	Bootstrap script and CLI entry point
DAG-12	Docker Compose — ClickHouse for local dev
DAG-13	Docker Swarm stack — ClickHouse for production
DAG-14	Analytical summary VIEWs
DAG-15	Verification and end-to-end testing
DAG-16	Auto-discover parquet assets for ClickHouse VIEWs

Architecture¶

Dagster Pipeline ──writes──> Parquet files (vanguard-lake/)
                                    │
                              (shared volume mount)
                                    │
                          ClickHouse ──reads via file()──> VIEWs
                                    │
                         ┌──────────┼──────────┐
                     TCP:9000    HTTP:8123    Traefik
                    (TablePlus)  (browser/    (production
                                  curl/API)   routing)

ClickHouse VIEWs use the file() table function to read Parquet directly. VIEWs are lazy — no data is stored in ClickHouse. New asset materializations are immediately visible on the next query (glob patterns auto-discover new files).

How VIEWs work¶

Registered assets (17) have explicit Polars-to-ClickHouse schema mappings for type safety:

CREATE OR REPLACE VIEW vanguard.v3_service_products AS
SELECT * FROM file('/var/lib/clickhouse/user_files/vanguard-lake/v3_service_products/*.parquet',
  Parquet, 'service_product_id Int64, service_id Int64, ...')

Auto-discovered assets (~35) use ClickHouse's Parquet metadata inference:

CREATE OR REPLACE VIEW vanguard.active_services_raw AS
SELECT * FROM file('/var/lib/clickhouse/user_files/vanguard-lake/active_services_raw/*.parquet', Parquet)

Summary VIEWs (4) are hand-crafted aggregations that reference the raw VIEWs: - service_summary_by_partner — active/suspended services, overspend per partner - usage_summary_by_partner — domestic/roaming data/voice/SMS/cost per partner per month - product_distribution — service counts by product, network, plan level - liability_by_partner — customer count, service count, monthly liability per partner

What changed¶

New files¶

File	Purpose
`src/vanguard_wholesale/clickhouse/__init__.py`	Package init
`src/vanguard_wholesale/clickhouse/config.py`	`ClickHouseConfig` dataclass with env var defaults
`src/vanguard_wholesale/clickhouse/schema_mapper.py`	`polars_schema_to_clickhouse()` type mapping
`src/vanguard_wholesale/clickhouse/asset_registry.py`	17 registered assets with explicit schemas
`src/vanguard_wholesale/clickhouse/view_generator.py`	DDL generation + `discover_unregistered_assets()`
`src/vanguard_wholesale/clickhouse/summary_views.py`	4 analytical aggregate VIEWs
`src/vanguard_wholesale/clickhouse/bootstrap.py`	3-phase bootstrap CLI (`bootstrap-clickhouse`)
`compose.clickhouse.yml`	Standalone ClickHouse for local dev
`clickhouse-config/parquet_settings.xml`	ClickHouse settings for Parquet robustness

Modified files¶

File	Change
`pyproject.toml`	Added `clickhouse-connect>=0.8.0` dep + `bootstrap-clickhouse` entry point
`docker-stack.yml`	Added ClickHouse service with Traefik labels, Swarm config, `clickhouse_data` volume
`.env.template`	Added CLICKHOUSE_* env vars (localhost defaults)
`environments/vanguard_wholesale.env.template`	Added CLICKHOUSE_* env vars (Docker defaults)
`uv.lock`	Updated lockfile

Research findings and design decisions¶

Why ClickHouse?¶

Already proven in the fraud-scoring model (underwriting-fraud-model repo) with the same Parquet-over-VIEWs pattern
Native file() function reads Parquet directly — no ETL, no data duplication
HTTP interface enables browser-based queries, curl, and API integration out of the box
The NFS volume path already contains CLICK_HOUSE in the name (DAG-7), confirming this was anticipated
Lightweight: VIEWs are schema-only metadata (~KB each), zero cost when idle

Why auto-discovery?¶

The manual registry only covers assets with module-level pl.Schema constants (17 of ~50 assets). The ReportAssetFactory generates 3 assets per report config ({name}_raw, {name}_masked, {name}_backfilled) at runtime. Customer filter assets are similarly dynamic. Auto-discovery scans the vanguard-lake directory and creates VIEWs for everything not already registered — future assets are covered automatically.

ClickHouse settings for robustness¶

Two settings in clickhouse-config/parquet_settings.xml:

Setting	Default	Our value	Why
`engine_file_skip_empty_files`	0	1	Dagster creates 0-byte `.parquet` files for partners with no data. Without this, ClickHouse throws `ParquetInvalidOrCorruptedFileException`.
`engine_file_empty_if_not_exists`	0	1	VIEWs for not-yet-materialized assets return empty results instead of erroring.

Known limitations¶

Limitation	Impact	Mitigation
Auto-infer VIEWs fail at creation if no Parquet files exist	VIEW not created until asset is materialized	Re-run `bootstrap-clickhouse` after materializing; registered assets with explicit schemas are unaffected
Schema drift (type changes) on explicit-schema VIEWs	Query-time error if Parquet type doesn't match VIEW schema	Re-run bootstrap after schema changes; adding/removing columns is safe
No query caching	Every query re-reads from Parquet	Acceptable for analytical workloads; consider Refreshable MVs for hot summary views if needed
Concurrent reads during writes	Potential partial-file reads	Polars uses atomic write-then-rename; ClickHouse sees complete files or nothing
VIEWs are not indexed	No predicate pushdown beyond Parquet row groups	Acceptable for current data volumes; MergeTree ingestion is an option if performance degrades

What this does NOT change¶

Zero modifications to the Dagster pipeline, assets, IO manager, sensors, or jobs
Zero changes to existing MySQL reporting databases
ClickHouse is a purely additive, read-only layer

Local dev usage¶

# Start ClickHouse
docker compose -f compose.clickhouse.yml up -d

# Run Dagster to materialize assets
dg dev

# Create/refresh all VIEWs
CLICKHOUSE_PASSWORD=clickhouse uv run bootstrap-clickhouse

# Query via HTTP
curl 'http://localhost:8123/?user=default&password=clickhouse&query=SELECT+*+FROM+vanguard.product_distribution'

GUI clients¶

ClickHouse Play UI (built-in, zero install)¶

Available immediately at http://localhost:8123/play once ClickHouse is running.

Lightweight SQL editor with syntax highlighting in the browser
Tabular result display with copy/export
Enter credentials in the top bar: user default, password clickhouse
Supports all ClickHouse SQL including SHOW TABLES, DESCRIBE, and full queries
Good for quick ad-hoc queries; no schema tree or saved queries

Tabix (web-based, ClickHouse-specific)¶

Open https://tabix.io in your browser — no install required. It connects directly to your local ClickHouse instance via the HTTP interface.

To connect: 1. Go to https://tabix.io 2. Enter connection details: - URL: http://localhost:8123 - User: default - Password: clickhouse - Database: vanguard 3. Click Connect

Features: - Schema/table browser tree in the left sidebar — browse all VIEWs and their columns - SQL editor with ClickHouse-aware autocomplete - Result visualisation: tables, line charts, bar charts, area charts - Saved queries within the browser session - Purpose-built for ClickHouse (understands system tables, engines, etc.)

TablePlus (tested)¶

Connect using: - Connection type: ClickHouse - Host: localhost - Port: 8123 - User: default - Password: clickhouse - Database: vanguard

All VIEWs appear in the sidebar and are fully queryable. Tested and confirmed working with this implementation.

DBeaver Community (not tested)¶

Should work via the ClickHouse JDBC driver. Connect using: - JDBC URL: jdbc:clickhouse://localhost:8123/vanguard - User: default - Password: clickhouse

DBeaver provides a full schema browser, ER diagrams, and data export. The ClickHouse JDBC driver may need to be downloaded on first connection. Note: not tested with this implementation.

Excel — Windows ODBC Driver (direct connection)¶

On Windows machines, Excel can connect directly to ClickHouse using the ClickHouse ODBC driver:

Install the ClickHouse ODBC driver from GitHub releases
Configure a User DSN in ODBC Data Source Administrator with the ClickHouse server address, port, database (vanguard), and credentials
In Excel: Data > Get Data > From ODBC, select the DSN
Browse VIEWs directly or write custom SQL

This gives Excel users direct, refreshable access to all Vanguard data VIEWs without any intermediate exports. The Altinity blog has a detailed walkthrough of the connection process.

On macOS, Excel's sandbox blocks ODBC driver loading. Use Power Query with the HTTP interface instead (POST SQL to http://<host>:8123 and parse the CSV response). See the Confluence docs for a working Power Query M script.

HTTP output formats¶

ClickHouse supports multiple output formats via the FORMAT clause, useful for integrations:

Format	Use case	Example
`Pretty`	Human-readable in browser	`FORMAT Pretty`
`CSVWithNames`	CSV with headers (Excel, scripts)	`FORMAT CSVWithNames`
`TabSeparatedWithNames`	TSV with headers	`FORMAT TabSeparatedWithNames`
`JSON`	JSON array (API consumers)	`FORMAT JSON`
`JSONEachRow`	NDJSON / `.jsonl` (streaming)	`FORMAT JSONEachRow`
`Parquet`	Binary Parquet download	`FORMAT Parquet`

Existing documentation¶

The fraud-scoring model has a working ClickHouse deployment with detailed Confluence documentation covering browser access, output formats, Power BI/Excel integration, and example queries:

Confluence: Observability (ClickHouse "LakeHouse")

The same patterns apply to this Vanguard Wholesale deployment — substitute vanguard for predictions as the database name and use the appropriate host/credentials.

Deployment steps (production)¶

Create Docker Swarm config for ClickHouse settings:

docker config create clickhouse_parquet_settings clickhouse-config/parquet_settings.xml

Add ClickHouse env vars to the NFS env file at /opt/dagster/shared-data/environments/:

CLICKHOUSE_HOST=clickhouse
CLICKHOUSE_PORT=8123
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD=<secure-password>
CLICKHOUSE_DB=vanguard
CLICKHOUSE_TRAEFIK_HOST=ch.dagster-prod.plan.com   # or appropriate hostname

Deploy the stack — the ClickHouse service is included in docker-stack.yml:
```
docker stack deploy -c docker-stack.yml dagster
```
Run bootstrap from any container with access to ClickHouse:
```
bootstrap-clickhouse
```

Verify via Traefik-routed HTTP:

https://ch.dagster-prod.plan.com/?query=SHOW+TABLES+FROM+vanguard

Test plan¶

[x] Schema mapper produces correct ClickHouse types for all Polars dtypes used in the project
[x] Registry imports all 17 explicit schemas without error
[x] VIEW generator produces valid DDL for partitioned and unpartitioned assets
[x] Auto-discovery finds all factory-generated report assets and customer filters
[x] Bootstrap is idempotent (re-run produces no errors)
[x] Bootstrap gracefully skips assets without Parquet files
[x] 0-byte Parquet files are skipped (verified with/without engine_file_skip_empty_files)
[x] Queries return real data from materialized assets
[x] Summary VIEWs return correct aggregations
[x] HTTP interface accessible at localhost:8123
[x] compose.clickhouse.yml starts and passes healthcheck
[ ] Production deployment via Docker Swarm (pending)
[ ] Traefik routing for production HTTP access (pending)