DAG-3: ClickHouse OLAP Lakehouse for Vanguard Wholesale¶
Summary¶
Adds a ClickHouse data lakehouse layer that provides SQL-queryable VIEWs over all Dagster-managed Parquet assets. ClickHouse reads Parquet files directly from disk via its file() table function — zero data duplication, zero changes to the existing Dagster pipeline.
VIEWs are created automatically, covering all v3-derived assets, billing CDR layers, wholesale liability, customer filters, factory-generated report assets (raw, masked, backfilled), and UUID mappings, etc...
- Natively queryable via HTTP (contianer port 8123) — browser, curl, API integrations
- Queryable via TCP (port 9000) — TablePlus, DBeaver, BI tools
- Queryable via built-in UI —
http://localhost:8123/play
JIRA¶
Epic: DAG-3 — OLAP Lakehouse Federation for Reports Data
| Task | Summary |
|---|---|
| DAG-9 | ClickHouse config and Polars-to-ClickHouse schema mapper |
| DAG-10 | Asset registry and VIEW generator |
| DAG-11 | Bootstrap script and CLI entry point |
| DAG-12 | Docker Compose — ClickHouse for local dev |
| DAG-13 | Docker Swarm stack — ClickHouse for production |
| DAG-14 | Analytical summary VIEWs |
| DAG-15 | Verification and end-to-end testing |
| DAG-16 | Auto-discover parquet assets for ClickHouse VIEWs |
Architecture¶
Dagster Pipeline ──writes──> Parquet files (vanguard-lake/)
│
(shared volume mount)
│
ClickHouse ──reads via file()──> VIEWs
│
┌──────────┼──────────┐
TCP:9000 HTTP:8123 Traefik
(TablePlus) (browser/ (production
curl/API) routing)
ClickHouse VIEWs use the file() table function to read Parquet directly. VIEWs are lazy — no data is stored in ClickHouse. New asset materializations are immediately visible on the next query (glob patterns auto-discover new files).
How VIEWs work¶
Registered assets (17) have explicit Polars-to-ClickHouse schema mappings for type safety:
CREATE OR REPLACE VIEW vanguard.v3_service_products AS
SELECT * FROM file('/var/lib/clickhouse/user_files/vanguard-lake/v3_service_products/*.parquet',
Parquet, 'service_product_id Int64, service_id Int64, ...')
Auto-discovered assets (~35) use ClickHouse's Parquet metadata inference:
CREATE OR REPLACE VIEW vanguard.active_services_raw AS
SELECT * FROM file('/var/lib/clickhouse/user_files/vanguard-lake/active_services_raw/*.parquet', Parquet)
Summary VIEWs (4) are hand-crafted aggregations that reference the raw VIEWs:
- service_summary_by_partner — active/suspended services, overspend per partner
- usage_summary_by_partner — domestic/roaming data/voice/SMS/cost per partner per month
- product_distribution — service counts by product, network, plan level
- liability_by_partner — customer count, service count, monthly liability per partner
What changed¶
New files¶
| File | Purpose |
|---|---|
src/vanguard_wholesale/clickhouse/__init__.py |
Package init |
src/vanguard_wholesale/clickhouse/config.py |
ClickHouseConfig dataclass with env var defaults |
src/vanguard_wholesale/clickhouse/schema_mapper.py |
polars_schema_to_clickhouse() type mapping |
src/vanguard_wholesale/clickhouse/asset_registry.py |
17 registered assets with explicit schemas |
src/vanguard_wholesale/clickhouse/view_generator.py |
DDL generation + discover_unregistered_assets() |
src/vanguard_wholesale/clickhouse/summary_views.py |
4 analytical aggregate VIEWs |
src/vanguard_wholesale/clickhouse/bootstrap.py |
3-phase bootstrap CLI (bootstrap-clickhouse) |
compose.clickhouse.yml |
Standalone ClickHouse for local dev |
clickhouse-config/parquet_settings.xml |
ClickHouse settings for Parquet robustness |
Modified files¶
| File | Change |
|---|---|
pyproject.toml |
Added clickhouse-connect>=0.8.0 dep + bootstrap-clickhouse entry point |
docker-stack.yml |
Added ClickHouse service with Traefik labels, Swarm config, clickhouse_data volume |
.env.template |
Added CLICKHOUSE_* env vars (localhost defaults) |
environments/vanguard_wholesale.env.template |
Added CLICKHOUSE_* env vars (Docker defaults) |
uv.lock |
Updated lockfile |
Research findings and design decisions¶
Why ClickHouse?¶
- Already proven in the fraud-scoring model (
underwriting-fraud-modelrepo) with the same Parquet-over-VIEWs pattern - Native
file()function reads Parquet directly — no ETL, no data duplication - HTTP interface enables browser-based queries, curl, and API integration out of the box
- The NFS volume path already contains
CLICK_HOUSEin the name (DAG-7), confirming this was anticipated - Lightweight: VIEWs are schema-only metadata (~KB each), zero cost when idle
Why auto-discovery?¶
The manual registry only covers assets with module-level pl.Schema constants (17 of ~50 assets). The ReportAssetFactory generates 3 assets per report config ({name}_raw, {name}_masked, {name}_backfilled) at runtime. Customer filter assets are similarly dynamic. Auto-discovery scans the vanguard-lake directory and creates VIEWs for everything not already registered — future assets are covered automatically.
ClickHouse settings for robustness¶
Two settings in clickhouse-config/parquet_settings.xml:
| Setting | Default | Our value | Why |
|---|---|---|---|
engine_file_skip_empty_files |
0 | 1 | Dagster creates 0-byte .parquet files for partners with no data. Without this, ClickHouse throws ParquetInvalidOrCorruptedFileException. |
engine_file_empty_if_not_exists |
0 | 1 | VIEWs for not-yet-materialized assets return empty results instead of erroring. |
Known limitations¶
| Limitation | Impact | Mitigation |
|---|---|---|
| Auto-infer VIEWs fail at creation if no Parquet files exist | VIEW not created until asset is materialized | Re-run bootstrap-clickhouse after materializing; registered assets with explicit schemas are unaffected |
| Schema drift (type changes) on explicit-schema VIEWs | Query-time error if Parquet type doesn't match VIEW schema | Re-run bootstrap after schema changes; adding/removing columns is safe |
| No query caching | Every query re-reads from Parquet | Acceptable for analytical workloads; consider Refreshable MVs for hot summary views if needed |
| Concurrent reads during writes | Potential partial-file reads | Polars uses atomic write-then-rename; ClickHouse sees complete files or nothing |
| VIEWs are not indexed | No predicate pushdown beyond Parquet row groups | Acceptable for current data volumes; MergeTree ingestion is an option if performance degrades |
What this does NOT change¶
- Zero modifications to the Dagster pipeline, assets, IO manager, sensors, or jobs
- Zero changes to existing MySQL reporting databases
- ClickHouse is a purely additive, read-only layer
Local dev usage¶
# Start ClickHouse
docker compose -f compose.clickhouse.yml up -d
# Run Dagster to materialize assets
dg dev
# Create/refresh all VIEWs
CLICKHOUSE_PASSWORD=clickhouse uv run bootstrap-clickhouse
# Query via HTTP
curl 'http://localhost:8123/?user=default&password=clickhouse&query=SELECT+*+FROM+vanguard.product_distribution'
GUI clients¶
ClickHouse Play UI (built-in, zero install)¶
Available immediately at http://localhost:8123/play once ClickHouse is running.
- Lightweight SQL editor with syntax highlighting in the browser
- Tabular result display with copy/export
- Enter credentials in the top bar: user
default, passwordclickhouse - Supports all ClickHouse SQL including
SHOW TABLES,DESCRIBE, and full queries - Good for quick ad-hoc queries; no schema tree or saved queries
Tabix (web-based, ClickHouse-specific)¶
Open https://tabix.io in your browser — no install required. It connects directly to your local ClickHouse instance via the HTTP interface.
To connect:
1. Go to https://tabix.io
2. Enter connection details:
- URL: http://localhost:8123
- User: default
- Password: clickhouse
- Database: vanguard
3. Click Connect
Features: - Schema/table browser tree in the left sidebar — browse all VIEWs and their columns - SQL editor with ClickHouse-aware autocomplete - Result visualisation: tables, line charts, bar charts, area charts - Saved queries within the browser session - Purpose-built for ClickHouse (understands system tables, engines, etc.)
TablePlus (tested)¶
Connect using:
- Connection type: ClickHouse
- Host: localhost
- Port: 8123
- User: default
- Password: clickhouse
- Database: vanguard
All VIEWs appear in the sidebar and are fully queryable. Tested and confirmed working with this implementation.
DBeaver Community (not tested)¶
Should work via the ClickHouse JDBC driver. Connect using:
- JDBC URL: jdbc:clickhouse://localhost:8123/vanguard
- User: default
- Password: clickhouse
DBeaver provides a full schema browser, ER diagrams, and data export. The ClickHouse JDBC driver may need to be downloaded on first connection. Note: not tested with this implementation.
Excel — Windows ODBC Driver (direct connection)¶
On Windows machines, Excel can connect directly to ClickHouse using the ClickHouse ODBC driver:
- Install the ClickHouse ODBC driver from GitHub releases
- Configure a User DSN in ODBC Data Source Administrator with the ClickHouse server address, port, database (
vanguard), and credentials - In Excel: Data > Get Data > From ODBC, select the DSN
- Browse VIEWs directly or write custom SQL
This gives Excel users direct, refreshable access to all Vanguard data VIEWs without any intermediate exports. The Altinity blog has a detailed walkthrough of the connection process.
On macOS, Excel's sandbox blocks ODBC driver loading. Use Power Query with the HTTP interface instead (POST SQL to http://<host>:8123 and parse the CSV response). See the Confluence docs for a working Power Query M script.
HTTP output formats¶
ClickHouse supports multiple output formats via the FORMAT clause, useful for integrations:
| Format | Use case | Example |
|---|---|---|
Pretty |
Human-readable in browser | FORMAT Pretty |
CSVWithNames |
CSV with headers (Excel, scripts) | FORMAT CSVWithNames |
TabSeparatedWithNames |
TSV with headers | FORMAT TabSeparatedWithNames |
JSON |
JSON array (API consumers) | FORMAT JSON |
JSONEachRow |
NDJSON / .jsonl (streaming) |
FORMAT JSONEachRow |
Parquet |
Binary Parquet download | FORMAT Parquet |
Existing documentation¶
The fraud-scoring model has a working ClickHouse deployment with detailed Confluence documentation covering browser access, output formats, Power BI/Excel integration, and example queries:
- Confluence: Observability (ClickHouse "LakeHouse")
The same patterns apply to this Vanguard Wholesale deployment — substitute vanguard for predictions as the database name and use the appropriate host/credentials.
Deployment steps (production)¶
-
Create Docker Swarm config for ClickHouse settings:
docker config create clickhouse_parquet_settings clickhouse-config/parquet_settings.xml -
Add ClickHouse env vars to the NFS env file at
/opt/dagster/shared-data/environments/:CLICKHOUSE_HOST=clickhouse CLICKHOUSE_PORT=8123 CLICKHOUSE_USER=default CLICKHOUSE_PASSWORD=<secure-password> CLICKHOUSE_DB=vanguard CLICKHOUSE_TRAEFIK_HOST=ch.dagster-prod.plan.com # or appropriate hostname -
Deploy the stack — the ClickHouse service is included in
docker-stack.yml:docker stack deploy -c docker-stack.yml dagster -
Run bootstrap from any container with access to ClickHouse:
bootstrap-clickhouse -
Verify via Traefik-routed HTTP:
https://ch.dagster-prod.plan.com/?query=SHOW+TABLES+FROM+vanguard
Test plan¶
- [x] Schema mapper produces correct ClickHouse types for all Polars dtypes used in the project
- [x] Registry imports all 17 explicit schemas without error
- [x] VIEW generator produces valid DDL for partitioned and unpartitioned assets
- [x] Auto-discovery finds all factory-generated report assets and customer filters
- [x] Bootstrap is idempotent (re-run produces no errors)
- [x] Bootstrap gracefully skips assets without Parquet files
- [x] 0-byte Parquet files are skipped (verified with/without
engine_file_skip_empty_files) - [x] Queries return real data from materialized assets
- [x] Summary VIEWs return correct aggregations
- [x] HTTP interface accessible at
localhost:8123 - [x]
compose.clickhouse.ymlstarts and passes healthcheck - [ ] Production deployment via Docker Swarm (pending)
- [ ] Traefik routing for production HTTP access (pending)