ClickHouse Unification¶
ClickHouse is already deployed in the Rev-Sci stack as a SQL layer over Dagster-produced Parquet files. The proposed datalake has no SQL endpoint — it's the most-cited gap in the briefing (s.9, s.12). This section describes how a unified ClickHouse deployment serves both systems and what end-user SQL access looks like.
What ClickHouse does today (Rev-Sci)¶
ClickHouse runs as a Docker Swarm service with:
- HTTP interface on port 8123 (Traefik-routed, TLS)
- Native TCP on port 9000 (BI tools, CLI clients)
- NFS volume mount at
/var/lib/clickhouse/user_files— same volume as Dagster's data directory - Lazy VIEWs —
CREATE OR REPLACE VIEW db.asset AS SELECT * FROM file('path/*.parquet', Parquet, 'schema') - Per-code-location databases —
vanguard_wholesale,underwriting_ml, each with their own VIEWs - Auto-discovery —
bootstrap.pyscans the data directory, finds Parquet files not in the registry, and creates VIEWs with auto-inferred schemas - Explicit schemas — registered assets have Polars-to-ClickHouse type mappings for type safety
- Empty-file tolerance —
parquet_settings.xmlskips 0-byte files and returns empty results for missing assets
No data is stored in ClickHouse. Every query reads Parquet on-the-fly via file(). ClickHouse is a query router, not a data store.
What ClickHouse solves for the datalake¶
The briefing identifies these access gaps:
| Gap (from briefing) | ClickHouse solution |
|---|---|
| "No JDBC/REST endpoint" (s.9) | HTTP :8123 (REST-like SQL), Native TCP :9000 (JDBC/ODBC drivers) |
| "Tools that expect a warehouse endpoint (Tableau, Looker, Mode)" (s.9) | ClickHouse has native Tableau connector, ODBC driver for Looker, JDBC for Mode |
| "Evaluating Unity vs Synapse Serverless vs DuckDB-over-HTTP" (s.13) | ClickHouse is already deployed, open-source, no licensing cost |
| "No formal data catalogue" (s.12 item 3) | ClickHouse SHOW DATABASES / SHOW TABLES / DESCRIBE TABLE gives basic catalogue. Not Unity-grade, but functional. |
| "Implicit lineage only" (s.12 item 5) | ClickHouse VIEWs are named after Dagster asset keys — VIEW name = asset name = lineage entry point |
Unified ClickHouse: per-code-location databases¶
The existing bootstrap pattern creates a database per code location. Adding the datalake gives:
ClickHouse instance
├── vanguard_wholesale (existing)
│ ├── v3_products
│ ├── billing_usage_aggregated
│ ├── connected_service_liability
│ └── ... (~50 VIEWs)
│
├── underwriting_ml (existing)
│ ├── ml_predictions_enriched
│ ├── drift_reports
│ └── ...
│
└── datalake (new)
├── bronze_v3_customer
├── bronze_v3_service
├── bronze_billing_cdrs
├── silver_fact_invoice
├── silver_fact_invoice_item
├── silver_dim_customer
└── ... (~80+ VIEWs across bronze/silver/gold)
VIEW generation for the datalake¶
The existing view_generator.py generates DDL for both registered (explicit schema) and auto-discovered (inferred schema) assets. For the datalake, the bronze/silver/gold directory structure maps directly:
-- Bronze: partitioned by ingestion_date, multiple parquet files per partition
CREATE OR REPLACE VIEW datalake.bronze_v3_customer AS
SELECT * FROM file(
'/var/lib/clickhouse/user_files/datalake/bronze/v3_customer/**/*.parquet',
Parquet
)
-- Silver: single latest.parquet per table
CREATE OR REPLACE VIEW datalake.silver_fact_invoice AS
SELECT * FROM file(
'/var/lib/clickhouse/user_files/datalake/silver/facts/fact_invoice/latest.parquet',
Parquet,
'invoice_id UInt64, customer_id UInt64, ...'
)
-- Gold (future): date-partitioned marts
CREATE OR REPLACE VIEW datalake.gold_revenue_monthly AS
SELECT * FROM file(
'/var/lib/clickhouse/user_files/datalake/gold/revenue_monthly/**/*.parquet',
Parquet
)
Bronze VIEWs use **/*.parquet globs to read across all ingestion_date= partitions. ClickHouse handles this natively — file() supports recursive globs.
End-user SQL access: can users create their own views and tables?¶
This is an important governance question. The answer should be yes, with guardrails.
What users should be able to do¶
- Query any VIEW —
SELECT * FROM datalake.silver_fact_invoice WHERE customer_id = 123 - Create personal views —
CREATE VIEW my_analysis.monthly_churn AS SELECT ...for recurring queries - Create temporary tables —
CREATE TEMPORARY TABLE tmp AS SELECT ...for multi-step analysis sessions - Join across code locations —
SELECT v.*, d.customer_name FROM vanguard_wholesale.connected_service_liability v JOIN datalake.silver_dim_customer d ON ...
What users should NOT be able to do¶
- Write to the data lake — ClickHouse VIEWs are read-only by construction (they read
file(), there's no write path). Bronze immutability is enforced at the storage level. - Modify managed VIEWs — The bootstrap owns VIEWs in
vanguard_wholesale,underwriting_ml, anddatalakedatabases. User modifications would be overwritten on next bootstrap run. - Create MergeTree tables — These store data inside ClickHouse, breaking the "ClickHouse stores nothing" contract and creating an unmanaged data silo.
Implementation: user-writable scratch databases¶
Create a scratch database (or per-user databases) where analysts can create their own VIEWs and tables:
CREATE DATABASE scratch;
GRANT CREATE TABLE, CREATE VIEW, SELECT, INSERT ON scratch.* TO analyst_role;
GRANT SELECT ON vanguard_wholesale.* TO analyst_role;
GRANT SELECT ON underwriting_ml.* TO analyst_role;
GRANT SELECT ON datalake.* TO analyst_role;
This gives analysts full SQL power within a sandboxed namespace while keeping the managed databases read-only.
Acceptable trade-offs:
- Analyst-created VIEWs in
scratchare not backed up, not version-controlled, and not monitored. This is the intent — scratch is ephemeral. - If an analyst view becomes load-bearing (other people depend on it), it should be promoted to a Dagster asset and managed through the normal pipeline. The ClickHouse VIEW becomes a read-only window; the Dagster asset becomes the source of truth.
- Temporary tables in
scratchconsume ClickHouse storage. Set a TTL or periodic cleanup to prevent unbounded growth.
Cross-code-location joins¶
One of ClickHouse's unique advantages in this architecture: it sees all code locations' data through a single SQL namespace. An analyst can join wholesale reporting data with datalake bronze in a single query:
SELECT
dl.customer_name,
dl.dealer_name,
vw.connected_services,
vw.monthly_liability
FROM datalake.silver_dim_customer dl
JOIN vanguard_wholesale.connected_service_liability vw
ON dl.customer_id = vw.customer_id
WHERE dl.dealer_uuid = '2708fccc-6c5c-41d8-8fe3-0f6081f25894'
Neither Dagster nor the IO managers enable this — it's a ClickHouse-only capability. This is the strongest argument for a unified ClickHouse instance serving all code locations.
Where should ClickHouse run?¶
Option 1: On-prem (recommended for Strategy A/B)¶
Already deployed. NFS mount gives zero-latency access to Parquet files. Traefik provides TLS-terminated HTTP access. The only cost is the container's memory (ClickHouse is efficient — 1–2 GB RAM for query-only workloads at this scale).
Option 2: Azure Container App (only if compute moves to Azure)¶
If and when Dagster moves to Azure, ClickHouse should move with it. Use azureBlobStorage() table function instead of file() — same VIEW pattern, different storage backend. The bootstrap would need a mode switch: file() for local/NFS, azureBlobStorage() for ADLS.
Option 3: ClickHouse Cloud (not recommended)¶
Managed ClickHouse is priced for persistent storage workloads (MergeTree tables). Our use case — stateless VIEWs over external Parquet — doesn't benefit from managed features (replication, sharding, backups). We'd be paying for capabilities we don't use.
Migration path¶
- Day 1: Datalake code location writes Parquet to NFS. Bootstrap creates
datalakedatabase with VIEWs. - Week 1: Register datalake bronze and silver assets in
asset_registry.pywith explicit schemas. Auto-discovery handles the rest. - Week 2: Create
scratchdatabase with analyst permissions. Share ClickHouse connection details. - Ongoing: As gold marts are built, they get VIEWs automatically. Analysts build ad-hoc analyses in
scratch, promote recurring ones to Dagster assets.