Skip to content

ClickHouse Unification

ClickHouse is already deployed in the Rev-Sci stack as a SQL layer over Dagster-produced Parquet files. The proposed datalake has no SQL endpoint — it's the most-cited gap in the briefing (s.9, s.12). This section describes how a unified ClickHouse deployment serves both systems and what end-user SQL access looks like.


What ClickHouse does today (Rev-Sci)

ClickHouse runs as a Docker Swarm service with:

  • HTTP interface on port 8123 (Traefik-routed, TLS)
  • Native TCP on port 9000 (BI tools, CLI clients)
  • NFS volume mount at /var/lib/clickhouse/user_files — same volume as Dagster's data directory
  • Lazy VIEWsCREATE OR REPLACE VIEW db.asset AS SELECT * FROM file('path/*.parquet', Parquet, 'schema')
  • Per-code-location databasesvanguard_wholesale, underwriting_ml, each with their own VIEWs
  • Auto-discoverybootstrap.py scans the data directory, finds Parquet files not in the registry, and creates VIEWs with auto-inferred schemas
  • Explicit schemas — registered assets have Polars-to-ClickHouse type mappings for type safety
  • Empty-file toleranceparquet_settings.xml skips 0-byte files and returns empty results for missing assets

No data is stored in ClickHouse. Every query reads Parquet on-the-fly via file(). ClickHouse is a query router, not a data store.


What ClickHouse solves for the datalake

The briefing identifies these access gaps:

Gap (from briefing) ClickHouse solution
"No JDBC/REST endpoint" (s.9) HTTP :8123 (REST-like SQL), Native TCP :9000 (JDBC/ODBC drivers)
"Tools that expect a warehouse endpoint (Tableau, Looker, Mode)" (s.9) ClickHouse has native Tableau connector, ODBC driver for Looker, JDBC for Mode
"Evaluating Unity vs Synapse Serverless vs DuckDB-over-HTTP" (s.13) ClickHouse is already deployed, open-source, no licensing cost
"No formal data catalogue" (s.12 item 3) ClickHouse SHOW DATABASES / SHOW TABLES / DESCRIBE TABLE gives basic catalogue. Not Unity-grade, but functional.
"Implicit lineage only" (s.12 item 5) ClickHouse VIEWs are named after Dagster asset keys — VIEW name = asset name = lineage entry point

Unified ClickHouse: per-code-location databases

The existing bootstrap pattern creates a database per code location. Adding the datalake gives:

ClickHouse instance
├── vanguard_wholesale    (existing)
│   ├── v3_products
│   ├── billing_usage_aggregated
│   ├── connected_service_liability
│   └── ... (~50 VIEWs)
│
├── underwriting_ml       (existing)
│   ├── ml_predictions_enriched
│   ├── drift_reports
│   └── ...
│
└── datalake              (new)
    ├── bronze_v3_customer
    ├── bronze_v3_service
    ├── bronze_billing_cdrs
    ├── silver_fact_invoice
    ├── silver_fact_invoice_item
    ├── silver_dim_customer
    └── ... (~80+ VIEWs across bronze/silver/gold)

VIEW generation for the datalake

The existing view_generator.py generates DDL for both registered (explicit schema) and auto-discovered (inferred schema) assets. For the datalake, the bronze/silver/gold directory structure maps directly:

-- Bronze: partitioned by ingestion_date, multiple parquet files per partition
CREATE OR REPLACE VIEW datalake.bronze_v3_customer AS
SELECT * FROM file(
    '/var/lib/clickhouse/user_files/datalake/bronze/v3_customer/**/*.parquet',
    Parquet
)

-- Silver: single latest.parquet per table
CREATE OR REPLACE VIEW datalake.silver_fact_invoice AS
SELECT * FROM file(
    '/var/lib/clickhouse/user_files/datalake/silver/facts/fact_invoice/latest.parquet',
    Parquet,
    'invoice_id UInt64, customer_id UInt64, ...'
)

-- Gold (future): date-partitioned marts
CREATE OR REPLACE VIEW datalake.gold_revenue_monthly AS
SELECT * FROM file(
    '/var/lib/clickhouse/user_files/datalake/gold/revenue_monthly/**/*.parquet',
    Parquet
)

Bronze VIEWs use **/*.parquet globs to read across all ingestion_date= partitions. ClickHouse handles this natively — file() supports recursive globs.


End-user SQL access: can users create their own views and tables?

This is an important governance question. The answer should be yes, with guardrails.

What users should be able to do

  1. Query any VIEWSELECT * FROM datalake.silver_fact_invoice WHERE customer_id = 123
  2. Create personal viewsCREATE VIEW my_analysis.monthly_churn AS SELECT ... for recurring queries
  3. Create temporary tablesCREATE TEMPORARY TABLE tmp AS SELECT ... for multi-step analysis sessions
  4. Join across code locationsSELECT v.*, d.customer_name FROM vanguard_wholesale.connected_service_liability v JOIN datalake.silver_dim_customer d ON ...

What users should NOT be able to do

  1. Write to the data lake — ClickHouse VIEWs are read-only by construction (they read file(), there's no write path). Bronze immutability is enforced at the storage level.
  2. Modify managed VIEWs — The bootstrap owns VIEWs in vanguard_wholesale, underwriting_ml, and datalake databases. User modifications would be overwritten on next bootstrap run.
  3. Create MergeTree tables — These store data inside ClickHouse, breaking the "ClickHouse stores nothing" contract and creating an unmanaged data silo.

Implementation: user-writable scratch databases

Create a scratch database (or per-user databases) where analysts can create their own VIEWs and tables:

CREATE DATABASE scratch;
GRANT CREATE TABLE, CREATE VIEW, SELECT, INSERT ON scratch.* TO analyst_role;
GRANT SELECT ON vanguard_wholesale.* TO analyst_role;
GRANT SELECT ON underwriting_ml.* TO analyst_role;
GRANT SELECT ON datalake.* TO analyst_role;

This gives analysts full SQL power within a sandboxed namespace while keeping the managed databases read-only.

Acceptable trade-offs:

  • Analyst-created VIEWs in scratch are not backed up, not version-controlled, and not monitored. This is the intent — scratch is ephemeral.
  • If an analyst view becomes load-bearing (other people depend on it), it should be promoted to a Dagster asset and managed through the normal pipeline. The ClickHouse VIEW becomes a read-only window; the Dagster asset becomes the source of truth.
  • Temporary tables in scratch consume ClickHouse storage. Set a TTL or periodic cleanup to prevent unbounded growth.

Cross-code-location joins

One of ClickHouse's unique advantages in this architecture: it sees all code locations' data through a single SQL namespace. An analyst can join wholesale reporting data with datalake bronze in a single query:

SELECT
    dl.customer_name,
    dl.dealer_name,
    vw.connected_services,
    vw.monthly_liability
FROM datalake.silver_dim_customer dl
JOIN vanguard_wholesale.connected_service_liability vw
    ON dl.customer_id = vw.customer_id
WHERE dl.dealer_uuid = '2708fccc-6c5c-41d8-8fe3-0f6081f25894'

Neither Dagster nor the IO managers enable this — it's a ClickHouse-only capability. This is the strongest argument for a unified ClickHouse instance serving all code locations.


Where should ClickHouse run?

Already deployed. NFS mount gives zero-latency access to Parquet files. Traefik provides TLS-terminated HTTP access. The only cost is the container's memory (ClickHouse is efficient — 1–2 GB RAM for query-only workloads at this scale).

Option 2: Azure Container App (only if compute moves to Azure)

If and when Dagster moves to Azure, ClickHouse should move with it. Use azureBlobStorage() table function instead of file() — same VIEW pattern, different storage backend. The bootstrap would need a mode switch: file() for local/NFS, azureBlobStorage() for ADLS.

Managed ClickHouse is priced for persistent storage workloads (MergeTree tables). Our use case — stateless VIEWs over external Parquet — doesn't benefit from managed features (replication, sharding, backups). We'd be paying for capabilities we don't use.


Migration path

  1. Day 1: Datalake code location writes Parquet to NFS. Bootstrap creates datalake database with VIEWs.
  2. Week 1: Register datalake bronze and silver assets in asset_registry.py with explicit schemas. Auto-discovery handles the rest.
  3. Week 2: Create scratch database with analyst permissions. Share ClickHouse connection details.
  4. Ongoing: As gold marts are built, they get VIEWs automatically. Analysts build ad-hoc analyses in scratch, promote recurring ones to Dagster assets.