Executive Summary¶
The core question¶
Both stacks work. Both use Dagster, Parquet, and a medallion architecture. The question is not "which is better" but where does the data live, and where does the compute run? Everything else — DuckDB vs Polars, ClickHouse vs ad-hoc DuckDB, watchdog vs sensors — follows from that.
We propose two strategies. They are not compromises; each is a coherent architecture with different trade-offs.
Strategy A: Consolidate on-prem¶
One sentence: Bring the datalake in as a new code location on the existing Rev-Sci Swarm infrastructure, store all Parquet on NFS, serve SQL via ClickHouse.
What this looks like¶
┌─────────────────────────────────────────────────────────┐
│ Docker Swarm (on-prem) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ vanguard_ │ │ underwriting │ │ datalake │ │
│ │ wholesale │ │ _ml │ │ (new) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ NFS: /opt/dagster/shared-data │ │
│ │ ├── vanguard_wholesale/ │ │
│ │ ├── underwriting_ml/ │ │
│ │ └── datalake/ │ │
│ │ ├── bronze/{source}/{table}/ │ │
│ │ ├── silver/facts/ │ │
│ │ └── silver/dims/ │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ ClickHouse │──── HTTP :8123 (Traefik) │
│ │ VIEWs over file() │──── TCP :9000 (BI tools) │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Pros¶
| Advantage | Detail |
|---|---|
| Zero egress cost | All data stays on-prem. No Azure data transfer fees. At 330 GB growing 1–2 GB/day, this saves ~£70+/month vs ADLS round-trips, growing with volume. |
| Existing infrastructure | SwarmRunLauncher, Traefik routing, NFS volumes, ClickHouse bootstrap, per-location database isolation — all already built and tested. |
| Compute distribution | SwarmRunLauncher gives each run its own container. The datalake proposal runs everything in-process with max_concurrent_runs=8. Swarm gives isolation and scales to 50 concurrent runs today. |
| ClickHouse SQL endpoint | Solves the datalake's "no JDBC endpoint" weakness (Briefing s.9, s.12 item 7). HTTP + TCP + JDBC/ODBC access out of the box. |
| IO manager consistency | All code locations use PolarsParquetIOManager. Storage paths, partitioning, and serialisation are externalised from asset code. |
| Single operational surface | One Dagster instance, one webserver, one daemon, one set of sensors, one monitoring story. |
| Faster iteration | No cloud deploy cycle. Push image to registry.plan.com, Swarm rolls it out. |
Cons¶
| Disadvantage | Detail |
|---|---|
| Single node, no HA | Both systems share this weakness. A hardware failure takes everything down. |
| NFS performance ceiling | NFS over local network is fast for current volumes. At multi-TB scale with many concurrent reads, it becomes the bottleneck. |
| No native ADLS access for analysts | Analysts can't SET azure_storage_connection_string and query Parquet directly. They go through ClickHouse or mount NFS. Power BI ADLS connector doesn't work. |
| Migration effort for datalake | Datalake code needs refactoring: adopt IO managers, replace manual ADLS paths, integrate with Swarm. Gregg's existing bronze extraction logic needs to write to NFS instead of ADLS. |
| Capacity planning is our problem | We own the hardware. Scaling means buying/provisioning, not clicking a slider. |
Cost¶
| Item | Monthly |
|---|---|
| ADLS storage | £0 (not used) |
| Compute | £0 marginal (existing hardware) |
| Egress | £0 |
| NFS storage | Already provisioned |
| Total | ~£0 incremental |
Strategy B: Hybrid — NFS primary, ADLS as distribution layer¶
One sentence: Run all compute on-prem with NFS as the primary store, but sync finished silver/gold outputs to ADLS for analyst self-service and Power BI access.
What this looks like¶
┌─────────────────────────────────────────────────┐
│ Docker Swarm (on-prem) │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ vanguard_ │ │ underwrite │ │ datalake │ │
│ │ wholesale │ │ _ml │ │ (new) │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ NFS (primary) │ │
│ │ All bronze, silver, gold Parquet │ │
│ └──────────────┬──────────────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ ▼ ▼ │
│ ┌───────────┐ ┌────────────┐ │
│ │ ClickHouse│ │ azcopy │ │
│ │ (on-prem) │ │ sync job │ │
│ └───────────┘ └─────┬──────┘ │
│ │ │
└─────────────────────────┼────────────────────────┘
▼
┌──────────────┐
│ ADLS Gen2 │
│ (read-only │
│ copy) │
└──────┬───────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
Power BI DuckDB Notebooks
(ADLS (analysts) (PyArrow)
connector)
Pros¶
All of Strategy A's pros, plus:
| Advantage | Detail |
|---|---|
| ADLS analyst access preserved | Analysts who already use DuckDB + azure_storage_connection_string keep working. Power BI ADLS connector works. |
| Gradual migration | Datalake bronze extracts can continue writing to ADLS during transition. The sync job bridges the gap. |
| Cloud-ready foundation | When/if compute moves to Azure, the ADLS copy becomes the primary and the sync reverses direction. |
| Separation of concerns | NFS is the operational store (fast, local). ADLS is the distribution store (accessible, durable). |
Cons¶
| Disadvantage | Detail |
|---|---|
| Data duplication | Silver/gold exists in two places. Stale-data risk if sync fails. |
| Sync complexity | Another moving part to monitor. Needs its own alerting. |
| Egress cost (small) | azcopy sync pushes data on-prem → Azure. Only silver/gold outputs, not full bronze. Likely <£20/month. |
| Two query surfaces | ClickHouse (on-prem, real-time) and DuckDB/Power BI (ADLS, eventually consistent). Analysts need to know which is canonical. |
Cost¶
| Item | Monthly |
|---|---|
| ADLS storage (silver/gold copy, ~50 GB) | ~£1 |
| Compute | £0 marginal |
| Egress (sync only) | ~£5–20 |
| Total | ~£6–21 incremental |
Recommendation¶
Start with Strategy A. It requires zero new infrastructure, eliminates egress, and gives the datalake immediate access to SwarmRunLauncher, ClickHouse, and IO managers. The datalake becomes a code location on existing infra within weeks, not months.
Graduate to Strategy B once analyst demand for ADLS self-service is concrete. The sync job is a single Dagster asset — low effort to add later, and it doesn't change the core architecture.
Do not start with ADLS as the primary store unless compute is also moving to Azure. Running on-prem compute against ADLS storage means every transform pays egress twice (read bronze, write silver), and every silver-to-gold step pays again. At current scale this is tolerable; at the roadmap's projected growth it becomes the dominant cost.