Skip to content

Executive Summary

The core question

Both stacks work. Both use Dagster, Parquet, and a medallion architecture. The question is not "which is better" but where does the data live, and where does the compute run? Everything else — DuckDB vs Polars, ClickHouse vs ad-hoc DuckDB, watchdog vs sensors — follows from that.

We propose two strategies. They are not compromises; each is a coherent architecture with different trade-offs.


Strategy A: Consolidate on-prem

One sentence: Bring the datalake in as a new code location on the existing Rev-Sci Swarm infrastructure, store all Parquet on NFS, serve SQL via ClickHouse.

What this looks like

┌─────────────────────────────────────────────────────────┐
│  Docker Swarm (on-prem)                                 │
│                                                         │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐    │
│  │ vanguard_    │ │ underwriting │ │  datalake    │    │
│  │ wholesale    │ │ _ml          │ │  (new)       │    │
│  └──────┬───────┘ └──────┬───────┘ └──────┬───────┘    │
│         │                │                │             │
│         ▼                ▼                ▼             │
│  ┌─────────────────────────────────────────────────┐    │
│  │  NFS: /opt/dagster/shared-data                  │    │
│  │  ├── vanguard_wholesale/                        │    │
│  │  ├── underwriting_ml/                           │    │
│  │  └── datalake/                                  │    │
│  │      ├── bronze/{source}/{table}/               │    │
│  │      ├── silver/facts/                          │    │
│  │      └── silver/dims/                           │    │
│  └─────────────────────────────────────────────────┘    │
│         │                                               │
│         ▼                                               │
│  ┌─────────────────────┐                                │
│  │  ClickHouse         │──── HTTP :8123 (Traefik)       │
│  │  VIEWs over file()  │──── TCP  :9000 (BI tools)      │
│  └─────────────────────┘                                │
└─────────────────────────────────────────────────────────┘

Pros

Advantage Detail
Zero egress cost All data stays on-prem. No Azure data transfer fees. At 330 GB growing 1–2 GB/day, this saves ~£70+/month vs ADLS round-trips, growing with volume.
Existing infrastructure SwarmRunLauncher, Traefik routing, NFS volumes, ClickHouse bootstrap, per-location database isolation — all already built and tested.
Compute distribution SwarmRunLauncher gives each run its own container. The datalake proposal runs everything in-process with max_concurrent_runs=8. Swarm gives isolation and scales to 50 concurrent runs today.
ClickHouse SQL endpoint Solves the datalake's "no JDBC endpoint" weakness (Briefing s.9, s.12 item 7). HTTP + TCP + JDBC/ODBC access out of the box.
IO manager consistency All code locations use PolarsParquetIOManager. Storage paths, partitioning, and serialisation are externalised from asset code.
Single operational surface One Dagster instance, one webserver, one daemon, one set of sensors, one monitoring story.
Faster iteration No cloud deploy cycle. Push image to registry.plan.com, Swarm rolls it out.

Cons

Disadvantage Detail
Single node, no HA Both systems share this weakness. A hardware failure takes everything down.
NFS performance ceiling NFS over local network is fast for current volumes. At multi-TB scale with many concurrent reads, it becomes the bottleneck.
No native ADLS access for analysts Analysts can't SET azure_storage_connection_string and query Parquet directly. They go through ClickHouse or mount NFS. Power BI ADLS connector doesn't work.
Migration effort for datalake Datalake code needs refactoring: adopt IO managers, replace manual ADLS paths, integrate with Swarm. Gregg's existing bronze extraction logic needs to write to NFS instead of ADLS.
Capacity planning is our problem We own the hardware. Scaling means buying/provisioning, not clicking a slider.

Cost

Item Monthly
ADLS storage £0 (not used)
Compute £0 marginal (existing hardware)
Egress £0
NFS storage Already provisioned
Total ~£0 incremental

Strategy B: Hybrid — NFS primary, ADLS as distribution layer

One sentence: Run all compute on-prem with NFS as the primary store, but sync finished silver/gold outputs to ADLS for analyst self-service and Power BI access.

What this looks like

┌─────────────────────────────────────────────────┐
│  Docker Swarm (on-prem)                         │
│                                                 │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐  │
│  │ vanguard_  │ │ underwrite │ │ datalake   │  │
│  │ wholesale  │ │ _ml        │ │ (new)      │  │
│  └─────┬──────┘ └─────┬──────┘ └─────┬──────┘  │
│        ▼              ▼              ▼          │
│  ┌─────────────────────────────────────────┐    │
│  │  NFS (primary)                          │    │
│  │  All bronze, silver, gold Parquet       │    │
│  └──────────────┬──────────────────────────┘    │
│                 │                                │
│        ┌────────┴────────┐                       │
│        ▼                 ▼                       │
│  ┌───────────┐    ┌────────────┐                │
│  │ ClickHouse│    │ azcopy     │                │
│  │ (on-prem) │    │ sync job   │                │
│  └───────────┘    └─────┬──────┘                │
│                         │                        │
└─────────────────────────┼────────────────────────┘
                          ▼
                   ┌──────────────┐
                   │  ADLS Gen2   │
                   │  (read-only  │
                   │   copy)      │
                   └──────┬───────┘
                          │
              ┌───────────┼───────────┐
              ▼           ▼           ▼
         Power BI    DuckDB       Notebooks
         (ADLS       (analysts)   (PyArrow)
          connector)

Pros

All of Strategy A's pros, plus:

Advantage Detail
ADLS analyst access preserved Analysts who already use DuckDB + azure_storage_connection_string keep working. Power BI ADLS connector works.
Gradual migration Datalake bronze extracts can continue writing to ADLS during transition. The sync job bridges the gap.
Cloud-ready foundation When/if compute moves to Azure, the ADLS copy becomes the primary and the sync reverses direction.
Separation of concerns NFS is the operational store (fast, local). ADLS is the distribution store (accessible, durable).

Cons

Disadvantage Detail
Data duplication Silver/gold exists in two places. Stale-data risk if sync fails.
Sync complexity Another moving part to monitor. Needs its own alerting.
Egress cost (small) azcopy sync pushes data on-prem → Azure. Only silver/gold outputs, not full bronze. Likely <£20/month.
Two query surfaces ClickHouse (on-prem, real-time) and DuckDB/Power BI (ADLS, eventually consistent). Analysts need to know which is canonical.

Cost

Item Monthly
ADLS storage (silver/gold copy, ~50 GB) ~£1
Compute £0 marginal
Egress (sync only) ~£5–20
Total ~£6–21 incremental

Recommendation

Start with Strategy A. It requires zero new infrastructure, eliminates egress, and gives the datalake immediate access to SwarmRunLauncher, ClickHouse, and IO managers. The datalake becomes a code location on existing infra within weeks, not months.

Graduate to Strategy B once analyst demand for ADLS self-service is concrete. The sync job is a single Dagster asset — low effort to add later, and it doesn't change the core architecture.

Do not start with ADLS as the primary store unless compute is also moving to Azure. Running on-prem compute against ADLS storage means every transform pays egress twice (read bronze, write silver), and every silver-to-gold step pays again. At current scale this is tolerable; at the roadmap's projected growth it becomes the dominant cost.