Executive Summary¶

The core question¶

Both stacks work. Both use Dagster, Parquet, and a medallion architecture. The question is not "which is better" but where does the data live, and where does the compute run? Everything else — DuckDB vs Polars, ClickHouse vs ad-hoc DuckDB, watchdog vs sensors — follows from that.

We propose two strategies. They are not compromises; each is a coherent architecture with different trade-offs.

Strategy A: Consolidate on-prem¶

One sentence: Bring the datalake in as a new code location on the existing Rev-Sci Swarm infrastructure, store all Parquet on NFS, serve SQL via ClickHouse.

What this looks like¶

┌─────────────────────────────────────────────────────────┐
│  Docker Swarm (on-prem)                                 │
│                                                         │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐    │
│  │ vanguard_    │ │ underwriting │ │  datalake    │    │
│  │ wholesale    │ │ _ml          │ │  (new)       │    │
│  └──────┬───────┘ └──────┬───────┘ └──────┬───────┘    │
│         │                │                │             │
│         ▼                ▼                ▼             │
│  ┌─────────────────────────────────────────────────┐    │
│  │  NFS: /opt/dagster/shared-data                  │    │
│  │  ├── vanguard_wholesale/                        │    │
│  │  ├── underwriting_ml/                           │    │
│  │  └── datalake/                                  │    │
│  │      ├── bronze/{source}/{table}/               │    │
│  │      ├── silver/facts/                          │    │
│  │      └── silver/dims/                           │    │
│  └─────────────────────────────────────────────────┘    │
│         │                                               │
│         ▼                                               │
│  ┌─────────────────────┐                                │
│  │  ClickHouse         │──── HTTP :8123 (Traefik)       │
│  │  VIEWs over file()  │──── TCP  :9000 (BI tools)      │
│  └─────────────────────┘                                │
└─────────────────────────────────────────────────────────┘

Pros¶

Advantage	Detail
Zero egress cost	All data stays on-prem. No Azure data transfer fees. At 330 GB growing 1–2 GB/day, this saves ~£70+/month vs ADLS round-trips, growing with volume.
Existing infrastructure	SwarmRunLauncher, Traefik routing, NFS volumes, ClickHouse bootstrap, per-location database isolation — all already built and tested.
Compute distribution	SwarmRunLauncher gives each run its own container. The datalake proposal runs everything in-process with `max_concurrent_runs=8`. Swarm gives isolation and scales to 50 concurrent runs today.
ClickHouse SQL endpoint	Solves the datalake's "no JDBC endpoint" weakness (Briefing s.9, s.12 item 7). HTTP + TCP + JDBC/ODBC access out of the box.
IO manager consistency	All code locations use `PolarsParquetIOManager`. Storage paths, partitioning, and serialisation are externalised from asset code.
Single operational surface	One Dagster instance, one webserver, one daemon, one set of sensors, one monitoring story.
Faster iteration	No cloud deploy cycle. Push image to `registry.plan.com`, Swarm rolls it out.

Cons¶

Disadvantage	Detail
Single node, no HA	Both systems share this weakness. A hardware failure takes everything down.
NFS performance ceiling	NFS over local network is fast for current volumes. At multi-TB scale with many concurrent reads, it becomes the bottleneck.
No native ADLS access for analysts	Analysts can't `SET azure_storage_connection_string` and query Parquet directly. They go through ClickHouse or mount NFS. Power BI ADLS connector doesn't work.
Migration effort for datalake	Datalake code needs refactoring: adopt IO managers, replace manual ADLS paths, integrate with Swarm. Gregg's existing bronze extraction logic needs to write to NFS instead of ADLS.
Capacity planning is our problem	We own the hardware. Scaling means buying/provisioning, not clicking a slider.

Cost¶

Item	Monthly
ADLS storage	£0 (not used)
Compute	£0 marginal (existing hardware)
Egress	£0
NFS storage	Already provisioned
Total	~£0 incremental

Strategy B: Hybrid — NFS primary, ADLS as distribution layer¶

One sentence: Run all compute on-prem with NFS as the primary store, but sync finished silver/gold outputs to ADLS for analyst self-service and Power BI access.

What this looks like¶

┌─────────────────────────────────────────────────┐
│  Docker Swarm (on-prem)                         │
│                                                 │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐  │
│  │ vanguard_  │ │ underwrite │ │ datalake   │  │
│  │ wholesale  │ │ _ml        │ │ (new)      │  │
│  └─────┬──────┘ └─────┬──────┘ └─────┬──────┘  │
│        ▼              ▼              ▼          │
│  ┌─────────────────────────────────────────┐    │
│  │  NFS (primary)                          │    │
│  │  All bronze, silver, gold Parquet       │    │
│  └──────────────┬──────────────────────────┘    │
│                 │                                │
│        ┌────────┴────────┐                       │
│        ▼                 ▼                       │
│  ┌───────────┐    ┌────────────┐                │
│  │ ClickHouse│    │ azcopy     │                │
│  │ (on-prem) │    │ sync job   │                │
│  └───────────┘    └─────┬──────┘                │
│                         │                        │
└─────────────────────────┼────────────────────────┘
                          ▼
                   ┌──────────────┐
                   │  ADLS Gen2   │
                   │  (read-only  │
                   │   copy)      │
                   └──────┬───────┘
                          │
              ┌───────────┼───────────┐
              ▼           ▼           ▼
         Power BI    DuckDB       Notebooks
         (ADLS       (analysts)   (PyArrow)
          connector)

Pros¶

All of Strategy A's pros, plus:

Advantage	Detail
ADLS analyst access preserved	Analysts who already use `DuckDB + azure_storage_connection_string` keep working. Power BI ADLS connector works.
Gradual migration	Datalake bronze extracts can continue writing to ADLS during transition. The sync job bridges the gap.
Cloud-ready foundation	When/if compute moves to Azure, the ADLS copy becomes the primary and the sync reverses direction.
Separation of concerns	NFS is the operational store (fast, local). ADLS is the distribution store (accessible, durable).

Cons¶

Disadvantage	Detail
Data duplication	Silver/gold exists in two places. Stale-data risk if sync fails.
Sync complexity	Another moving part to monitor. Needs its own alerting.
Egress cost (small)	`azcopy sync` pushes data on-prem → Azure. Only silver/gold outputs, not full bronze. Likely <£20/month.
Two query surfaces	ClickHouse (on-prem, real-time) and DuckDB/Power BI (ADLS, eventually consistent). Analysts need to know which is canonical.

Cost¶

Item	Monthly
ADLS storage (silver/gold copy, ~50 GB)	~£1
Compute	£0 marginal
Egress (sync only)	~£5–20
Total	~£6–21 incremental

Recommendation¶

Start with Strategy A. It requires zero new infrastructure, eliminates egress, and gives the datalake immediate access to SwarmRunLauncher, ClickHouse, and IO managers. The datalake becomes a code location on existing infra within weeks, not months.

Graduate to Strategy B once analyst demand for ADLS self-service is concrete. The sync job is a single Dagster asset — low effort to add later, and it doesn't change the core architecture.

Do not start with ADLS as the primary store unless compute is also moving to Azure. Running on-prem compute against ADLS storage means every transform pays egress twice (read bronze, write silver), and every silver-to-gold step pays again. At current scale this is tolerable; at the roadmap's projected growth it becomes the dominant cost.