Cross-Comparison: Rev-Sci Stack vs Proposed Datalake¶

This section compares each architectural dimension of the two systems, identifies what is more idiomatic Dagster, and notes what each system does better.

Side-by-side overview¶

Dimension	Rev-Sci Stack	Proposed Datalake
Dagster version	Current (dagster-dg-cli, @definitions decorator)	Unknown — uses `Definitions()` constructor
Code locations	2 (`vanguard_wholesale`, `underwriting_ml`)	2 (`definitions_mysql`, `definitions_apis`)
Run launcher	`SwarmRunLauncher` — container per run	Default (in-process subprocesses)
Run coordinator	`QueuedRunCoordinator`, 50 concurrent, tag limits	`QueuedRunCoordinator`, 8 concurrent
Metadata DB	MySQL	PostgreSQL
Transform engine	Polars (DataFrame API)	DuckDB (SQL)
IO management	`PolarsParquetIOManager`	Manual file I/O to ADLS paths
Storage	NFS-backed Parquet (local)	ADLS Gen2 Parquet (cloud)
Query layer	ClickHouse VIEWs over `file()`	None (DuckDB ad-hoc, Power BI direct)
Partitioning	Dagster `DynamicPartitionsDefinition` (by partner UUID)	Filesystem partitioning (`ingestion_date=YYYY-MM-DD/`)
Scheduling	Sensors + `AutomationCondition.eager()`	Cron schedules (nightly waves)
Monitoring	Dagster sensors + automation sensor	External watchdog (`/loop` every 15 min)
Alerting	Via Dagster hooks (implicit)	Pushover notifications from watchdog
Schema management	Polars schemas in `clickhouse/schemas.py`	Parquet self-describing + `_concat_tolerant`
PII handling	Masking assets in report factory pipeline	Not implemented (flagged as gap)
Data quality	Implicit (asset dependencies, schema validation)	Reconciliation module (row counts + watermarks)

What to keep from each system¶

Keep from Rev-Sci¶

SwarmRunLauncher — This is the single biggest infrastructure advantage. Each Dagster run gets its own Docker container with isolated memory, CPU, and failure domain. The datalake currently runs all transforms in the daemon's process tree, capped at 8 concurrent runs to avoid resource contention. With Swarm, the datalake's 35 bronze extractions can run in parallel containers without competing for memory.

ClickHouse SQL layer — The datalake explicitly flags "no JDBC endpoint" as a known weakness. ClickHouse provides HTTP, TCP, JDBC, and ODBC access over lazy Parquet VIEWs with zero data duplication. The existing bootstrap pipeline auto-discovers assets and creates per-location databases. This extends to the datalake trivially.

PolarsParquetIOManager — Externalises all storage concerns (paths, partitioning, serialisation) from asset code. Assets declare typed dependencies and receive DataFrames. Storage location is a single config value, not embedded in 50 asset files.

Tag concurrency limits — Rev-Sci's dagster.yaml uses per-tag limits (warehouse_polling: 10, report_stage=load: 15, usage_bronze_job: 1) to prevent specific bottlenecks without restricting overall concurrency. More surgical than a global max_concurrent_runs=8.

Automation conditions — AutomationCondition.eager() on downstream assets means sensor-triggered materialisation cascades automatically. The datalake uses explicit cron scheduling for each wave. Automation conditions are more idiomatic Dagster and self-maintaining — add a new downstream asset and it materialises automatically when its inputs update.

Dynamic partitions — Rev-Sci partitions by partner UUID using DynamicPartitionsDefinition. New partners are detected by sensors and added to the partition set at runtime. The datalake partitions by ingestion_date at the filesystem level, which Dagster doesn't manage. Dagster-managed partitions give you backfill UI, per-partition status, and partition-aware sensors.

Keep from the datalake¶

Bronze immutability contract — "Bronze is immutable once written. A historical partition is never rewritten." Rev-Sci's bronze layer doesn't enforce this — assets overwrite on each materialisation. The datalake's append-only bronze with ingestion_date partitioning gives point-in-time replay. This is the superior pattern for a company-wide data lake.

Reconciliation module — Per-table source-vs-lake comparison with explicit tolerances (1% row count for full snapshots, 5-min watermark for incrementals). Rev-Sci has no equivalent — data quality is implicit (if the asset runs, the data is assumed correct). The reconciliation module should be adopted as asset checks.

Watermark-based incremental extraction — MAX(updated_at) per table, persisted as JSON. Rev-Sci does full extractions for most assets. For the datalake's 35 sources (several multi-GB), incremental extraction is essential.

Schema drift handling — _concat_tolerant handles decimal widening, int promotion, null unification. Rev-Sci relies on Polars schema enforcement, which fails loudly on drift. The datalake's approach is better for bronze (accept additive drift, fail on breaking changes).

Source coverage — 35 sources covering core business, external APIs, billing, and CRM. Rev-Sci covers a narrower slice (V3, billing CDRs, Vanguard API). The datalake's breadth is the whole point.

Explicit SLOs — "V3 core domains ready by 07:30 UTC", "Silver ready by 07:00 UTC". Rev-Sci has no stated SLOs. Even if informal, they set expectations.

What is more idiomatic Dagster¶

Pattern	More idiomatic	Why
IO managers for storage	Rev-Sci	Dagster's docs recommend IO managers as the primary storage abstraction. Manual file I/O bypasses asset materialisation metadata.
`@definitions` decorator	Rev-Sci	Newer Dagster API; the datalake likely uses `Definitions()` constructor directly.
Sensors for change detection	Rev-Sci	Dagster sensors are the built-in mechanism for event-driven materialisation.
Cron scheduling for extraction waves	Datalake	Dagster schedules are the correct tool for time-based extraction. Both systems use this; the datalake has a more explicit wave structure.
`AutomationCondition.eager()` for downstream cascades	Rev-Sci	Replaces explicit schedule dependencies with declarative propagation.
Dynamic partitions via `DynamicPartitionsDefinition`	Rev-Sci	Dagster-managed partitions give backfill UI, status tracking, and sensor integration.
Filesystem partitioning (`ingestion_date=YYYY-MM-DD/`)	Neither	Not a Dagster pattern — it's manual file layout. Should be `TimeWindowPartitionsDefinition` or `DailyPartitionsDefinition`.
Reconciliation as asset checks	Datalake (concept), Rev-Sci (mechanism)	Dagster's `@asset_check` is the right way to implement the datalake's reconciliation module.
Watchdog for operational health	Datalake (the need), neither (the implementation)	Dagster doesn't have a built-in orchestrator-health monitor. The watchdog fills a real gap, but running it as a `/loop` on a laptop is fragile. Should be a scheduled Dagster job or cloud-hosted monitor.

Strengths and weaknesses¶

Rev-Sci strengths¶

Production-hardened infrastructure — SwarmRunLauncher, NFS volumes, Traefik routing, ClickHouse bootstrap — all tested in production with real workloads.
IO manager discipline — No asset contains a file path. Storage is a resource config, not code.
Compute isolation — Container-per-run prevents memory contention and cascading failures.
Query access — ClickHouse HTTP/TCP gives BI tools, analysts, and APIs a SQL endpoint without additional infrastructure.
Automation cascade — Sensor detects change → materialises parent → AutomationCondition.eager() cascades to children. No explicit scheduling of downstream assets.

Rev-Sci weaknesses¶

No bronze immutability — Assets overwrite previous materialisations. No point-in-time replay from bronze.
No data quality checks — No reconciliation, no row-count validation, no watermark comparison. If the source returns 0 rows, the asset happily writes an empty Parquet file.
Narrow source coverage — Two code locations serving wholesale reporting and ML. Not a company-wide data platform.
No incremental extraction — Most assets do full table scans on each run. Works at current scale; won't scale to multi-GB tables.
Single node — Same as the datalake.

Datalake strengths¶

Architectural rigour — Immutable bronze, replayable silver, explicit grain statements, SCD strategy documented. This is a well-designed data lake.
Source breadth — 35 sources across core business, CRM, billing, and external APIs.
Reconciliation — Quantitative correctness checks with explicit tolerances.
Incremental extraction — Watermark-based with persisted cursors.
Cost discipline — £5/month storage, explicit comparison to commercial alternatives.

Datalake weaknesses¶

No run launcher — All transforms run in-process. max_concurrent_runs=8 is a resource-contention mitigation, not an architecture.
No IO managers — Storage paths embedded in asset code. Changing storage location means editing every asset.
No SQL endpoint — Analysts need DuckDB installed locally. No JDBC/ODBC for BI tools. Explicitly flagged as a gap.
Watchdog on a laptop — The operational monitoring system runs in a /loop session on the data engineer's laptop. Session close = monitoring blackout.
Queue constraints — 8 concurrent runs for 35 sources means sequential waves are required. A 5-hour nightly window is tight.
No PII classification — Flagged as a gap. Rev-Sci has PII masking built into the report factory pipeline.

How these differences shape priorities¶

For the datalake team (Gregg)¶

The highest-value wins from consolidation are infrastructure, not data:

SwarmRunLauncher eliminates the max_concurrent_runs=8 bottleneck immediately. Bronze extractions can run in parallel containers.
ClickHouse solves the "no JDBC endpoint" gap without Synapse Serverless licensing or DuckDB-over-HTTP custom builds.
IO managers decouple storage decisions from asset logic. If we later move to ADLS, it's a config change, not a rewrite.

The datalake's data architecture (immutable bronze, reconciliation, incremental extraction, SLOs) is strong and should be preserved as-is.

For the Rev-Sci team¶

The highest-value wins are data architecture patterns:

Bronze immutability should be adopted for existing assets. Write append-only partitions instead of overwriting.
Reconciliation should be implemented as @asset_check decorators on critical assets.
Cross-code-location dependencies — once the datalake's bronze layer is available, Rev-Sci assets can declare dependencies on datalake bronze instead of connecting to source databases directly. This removes source-system coupling from the ML and reporting pipelines.