Cross-Comparison: Rev-Sci Stack vs Proposed Datalake¶
This section compares each architectural dimension of the two systems, identifies what is more idiomatic Dagster, and notes what each system does better.
Side-by-side overview¶
| Dimension | Rev-Sci Stack | Proposed Datalake |
|---|---|---|
| Dagster version | Current (dagster-dg-cli, @definitions decorator) | Unknown — uses Definitions() constructor |
| Code locations | 2 (vanguard_wholesale, underwriting_ml) |
2 (definitions_mysql, definitions_apis) |
| Run launcher | SwarmRunLauncher — container per run |
Default (in-process subprocesses) |
| Run coordinator | QueuedRunCoordinator, 50 concurrent, tag limits |
QueuedRunCoordinator, 8 concurrent |
| Metadata DB | MySQL | PostgreSQL |
| Transform engine | Polars (DataFrame API) | DuckDB (SQL) |
| IO management | PolarsParquetIOManager |
Manual file I/O to ADLS paths |
| Storage | NFS-backed Parquet (local) | ADLS Gen2 Parquet (cloud) |
| Query layer | ClickHouse VIEWs over file() |
None (DuckDB ad-hoc, Power BI direct) |
| Partitioning | Dagster DynamicPartitionsDefinition (by partner UUID) |
Filesystem partitioning (ingestion_date=YYYY-MM-DD/) |
| Scheduling | Sensors + AutomationCondition.eager() |
Cron schedules (nightly waves) |
| Monitoring | Dagster sensors + automation sensor | External watchdog (/loop every 15 min) |
| Alerting | Via Dagster hooks (implicit) | Pushover notifications from watchdog |
| Schema management | Polars schemas in clickhouse/schemas.py |
Parquet self-describing + _concat_tolerant |
| PII handling | Masking assets in report factory pipeline | Not implemented (flagged as gap) |
| Data quality | Implicit (asset dependencies, schema validation) | Reconciliation module (row counts + watermarks) |
What to keep from each system¶
Keep from Rev-Sci¶
SwarmRunLauncher — This is the single biggest infrastructure advantage. Each Dagster run gets its own Docker container with isolated memory, CPU, and failure domain. The datalake currently runs all transforms in the daemon's process tree, capped at 8 concurrent runs to avoid resource contention. With Swarm, the datalake's 35 bronze extractions can run in parallel containers without competing for memory.
ClickHouse SQL layer — The datalake explicitly flags "no JDBC endpoint" as a known weakness. ClickHouse provides HTTP, TCP, JDBC, and ODBC access over lazy Parquet VIEWs with zero data duplication. The existing bootstrap pipeline auto-discovers assets and creates per-location databases. This extends to the datalake trivially.
PolarsParquetIOManager — Externalises all storage concerns (paths, partitioning, serialisation) from asset code. Assets declare typed dependencies and receive DataFrames. Storage location is a single config value, not embedded in 50 asset files.
Tag concurrency limits — Rev-Sci's dagster.yaml uses per-tag limits (warehouse_polling: 10, report_stage=load: 15, usage_bronze_job: 1) to prevent specific bottlenecks without restricting overall concurrency. More surgical than a global max_concurrent_runs=8.
Automation conditions — AutomationCondition.eager() on downstream assets means sensor-triggered materialisation cascades automatically. The datalake uses explicit cron scheduling for each wave. Automation conditions are more idiomatic Dagster and self-maintaining — add a new downstream asset and it materialises automatically when its inputs update.
Dynamic partitions — Rev-Sci partitions by partner UUID using DynamicPartitionsDefinition. New partners are detected by sensors and added to the partition set at runtime. The datalake partitions by ingestion_date at the filesystem level, which Dagster doesn't manage. Dagster-managed partitions give you backfill UI, per-partition status, and partition-aware sensors.
Keep from the datalake¶
Bronze immutability contract — "Bronze is immutable once written. A historical partition is never rewritten." Rev-Sci's bronze layer doesn't enforce this — assets overwrite on each materialisation. The datalake's append-only bronze with ingestion_date partitioning gives point-in-time replay. This is the superior pattern for a company-wide data lake.
Reconciliation module — Per-table source-vs-lake comparison with explicit tolerances (1% row count for full snapshots, 5-min watermark for incrementals). Rev-Sci has no equivalent — data quality is implicit (if the asset runs, the data is assumed correct). The reconciliation module should be adopted as asset checks.
Watermark-based incremental extraction — MAX(updated_at) per table, persisted as JSON. Rev-Sci does full extractions for most assets. For the datalake's 35 sources (several multi-GB), incremental extraction is essential.
Schema drift handling — _concat_tolerant handles decimal widening, int promotion, null unification. Rev-Sci relies on Polars schema enforcement, which fails loudly on drift. The datalake's approach is better for bronze (accept additive drift, fail on breaking changes).
Source coverage — 35 sources covering core business, external APIs, billing, and CRM. Rev-Sci covers a narrower slice (V3, billing CDRs, Vanguard API). The datalake's breadth is the whole point.
Explicit SLOs — "V3 core domains ready by 07:30 UTC", "Silver ready by 07:00 UTC". Rev-Sci has no stated SLOs. Even if informal, they set expectations.
What is more idiomatic Dagster¶
| Pattern | More idiomatic | Why |
|---|---|---|
| IO managers for storage | Rev-Sci | Dagster's docs recommend IO managers as the primary storage abstraction. Manual file I/O bypasses asset materialisation metadata. |
@definitions decorator |
Rev-Sci | Newer Dagster API; the datalake likely uses Definitions() constructor directly. |
| Sensors for change detection | Rev-Sci | Dagster sensors are the built-in mechanism for event-driven materialisation. |
| Cron scheduling for extraction waves | Datalake | Dagster schedules are the correct tool for time-based extraction. Both systems use this; the datalake has a more explicit wave structure. |
AutomationCondition.eager() for downstream cascades |
Rev-Sci | Replaces explicit schedule dependencies with declarative propagation. |
Dynamic partitions via DynamicPartitionsDefinition |
Rev-Sci | Dagster-managed partitions give backfill UI, status tracking, and sensor integration. |
Filesystem partitioning (ingestion_date=YYYY-MM-DD/) |
Neither | Not a Dagster pattern — it's manual file layout. Should be TimeWindowPartitionsDefinition or DailyPartitionsDefinition. |
| Reconciliation as asset checks | Datalake (concept), Rev-Sci (mechanism) | Dagster's @asset_check is the right way to implement the datalake's reconciliation module. |
| Watchdog for operational health | Datalake (the need), neither (the implementation) | Dagster doesn't have a built-in orchestrator-health monitor. The watchdog fills a real gap, but running it as a /loop on a laptop is fragile. Should be a scheduled Dagster job or cloud-hosted monitor. |
Strengths and weaknesses¶
Rev-Sci strengths¶
- Production-hardened infrastructure — SwarmRunLauncher, NFS volumes, Traefik routing, ClickHouse bootstrap — all tested in production with real workloads.
- IO manager discipline — No asset contains a file path. Storage is a resource config, not code.
- Compute isolation — Container-per-run prevents memory contention and cascading failures.
- Query access — ClickHouse HTTP/TCP gives BI tools, analysts, and APIs a SQL endpoint without additional infrastructure.
- Automation cascade — Sensor detects change → materialises parent →
AutomationCondition.eager()cascades to children. No explicit scheduling of downstream assets.
Rev-Sci weaknesses¶
- No bronze immutability — Assets overwrite previous materialisations. No point-in-time replay from bronze.
- No data quality checks — No reconciliation, no row-count validation, no watermark comparison. If the source returns 0 rows, the asset happily writes an empty Parquet file.
- Narrow source coverage — Two code locations serving wholesale reporting and ML. Not a company-wide data platform.
- No incremental extraction — Most assets do full table scans on each run. Works at current scale; won't scale to multi-GB tables.
- Single node — Same as the datalake.
Datalake strengths¶
- Architectural rigour — Immutable bronze, replayable silver, explicit grain statements, SCD strategy documented. This is a well-designed data lake.
- Source breadth — 35 sources across core business, CRM, billing, and external APIs.
- Reconciliation — Quantitative correctness checks with explicit tolerances.
- Incremental extraction — Watermark-based with persisted cursors.
- Cost discipline — £5/month storage, explicit comparison to commercial alternatives.
Datalake weaknesses¶
- No run launcher — All transforms run in-process.
max_concurrent_runs=8is a resource-contention mitigation, not an architecture. - No IO managers — Storage paths embedded in asset code. Changing storage location means editing every asset.
- No SQL endpoint — Analysts need DuckDB installed locally. No JDBC/ODBC for BI tools. Explicitly flagged as a gap.
- Watchdog on a laptop — The operational monitoring system runs in a
/loopsession on the data engineer's laptop. Session close = monitoring blackout. - Queue constraints — 8 concurrent runs for 35 sources means sequential waves are required. A 5-hour nightly window is tight.
- No PII classification — Flagged as a gap. Rev-Sci has PII masking built into the report factory pipeline.
How these differences shape priorities¶
For the datalake team (Gregg)¶
The highest-value wins from consolidation are infrastructure, not data:
- SwarmRunLauncher eliminates the
max_concurrent_runs=8bottleneck immediately. Bronze extractions can run in parallel containers. - ClickHouse solves the "no JDBC endpoint" gap without Synapse Serverless licensing or DuckDB-over-HTTP custom builds.
- IO managers decouple storage decisions from asset logic. If we later move to ADLS, it's a config change, not a rewrite.
The datalake's data architecture (immutable bronze, reconciliation, incremental extraction, SLOs) is strong and should be preserved as-is.
For the Rev-Sci team¶
The highest-value wins are data architecture patterns:
- Bronze immutability should be adopted for existing assets. Write append-only partitions instead of overwriting.
- Reconciliation should be implemented as
@asset_checkdecorators on critical assets. - Cross-code-location dependencies — once the datalake's bronze layer is available, Rev-Sci assets can declare dependencies on datalake bronze instead of connecting to source databases directly. This removes source-system coupling from the ML and reporting pipelines.