Skip to content

Gaps and Risks

Shortcomings identified across both systems, and open questions for the merged architecture.


Shared weaknesses (both systems)

Single-node compute, no HA

Both systems run on a single on-prem Docker Swarm node. A hardware failure, kernel panic, or network partition takes down all orchestration, all running transforms, and (for ClickHouse) all query access.

Mitigation path: Multi-node Swarm (short term) or migration to managed K8s / Azure Container Apps (medium term). The briefing lists this as a roadmap item. For the merged stack, this becomes higher priority — one failure now takes down two data platforms, not one.

No automated backup of Parquet data

NFS volumes are the single copy. ADLS provides Azure-managed redundancy, but NFS does not. A volume corruption or accidental deletion loses data.

Mitigation: NFS snapshot schedule (if the NAS supports it), or the Strategy B azcopy sync to ADLS as an offsite copy.

Dagster daemon as single point of failure

If the Dagster daemon crashes, no sensors evaluate, no schedules fire, no runs launch. Neither system monitors the daemon's health from outside (the datalake's watchdog queries GraphQL but runs on a laptop).

Mitigation: Swarm's restart_policy: on-failure handles transient crashes. A lightweight external health check (HTTP ping to the webserver's /health endpoint, alert on failure) would catch sustained outages.


Rev-Sci gaps

No bronze immutability

Assets overwrite their Parquet files on each materialisation. There is no historical partition, no ingestion_date dimension, no point-in-time replay. If a source system changes data retroactively, the previous state is lost.

Impact: Silver/gold layers can't be rebuilt from a historical bronze snapshot. If an asset produces bad data, the previous good data is already overwritten.

Fix: Adopt the datalake's append-only bronze pattern. Use DailyPartitionsDefinition so each day's extraction is a separate file. The IO manager handles partition-to-path mapping.

No data quality checks

No reconciliation module, no row-count validation, no watermark comparison, no @asset_check decorators. If a source returns 0 rows, the asset writes an empty Parquet file and reports success.

Impact: Silent data loss. An empty Parquet file propagates through silver/gold layers, producing empty reports without alerting anyone.

Fix: Adopt the datalake's reconciliation pattern, implemented as @asset_check decorators. Check row count against source, check watermark freshness, check for unexpected nulls.

No incremental extraction

Most Rev-Sci assets do full table scans. The V3 database queries select all active records on each run. For tables with millions of rows that change infrequently, this is wasteful.

Impact: Extraction time scales with table size, not change rate. As source tables grow, extraction windows widen.

Fix: Adopt the datalake's watermark-based incremental pattern for tables with a reliable updated_at column. Keep full-snapshot as fallback for tables without monotonic columns.


Datalake gaps

No run launcher

All transforms run in the daemon's process tree. max_concurrent_runs=8 is a resource-contention cap, not an architectural choice. Eight DuckDB transforms at 2–4 GB each need 16–32 GB RAM, shared with the daemon and sensors.

Impact: A memory-hungry transform (e.g., the 27.9M-row fact_invoice_item silver build) can OOM the daemon, taking down all scheduling and sensor evaluation.

Fix: SwarmRunLauncher. Each run gets its own container with isolated memory. The daemon stays lean.

No IO managers

Storage paths, partition layouts, and idempotent-write logic are embedded in asset code. See IO Managers for the full analysis.

Impact: Storage migration (ADLS → NFS, or any other change) requires editing every asset. Partition logic is duplicated and inconsistent across assets.

Fix: Adopt PolarsParquetIOManager (or a DuckDB-aware equivalent). Externalise all storage concerns.

Watchdog reliability

The operational watchdog — which detects zombie runs, stale data, and infrastructure failures — runs as a /loop on the data engineer's laptop. If the session closes, monitoring stops.

Impact: Zombie runs eat coordinator slots undetected. Stale data goes unnoticed until an analyst complains. The briefing acknowledges this as Priority 1 on the roadmap.

Fix: Move the watchdog to a scheduled Dagster job (runs inside the daemon, always on) or a cloud-scheduled function (Azure Function on a timer trigger). The watchdog's GraphQL queries and corrective actions are portable — the logic doesn't depend on running in a /loop.

No PII classification or masking

The briefing flags this: "sensitive columns not tagged. Would need to happen before we widen lake access beyond internal analysts."

Impact: Widening ClickHouse access (Strategy A's scratch database, or any analyst self-service) exposes PII in bronze/silver data without controls.

Fix: Rev-Sci already has PII masking built into the report factory pipeline (masking assets hash sensitive columns). This pattern should extend to the datalake's silver layer. Bronze should remain unmasked (it's the raw source mirror) but access-controlled.

Filesystem partitioning not Dagster-managed

Bronze partitions use ingestion_date=YYYY-MM-DD/ directory conventions. Dagster doesn't know about these partitions — there's no PartitionsDefinition, no backfill UI, no per-partition status in the asset graph.

Impact: Backfilling a specific date range requires manual file operations or custom scripts, not Dagster's built-in backfill mechanism.

Fix: Use DailyPartitionsDefinition with PolarsParquetIOManager. The IO manager maps Dagster partition keys to file paths. Dagster tracks partition status and provides backfill UI.


Open questions for the merged architecture

1. Who owns the bronze layer?

If both Rev-Sci and the datalake produce bronze-layer data (Rev-Sci from V3/billing, datalake from 35 sources), do we end up with two bronze layers? Or does the datalake's bronze become the single source and Rev-Sci reads from it?

Recommendation: The datalake's bronze should be the canonical raw layer for all shared sources. Rev-Sci assets that currently extract from V3 directly should declare cross-code-location dependencies on datalake bronze instead. This removes source-system coupling from downstream code locations.

For sources that are Rev-Sci-specific (e.g., Vanguard API, RevSci API), those extractions stay in the vanguard_wholesale code location.

2. How do cross-code-location dependencies work?

Dagster supports cross-code-location asset dependencies via AssetKey references, but the IO manager must know where the upstream asset's data lives. If vanguard_wholesale depends on datalake/bronze_v3_customer, the IO manager in vanguard_wholesale needs to resolve the path /opt/dagster/shared-data/datalake/bronze_v3_customer/.

Options: - Configure PolarsParquetIOManager with a base_dir that covers the whole shared-data tree (breaks the per-location namespace) - Use a SourceAsset with an explicit path override - Use ClickHouse as the cross-location read layer (query datalake.bronze_v3_customer from a vanguard_wholesale asset via SQL)

This needs a design decision before implementation.

3. How do we handle the datalake's watermark state?

The datalake persists watermarks as JSON files in ADLS (config/reconciliation/{source}/{table}.json). In the merged stack, these would live on NFS. But Dagster has built-in cursor persistence for sensors (via context.cursor). Should watermarks be sensor cursors, asset metadata, or standalone files?

Recommendation: Use Dagster sensor cursors for extraction watermarks. This keeps watermark state in Dagster's metadata DB (MySQL), co-located with run history and event logs. The reconciliation JSON files can remain as a human-readable audit trail.

4. Metadata DB: MySQL or PostgreSQL?

Rev-Sci uses MySQL for Dagster metadata. The datalake uses PostgreSQL. The merged stack needs one.

Recommendation: Keep MySQL. It's already deployed, configured, and backed up in the Swarm stack. Dagster's dagster-mysql support is production-grade. PostgreSQL offers no compelling advantage for Dagster metadata at this scale.

5. What happens to the datalake's existing ADLS data?

If we consolidate on NFS, the existing 330 GB of bronze Parquet on ADLS needs to migrate. Options: - azcopy one-time copy to NFS - Re-extract from source (bronze is replayable, but backfills for billing CDRs and Zendesk are multi-month efforts) - Keep ADLS as a read-only archive and start fresh on NFS

Recommendation: One-time azcopy to NFS, then decommission ADLS writes. Re-extraction is wasteful for data that's already correct.