Skip to content

Gregg's Datalake Briefing — Restructured Summary

This is a condensed reference of the datalake technical briefing (2026-04-26, 14 pages), restructured for comparison with the Rev-Sci stack. Page/section references point back to the original PDF.


Architecture at a glance

Layer Choice Rationale
Storage ADLS Gen2 + Parquet (Snappy) Cheap, columnar, open format. No Delta/Iceberg — every bronze write is an idempotent partition overwrite.
Orchestration Dagster on Docker Swarm (single on-prem node) Asset-first model. Lineage, typed partitions, retries.
Transform DuckDB in-process Columnar SQL, reads Parquet natively, zero infra. Largest silver input 27M rows → seconds.
Compute Docker Swarm, single node Simple, owned. No HA — K8s/Container Apps on roadmap.
Extraction Polling with watermarks MAX(updated_at) per table, persisted as JSON in ADLS.
Catalogue None yet Analysts query ADLS paths directly. Unity Catalog / Synapse Serverless under evaluation.
Metadata DB PostgreSQL Dagster run storage, event logs, schedule state.

Medallion architecture

Bronze (s.2.1)

  • One Parquet dataset per source table. Raw, preserving legacy column names.
  • Minimal mutation: 0000-00-00 → NULL, type coercion for Parquet compatibility.
  • Partition layout: bronze/{source}/{table}/ingestion_date=YYYY-MM-DD/{table}_{run_id}.parquet
  • Immutable once written. Historical partitions never rewritten.
  • Idempotent writes: clear_today_partition at start of each full-snapshot, with except_run_id to protect retries.

Silver (s.2.2)

  • Full daily refresh from bronze (not incremental). DuckDB handles the volume.
  • Cleaned types: DECIMAL(18,4) for money, DATE for dates, BOOLEAN for flags.
  • FK integers joined to expose human-readable labels.
  • Denormalising joins (e.g., fact_invoice = invoice + invoice_status + invoice_data).
  • Single latest.parquet per silver table.

Gold (s.2.3, roadmap)

  • Pre-aggregated marts on silver. Date-partitioned for time-series.
  • Revenue / churn / ARPU / cohort marts. Blocked on analyst requirements.

Replayability (s.2.4)

Bronze immutable + silver is a pure function of bronze = silver can be rebuilt without re-extracting from source systems. Source of truth shifts from operational DBs to bronze.


Data sources (s.4)

  • 35 sources across V3 MySQL (21 domains, ~50 tables), BaRE, billing CDRs (75 GB), APN (180 GB, 4.7B rows), Zendesk (23 GB), HubSpot, Trustpilot, Google Ads, Aircall, Twilio, ElevenLabs, cybersecurity, profitability.
  • Explicitly excluded: v4_warehouse (rebuilt in silver), V4 API (double-sourcing), database views (reconstructed in silver).
  • Total volume: ~330 GB, growing 1–2 GB/day.

Nightly schedule (s.5)

  • 02:00–03:00: Lightweight V3 dimensions
  • 03:00–04:00: Core business tables (order, sales, BaRE, billing)
  • 05:00–07:00: Silver dimensions then facts
  • 08:00–09:00: Housekeeping + validation
  • Long-runners: APN (4.7B rows daily chunk backfill), billing CDRs (11-month backfill), Zendesk (3.5-year backfill)

Silver model (s.6)

Facts (all daily full refresh)

Fact Grain Rows
silver_fact_invoice 1 row per invoice header 5.1M
silver_fact_invoice_item 1 row per invoice line item 27.9M
silver_fact_invoice_service 1 row per invoice x service 17.4M

Dimensions

Built: dim_customer, dim_service, dim_partner (rebuilt 05:00 UTC). In-flight: dim_charge_type, dim_usage_type, dim_usage_type_group, dim_usage_unit, dim_module_type (~30–800 rows each).


Data correctness (s.7)

  • Reconciliation module: COUNT(*) from MySQL (not information_schema.tables.table_rows — that's an InnoDB approximation).
  • Tolerances: 1% row count for full snapshots, 5-min watermark for incrementals.
  • Results persisted as JSON in ADLS.
  • Evidence: 7/8 v3_customer tables at 0.00% drift.

Operational watchdog (s.8)

  • Runs as /loop every 15 minutes.
  • Scans ADLS for adls_last_modified per source.
  • Queries Dagster GraphQL for run status.
  • Classifies: healthy, running_ok, zombie, hard_failure, stale_no_run, backfilling.
  • Bounded action: kill zombies, relaunch transient failures, escalate fatals.
  • Safeguards: max 3 relaunches/cycle, 60-min cooldown/source, 3-strike escalation.
  • Root-cause categories: transient_api, auth_expired, oom, bad_query.
  • Pushover heartbeat per cycle.
  • Runs on the data engineer's laptop. Silent-failure window if session closes.

Analyst access (s.9)

Pattern Stack
Ad-hoc SQL DuckDB with Azure connection string
BI Power BI ADLS Gen2 connector
Notebooks Python / PyArrow with AzureFileSystem
Operational Dagster UI (asset lineage, run status)

Missing: No JDBC/REST endpoint for tools that expect a warehouse (Tableau, Looker, Mode).


Cost model (s.11)

Item Monthly
ADLS storage (330 GB hot) £5
Compute £0 (existing on-prem)
Egress £0 (in-region consumption)
Total ~£5/month

Comparison: Fivetran + Snowflake for this footprint = low-to-mid five figures/year.


Known weaknesses (s.12)

  1. Single-node compute, no HA
  2. Watchdog runs on laptop
  3. No data catalogue
  4. No data-quality framework beyond reconciliation
  5. Implicit lineage only (Dagster UI)
  6. Recurring coordinator-zombie pattern
  7. No JDBC endpoint
  8. No PII classification / masking

Roadmap (s.13)

Priority Item Status
1 Billing dimensions (5 tables) Ready to build
1 Move watchdog off laptop Planned
2 End-to-end silver join: customer → service → invoice Design phase
2 Zendesk historical backfill In progress
2 billing_cdrs/cdr_bare backfill In progress
3 Gold layer (revenue / churn / ARPU) Blocked on requirements
3 Data catalogue / query endpoint Evaluating options
3 Coordinator-zombie root cause Investigation open
3 Silver Type-2 dimensions Deferred