Gregg's Datalake Briefing — Restructured Summary¶

This is a condensed reference of the datalake technical briefing (2026-04-26, 14 pages), restructured for comparison with the Rev-Sci stack. Page/section references point back to the original PDF.

Architecture at a glance¶

Layer	Choice	Rationale
Storage	ADLS Gen2 + Parquet (Snappy)	Cheap, columnar, open format. No Delta/Iceberg — every bronze write is an idempotent partition overwrite.
Orchestration	Dagster on Docker Swarm (single on-prem node)	Asset-first model. Lineage, typed partitions, retries.
Transform	DuckDB in-process	Columnar SQL, reads Parquet natively, zero infra. Largest silver input 27M rows → seconds.
Compute	Docker Swarm, single node	Simple, owned. No HA — K8s/Container Apps on roadmap.
Extraction	Polling with watermarks	`MAX(updated_at)` per table, persisted as JSON in ADLS.
Catalogue	None yet	Analysts query ADLS paths directly. Unity Catalog / Synapse Serverless under evaluation.
Metadata DB	PostgreSQL	Dagster run storage, event logs, schedule state.

Medallion architecture¶

Bronze (s.2.1)¶

One Parquet dataset per source table. Raw, preserving legacy column names.
Minimal mutation: 0000-00-00 → NULL, type coercion for Parquet compatibility.
Partition layout: bronze/{source}/{table}/ingestion_date=YYYY-MM-DD/{table}_{run_id}.parquet
Immutable once written. Historical partitions never rewritten.
Idempotent writes: clear_today_partition at start of each full-snapshot, with except_run_id to protect retries.

Silver (s.2.2)¶

Full daily refresh from bronze (not incremental). DuckDB handles the volume.
Cleaned types: DECIMAL(18,4) for money, DATE for dates, BOOLEAN for flags.
FK integers joined to expose human-readable labels.
Denormalising joins (e.g., fact_invoice = invoice + invoice_status + invoice_data).
Single latest.parquet per silver table.

Gold (s.2.3, roadmap)¶

Pre-aggregated marts on silver. Date-partitioned for time-series.
Revenue / churn / ARPU / cohort marts. Blocked on analyst requirements.

Replayability (s.2.4)¶

Bronze immutable + silver is a pure function of bronze = silver can be rebuilt without re-extracting from source systems. Source of truth shifts from operational DBs to bronze.

Data sources (s.4)¶

35 sources across V3 MySQL (21 domains, ~50 tables), BaRE, billing CDRs (75 GB), APN (180 GB, 4.7B rows), Zendesk (23 GB), HubSpot, Trustpilot, Google Ads, Aircall, Twilio, ElevenLabs, cybersecurity, profitability.
Explicitly excluded: v4_warehouse (rebuilt in silver), V4 API (double-sourcing), database views (reconstructed in silver).
Total volume: ~330 GB, growing 1–2 GB/day.

Nightly schedule (s.5)¶

02:00–03:00: Lightweight V3 dimensions
03:00–04:00: Core business tables (order, sales, BaRE, billing)
05:00–07:00: Silver dimensions then facts
08:00–09:00: Housekeeping + validation
Long-runners: APN (4.7B rows daily chunk backfill), billing CDRs (11-month backfill), Zendesk (3.5-year backfill)

Silver model (s.6)¶

Facts (all daily full refresh)¶

Fact	Grain	Rows
`silver_fact_invoice`	1 row per invoice header	5.1M
`silver_fact_invoice_item`	1 row per invoice line item	27.9M
`silver_fact_invoice_service`	1 row per invoice x service	17.4M

Dimensions¶

Built: dim_customer, dim_service, dim_partner (rebuilt 05:00 UTC). In-flight: dim_charge_type, dim_usage_type, dim_usage_type_group, dim_usage_unit, dim_module_type (~30–800 rows each).

Data correctness (s.7)¶

Reconciliation module: COUNT(*) from MySQL (not information_schema.tables.table_rows — that's an InnoDB approximation).
Tolerances: 1% row count for full snapshots, 5-min watermark for incrementals.
Results persisted as JSON in ADLS.
Evidence: 7/8 v3_customer tables at 0.00% drift.

Operational watchdog (s.8)¶

Runs as /loop every 15 minutes.
Scans ADLS for adls_last_modified per source.
Queries Dagster GraphQL for run status.
Classifies: healthy, running_ok, zombie, hard_failure, stale_no_run, backfilling.
Bounded action: kill zombies, relaunch transient failures, escalate fatals.
Safeguards: max 3 relaunches/cycle, 60-min cooldown/source, 3-strike escalation.
Root-cause categories: transient_api, auth_expired, oom, bad_query.
Pushover heartbeat per cycle.
Runs on the data engineer's laptop. Silent-failure window if session closes.

Analyst access (s.9)¶

Pattern	Stack
Ad-hoc SQL	DuckDB with Azure connection string
BI	Power BI ADLS Gen2 connector
Notebooks	Python / PyArrow with `AzureFileSystem`
Operational	Dagster UI (asset lineage, run status)

Missing: No JDBC/REST endpoint for tools that expect a warehouse (Tableau, Looker, Mode).

Cost model (s.11)¶

Item	Monthly
ADLS storage (330 GB hot)	£5
Compute	£0 (existing on-prem)
Egress	£0 (in-region consumption)
Total	~£5/month

Comparison: Fivetran + Snowflake for this footprint = low-to-mid five figures/year.

Known weaknesses (s.12)¶

Single-node compute, no HA
Watchdog runs on laptop
No data catalogue
No data-quality framework beyond reconciliation
Implicit lineage only (Dagster UI)
Recurring coordinator-zombie pattern
No JDBC endpoint
No PII classification / masking

Roadmap (s.13)¶

Priority	Item	Status
1	Billing dimensions (5 tables)	Ready to build
1	Move watchdog off laptop	Planned
2	End-to-end silver join: customer → service → invoice	Design phase
2	Zendesk historical backfill	In progress
2	billing_cdrs/cdr_bare backfill	In progress
3	Gold layer (revenue / churn / ARPU)	Blocked on requirements
3	Data catalogue / query endpoint	Evaluating options
3	Coordinator-zombie root cause	Investigation open
3	Silver Type-2 dimensions	Deferred