Gregg's Datalake Briefing — Restructured Summary¶
This is a condensed reference of the datalake technical briefing (2026-04-26, 14 pages), restructured for comparison with the Rev-Sci stack. Page/section references point back to the original PDF.
Architecture at a glance¶
| Layer | Choice | Rationale |
|---|---|---|
| Storage | ADLS Gen2 + Parquet (Snappy) | Cheap, columnar, open format. No Delta/Iceberg — every bronze write is an idempotent partition overwrite. |
| Orchestration | Dagster on Docker Swarm (single on-prem node) | Asset-first model. Lineage, typed partitions, retries. |
| Transform | DuckDB in-process | Columnar SQL, reads Parquet natively, zero infra. Largest silver input 27M rows → seconds. |
| Compute | Docker Swarm, single node | Simple, owned. No HA — K8s/Container Apps on roadmap. |
| Extraction | Polling with watermarks | MAX(updated_at) per table, persisted as JSON in ADLS. |
| Catalogue | None yet | Analysts query ADLS paths directly. Unity Catalog / Synapse Serverless under evaluation. |
| Metadata DB | PostgreSQL | Dagster run storage, event logs, schedule state. |
Medallion architecture¶
Bronze (s.2.1)¶
- One Parquet dataset per source table. Raw, preserving legacy column names.
- Minimal mutation:
0000-00-00→ NULL, type coercion for Parquet compatibility. - Partition layout:
bronze/{source}/{table}/ingestion_date=YYYY-MM-DD/{table}_{run_id}.parquet - Immutable once written. Historical partitions never rewritten.
- Idempotent writes:
clear_today_partitionat start of each full-snapshot, withexcept_run_idto protect retries.
Silver (s.2.2)¶
- Full daily refresh from bronze (not incremental). DuckDB handles the volume.
- Cleaned types:
DECIMAL(18,4)for money,DATEfor dates,BOOLEANfor flags. - FK integers joined to expose human-readable labels.
- Denormalising joins (e.g.,
fact_invoice = invoice + invoice_status + invoice_data). - Single
latest.parquetper silver table.
Gold (s.2.3, roadmap)¶
- Pre-aggregated marts on silver. Date-partitioned for time-series.
- Revenue / churn / ARPU / cohort marts. Blocked on analyst requirements.
Replayability (s.2.4)¶
Bronze immutable + silver is a pure function of bronze = silver can be rebuilt without re-extracting from source systems. Source of truth shifts from operational DBs to bronze.
Data sources (s.4)¶
- 35 sources across V3 MySQL (21 domains, ~50 tables), BaRE, billing CDRs (75 GB), APN (180 GB, 4.7B rows), Zendesk (23 GB), HubSpot, Trustpilot, Google Ads, Aircall, Twilio, ElevenLabs, cybersecurity, profitability.
- Explicitly excluded: v4_warehouse (rebuilt in silver), V4 API (double-sourcing), database views (reconstructed in silver).
- Total volume: ~330 GB, growing 1–2 GB/day.
Nightly schedule (s.5)¶
- 02:00–03:00: Lightweight V3 dimensions
- 03:00–04:00: Core business tables (order, sales, BaRE, billing)
- 05:00–07:00: Silver dimensions then facts
- 08:00–09:00: Housekeeping + validation
- Long-runners: APN (4.7B rows daily chunk backfill), billing CDRs (11-month backfill), Zendesk (3.5-year backfill)
Silver model (s.6)¶
Facts (all daily full refresh)¶
| Fact | Grain | Rows |
|---|---|---|
silver_fact_invoice |
1 row per invoice header | 5.1M |
silver_fact_invoice_item |
1 row per invoice line item | 27.9M |
silver_fact_invoice_service |
1 row per invoice x service | 17.4M |
Dimensions¶
Built: dim_customer, dim_service, dim_partner (rebuilt 05:00 UTC).
In-flight: dim_charge_type, dim_usage_type, dim_usage_type_group, dim_usage_unit, dim_module_type (~30–800 rows each).
Data correctness (s.7)¶
- Reconciliation module:
COUNT(*)from MySQL (notinformation_schema.tables.table_rows— that's an InnoDB approximation). - Tolerances: 1% row count for full snapshots, 5-min watermark for incrementals.
- Results persisted as JSON in ADLS.
- Evidence: 7/8
v3_customertables at 0.00% drift.
Operational watchdog (s.8)¶
- Runs as
/loopevery 15 minutes. - Scans ADLS for
adls_last_modifiedper source. - Queries Dagster GraphQL for run status.
- Classifies:
healthy,running_ok,zombie,hard_failure,stale_no_run,backfilling. - Bounded action: kill zombies, relaunch transient failures, escalate fatals.
- Safeguards: max 3 relaunches/cycle, 60-min cooldown/source, 3-strike escalation.
- Root-cause categories:
transient_api,auth_expired,oom,bad_query. - Pushover heartbeat per cycle.
- Runs on the data engineer's laptop. Silent-failure window if session closes.
Analyst access (s.9)¶
| Pattern | Stack |
|---|---|
| Ad-hoc SQL | DuckDB with Azure connection string |
| BI | Power BI ADLS Gen2 connector |
| Notebooks | Python / PyArrow with AzureFileSystem |
| Operational | Dagster UI (asset lineage, run status) |
Missing: No JDBC/REST endpoint for tools that expect a warehouse (Tableau, Looker, Mode).
Cost model (s.11)¶
| Item | Monthly |
|---|---|
| ADLS storage (330 GB hot) | £5 |
| Compute | £0 (existing on-prem) |
| Egress | £0 (in-region consumption) |
| Total | ~£5/month |
Comparison: Fivetran + Snowflake for this footprint = low-to-mid five figures/year.
Known weaknesses (s.12)¶
- Single-node compute, no HA
- Watchdog runs on laptop
- No data catalogue
- No data-quality framework beyond reconciliation
- Implicit lineage only (Dagster UI)
- Recurring coordinator-zombie pattern
- No JDBC endpoint
- No PII classification / masking
Roadmap (s.13)¶
| Priority | Item | Status |
|---|---|---|
| 1 | Billing dimensions (5 tables) | Ready to build |
| 1 | Move watchdog off laptop | Planned |
| 2 | End-to-end silver join: customer → service → invoice | Design phase |
| 2 | Zendesk historical backfill | In progress |
| 2 | billing_cdrs/cdr_bare backfill | In progress |
| 3 | Gold layer (revenue / churn / ARPU) | Blocked on requirements |
| 3 | Data catalogue / query endpoint | Evaluating options |
| 3 | Coordinator-zombie root cause | Investigation open |
| 3 | Silver Type-2 dimensions | Deferred |