Why every consumer who says they only need the totals will eventually ask for the detail.
Notes on data engineering, distributed systems, and the occasional detour through music theory and career.
One row per table per run, raw measurements only. Everything else is a query on top.
Hot, warm, and cold tiers — partition the pipeline so the tables that matter most get attention first.
_extracted_at, _batch_id, _source_hash — what each one costs and what it buys.
Load into staging, validate, swap atomically. Zero downtime, trivial rollback.
The row was there yesterday and gone today. A cursor never sees it, so something else has to.
Why full replace is the default, and when to deviate from it.
A field guide to the assumptions that will eventually break your pipeline.
Pure EL doesn't exist. Every pipeline conforms whether it admits it or not.
Multiple examples of incremental loading rules for different cases
Go for growth, not just technical skills
How helping with Stata homework led to a career in D.Eng.