The schema-change checklist I use before a source breaks downstream models
I use a schema-change checklist to classify additive fields, renames, type changes, dropped fields, and semantic changes before an upstream source breaks downstream models or dashboards.
Schema changes get expensive when they look harmless at ingest time.
The load can finish, the row count can land, and the first visible cost can be downstream: a join starts dropping customers, a timestamp watermark skips records, or a dashboard keeps using a field whose meaning moved. I do not want the first serious schema-change review to happen after the model is wrong.
Before promotion, I classify the change, name the downstream surface, and choose the response on purpose: absorb it, warn consumers, quarantine records, block promotion, coordinate a migration, or run compatibility in parallel.
Problem
Schema changes are easy to underreact to when the pipeline stays green.
An additive field looks harmless until an analyst exposes it before the business meaning is agreed. A rename looks simple until it breaks a watermark or dashboard filter. A type change looks like an implementation detail until a staging join silently casts away the value downstream models depend on.
The cost is not only the broken run. It is the uncertainty after the run: which models used the field, which dashboard consumed the model, who owns the response, and why the team let that shape move forward without a decision.
This is different from the initial source agreement I want in a six-part data contract for a source table. That contract says what the source is supposed to mean. This checklist is what I use when the source moves anyway.
Default approach
- Capture the observed change: source, field, old shape, new shape, and when the change appeared.
- Classify the change before reacting: additive field, rename, type change, dropped field, or semantic change.
- Check downstream use across ingestion, staging models, marts, dashboards, metric definitions, extracts, and known consumers.
- Name the owner for the response. The upstream owner, transformation owner, and business-facing consumer owner may not be the same person.
- Choose the promotion decision explicitly: absorb, warn, quarantine, block promotion, coordinate migration, or run compatibility in parallel.
- Record the evidence next to the pull request, incident note, release comment, validation run, or source-contract update.
The classification matters because not every change deserves the same gate. Unused additive fields can often move into raw or staging with a warning. Key type changes, watermark renames, dropped business fields, and semantic shifts usually need a stronger stop.
That is also why row counts are not enough. A schema change can keep the same number of rows while changing join behavior, null behavior, or metric meaning. I still want the output checks from the checks I add before I trust a pipeline, but I do not wait for those checks to be the first place the structural change is understood.
Example
Imagine the customer API changes during a normal validation run.
Three things happen at once: customer_id changes from a numeric identifier to a string identifier, customer_segment appears as a new nullable field, and created_at is renamed to created_timestamp. Ingestion still lands records, but the change touches the raw load, the staging model, the customer mart, and an executive dashboard filter.
This is the checklist I want before promotion.
Schema-change classification checklist
Source: customer API / customers endpoint
Observed on: 2025-02-11 validation run
Owner recording decision: analytics engineering
Change 1
field: customer_id
old shape: integer-like numeric identifier
new shape: string identifier
classification: type change
critical downstream use: staging joins, customer mart key, dashboard drill-through URL
first risk: silent cast or broken join if downstream expects numeric
owner: data engineering owns ingest; analytics engineering owns staging model
promotion decision: block promotion until raw string is preserved, staging key cast is explicit, and join tests pass
follow-up evidence: source contract note, staging PR, validation run, join/null test output
Change 2
field: customer_segment
old shape: not present
new shape: nullable string
classification: additive field
critical downstream use: none yet; requested by lifecycle reporting later
first risk: low for current dashboards, medium if analysts use it before the definition is certified
owner: lifecycle analytics owner confirms definition before mart exposure
promotion decision: absorb at raw/staging layer, warn that it is not business-certified yet
follow-up evidence: release comment says the field is staged but not curated
Change 3
field: created_at -> created_timestamp
old shape: created_at timestamp string
new shape: created_timestamp timestamp string
classification: rename
critical downstream use: incremental load watermark, staging model, cohort dashboard
first risk: latest records stop loading or cohorts shift if the old field name is assumed
owner: data engineering confirms source change; analytics engineering updates staging compatibility
promotion decision: coordinate migration with temporary compatibility alias and alert on old-field disappearance
follow-up evidence: PR includes alias removal date, dashboard validation slice, and owner sign-off
Checklist fields to preserve for every schema change
- source and observed date
- field name or semantic rule
- old shape and new shape
- classification: additive field, rename, type change, dropped field, semantic change
- downstream impact: models, dashboards, metrics, exports, or consumers
- owner: upstream, transformation, and business-facing owner where relevant
- promotion decision: absorb, warn, quarantine, block promotion, coordinate migration, or run parallel compatibility
- follow-up evidence: PR, source contract update, validation run, release note, or incident note
The three changes should not get one blanket decision.
For customer_id, I block promotion until the raw string is preserved and the staging model makes the cast explicit. A key is not a cosmetic field. If the downstream mart expects numeric IDs, a quiet cast can create join misses that look like customer churn or dashboard drill-through defects.
For customer_segment, I can absorb the field earlier because no current dashboard depends on it. But I still warn consumers that it is not business-certified. The field should not appear in the curated mart until someone owns allowed values, null behavior, and the difference between source-system labels and reporting labels.
For created_at, I prefer a compatibility window. The staging model can expose the old alias briefly while the source owner confirms the rename and the analytics owner updates the watermark, cohort model, and dashboard slice. The important part is that the alias has an owner and removal date. Otherwise the compatibility layer becomes a permanent hiding place for unfinished migration work.
Dropped fields and semantic changes fit the same checklist even though this API example does not include them. If sales_region disappears, I want to know which model, metric, export, or owner loses required meaning before anyone fills a fake default. If customer_segment keeps the same name but changes from lifecycle segment to marketing segment, I treat it as a semantic change and require the same promotion decision I would use for a visible schema break.
Automated lineage and runtime observability can speed up the investigation. I still want a human-readable decision record, especially when lineage misses spreadsheets, dashboard extracts, or owner-maintained consumer lists. If the issue is already late, stale, or failed at runtime, pipeline observability signals help the first responder. This checklist belongs one step earlier: before the team decides that the changed source is safe to promote.
Tradeoffs
- Breaks when: every additive field is treated as a release blocker → Mitigation: allow unused additive fields into raw and staging with a warning, but hold curated exposure until an owner confirms meaning, allowed values, and null behavior.
- Breaks when: a harmless widening is grouped with destructive type changes → Mitigation: distinguish widening from incompatible casts, preserve raw values, and run join/null checks before promotion.
- Breaks when: lineage misses spreadsheet exports, dashboard extracts, or manually maintained reports → Mitigation: pair automated lineage with a known-consumer list for business-critical models.
- Breaks when: compatibility aliases for renames become permanent → Mitigation: record the alias removal date, owner, and validation slice before the compatibility layer ships.
- Breaks when: a dropped field removes required business meaning and the upstream owner cannot restore it quickly → Mitigation: record degraded mode, consumer warning, and migration owner instead of silently filling fake defaults.
Close
Next step: Pick one business-critical source that changed recently and classify the change before the next promotion: old shape, new shape, downstream use, owner, promotion decision, and follow-up evidence.
If you want to compare notes on schema-change triage, I am most interested in the field that looks low-risk and the model that would prove otherwise.