The 6-part data contract I want before I trust a source table

A new source table can look useful long before it is safe to trust. If nobody agrees on row meaning, key rule, landing cadence, and valid-record handling, the first downstream model bakes in assumptions nobody approved.

Before I add the checks from Row counts are not enough: the checks I add before I trust a pipeline, I start with a short six-part source contract:

row meaning
key rule
landing cadence
valid-record handling
correction window
owner

It is small enough to agree on quickly and specific enough to stop downstream guessing.

Even when the source is exploratory or changing fast, I still write the short version and label the trust boundary up front. If the row meaning is still fuzzy after that contract pass, I usually need an explicit grain note before the downstream model is safe to reuse.

Problem

Imagine a warehouse integration starts sending a shipment_events table for on-time-in-full (OTIF) and fill-rate reporting. Monday’s load looks normal. By Tuesday, planners see duplicate shipment corrections in one lane and a small set of rows with no warehouse_id.

At that point, the analytics team can keep patching transforms around ambiguous source behavior. The better move is to stop and write the contract that says what a row means, which records are valid, and how long corrections can restate history. I have learned that this is cheaper than repairing the metric layer later.

Default approach

Define the row meaning first: one row should mean one shipment event, one shipment line, or one snapshot record, not a mix of all three.
Define the key rule next: which column or column set should be unique, and when can retries or duplicates appear?
Define landing cadence and lateness: how often should the table land, and when is it officially late?
Define valid-record rules for critical fields and the response when they fail: quarantine, block publish, or allow with a flag.
Define the correction window: can rows be updated or deleted, how are corrections flagged, and when does history stop moving?
Define the owner: who confirms the rule and who answers first when a check fails?

Example

Here is how I would turn those six parts into a lightweight contract for a new shipment_events table before I use it in executive reporting. I keep the contract in the same six-part order so the field-level rules line up with the actual checks.

Field	Value
Table	shipment_events
Use case	OTIF and fill-rate reporting

Contract field	Expectation	Why it matters
row meaning	one row = one shipment event	stops grain drift downstream
shipment_event_id	unique per source event	retry duplicates do not inflate counts
landing cadence	daily load by 06:00 UTC	tells me when freshness is late
event_ts	always present in UTC	defines lateness and daily bucketing
warehouse_id	non-null, valid mapped warehouse	joins do not silently drop records
sku	non-null product identifier	shipment and inventory logic can reconcile
quantity	signed numeric value, never null	corrections and reversals stay explainable
event_type	allowed set: ship, cancel, correct	event flow is explicit
invalid record action	quarantine rows missing warehouse_id or sku	bad records do not leak into KPI tables
change behavior	corrections may arrive within 72 hours	downstream models know history can move
owner	WMS integration team	someone can confirm or fix the rule

The questions worth settling are simple: can retries duplicate raw events, when is the daily load officially late, how long can corrections restate history, and do rows without warehouse_id get quarantined or allowed through with a flag? Those answers belong before the first KPI review, not during it.

Once that contract exists, I can turn it into boring controls: a freshness check on landing time, a uniqueness check on shipment_event_id, null checks on warehouse_id and sku, accepted values on event_type, and a publish rule that keeps quarantined rows out of executive reporting.

That contract belongs next to the ingestion checks or source tests, not in a wiki that drifts out of date.

Without that contract, the downstream model is not really tested against the failure modes that matter. It is turning upstream ambiguity into downstream logic, which is usually when I need an explicit grain note just to make the row meaning visible again.

Tradeoffs

Breaks when: the source is exploratory and the schema changes every week → Mitigation: start with row meaning, landing cadence, and owner, then label the table non-authoritative until the field-level rules stabilize.
Breaks when: the upstream team cannot guarantee uniqueness yet → Mitigation: land the raw table separately, quarantine duplicates, and keep business-facing models off it until the key behavior stabilizes.
Breaks when: late corrections are normal for the business → Mitigation: publish a restatement window and a freeze rule so downstream users know which dates can still move and when the numbers become final.
Breaks when: the contract document drifts away from actual source behavior → Mitigation: keep the contract beside ingestion checks or source tests and review it when the source logic changes.

Close

Next step: Pick one critical source table and write the six contract lines that stop downstream guessing: row meaning, key rule, landing cadence, valid-record rules, correction window, and owner.

If a source table keeps turning into downstream detective work, use those six lines to name the one missing boundary before another model inherits the guess.

The 6-part data contract I want before I trust a source table

Problem

Default approach

Example

Tradeoffs

Close

Continue reading

Row counts are not enough: the checks I add before I trust a pipeline

Every important model needs an explicit grain