Go home
Menu

The 6-part data contract I want before I trust a source table

My six-part source-table contract covers row meaning, key rule, landing cadence, valid-record handling, correction window, and owner so downstream analytics work does not start with guesses.

· 5 min read by Berhan Turkkaynagi

A new source table can look useful long before it is safe to trust. If nobody agrees on row meaning, key rule, landing cadence, and valid-record handling, the first downstream model bakes in assumptions nobody approved.

Before I add the checks from Row counts are not enough: the checks I add before I trust a pipeline, I want a short six-part source contract:

  • row meaning
  • key rule
  • landing cadence
  • valid-record handling
  • correction window
  • owner

It is small enough to agree on quickly and specific enough to stop downstream guessing.

Even when the source is exploratory or changing fast, I still write the short version and label the trust boundary up front. If the row meaning is still fuzzy after that contract pass, I usually need an explicit grain note before the downstream model is safe to reuse.

Problem

Imagine a warehouse integration starts sending a shipment_events table for on-time-in-full (OTIF) and fill-rate reporting. Monday’s load looks normal. By Tuesday, planners see duplicate shipment corrections in one lane and a small set of rows with no warehouse_id.

At that point, the analytics team can keep patching transforms around ambiguous source behavior. The better move is to stop and write the contract that says what a row means, which records are valid, and how long corrections can restate history. I have learned that this is cheaper than repairing the metric layer later.

Default approach

  • Define the row meaning first: one row should mean one shipment event, one shipment line, or one snapshot record, not a mix of all three.
  • Define the key rule next: which column or column set should be unique, and when can retries or duplicates appear?
  • Define landing cadence and lateness: how often should the table land, and when is it officially late?
  • Define valid-record rules for critical fields and the response when they fail: quarantine, block publish, or allow with a flag.
  • Define the correction window: can rows be updated or deleted, how are corrections flagged, and when does history stop moving?
  • Define the owner: who confirms the rule and who answers first when a check fails?

Example

Here is how I would turn those six parts into a lightweight contract for a new shipment_events table before I use it in executive reporting. I keep the contract in the same six-part order so the field-level rules line up with the actual checks.

FieldValue
Tableshipment_events
Use caseOTIF and fill-rate reporting
Contract fieldExpectationWhy it matters
row meaningone row = one shipment eventstops grain drift downstream
shipment_event_idunique per source eventretry duplicates do not inflate counts
landing cadencedaily load by 06:00 UTCtells me when freshness is late
event_tsalways present in UTCdefines lateness and daily bucketing
warehouse_idnon-null, valid mapped warehousejoins do not silently drop records
skunon-null product identifiershipment and inventory logic can reconcile
quantitysigned numeric value, never nullcorrections and reversals stay explainable
event_typeallowed set: ship, cancel, correctevent flow is explicit
invalid record actionquarantine rows missing warehouse_id or skubad records do not leak into KPI tables
change behaviorcorrections may arrive within 72 hoursdownstream models know history can move
ownerWMS integration teamsomeone can confirm or fix the rule

The questions I want settled are simple: can retries duplicate raw events, when is the daily load officially late, how long can corrections restate history, and do rows without warehouse_id get quarantined or allowed through with a flag? Those answers belong before the first KPI review, not during it.

Once that contract exists, I can turn it into boring controls: a freshness check on landing time, a uniqueness check on shipment_event_id, null checks on warehouse_id and sku, accepted values on event_type, and a publish rule that keeps quarantined rows out of executive reporting.

I want that contract next to the ingestion checks or source tests, not in a wiki that drifts out of date.

Without that contract, the downstream model is not really tested against the failure modes that matter. It is turning upstream ambiguity into downstream logic, which is usually when I need an explicit grain note just to make the row meaning visible again.

Tradeoffs

  • Breaks when: the source is exploratory and the schema changes every week → Mitigation: start with row meaning, landing cadence, and owner, then label the table non-authoritative until the field-level rules stabilize.
  • Breaks when: the upstream team cannot guarantee uniqueness yet → Mitigation: land the raw table separately, quarantine duplicates, and keep business-facing models off it until the key behavior stabilizes.
  • Breaks when: late corrections are normal for the business → Mitigation: publish a restatement window and a freeze rule so downstream users know which dates can still move and when the numbers become final.
  • Breaks when: the contract document drifts away from actual source behavior → Mitigation: keep the contract beside ingestion checks or source tests and review it when the source logic changes.

Close

Next step: Pick one critical source table and write the six contract lines that stop downstream guessing: row meaning, key rule, landing cadence, valid-record rules, correction window, and owner.

If a source table keeps turning into downstream detective work, use those six lines to name the one missing boundary before another model inherits the guess.