Berhan Turkkaynagi

The metric ownership review I run before quarter close

Sat, 16 May 2026 00:00:00 GMT

Quarter close is when metric ownership gaps stop being abstract.

A KPI definition can change quietly during the quarter. An exclusion can move from one dashboard to another. A finance owner can approve a label in one surface while operations still reads the old rule in another. When the close package is about to lock, those small gaps become reporting risk.

I do not want to relitigate every metric in the close meeting. I want a short review for the numbers that changed, affect a decision, or still carry an open question.

Problem

Quarter-close reporting puts a deadline on metric ambiguity.

The issue is rarely that nobody cares about the number. The issue is that the current owner, latest definition, affected surface, and unresolved question are scattered across release notes, dashboard subtitles, pull requests, chats, and meeting memory.

That scatter matters when the report is close-facing. If active_customer changed its refund exclusion this quarter, finance needs to know which definition is in the close dashboard. If fill_rate excludes manual holds in the operations dashboard, the service narrative needs to say that before the deck freezes. If a revenue-recognition reporting label was split for visibility, the analytics surface needs a finance-owned reporting-boundary answer before anyone treats it as close evidence.

The failure mode is predictable: the review becomes a debate about every metric in the catalog, or it happens after the reporting surface has already locked.

Default approach

I keep the review narrow.

The metric enters the quarter-close ownership review only when at least one of these is true:

the definition changed this quarter;
an exclusion or status bucket changed this quarter;
the definition owner changed;
the metric appears in a close-facing dashboard, packet, export, or executive narrative;
an open question could change whether the surface is safe to lock.

For each included metric, I want the same fields in one place: reporting definition, decision it supports, owner, changed definition or exclusion, last-change check, open question, reporting surface affected, safe-to-lock judgment, and next owner/action.

The important field is the judgment. A metric can be safe to lock, safe with owner note, or hold. That prevents two bad defaults: blocking every surface because one edge case exists, or locking a close-facing number while the owner question is still floating in chat.

Example: the ownership card I want before reporting locks

Here is the card I would rather write before the close packet freezes:

Metric ownership review card
Close period: 2025 Q3 close package
Review rule: only decision-critical metrics with changed definitions, exclusions, owners, or reporting surfaces

1) active_customer
- reporting definition: billed account with at least one paid invoice in the close period
- decision it supports: revenue retention and customer-count narrative for the close package
- owner: finance analytics
- definition/exclusion changed this quarter: refunded invoices are excluded; trial-only accounts remain excluded
- last-change check: semantic definition PR and dashboard subtitle both show the refund exclusion
- open question: confirm whether migrated accounts with invoice credits stay in the billed-account population
- reporting surface affected: finance close dashboard and executive close narrative
- safe-to-lock judgment: safe with owner note until migrated-account question is resolved
- next owner/action: finance analytics confirms migrated-account handling before the close deck freezes

2) fill_rate
- reporting definition: complete customer orders filled from available stock on the first warehouse execution attempt
- decision it supports: quarter-end service and fulfillment narrative
- owner: warehouse operations analytics
- definition/exclusion changed this quarter: customer-requested future ship dates and manual holds are excluded
- last-change check: metric card and operations dashboard filter note both show the new exclusions
- open question: decide whether split shipments stay out of the close-facing metric or become a separate exception note
- reporting surface affected: operations close dashboard and supply-chain service summary
- safe-to-lock judgment: hold until split-shipment handling is named in the surface note
- next owner/action: warehouse operations records split-shipment handling and republishes the filter note

3) revenue_recognition_reporting_definition
- reporting definition: reporting surface uses the finance-approved recognition status label for close review
- decision it supports: whether the analytics surface is safe to reference in the quarter-close package
- owner: finance analytics with finance policy owner review
- definition/exclusion changed this quarter: analytics field was renamed and one deferred-status bucket was split for reporting visibility
- last-change check: close dashboard, semantic definition, and release note use the same status labels
- open question: finance policy owner confirms whether the split bucket is explanatory only or blocks the close-facing surface
- reporting surface affected: finance close dashboard and analytics release note
- safe-to-lock judgment: hold the surface until the policy owner answers; this card does not define accounting treatment
- next owner/action: finance analytics records the policy-owner answer or removes the surface from the close packet

The first row is not perfectly clean, but it may be safe with a note. The refund exclusion is visible in both the semantic definition and the dashboard subtitle. The open question is narrow: migrated accounts with invoice credits. I would not hold the entire close package for that by default, but I would name the owner and put the note beside the affected surface.

The fill-rate row is different. The metric affects the quarter-end service narrative, and split shipments can change the story. If the surface does not say whether split shipments are excluded, included, or called out separately, I would hold that close-facing surface until the owner records the handling.

The revenue-recognition row has the strongest boundary. The analytics artifact is not deciding accounting treatment. It is deciding whether the reporting surface is safe to reference. If the policy owner has not answered whether the split bucket is explanatory or blocking, the correct analytics decision is to hold the surface or remove it from the close packet.

Tradeoffs

Breaks when: the review tries to cover every metric in the catalog → Mitigation: include only changed, decision-critical metrics or surfaces with unresolved close-facing questions.
Breaks when: open questions stay in meeting notes or chat threads → Mitigation: record the affected surface, owner, and next action on the card before the report locks.
Breaks when: every uncertainty becomes a blocker → Mitigation: separate safe with owner note from hold, and reserve holds for questions that can change the reported number, interpretation, or surface eligibility.
Breaks when: revenue-recognition language starts sounding like policy guidance → Mitigation: keep the card at the reporting-definition boundary and let finance or policy owners decide treatment.
Breaks when: the card becomes a stale governance artifact → Mitigation: use it for the close window, store it beside the release note or close checklist, and replace it when the next definition change happens.

Close

Next step: Pick one close-facing metric that changed this quarter and write the ownership card before the reporting surface freezes: definition, owner, last-change check, open question, safe-to-lock judgment, and next action.

The goal is not perfect metric governance. It is a calmer close meeting because the changed numbers already have owners, surface notes, and explicit lock or hold decisions.

The source-trust scorecard I use before promoting a table to production

Mon, 11 May 2026 00:00:00 GMT

The source-trust scorecard is the promotion record between a valid contract and a production dependency.

The dangerous moment is the promotion step. The columns land, row volume looks normal, and the first schema checks pass. Then a downstream model inherits key churn, late corrections, or an owner-response gap nobody reviewed before the table entered the daily dashboard path.

My default is to separate the source contract from the promotion decision. The contract says what the source is supposed to do. The source-trust scorecard records what the source has actually done since ingestion started and whether that behavior is safe enough to depend on.

Problem

I do not want the first real test of a source table to happen after a critical model depends on it.

That is the trap with contract-compliant sources. The agreement may define row meaning, cadence, key rules, accepted values, and an owner, but the observation window can still show behavior that would make production noisy. A table can arrive late twice in two weeks. A vendor can merge catalogs and rekey items. A correction can restate data outside the expected window. A new status can appear before downstream logic knows what to do with it.

Those facts do not always make the source unusable. They do mean I need a promotion record that says which uses can move forward, which ones need guardrails, and which ones stay blocked.

Default approach

I use the scorecard after the initial source contract exists and before the table becomes a dependency for a critical model, dashboard, or recurring review.

The review stays small on purpose:

Check cadence reliability against the promised landing pattern.
Check key stability against the joins, deduplication, and history the downstream model needs.
Check whether corrections are predictable, bounded, and explainable.
Check null and domain risk on fields the downstream logic treats as required.
Check whether late arrivals are normal, visible, and compatible with the reporting window.
Check whether the source owner responds fast enough when evidence shows a defect.
Write the promotion decision as promote, promote with guardrails, or hold.

Every score needs a reference: a check result, run log, source-owner note, issue, or review record. If I cannot point to the evidence, I treat the score as an opinion and keep it out of the promotion decision.

Example: a vendor inventory source that needs guardrails

Here is the kind of source that makes the scorecard useful: a vendor inventory feed that already has a lightweight source contract.

The table lands with expected columns. Row volume stays near the normal range. Basic schema checks pass. A shallow review would promote it into both the daily replenishment model and the executive inventory snapshot.

The observation window says something narrower:

Source trust scorecard
Source table: vendor_inventory_snapshot
Candidate use: daily available-to-promise and replenishment models
Observation window: 14 daily loads after initial source contract
Promotion decision: promote with guardrails
Decision owner: analytics engineering, reviewed with inventory source owner

1) Cadence reliability
- Expected: daily landing by 05:30 ET
- Observed evidence: 12 of 14 loads landed on time; 2 arrived 45-70 minutes late with source-owner notice
- Risk note: acceptable for noon planning refresh, not safe for early-morning executive snapshot
- Score: guardrail

2) Key stability
- Expected: one row per source_item_id, location_id, snapshot_date
- Observed evidence: 0 duplicate rows on stable keys, but 3% of items were rekeyed after vendor catalog merge
- Risk note: downstream joins need a rekey bridge before production promotion
- Score: guardrail

3) Correction pattern
- Expected: corrections can restate the last 72 hours only
- Observed evidence: two corrections landed inside 72 hours; one correction restated a five-day-old partition
- Risk note: publish recent days as preliminary until correction window is proven or exception is explained
- Score: guardrail

4) Null/domain risk
- Expected: item_id, location_id, on_hand_qty, and status are populated with accepted values
- Observed evidence: location_id null rate stayed below 0.2%; status introduced HOLD without prior notice
- Risk note: HOLD needs an accepted-value rule and owner-confirmed downstream handling
- Score: guardrail

5) Late-arrival pattern
- Expected: late rows are rare and visible through load metadata
- Observed evidence: late rows cluster around vendor weekend reconciliation and affect Monday snapshots
- Risk note: Monday planning slice needs a freshness and late-row annotation before release
- Score: guardrail

6) Owner response
- Expected: source owner acknowledges production-blocking defects same business day
- Observed evidence: owner answered cadence and status questions same day; rekey explanation took two days
- Risk note: key churn is the only unresolved promotion blocker
- Score: guardrail

Decision
- Promote with guardrails after adding the rekey bridge, accepted-value rule for HOLD, and preliminary label for recent correction windows.
- Hold executive-facing snapshots until the next observation window shows cadence and key behavior are stable.

That decision is not a clean promote, and it is not a full hold.

The source is useful for the noon replenishment path after the rekey bridge and HOLD handling are in place. Recent partitions need a preliminary label until correction behavior is clearer. The executive snapshot waits because early-morning cadence and key stability are still the two ways this table can put the wrong number in front of the wrong room.

The scorecard keeps that judgment visible. It shows why one production use can move forward while another stays blocked, and it gives the next reviewer the evidence behind the boundary.

Tradeoffs

Breaks when: teams use the scorecard to block exploratory analysis or every low-risk staging table → Mitigation: reserve it for sources that will feed critical models, dashboards, recurring reviews, or other downstream commitments.
Breaks when: scoring becomes subjective gatekeeping or a permanent grade → Mitigation: record the exact check, run log, source-owner note, issue, or review evidence behind each score, then revisit the score when source behavior changes.
Breaks when: teams wait for a perfect source and delay useful delivery → Mitigation: allow promote with guardrails when the limits are explicit, such as preliminary labels, blocked dashboard surfaces, rekey bridges, or owner-confirmed correction windows.
Breaks when: the scorecard duplicates the source contract and creates two places for the same rule → Mitigation: let the contract define expectations, and let the scorecard record observed behavior against those expectations during the promotion window.

Close

Next step: Before one critical model or dashboard depends on a new source, score the source against observed cadence, key stability, corrections, null risk, late arrivals, owner response, and the promote/guard/hold decision.

The scorecard is complete only when the downstream owner can see why the table moves forward, moves with guardrails, or stays held.

The change log I publish when a backfill moves reported numbers

Mon, 11 May 2026 00:00:00 GMT

Republication pressure starts when a corrected number replaces one people have already used.

The dashboard is safer to use after the rebuild, but the number has already appeared in a deck, export, or business review. If I republish without a short change log, finance and operations readers are left comparing screenshots and asking whether the metric changed, the business changed, or the team quietly fixed an error.

My default is to publish the communication record after validation passes. The validation evidence tells me the movement is expected. The change log tells readers what moved and how to use the new number.

Problem

Backfills usually get treated as technical work: rebuild the range, compare before and after, confirm the expected movement, and republish the dashboard.

That is necessary, but it is not enough when the number was already circulating inside the company. A finance lead may have last week’s export. An operations manager may have copied the old dashboard value into a review deck. Someone may notice that April and May changed and assume the definition changed too.

The trust break is not always the backfill. Sometimes the trust break is the missing explanation after the backfill.

For decision-critical metrics, I want one compact change log before the republished number becomes the new argument.

Default approach

I publish a change-log entry when a validated backfill moves a number that already appeared in a dashboard, deck, export, or recurring review.

The entry has to answer a narrow set of questions:

Which metric or published surface changed?
Which date range, cohort, dashboard, or export is affected?
What direction of movement should readers expect?
Which validation evidence says the movement is expected?
Who approved the business interpretation?
How should readers treat older decks, exports, or screenshots?
What remains open or under follow-up?

I keep that record separate from the validation workbook. The workbook proves the rebuild is safe enough to publish. The change log helps people interpret the number they can now see.

The line I care about most is the interpretation boundary. If the backfill corrects shipped-status timing, I want the next review to treat the movement as a timing correction, not as a new demand signal.

Example: the change log I want before republishing

Imagine a 90-day order-status backfill.

A source system corrected late carrier acknowledgements. Orders that were stuck in pending now settle as shipped. February and March shipped-volume totals increase. April stays close to the prior publish because most corrections were already inside the reporting cutoff.

The rebuild passes validation. The row-count and status-transition checks match the expected correction window. The before/after comparison shows the movement in the right months.

Now the communication problem starts.

The weekly operations dashboard, monthly KPI export, and finance-facing reconciliation tab all show the republished values. Someone can reasonably compare them against an older deck. This is the change log I want available before that happens:

Published-number change log
Metric: shipped volume
Reason for backfill: source corrected order status for late carrier acknowledgements
Affected range: 2025-02-01 through 2025-04-30
Published surfaces touched:
- weekly operations dashboard / shipped volume card
- monthly KPI export used in the supply review
- finance-facing volume reconciliation tab

Expected movement:
- February and March shipped volume should increase slightly
- April shipped volume should stay close to the prior publish because most corrections landed before month end
- on-time delivery percentage may move in the same rows, but fill-rate definition is unchanged

Validation evidence:
- row-count and status-transition comparison completed for February-April partitions
- before/after metric comparison reviewed by analytics engineering
- unexpected movement threshold: any month moving outside the signed validation slice gets held from republication

Owner sign-off:
- analytics engineering approved the rebuild evidence
- operations owner approved the interpretation note for the review deck

Interpretation boundary:
- use the new dashboard for February-April comparisons after 2025-05-27
- do not treat the movement as a new demand signal; it is a correction to shipped-status timing
- older exports remain historical snapshots and should not be mixed with the republished dashboard without this note

Open follow-up:
- source owner will confirm whether the carrier acknowledgement correction needs a permanent freshness check

That note does not explain every row. It gives the reader enough context to stop guessing.

The affected range names where comparison risk lives. The surfaces list tells dashboard owners and review owners where to expect questions. The expected movement separates a controlled correction from a surprise. The sign-off separates engineering validation from business interpretation.

The open follow-up matters too. If the source owner still needs to decide whether this correction should become a permanent freshness check, I would rather say that plainly than let the change log sound more final than the system actually is.

Tradeoffs

Breaks when: every low-risk correction gets the same ceremony → Mitigation: reserve the full change log for published dashboards, recurring reviews, finance-facing exports, or metrics with named owners.
Breaks when: the change log becomes a duplicate validation workbook → Mitigation: summarize expected movement and reference the validation evidence separately instead of listing every row-level difference.
Breaks when: the technical result is valid but the business meaning is still uncertain → Mitigation: publish a temporary interpretation boundary, name the owner, and record the follow-up instead of pretending the uncertainty is gone.
Breaks when: teams mix old exports with republished dashboards without context → Mitigation: declare which source is authoritative after republication and label older artifacts as historical snapshots.

Close

Next step: Before republishing one backfilled metric, write the change-log entry a reader would need when comparing the new dashboard to last week’s deck.

If that note cannot name the affected range, expected movement, validation evidence, owner sign-off, and interpretation boundary, the backfill may be technically done while the trust work is still open.

When an analytics incident needs a postmortem, not just a note

Sat, 09 May 2026 00:00:00 GMT

After the dashboard is fixed, the next decision is whether the incident is actually closed.

After a data trust break, I still need to decide whether the short incident note is enough. Some incidents need one prevention item and a clean close. Others need a full postmortem because the same failure will come back if nobody names the pattern, owner, and follow-up path.

My default is to choose the review level from triggers, not from the temperature of the room. A trigger table keeps the response proportional.

Problem

Analytics incidents usually fail in one of two directions: too much ceremony for a small miss, or too little learning for a repeated trust break.

If every late dashboard, stale extract, or corrected number gets a full postmortem, people stop reading them. The process becomes ceremony, and the useful reviews get buried.

If every incident gets closed as a short note, repeated failures stay invisible. A published number can be corrected twice, each time with a plausible local explanation, while the real prevention work never gets accepted.

I want the decision rule in writing before the next incident. After the fix, everyone is tired, the meeting clock is loud, and the team is already biased toward either moving on or making the review bigger than it needs to be.

Default approach

I start with the incident note, then decide whether the incident needs note-only closure, a lightweight review, or a full postmortem.

The table has to answer the question a lead is actually facing: can we close the note, do we need a short review, or did this incident expose a system weakness that needs a full postmortem?

Analytics incident escalation table

Field                         | Note only                              | Lightweight review                         | Full postmortem
------------------------------|----------------------------------------|--------------------------------------------|----------------
Impact                        | no decision, meeting, export, or KPI trust was affected | one team, dashboard, or scheduled review was delayed or temporarily unsafe | a decision, finance process, customer-facing surface, or executive review was affected
Recurrence                    | first isolated occurrence with understood cause | repeated pattern in one asset or recent near-miss | repeated cross-surface failure or unresolved previous prevention item
Exposure                      | contained inside the analytics team    | visible to one business owner or operating team | visible outside the immediate team or tied to a formal reporting path
Customer/executive visibility  | none                                   | possible if the issue is not handled before the next review | confirmed customer, executive, finance-close, board, or external-reporting visibility
Detection path                | expected alert or check caught it before use | human report found it, or alert lacked enough responder context | users found it, monitoring failed, or detection happened after a bad decision
Unresolved prevention work    | one clear fix and owner                | prevention needs coordination across analytics and one partner | root cause or prevention path is unclear, cross-team, risky, or under-owned
Review level                  | close the short incident note          | schedule a short review with timeline and action list | write a blameless postmortem with timeline, contributing causes, impact, owners, and follow-up links
Owner                         | incident owner closes the note         | analytics lead owns review and action follow-through | incident owner plus accountable analytics, engineering, and business owners
Next action                   | record fix and prevention in the note  | add one or two follow-up tasks with due dates | track postmortem actions until accepted, rejected, or replaced

This is not a scoring system. I do not add up points and pretend the number made the decision.

I use the table to make the judgment visible. If impact was low, detection worked, and prevention is obvious, I keep the incident small. If recurrence, executive exposure, failed detection, or unclear prevention shows up, I escalate before the incident becomes folklore.

Example

Here is the note-only case.

A daily operations dashboard publishes 11 minutes late because an upstream extract lands after its usual window. The dashboard is not used until the afternoon standup. The delay is caught before the meeting. The data publishes cleanly. The prevention item is clear: update the source cutoff note and widen the warning window so the team sees the risk earlier.

That incident deserves a short note, not a postmortem.

Incident: operations dashboard published 11 minutes late
Impact: no meeting or decision used stale data
Recurrence: first isolated delay from this source handoff
Exposure: contained inside analytics
Detection: expected freshness check caught it before use
Prevention: update source cutoff note and warning threshold
Review level: note only
Owner: analytics engineer closes note
Next action: record prevention item and monitor next scheduled run

A full review would not add much. It would spend more attention than the incident earned, and it would teach the wrong habit: every small delay becomes a meeting instead of a clean note with an owner.

Here is the case that should not stay note-only.

A finance-facing revenue number is republished twice in one month after a join change drops a subset of settled refunds. The first incident had a short note and one prevention item. The second reaches the finance close review before the discrepancy is caught, and the corrected number has to be explained to executives. Nobody can say whether the failure was release review, metric ownership, detection, or the prevention item from the first note not being done.

That incident needs a full postmortem.

The trigger is not embarrassment. The trigger is the combination of recurrence, executive exposure, a formal finance process, failed detection, and unresolved prevention work across ownership boundaries.

Incident: finance revenue number republished after settled-refund join issue
Impact: business review used or almost used the wrong number
Recurrence: second related incident in one month
Exposure: finance-facing dashboard and review packet
Detection: discrepancy found by a person, not by the expected check
Unresolved prevention: unclear whether release review, metric ownership, or join monitoring failed
Review level: full postmortem
Owner: analytics incident owner plus finance analytics owner
Next action: write timeline, name contributing causes, and track accepted prevention work

The document is not the point. The point is deciding whether the incident revealed a system weakness that a short note cannot close and a vague action item will not prevent.

Tradeoffs

Breaks when: every small analytics alert gets the same heavy review. → Mitigation: keep note-only and lightweight-review paths legitimate so full postmortems stay worth reading.
Breaks when: the postmortem becomes a search for who caused the wrong number. → Mitigation: keep the review blameless and frame the work around missing signals, unclear ownership, weak release checks, and follow-up the team can actually change.
Breaks when: repeated small incidents keep looking harmless in isolation. → Mitigation: include recurrence and unresolved prior action items in the trigger table so patterns can roll up before trust erodes.
Breaks when: the team writes a good review but nobody accepts the prevention work. → Mitigation: attach every review level to an owner, a next action, and a backlog path where the action can be accepted, rejected, or replaced.

Close

The smallest useful version is a trigger table the team agrees to before the next incident.

It does not need to be perfect. It needs to answer one question clearly: when is a short note enough, and when would moving on leave the same trust break waiting for the next review?

Next step: Pick one recent analytics incident and classify it three ways: note-only, lightweight review, or full postmortem.

If the answer is hard to defend, the table is the work to do before the next incident forces that decision under pressure.

The first-response runbook I want behind every analytics SLA alert

Tue, 05 May 2026 00:00:00 GMT

At 07:34 ET, a vague SLA alert is already late.

If the 08:00 ET operations review depends on the dashboard, I do not want the alert to say only that a job missed its schedule. I want the first responder to know whether the output is safe, where the run last succeeded, which boundary failed, who owns the next check, and what message should go to the people waiting on the number.

The runbook is what makes the alert usable.

Problem

Analytics SLA alerts often create urgency without giving the responder a useful first move.

A scheduler can say the daily KPI refresh missed the 07:30 ET cutoff. That matters, but it does not answer the questions that decide the first response: Did the source land? Did yesterday’s dashboard stay published? Did the failure happen in the transform or the publish step? Is the 08:00 ET review unsafe, or can the team use the last settled snapshot?

When those answers are missing, the first ten minutes become tool-hopping. One person opens the orchestrator. Another checks the dashboard. Someone asks whether stakeholders should wait. The alert technically worked, but the response started from scratch.

For decision-critical analytics, I want the runbook card attached before the alert fires.

Default approach

My default is to keep one short runbook card behind each SLA family that can interrupt a real decision.

The card has to answer a narrow set of questions:

What business cutoff makes this alert matter?
Which dashboard, export, model, or review is affected?
What was the last successful run or publish?
Which step failed or became late?
Is the upstream source fresh enough to trust?
Does the published output still have a normal shape if a partial publish exists?
Who is the first responder, who is the backup, and when does escalation start?
What should stakeholders hear before the cause is fully known?

I do not need a long wiki page at this moment. I need a card that changes the first action.

If the runbook cannot say what to check first, the alert is not ready to page someone. It may still be a warning, a dashboard marker, or a backlog item. Paging should be reserved for failures that can affect a business cutoff and have a response path attached.

Example

For this case, the daily KPI dashboard needs to publish by 07:30 ET for an 08:00 ET operations review.

At 07:34 ET, the alert fires. This is the card I want behind it:

SLA alert runbook card
Alert: daily KPI dashboard refresh missed 07:30 ET publish cutoff
Business cutoff: 08:00 ET operations review
Affected output: Daily KPI dashboard / executive summary tiles
Decision exposure: review should not use today's dashboard until publish is confirmed

Last successful run: 2025-03-03 07:18 ET
Current run: 2025-03-04 started 06:42 ET; failed at 07:24 ET
Failing step: transform_daily_kpi / publish_mart_daily_kpi
First responder: analytics engineering on-call
Backup / escalation: data platform owner if source freshness is late by 07:40 ET; business owner if dashboard remains unsafe by 07:50 ET

First five checks
1. Source freshness: finance_extract landed by 06:10 ET? latest observed timestamp?
2. Failed step: inspect transform_daily_kpi error and recent code/config change.
3. Last publish: confirm latest successful published partition and dashboard cache timestamp.
4. Output shape: compare row count, null-rate, and one KPI total against recent same-weekday band if a partial publish exists.
5. Recovery choice: rerun, hold yesterday's snapshot, or mark dashboard unsafe for the 08:00 ET review.

Stakeholder message template
Status: daily KPI dashboard is <safe/unsafe/under review> for the 08:00 ET operations review.
Evidence: last successful publish is <timestamp>; current run failed at <step>; source freshness is <ok/late/unknown>.
Next action: <rerun/hold yesterday's snapshot/investigate failed step>.
Next update: <time> from <responder>.

The first useful line is the business cutoff. Without it, the responder cannot tell whether this is a page, a warning, or a note for later cleanup.

The next useful line is the last successful run. If yesterday’s dashboard is still published and clearly labeled, the team may have a safe fallback for the review. If the latest publish is partial, stale, or cached in a confusing state, the responder should say that early instead of letting people assume the dashboard is current.

The failing step matters because it prevents the first responder from starting at the wrong layer. If the source extract is late, the next check is upstream freshness and escalation. If the source landed and the transform failed, the next check is the model error, recent change, and rerun path. If the transform finished but the dashboard cache did not refresh, the response is different again.

The stakeholder template is part of the runbook because communication is not separate from recovery. At 07:45 ET, a calm update is more useful than a perfect root cause that arrives after the review starts.

The first update can stay this small:

Status: daily KPI dashboard is under review for the 08:00 ET operations review.
Evidence: last successful publish is 2025-03-03 07:18 ET; current run failed at publish_mart_daily_kpi; source freshness is ok.
Next action: analytics engineering is rerunning the failed publish step and checking output shape before clearing the dashboard.
Next update: 07:50 ET from analytics engineering on-call.

That message does not pretend to know the cause. It tells people what is safe, what evidence exists, what happens next, and when they will hear again.

Tradeoffs

Breaks when: one generic runbook tries to cover every pipeline, dashboard, and alert → Mitigation: keep one card per critical output or SLA family so the cutoff, affected surface, owner, and first checks are specific.
Breaks when: every warning uses paging severity → Mitigation: page only when the failure can affect a decision, review, export, or dashboard promise; keep lower-severity warnings visible without waking the same responder.
Breaks when: the stakeholder template hides uncertainty → Mitigation: allow unknown as a real state and require the next check plus next update time instead of forcing a fake root cause.
Breaks when: the card names source freshness or output-shape checks that nobody maintains → Mitigation: mark missing signals honestly and assign the owner before treating the alert as production-ready.

Close

Next step: Pick one SLA alert that could interrupt a real review and write the card behind it: cutoff, affected output, last success, failed step, source freshness, responder, escalation, and stakeholder message.

A page that still sends the responder to three tools before they can say whether the dashboard is safe is not finished yet.

The schema-change checklist I use before a source breaks downstream models

Sat, 02 May 2026 00:00:00 GMT

Schema changes get expensive when they look harmless at ingest time.

The load can finish, the row count can land, and the first visible cost can be downstream: a join starts dropping customers, a timestamp watermark skips records, or a dashboard keeps using a field whose meaning moved. I do not want the first serious schema-change review to happen after the model is wrong.

Before promotion, I classify the change, name the downstream surface, and choose the response on purpose: absorb it, warn consumers, quarantine records, block promotion, coordinate a migration, or run compatibility in parallel.

Problem

Schema changes are easy to underreact to when the pipeline stays green.

An additive field looks harmless until an analyst exposes it before the business meaning is agreed. A rename looks simple until it breaks a watermark or dashboard filter. A type change looks like an implementation detail until a staging join silently casts away the value downstream models depend on.

The cost is not only the broken run. It is the uncertainty after the run: which models used the field, which dashboard consumed the model, who owns the response, and why the team let that shape move forward without a decision.

This is different from the initial source agreement I want in a six-part data contract for a source table. That contract says what the source is supposed to mean. This checklist is what I use when the source moves anyway.

Default approach

Capture the observed change: source, field, old shape, new shape, and when the change appeared.
Classify the change before reacting: additive field, rename, type change, dropped field, or semantic change.
Check downstream use across ingestion, staging models, marts, dashboards, metric definitions, extracts, and known consumers.
Name the owner for the response. The upstream owner, transformation owner, and business-facing consumer owner may not be the same person.
Choose the promotion decision explicitly: absorb, warn, quarantine, block promotion, coordinate migration, or run compatibility in parallel.
Record the evidence next to the pull request, incident note, release comment, validation run, or source-contract update.

The classification matters because not every change deserves the same gate. Unused additive fields can often move into raw or staging with a warning. Key type changes, watermark renames, dropped business fields, and semantic shifts usually need a stronger stop.

That is also why row counts are not enough. A schema change can keep the same number of rows while changing join behavior, null behavior, or metric meaning. I still want the output checks from the checks I add before I trust a pipeline, but I do not wait for those checks to be the first place the structural change is understood.

Example

Imagine the customer API changes during a normal validation run.

Three things happen at once: customer_id changes from a numeric identifier to a string identifier, customer_segment appears as a new nullable field, and created_at is renamed to created_timestamp. Ingestion still lands records, but the change touches the raw load, the staging model, the customer mart, and an executive dashboard filter.

This is the checklist I want before promotion.

Schema-change classification checklist
Source: customer API / customers endpoint
Observed on: 2025-02-11 validation run
Owner recording decision: analytics engineering

Change 1
field: customer_id
old shape: integer-like numeric identifier
new shape: string identifier
classification: type change
critical downstream use: staging joins, customer mart key, dashboard drill-through URL
first risk: silent cast or broken join if downstream expects numeric
owner: data engineering owns ingest; analytics engineering owns staging model
promotion decision: block promotion until raw string is preserved, staging key cast is explicit, and join tests pass
follow-up evidence: source contract note, staging PR, validation run, join/null test output

Change 2
field: customer_segment
old shape: not present
new shape: nullable string
classification: additive field
critical downstream use: none yet; requested by lifecycle reporting later
first risk: low for current dashboards, medium if analysts use it before the definition is certified
owner: lifecycle analytics owner confirms definition before mart exposure
promotion decision: absorb at raw/staging layer, warn that it is not business-certified yet
follow-up evidence: release comment says the field is staged but not curated

Change 3
field: created_at -> created_timestamp
old shape: created_at timestamp string
new shape: created_timestamp timestamp string
classification: rename
critical downstream use: incremental load watermark, staging model, cohort dashboard
first risk: latest records stop loading or cohorts shift if the old field name is assumed
owner: data engineering confirms source change; analytics engineering updates staging compatibility
promotion decision: coordinate migration with temporary compatibility alias and alert on old-field disappearance
follow-up evidence: PR includes alias removal date, dashboard validation slice, and owner sign-off

Checklist fields to preserve for every schema change
- source and observed date
- field name or semantic rule
- old shape and new shape
- classification: additive field, rename, type change, dropped field, semantic change
- downstream impact: models, dashboards, metrics, exports, or consumers
- owner: upstream, transformation, and business-facing owner where relevant
- promotion decision: absorb, warn, quarantine, block promotion, coordinate migration, or run parallel compatibility
- follow-up evidence: PR, source contract update, validation run, release note, or incident note

The three changes should not get one blanket decision.

For customer_id, I block promotion until the raw string is preserved and the staging model makes the cast explicit. A key is not a cosmetic field. If the downstream mart expects numeric IDs, a quiet cast can create join misses that look like customer churn or dashboard drill-through defects.

For customer_segment, I can absorb the field earlier because no current dashboard depends on it. But I still warn consumers that it is not business-certified. The field should not appear in the curated mart until someone owns allowed values, null behavior, and the difference between source-system labels and reporting labels.

For created_at, I prefer a compatibility window. The staging model can expose the old alias briefly while the source owner confirms the rename and the analytics owner updates the watermark, cohort model, and dashboard slice. The important part is that the alias has an owner and removal date. Otherwise the compatibility layer becomes a permanent hiding place for unfinished migration work.

Dropped fields and semantic changes fit the same checklist even though this API example does not include them. If sales_region disappears, I want to know which model, metric, export, or owner loses required meaning before anyone fills a fake default. If customer_segment keeps the same name but changes from lifecycle segment to marketing segment, I treat it as a semantic change and require the same promotion decision I would use for a visible schema break.

Automated lineage and runtime observability can speed up the investigation. I still want a human-readable decision record, especially when lineage misses spreadsheets, dashboard extracts, or owner-maintained consumer lists. If the issue is already late, stale, or failed at runtime, pipeline observability signals help the first responder. This checklist belongs one step earlier: before the team decides that the changed source is safe to promote.

Tradeoffs

Breaks when: every additive field is treated as a release blocker → Mitigation: allow unused additive fields into raw and staging with a warning, but hold curated exposure until an owner confirms meaning, allowed values, and null behavior.
Breaks when: a harmless widening is grouped with destructive type changes → Mitigation: distinguish widening from incompatible casts, preserve raw values, and run join/null checks before promotion.
Breaks when: lineage misses spreadsheet exports, dashboard extracts, or manually maintained reports → Mitigation: pair automated lineage with a known-consumer list for business-critical models.
Breaks when: compatibility aliases for renames become permanent → Mitigation: record the alias removal date, owner, and validation slice before the compatibility layer ships.
Breaks when: a dropped field removes required business meaning and the upstream owner cannot restore it quickly → Mitigation: record degraded mode, consumer warning, and migration owner instead of silently filling fake defaults.

Close

Next step: Pick one business-critical source that changed recently and classify the change before the next promotion: old shape, new shape, downstream use, owner, promotion decision, and follow-up evidence.

If you want to compare notes on schema-change triage, I am most interested in the field that looks low-risk and the model that would prove otherwise.

The evidence packet I want before an analytics release is approved

Tue, 28 Apr 2026 00:00:00 GMT

A release can pass every automated check and still leave the approval hard to defend.

That is the analytics release failure I care about here: the code merged, the pipeline stayed green, the dashboard changed, and three weeks later nobody can reconstruct why the release was considered safe. The evidence exists somewhere, but it is split across pull request comments, CI logs, screenshots, owner messages, and a rollback note that never made it into the same record.

For finance-facing or operations-facing analytics releases, I want one small packet before promotion. Not a ceremony. A packet.

Problem

The approval moment is often thinner than the change deserves.

A pull request says tests passed. A dashboard screenshot sits in a thread. A finance owner writes “looks good” after checking one slice. The rollback path is known by the engineer who shipped it. Each piece may be reasonable on its own, but the release decision is still fragile because the pieces are not tied together.

The cost shows up later. When a finance export, executive dashboard, or operations metric is questioned, the team has to do archaeology before it can answer the actual question: did the live number behave the way the release note said it would?

That is why my default is to approve decision-critical analytics changes from an evidence packet, not from a green badge alone.

Default approach

Name the release boundary: pull request, changed models, semantic definitions, dashboard cards, exports, and downstream report surfaces.
Attach validation evidence that explains the business-facing change, not just the automation status.
Separate engineering approval from owner approval. The delivery path and the visible metric interpretation are related, but they are not the same decision.
Write the stakeholder-facing release comment before promotion: what should move, what should not move, and where readers should look if they compare against an older deck or export.
Keep the rollback or recovery path boring: revert reference, validation rerun, first responder, and owner for stakeholder clarification.
Link to durable evidence instead of pasting every log line into the packet.

This is the cross-layer version of release control. A dashboard-specific checklist still matters, and so does a dbt deployment record or Azure DevOps check record. The packet is where I make those pieces answer one approval question.

Example: the packet I want in the release comment

Imagine a release that changes a finance-facing revenue dashboard. The code diff is small, but the release touches one dbt model, one semantic definition, and one dashboard card.

That is enough surface area for the approval reason to get lost.

Analytics release evidence packet
Release: revenue dashboard net revenue definition update
PR link: repository pull request #418
Changed assets:
- dbt model: mart_finance_revenue
- semantic definition: net_revenue excludes refunded order lines after settlement
- dashboard card: executive revenue scorecard / Net revenue by week

Validation evidence:
- CI: lint, unit tests, dbt compile, and changed-model build passed
- dbt comparison: mart_finance_revenue row count unchanged for validation slice
- metric comparison: net revenue moved -0.42% for 2025-01-06 to 2025-01-19, expected from settled refunds
- dashboard screenshot diff: only Net revenue by week and Revenue mix cards changed
- owner review: finance analytics lead approved the validation slice and release note

Release comment:
- visible change: net revenue may decrease slightly for weeks with settled refunds
- not changing: booked gross revenue, customer count, and order volume cards
- stakeholder note: finance review should use the release comment if January revenue is compared to last week's deck

Rollback / recovery:
- rollback reference: revert PR #418 and restore previous semantic definition artifact
- recovery check: rerun dashboard validation slice after rollback
- first responder: analytics engineering owns rollback; finance analytics lead owns stakeholder clarification

The packet stays short because it has one job: tie the approval to durable evidence.

The pull request and CI run still hold the delivery evidence. The dashboard screenshot still proves the visible output. The owner approval still belongs to the person accountable for the finance interpretation. The packet ties those records together so the approval can be reconstructed without searching five tools.

The line I look for first is not the test result. It is the expected visible movement.

If net revenue moves by about half a percent for the validation slice and the owner agrees that the movement comes from settled refunds, I have a release decision I can defend. If the same release has green tests but no expected movement, I still do not have enough evidence.

This is where the packet differs from a BI-only checklist. For dashboard-specific output review, I still want a dashboard release checklist before a BI change goes live. For model promotion, I still want a boring dbt deployment record. For delivery-system enforcement, I still want Azure DevOps checks before analytics code reaches production. The release packet does not replace those artifacts. It collects the approval evidence that crosses them.

Tradeoffs

Breaks when: every low-risk copy edit, label change, or exploratory dashboard tweak gets the full packet → Mitigation: reserve the strict packet for metric definitions, executive dashboards, finance-facing marts, and decision-critical semantic changes.
Breaks when: the packet turns into a stale template nobody reads → Mitigation: keep only evidence someone would need during rollback, dispute, or stakeholder explanation.
Breaks when: links point to expiring CI output, private chat threads, or screenshots nobody can find later → Mitigation: keep the approval record in the pull request, release note, or repository-backed artifact, and link only evidence that will still be available when the number is questioned.
Breaks when: the change cannot be cleanly rolled back because it includes a source correction or historical backfill → Mitigation: record the recovery path instead: affected surfaces, restatement window, validation rerun, communication owner, and first responder.

Close

Next step: Pick one decision-critical analytics release and write the packet you would want to read three weeks later: changed assets, validation evidence, owner approval, stakeholder note, and rollback or recovery path.

The approval is safer when the evidence can outlive the Slack thread around the release.

The metadata-driven pipeline decisions I would revisit before moving ADF patterns into Fabric Data Factory

Mon, 27 Apr 2026 00:00:00 GMT

The expensive version of an ADF-to-Fabric migration is the one where the control table keeps working just long enough to hide what changed.

A metadata-driven ADF framework can reduce duplicated pipelines. It can also turn source selection, connection behavior, dataset shape, schedules, validation, and ownership into a second programming language. When that happens, the migration plan starts preserving the abstraction before the team has decided whether the abstraction is still honest.

My thesis is simple: keep metadata for durable operating decisions, and delete or make explicit the ADF-specific indirection before rebuilding the pattern in Fabric Data Factory.

Problem

The migration trap is treating a metadata-driven ADF framework as one reusable asset.

In ADF, a team may have used one control table to drive linked services, parameterized datasets, target paths, table names, watermarks, schedules, and validation rules through a reusable copy pipeline. That pattern can be useful. It keeps repeated pipeline objects under control.

It also makes some decisions too easy to miss. If server_name, database_name, dataset_name, and schedule_name are just fields in a row, the team may forget that Fabric Data Factory does not model those pieces the same way.

As of my 2026-04-27 source check, Microsoft’s migration planning guidance says Fabric Data Factory defines dataset properties inline within activities, replaces linked services with Fabric connections, uses variable libraries instead of ADF global parameters, and handles scheduling differently from ADF. The same guidance says manual migration is necessary for complex environments and low-parity patterns (Microsoft Learn, migration planning, last updated 2026-04-11).

That does not make metadata-driven pipelines bad. It means I would stop asking, “Can we migrate the framework?” and start asking, “Which fields still describe how we operate this data path?”

Default approach

My default is to turn one old ADF control-table row into a migration decision record before rebuilding anything in Fabric.

Inventory each field by job: source identity, connection behavior, dataset or path behavior, load pattern, target shape, schedule, validation, ownership, and migration status.
Keep metadata that the team still needs to review deliberately: source identity, load pattern, watermark policy, validation evidence, owner, escalation path, and migration status.
Move platform-specific choices into Fabric-owned artifacts or documented conventions: connections, inline activity settings, variable libraries, workspace rules, and deployment mappings.
Choose the target shape before preserving the orchestration pattern. The path might land in a Lakehouse table, feed a Warehouse table, use Copy job, call a notebook for merge behavior, or stay in ADF for now.
Rebuild schedules and validation as production controls, not as leftovers from the old trigger framework.
Give each pipeline a migration status such as migrate, needs-review, redesign-required, keep-mounted-adf-for-now, or delete-abstraction.

The source notes matter because this is platform behavior, not a timeless design law. As of 2026-04-27, Microsoft’s ADF upgrade page categorizes pipelines as Ready, Needs review, Coming soon, or Not compatible. It also says dynamic linked services, highly dynamic linked-service patterns, and dataset-driven metadata patterns cannot migrate as-is through the UX-based path (Microsoft Learn, upgrade ADF pipelines to Fabric, last updated 2026-04-21). The comparison page frames Fabric as connection-based and dataset-free, with data properties defined inline in activities (Microsoft Learn, ADF/Fabric differences, last updated 2026-03-31).

So the decision is not “metadata or no metadata.” The decision is where the metadata earns its keep.

Example

Here is the kind of ADF control-table row I would not migrate blindly.

ADF control-table row before migration
source_system: erp_sql
linked_service_name: ls_sql_dynamic
server_name: erp-prod-sql-01
database_name: operations
schema_name: dbo
table_name: PurchaseOrders
dataset_name: ds_sql_table_dynamic
target_path: raw/erp/purchase_orders/
watermark_column: LastModifiedUtc
load_type: incremental
schedule_name: daily_0500_eastern
validation_rule: row_count_plus_watermark
owner_team: analytics_platform

That row mixes four different jobs.

First, it names the data path: source_system, schema_name, table_name, and watermark_column tell me what source is moving and how change is detected. Those fields still belong in a migration discussion.

Second, it hides connection behavior: linked_service_name, server_name, and database_name may have been useful ADF indirection, but I would not carry them forward as runtime-switchable strings unless the team can explain the operational reason. In Fabric, I want the governed connection reference to be visible.

Third, it blurs target shape: target_path says where raw data landed in the old pattern, but it does not answer whether the Fabric target is a Lakehouse table, a Warehouse table, both, or neither.

Fourth, it under-specifies production ownership. validation_rule and owner_team are a start, but I also want the failed-run owner, source-contract owner, reporting sign-off owner, rerun note, and escalation path. Migration is where I would make those fields boring and explicit.

As of 2026-04-27, Microsoft’s global-parameter migration guide says ADF global parameters need manual steps, expression references must be updated, and workspace variables should not be overloaded with run-time values (Microsoft Learn, global parameters to variable libraries, last updated 2026-04-11). That is the kind of boundary I want the record to expose, not hide.

The Fabric-ready record is less clever and more operational.

Fabric migration decision record
source_identity: erp_sql / PurchaseOrders
fabric_connection_reference: ops-sql-prod-connection
connection_strategy: explicit connection per governed source, not dynamic server/database strings
load_pattern: incremental copy to Lakehouse staging, with notebook or Warehouse merge only where needed
target_shape: Bronze Lakehouse table -> curated Warehouse table if reporting needs SQL serving
watermark_policy: LastModifiedUtc plus replay/backfill note
schedule_policy: pipeline-local schedule; no reusable central trigger assumption
validation_evidence: row count, max watermark, schema drift check, failed-run owner, rerun note
owner: analytics_platform owns orchestration; source owner owns source contract; reporting owner signs off curated output
escalation_path: source connectivity -> platform; source data change -> source owner; reporting mismatch -> analytics owner
migration_status: redesign-required because dynamic linked-service and dataset-driven pattern cannot migrate as-is

The new record does not pretend every old field deserves a new home.

It keeps source_identity, load_pattern, watermark_policy, validation_evidence, owner, and escalation_path because those decisions survive the platform move. It moves connection choice into an explicit governed Fabric connection reference. It makes target shape a first-class decision instead of burying the answer in a path string. It records schedule policy and migration status so no one confuses a successful assessment with production readiness.

That last point matters. As of my 2026-04-27 source check, Microsoft’s upgrade guidance says post-migration work includes validating connections, re-enabling and configuring triggers, running end-to-end tests, and validating in nonproduction before production cutover (Microsoft Learn, upgrade ADF pipelines to Fabric, last updated 2026-04-21). Microsoft’s global-parameter guide separately says ADF global parameters are not automatically migrated and need deliberate re-authoring into variable libraries (Microsoft Learn, global parameters to variable libraries, last updated 2026-04-11). I would rather keep those checks on the decision record than treat migration as proof that the pipeline is safe.

Tradeoffs

Default: Keep a control table for source identity, load pattern, watermark policy, validation evidence, owner, escalation path, and migration status. Breaks when: the table starts storing every platform-specific knob. Mitigation: move connection, schedule, deployment, and target-shape conventions into Fabric-owned artifacts or documented workspace rules.
Default: Use explicit Fabric connections for governed sources. Breaks when: the old ADF pattern relied on parameterized linked services to swap server, database, or authentication context at runtime. Mitigation: create separate governed connection references and keep the source-selection decision visible in metadata instead of hiding it inside dynamic connection strings.
Default: Define file, table, schema, and copy settings close to the Fabric activity that uses them. Breaks when: the team tries to recreate ADF reusable dataset objects as a parallel metadata language. Mitigation: keep reusable naming conventions, but let Fabric inline properties and activities own the concrete shape.
Default: Use variable libraries for environment constants and pipeline parameters for run-specific values. Breaks when: old global parameters are treated as an automatic migration. Mitigation: record each global parameter, decide whether it is an environment constant or a runtime decision, and rewrite references deliberately.
Default: Keep scheduling boring and pipeline-local unless there is a strong reason to centralize orchestration elsewhere. Breaks when: the ADF framework depended on reusable triggers, tumbling windows, dependency triggers, or backfill semantics. Mitigation: rebuild schedules, backfills, and trigger metadata explicitly after assessment rather than assuming parity.
Default: Validate after migration with row counts, watermarks, schema checks, owner review, and a rerun note. Breaks when: the migration assessment says a pipeline is supported and the team treats that as production proof. Mitigation: run nonproduction end-to-end validation and keep the result in the migration decision record.

Close

Next step: Take one metadata-driven ADF control-table row and mark every field as keep, make explicit in Fabric, or delete; only then decide whether the pipeline should migrate, be redesigned, stay mounted for now, or be retired.

If the row cannot name the owner, validation evidence, target shape, and migration status, I would fix the row before I trusted the migration plan.

The checks I add to supply chain data before planners trust it

Sun, 26 Apr 2026 00:00:00 GMT

My first artifact for item_location_week_supply_plan is a compact planner trust check card. Buy, expedite, and reallocate decisions stay blocked until every line on that card is green.

I have watched a planning feed arrive on time with expected row counts and still be unsafe to use. The file can look healthy at the table level while business-rule defects point planners toward the wrong move.

Problem

A planning feed can be fresh, complete by row count, and still unsafe for action.

I keep this boundary separate from two other supply-chain posts on purpose:

KPI-definition boundary: in The supply chain numbers I define before they reach a business review, I decide metric meaning, timing, exclusions, grain, and owner before a review.
Restatement-window boundary: in Handling late-arriving supply chain data without rewriting history by hand, I decide how late events restate history inside a declared correction window.

In this post, I decide whether the planning slice is safe enough for planners to act on right now.

Default approach

I run a check ladder in a fixed order.

First gate: freshness and row counts. I use them as entry gates only. They are necessary, but they are explicitly insufficient for planner trust. This is the same premise I use in Row counts are not enough: the checks I add before I trust a pipeline.
Then I run seven business-rule categories before I clear planning actions:
1. Duplicate purchase orders
2. UOM conversion integrity
3. Missing location mappings
4. Negative or impossible inventory states
5. Order-status lifecycle validity
6. Forecast vs actual demand grain alignment
7. Raw material vs finished-goods classification integrity
For every failed category, I attach one blocked action and one first-response owner so planners are not left guessing whether the slice can be used.

Example

In one weekly slice of item_location_week_supply_plan, FG-104 at WEST-03 looked healthy at the table level (freshness=PASS, row_count=PASS) and still failed planner trust.

Planner trust check card
Feed: item_location_week_supply_plan
Slice: FG-104 @ WEST-03
Entry gates: freshness=PASS, row_count=PASS (insufficient alone)

1) Duplicate purchase orders
- Trigger condition: two active lines share (po_number, line_id, item_id, location_id, required_date) after an EDI 850 resend lands with a new message_id
- Failure description: the same inbound case appears twice and inflates expected receipts
- Planner decision error it causes: planner skips a needed expedite because inbound looks bigger than reality
- First response: procurement data owner inactivates the duplicate line and republishes the slice

2) UOM conversion integrity
- Trigger condition: supplier EDI 856 arrives in CASE, ERP posts receipts in EA, and case-pack mapping is stale for FG-104
- Failure description: each receipt is multiplied by the old case-pack value and creates phantom on-hand overnight
- Planner decision error it causes: planner delays replenishment because phantom safety stock hides a true shortage
- First response: master-data/UOM owner corrects CASE→EA mapping and reruns the planning model

3) Missing location mappings
- Trigger condition: WMS emits alias location `W03-RCV`, but canonical mapping to `WEST-03` is missing in the hierarchy bridge
- Failure description: receipts and on-hand fall into an unmapped bucket and disappear from the WEST-03 planning view
- Planner decision error it causes: planner launches an unnecessary inter-DC reallocation for a shortage that is only a mapping miss
- First response: location-master owner backfills the alias mapping and revalidates affected keys

4) Negative or impossible inventory states
- Trigger condition: FG available quantity goes negative without an approved reason code after issue transactions post before in-transit settlement closes
- Failure description: timing-gap negatives are mixed with true data defects, so FG-104 shows an impossible shortage state
- Planner decision error it causes: planner triggers the wrong expedite instead of waiting for the settlement window close
- First response: inventory-control owner verifies transaction chain and reason codes before release

5) Order-status lifecycle validity
- Trigger condition: order cancelled in OMS, but the cancellation event is dropped in transit and the line stays OPEN in planning facts
- Failure description: the cancelled line sits permanently in the open bucket and inflates expected backlog
- Planner decision error it causes: planner over-allocates capacity and commits an unnecessary expedite to chase phantom backlog
- First response: order-management owner replays cancellation events, reconciles status snapshots, and re-emits order facts

6) Forecast vs actual demand grain alignment
- Trigger condition: forecast is uploaded at SKU-region grain when replenishment consumes SKU-DC grain
- Failure description: missing DC rows are treated as zero demand, triggering unnecessary safety-stock recommendations
- Planner decision error it causes: planner launches wrong replenishment transfers into DCs that are not actually under-demand
- First response: planning analytics owner disaggregates SKU-region forecast to SKU-DC with the approved split key before recomputing joins

7) Raw material vs finished-goods classification integrity
- Trigger condition: RM-881 is misclassified as FG during a product-master merge and enters FG availability logic without a transformation rule
- Failure description: raw material is counted as sellable finished stock for FG-104
- Planner decision error it causes: planner misses a real finished-goods shortage and delays the buy signal
- First response: product-classification owner restores RM/FG boundary and reruns publish checks

Blocked planning actions while any line is FAIL:
- Buy: blocked
- Expedite: blocked
- Reallocate: blocked

This card keeps me out of false confidence: the feed can look healthy at the table level and still be unsafe for planner decisions.

Tradeoffs

Breaks when: I assert qty_on_hand >= 0 before the goods-in-transit settlement window closes, so legitimate shipment-before-receipt timing gaps look like defects → Mitigation: I apply strict non-negative assertions only after settlement close, and I require reason codes inside the open window.
Breaks when: I run duplicate-PO checks without lifecycle context during ERP reopen/reclose cycles, so legitimately reopened lines are flagged as duplicates → Mitigation: I scope duplicate checks to active statuses plus lifecycle transition rules from the order-policy contract.
Breaks when: I compare forecast and actual demand before SKU-region forecasts are disaggregated to SKU-DC, so missing DC rows are interpreted as zero demand → Mitigation: I enforce disaggregation and completeness checks at SKU-DC grain before replenishment logic runs.
Breaks when: I trust status snapshots only and ignore event-stream loss during OMS→planning transport, so dropped cancellation events create phantom open backlog → Mitigation: I run event-vs-snapshot reconciliation and block planner release on unresolved cancellation gaps.
Breaks when: I treat RM/FG product_type as static during new-item onboarding, so temporary classification drift enters FG ATP logic → Mitigation: I gate FG availability on explicit transformation rules and quarantine UNKNOWN/RM classes from sellable stock.
Breaks when: I leave first-response ownership generic across plants and shifts, so failed checks sit unclaimed through weekend planning cycles → Mitigation: I assign a named owner + response SLA per category and route alerts by calendar coverage.

Close

Next step: Publish one planner trust check card for your highest-risk planning feed this week, and keep buy/expedite/reallocate blocked until every line on the card is green.

For a supply-chain anomaly already under debate, I’m happy to compare notes on the category, blocked-action, and owner boundary that keeps a bad planning slice from becoming a bad planner decision.

Handling late-arriving supply chain data without rewriting history by hand

Sat, 25 Apr 2026 00:00:00 GMT

On Monday morning, I opened the weekly supply-chain review deck and watched last week’s OTIF move before the meeting started.

Supplier North sent a shipment confirmation two days late. Friday’s OTIF was published from the data available at close. Monday’s refresh pulled in the new event and the result moved. That is when a team either trusts a declared restatement contract or starts rewriting numbers by hand.

Problem

In supply-chain systems, late ASN and shipment events are normal, but many review workflows still treat the first weekly value as final.

When that assumption wins, every late confirmation becomes manual repair: patch the dashboard, explain the change in chat, then repeat the patch next week. The metric still moves, but the movement is hidden in ad hoc edits instead of a declared restatement policy.

I already define KPI boundaries before review in The supply chain numbers I define before they reach a business review. The next boundary I set is restatement: when a late event is allowed to move OTIF, and when the period becomes settled.

Default approach

I use one platform-agnostic restatement rule: rerun a fixed lookback window on every refresh and label the values inside that window as preliminary.

Keep OTIF tied to one event contract: late ASN, late shipment confirmation, or late POD/receipt update can restate recent periods.
Recompute a fixed 14-day window on each run using loaded_at as the restatement watermark.
Mark every OTIF value inside the window as preliminary_within_window; mark values outside the window as settled_after_window.
Publish the same labels to consumers so planners know exactly when a number can still move.
Keep inventory snapshot settlement as a separate policy surface from transactional late events; do not blend them into one rule.

This is not a historical backfill-validation workflow like How I validate a metric after a backfill. It is the day-to-day restatement contract for routine late-arriving corrections.

Example: the restatement window policy card I run

Last week, Supplier North OTIF first landed at 91% on Friday close. A shipment confirmation arrived Sunday night with loaded_at = 2026-04-19 22:14:00, tied to orders shipped in the prior week. Monday’s run recomputed the 14-day window and that week’s OTIF moved to 88%.

Restatement window policy card
Metric: supplier_otif
Window duration: 14 days
Trigger event: late ASN, late shipment confirmation, late POD/receipt update
Preliminary label: preliminary_within_window
Settled label: settled_after_window

Before
- Weekly OTIF was frozen at first publish.
- Late confirmations triggered manual dashboard edits and one-off SQL updates.

After
- Every refresh reruns the latest 14-day window using loaded_at watermark.
- Supplier North late confirmations restate OTIF automatically inside the window.
- Periods older than 14 days stay settled unless a deliberate backfill is approved.

The behavior stays visible and predictable: last week’s OTIF can move when the supplier event arrives late, and the movement happens through the declared window, not analyst cleanup.

Tradeoffs

Breaks when: the lookback window is shorter than real supplier latency patterns → Mitigation: set the window from observed late-arrival distribution and review it quarterly.
Breaks when: downstream users read preliminary values as final → Mitigation: surface preliminary_within_window and settled_after_window in the same table used in the review.
Breaks when: teams use this policy to hide structural source issues → Mitigation: keep planner-trust quality assertions separate and escalate recurring source defects as their own workstream.
Breaks when: transactional events and inventory snapshots share one restatement rule → Mitigation: maintain independent settlement policies so each surface reflects its own timing behavior.

Close

Next step: Pick one supplier-facing KPI this week, publish its restatement window policy card, and mark which periods are still preliminary before the next business review.

If you want to compare latency patterns and choose a window that your planners will trust, bring one recent late-arrival example and we can pressure-test the policy together.

The supply chain numbers I define before they reach a business review

Fri, 24 Apr 2026 00:00:00 GMT

A weekly supply-chain business review can burn an hour arguing over which version of fill rate is right and still leave without deciding whether service, fulfillment, or working capital needs action this week.

Fill rate, on-time delivery, and inventory turns sound like shared language until planning, warehouse operations, and finance each carry their own clock, grain, and exclusions into the room under the same labels. Once the deck shows one number per label, the meeting becomes a dashboard debate instead of a decision.

I would rather define each KPI’s boundary before the review than untangle it in the meeting. If one supply-chain label cannot hold the same owner, clock, grain, exclusions, and intended decision for planning, warehouse, and finance at once, I split it and name the variants before anyone presents a number.

Problem

Supply-chain KPI labels often look stable until teams ask what they include.

Fill rate can mean complete orders filled from available stock, order lines filled, units shipped within a short window, or demand covered after backorder recovery. On-time delivery can be measured against requested date, original promise date, latest customer-accepted promise date, planned ship date, actual ship timestamp, or actual delivery timestamp. Inventory turns can be a finance-owned value calculation for a fiscal period or an operational shortcut for SKU movement through a warehouse.

These are not cosmetic wording gaps. Each one changes who owns the number, which clock it runs on, and which decision the review is allowed to make with it.

The pattern I watch for is a review deck with three headline KPIs and no boundary card behind them. Planning reads fill rate as demand coverage. Warehouse operations reads the same label as complete customer orders filled from stock on the first execution attempt. Finance reads it as a service signal that should support or contradict the working-capital story. The label is shared; the definitions underneath are not, and the decisions they imply do not overlap.

That is the same failure shape I watch for in metric drift across dashboards. The supply-chain version is harder to catch because the disagreement usually starts before any dashboard exists, inside the definitions teams carry into the meeting.

Default approach

Before the number enters the review deck, I write a short KPI boundary card.

Start with the review decision, not the formula. The card should state, in one line, the question this metric is allowed to answer in this meeting and who gets to act on the answer.
Lock the clock. Order date, requested date, original promise date, latest customer-accepted promise date, planned ship date, actual ship timestamp, actual delivery timestamp, data-complete cutoff, and fiscal period are different clocks, and the card picks one on purpose.
Write exclusions in business language. Cancelled orders, customer reschedules, substitutions, backorders, partial shipments, carrier exceptions, returns, obsolete inventory, and consignment stock each need an explicit keep-or-drop decision, not a buried SQL filter.
Name the grain before aggregation. Order, order line, unit, shipment, SKU-location-day, and fiscal-period inventory value each answer a different question, and averaging them silently is how a supposedly shared metric drifts.
Assign one named owner who can approve changes to timing, exclusions, or grain without having to ask another team first.
Decide up front whether the review gets one shared definition or separately named variants, and record the decision on the card. Do not defer that call to the meeting.

The last bullet carries the most weight. Splitting a KPI is not a naming exercise; it is an admission that one shared label would otherwise make planning, warehouse, or finance wrong in order to make another team precise.

Example: the boundary card I want before the review

Here is a weekly supply-chain business review I would rather catch before it ships than defend after.

The service-and-inventory slide currently reads:

Weekly supply-chain business review
- fill rate: 92%
- on-time delivery: 89%
- inventory turns: 5.1

The disagreement starts before the slide is up.

Planning puts fill rate at 96%, counting units eventually shipped against the demand created that week. Warehouse operations puts it at 88%: complete customer orders filled from available stock on the first execution attempt, with cancellations and backordered lines excluded. Finance wants neither version anywhere near the close, because partial shipments and backorder recovery push revenue across period boundaries.

On-time delivery repeats the shape. Transportation reports 91% against the latest customer-accepted promise date and the actual delivery timestamp. Warehouse operations prefers 95%, measured against planned ship date and actual ship timestamp, and stops the clock there. Sales asks why an order that missed the original promise date still counts as on time after a customer reschedule quietly rewrote the target.

Inventory turns behaves like one number until someone asks what the denominator is. Finance is using finished-goods inventory value averaged across the completed fiscal period with COGS for the same period. Planning is watching SKU velocity and days of supply. Warehouse operations is watching movement and capacity pressure against a physical footprint. All three views answer real questions; none of them are the same metric, and none of them should inherit the single label inventory turns on an executive slide.

Here is the boundary card I would write before the next deck goes out, with one owner and one intended decision per approved metric:

Supply-chain KPI boundary card
Review: weekly supply-chain business review
Decision: which numbers are safe as shared review KPIs, and which need named variants

1. fill rate
- review label: fill rate
- approved review metric: warehouse_order_fill_rate
- intended decision: did fulfillment fill customer orders completely from available stock on the first execution attempt?
- owner: warehouse operations / fulfillment analytics
- timing: first ship attempt or agreed ship-from-stock window
- grain: order-level complete fill unless the card explicitly says order-line or unit fill
- exclusions to state: cancellations, customer future-dated orders, substitutions, split shipments, manual holds, test orders, backorders
- split decision: planning gets a separate supply-availability metric; finance does not use this number for close or working-capital review

2. on-time delivery
- review label: on-time delivery
- approved review metric: customer_on_time_delivery
- intended decision: did delivered orders meet the agreed customer promise window?
- owner: transportation / logistics, with customer-service ownership for promise-date policy
- timing: promise date source and actual delivery timestamp are locked before review
- grain: complete order unless the card explicitly says shipment or order line
- exclusions to state: customer reschedules, carrier exceptions, missing proof of delivery, cancelled orders, pickup orders, early-but-incomplete deliveries
- split decision: warehouse ship timeliness and customer delivery performance stay separate if they use different clocks

3. inventory turns
- review label: inventory turns
- approved review metric: finance_finished_goods_inventory_turns
- intended decision: is finished-goods working capital moving in the right direction for the completed fiscal period?
- owner: finance / supply-chain finance
- timing: completed fiscal month, quarter, or year; numerator and denominator use the same period
- grain: financial value over time, not unit movement
- exclusions to state: obsolete/excess inventory, consignment stock, intercompany inventory, raw/WIP inventory, returns, write-downs, cost policy
- split decision: planning gets days of supply or SKU velocity; warehouse gets movement or capacity metrics; finance owns official turns

Business-review rule
- If owner, clock, grain, exclusions, and decision do not match across planning, warehouse, and finance, the deck shows named variants instead of one shared KPI label.

The resolution is not to average the versions or pick the dashboard backed by the most confident owner.

The resolution is to split the label and assign each variant to the audience and decision it serves. Planning keeps a named supply-availability or service metric for demand coverage. Warehouse operations owns warehouse_order_fill_rate for complete first-attempt fulfillment. Transportation and customer service together own customer_on_time_delivery for the promise-window question, with customer service holding the right to change promise-date policy. Finance owns finance_finished_goods_inventory_turns for finished-goods working capital across the completed fiscal period.

Once those names exist, the review can still show more than one number. The difference is that each number states which decision it supports and which owner can change the definition without a cross-team negotiation in the meeting.

This is also why I do not treat the problem as a self-service dashboard issue. A dashboard can only travel safely after the metric boundary is honest. Before a supply-chain number reaches the review, I apply the same caution I use before I label a dashboard self-service: write the allowed question, the intended audience, and the explicit not-for cases, then decide whether the single shared label still fits at all.

Tradeoffs

Breaks when: one KPI label is forced to serve planning, warehouse, and finance decisions → Mitigation: split the metric into separately named definitions with the owner and intended decision on the card.
Breaks when: the formula is written but the clock is not → Mitigation: define the timing source before review and block after-the-fact swaps between order date, ship date, delivery date, promise date, and fiscal period.
Breaks when: partial shipments make the numerator look better for one audience and worse for another → Mitigation: choose order, order-line, unit, or shipment grain deliberately and keep partial-shipment handling visible.
Breaks when: inventory turns are treated like a warehouse velocity metric in one slide and a finance working-capital metric in the next → Mitigation: reserve official turns for finance-owned value over a fiscal period, then create separate planning or warehouse operational metrics when needed.
Breaks when: the boundary card turns into a governance ceremony for every low-risk operational number → Mitigation: reserve the full card for numbers that reach cross-functional reviews or executive summaries.

Close

Next step: For one supply-chain number already in a business review, write the boundary card before the next meeting: owner, clock, grain, exclusions, and the first decision it supports.

Named variants usually create calmer reviews than one overloaded KPI label carrying planning, warehouse, and finance decisions at once.

Azure DevOps checks before analytics code reaches production

Fri, 24 Apr 2026 00:00:00 GMT

The Azure DevOps setup I care about for analytics repos is not the cleverest YAML file.

It is the record that lets me say whether this change can move closer to production.

Before a metric definition, model, or dashboard-facing transform gets promoted, I want the delivery decision visible in one place: the protected-branch PR gate, the short validation chain, the evidence from failed checks or risky changes, and the approval boundary for the next environment.

If I need a Slack thread to reconstruct those facts, the pipeline is still hiding the decision.

Problem

Analytics code can fail with the same shape as application code. A small pull request changes one definition, the code review looks harmless, and the number that moves later is the one finance or operations cares about.

The weak spot is usually not that Azure DevOps cannot run enough checks. It is that the delivery record is too thin. A PR has a green badge, a reviewer remembers that tests usually run, and a promotion stage succeeds because the previous stage looked healthy. Later, when net revenue changes, nobody can tell whether the protected-branch policy guarded the merge, whether dbt compile ran on the changed scope, or whether a red run would have published enough evidence for the first responder.

That is the gap I want Azure DevOps to close: not a platform tour, not a YAML showcase, just one inspectable chain from pull request to promotion boundary.

Default approach

This post uses the Azure Repos Git branch-policy and Azure Pipelines behavior I checked in Microsoft Learn on 2026-03-24.

Protect the production branch with Azure Repos Git branch policies and required build validation. For Azure Repos Git, that branch-policy build validation is the PR gate I trust; push-trigger CI is supporting evidence, not the merge boundary.
Keep PR validation short enough to debug: lint, repo tests, dbt compile, and one changed-model or changed-path check tied to the repo’s real risk.
Publish test results into the pipeline summary so a failed run names the broken check before anyone opens raw logs.
Publish the smallest useful investigation artifacts from validation: compile output, changed-node or changed-path evidence, failed SQL snippets when present, and a release note that names the rollback reference.
Use stage conditions deliberately. Validation, evidence publication, and promotion should be separate decisions, and a promotion stage that depends on validation should preserve that success requirement.
Put production-adjacent promotion behind an environment approval or check controlled outside ordinary pipeline edits.

That is the job I want Azure DevOps doing here. Enforce the delivery decision. Preserve the evidence. Do not become a second analytics system.

Example: the pre-production check record I want attached to the PR

Imagine a pull request that changes the net revenue definition used by a finance-facing mart. The diff is small: one calculation changes how refunds are excluded from revenue after a source-system fix.

That is exactly the sort of analytics change I do not want waved through because the pull request is short.

Azure DevOps pre-production check record
Change: metric definition update for net revenue
PR: #418 -> main
Protected branch policy: required build validation passed
Validation trigger model: Azure Repos Git branch-policy build validation, not only a generic CI push trigger

Validation stage
- branch validation: required build validation passed on the protected branch
- lint: passed
- repo tests: passed
- dbt compile: passed
- changed-scope check: changed models and changed paths reviewed

Published evidence
- test results: visible in the pipeline summary
- investigation artifacts: compile output, changed-node evidence, failed-query snippets if present
- release note: metric definition changed; rollback path points to previous definition commit and artifact

Promotion boundary
- preproduction stage condition: only after validation succeeds
- environment approval/check: approved before production-adjacent promotion
- decision: merge and promotion allowed because checks, evidence, and approval are visible in one record

The first line I check is the branch boundary. A pipeline that happens to run on pushes is not the same thing as protected PR validation. I want the protected branch to require build validation before completion, so the merge decision is tied to the policy that guards the branch.

The validation stage stays intentionally boring. lint catches formatting and static mistakes. Repo tests catch local behavior. dbt compile proves the project can still parse, resolve references, and render the changed graph. The changed-model or changed-path check keeps a one-line metric edit from looking risk-free.

I am not trying to turn this post into a dbt deployment guide. The deployment mechanics belong in How I keep dbt Core deployments boring in production. Here, dbt compile and changed-scope evidence earn their slot because they are gates in the Azure DevOps record.

Published evidence is where a lot of otherwise decent pipelines get thin. A red run should not leave the next person guessing which test failed, where the compile output went, or whether the changed-scope check found a wider path than expected. I want that evidence attached to the run while the context is still fresh.

The promotion boundary is separate on purpose. Validation steps live in the pipeline. The production-adjacent approval or check should be owned as a protected resource decision, not hidden inside the same editable YAML path as ordinary validation. That boundary keeps the run was green from silently turning into production can move automatically.

The same logic applies when the risky part of the change is the model-level test coverage. The dbt tests I write first for business-critical models explains how I choose those tests. This record explains where Azure DevOps should enforce those tests, publish their results, and stop promotion until the evidence is visible.

Tradeoffs

Breaks when: the team relies on a generic push trigger and calls it PR protection → Mitigation: anchor Azure Repos Git PR validation to branch-policy build validation on the protected branch, then treat CI triggers as supporting context.
Breaks when: the YAML becomes more impressive than the release decision → Mitigation: keep validation stages short and tie every step to merge safety, failure investigation, or promotion control.
Breaks when: test failures only exist in raw logs → Mitigation: publish test results and the smallest useful investigation artifacts so the first responder can see the failed check and next action quickly.
Breaks when: stage conditions skip too much or run after failed prerequisites → Mitigation: keep success conditions explicit and preserve succeeded() where the prior stage must pass.
Breaks when: environment approvals live inside the same editable pipeline logic as ordinary validation steps → Mitigation: use approvals and checks on protected resources so resource owners control the promotion boundary outside YAML edits.
Breaks when: dbt compile or changed-model checks pull the conversation back into dbt deployment mechanics → Mitigation: keep them framed as gates inside the Azure DevOps record, then link out to the dbt deployment and dbt test-ordering posts for mechanics.

Close

Next step: Pick one analytics repo and write the check record you would want attached to the next metric definition change before it merges: branch-policy validation, lint, tests, compile, changed scope, published evidence, and environment approval.

If a release still needs Slack archaeology after a green Azure DevOps run, use that evidence record to make the next merge explain itself from policy to approval.

What I check before I label a dashboard self-service

Thu, 23 Apr 2026 00:00:00 GMT

A dashboard is not self-service just because a lot of people can open it.

I use that label only when the promise survives casual reuse. If a forwarded link can pull a new reader past the intended question, audience, grain, filters, or timing boundary, the label has outrun the dashboard.

I do not blame the finance reviewer for opening a familiar operations view. I blame the label when it travels farther than the logic underneath it.

Problem

Some dashboards are reliable for the team that built them and still unsafe as self-service assets.

I see this when an operations dashboard gets reused in finance close because the chart names look familiar and the link is easy to share. One tile mixes booked orders and shipped orders. Cancelled-order handling and sandbox exclusions stay buried in SQL. The refresh is still incomplete before 09:00 ET. Operations can still use it to spot same-day flow problems. Finance cannot use it to decide whether revenue is safe to close.

That is not a careless-reader problem. It is a boundary problem. A dashboard earns the self-service label only when a casual reader can tell what question it answers, who it serves, what each number counts, which filters shape the answer, and when the data is complete enough to trust. If those boundaries are still implied, the honest label is team-scoped, split into another view, or simply not self-service yet.

Default approach

I treat self-service as a label decision, not a compliment. I want one short self-service review card beside the dashboard that tells a new reader what I allow, what I block, and where the label ends.

What I allow before I use the label

One allowed question, written the way the operator would ask it. In this case: Where is same-day order flow falling behind plan by ship-promise date?
One named audience. If the default reader is operations managers and fulfillment leads, I write that down instead of implying anyone with access.
Plain grain on every important tile: booked, shipped, or settled; one row, one order, or one shipment.
Visible filter boundaries, including exclusions that would change the answer if a reader assumed the wrong scope.
A timing note that says when the number is directional, when it is complete enough for the intended audience, and when it is unsafe for wider reuse.
One explicit label outcome at the end: self-service now, split the view first, or keep it team-scoped.

What I block or relabel

I block the label when one dashboard is asked to answer both an operations follow-up and a finance close question. That is not broader reuse. That is two decision paths hiding in one view.
I relabel when the audience boundary is effectively whoever got the link.
I stop the label when a tile mixes booked, shipped, and settled logic without naming the difference.
I stop the label when the important filters only exist in SQL comments, dbt models, or someone’s explanation on a call.
I keep the dashboard out of finance or executive reuse when the morning refresh is still incomplete but the page only says updated daily.

If question, audience, grain, filters, and timing do not line up for the next reader, I do not widen the label. If the dashboard only fits the original team, I would rather say that plainly than pretend the word self-service will teach the next reader where the boundary is.

Example: the self-service review card I want beside the dashboard

Here is the compact review card I would attach to a dashboard before I let anyone call it self-service:

Dashboard self-service review

Field	Value
Dashboard	daily order flow monitor
Label decision	FAIL — not self-service yet

Review step	What I confirm	Status
Question	Allowed question: Where is same-day order flow falling behind plan by ship-promise date? Current failure: finance is also using the dashboard to ask whether settled daily revenue is safe to close.	—
Audience	Intended audience: operations managers and fulfillment leads Explicit not for: finance close, revenue reporting, external reporting	—
Grain	Current failure: one trend mixes booked orders and shipped orders in the same daily view Pass condition: each tile names whether it is booked, shipped, or settled, and what one row / point represents	—
Filter boundaries	Must state location scope, cancelled-order handling, sandbox and test exclusions, and return handling Current failure: those exclusions live in SQL but are invisible to the reader	—
Timing assumptions	Current data is intra-day and incomplete before 09:00 ET Pass condition: the dashboard states when the number is directional, when it is complete enough for operations, and when it is not safe for finance	—
Outcome	Keep this dashboard team-scoped for operations, or split finance into a separate settled view before using the self-service label	—

The question line does most of the work. Where is same-day order flow falling behind plan by ship-promise date? is an operations question. Can finance use this for settled daily revenue close? is a different question, with different timing needs and different counting rules.

The audience line keeps the dashboard from pretending it is safe for every reader who can open it. Once I write operations managers and fulfillment leads and add not for finance close, I stop treating access as proof that the view is broadly reusable.

The grain and filter lines keep a new reader from reverse-engineering the dashboard from memory or tribal knowledge. If a tile mixes booked and shipped logic, or if cancelled-order handling and test exclusions only live in SQL, the dashboard is still depending on insider context.

The timing line decides whether the label is honest. If the number is incomplete before 09:00 ET, I want the card to say whether that is acceptable for operations and unsafe for finance. If two audiences need different timing and counting rules, I split the view before I stretch the promise.

That is also where I keep this post separate from broader dashboard-trust topics. Once finance gets its own settled view and that view matters to leadership, I still want it to carry the explicit operating boundary from The operating spec I want before I trust a business-critical dashboard.

If the team still wants one shared headline number across both views, I treat that as a definition problem before I treat it as a training problem. That is when I pull up The definition card I use to stop metric drift across dashboards, because a self-service label cannot rescue a metric that still changes meaning by audience.

Tradeoffs

Breaks when: teams use self-service as access language instead of decision-safety language → Mitigation: tie the label to one named question and one named audience, not to the size of the permission group.
Breaks when: one dashboard keeps absorbing booked, shipped, and settled logic under one title because separate views feel inconvenient → Mitigation: split the views or rename the tiles before widening the audience promise.
Breaks when: the review card turns into ceremony for low-risk team dashboards that never travel beyond the original group → Mitigation: keep the artifact short and reserve the strict pass / fail gate for views likely to be reused casually.
Breaks when: the card is written once and then ignored while filters, logic, or refresh expectations change → Mitigation: update the card in the same release path as dashboard logic or definition changes.

Close

Next step: For one dashboard that already travels outside its original audience, write the review card before the next forwarded link turns a useful team view into an unofficial company metric.

The card earns its space when audience, question, and not-for boundaries are visible before people answer from the same chart in different meetings.

Where Microsoft Fabric fits and where I would keep it out of the critical path

Fri, 17 Apr 2026 00:00:00 GMT

Microsoft Fabric gets more useful to me when I stop asking whether it can do everything.

The better question is narrower: which parts of a finance and operations reporting path should Fabric simplify, and which parts should stay explicit because trust breaks there first?

That is where Fabric is strongest for a Microsoft-heavy team. OneLake, Lakehouse, Warehouse, shortcuts, and Power BI can cut real handoffs without forcing another stack debate.

I still do not want that convenience to blur semantic ownership, transform checks, permission boundaries, or the moment a fast path becomes a slower one.

If row meaning is still fuzzy, I start with Every important model needs an explicit grain.

If metric definition still drifts, I fix that next with The definition card I use to stop metric drift across dashboards.

Fabric can host those decisions. It does not make them for you.

Problem

A Microsoft-heavy team has a reasonable instinct here.

Power BI is already in place. Azure is already in place. Finance and operations want one path from raw ERP data to trusted reporting.

The trouble starts when “one path” turns into one black box.

Lakehouse, Warehouse, SQL analytics endpoint, shortcuts, security mode, and the semantic model can sit so close together that the path feels trustworthy just because the parts share a platform.

That is not the same as knowing where the reporting definition lives, where a shortcut can fail, or when Direct Lake stops behaving the way the team assumed.

Where Fabric earns a place

Use Fabric where it removes real handoffs: OneLake for shared storage, Lakehouse for ingestion and transform work, Warehouse for refined SQL serving, and Power BI for the reporting surface.
Pick the store by workload. Eventhouse fits high-volume event analysis. SQL database fits transactional work. Lakehouse plus Warehouse cover most reporting-path questions.
Prefer Direct Lake on OneLake when OneLake security, broader modeling features, and in-memory behavior matter most. In a 2026-02-21 Microsoft-docs check, Microsoft documents that Direct Lake on OneLake does not use SQL endpoints or DirectQuery fallback, while Direct Lake on SQL uses the SQL analytics endpoint for discovery and permission checks and can fall back to DirectQuery for SQL views or SQL-based granular access control (Direct Lake overview).
Use shortcuts only when the source owner, permission path, and first failure surface are explainable. Microsoft notes that the calling user must have permission on the shortcut target, and that Direct Lake over SQL or delegated T-SQL can pass the calling item’s owner identity instead of the user’s identity (OneLake shortcuts). Zero-copy is useful. Hidden dependency chains are not.
Keep semantic ownership and model checks outside the platform promise. I still want named owners, trusted-table tests, and one release check before publish.
That is the same discipline behind The dbt tests I write first for business-critical models.
Review region limits, security mode, and deployment mapping before a report is called trusted. In the same 2026-02-21 Microsoft-docs check, Microsoft says Direct Lake semantic models must be created in the same region as the data source workspace, and lakehouse deployment pipelines create a new empty lakehouse in the target workspace unless dependency mapping is configured (Direct Lake overview, lakehouse deployment pipelines). A unified platform still has seams.

My fit test is simple: Fabric belongs where it removes handoffs; I pull it out of the critical path when storage mode, shortcut identity, or deployment behavior would otherwise stay implicit.

Example

This is the checklist I want before I trust a Fabric-backed report that combines purchase orders, inventory exposure, and shipment status:

Field	Value
Critical report	weekly cash + inventory exposure
Metric the CFO will challenge first	open purchase-order cash plus on-hand inventory value

Review step	What I confirm	Status
Land raw ERP and warehouse feeds in a Lakehouse.	Fabric fit: pipelines, Spark, Delta tables, OneLake storage Boundary: raw tables are replayable inputs, not finance-ready outputs	—
Normalize the trusted reporting tables before they reach the semantic model.	Fabric fit: stage and curate in Lakehouse and/or Warehouse Explicit checks: grain uniqueness, null checks on cost and quantity, relationship checks to item and supplier dimensions	—
Serve finance-facing tables from a Warehouse.	Fabric fit: T-SQL, structured analytics, Power BI-friendly serving Boundary: the same region and storage-mode rules above still apply	—
Use a shortcut for supplier master only if the dependency is documented.	Fabric fit: zero-copy access to a shared domain dataset Boundary: if the target moves or permissions diverge, the failure appears upstream of the report	—
Publish the semantic model with an intentional Direct Lake choice.	Preferred path: Direct Lake on OneLake when OneLake security and in-memory behavior matter Warning: Direct Lake on SQL is the deliberate path when SQL endpoint checks or DirectQuery fallback belong in the design	—

That is the boundary I care about.

Fabric does useful work in this path. OneLake removes duplicate storage conversations. Lakehouse plus Warehouse narrow the handoff between engineering and reporting surfaces.

I still do not want the critical path to depend on “we assume Fabric handles that.”

If the team needs SQL views in the semantic-model path, or depends on SQL-based row security, I want that named because the Direct Lake behavior changes.

If a shortcut points to another workspace, I want the source owner and permission path written down before the report is called trusted.

What I keep explicit

This is the part I do not hand to the platform story.

Finance analytics owns metric definitions and semantic-model signoff.
Data platform owns storage mode, region, security mode, and deployment mapping.
Trusted tables still need grain, null, and relationship checks before they reach finance-facing measures.
Shortcut targets, target permissions, and the first expected failure surface should be written down before the dataset joins a critical report.
If deployment pipelines create a new empty lakehouse or remap dependencies, that validation should happen before a finance-facing promotion is called routine.

Tradeoffs

Breaks when: the team assumes Fabric replaces semantic ownership, data contracts, and model tests because the stack is unified → Mitigation: keep platform fit and trust ownership as separate decisions, and name the owner of each.
Breaks when: Direct Lake is treated like one stable mode regardless of views, SQL endpoint checks, or security choices → Mitigation: choose Direct Lake on OneLake or Direct Lake on SQL deliberately, and document where fallback can occur.
Breaks when: shortcuts are sold as transparent zero-copy access with no downside → Mitigation: document the shortcut owner, target, permission path, and first failure surface before the dataset joins a critical report.
Breaks when: deployment or region assumptions are treated as routine plumbing → Mitigation: review same-region limits, metadata-only deployment behavior, and target validation before promotion.
Breaks when: the team forces the wrong Fabric store into the critical path because Fabric feels unified → Mitigation: keep Eventhouse for event workloads, SQL database for transactions, and Lakehouse plus Warehouse for most reporting.

Close

Next step: For one finance or operations report your team treats as critical, write which part Fabric can simplify, which part still needs an explicit owner, and which failure surface must stay visible.

Can the team explain the Direct Lake choice, shortcut dependency, and warehouse boundary before the report becomes trusted?

The Snowflake design choices that make downstream models easier to trust

Fri, 17 Apr 2026 00:00:00 GMT

A Snowflake model gets hard to trust when one source change has no obvious place to stop.

A feed sends quantity as text, a timestamp arrives in a new format, a nested location field appears, and suddenly every downstream join has to decide what orders means again.

I do not use raw, stage, and curated layers as warehouse ceremony. I use them to isolate change: raw preserves source fidelity, stage normalizes names and types, and curated locks row meaning plus safe joins.

Problem

A broken query is only the visible symptom.

The deeper problem is a warehouse where nobody can say which layer owns source fidelity, which layer owns type cleanup, or which table a planner should actually trust.

That blur turns one source change into extra joins, wider backfills, and longer incident notes.

If raw, stage, and curated objects all carry half-cleaned business logic, every upstream drift leaks further than it should.

The source contract still matters, so I still want the source table expectations written down first.

I also want the curated model grain stated explicitly.

Both disciplines get easier to sustain when Snowflake itself has clear layer jobs.

Default approach

Keep raw landing in an explicit raw schema and keep it close to source shape so replay, audit, and file-level evolution stay possible.
Use staged tables to normalize names, cast strings into typed timestamps and numbers, and flatten the predictable parts of semi-structured payloads.
Use curated models to freeze business grain, safe joins, and reader-facing measures so downstream users do not reason about source drift directly.
Make the layer obvious from the fully qualified name. If orders exists in three places, I want analytics.raw.order_events, analytics.stage.order_lines_typed, and analytics.curated.fct_order_lines.
Declare keys and relationships as metadata on trusted tables when they help humans and BI tools understand the join path, while staying honest that Snowflake treats primary-key and foreign-key constraints on standard tables as informational metadata rather than enforced integrity (table design guidance).
Treat transient storage, schema evolution, and clustering as bounded tools, not the default trust pattern.

Example

Imagine an order-and-inventory feed lands in Snowflake every hour.

One morning the upstream export changes in three ways:

Source change
- quantity: "12" instead of 12
- event_timestamp: "02/17/2026 05:14:08 -0400" instead of ISO 8601
- location.attributes.zone: new nested attribute

If that change leaks straight into the trusted model, three downstream problems show up at once.

Quantity math becomes less safe.
Timestamp filters and backfill windows get ambiguous.
Location joins start depending on a nested payload shape instead of a stable typed column.

This is the smallest boundary note I want instead:

Field	Value
RAW	analytics.raw.order_events keep landed payload in VARIANT preserve source naming for replay and audit allow schema evolution here only for controlled file loads, with `ENABLE_SCHEMA_EVOLUTION`, `MATCH_BY_COLUMN_NAME`, and the table boundary documented
STAGE	analytics.stage.order_lines_typed expose order_id, order_line_id, sku_id, location_id quantity NUMBER(38,0) event_ts TIMESTAMP_TZ location_zone VARCHAR
CURATED	analytics.curated.fct_order_lines grain: one row per order_line_id safe joins: dim_dates on order_date, dim_locations on location_id lifecycle: permanent unless recovery boundaries are documented otherwise

The staged model is where I want the ugly conversion work:

select
  payload:order_id::varchar as order_id,
  payload:line_id::varchar as order_line_id,
  try_to_number(payload:quantity) as quantity,
  try_to_timestamp_tz(
    payload:event_timestamp::varchar,
    'MM/DD/YYYY HH24:MI:SS TZHTZM'
  ) as event_ts,
  payload:location:id::varchar as location_id,
  payload:location:attributes:zone::varchar as location_zone
from analytics.raw.order_events;

That is the stage boundary doing its job.

Raw stays faithful to the source. Stage holds the typing and flattening work. Curated facts do not need to reinterpret the payload every time the feed drifts.

If try_to_number(payload:quantity) starts returning NULL, I want that failure to surface in stage, not inside a curated fact with string-shaped quantity logic.

I want the same boundary for timestamps. If the feed is not ISO 8601, stage should parse it with an explicit format instead of relying on session settings.

On the curated side, I still want row meaning frozen. fct_order_lines stays one row per order_line_id. fct_inventory_snapshots stays at its own declared grain.

A backfill or incident note should point to the staged normalization boundary, not force both facts to reinterpret raw payloads on the fly.

I also declare the trusted join path there. On standard Snowflake tables, that metadata does not enforce integrity, but it still makes joins to dates, locations, and inventory snapshots easier to review.

I keep the storage boundary explicit too.

If the raw landing table is reconstructable from external files, I might accept transient storage there. I do not make the same default for curated facts or dimensions.

Snowflake documents that transient tables have no Fail-safe, so I only use them where loss is acceptable or reconstruction is already documented (temporary and transient tables).

I keep the same boundary on schema evolution. If I enable ENABLE_SCHEMA_EVOLUTION, I want it on the raw file-load table that is supposed to absorb column additions. Snowflake limits automatic schema evolution to file loads and Snowpipe with MATCH_BY_COLUMN_NAME; it can add columns and drop NOT NULL when new files omit a field, which is exactly why I keep it out of curated facts (schema evolution docs).

I keep clustering just as narrow.

Snowflake already micro-partitions automatically, and its table design guidance says clustering is unnecessary for most tables and usually only worth revisiting when large tables spend real time scanning on a query path that differs from load order (table design guidance).

Until then, clear layer ownership earns more trust than premature storage tuning.

Tradeoffs

Breaks when: raw starts carrying business logic because the team is afraid to create another table → Mitigation: keep raw faithful to source shape and move typed cleanup plus reusable keys into stage.
Breaks when: staged models leave predictable timestamps, numbers, or join attributes trapped inside VARIANT → Mitigation: flatten and type the fields the team actually filters, joins, or backfills against.
Breaks when: I pretend Snowflake constraints are enforcing integrity on standard tables → Mitigation: use primary and foreign keys as metadata for legibility and tooling, then keep the actual trust checks in model logic and tests.
Breaks when: automatic schema evolution in file loads becomes permission to let curated models drift silently → Mitigation: let raw absorb evolving files, but require staged and curated changes to stay deliberate and reviewable.
Breaks when: transient tables become the default for long-lived trusted models → Mitigation: reserve transient storage for scratch or reconstructable layers and keep trusted curated models on the safer lifecycle.
Breaks when: I treat clustering as part of the base design pattern → Mitigation: keep it as a late optimization for large scan-heavy tables instead of mixing it into the minimum trust boundary.

Close

Next step: For one source change your team has already seen in Snowflake, write where the change should stop: raw replay, staged typing, curated grain, or recovery boundary.

The boundary review is useful when it explains why raw, stage, and curated layers disagree before trust erodes downstream.

What I watch in Snowflake before compute cost becomes a surprise

Fri, 17 Apr 2026 00:00:00 GMT

Snowflake compute gets expensive fast when one warehouse bill jumps and nobody can say whether the extra credits came from idle minutes, warehouse pressure, or one recurring query family.

That is when bad triage starts. Someone reaches for a bigger warehouse. Someone else opens one loud query profile. The basic question is still unsettled.

I do not start with resizing or rewriting. I start with one warehouse and one question at a time.

My first pass stays fixed: metering, idle gap, load, query history, lagged attribution, then pruning if scans still look wrong.

Problem

The expensive mistake is not only a higher warehouse bill.

It is asking the wrong question first.

I see teams jump from “this warehouse cost more” to “rewrite that query” before anyone checks whether the warehouse was mostly busy, mostly idle, or stuck resuming.

That is how one noisy morning turns into a week of unfocused tuning.

Per-query attribution creates a second trap. As of 2026-02-10, Snowflake documents that QUERY_ATTRIBUTION_HISTORY can lag by up to eight hours, excludes warehouse idle time, and omits very short queries (<= ~100ms), so I treat it as a later confirmation layer instead of the first read (Snowflake docs).

If I wait for it before same-day triage, I lose the fast path. If I treat it like the whole bill, I blame one query family for cost it did not fully own.

This is narrower than Five pipeline observability signals before more orchestration.

Here I am not asking whether the pipeline is healthy. I am asking why one warehouse got expensive and which layer earns the next ten minutes.

Default approach

Start with WAREHOUSE_METERING_HISTORY. Confirm which warehouse changed and whether the jump is new, recurring, or already normalizing.
Compare credits_used_compute with credits_attributed_compute_queries before blaming one query family. The gap is where warehouse idle time starts to show up.
Read WAREHOUSE_LOAD_HISTORY next. I want to know whether the warehouse was busy, overloaded, queued during provisioning, or blocked.
Use QUERY_HISTORY for the same-day drill-down. That is where I inspect queue time, spill, scan volume, cache use, and partitions scanned.
Return to QUERY_ATTRIBUTION_HISTORY later. I use it to confirm which query family actually consumed the compute credits once the lagged data lands.
Use TABLE_QUERY_PRUNING_HISTORY only if the expensive pattern still points to unnecessary scans. That answers a different question from metering or attribution.

All five views live in Account Usage. As of 2026-02-10, Snowflake lists WAREHOUSE_METERING_HISTORY, WAREHOUSE_LOAD_HISTORY, and TABLE_QUERY_PRUNING_HISTORY under USAGE_VIEWER, QUERY_HISTORY under GOVERNANCE_VIEWER, and QUERY_ATTRIBUTION_HISTORY under either role in Account Usage. The order still matters more than memorizing every column.

If the question shifts from “why did this warehouse get expensive?” to “what cloud services actually billed?”, I step out to METERING_DAILY_HISTORY.

The warehouse metering views tell me consumed credits first. That is the right triage start, but it is not the whole billed-cloud-services answer.

Example

Imagine a daily transform warehouse named TRANSFORM_DAILY_WH.

A model fan-out ships in the same morning release. The warehouse that usually burns a steady amount of compute between 05:45 ET and 07:00 ET suddenly costs more than twice its normal run.

I do not want a dashboard first.

I want one investigation note I can scan in order:

Field	Value
Warehouse	TRANSFORM_DAILY_WH
Window	2026-02-10 05:45-07:00 ET
1. WAREHOUSE_METERING_HISTORY	credits_used_compute: 18.4 credits_attributed_compute_queries: 10.9 idle compute gap: 7.5
2. WAREHOUSE_LOAD_HISTORY	avg_running elevated during 06:10-06:40 avg_queued_load near zero avg_queued_provisioning spikes at warehouse resume avg_blocked negligible
3. QUERY_HISTORY	one query_parameterized_hash dominates bytes_scanned queued_overload_time low bytes_spilled_to_remote_storage high percentage_scanned_from_cache low partitions_scanned jumped versus the prior run
4. QUERY_ATTRIBUTION_HISTORY (later check)	same query family accounts for 62% of attributed compute attribution confirms the suspect pattern, not the full warehouse bill
5. TABLE_QUERY_PRUNING_HISTORY	affected fact table pruning_ratio: 0.18 partitions_scanned_per_query far above the recent baseline
Decision	do not resize first fix the scan-heavy query pattern shorten idle time on this task warehouse instead of paying for empty minutes

That note tells me where the next ten minutes should go.

The metering lines tell me the warehouse did use more compute, but not all of it was query-attributed. That is my cue to keep both workload and idle time in view.

The load lines tell me the warehouse was not overloaded for the morning. avg_queued_load stays low. avg_queued_provisioning spikes around resume, but it does not explain the whole bill.

That keeps me from resizing first.

Then I use QUERY_HISTORY as the fast path. I group the repeat pattern with query_parameterized_hash, then inspect queue time, spill, cache use, and partitions scanned.

In this case one query family is scanning more, spilling more, and using less cache than the earlier baseline.

Later, once QUERY_ATTRIBUTION_HISTORY catches up, I can confirm that the same family consumed most of the attributed compute credits.

That confirmation matters, but it is the later layer. It still does not explain the warehouse idle gap.

Finally, TABLE_QUERY_PRUNING_HISTORY gives me the scan-efficiency answer. A low pruning ratio and high partitions scanned per query tell me this is not just a big query. It is a wasteful scan pattern.

My default decision here is boring on purpose.

I would not resize first because queue overload stayed low.

I would tighten the scan-heavy pattern, keep the warehouse idle gap visible, and check whether the task warehouse is sitting around longer than the workload justifies.

That boundary matters. As of 2026-02-10, Snowflake says suspending a warehouse drops its cache, recommends immediate suspension for tasks, and suggests at least 10 minutes for BI or SELECT-heavy warehouses that benefit from cache warmth (warehouse cache guidance).

In this example, the workload is a task warehouse. I care more about stopping empty minutes than preserving cache warmth between runs.

If the spike followed a dbt release, I also want the release path to stay legible.

That is why I keep the deployment record explicit in How I keep dbt Core deployments boring in production.

Tradeoffs

Breaks when: I start with QUERY_ATTRIBUTION_HISTORY and treat it as the whole bill → Mitigation: keep warehouse metering and the idle gap ahead of per-query attribution.
Breaks when: I rewrite SQL before checking whether the warehouse was overloaded, provisioning, or blocked → Mitigation: read WAREHOUSE_LOAD_HISTORY before choosing sizing, scheduling, or query fixes.
Breaks when: I use QUERY_HISTORY as if it already contains attributed compute cost → Mitigation: use it for immediate query symptoms, then come back later for lagged credit attribution.
Breaks when: I copy BI-oriented cache advice onto a task warehouse that should suspend quickly → Mitigation: match auto-suspend choices to the actual workload instead of one universal rule.
Breaks when: I turn one warehouse investigation into a full Snowflake billing explainer → Mitigation: keep the post centered on one warehouse-spike path and use METERING_DAILY_HISTORY only for the billed-cost boundary.

Close

Next step: For one Snowflake warehouse that feels expensive right now, write a six-line note: metering change, idle-cost split, queue state, top query pattern, workload class, and first fix.

That note makes the first move less like generic tuning and more like a choice between sizing, scheduling, or workload separation.

The dbt tests I write first for business-critical models

Thu, 16 Apr 2026 00:00:00 GMT

I start here only after two earlier questions are settled: the source contract exists, and the source freshness check is green.

Then the question narrows: which model-level dbt tests would change my first response if the model started lying?

Freshness is a separate source check, not one of dbt’s four built-in generic data tests.

Once that gate is green, I want the smallest model-level test set that blocks the failures most likely to break joins, counts, statuses, or planner-facing quantities.

On a business-critical model, the first ladder should catch duplicate rows, broken parent joins, invalid states, and one model-specific rule before the dashboard conversation starts.

Problem

Imagine fct_purchase_order_lines feeds an operations dashboard that planners use to chase late supplier deliveries. The source lands on time. The model still builds. The dashboard still renders.

A source-system fix quietly changes three things at once.

A retry path duplicates some purchase_order_line_id values, some rows keep a purchase_order_id missing from the header model, and one status mapping starts writing reopened.

None of that requires a broken DAG to create a business problem.

The failure mode I see is treating tests like a generic checklist instead of ordering them around the next decision.

On a business-critical model, the first tests should tell me quickly whether counts, joins, and decision states are still safe.

Default approach

Start after the source freshness check is green.
Add the uniqueness check that enforces the declared grain first. If the grain is composite, expose a stable surrogate key and test that, or write a singular data test against the full key.
Add not_null only on fields that would break a real decision path if they disappeared, such as the parent key, quantity, effective date, or business status.
Add relationships where an orphaned record would create a business-facing mismatch between the model and the parent entity.
Add one accepted_values check or one custom singular data test for the business-state or range rule most likely to drift without breaking the SQL.
Add one custom business-rule test for the highest-risk scenario the built-ins still miss, then stop before the suite turns into noise.

Example

This is the compact test ladder I would want on a planner-facing purchase-order-line model after the source freshness gate is already green:

models:
  - name: fct_purchase_order_lines
    columns:
      - name: po_line_grain_key
        data_tests:
          - unique
          - not_null

      - name: purchase_order_id
        data_tests:
          - not_null
          - relationships:
              arguments:
                to: ref('fct_purchase_orders')
                field: purchase_order_id

      - name: line_status
        data_tests:
          - not_null
          - accepted_values:
              arguments:
                values: ['open', 'partial', 'closed', 'cancelled']

      - name: open_quantity
        data_tests:
          - not_null

That order is deliberate.

unique on po_line_grain_key goes first because one duplicated line can inflate open quantity, duplicate joins, and make planners think more material is still outstanding than it is.
not_null on purchase_order_id, line_status, and open_quantity comes next because those fields decide whether the row can be joined, interpreted, or acted on.
relationships on purchase_order_id earns a slot because orphaned lines create a mismatch between the line model and the header view the business also reads.
relationships excludes NULL values by design, so I only trust it after I have decided whether nulls should fail separately.
accepted_values on line_status comes before a softer shape check because one invalid state can drive the wrong operational response even when the row count still looks normal.

Then I add one model-specific rule the built-ins will not catch:

-- tests/open_quantity_never_negative_for_active_lines.sql
select *
from {{ ref('fct_purchase_order_lines') }}
where line_status != 'cancelled'
  and open_quantity < 0

I keep this first set small on purpose.

If unique fails, I inspect retries, merge logic, or a bad intermediate join. If relationships fails, I inspect parent load timing or the ref boundary. If accepted_values fails, I inspect the latest status-mapping change.

Each early test should narrow the first investigation step.

Tradeoffs

Breaks when the model has no declared grain or stable key yet → Mitigation: go back to the explicit grain note first, then let the first unique test enforce that row meaning.
Breaks when the real grain is composite and the suite checks one convenient column → Mitigation: expose a stable surrogate key or add a model-level assertion that matches the declared grain.
Breaks when teams copy the same null and relationships tests onto every field → Mitigation: keep only the tests that would change the first response on this model.
Breaks when relationships sits on optional or noisy foreign keys and creates alert churn → Mitigation: reserve it for joins where orphaned records create a real business mismatch, and pair it with not_null only when nulls should fail.
Breaks when the built-ins all pass but the model still violates a business rule → Mitigation: add one custom test for the highest-risk scenario, such as negative open quantity or an impossible state transition.
Breaks when the suite grows into dozens of low-value checks because the model is important → Mitigation: rank tests by failure cost and response path, then add depth only where the business risk justifies it.

Close

Next step: Pick one business-critical model, confirm the source freshness gate is green, and write the first four or five dbt tests that would change your first response if the model started lying tomorrow morning.

I’d compare notes on the business-critical model where the test suite keeps growing but the first investigation step still isn’t clear.

The operating spec I want before I trust a business-critical dashboard

Fri, 13 Mar 2026 00:00:00 GMT

A dashboard can appear in every executive review and still be hard to trust when it matters.

If nobody can answer who owns it, when the number becomes safe to use, which slice to validate first, and what the fallback is when the refresh lands late, I do not treat it as production-ready.

Before I trust a business-critical dashboard, I want one short operating spec beside it. I am not trying to add more documentation. I am trying to make the next decision safer when the number is late, disputed, or clearly wrong.

Problem

A business-critical dashboard earns trust one review at a time, but it can lose that trust in one morning.

I have seen teams hand off a dashboard with useful charts and no operating boundary anyone can point to. The finance lead assumes the number is safe by 07:30 ET. Analytics assumes everyone knows draft orders are excluded. Nobody can point to the owner, the trusted validation slice, or the fallback plan when the refresh misses its cutoff.

That is how a dashboard can look finished while still behaving like an undocumented handoff. When the headline KPI is questioned, the room starts with interpretation instead of the first check.

Default approach

Name the dashboard owner and the business owner separately. I want one person accountable for the data path and one person accountable for definition decisions.
Write the primary decision the dashboard supports. If I cannot say what meeting or review it is for, the rest of the spec gets vague fast.
List the source models that feed the headline KPI, not every upstream table in the warehouse.
Add one short metric note with the key exclusions, boundary conditions, and definition-change owner.
Write the refresh expectation as a real operating window: source landing time, dashboard refresh time, and the point where I call the dashboard stale.
Record one trusted validation slice I can check quickly when the number is challenged.
Write the failure path: first check, second check, safe fallback, and who posts the update.
Keep the spec beside the dashboard or release path, and update it when the dashboard changes. If it lives in a stale wiki page, it is already failing.

Example: the one-page operating spec I want beside the dashboard

Here is the kind of spec I want attached to an executive KPI scorecard before the weekly review depends on it:

Dashboard operating spec
Dashboard: executive KPI scorecard
Dashboard owner: analytics engineering
Business owner: finance director
Primary decision: weekly executive review of revenue, margin, and order health

Source models
- mart_finance_daily_kpis
- fct_orders
- dim_customers

Headline metric note
- Settled net revenue
- Excludes sandbox, QA, and fully refunded orders
- Finance is the decision owner for definition changes

Refresh expectation
- Source landing complete by 06:30 ET
- Dashboard refresh complete by 07:15 ET
- Treat the dashboard as stale after 07:30 ET unless the owner posts an update

Trusted validation slice
- US enterprise orders
- Last complete business day
- Compare to the settled finance snapshot when the headline KPI is questioned

Known boundaries
- Intra-day order activity is incomplete before the morning cutoff
- Draft orders are intentionally excluded
- Margin is directional until freight adjustments settle

Failure path
- First check: source freshness and latest successful publish
- Second check: validation slice against the settled snapshot
- Safe fallback: use yesterday's settled KPI snapshot in the review until the issue is resolved
- Communication path: owner posts a short incident note with next update time

Each line in that spec changes the next decision.

I do not need a big wiki page. I need enough operating context to answer three questions fast: is the dashboard safe to use, who makes the next call, and what do we do if it is not?

The owners tell me who approves definition changes and who runs the first technical check. The source model list narrows the search space before anyone starts opening warehouse tables one at a time.

The metric note keeps the number aligned with how the business already uses it. The trusted validation slice gives me one fast comparison when the KPI is questioned. The refresh window tells me when the dashboard is safe, not just when a scheduler says something ran. The fallback line tells me whether we use yesterday’s settled snapshot or hold the review until the number is safe.

The release checklist governs a change. The operating spec governs the dashboard on an ordinary Tuesday, when nothing is supposed to be changing.

When the logic, filters, or trusted slice change, I update this spec in the same release path as A dashboard release checklist before a BI change goes live. If the number is still wrong after publish, I move into When a dashboard number changes, I check these four things first instead of debating the chart live.

Tradeoffs

Breaks when: the spec turns into a long wiki page nobody updates → Mitigation: keep it to one page, store it beside the dashboard or release artifact, and update it in the same change that affects the dashboard.
Breaks when: the dashboard spec pretends an unresolved metric definition is settled → Mitigation: mark the disputed field clearly and hold the dashboard out of the critical review path until the definition owner decides.
Breaks when: teams force the full operating spec onto low-risk dashboards and create more process than value → Mitigation: reserve the full version for executive and business-critical dashboards, and use a lighter note for lower-risk BI work.
Breaks when: the fallback path is vague because nobody wants to say “do not use this number” before a live review → Mitigation: agree on the safe fallback and escalation path before the next meeting depends on the dashboard.

Close

Next step: write the owner, source models, metric note, refresh cutoff, trusted slice, and failure path for one business-critical dashboard before the next review depends on it.

If an executive dashboard on your team still cannot answer who owns it, when the number becomes safe to use, and what the fallback is when the refresh lands late, that missing spec is usually a shorter fix than another layer of review process.

How I keep dbt Core deployments boring in production

Tue, 10 Mar 2026 00:00:00 GMT

A dbt deploy is easiest to trust when another engineer can tell me exactly what will run before production moves.

For me, this is the upstream counterpart to my BI release checklist: one compact artifact that makes scope, approval, and rollback legible before a business-critical number changes.

The failure mode I care about is not only a red job. It is a green promotion nobody can explain after a business-critical model changes.

That is why I keep the state reference, selector, target, changed nodes, and promotion result visible in one deployment note. If those details disappear behind wrappers or tribal memory, the deploy can still go green and still be hard to trust.

Problem

A dbt Core deploy can look fine in a pull request and still be hard to trust on its way to production.

SQL review passes. Tests are green. The pipeline UI says a job ran. I still need to know which production manifest I compared against, which selector will run, which target proved the change, and whether production promotion reuses that same logic or disappears behind a different wrapper.

If those answers are fuzzy, the failure mode is not just a red job. It is a production promotion nobody can explain when a business-critical model changes under time pressure.

Default approach

Keep one production manifest or state artifact available to CI so changed-node selection is based on a known production reference, not a guess.
Use one explicit selector for the change set and its downstream blast radius, then carry that selector into the deployment summary.
Keep environment targets intentional and easy to inspect so CI, staging, and production do not quietly diverge.
Prove the changed selection in a non-production target first, then promote with the same selector and state logic instead of inventing a second deployment path.
Attach one short deployment summary that shows the state reference, selector, target, changed nodes, and approval or promotion result.

If another engineer cannot explain what will run before the deploy starts, the workflow is still too opaque. I care more about an inspectable deploy than one more clever wrapper.

Example

Imagine a pull request that changes fct_revenue and one downstream finance mart.

The CI step I want visible is short:

dbt build --select state:modified+ --state <prod-artifacts> --target ci

The command is short, but the deployment summary is what earns trust:

Field	Value
PR	#284
Changed model	fct_revenue
Downstream impact	mart_finance_revenue
State reference	manifest.json from the last successful production deploy
Selector	state:modified+
Target	ci
Changed nodes	fct_revenue mart_finance_revenue
Approval	changed-node build reviewed after ci run passed
Production promotion	reused the same selector and state logic
Result	promoted without widening scope in production

That gives me enough context to approve production promotion.

I can see the production reference, selector, target, changed nodes, and approval path in one place. Most important, production promotion reused the same selection logic instead of inventing a second path at the promotion boundary.

If the same pull request had a green CI badge but no visible state reference, no selector, and no clear note about production promotion, I would still treat it as fragile. A green run is not the same as an explainable deploy.

If the deployment note is explicit and a run still goes wrong, I move to the observability signals that name the failed boundary to see whether the missed cutoff is recoverable before I open The incident note template I wish every analytics team used.

Tradeoffs

Breaks when: the production manifest or state artifact is missing, stale, or hard to retrieve → Mitigation: publish the last-known production artifacts from CI and make the state reference part of the default deployment record.
Breaks when: dev, CI, and production targets drift in ways the deployment note does not surface → Mitigation: keep target configuration explicit, review the target assumptions in the same place as the dbt invocation, and avoid hidden environment-specific behavior.
Breaks when: state:modified+ hides a wider blast radius on a business-critical model → Mitigation: widen the selector deliberately for critical paths and make the extra scope an explicit deployment choice, not a surprise after promotion.
Breaks when: CI wrappers make dbt feel automated but nobody can tell which command, selector, or artifact actually ran → Mitigation: expose the exact dbt invocation and artifact references in the deployment summary so the run stays inspectable.

Close

Next step: For one business-critical dbt model, write the deployment summary you want to see before its next production promotion: state reference, selector, target, changed nodes, and approval.

The promotion is easier to trust when the green run can be tied back to the manifest, selector, target, changed nodes, and approval without archaeology.

A dashboard release checklist before a BI change goes live

Sun, 08 Mar 2026 00:00:00 GMT

A business-critical dashboard change is not ready just because the pull request looks tidy.

Before I approve a BI change, I want one small release artifact that says which slice was checked, what number should move, who signed off, and how I roll it back if the live result lands wrong.

It plays the same role for BI that the deployment summary I want for dbt Core promotions plays upstream: one compact artifact that makes scope and approval visible before promotion, so the leadership review is not the first serious review of the release.

Problem

A dashboard change can look small in code review and still change the number leadership sees in production.

One filter tweak, one rewritten chart calculation, or one quiet definition note can change the story leaders see at 08:00 ET. If nobody compares outputs on a trusted slice, records the expected delta, and names the rollback path, the team ends up explaining the number during the review instead of shipping it with confidence.

Default approach

Freeze the release scope first: which tiles, metrics, filters, and date windows are actually changing.
Compare before-and-after outputs on one validation slice the business owner already trusts, using the same date window and filters on both sides.
Read the filter logic and metric-definition changes line by line in the pull request or release note, not just in chart screenshots.
Write down the expected visible difference before the dashboard is republished.
Get explicit owner sign-off on the changed slice and the definition note.
Keep one rollback path ready before I call the release safe, including the revert step and the prior comparison snapshot.

This is release evidence, not polish review. A dashboard can look cleaner and still be less trustworthy if the release note never says which filter changed or why the number moved.

Example: the dashboard release checklist in the pull request

Here is the kind of release artifact I want attached to a revenue dashboard pull request before it goes live:

Dashboard release checklist

Field	Value
Dashboard	executive revenue scorecard
PR	#418
Change owner	analytics engineering
Business owner	finance analytics lead
Validation slice	2025-11-24 to 2025-12-02, US enterprise orders only
Change summary	Exclude sandbox and QA orders from the revenue filter Update the revenue mix chart to use settled net revenue Update the seven-day trend chart denominator to match the settled view

Review step	What I confirm	Status
Before/after output comparison	Headline revenue: $4,218,330 -> $4,201,980 (-0.39%), expected from sandbox-order removal Revenue mix by segment: enterprise share changes from 61.2% -> 61.5%, expected Seven-day trend: two days move because the denominator now matches settled revenue	pass
Filter logic review	New exclusion: order_source in ('sandbox', 'qa') No change to refunded-order handling No date-window change outside the stated validation slice	pass
Definition note	Dashboard note added: revenue mix and seven-day trend now use settled net revenue instead of booked gross revenue Release note linked in the PR summary for downstream readers	pass
Owner sign-off	Finance analytics lead reviewed the changed slice at 16:10 ET Expected visible differences match the release note	approved
Rollback path	Revert PR #418 and republish the previous dashboard definition Keep the prior validation-slice snapshot in the release note for comparison	ready

That checklist earns its space because each line changes the next decision.

The before-and-after slice tells me whether the delta is understood. The filter review catches quiet scope changes. The definition note keeps the dashboard aligned with how the business talks about the metric. Owner sign-off makes the change explicit. The rollback path keeps the release safe if the live result still surprises us.

If the release still lands wrong after that, I move into incident mode and keep The incident note template I wish every analytics team used open beside the debugging work. The point of the checklist is to make that handoff rare.

Tradeoffs

Breaks when: teams force the full checklist onto low-risk dashboards and make routine edits slower than they need to be → Mitigation: keep the strict version for executive and business-critical dashboards, and use a lighter release note for low-risk BI work.
Breaks when: there is no stable validation slice or baseline report to compare against → Mitigation: create one trusted slice before the release review starts, even if the first version is manual and narrow.
Breaks when: the filter or metric definition is still disputed when the release review starts → Mitigation: hold the go-live and resolve the definition in writing before anyone debates chart polish.
Breaks when: an urgent hotfix needs to land during an active incident → Mitigation: collapse the checklist to the trust boundary, owner approval, and rollback steps first, then backfill the release note after the number is safe again.

Close

Next step: For one business-critical dashboard, write the release checklist you want attached to its next logic or filter change before it reaches leadership.

The release conversation feels calmer when the expected delta and rollback path are visible before anyone is defending a surprising card.

The incident note template I wish every analytics team used

Sat, 07 Mar 2026 00:00:00 GMT

At 08:12 ET, I want one page open before I want another Slack thread.

Slack fills up, someone asks whether the leadership review is still safe, and half the useful evidence stays trapped in query tabs or terminal history.

At the first real check, I open a one-page incident note and start writing timestamps, evidence, and the current decision. I am not trying to document everything. I am trying to keep the incident legible while the KPI is still moving.

The goal is one visible source for the timeline, evidence, and current decision before the meeting clock takes over.

Problem

Imagine an executive revenue KPI is +11.8% versus the settled baseline at 08:05 ET. The leadership review starts at 08:30 ET.

At that point, I need more than the fix. I need one page that says whether the dashboard is safe, which checks already passed, what I think is broken, and when the next update goes out.

Without that note, the same questions get asked twice. Someone reruns a query I already checked. The review gets delayed for the wrong reason. By the time the issue is fixed, the cause and prevention item have already started to fade.

Default approach

Open the note as soon as the incident affects a real meeting, KPI, or decision.
Put impact, owner, next update time, and the current dashboard-safe decision at the top so nobody has to search for the current state.
Log checks in the order I run them, with one timestamped evidence line per check.
Keep hypothesis separate from confirmed cause so the note stays honest while the investigation is still moving.
Record the operating decision explicitly: hold the number, use yesterday’s settled snapshot, delay the review, or republish after the fix.
Close the note with one prevention item and one owner before I call the incident done.

If the run itself is late, I work from the five-signal observability panel for late analytics runs.

If the data landed but the number changed, I use When a dashboard number changes, I check these four things first to walk through freshness, row counts, joins, and metric logic.

Example

This is the kind of live note I want open during that revenue incident:

Field	Value
Incident	executive revenue KPI is +11.8% vs settled baseline
Started	2025-12-09 08:05 ET
Owner	analytics engineering
Impact	08:30 ET leadership review is not safe to run from the live dashboard
Current decision	use yesterday's settled snapshot until the dashboard is republished
Next update	08:20 ET
Timeline	08:05 ET alert from revenue dashboard and finance Slack thread 08:12 ET freshness check passes; finance extract landed at 07:11 ET 08:18 ET row-count check passes; fact_sales volume is within 0.8% of recent Mondays 08:27 ET join check fails; unmatched promotion keys jump from 0.4% to 14.2% 08:41 ET mapping fix deployed and model rebuilt 08:50 ET dashboard republished
Checks run	Freshness: ok — latest finance snapshot landed on time Row counts: ok — fact_sales volume is stable Joins: off — promotion mapping dropped a material set of rows Metric logic: unchanged — no dashboard filter or definition edit since yesterday
Hypothesis	New promotion override codes were not included in the morning dimension sync.
Confirmed cause	The morning dim_promotions sync excluded new override codes from the ERP feed.
Fix	Backfill the missing promotion codes, rebuild the model, and republish the dashboard.
Prevention	Add an unmatched-key alert on the promotion join before the next finance review.
Prevention owner	analytics engineering

That note is enough for the incident I am running. I can answer status questions, keep the operating decision visible, and avoid rebuilding the same timeline later.

I care less about the document itself than whether impact, evidence, decision, and prevention sit in one place while the KPI is still moving. If I already have the pipeline checks in place, this note becomes the record of what I actually saw, ruled out, and decided.

The headings stay simple on purpose: impact, owner, next update, timeline, checks run, hypothesis, confirmed cause, fix, and prevention. If a field does not change the next action or preserve evidence, I leave it out.

Tradeoffs

Breaks when: the note gets opened after the fix and turns into reconstructed memory → Mitigation: start the note on the first real check, even if the first version is only four lines.
Breaks when: the team writes paragraphs instead of evidence lines → Mitigation: keep one bullet per timestamp and one line per check.
Breaks when: a hypothesis hardens into a fake root cause because it was written too early → Mitigation: keep separate headings for hypothesis and confirmed cause.
Breaks when: the incident closes with no prevention owner → Mitigation: require one named follow-up item before marking the incident complete.

Close

Next step: Pick one business-critical KPI and write the headers for your live incident note before the next morning review turns noisy.

If the next analytics incident would still be reconstructed from chat, use one responder artifact—the timestamp, check, decision, and owner line—to make the first review easier.

Five pipeline observability signals before more orchestration

Sat, 07 Mar 2026 00:00:00 GMT

At 07:34, I do not need another orchestration knob. I need signals that tell me whether the 08:00 ET review is still recoverable.

Before I add more orchestration, I want a small operating view: Did the source land? What was the latest successful publish? Did runtime drift? Does the output still look normal? Who owns the first check?

I keep that view separate from Row counts are not enough: the checks I add before I trust a pipeline because those checks live inside the data path. This panel is for the moment the publish is late, stale, or failed and I need the next investigation step fast.

If the panel says the publish completed and the output still looks wrong downstream, I move to When a dashboard number changes, I check these four things first.

Problem

Imagine a daily inventory_availability pipeline that needs to publish by 07:30 ET for an 08:00 ET operations review.

At 07:34, the orchestrator shows a failed run. That matters, but it is not enough. I still need to know whether the source extract was late, whether the curated model slowed down, whether the latest publish is still yesterday, and whether there is a clear owner for the first investigation step.

When those answers are missing, the team starts shopping for another retry rule or dependency feature. I would rather make the missed run legible first.

Default approach

Start with one business-critical pipeline and one real business cutoff time.
Show one upstream handoff signal, usually source freshness with the latest landed timestamp.
Show one publish-completion signal, usually the latest successful publish or latest published partition timestamp.
Track run duration against a normal band, not just success versus failure.
Add one output-shape signal on the published model, such as row-count delta or unmatched-key rate.
Put the owner and the first runbook question in the same view so the handoff starts immediately.

I do not want a wall of task states. I want a boundary view. Each signal should change either the next investigation step or my confidence in the published output.

Example

Here is the five-signal panel I want for that inventory_availability run. The companion table view of the five-signal panel shows the same review surface:

Source extract freshness — expected landed by 06:10 ET; observed 06:08 ET
Latest successful publish — expected today by 07:30 ET; observed 2025-12-01 07:14 ET
Curated model duration — expected 12–15 minutes; observed 41 minutes
Output row-count delta — expected within +/-2%; observed +0.4%
Owner + first check — expected platform analytics; observed inspect recent model changes and step runtime

That pattern tells me where to start.

The source landed on time, so I do not start with ingestion. The latest successful publish is still yesterday, so today’s data did not make it across the finish line. The output row-count delta is stable, so this does not look like a missing partition or obvious shape break. The inflated duration points me at the transform or publish path before I waste time debating retries, dependencies, or dashboard logic.

The audit query behind that panel is small on purpose:

select
  run_date,
  source_landed_at,
  published_at,
  duration_minutes,
  output_row_count
from pipeline_run_audit
where pipeline_name = 'inventory_availability'
order by run_date desc
limit 7;

I am not trying to build a perfect monitoring product. I want enough history to compare today’s run to the normal band, assign the first check cleanly, and keep the incident moving.

That is why I keep this operating view separate from the pipeline checks themselves. The checks live inside the data path. This panel is for the moment a run is late or fails and I need the next investigation step fast.

Tradeoffs

Breaks when: teams instrument every task in the DAG and bury the useful signals under noise → Mitigation: start with one pipeline tied to one business deadline and keep only the signals that change the first investigation step.
Breaks when: the orchestration UI shows task state but nothing about the published dataset → Mitigation: pair workflow state with one publish-completion signal and one business-output signal.
Breaks when: one platform-wide SLA hides different lateness patterns across feeds → Mitigation: define freshness windows per dataset and per business expectation, not as one generic cutoff.
Breaks when: alerts fire but nobody knows who owns the fix or what to check first → Mitigation: put the owner and the first runbook question in the same panel.

Close

Next step: Pick one business-critical pipeline and write down the five signals you need before the next missed SLA turns into a meeting.

If a cutoff was missed recently, use one pipeline’s five-signal panel to decide whether the next fix is ownership, freshness, publication evidence, or orchestration work.

How I validate a metric after a backfill

Sat, 07 Mar 2026 00:00:00 GMT

Before I rerun a backfill, I write down the slices that should move and the ones that should stay flat. That short note tells me whether the rebuild is correcting history or merely moving numbers around.

When I validate a metric after a backfill, I start with the blast radius, not the total: which months, cohorts, or plans should move, which ones should stay flat, and what I need to explain in the BI release checklist before a dashboard change goes live before the dashboard returns to production.

If metric meaning or row meaning is still fuzzy, a backfill exposes it fast.

Problem

Imagine I fix a billing bug that misclassifies invoice reversals in monthly subscription revenue. The code fix is small. The risky part comes next: I need to backfill the last 90 days and republish finance dashboards that leadership already used in prior reviews.

A 1.8% move in February is not automatically good or bad. The real question is whether the movement lands where I expected. If I cannot say which months, plans, or customer cohorts should change before I start the rebuild, I am not validating a metric. I am rerolling history and hoping the new total looks more credible.

Default approach

Freeze the metric definition, owner, and grain before I rerun anything.
Write a short backfill note that names the bug, the date window, the affected tables and dashboards, the expected movers, and the expected non-movers.
Snapshot the pre-backfill result so I can compare before and after instead of relying on memory, screenshots, or dashboard cache.
Compare the rebuild by stable slices such as billing month, plan type, or customer cohort, not just the top-line total.
Investigate two kinds of surprises: slices that moved when they should have stayed flat, and slices that stayed flat when they should have moved.
Publish the restatement boundary in one short note so downstream users know the rebuilt window, the expected movers, the expected non-movers, and when the rebuilt numbers become final.

That sequence gives me an explanation before I republish anything. It also helps me separate a real metric fix from unrelated model drift.

Example: a 90-day revenue backfill

Before I start the rebuild, I want a short note like this:

Field	Value
Metric	subscription_revenue_monthly
Owner	finance analytics
Grain	one billed_account_id per billing_month
Reason for backfill	invoice reversals were excluded from the monthly revenue adjustment logic
Backfill window	2025-08-01 through 2025-10-31
Expected movers	monthly plans with reversal activity customer cohorts billed inside the 90-day window
Expected non-movers	annual prepaid plans free plans months before 2025-08
Dashboards affected	finance MRR review monthly retention pack

Then I compare before and after by a stable slice:

billing_month	plan_type	revenue_before	revenue_after	delta	expected
2025-08	monthly	182400	190900	+8500	yes
2025-08	annual	264000	264000	0	no
2025-09	monthly	188100	195700	+7600	yes
2025-09	annual	271500	274900	+3400	no
2025-10	monthly	191800	199600	+7800	yes
2025-10	annual	279200	279200	0	no

The restatement boundary I want published beside the dashboard is short:

Restatement notice
- rebuilt window: 2025-08-01 through 2025-10-31
- expected movers: monthly plans with reversal activity inside that window
- expected non-movers: annual prepaid plans, free plans, months before 2025-08
- numbers outside this boundary should remain unchanged
- dashboard stays unpublished until unexpected movement is explained

The top-line move is directionally right. The rebuild is still not validated. September annual-plan revenue changed even though annual plans were outside the bug. That one mismatch is enough for me to stop and investigate.

I keep the comparison query simple on purpose:

with before_backfill as (
  select
    billing_month,
    plan_type,
    sum(revenue_usd) as revenue_before
  from finance.subscription_revenue_monthly_before
  group by 1, 2
),
after_backfill as (
  select
    billing_month,
    plan_type,
    sum(revenue_usd) as revenue_after
  from finance.subscription_revenue_monthly
  group by 1, 2
)
select
  coalesce(a.billing_month, b.billing_month) as billing_month,
  coalesce(a.plan_type, b.plan_type) as plan_type,
  coalesce(b.revenue_before, 0) as revenue_before,
  coalesce(a.revenue_after, 0) as revenue_after,
  coalesce(a.revenue_after, 0) - coalesce(b.revenue_before, 0) as delta
from before_backfill b
full outer join after_backfill a
  on a.billing_month = b.billing_month
 and a.plan_type = b.plan_type
order by 1, 2;

If I see that unexpected annual delta, I do not republish the dashboard yet. My next check is whether a dimension change or a plan-mapping fix got bundled into the same release, then I validate that unexpected slice on its own before I reopen the dashboard. That is how a clean metric correction gets confused with a wider model change.

This is why I do not validate a metric after a backfill by comparing totals alone. A metric can land close to the expected overall number and still move the wrong cohort, the wrong plan, or the wrong time period.

If the backfill still produces unexplained movement after this comparison, I treat it like a fresh reliability problem. At that point I go back to the checks I add before I trust a pipeline or the incident note template I use when a dashboard number changes.

Tradeoffs

Breaks when: source history is incomplete, overwritten, or missing key fields needed to reconstruct the old logic → Mitigation: make the reconstruction limit explicit and say which periods or slices are no longer fully trustworthy.
Breaks when: a slowly changing dimension or mapping rule changed in the same deployment as the metric fix → Mitigation: isolate the logic repair from the dimensional change, or validate each one with its own before-and-after comparison.
Breaks when: the team only checks the total delta and never checks the shape of the change → Mitigation: compare by stable slices such as month, cohort, plan, or region before republishing dashboards.
Breaks when: downstream users assume historical numbers never move after month-end → Mitigation: publish a restatement window and label which outputs are settled versus still allowed to change.

Close

Next step: For one metric that can be restated, write the dates, cohorts, and dashboards you expect to move if you backfill the last 90 days.

That expected-movers list gives the rerun a boundary before the total delta becomes the whole argument.

Every important model needs an explicit grain

Sat, 07 Mar 2026 00:00:00 GMT

A model can look clean in SQL and still be unsafe to use. If nobody wrote down what one row means, the first many-to-many join turns a reasonable table into an argument.

I treat grain as a trust boundary, not a documentation chore. Before I trust a revenue sum or a dashboard card, I want one line that says what the row represents, which key should be unique, and which joins preserve that meaning.

If the row meaning is fuzzy here, the definition work that stops metric drift across dashboards starts from the wrong model. If the source rules are fuzzy too, I start earlier with a minimum data contract: expected keys, lateness window, and the rule for bad records.

Problem

Imagine fct_customer_orders is meant to hold one row per completed order_id. A dashboard request comes in for revenue by marketing channel, so someone joins that model directly to fct_sessions on customer_id.

The query runs. The chart even looks plausible. But one customer can have several sessions before one order, so booked revenue gets repeated across joined rows. The SQL is only part of the problem. The deeper issue is that the team cannot say, in one sentence, what the model is supposed to preserve.

Default approach

Write the grain in one line: one row equals which business entity, at which time boundary, and what state or event qualifies it.
Name the key or key combination that should be unique at that grain, and what should happen when duplicates appear.
List the joins that preserve the grain and the ones that require reshaping or pre-aggregation first.
Mark which measures are safe to sum, count, or average from the model.
Add one test that fails when the declared grain starts duplicating.

That is enough context to block most accidental many-to-many joins before they ship.

Example

This is the lightweight grain note I would want next to a business-critical order model:

Field	Value
Model	fct_customer_orders
Declared grain	one row = one completed order_id
Expected unique key	order_id
Safe joins	dim_customers on customer_id (many orders to one customer) dim_dates on order_date (many orders to one date)
Unsafe without reshaping first	fct_sessions on customer_id ad_clicks on customer_id
Safe measures from this model	sum(revenue_usd) count(order_id) avg(order_value_usd)
First grain test	fail if count(*) != count(distinct order_id)

Now imagine the data looks like this:

fct_customer_orders

order_id	customer_id	revenue_usd
1001	C42	120
1002	C77	80

fct_sessions

session_id	customer_id	channel
s1	C42	paid
s2	C42	email
s3	C42	organic
s4	C77	paid

A direct join can look innocent:

select
  s.channel,
  sum(o.revenue_usd) as revenue_usd
from fct_customer_orders o
join fct_sessions s
  on o.customer_id = s.customer_id
group by 1

If I run that, order 1001 shows up three times because customer C42 had three sessions. Revenue becomes 440 instead of 200.

The safer move is to reshape the session side to order grain before I bring it onto the order model. I pick one attribution rule, build a helper model with one row per order_id, and join that helper back to orders. Once both sides share grain, the revenue sum is safe again.

If row counts stay roughly stable while duplicates sneak into the declared key, I fall back to the check set I use before I trust a pipeline. Stable volume does not protect me from grain drift.

Tradeoffs

Breaks when: a legacy model already mixes order, order-line, and session logic in one table → Mitigation: write the current grain note first, mark unsafe measures clearly, and split the highest-risk use case into a cleaner model before attempting a full rewrite.
Breaks when: source keys are unstable and duplicates are normal upstream behavior → Mitigation: quarantine raw duplicates, state that the curated model is not yet authoritative, and tighten the source rule before downstream teams start summing it.
Breaks when: one business question genuinely needs two grains, such as order conversion and session behavior in the same analysis → Mitigation: keep separate models for each grain and join only after pre-aggregating both sides to the decision grain.
Breaks when: code review checks SQL syntax and row counts but never checks key duplication → Mitigation: add one uniqueness test on the declared grain and treat failures there as trust failures, not minor cleanup.

Close

Next step: Pick one business-critical model and write down the grain, expected key, and one join you would block before the next pull request touches it.

Which model on your team is one quiet rename away from a mixed-grain row that joins still trust?

The definition card I use to stop metric drift across dashboards

Sat, 07 Mar 2026 00:00:00 GMT

When the same metric starts arguing with itself across dashboards, I do not start with the chart. I start with the definition.

One number should mean one thing for one decision. If finance, product, and operations need different answers, I would rather name those differences early than let one label drift across three dashboards.

Before I compare dashboards, I want one owned definition in writing and one canonical card that every dashboard can point back to.

If freshness, row loss, or join behavior are still in doubt, I run the pipeline checks and the dashboard triage sequence I run first before I call it metric drift.

Problem

A metric can drift without any pipeline break. The data lands on time, row counts look normal, and the dashboard still tells two different stories.

Imagine a monthly review where finance shows 18,240 active customers and product shows 24,910. If both charts are labeled active_customer, the meeting turns into a debate about whose dashboard is wrong. Most of the time, the real problem is older than the chart: nobody wrote down the owner, grain, exclusions, and change history in one place.

Default approach

Tie the metric to one decision and one owner before I call it reusable.
Write the grain in one line: what one counted entity represents and over what time window. If the underlying model still has ambiguous row meaning, I fix that first in Every important model needs an explicit grain.
Write the inclusion and exclusion rules in plain language, not just SQL, so a reviewer can tell what the metric keeps out.
Publish one canonical definition location, even if the dashboards live in different tools. Keep it next to the owned metric definition, semantic layer, or dashboard review path where people already verify the number, then point each dashboard subtitle, review doc, or release note back to that same card.
Keep a short change log for metric logic changes before they ship, including the date, owner, what changed, why it matters for the decision, and which dashboards need the update.
Split the metric into differently named versions when two teams need different logic.

Example

Here is the definition card I want every dashboard to point back to once a metric shows up in more than one dashboard:

Field	Value
Dashboard label in use today	active_customer
Canonical definition card	metrics/active_customer

Record	decision	owner	grain	includes	excludes	source	last change
finance_active_customer_monthly	monthly revenue retention review	finance analytics	one billed_account_id per calendar month	account has at least one paid invoice in the month	trial-only accounts, fully refunded invoices	fct_invoice_monthly	2025-10-12 excluded fully refunded invoices
product_active_customer_28d	product engagement planning	product analytics	one account_id with activity in the trailing 28 days	account has at least one core feature event	internal test accounts	fct_product_activity_daily	2025-10-28 switched from session_start to core_feature_event

I do not leave that card floating in a wiki page nobody opens. I keep one canonical copy next to the owned metric definition or dashboard review path, then make the dashboard point back to that card in the subtitle, release note, or review doc. Each change entry records the date, owner, what changed, why it matters for the decision, and which dashboards still need to move.

Finance and product are not disagreeing about arithmetic. They are answering different questions under the same label.

Once I see that, I stop asking which dashboard is right. The first decision is whether we need two metrics or one owned definition. If leadership truly needs one shared active_customer, I pick the decision first, write one owned definition, and retire the competing label. I do not average two dashboard calculations into a compromise.

Once that shared definition changes, I still want to control the rollout and any historical rebuild separately. If the new definition is ready to ship across live dashboards, I want a short release check—labels, filters, and cutover note—before I trust the rollout. If the change also restates history, How I validate a metric after a backfill is the next check I use before I trust the rebuilt number.

Tradeoffs

Breaks when: one label is forced to serve incompatible decisions across teams → Mitigation: split the metric into separately named definitions and assign an owner to each one.
Breaks when: the definition exists in a doc but never gets updated when the logic changes → Mitigation: keep the change log next to the model or dashboard review path and require it in release checks.
Breaks when: a legacy metric already appears in too many dashboards to rename in one pass → Mitigation: declare one canonical definition, note the cutoff date, and migrate the highest-risk dashboards first.
Breaks when: a team buys semantic-layer tooling before it agrees on owner, grain, and exclusions → Mitigation: start with a small definition card for one business-critical metric and expand only after the rules hold.

Close

Next step: Pick one metric that appears in more than one dashboard and write down the owner, grain, time window, inclusion rules, exclusions, and last logic change before the next review.

If your team is untangling metric drift right now, which metric label is still carrying two different definitions into the same review?

When a dashboard number changes, I check these four things first

Fri, 06 Mar 2026 00:00:00 GMT

When a dashboard number moves without warning, I need the first answer fast: which layer changed—freshness, row counts, joins, or metric logic?

I do not start by defending the number. I start by isolating the layer that changed.

That first pass stays in this order—freshness, row counts, joins, and metric logic—because each check rules out a whole class of failure quickly.

It breaks when there are no baselines or no clear owner. In that case I say that early and create those baselines before I promise a final answer.

Problem

Imagine available inventory is down 18% on Monday morning. The dashboard may be reacting to a late source, lost model rows, a join that dropped valid records, or a metric-logic change.

Those are different failure modes. If I treat them as one problem, the incident gets slower and the conversation gets noisy.

Default approach

Check freshness first: did the source land on time, did the job finish cleanly, and is the latest successful publish recent enough to trust?
Check row counts next: did the main model lose or gain more volume than expected versus recent baselines?
Check join behavior after that: did a dimension change create duplicate, unmatched, or mis-mapped rows?
Check metric logic last: did someone change filters, time windows, exclusions, or semantic rules on purpose, and did that change go through a dashboard release check before it hit production?
Send a short status update once I know which layer is failing and what I am checking next.

After the first pass, I send a short update so the conversation stays calm and scoped. This is the copy/paste template I use:

Status: investigating <metric> <up/down X%> vs <baseline> (as of <time>)

Freshness: <ok/late> — <one evidence line>
Row counts: <ok/off> — <one evidence line>
Joins: <ok/off> — <one evidence line>
Metric logic: <ok/changed> — <one evidence line>

Hypothesis: <one sentence>
Next check: <one sentence>
Next update: <time>

When the issue affects a meeting or decision, I keep The incident note template I wish every analytics team used open beside these checks so the timeline, evidence, and next update stay in one place.

If freshness, row counts, joins, and metric logic all come back clean and the disagreement remains, I stop treating it like a pipeline incident. My next check is the owned metric definition and any recent BI release evidence.

Example

Here is a simple version of the sequence for that Monday inventory drop:

Field	Value
Symptom	available inventory is down 18%

Review step	What I confirm	Status
Freshness	inventory snapshot extract completed at 07:12 ET latest snapshot date matches expectation	—
Row counts	stg_inventory_snapshot row count is down 1.4% vs recent Mondays not enough to explain the full drop	—
Join behavior	unmatched location_id values jump from 0.3% to 14.8% available units disappear after the join to location attributes	—
Metric logic	no intentional dashboard filter or definition change	—

Conclusion: the issue is missing location mappings, not a true inventory decline

The query I want ready for this step is usually simple:

select
  count(*) as total_rows,
  sum(case when l.location_id is null then 1 else 0 end) as unmatched_rows
from fct_inventory_snapshot i
left join dim_locations l
  on i.location_id = l.location_id;

That sequence usually tells me which layer is failing in one pass. I like checks like this because I can run them at 08:05 and explain the result in one calm Slack update.

The controls I want in place are boring on purpose: load completion time, row-count thresholds, and one unmatched-key check on the model that feeds the dashboard. Boring is good during an incident.

If freshness is bad because the run missed its cutoff, I move to the five-signal observability panel for missed analytics cutoffs.

If the four checks are clean and the disagreement remains, I move to The definition card I use to stop metric drift across dashboards.

Tradeoffs

Breaks when: there is no baseline for freshness or row-count shifts → Mitigation: start logging daily load times and row counts for critical models, even if the first version is manual.
Breaks when: the metric definition lives in multiple places across SQL, dbt, and BI → Mitigation: choose one owned definition and point dashboards back to it.
Breaks when: late-arriving data is normal for the business → Mitigation: compare against the same latency window instead of the final settled number.
Breaks when: nobody owns the upstream model or dashboard metric → Mitigation: assign one owner for the next incident before the memory of this one fades.

Close

Next step: Before the next incident, pick one important metric and write down the freshness, row-count, join, and logic checks you expect before anyone asks questions about it.

If the next KPI move would still send everyone to Slack first, use that four-check order as the first triage note before rewriting SQL.

Row counts are not enough: the checks I add before I trust a pipeline

Fri, 06 Mar 2026 00:00:00 GMT

Row counts are a smoke test, not a pass. I have seen models land at the usual volume while a join key turns null, duplicates creep in, or one business category disappears without tripping the row-count alert.

Before I trust a published model, I want five checks inside the publish path: freshness, row counts, uniqueness, null rate, and one shape check tied to business risk.

These five checks live inside the publish path, so they are a different job from the operating view I watch when a run is already late or stale. If these five checks pass and a number still moves where the business can see it, I switch to When a dashboard number changes, I check these four things first.

If a check cannot change a response, it usually does not earn a slot.

Problem

Imagine fct_shipments still lands about 1.2 million rows after an upstream schema change. At first glance the pipeline looks healthy.

But the source team renamed a warehouse mapping field, the transform still runs, and 22% of rows now carry a null warehouse_id. The row count did its job; it just did not protect the downstream metric.

That kind of miss shows up later as a regional fill-rate problem, a missing warehouse view, or a dashboard nobody trusts. I would rather catch it in the pipeline than explain it in a meeting.

Default approach

Check freshness first: did the extract land on time, did the model finish when I expected, and is the latest successful publish recent enough to trust?
Check row counts next to rule out a big volume shift. If the count looks normal, keep going. Stable volume is not a pass when keys can go null, duplicates can creep in, or one category can quietly disappear.
Check uniqueness on the keys that drive downstream joins, mappings, or counts.
Check null rate on fields that would break joins, mappings, or business logic if they go missing.
Check one accepted-values or distribution rule on a high-risk column so I catch shape changes, not just missing rows.

After those checks, each one needs an owner and a first response step.

Example

Here is the kind of failure I want the pipeline to catch before a dashboard does:

Field	Value
Model	fct_shipments

Record	row count	shipment_id	warehouse_id null rate	status values	regional split
Expected daily shape:	~1.2M	unique	< 0.5%	shipped, cancelled, corrected	east 34%, central 29%, west 37%
After upstream schema change:	1.19M	still unique	22.4%	unchanged	west volume collapses because null warehouse_id cannot map to region

I keep the first check query simple on purpose:

select
  count(*) as row_count,
  count(distinct shipment_id) as distinct_shipment_id,
  round(100.0 * avg(case when warehouse_id is null then 1 else 0 end), 2) as warehouse_id_null_pct
from fct_shipments;

On very large tables, I run the distinct check on the newest partition or a rolling window, then schedule a deeper scan less often.

Then I add one quick shape check next to it:

select
  region,
  count(*) as shipments
from fct_shipments
group by 1
order by 2 desc;

If the row count is stable but one region disappears, I do not need a long debate about whether the pipeline is healthy. I already know which mapping layer to inspect.

If these checks pass and two dashboards still disagree, I stop debugging the pipeline and move to definition ownership. The definition card I use to stop metric drift across dashboards is the handoff I make next.

I want a small number of business-aware checks, not a long catalog of generic ones. A null-rate check on warehouse_id earns its place. Ten low-risk checks on columns nobody uses usually do not earn a slot.

Tradeoffs

Breaks when: every column gets the same battery of checks → Mitigation: start with fields tied to joins, filters, revenue, inventory, or service-level reporting.
Breaks when: thresholds are copied from another model with different behavior → Mitigation: set baselines from recent history and review them when the source changes.
Breaks when: alerts fire but nobody knows the first investigation step → Mitigation: attach each check to an owner and a short runbook question such as “freshness, join, or mapping?”
Breaks when: teams assume schema tests cover everything important → Mitigation: add one or two data-shape checks that reflect real business failure modes, not just table structure.
Breaks when: full-table uniqueness checks are too expensive to run on every build → Mitigation: check uniqueness on the newest partition or a rolling window, and run a deeper scan on a slower cadence.

Close

Next step: Pick one business-critical model and add one freshness check, one volume check, one uniqueness check, one null-rate check, and one business-shape check before the next source change lands.

If a row-count check already passes while the business still distrusts the output, use one failed shape check from that model to decide which test deserves ownership first.

The 6-part data contract I want before I trust a source table

Fri, 06 Mar 2026 00:00:00 GMT

A new source table can look useful long before it is safe to trust. If nobody agrees on row meaning, key rule, landing cadence, and valid-record handling, the first downstream model bakes in assumptions nobody approved.

Before I add the checks from Row counts are not enough: the checks I add before I trust a pipeline, I want a short six-part source contract:

row meaning
key rule
landing cadence
valid-record handling
correction window
owner

It is small enough to agree on quickly and specific enough to stop downstream guessing.

Even when the source is exploratory or changing fast, I still write the short version and label the trust boundary up front. If the row meaning is still fuzzy after that contract pass, I usually need an explicit grain note before the downstream model is safe to reuse.

Problem

Imagine a warehouse integration starts sending a shipment_events table for on-time-in-full (OTIF) and fill-rate reporting. Monday’s load looks normal. By Tuesday, planners see duplicate shipment corrections in one lane and a small set of rows with no warehouse_id.

At that point, the analytics team can keep patching transforms around ambiguous source behavior. The better move is to stop and write the contract that says what a row means, which records are valid, and how long corrections can restate history. I have learned that this is cheaper than repairing the metric layer later.

Default approach

Define the row meaning first: one row should mean one shipment event, one shipment line, or one snapshot record, not a mix of all three.
Define the key rule next: which column or column set should be unique, and when can retries or duplicates appear?
Define landing cadence and lateness: how often should the table land, and when is it officially late?
Define valid-record rules for critical fields and the response when they fail: quarantine, block publish, or allow with a flag.
Define the correction window: can rows be updated or deleted, how are corrections flagged, and when does history stop moving?
Define the owner: who confirms the rule and who answers first when a check fails?

Example

Here is how I would turn those six parts into a lightweight contract for a new shipment_events table before I use it in executive reporting. I keep the contract in the same six-part order so the field-level rules line up with the actual checks.

Field	Value
Table	shipment_events
Use case	OTIF and fill-rate reporting

Contract field	Expectation	Why it matters
row meaning	one row = one shipment event	stops grain drift downstream
shipment_event_id	unique per source event	retry duplicates do not inflate counts
landing cadence	daily load by 06:00 UTC	tells me when freshness is late
event_ts	always present in UTC	defines lateness and daily bucketing
warehouse_id	non-null, valid mapped warehouse	joins do not silently drop records
sku	non-null product identifier	shipment and inventory logic can reconcile
quantity	signed numeric value, never null	corrections and reversals stay explainable
event_type	allowed set: ship, cancel, correct	event flow is explicit
invalid record action	quarantine rows missing warehouse_id or sku	bad records do not leak into KPI tables
change behavior	corrections may arrive within 72 hours	downstream models know history can move
owner	WMS integration team	someone can confirm or fix the rule

The questions I want settled are simple: can retries duplicate raw events, when is the daily load officially late, how long can corrections restate history, and do rows without warehouse_id get quarantined or allowed through with a flag? Those answers belong before the first KPI review, not during it.

Once that contract exists, I can turn it into boring controls: a freshness check on landing time, a uniqueness check on shipment_event_id, null checks on warehouse_id and sku, accepted values on event_type, and a publish rule that keeps quarantined rows out of executive reporting.

I want that contract next to the ingestion checks or source tests, not in a wiki that drifts out of date.

Without that contract, the downstream model is not really tested against the failure modes that matter. It is turning upstream ambiguity into downstream logic, which is usually when I need an explicit grain note just to make the row meaning visible again.

Tradeoffs

Breaks when: the source is exploratory and the schema changes every week → Mitigation: start with row meaning, landing cadence, and owner, then label the table non-authoritative until the field-level rules stabilize.
Breaks when: the upstream team cannot guarantee uniqueness yet → Mitigation: land the raw table separately, quarantine duplicates, and keep business-facing models off it until the key behavior stabilizes.
Breaks when: late corrections are normal for the business → Mitigation: publish a restatement window and a freeze rule so downstream users know which dates can still move and when the numbers become final.
Breaks when: the contract document drifts away from actual source behavior → Mitigation: keep the contract beside ingestion checks or source tests and review it when the source logic changes.

Close

Next step: Pick one critical source table and write the six contract lines that stop downstream guessing: row meaning, key rule, landing cadence, valid-record rules, correction window, and owner.

If a source table keeps turning into downstream detective work, use those six lines to name the one missing boundary before another model inherits the guess.

How I choose an analytics stack I can debug

Thu, 05 Mar 2026 00:00:00 GMT

I choose an analytics stack I can debug under pressure, not the biggest stack on the market.

When a number moves, I want to trace it to the last landed extract, the logic change, the metric definition, and the owner of the next check within minutes.

My default shape is simple: land data with checks, transform in version control, test decision-changing models, publish metric definitions with an owner, and keep a small operating view for late or failed runs.

Inventory, orders, and finance have different grains and failure modes, but I want the same handholds in every incident. That operating model matters more than the vendor names.

A concrete example

If an in-stock dashboard drops across a regional warehouse network on Monday morning, I want three answers before I touch the dashboard:

Did the latest inventory snapshot land on time?
Did the transform keep the expected SKU-location row counts?
Did the in-stock logic or location mapping change, or did the inventory position really change?

The stack earns its place when those answers are visible under pressure. It fails when custom glue, silent logic changes, or ownership gaps only show up once the room is already reacting.

If the snapshot is late, I stop at ingestion. If it landed and row counts still look normal, I move to logic and mapping before I argue with the dashboard.

I wrote up that first debugging pass in When a dashboard number changes, I check these four things first.

What I optimize for

I optimize for clear boundaries and fast evidence. Each layer should tell me where to look next when a KPI changes or a run misses SLA.

Clear boundaries between ingestion, transformation, semantic logic, and reporting
Versioned SQL, model tests, and diffable logic on decision-changing models
Checks that separate freshness, row volume, logic, and ownership problems
Enough observability to show the latest publish, run duration, and owner
Tooling a team can run without routing every small change through a specialist

How I evaluate each layer

When I compare tools, I ask what evidence each layer gives me when a number moves or a run misses SLA.

That filter matters more than feature count.

Ingestion: I want the last landed batch, a freshness signal, and the source contract or quarantine rule for bad records.
Transformation: I want versioned SQL, tests on critical models, and a fast way to diff logic changes.
Orchestration and operating view: I want the latest successful publish, run duration versus normal, and an owner for the next check.
BI or semantic layer: I want explicit metric definitions, visible filter logic, and fewer places for one KPI to fork into three meanings.

If a product makes automation easier but hides those artifacts, I usually skip it. I want the shortest path from symptom to cause and a clear owner at each boundary.

When transformation is visible but promotion still feels opaque, I want a boring deployment record I can read quickly: state, selector, and target.

Before I add workflow complexity, I ask whether a late run is already explainable from a small operating view for late runs: source freshness, latest successful publish, run duration, output shape, and a clear owner for the first check.

When dashboards still disagree on a KPI, the problem is usually definition drift, not infrastructure, and the handoff is to the owner of the metric definition rather than the platform team.

Tradeoffs

Breaks when ingestion arrives without stable keys, lateness rules, or contract expectations. Mitigation: add source-facing checks, quarantine bad records, and delay downstream metrics instead of patching reports by hand.
Breaks when one platform claims to own ingestion, transformation, semantic logic, and alerting with no clear handoffs. Mitigation: keep boundaries explicit so failures stay visible and each layer has an owner.
Breaks when a platform decision is used to hide missing metric ownership or unclear model grain. Mitigation: fix the definition and modeling gaps first, because no tool choice will remove that disagreement.
Breaks when the business really needs heavy real-time behavior or event-driven fan-out. Mitigation: keep the analytical path simple and add streaming components only where latency changes the decision.

Close

Next step: Take one important metric and map it from source to dashboard, including the checks you expect at each boundary.

The useful test is whether a late run can still be traced from symptom to artifact to owner without the stack hiding responsibility.

What I write about: reliable analytics systems that stay explainable

Thu, 05 Mar 2026 00:00:00 GMT

I write about the checks, decisions, and handoffs I use when a KPI moves, a pipeline slips, or two dashboards disagree.

If I cannot point to the query, release step, owner, or failed check behind a claim, I leave it out.

What I cover

Pipeline checks that catch lateness, row loss, join failures, and broken publishes before the dashboard takes the hit
Modeling patterns that keep row meaning, metric definitions, and dashboard numbers legible as the warehouse and team grow
Delivery habits like release checklists, incident notes, ownership boundaries, and debugging order that hold up under pressure

A concrete example

Say Monday revenue drops in an executive dashboard and nobody expected it.

I do not start with the chart. I start with the smallest checks that narrow the failure.

Confirm the source landed on time.
Compare row counts and key joins in the model behind the chart.
Check whether the metric logic or filter changed on purpose.
Write down the current hypothesis before Slack fills with partial explanations.

That sequence is the first-pass order I use in When a dashboard number changes, I check these four things first.

That order tells me whether I am dealing with lateness, shape, joins, or definition drift before the room turns the issue into a tooling argument.

When those checks pass and two dashboards still disagree, I treat it as a definition problem, not a chart problem. I move to metric definition work before I touch the chart again.

When the incident is active and other people need updates, I keep a short incident note beside the debugging work.

I want each post to leave behind something I would run again: a check, a note template, a definition card, or a release habit that keeps the next incident explainable.

Tradeoffs

Breaks when I start writing about tools before I can name the check, contract, or decision rule that matters. Mitigation: anchor the post to the operating artifact first and mention tooling only when it changes the workflow.
Breaks when the useful answer depends on team size, latency expectations, or ownership structure. Mitigation: say that boundary directly and explain what I would change under those constraints.
Breaks when the real example is too client-specific to publish cleanly. Mitigation: generalize names and numbers, but keep the failure mode, decision, and mitigation honest.

Close

Next step: Start with my default stack and workflow if you want the operating model behind these notes.

If the live pressure point is a late run, a KPI that moved, or a dashboard disagreement, use that situation to choose the first field note instead of reading the corpus front to back.