When a dashboard number changes, I check these four things first

When a dashboard number moves without warning, I need the first answer fast. Which layer changed: freshness, row counts, joins, or metric logic?

I do not start by defending the number. I start by isolating the layer that changed.

That first pass stays in this order (freshness, row counts, joins, and metric logic) because each check rules out a whole class of failure quickly.

It breaks when there are no baselines or no clear owner. In that case I say that early and create those baselines before I promise a final answer.

Problem

Imagine available inventory is down 18% on Monday morning. The dashboard may be reacting to a late source, lost model rows, a join that dropped valid records, or a metric-logic change.

Those are different failure modes. If I treat them as one problem, the incident gets slower and the conversation gets noisy.

Default approach

Check freshness first: did the source land on time, did the job finish cleanly, and is the latest successful publish recent enough to trust?
Check row counts next: did the main model lose or gain more volume than expected versus recent baselines?
Check join behavior after that: did a dimension change create duplicate, unmatched, or mis-mapped rows?
Check metric logic last: did someone change filters, time windows, exclusions, or semantic rules intentionally, and did that change go through a dashboard release check before it hit production?
Send a short status update once I know which layer is failing and what I am checking next.

After the first pass, I send a short update so the conversation stays calm and scoped. This is the copy/paste template I use:

Status: investigating <metric> <up/down X%> vs <baseline> (as of <time>)

Freshness: <ok/late> — <one evidence line>
Row counts: <ok/off> — <one evidence line>
Joins: <ok/off> — <one evidence line>
Metric logic: <ok/changed> — <one evidence line>

Hypothesis: <one sentence>
Next check: <one sentence>
Next update: <time>

When the issue affects a meeting or decision, I keep The incident note template I wish every analytics team used open beside these checks so the timeline, evidence, and next update stay in one place.

If freshness, row counts, joins, and metric logic all come back clean and the disagreement remains, I stop treating it like a pipeline incident. My next check is the owned metric definition and any recent BI release evidence.

Example

Here is a simple version of the sequence for that Monday inventory drop:

Field	Value
Symptom	available inventory is down 18%

Review step	What I confirm	Status
Freshness	inventory snapshot extract completed at 07:12 ET latest snapshot date matches expectation	—
Row counts	stg_inventory_snapshot row count is down 1.4% vs recent Mondays not enough to explain the full drop	—
Join behavior	unmatched location_id values jump from 0.3% to 14.8% available units disappear after the join to location attributes	—
Metric logic	no intentional dashboard filter or definition change	—

Conclusion: the issue is missing location mappings, not a true inventory decline

The query I keep ready for this step is usually simple:

select
  count(*) as total_rows,
  sum(case when l.location_id is null then 1 else 0 end) as unmatched_rows
from fct_inventory_snapshot i
left join dim_locations l
  on i.location_id = l.location_id;

That sequence usually tells me which layer is failing in one pass. I like checks like this because I can run them at 08:05 and explain the result in one calm Slack update.

The controls behind these checks are boring on purpose: load completion time, row-count thresholds, and one unmatched-key check on the model that feeds the dashboard. Boring is good during an incident.

If freshness is bad because the run missed its cutoff, I move to the five-signal observability panel for missed analytics cutoffs.

If the four checks are clean and the disagreement remains, I move to The definition card I use to stop metric drift across dashboards.

Tradeoffs

Breaks when: there is no baseline for freshness or row-count shifts → Mitigation: start logging daily load times and row counts for critical models, even if the first version is manual.
Breaks when: the metric definition lives in multiple places across SQL, dbt, and BI → Mitigation: choose one owned definition and point dashboards back to it.
Breaks when: late-arriving data is normal for the business → Mitigation: compare against the same latency window instead of the final settled number.
Breaks when: nobody owns the upstream model or dashboard metric → Mitigation: assign one owner for the next incident before the memory of this one fades.

Close

Next step: Before the next incident, pick one important metric and write down the freshness, row-count, join, and logic checks you expect before anyone asks questions about it.

If the next KPI move would still send everyone to Slack first, use that four-check order as the first triage note before rewriting SQL.

When a dashboard number changes, I check these four things first

Problem

Default approach

Example

Tradeoffs

Close

Continue reading

The incident note template I wish every analytics team used

Five pipeline observability signals before more orchestration