The incident note template I wish every analytics team used

At 08:12 ET, I open the incident note before I open another Slack thread.

Slack fills up, someone asks whether the leadership review is still safe, and half the useful evidence stays trapped in query tabs or terminal history.

At the first real check, I open a one-page incident note and start writing timestamps, evidence, and the current decision. I am not trying to document everything. I am trying to keep the incident legible while the KPI is still moving.

The goal is one visible source for the timeline, evidence, and current decision before the meeting clock takes over.

Problem

Imagine an executive revenue KPI is +11.8% versus the settled baseline at 08:05 ET. The leadership review starts at 08:30 ET.

At that point, I need more than the fix. I need one page that says whether the dashboard is safe, which checks already passed, what I think is broken, and when the next update goes out.

Without that note, the same questions get asked twice. Someone reruns a query I already checked. The review gets delayed for the wrong reason. By the time the issue is fixed, the cause and prevention item have already started to fade.

Default approach

Open the note as soon as the incident affects a real meeting, KPI, or decision.
Put impact, owner, next update time, and the current dashboard-safe decision at the top so nobody has to search for the current state.
Log checks in the order I run them, with one timestamped evidence line per check.
Keep hypothesis separate from confirmed cause so the note stays honest while the investigation is still moving.
Record the operating decision explicitly: hold the number, use yesterday’s settled snapshot, delay the review, or republish after the fix.
Close the note with one prevention item and one owner before I call the incident done.

If the run itself is late, I work from the five-signal observability panel for late analytics runs.

If the data landed but the number changed, I use When a dashboard number changes, I check these four things first to walk through freshness, row counts, joins, and metric logic.

Example

This is the incident note I keep open during that revenue incident:

Field	Value
Incident	executive revenue KPI is +11.8% vs settled baseline
Started	2025-12-09 08:05 ET
Owner	analytics engineering
Impact	08:30 ET leadership review is not safe to run from the live dashboard
Current decision	use yesterday's settled snapshot until the dashboard is republished
Next update	08:20 ET
Timeline	08:05 ET alert from revenue dashboard and finance Slack thread 08:12 ET freshness check passes; finance extract landed at 07:11 ET 08:18 ET row-count check passes; fact_sales volume is within 0.8% of recent Mondays 08:27 ET join check fails; unmatched promotion keys jump from 0.4% to 14.2% 08:41 ET mapping fix deployed and model rebuilt 08:50 ET dashboard republished
Checks run	Freshness: ok — latest finance snapshot landed on time Row counts: ok — fact_sales volume is stable Joins: off — promotion mapping dropped a material set of rows Metric logic: unchanged — no dashboard filter or definition edit since yesterday
Hypothesis	New promotion override codes were not included in the morning dimension sync.
Confirmed cause	The morning dim_promotions sync excluded new override codes from the ERP feed.
Fix	Backfill the missing promotion codes, rebuild the model, and republish the dashboard.
Prevention	Add an unmatched-key alert on the promotion join before the next finance review.
Prevention owner	analytics engineering

That note is enough for the incident I am running. I can answer status questions, keep the operating decision visible, and avoid rebuilding the same timeline later.

I care less about the document itself than whether impact, evidence, decision, and prevention sit in one place while the KPI is still moving. If I already have the pipeline checks in place, this note becomes the record of what I actually saw, ruled out, and decided.

The headings stay simple on purpose: impact, owner, next update, timeline, checks run, hypothesis, confirmed cause, fix, and prevention. If a field does not change the next action or preserve evidence, I leave it out.

Tradeoffs

Breaks when: the note gets opened after the fix and turns into reconstructed memory → Mitigation: start the note on the first real check, even if the first version is only four lines.
Breaks when: the team writes paragraphs instead of evidence lines → Mitigation: keep one bullet per timestamp and one line per check.
Breaks when: a hypothesis hardens into a fake root cause because it was written too early → Mitigation: keep separate headings for hypothesis and confirmed cause.
Breaks when: the incident closes with no prevention owner → Mitigation: require one named follow-up item before marking the incident complete.

Close

Next step: Pick one business-critical KPI and write the headers for your live incident note before the next morning review turns noisy.

When the next incident lands, one responder artifact—the timestamp, check, decision, and owner line—makes the first review easier than reconstructing from chat.

The incident note template I wish every analytics team used

Problem

Default approach

Example

Tradeoffs

Close

Continue reading

Five pipeline observability signals before more orchestration

When a dashboard number changes, I check these four things first