Go home
Menu

The incident note template I wish every analytics team used

I keep a one-page incident note during analytics incidents so impact, checks, decisions, cause, and prevention stay clear while the KPI is still broken.

· 5 min read by Berhan Turkkaynagi

At 08:12 ET, I want one page open before I want another Slack thread.

Slack fills up, someone asks whether the leadership review is still safe, and half the useful evidence stays trapped in query tabs or terminal history.

At the first real check, I open a one-page incident note and start writing timestamps, evidence, and the current decision. I am not trying to document everything. I am trying to keep the incident legible while the KPI is still moving.

The goal is one visible source for the timeline, evidence, and current decision before the meeting clock takes over.

Problem

Imagine an executive revenue KPI is +11.8% versus the settled baseline at 08:05 ET. The leadership review starts at 08:30 ET.

At that point, I need more than the fix. I need one page that says whether the dashboard is safe, which checks already passed, what I think is broken, and when the next update goes out.

Without that note, the same questions get asked twice. Someone reruns a query I already checked. The review gets delayed for the wrong reason. By the time the issue is fixed, the cause and prevention item have already started to fade.

Default approach

  • Open the note as soon as the incident affects a real meeting, KPI, or decision.
  • Put impact, owner, next update time, and the current dashboard-safe decision at the top so nobody has to search for the current state.
  • Log checks in the order I run them, with one timestamped evidence line per check.
  • Keep hypothesis separate from confirmed cause so the note stays honest while the investigation is still moving.
  • Record the operating decision explicitly: hold the number, use yesterday’s settled snapshot, delay the review, or republish after the fix.
  • Close the note with one prevention item and one owner before I call the incident done.

If the run itself is late, I work from the five-signal observability panel for late analytics runs.

If the data landed but the number changed, I use When a dashboard number changes, I check these four things first to walk through freshness, row counts, joins, and metric logic.

Example

This is the kind of live note I want open during that revenue incident:

FieldValue
Incidentexecutive revenue KPI is +11.8% vs settled baseline
Started2025-12-09 08:05 ET
Owneranalytics engineering
Impact08:30 ET leadership review is not safe to run from the live dashboard
Current decisionuse yesterday's settled snapshot until the dashboard is republished
Next update08:20 ET
Timeline08:05 ET alert from revenue dashboard and finance Slack thread 08:12 ET freshness check passes; finance extract landed at 07:11 ET 08:18 ET row-count check passes; fact_sales volume is within 0.8% of recent Mondays 08:27 ET join check fails; unmatched promotion keys jump from 0.4% to 14.2% 08:41 ET mapping fix deployed and model rebuilt 08:50 ET dashboard republished
Checks runFreshness: ok — latest finance snapshot landed on time Row counts: ok — fact_sales volume is stable Joins: off — promotion mapping dropped a material set of rows Metric logic: unchanged — no dashboard filter or definition edit since yesterday
HypothesisNew promotion override codes were not included in the morning dimension sync.
Confirmed causeThe morning dim_promotions sync excluded new override codes from the ERP feed.
FixBackfill the missing promotion codes, rebuild the model, and republish the dashboard.
PreventionAdd an unmatched-key alert on the promotion join before the next finance review.
Prevention owneranalytics engineering

That note is enough for the incident I am running. I can answer status questions, keep the operating decision visible, and avoid rebuilding the same timeline later.

I care less about the document itself than whether impact, evidence, decision, and prevention sit in one place while the KPI is still moving. If I already have the pipeline checks in place, this note becomes the record of what I actually saw, ruled out, and decided.

The headings stay simple on purpose: impact, owner, next update, timeline, checks run, hypothesis, confirmed cause, fix, and prevention. If a field does not change the next action or preserve evidence, I leave it out.

Tradeoffs

  • Breaks when: the note gets opened after the fix and turns into reconstructed memory → Mitigation: start the note on the first real check, even if the first version is only four lines.
  • Breaks when: the team writes paragraphs instead of evidence lines → Mitigation: keep one bullet per timestamp and one line per check.
  • Breaks when: a hypothesis hardens into a fake root cause because it was written too early → Mitigation: keep separate headings for hypothesis and confirmed cause.
  • Breaks when: the incident closes with no prevention owner → Mitigation: require one named follow-up item before marking the incident complete.

Close

Next step: Pick one business-critical KPI and write the headers for your live incident note before the next morning review turns noisy.

If the next analytics incident would still be reconstructed from chat, use one responder artifact—the timestamp, check, decision, and owner line—to make the first review easier.