When an analytics incident needs a postmortem, not just a note
I use a simple trigger table to decide when an analytics incident needs a note, a lightweight review, or a full postmortem based on impact, recurrence, exposure, and unresolved prevention work.
After the dashboard is fixed, the next decision is whether the incident is actually closed.
After a data trust break, I still need to decide whether the short incident note is enough. Some incidents need one prevention item and a clean close. Others need a full postmortem because the same failure will come back if nobody names the pattern, owner, and follow-up path.
My default is to choose the review level from triggers, not from the temperature of the room. A trigger table keeps the response proportional.
Problem
Analytics incidents usually fail in one of two directions: too much ceremony for a small miss, or too little learning for a repeated trust break.
If every late dashboard, stale extract, or corrected number gets a full postmortem, people stop reading them. The process becomes ceremony, and the useful reviews get buried.
If every incident gets closed as a short note, repeated failures stay invisible. A published number can be corrected twice, each time with a plausible local explanation, while the real prevention work never gets accepted.
I want the decision rule in writing before the next incident. After the fix, everyone is tired, the meeting clock is loud, and the team is already biased toward either moving on or making the review bigger than it needs to be.
Default approach
I start with the incident note, then decide whether the incident needs note-only closure, a lightweight review, or a full postmortem.
The table has to answer the question a lead is actually facing: can we close the note, do we need a short review, or did this incident expose a system weakness that needs a full postmortem?
Analytics incident escalation table
Field | Note only | Lightweight review | Full postmortem
------------------------------|----------------------------------------|--------------------------------------------|----------------
Impact | no decision, meeting, export, or KPI trust was affected | one team, dashboard, or scheduled review was delayed or temporarily unsafe | a decision, finance process, customer-facing surface, or executive review was affected
Recurrence | first isolated occurrence with understood cause | repeated pattern in one asset or recent near-miss | repeated cross-surface failure or unresolved previous prevention item
Exposure | contained inside the analytics team | visible to one business owner or operating team | visible outside the immediate team or tied to a formal reporting path
Customer/executive visibility | none | possible if the issue is not handled before the next review | confirmed customer, executive, finance-close, board, or external-reporting visibility
Detection path | expected alert or check caught it before use | human report found it, or alert lacked enough responder context | users found it, monitoring failed, or detection happened after a bad decision
Unresolved prevention work | one clear fix and owner | prevention needs coordination across analytics and one partner | root cause or prevention path is unclear, cross-team, risky, or under-owned
Review level | close the short incident note | schedule a short review with timeline and action list | write a blameless postmortem with timeline, contributing causes, impact, owners, and follow-up links
Owner | incident owner closes the note | analytics lead owns review and action follow-through | incident owner plus accountable analytics, engineering, and business owners
Next action | record fix and prevention in the note | add one or two follow-up tasks with due dates | track postmortem actions until accepted, rejected, or replaced
This is not a scoring system. I do not add up points and pretend the number made the decision.
I use the table to make the judgment visible. If impact was low, detection worked, and prevention is obvious, I keep the incident small. If recurrence, executive exposure, failed detection, or unclear prevention shows up, I escalate before the incident becomes folklore.
Example
Here is the note-only case.
A daily operations dashboard publishes 11 minutes late because an upstream extract lands after its usual window. The dashboard is not used until the afternoon standup. The delay is caught before the meeting. The data publishes cleanly. The prevention item is clear: update the source cutoff note and widen the warning window so the team sees the risk earlier.
That incident deserves a short note, not a postmortem.
Incident: operations dashboard published 11 minutes late
Impact: no meeting or decision used stale data
Recurrence: first isolated delay from this source handoff
Exposure: contained inside analytics
Detection: expected freshness check caught it before use
Prevention: update source cutoff note and warning threshold
Review level: note only
Owner: analytics engineer closes note
Next action: record prevention item and monitor next scheduled run
A full review would not add much. It would spend more attention than the incident earned, and it would teach the wrong habit: every small delay becomes a meeting instead of a clean note with an owner.
Here is the case that should not stay note-only.
A finance-facing revenue number is republished twice in one month after a join change drops a subset of settled refunds. The first incident had a short note and one prevention item. The second reaches the finance close review before the discrepancy is caught, and the corrected number has to be explained to executives. Nobody can say whether the failure was release review, metric ownership, detection, or the prevention item from the first note not being done.
That incident needs a full postmortem.
The trigger is not embarrassment. The trigger is the combination of recurrence, executive exposure, a formal finance process, failed detection, and unresolved prevention work across ownership boundaries.
Incident: finance revenue number republished after settled-refund join issue
Impact: business review used or almost used the wrong number
Recurrence: second related incident in one month
Exposure: finance-facing dashboard and review packet
Detection: discrepancy found by a person, not by the expected check
Unresolved prevention: unclear whether release review, metric ownership, or join monitoring failed
Review level: full postmortem
Owner: analytics incident owner plus finance analytics owner
Next action: write timeline, name contributing causes, and track accepted prevention work
The document is not the point. The point is deciding whether the incident revealed a system weakness that a short note cannot close and a vague action item will not prevent.
Tradeoffs
- Breaks when: every small analytics alert gets the same heavy review. → Mitigation: keep note-only and lightweight-review paths legitimate so full postmortems stay worth reading.
- Breaks when: the postmortem becomes a search for who caused the wrong number. → Mitigation: keep the review blameless and frame the work around missing signals, unclear ownership, weak release checks, and follow-up the team can actually change.
- Breaks when: repeated small incidents keep looking harmless in isolation. → Mitigation: include recurrence and unresolved prior action items in the trigger table so patterns can roll up before trust erodes.
- Breaks when: the team writes a good review but nobody accepts the prevention work. → Mitigation: attach every review level to an owner, a next action, and a backlog path where the action can be accepted, rejected, or replaced.
Close
The smallest useful version is a trigger table the team agrees to before the next incident.
It does not need to be perfect. It needs to answer one question clearly: when is a short note enough, and when would moving on leave the same trust break waiting for the next review?
Next step: Pick one recent analytics incident and classify it three ways: note-only, lightweight review, or full postmortem.
If the answer is hard to defend, the table is the work to do before the next incident forces that decision under pressure.