Go home
Menu

The first-response runbook I want behind every analytics SLA alert

I keep a short runbook behind analytics SLA alerts so the first responder can see impact, last success, failed step, source freshness, owner, escalation path, and the stakeholder message before the cutoff is missed.

· 6 min read by Berhan Turkkaynagi

At 07:34 ET, a vague SLA alert is already late.

If the 08:00 ET operations review depends on the dashboard, I do not want the alert to say only that a job missed its schedule. I want the first responder to know whether the output is safe, where the run last succeeded, which boundary failed, who owns the next check, and what message should go to the people waiting on the number.

The runbook is what makes the alert usable.

Problem

Analytics SLA alerts often create urgency without giving the responder a useful first move.

A scheduler can say the daily KPI refresh missed the 07:30 ET cutoff. That matters, but it does not answer the questions that decide the first response: Did the source land? Did yesterday’s dashboard stay published? Did the failure happen in the transform or the publish step? Is the 08:00 ET review unsafe, or can the team use the last settled snapshot?

When those answers are missing, the first ten minutes become tool-hopping. One person opens the orchestrator. Another checks the dashboard. Someone asks whether stakeholders should wait. The alert technically worked, but the response started from scratch.

For decision-critical analytics, I want the runbook card attached before the alert fires.

Default approach

My default is to keep one short runbook card behind each SLA family that can interrupt a real decision.

The card has to answer a narrow set of questions:

  • What business cutoff makes this alert matter?
  • Which dashboard, export, model, or review is affected?
  • What was the last successful run or publish?
  • Which step failed or became late?
  • Is the upstream source fresh enough to trust?
  • Does the published output still have a normal shape if a partial publish exists?
  • Who is the first responder, who is the backup, and when does escalation start?
  • What should stakeholders hear before the cause is fully known?

I do not need a long wiki page at this moment. I need a card that changes the first action.

If the runbook cannot say what to check first, the alert is not ready to page someone. It may still be a warning, a dashboard marker, or a backlog item. Paging should be reserved for failures that can affect a business cutoff and have a response path attached.

Example

For this case, the daily KPI dashboard needs to publish by 07:30 ET for an 08:00 ET operations review.

At 07:34 ET, the alert fires. This is the card I want behind it:

SLA alert runbook card
Alert: daily KPI dashboard refresh missed 07:30 ET publish cutoff
Business cutoff: 08:00 ET operations review
Affected output: Daily KPI dashboard / executive summary tiles
Decision exposure: review should not use today's dashboard until publish is confirmed

Last successful run: 2025-03-03 07:18 ET
Current run: 2025-03-04 started 06:42 ET; failed at 07:24 ET
Failing step: transform_daily_kpi / publish_mart_daily_kpi
First responder: analytics engineering on-call
Backup / escalation: data platform owner if source freshness is late by 07:40 ET; business owner if dashboard remains unsafe by 07:50 ET

First five checks
1. Source freshness: finance_extract landed by 06:10 ET? latest observed timestamp?
2. Failed step: inspect transform_daily_kpi error and recent code/config change.
3. Last publish: confirm latest successful published partition and dashboard cache timestamp.
4. Output shape: compare row count, null-rate, and one KPI total against recent same-weekday band if a partial publish exists.
5. Recovery choice: rerun, hold yesterday's snapshot, or mark dashboard unsafe for the 08:00 ET review.

Stakeholder message template
Status: daily KPI dashboard is <safe/unsafe/under review> for the 08:00 ET operations review.
Evidence: last successful publish is <timestamp>; current run failed at <step>; source freshness is <ok/late/unknown>.
Next action: <rerun/hold yesterday's snapshot/investigate failed step>.
Next update: <time> from <responder>.

The first useful line is the business cutoff. Without it, the responder cannot tell whether this is a page, a warning, or a note for later cleanup.

The next useful line is the last successful run. If yesterday’s dashboard is still published and clearly labeled, the team may have a safe fallback for the review. If the latest publish is partial, stale, or cached in a confusing state, the responder should say that early instead of letting people assume the dashboard is current.

The failing step matters because it prevents the first responder from starting at the wrong layer. If the source extract is late, the next check is upstream freshness and escalation. If the source landed and the transform failed, the next check is the model error, recent change, and rerun path. If the transform finished but the dashboard cache did not refresh, the response is different again.

The stakeholder template is part of the runbook because communication is not separate from recovery. At 07:45 ET, a calm update is more useful than a perfect root cause that arrives after the review starts.

The first update can stay this small:

Status: daily KPI dashboard is under review for the 08:00 ET operations review.
Evidence: last successful publish is 2025-03-03 07:18 ET; current run failed at publish_mart_daily_kpi; source freshness is ok.
Next action: analytics engineering is rerunning the failed publish step and checking output shape before clearing the dashboard.
Next update: 07:50 ET from analytics engineering on-call.

That message does not pretend to know the cause. It tells people what is safe, what evidence exists, what happens next, and when they will hear again.

Tradeoffs

  • Breaks when: one generic runbook tries to cover every pipeline, dashboard, and alert → Mitigation: keep one card per critical output or SLA family so the cutoff, affected surface, owner, and first checks are specific.
  • Breaks when: every warning uses paging severity → Mitigation: page only when the failure can affect a decision, review, export, or dashboard promise; keep lower-severity warnings visible without waking the same responder.
  • Breaks when: the stakeholder template hides uncertainty → Mitigation: allow unknown as a real state and require the next check plus next update time instead of forcing a fake root cause.
  • Breaks when: the card names source freshness or output-shape checks that nobody maintains → Mitigation: mark missing signals honestly and assign the owner before treating the alert as production-ready.

Close

Next step: Pick one SLA alert that could interrupt a real review and write the card behind it: cutoff, affected output, last success, failed step, source freshness, responder, escalation, and stakeholder message.

A page that still sends the responder to three tools before they can say whether the dashboard is safe is not finished yet.