Go home
Menu

Five pipeline observability signals before more orchestration

I use five signals—source freshness, latest publish, runtime, output shape, and owner—to make a missed analytics pipeline cutoff explainable before I add more orchestration.

· 4 min read by Berhan Turkkaynagi

At 07:34, I do not need another orchestration knob. I need signals that tell me whether the 08:00 ET review is still recoverable.

Before I add more orchestration, I want a small operating view: Did the source land? What was the latest successful publish? Did runtime drift? Does the output still look normal? Who owns the first check?

I keep that view separate from Row counts are not enough: the checks I add before I trust a pipeline because those checks live inside the data path. This panel is for the moment the publish is late, stale, or failed and I need the next investigation step fast.

If the panel says the publish completed and the output still looks wrong downstream, I move to When a dashboard number changes, I check these four things first.

Problem

Imagine a daily inventory_availability pipeline that needs to publish by 07:30 ET for an 08:00 ET operations review.

At 07:34, the orchestrator shows a failed run. That matters, but it is not enough. I still need to know whether the source extract was late, whether the curated model slowed down, whether the latest publish is still yesterday, and whether there is a clear owner for the first investigation step.

When those answers are missing, the team starts shopping for another retry rule or dependency feature. I would rather make the missed run legible first.

Default approach

  • Start with one business-critical pipeline and one real business cutoff time.
  • Show one upstream handoff signal, usually source freshness with the latest landed timestamp.
  • Show one publish-completion signal, usually the latest successful publish or latest published partition timestamp.
  • Track run duration against a normal band, not just success versus failure.
  • Add one output-shape signal on the published model, such as row-count delta or unmatched-key rate.
  • Put the owner and the first runbook question in the same view so the handoff starts immediately.

I do not want a wall of task states. I want a boundary view. Each signal should change either the next investigation step or my confidence in the published output.

Example

Here is the five-signal panel I want for that inventory_availability run. The companion table view of the five-signal panel shows the same review surface:

  • Source extract freshness — expected landed by 06:10 ET; observed 06:08 ET
  • Latest successful publish — expected today by 07:30 ET; observed 2025-12-01 07:14 ET
  • Curated model duration — expected 12–15 minutes; observed 41 minutes
  • Output row-count delta — expected within +/-2%; observed +0.4%
  • Owner + first check — expected platform analytics; observed inspect recent model changes and step runtime

That pattern tells me where to start.

The source landed on time, so I do not start with ingestion. The latest successful publish is still yesterday, so today’s data did not make it across the finish line. The output row-count delta is stable, so this does not look like a missing partition or obvious shape break. The inflated duration points me at the transform or publish path before I waste time debating retries, dependencies, or dashboard logic.

The audit query behind that panel is small on purpose:

select
  run_date,
  source_landed_at,
  published_at,
  duration_minutes,
  output_row_count
from pipeline_run_audit
where pipeline_name = 'inventory_availability'
order by run_date desc
limit 7;

I am not trying to build a perfect monitoring product. I want enough history to compare today’s run to the normal band, assign the first check cleanly, and keep the incident moving.

That is why I keep this operating view separate from the pipeline checks themselves. The checks live inside the data path. This panel is for the moment a run is late or fails and I need the next investigation step fast.

Tradeoffs

  • Breaks when: teams instrument every task in the DAG and bury the useful signals under noise → Mitigation: start with one pipeline tied to one business deadline and keep only the signals that change the first investigation step.
  • Breaks when: the orchestration UI shows task state but nothing about the published dataset → Mitigation: pair workflow state with one publish-completion signal and one business-output signal.
  • Breaks when: one platform-wide SLA hides different lateness patterns across feeds → Mitigation: define freshness windows per dataset and per business expectation, not as one generic cutoff.
  • Breaks when: alerts fire but nobody knows who owns the fix or what to check first → Mitigation: put the owner and the first runbook question in the same panel.

Close

Next step: Pick one business-critical pipeline and write down the five signals you need before the next missed SLA turns into a meeting.

If a cutoff was missed recently, use one pipeline’s five-signal panel to decide whether the next fix is ownership, freshness, publication evidence, or orchestration work.