Five pipeline observability signals before more orchestration

At 07:34, I do not need another orchestration knob. I need signals that tell me whether the 08:00 ET review is still recoverable.

Before I add more orchestration, I want a small operating view: Did the source land? What was the latest successful publish? Did runtime drift? Does the output still look normal? Who owns the first check?

I keep that view separate from Row counts are not enough: the checks I add before I trust a pipeline because those checks live inside the data path. This panel is for the moment the publish is late, stale, or failed and I need the next investigation step fast.

If the panel says the publish completed and the output still looks wrong downstream, I move to When a dashboard number changes, I check these four things first.

Problem

Imagine a daily inventory_availability pipeline that needs to publish by 07:30 ET for an 08:00 ET operations review.

At 07:34, the orchestrator shows a failed run. That matters, but it is not enough. I still need to know whether the source extract was late, whether the curated model slowed down, whether the latest publish is still yesterday, and whether there is a clear owner for the first investigation step.

When those answers are missing, the team starts shopping for another retry rule or dependency feature. I would rather make the missed run legible first.

Default approach

Start with one business-critical pipeline and one real business cutoff time.
Show one upstream handoff signal, usually source freshness with the latest landed timestamp.
Show one publish-completion signal, usually the latest successful publish or latest published partition timestamp.
Track run duration against a normal band, not just success versus failure.
Add one output-shape signal on the published model, such as row-count delta or unmatched-key rate.
Put the owner and the first runbook question in the same view so the handoff starts immediately.

I do not want a wall of task states. I want a boundary view. Each signal should change either the next investigation step or my confidence in the published output.

Example

Here is the five-signal panel I want for that inventory_availability run. The companion table view of the five-signal panel shows the same review surface:

Source extract freshness — expected landed by 06:10 ET; observed 06:08 ET
Latest successful publish — expected today by 07:30 ET; observed 2025-12-01 07:14 ET
Curated model duration — expected 12–15 minutes; observed 41 minutes
Output row-count delta — expected within +/-2%; observed +0.4%
Owner + first check — expected platform analytics; observed inspect recent model changes and step runtime

That pattern tells me where to start.

The source landed on time, so I do not start with ingestion. The latest successful publish is still yesterday, so today’s data did not make it across the finish line. The output row-count delta is stable, so this does not look like a missing partition or obvious shape break. The inflated duration points me at the transform or publish path before I waste time debating retries, dependencies, or dashboard logic.

The audit query behind that panel is small on purpose:

select
  run_date,
  source_landed_at,
  published_at,
  duration_minutes,
  output_row_count
from pipeline_run_audit
where pipeline_name = 'inventory_availability'
order by run_date desc
limit 7;

I am not trying to build a perfect monitoring product. I want enough history to compare today’s run to the normal band, assign the first check cleanly, and keep the incident moving.

That is why I keep this operating view separate from the pipeline checks themselves. The checks live inside the data path. This panel is for the moment a run is late or fails and I need the next investigation step fast.

Tradeoffs

Breaks when: teams instrument every task in the DAG and bury the useful signals under noise → Mitigation: start with one pipeline tied to one business deadline and keep only the signals that change the first investigation step.
Breaks when: the orchestration UI shows task state but nothing about the published dataset → Mitigation: pair workflow state with one publish-completion signal and one business-output signal.
Breaks when: one platform-wide SLA hides different lateness patterns across feeds → Mitigation: define freshness windows per dataset and per business expectation, not as one generic cutoff.
Breaks when: alerts fire but nobody knows who owns the fix or what to check first → Mitigation: put the owner and the first runbook question in the same panel.

Close

Next step: Pick one business-critical pipeline and write down the five signals you need before the next missed SLA turns into a meeting.

If a cutoff was missed recently, use one pipeline’s five-signal panel to decide whether the next fix is ownership, freshness, publication evidence, or orchestration work.

Five pipeline observability signals before more orchestration

Problem

Default approach

Example

Tradeoffs

Close

Continue reading

Row counts are not enough: the checks I add before I trust a pipeline

When a dashboard number changes, I check these four things first