Five pipeline observability signals before more orchestration
I use five signals—source freshness, latest publish, runtime, output shape, and owner—to make a missed analytics pipeline cutoff explainable before I add more orchestration.
At 07:34, I do not need another orchestration knob. I need signals that tell me whether the 08:00 ET review is still recoverable.
Before I add more orchestration, I want a small operating view: Did the source land? What was the latest successful publish? Did runtime drift? Does the output still look normal? Who owns the first check?
I keep that view separate from Row counts are not enough: the checks I add before I trust a pipeline because those checks live inside the data path. This panel is for the moment the publish is late, stale, or failed and I need the next investigation step fast.
If the panel says the publish completed and the output still looks wrong downstream, I move to When a dashboard number changes, I check these four things first.
Problem
Imagine a daily inventory_availability pipeline that needs to publish by 07:30 ET for an 08:00 ET operations review.
At 07:34, the orchestrator shows a failed run. That matters, but it is not enough. I still need to know whether the source extract was late, whether the curated model slowed down, whether the latest publish is still yesterday, and whether there is a clear owner for the first investigation step.
When those answers are missing, the team starts shopping for another retry rule or dependency feature. I would rather make the missed run legible first.
Default approach
- Start with one business-critical pipeline and one real business cutoff time.
- Show one upstream handoff signal, usually source freshness with the latest landed timestamp.
- Show one publish-completion signal, usually the latest successful publish or latest published partition timestamp.
- Track run duration against a normal band, not just success versus failure.
- Add one output-shape signal on the published model, such as row-count delta or unmatched-key rate.
- Put the owner and the first runbook question in the same view so the handoff starts immediately.
I do not want a wall of task states. I want a boundary view. Each signal should change either the next investigation step or my confidence in the published output.
Example
Here is the five-signal panel I want for that inventory_availability run. The companion table view of the five-signal panel shows the same review surface:
- Source extract freshness — expected
landed by 06:10 ET; observed06:08 ET - Latest successful publish — expected
today by 07:30 ET; observed2025-12-01 07:14 ET - Curated model duration — expected
12–15 minutes; observed41 minutes - Output row-count delta — expected
within +/-2%; observed+0.4% - Owner + first check — expected
platform analytics; observedinspect recent model changes and step runtime
That pattern tells me where to start.
The source landed on time, so I do not start with ingestion. The latest successful publish is still yesterday, so today’s data did not make it across the finish line. The output row-count delta is stable, so this does not look like a missing partition or obvious shape break. The inflated duration points me at the transform or publish path before I waste time debating retries, dependencies, or dashboard logic.
The audit query behind that panel is small on purpose:
select
run_date,
source_landed_at,
published_at,
duration_minutes,
output_row_count
from pipeline_run_audit
where pipeline_name = 'inventory_availability'
order by run_date desc
limit 7;
I am not trying to build a perfect monitoring product. I want enough history to compare today’s run to the normal band, assign the first check cleanly, and keep the incident moving.
That is why I keep this operating view separate from the pipeline checks themselves. The checks live inside the data path. This panel is for the moment a run is late or fails and I need the next investigation step fast.
Tradeoffs
- Breaks when: teams instrument every task in the DAG and bury the useful signals under noise → Mitigation: start with one pipeline tied to one business deadline and keep only the signals that change the first investigation step.
- Breaks when: the orchestration UI shows task state but nothing about the published dataset → Mitigation: pair workflow state with one publish-completion signal and one business-output signal.
- Breaks when: one platform-wide SLA hides different lateness patterns across feeds → Mitigation: define freshness windows per dataset and per business expectation, not as one generic cutoff.
- Breaks when: alerts fire but nobody knows who owns the fix or what to check first → Mitigation: put the owner and the first runbook question in the same panel.
Close
Next step: Pick one business-critical pipeline and write down the five signals you need before the next missed SLA turns into a meeting.
If a cutoff was missed recently, use one pipeline’s five-signal panel to decide whether the next fix is ownership, freshness, publication evidence, or orchestration work.