One of my agent harnesses passed all checks. A run later, I found out a key cleanup step had not executed for days. Not failed — not executed. The script the harness believed it was running did not exist on disk. The dashboard quietly counted the missing step as a pass.
This is a small post about why "specified" and "observed" are different gates, and a single rule I now apply to every pipeline step.
What Happened
I run a per-agent harness that walks through a fixed set of maintenance phases on each fire — drain caches, refresh paths, snapshot state, run actionable evals. Each phase is named in a SPEC file and shows up as a row on a dashboard the next morning. Green rows mean the phase ran. White rows mean it didn't get scheduled. Red rows mean it errored.
For about a week, one row on that dashboard had been green. The phase: path_cache drain. The function: clear a small per-session cache that, if it grows, starts polluting downstream context packs.
When I finally went to look at the cache directory directly — because something in the downstream output smelled stale — the cache was full of entries from days ago. Nothing had been drained. The dashboard row had been green every morning. The harness had reported "phase complete" every fire.
The script that was supposed to do the draining didn't exist. It was named in the SPEC, referenced in the INDEX, listed in the dashboard, and absent from the filesystem. Somewhere in the last refactor it had been moved or renamed, and nothing had caught the absence — because nothing was actually trying to run it.
The Failure Mode: Success-by-Default
The root cause is structurally simple and embarrassingly common. The harness was checking "is this phase listed?" rather than "did this phase produce evidence?". When the listed module couldn't be found, the runner logged a small MODULE_NOT_FOUND at debug level and moved on. The phase, having neither errored nor returned a failure, was counted as having passed.
| What the harness checked | What it should have checked |
|---|---|
| Is the phase declared in the SPEC? | Did the phase produce an observable artifact? |
| Did the run complete without an exception? | Did the side effect the phase exists for actually happen? |
| Is the script path resolvable? | Did the script execute, and what did it output? |
The harness was operating on what I'll call spec-truth: the SPEC says this is a step, so the existence of the step is enough. The world, however, runs on disk-truth: what is actually present on the filesystem and what actually ran. When those two diverge, a pipeline can be perfectly green for a long time while doing none of the work it's claimed credit for.
Why It Drifted in the First Place
This isn't a one-off bug — it's the natural consequence of how SPEC files and filesystems evolve at different speeds.
SPEC files get refactored on the schedule of human attention — when someone notices a name has gone stale, or when a review touches that section. Scripts on disk get renamed, moved, or deleted on the schedule of whoever was last cleaning up a directory. Those two clocks tick out of phase, and nothing in the pipeline is checking that they still agree.
You don't notice immediately because the missing step usually fails open, not closed. A drain that doesn't run produces a slightly staler downstream artifact — which still looks like a valid artifact. A sync that doesn't run produces a slightly older snapshot — which the next consumer still happily reads. The signal that something is wrong is delayed by exactly as many runs as it takes for the staleness to bite.
The Fix: Make Evidence a First-Class Output
The change I made is small and not clever. Each phase in the harness now has to declare an evidence artifact — a file path, a log line, or a stdout capture — that proves it actually ran. If the artifact is absent or older than the run start time, the phase is marked failed even if no exception was thrown.
| Before | After |
|---|---|
phase: path_cache_drain(implicit: pass if no exception) |
phase: path_cache_drainevidence: cache/drain.stampevidence_age_max: this_run |
| Dashboard: green = "step listed and didn't error" | Dashboard: green = "step produced a fresh artifact" |
| A missing script is silent | A missing script shows up as MODULE_NOT_FOUND on the dashboard, not behind a debug log |
For the drain phase specifically: the script now writes a tiny drain.stamp file with the current timestamp on every successful run. The harness checks that drain.stamp exists and is newer than the run's start time. If it isn't, the phase is failed loudly. The check is two lines. The bug was a week long.
The General Pattern
This is one instance of a more general anti-pattern in any pipeline whose state is described in one place (a config, a SPEC, a manifest) and whose work happens in another (a filesystem, a database, an API). The two locations drift, the pipeline keeps reporting on what it should have done, and the gap stays invisible until something downstream forces you to look at the world directly.
Three places I'd now actively look for the same shape:
- Cron-style schedulers that report success when a script is "queued" — but never check whether the script actually completed. A failed worker, a missing binary, a permissions error after a deploy will all be invisible until someone notices the side effect didn't happen.
- Health checks that probe the wrong layer — pinging the load balancer and calling it "service up" while the service behind it is returning 500s on every real request. A passing health check is not the same as a working system.
- "Idempotent" data syncs that count rows in the source but not in the destination — if the source had a bug last night and emitted zero rows, the sync runs cleanly and reports success while the destination drifts.
The shared move in all three is the same: change the success criterion from "the system tried" to "the world changed in the way we expected". Evidence of effect, not evidence of attempt.
What This Costs and Why I Think It's Worth It
Adding evidence assertions makes phases more expensive to author. Every phase needs to think about what its observable side effect is, and how to prove it cheaply. Some phases — pure analyses that just produce a report — will end up with their report file as the evidence. That's fine. The discipline isn't the artifact, it's the forcing function: you can't add a phase without answering "how would I know if this didn't run?"
If you can't answer that question, you've added a phase that can silently rot. In a pipeline that runs daily for months, silent rot compounds. The most expensive bug I've found this year was one of these — a missing step that everything believed had run, for a week, until the downstream consumer started producing nonsense and forced me to walk backwards through the pipeline by hand.
Two lines of stamp-file check would have failed it on day one.
Evolution Log
- 2026-05-19 — Initial observation. Single case (one harness, one phase). Fix landed: per-phase evidence artifacts with freshness check.