When Tests Pass But Nothing Works: The Verification Gap

A bug in the tutorial system was "fixed" five times across five consecutive sprints. Each fix looked like this: the AI agent found the relevant code, made a change, the code compiled, the sprint closed as complete. The bug was still there in sprint six.

Separately, a UI overlap — the tutorial overlay rendering on top of the minimap — went unnoticed for seven sprints. A visual check ran every sprint. Nobody flagged it.

These aren't random failures. They're the same failure, two different forms. I'm calling it the verification gap: the distance between what your agents are told to check and what actually needs to be true for the feature to work.

Why This Happens

AI agents are extremely literal about verification. They check what you define, in the way you defined it. Nothing more.

When I wrote the acceptance criteria for the tutorial bug fix, I wrote something like: "The TUTORIAL_ADVANCE event should fire when the player completes the first action." Mars made code changes that were plausibly connected to that criteria. The code compiled. No runtime error. Sprint closed.

What I didn't write: "Verify this by running the game in Play Mode and completing the first action." That step — which seems obvious in hindsight — wasn't in the criteria. So it didn't happen.

The UI overlap was similar. The visual check criteria said: "Verify that the HUD elements are positioned correctly." Minimap: positioned correctly. Tutorial overlay: positioned correctly. The overlap between them wasn't a criteria violation. Neither element was wrong. Together they were.

Three Manifestations

I've now seen this pattern in three forms in my own project. They all have the same root cause but present differently:

1. Compile-pass-as-verification. The agent makes a code change. The code compiles without errors. The sprint closes. Nobody runs the code. This is the most common form and the most insidious because compilation success is a real signal — just not a sufficient one.

2. Criteria-bounded visual check. The agent takes a screenshot, compares it against the listed criteria, reports pass. Issues that exist outside the criteria boundary are invisible. The agent isn't failing at its job — it's doing exactly its job. The criteria were the problem.

3. Report-as-artifact. A build script finishes, reports success, and moves on. The actual output file doesn't exist, or is corrupt, or is the wrong version. The report was real; the artifact wasn't. (This one is rarer, but I've hit it.)

Pattern	False signal	What was actually missing
Compile-pass-as-verification	No build error	Runtime execution check
Criteria-bounded visual check	All listed elements OK	Interaction / overlap check
Report-as-artifact	Script exit 0	Output file existence check

What Closed the Gap

Three concrete changes. None of them are complicated. All of them require discipline to maintain.

Mandatory runtime verification in AC. For any behavioral feature, the acceptance criteria now explicitly includes a runtime check: "Verify by running in Play Mode and [specific action]." If the criteria don't specify a runtime check, the agent isn't required to do one. So the criteria have to specify it.

Out-of-criteria anomaly detection. The visual check now has a second pass: "Regardless of the criteria, flag any of the following: unexpected overlapping elements, layout that differs significantly from the previous screenshot, anything that looks visually wrong even if it's not in the criteria." This is a weak filter — the agent can't know what "looks wrong" means in the abstract — but it catches the obvious cases.

Failure memory. When a bug fix attempt fails, that attempt gets logged: what was tried, why it didn't work. The next sprint's AC gets a line: "Previous attempts: [list]. Verify those approaches don't re-introduce the same issue." This prevents the cycle of identical "fixes" that don't fix anything.

The Underlying Problem

The verification gap isn't really a bug in how agents work. It's a consequence of how verification works in general: you can only verify what you specify. Human reviewers have the same problem — they just bring implicit knowledge about what "obviously matters" that agents don't have yet.

When you're running a solo project with agents doing most of the implementation, the implicit knowledge is yours and it stays in your head. It doesn't automatically transfer to the verification criteria. Closing the verification gap means externalizing that knowledge — writing down the checks that you'd obviously do manually — so the agents can do them automatically.

That's the work. Not the implementation. The specification of what "verified" actually means.

I still don't have a complete solution to this. The failure memory approach helps with known failure modes. The out-of-criteria anomaly detection is weak. I think the missing piece is something like a "verification coverage" concept — a way to flag when a feature has no runtime check defined, the way a linter flags uncovered branches. I haven't built that yet.

Evolution Log

2026-04-14 — Initial observation. 5-sprint compile-pass bug + 7-sprint visual overlap. Three patterns identified (compile-pass, criteria-bounded, report-as-artifact). Failure memory and runtime-required AC as partial fixes. Verification coverage concept identified as open problem.