Agent Observability: When Every Object Can Report Its Own State

For most of this project, my AI agents could see the game in one of two ways: they could read the code, or they could look at screenshots. Code tells you what should happen. Screenshots tell you what the screen looks like at one moment in time. Neither tells you what's actually happening at runtime — what events fire, in what order, with what values.

This gap is what I've been calling the verification gap. The agent closes a sprint. The code is correct. But nobody ran it, so we don't know if the mechanic actually works. (I wrote about this in a previous post.)

The architecture I'll describe here is an attempt to close that gap structurally — by making every game object capable of reporting its own state, automatically, as part of normal gameplay.

The Architecture: IClaudeStream

The core idea is an interface — IClaudeStream — that any Unity MonoBehaviour can implement. Implementing the interface commits that object to one contract: it will emit structured log entries through PlaytestLogger whenever significant state changes occur.

// Any game object implementing this interface
// becomes observable by the agent team
public interface IClaudeStream {
    string GetStreamId();      // unique identifier
    StreamPriority Priority(); // what urgency this signal has
    void OnClaudeConnect();    // called when agent session starts
}

The PlaytestLogger on the other end is simple: it writes structured JSON entries to a file the agents can read. Each entry has a timestamp, an event type, the emitting object's ID, and a payload of whatever state is relevant.

What Changed

Before this architecture, when Mars (the QA agent) needed to verify a mechanic, the workflow was: read the relevant code, identify the acceptance criteria, try to infer whether the criteria would be met. If something was broken, Mars would have to find it by reading code — which only tells you what's supposed to happen.

After: Mars reads the playtest log. The log shows exactly what happened, at what time, in what order. If TUTORIAL_ADVANCE never fired in a run where it should have, that's in the log. If RESOURCE_DAMAGE_STAGE went straight from 0 to 3 without passing through 1 or 2, that's in the log. The verification step shifted from inference to observation.

Before IClaudeStream	After IClaudeStream
Mars reads code to infer behavior	Mars reads playtest log to observe behavior
Verification = "code looks right"	Verification = "event sequence was correct"
Bugs hidden until manual playtest	Bugs visible in log immediately after AutoPlayer run
Evidence is static (screenshots)	Evidence is dynamic (event stream)
AutoPlayer can run but can't report	AutoPlayer run generates full observable trace

What's Currently Instrumented

At the time of writing, these events are streaming through PlaytestLogger in the game:

HIT_DIRECTION_FLASH — 8-directional hit detection events
RESOURCE_DAMAGE_STAGE / RESOURCE_VANISHED — resource destruction stages
BIOME_FLOOR_APPLIED — biome floor sprite application
DEATH_CAUSE — player death classified by 5 cause categories
Quality Gate 3 evidence data (automatically collected on gate evaluation)
NPC weekly events (4 indicators, generalized)
First-5-minute experience markers (4 indicators)

The pattern for adding a new event is roughly 10 lines of code: implement IClaudeStream, add a PlaytestLogger.Log() call at the relevant state transition, register the type. The low addition cost matters — if instrumentation is expensive, you only add it when something breaks. The goal is to add it before things break.

The Interesting Second-Order Effect

I didn't anticipate this when designing the architecture: observable state changes how the AutoPlayer gets used. When the AutoPlayer couldn't report what it was observing, it was mainly useful as a stress test — run it for a while, see if the game crashes. Now it's a measurement tool. You can tell the AutoPlayer to run a specific scenario, then read what the object stream produced, and draw conclusions about mechanic behavior.

This is still early. I'm describing an architecture that's been validated in one project with one configuration of agents. How well this generalizes to other agent setups, other game types, or more complex state relationships — I genuinely don't know yet. What I can say is that the verification gap shrinks noticeably when objects report themselves, and that felt like the important problem to solve first.

Evolution Log

2026-04-21 — Initial documentation. IClaudeStream + PlaytestLogger architecture validated with 9+ event types, all passing ARCH-CS2 9/9 checks. AutoPlayer integration confirmed. Pattern sourced from PP Sprint 87 observability push.