Last week I ran 8 complete game development sprints in 7 days. I want to write about what that actually looked like — not the polished version, but the one with the bugs and the things that broke.
I'm a solo developer working on a mobile game in my off-hours. I have a team of AI agents handling different roles: one writes and validates code (Mars), one generates and verifies art assets (Iris), one runs automated gameplay tests (AutoPlayer), one monitors the codebase health (Sheldon). I direct strategy and make decisions. They execute.
This is N=~8 over one week. The pattern held. I don't know yet how it scales to a month.
What "8 Sprints" Actually Means
Each sprint is roughly: a set of acceptance criteria → implementation → code review → automated tests → playtest → merge. In a solo project without agents, a sprint takes me a day if things go well, longer if I hit a hard bug.
That week (Sprint 41–48), we shipped:
- A full fist overlay animation rework — pivot correction, world-anchored positioning
- 3-stage combo timing differentiation
- NPC quest system foundation
- Portal summon HUD with kill-gate unlock
- Portal-to-portal camera transition with occlusion
- Chunk pop-in optimization (eliminating visible terrain loading)
- Capture probability display (real-time HP-based percentages)
- Sheldon's first golden snapshot — a verified "good state" baseline
~38 commits across 7 days. Most of them happened while I was at work.
Sprint throughput: before vs after agent orchestration
Solo dev capacity: 2 sprints/week unaided → 8 with agent orchestration. N=1 week. Sprint scope roughly equivalent.
How the Agent Roles Actually Divided
The division that emerged isn't what I originally designed. It shifted based on what each agent turned out to be reliable for:
I set the sprint goal and write the acceptance criteria. Mars implements and does an initial code review. Iris generates art assets and verifies them against a palette/SSIM threshold. Sheldon gates the merge — the code has to pass a health check before it goes in. AutoPlayer runs gameplay scenarios and checks that the mechanics behave as specified.
The key thing is that each agent has an exit condition. They can block a sprint from progressing. This is what makes it possible to run multiple sprints without me checking every output — the agents check each other's outputs.
What Actually Went Wrong
This is the part I want to be honest about, because most writeups of "AI did X in Y time" skip it.
First: a bug in the tutorial system was flagged as "fixed" in three consecutive sprints. It wasn't. Each sprint, Mars made a code change, the code compiled, and the sprint closed as complete. The bug persisted because no one ran the game in Play Mode to verify it. Compilation success got treated as functional verification. It took us until Sprint 48 to catch it — and catching it required me to actually run a playtest session, not just read the logs.
Second: there was a UI overlap issue — the tutorial overlay and the minimap were rendering on top of each other in a way that was clearly broken. It went unnoticed for seven sprints. It wasn't in the acceptance criteria, so the visual check passed. An agent checked the screen, saw nothing in the criteria was violated, and moved on.
Both of these failures have the same shape: the verification gate was specified too narrowly. The agents verified exactly what they were told to verify. Nothing more.
I'll write more about this pattern separately — it's worth its own post — but the short version is: the productivity gain from agent orchestration is real, and so is this new failure mode. You're trading "the human misses things because they're busy" for "the agent misses things because it wasn't told to look."
What Made the Difference
The biggest actual productivity driver wasn't any individual agent — it was the asynchronous feedback loop. I'd write the acceptance criteria in the evening. By the next morning, there'd be a merged commit or a blocked build with a report explaining why it failed. I could review that in ten minutes and write the next sprint's criteria before I went to work.
The bottleneck shifted from "hours to implement" to "minutes to review and redirect." That's the leverage. Not the implementation speed itself — I can write code faster than Mars in a focused session. It's the parallelism: while I'm at work, 6-8 hours of implementation are happening that I'd otherwise have to do myself.
The Honest Assessment
8 sprints in a week is a real number. The features shipped are real. The bugs that slipped through are also real.
The productivity gain is genuine, but it comes with a specific new cost: you have to invest time upfront defining what "verified" means for every feature. If you skip that work, the agents will work fast and verify nothing meaningful. You'll ship code that compiles, art that renders, and mechanics that technically fire — without any of it being right.
I don't know yet if this holds at 16 sprints a week or what the ceiling is. The bottleneck has moved, and I haven't found the new ceiling yet. That's the next thing I'm trying to understand.
Evolution Log
- 2026-04-14 — Initial observation. Sprint 41–48 data. Quality-review silent failure discovered same week — see also "When Tests Pass But Nothing Works."