article
How I stopped reading the code.
Code review was supposed to be the non-negotiable. When I open-sourced wood-fired-tasks, I had pre-read 3.3% of the AI-written files in it — and none of the security-critical ones. The replacement for line-by-line review isn't trust. It's overlapping systems that grade each other — tests, hooks, CI, verifier agents, second-vendor audits — improved with the same telemetry discipline game developers use to tune their mechanics.
There is a question that lands on anyone shipping AI-written code, usually delivered with a raised eyebrow: do you actually read what it writes? For most of my career the question would not even have parsed. I came up reading diffs. Reviewing code before it landed was not a virtue, it was the job. If you had told me in 2020 that I would publish a repository and ask strangers to install it without having read most of it, I would have called that malpractice, and I would have been right to.
In late May I open-sourced wood-fired-tasks, the coordination layer for my fleet of coding agents — one shared backlog that any number of agents can drain in parallel without stepping on each other, where every closed task is graded by an independent agent before the close sticks. It has been the center of my workflow for nearly four months, and because I was about to ask other people to run it, the eyebrow question stopped being rhetorical. I instrumented my own usage and measured.
Of the 487 source files in the released tree, 16 were read by my primary session between the moment an AI wrote them and the moment they entered the commit that introduced them. That is 3.29%. I am not telling you this as a flex — a serious engineer does not brag about unread code. The number measures where my attention actually went, and the interesting part is which 16: the smoke tests, the lint and dependency configs, the architecture doc, a couple of small helpers. More than half their combined line count is four CLI test harnesses. The auth surface was not on the list — the 439-line plugin holding the entire session and token plumbing, the device-code route, the credential store, the secret-scan workflow. None pre-read. The security-critical code shipped unread.
What I read was the harness that proves the code works. The tests, not the implementation. The proof, not the code. And that is the actual thesis: the replacement for line-by-line review is not trust. It is overlapping systems that each verify what another produced — tests, hooks, CI, independent grader agents, second-vendor audits — arranged so that no single layer, and no single person, has to be right. The rest of this post is what those layers look like and how I know they work.
The loop I already knew
I have been a working game developer since September 2000, and my model of review was the one every senior engineer learns: read the diff, reason about the edges, ask the questions the author didn’t. When I started using AI agents to generate code in late 2025 I applied it unchanged. It lasted a few weeks. A single morning of agent output runs to tens of thousands of lines; you can read a junior’s daily output, but you cannot read an agent’s. If you try, you are the bottleneck, and you are looking at the wrong things — surface grammar instead of structural integrity.
The recognition that reorganized my work came from player telemetry. In games, telemetry lets you observe the behavior of thousands of players at once, see the patterns in how they actually play, and improve the mechanics until players experience the game the way you envisioned it. I have run that loop since the industry adopted telemetry at scale in the early 2010s. It took me longer than I am proud of to see that the same discipline applies to agents: observe their behavior at scale, see the patterns, and improve the workflows until the agents reliably follow the guardrails you have built to ensure quality. Agents are not players — what transfers is the method, not the metaphor. You cannot improve what you cannot measure. So I built the telemetry: every agent session, every tool call, every model call, every subagent spawn, landing in one analytics database. That stack is also what wrote this post. The numbers here come from the same instrumentation the post is about.
What the telemetry shows
The agent telemetry begins on April 27th, the day the capture layer went in — six weeks of data as this post goes live. It shows the destination, not the journey; the longer evolution predates the instrumentation and lives in git history.
The clearest signal is delegation. Subagent spawns per prompt rose from 0.15 to above 1.0 — I now spawn more subagents than I type prompts. In the wood-fired-tasks repo, 86% of the files an AI touched were written entirely by subagents I never directly drove. One subagent does the work, another grades it against the acceptance criteria, a third checks integration. I read the verdicts, not the diffs.
On release day my session fired 22 independent tasks-verifier passes and one integration-auditor pass — the cleanest single corroboration that shipping was gated by automated grading rather than by reading. Both agents are mine; they ship inside wood-fired-tasks, and they are read-only by design — the verifier cannot write code, only judge it. What looked like a ship-day burst became the baseline: 47 to 65 verifier passes every week since. Every grader that fires now is one I built.
That last sentence marks a shift that was measured, not felt. Until late May my big batches ran on get-shit-done, a third-party orchestration framework I used daily and modified heavily but did not write. I had a theory that my own task system could do the same job, so I measured both. Per token generated they cost about the same; structurally they were not close. A GSD milestone is one long-lived thread re-reading an enormous accumulated context across days of phase-gated execution. A task-DAG run fans the same work into short-lived parallel subagents, each graded independently — a scoped batch lands five to twelve times cheaper and an order of magnitude faster, ten-plus verified tasks an hour instead of a phase a day. I switched completely and retired the framework. That is the same telemetry loop pointed at the workflow itself — instrument, compare, adopt the better mechanic, measure again — and it is why speed and reliability keep rising together: the verification is built into the unit of work.
Layers that grade each other
Here is the piece I will state plainly because it is the uncomfortable one: I do not know how to manually configure a CI/CD pipeline. Every workflow on every repo I own was written by Claude — the acceptance tests, the mutation tests, the secret scan that rejects PRs containing API tokens, the weekly CodeQL job. I asked an agent to set them up and verified by whether the green check came back when I pushed. The repo now holds 88.1% line coverage, CI-enforced at 85% with a hard fail gate, and 2,640 tests — scaffolding that went up in a 48-hour burst the weekend it shipped.
So the validation infrastructure that grades the AI-written code is itself AI-written code I never typed. That recursion is why “trust” is the wrong frame. If I were trusting, I would have to trust both layers at once. What makes it tractable is that the layers grade each other: the code has to pass the tests, and a test failure is a fact about the system regardless of who wrote either side. I am not trusting. I am checking, and the checking is automatic.
Some of the checking does not wait for a test run. The hard rules — never amend a commit, never let a secret into a push, never touch a protected branch — do not live in my context files. They live in hooks: small deterministic scripts that intercept an agent’s action before it executes and refuse anything policy forbids. A rule written in prose is a request, and a request competes with everything in the model’s training pulling the other way — twenty thousand tokens into a session it can simply lose the thread. A hook cannot be talked out of its job. Last week I made this argument about compilers; hooks are the same move one layer up the stack. When the rule is hard, stop describing it and start enforcing it.
For most of my career I did not work this way. Between 2000 and 2023, less than ten percent of what I shipped had automated coverage worth speaking of, because the build-out never justified its cost. That changed when the cost dropped an order of magnitude. The discipline is not new. What is new is that the infrastructure costs an afternoon instead of a quarter. The gate is no longer effort. It is whether you have the reflex.
Two vendors, one artifact
The numbers in this post came from research agents given one instruction: test the thesis, do not confirm it. The first pass confidently reported that my cross-vendor audit pattern — Codex and Opus grading the same repo as independent reviewers — was not supported: within any single session, the two never co-occur. The finding was wrong. The audits run in separate CLI sessions, with the artifact passing between them through the task system; the query looked inside one window when the pattern crossed three. A second pass against the wider evidence — the Codex session logs on disk, the Claude prompt history, the production task database — surfaced the whole chain and corrected the draft before you read it. Overlapping evidence sources caught the error a single source produced.
The chain itself is the cleanest demonstration of the practice I have. A Codex session in late May ran a security audit of the repo and filed its remediation work straight into the production task tracker. Another ran an open-source-readiness audit. A third produced a seven-phase code-quality roadmap, and the prompt I gave it says the quiet part out loud:
“This codebase is something that I have never read and I did not personally write a single line of it. I’d like you to put together a plan to evaluate the code quality against the highest standards of the typescript community for similar systems and produce a comprehensive guide to significantly raising the code quality across the repo with a targeted roadmap that you enter as tasks in a new project targeting the production db of the wood-fired-bugs system.”
(wood-fired-bugs was the repo’s name before that week’s rename — same project.) A follow-on session turned the plan into nine production tasks, each stamped created_by: codex. All nine were implemented by Claude Opus over the next two days. Codex planned. Opus executed. I did neither.
On release day I ran the pattern in reverse. An hour after the version bump I directed Opus to run the final pre-launch audit — security, documentation accuracy, installer safety, license and CI hygiene. Twenty minutes after its findings came back I pasted in a fresh Codex audit of the same repo and had Opus merge the two views into twenty hardening tasks — security headers, error-handler tightening, CI permission scoping, connection caps — then run the autonomous loop over them. All twenty closed by the next morning. Two frontier models from two different vendors graded the same artifact independently, twice. I read the audits as text; I never read the diffs.
This week, that division of labor became a feature. Fable 5 debuted yesterday; within a day it was the majority model in my telemetry, and its cost made one thing obvious — you do not want your most expensive model doing bulk execution. wood-fired-tasks now ships configurable task models: a different model for each role in the pipeline — planning, execution, validation — the frontier model grading while cheaper models do the volume. The first backlog I drained that way made the case in dollars: execution on a mid-tier model with frontier-model verification on top cost about a third of frontier-everything per task, and every task still passed its verifier on the first try. The savings are structural — workers burn the overwhelming majority of the tokens, verification is a sliver — and the harness is what makes the cheap worker safe. The grading does not just protect quality; it is what unlocks the economics. The chain above, productized.
My industry does not trust this, and the distrust is rational
This year’s State of the Game Industry survey — 2,300-plus respondents — found 36% of game industry professionals using generative AI, just 30% inside studios proper, while 52% now say the technology is having a negative impact on the industry. That number was 30% a year ago and 18% the year before. Programmers are among the most unfavorable at 59%.
I do not think those colleagues are wrong. Distrust is the correct response to unvalidated output, and most AI output in the wild is unvalidated. The two standard responses are both losing positions: trust less, which leaves the velocity on the table, or trust more, which is hope wearing a process costume. The third option is the one this post has been describing — overlapping systems that do not require trust. Write the test that grades the behavior. Wire the hook that refuses the forbidden action before it executes. Run the verifier that grades the acceptance criterion. Configure the CI that grades the integration, and a second vendor’s audit to grade the first’s. Instrument every layer, including the one that produces the post about the loop, so that when something breaks — or merely over-claims — you find out before it ships.
Trust is a feeling. Validation engineering is a system. The first scales with attention, the second with infrastructure, and the 3.3% pre-read figure is just the measurement of where my attention lives now that the infrastructure carries the rest. I read the proof.
Then I shipped.
The open question I am instrumenting next: are my graders catching the failures I would have caught by reading? I will find out the way I always have — add the telemetry, watch it for a few weeks, adjust. The loop has not changed in fifteen years. Only the thing under observation has.
Companion post: the launch announcement for wood-fired-tasks, the repo this post measures.
next