Learning to trust the output without trusting the model: a story in three acts.

My trust in the models hasn't changed much; what changed is how I learned to trust their output, one verified result at a time. Over two years that played out in three acts — advisor, collaborator, workforce — each a new level of responsibility I handed to AI only after the results earned it.

Advisor: an expert I consulted but kept away from the code. Collaborator: handed entire features, but I reviewed every line. Workforce: directed at scale, executing under systems I built to keep it honest. The chart below shows when each shift happened; what follows is how I made each one without compromising quality.

/ 01
the arc

When each act happened.

January 2024 to May 2026 — human-authored in warm gray, AI-coauthored in amber. The data shows lines of source code changed per month. The arc, not the volume, is the point.

A word on the numbers up front. The measure I’d actually want is tasks shipped and verified — but I don’t have the data I'd need to show that information across two years. Lines of code is the most honest fallback I can show, and it’s a vanity metric; every senior engineer knows it. So read the bars as a rough shape, not a scoreboard. What the shape shows reliably isn’t how much I typed — it’s when the way I work changed, and that each shift produced more real output than the last.

/ 02
advisor

Act I

AI as an Advisor. January 2024 → July 2025

The first AI subscription that mattered for me was JetBrains AI Pro, running inside Rider. It could read whatever I pasted in and answer whatever I asked. I never found much value in letting it directly write code. That was the entire surface, and that was the point. It started with a build error: explain this, give me the most likely fix. I would try the answer and watch it work or fail. I did that well over a thousand times, correcting its hallucinations as I went — no, that is not a property of that class; no, that method does not exist. Its job was to compress my learning curve, not to take work off my plate, and it never wrote a line of what shipped. The verifying never stopped being necessary, and it never will. What changed across those thousand questions was how I asked them: I learned to ask for answers I could check in seconds. The work got faster because I got better at checking, not because I started to trust.

~1,400MESSAGES I PUT TO THE IDE’S AI ASSISTANT OVER 20 MONTHS

0lines it was allowed to write

/ 03
collaborator

Act II

AI as a Collaborator. August 2025 → December 2025

Claude Code changed what I’d hand over — not because it could finally access my file system — but because it could systematically ingest the context needed to execute complex tasks correctly using files I built up on disk. Right before my first experiments, I cheekily wrote one of my last manually entered commit messages: just making a checkpoint before I go all vibe coder! Honestly, the first 48 hours was rough! Right out of the gate a code generation bug quietly stripped twelve of sixteen components down to stubs in my engine. But each failure taught me the method: every mistake it made became an entry in a context document. You would never expect a new hire to be productive without first showing them how the work needs to be done and AI proved to be no different. Once the provided context was tight, the output became consistent and I could move it on to larger scoped tasks. I still reviewed every line, but reviewing generally good code is faster than writing great code. The persistent-entity system for my MECS engine — built and verified end-to-end in a day — was the first complete feature that didn't require significant changes from me. Suddenly, I wasn’t rate-limited by what I could write. I was rate-limited by what I could review.

The share of code I let it author went from zero to the majority in a single month — and stayed, because the review discipline held.

/ 04
workforce

Act III

AI as a Workforce. January 2026 → Present

Once I adopted tools that introduced a framework for formally validating work, the throughput became something I couldn’t read fast enough. In February the engine ran hot — a playable game client on screen in under an hour, more than a thousand platform commits in a month. Then March nearly broke it: more code than I could review, regressions slipping through, drift at the edges. I’d raised the ceiling without building the floor. The code velocity kept increasing while the business value created started to fall. Rather than slow everything down so I could review each commit, I started building systems that could validate as much of the code as possible for me — compile-time diagnostics that make the compiler refuse malformed work, aggressive linting, treating warnings as errors, a test-and-CI harness with as much code coverage as practical, cross-vendor audits where one model checks another’s, and two in-house agents that gate every task before it reaches me. Most of these systems have been essential for teams operating at scale for years. Thanks to AI, these same systems are now practical to build and maintain as a solo operator. Today I focus on reviewing the test outcomes, not the diffs, and that has freed me up to focus on the work being done at a higher level without sacrificing code quality. When I open-sourced wood-fired-tasks, I’d personally read about 3% of it. The work I did review was the tooling that guaranteed I no longer needed to read every line. I invite you to see for yourself.

Same trust, far more work — because the checking became automatic.

With AI, the bottleneck moves from writing, to reviewing, to governing.

Where you are on that arc is the first thing I’d want to know about your team so that I can help you get to that next stage.

Next steps →

/ 05
the asymmetry

AI is not universally good in every domain of software development.

AI is only as good as the code the world has already written for it to learn from, and that corpus is wildly uneven. Two independent records from my own work draw the exact same line. In the advisor years, the engine core was nearly 40% of everything I built — and the domain I asked the AI about least; I leaned on it hardest for the commodity backend and build work that was a tiny fraction of the actual code. Two years later, when AI could finally author code, the same split held: it co-authored 95% of my tooling and 59% of my backend, but only 44% of the engine internals. Where the work is proprietary game-tech, the judgment stays mine — both because I don’t need the help and because the help isn’t there. That’s the opposite of what most AI-adoption advice assumes, and it’s exactly where a studio gets burned letting the tool run unsupervised in its core domain.

In the advisor years: I built the most where I asked the least.

Two years later, when AI could author: the same line, from the other side.

/ 06
where to start

Bring me in early.

The studios that get the most from me pull me in before the AI mess compounds, not after.

The tracks

Three named problems studios bring me — trustworthy agent output, AI telemetry & governance, and agent orchestration. Find the one that sounds like your week; we scope the depth on a call.

/tracks → The writing

I document the practice as I build it — what works, what breaks, and what I’d tell your team to do differently.

/writing → The tools

The three layers that keep the practice honest — open-source orchestration, vendor-neutral telemetry, and the modification layer wired to my repos.

/tools →