Automated Verification Is More Important Than Ever

I am afraid of accepting AI-generated code without reviewing it. Why? I suspect this fear stems from an inability to automatically verify the correctness of the code’s behaviour. Style and obvious errors can be tested with linters and unit tests, but there is a host of advanced failure modes that are much harder to capture: race conditions¹, memory leaks and subtle performance regressions. These types of issues are why I am compelled to manually review code.

I picture a spectrum with eyes-closed vibe coding at one extreme and carefully curated handwritten code at the other. Similarly, I think of a testing spectrum with no automated tests on one end and elaborate multi-stage automated tests on the other. I am starting to understand that these spectra are complementary and that my willingness to loosen my grip on the proverbial reins is strongly tied to my faith in the test suite. If I can automatically test for every desired behaviour of the code, then I have no more reason to review the implementation produced by my agent.

Such a setup is utopian. There are simply too many ways for a system to behave to test for all of them. Additionally, the effort required to test error scenarios increases steeply as they get more exotic; automatically testing for memory leaks is much more difficult than testing a bad input to a function. Even so, the investment into better tests has always been worth it, both in terms of reliability and productivity, but the productivity aspect is now even more pronounced. Code can be produced near-instantaneously, and - as many teams are experiencing at the moment - manually verifying its correctness is becoming the primary bottleneck.

Agentic engineering has also opened the door for something that was previously not possible: crafting code in a language that I don’t understand and thus cannot review. This is a development I particularly enjoy, and I now frequently contribute to foreign codebases. In these cases, I can rely only on the automatic verification in the repository, unless I want to request a review from a colleague².

Code quality is only one aspect of a review. Even with the utopian test suite, there may still be valid reasons to review code. Maintaining one’s mental model of the codebase is one, though I think the argument for this is becoming weaker as agents can construct a literal visual model of most codebases in minutes. Stopping undesirable features early is another reason. Agents are great at many things, but telling the operator, “Hey, I don’t think we actually need this at all” remains a shortcoming. Decision record discipline mostly solves this: writing down intent allows team members to inspect and acknowledge the plan before implementation begins. Finally, reviews are an excellent place to share knowledge and teach one another, but this applies only when a human is at the other end. I am always happy to take a look at a fellow human’s work³.

In summary, I think teams that can automatically verify the complete correctness of their code will have an even larger velocity advantage than they already did, now that code can be produced at a pace humans cannot keep up with. From now on, when I review code produced by my agent, I will try to ask myself, “What are the reasons I must review this, and can I remove any of them?”.

Go’s race detector is a wonderful tool that can catch many race conditions as tests run. ↩︎
Requesting a colleague’s review remains an uncontroversial thing to do, but doesn’t aid in the quest to rid ourselves of human bottlenecks. ↩︎
And I will certainly not have an agent generate a review and then pass its comments off as my own. This is a massive faux pas, and it appals me whenever I see it. ↩︎