Verification loops for AI agents
An AI agent's claim is useful only after it is tied to a check the real system can pass.
The AI SDLC needs a tighter default loop:
- state the intended behavior
- change the smallest relevant surface
- observe the actual result
- run the strongest affordable check
- report what passed, what failed, and what remains unknown
This page owns the broad agentic SDLC pattern: how an agent moves from task to evidence across code, runtime, deployment, and operator workflows. The narrower code-review version of the same idea lives in AI code review needs verification loops.
That loop sounds obvious. It is still the place where many agentic workflows break.
An agent can produce a patch, explain it well, and still be wrong because the environment, data shape, dependency version, runtime process, or deployment path was different from its assumption. The failure is not always in the code. Sometimes the agent stopped at the moment where a human operator would know the work was only half finished.
“I found the problem” is not done. “I changed the file” is not done. “The fix should work” is not done.
Done starts when the system has been checked against the claim.
The Basic Loop
The useful pattern is:
Plan -> Execute -> Observe -> Verify -> Report.
Plan means the agent states the intended behavior and the surface it expects to touch. Execute means it makes the smallest change that can plausibly move the system toward that behavior. Observe means it reads back the actual state instead of assuming the write worked. Verify means it runs the strongest affordable check for the risk. Report means it names the evidence and the boundary.
This turns an agent from a confident generator into an operator.
For example:
- not “I updated the config,” but “the config changed, the service reloaded it, and the endpoint now responds with the expected behavior”
- not “I pushed the commit,” but “the remote contains the commit, the target checkout fast-forwarded, and the runtime uses that checkout”
- not “tests should pass,” but “this command passed, this command was not run, and this risk remains”
- not “the task is closed,” but “the acceptance criteria were checked against these artifacts”
The wording matters because it changes what the human has to do next. A vague final answer gives the human a new investigation. A verification loop gives the human a decision surface.
Verification Levels
Not every change deserves the same verification budget. The loop should match the risk.
Static checks are useful for syntax, formatting, schema, and type boundaries. Focused tests are better when behavior changed in a bounded module. Integration checks matter when the change crosses a service, database, external API, queue, runtime config, or deployment rail. Operator or browser smoke tests matter when a user-visible workflow changed.
A copy edit does not need a distributed test harness. A trading, billing, auth, or deployment change should not be closed with a text summary alone.
The agent should be able to explain why a chosen check is enough for this slice and what it does not prove. That last part is important. Verification is not pretending uncertainty is gone. It is reducing uncertainty in a way another engineer can inspect.
Why Agents Skip It
Agents skip verification for predictable reasons.
First, language completion rewards plausible closure. “Done” is cheap to say. Second, tool use can create a false sense of progress: editing files, running one command, or reading one log may feel like enough. Third, many prompts ask for the patch but do not ask for the readback. Fourth, humans often accept polished summaries because they are tired and the answer looks coherent.
That is how the human becomes the missing runtime.
The fix is to make verification part of the workflow contract, not a personality trait of a careful agent. The task should ask for evidence. Repo instructions should define expected checks. The final answer should include what was verified and what was not. If a check fails twice, the loop should escalate instead of blindly patching the same assumption.
A Practical Closeout Shape
A useful agent closeout can be short:
- changed: the files or behavior touched
- verified: the commands, tests, smoke checks, or readbacks that passed
- failed: anything that did not pass and what changed after that
- not verified: the boundary that still needs human or later system proof
- next: the smallest remaining action, if any
This format is intentionally unromantic. It is not there to showcase the agent’s reasoning. It is there to make the delivery state inspectable.
Strong agents are not the ones that sound most certain. They are the ones that make uncertainty smaller and leave a proof trail that another engineer can reopen without replaying the whole conversation.
That is the difference between chat-based coding and an agentic SDLC.
Reader next step
Keep reading before switching into hiring mode.
Related posts and tags are the natural continuation. If you want the person behind the note, About gives the profile context, while selected work stays available as implementation examples.