Verification loops for AI agents

The AI SDLC needs a tighter default loop:

state the intended behavior
change the smallest relevant surface
observe the actual result
run the strongest affordable check
report what passed, what failed, and what remains unknown

This page owns the broad agentic SDLC pattern: how an agent moves from task to evidence across code, runtime, deployment, and operator workflows. The narrower code-review version of the same idea lives in AI code review needs verification loops.

That loop sounds obvious. It is still the place where many agentic workflows break.

An agent can produce a patch, explain it well, and still be wrong because the environment, data shape, dependency version, runtime process, or deployment path was different from its assumption. The failure is not always in the code. Sometimes the agent stopped at the moment where a human operator would know the work was only half finished.

“I found the problem” is not done. “I changed the file” is not done. “The fix should work” is not done.

Done starts when the system has been checked against the claim.

The Basic Loop

The useful pattern is:

Plan -> Execute -> Observe -> Verify -> Report.

Plan means the agent states the intended behavior and the surface it expects to touch. Execute means it makes the smallest change that can plausibly move the system toward that behavior. Observe means it reads back the actual state instead of assuming the write worked. Verify means it runs the strongest affordable check for the risk. Report means it names the evidence and the boundary.

This turns an agent from a confident generator into an operator.

For example:

not “I updated the config,” but “the config changed, the service reloaded it, and the endpoint now responds with the expected behavior”
not “I pushed the commit,” but “the remote contains the commit, the target checkout fast-forwarded, and the runtime uses that checkout”
not “tests should pass,” but “this command passed, this command was not run, and this risk remains”
not “the task is closed,” but “the acceptance criteria were checked against these artifacts”

The wording matters because it changes what the human has to do next. A vague final answer gives the human a new investigation. A verification loop gives the human a decision surface.

Verification Levels

Not every change deserves the same verification budget. The loop should match the risk.

Static checks are useful for syntax, formatting, schema, and type boundaries. Focused tests are better when behavior changed in a bounded module. Integration checks matter when the change crosses a service, database, external API, queue, runtime config, or deployment rail. Operator or browser smoke tests matter when a user-visible workflow changed.

A copy edit does not need a distributed test harness. A trading, billing, auth, or deployment change should not be closed with a text summary alone.

The agent should be able to explain why a chosen check is enough for this slice and what it does not prove. That last part is important. Verification is not pretending uncertainty is gone. It is reducing uncertainty in a way another engineer can inspect.

Why Agents Skip It

Agents skip verification for predictable reasons.

First, language completion rewards plausible closure. “Done” is cheap to say. Second, tool use can create a false sense of progress: editing files, running one command, or reading one log may feel like enough. Third, many prompts ask for the patch but do not ask for the readback. Fourth, humans often accept polished summaries because they are tired and the answer looks coherent.

That is how the human becomes the missing runtime.

The fix is to make verification part of the workflow contract, not a personality trait of a careful agent. The task should ask for evidence. Repo instructions should define expected checks. The final answer should include what was verified and what was not. If a check fails twice, the loop should escalate instead of blindly patching the same assumption.

A Practical Closeout Shape

A useful agent closeout can be short:

changed: the files or behavior touched
verified: the commands, tests, smoke checks, or readbacks that passed
failed: anything that did not pass and what changed after that
not verified: the boundary that still needs human or later system proof
next: the smallest remaining action, if any

This format is intentionally unromantic. It is not there to showcase the agent’s reasoning. It is there to make the delivery state inspectable.

Strong agents are not the ones that sound most certain. They are the ones that make uncertainty smaller and leave a proof trail that another engineer can reopen without replaying the whole conversation.

That is the difference between chat-based coding and an agentic SDLC.