The GAN Pattern for Agents: Why Anthropic Stopped Letting Models Grade Their Own Work

Self-evaluation is a trap. I’ve written about harness engineering as the fix for agent failures, covering how the environment around the model matters more than the model itself. But one specific failure mode kept showing up that the five-layer framework doesn’t fully address: models grading their own work.

I’ve watched my own agents declare features “done” when the button didn’t do anything. I’ve seen research agents mark sources as “verified” without clicking a single link. Every time, the model reviewed its own output with the same brain that produced it, found nothing wrong, and moved on.

At AI Engineer, Anthropic’s Ash Prabaker and Andrew Wilson shared the internal harness they use for multi-hour agent runs. The core insight borrows from GANs (generative adversarial networks): split the builder and the critic into separate agents with separate context windows, and make them argue before any work begins.

The results are stark. Same prompt (“build a retro game maker”), same model, same cost ceiling. Without the harness: the app looked complete but arrow keys did nothing and play mode was broken. With the harness: 6 hours, $200, a fully playable game with physics, collision detection, a 54-color sprite editor, and an AI assistant the planner invented on its own.

The difference was entirely scaffolding.

Why Self-Evaluation Fails (Even With Good Models)

The intuition is simple. Tuning a standalone critic to be harsh is tractable. Tuning a builder to be self-critical is not.

Ash used an analogy that stuck: it’s easy for anyone to critique a fine meal. It’s much harder to cook one. LLMs have the same asymmetry. The same model that can’t reliably judge its own output becomes a ruthless quality gate when you strip away the generator context, give it a fresh window, and tell it to break things.

Anthropic found that out-of-the-box, Claude is a “really, really bad QA agent.” The sycophancy bias everyone hits with LLM-as-judge systems shows up here too. Early evaluator runs would find a bug and note “fix it later, might take 2 weeks” and move on. Making the evaluator useful required the same muscle as tuning any eval: reading traces, finding where the model’s judgment diverged from yours, and updating the prompt.

The key constraint: the evaluator never sees the generator’s reasoning traces. It only judges the output. They tried sharing context between the two and found it “muddied thoughts.” When the evaluator knows the generator’s rationale, it becomes easier for both to convince themselves something works. Keep them separate. Let the evaluator say “this is broken” and force the generator to figure out why on its own.

This Is Not Self-Reflection

If you’ve studied agentic design patterns, you’ll recognize reflection as one of the core primitives: a model critiques its own output, identifies flaws, and iterates. It works for straightforward tasks. But for multi-hour builds, reflection has a structural ceiling that this pattern breaks through.

The difference is context separation:

	Self-reflection	Adversarial evaluator
Context	Same window, same model invocation	Separate window, fresh invocation
Access to reasoning	Sees its own chain-of-thought	Sees only the output (via Playwright)
Bias	Rationalizes its own decisions	No ownership stake, no sunk cost
Failure mode	”This looks fine because I know why I did it"	"This is broken, I don’t care why”
Can recommend restart	Almost never (ownership bias)	Regularly does when scores plateau

Self-reflection is a student grading their own exam. They remember their reasoning, give partial credit generously, and convince themselves the answer is “close enough.” The adversarial evaluator is a different student grading a paper they’ve never seen, with a strict rubric and instructions to find every flaw.

The practical threshold: if your agent task completes in under 30 minutes, reflection is usually fine. The errors are small and recoverable. Once you cross into multi-hour territory, the compounding effect of “close enough” judgments produces apps that look done on the surface but crumble when you actually use them. That’s where the context split becomes non-negotiable.

The Three-Role Architecture

The full harness has three roles, each with its own context window:

%%{init: {"layout": "dagre"}}%%
flowchart TD
    P[Planner] --> |"high-level spec"| G[Generator]
    G --> |"builds feature"| E[Evaluator]
    E --> |"critique + scores"| G
    E --> |"all criteria pass"| Done[Ship]
    G --> |"stuck after N rounds"| Restart[Throw away and restart]

Planner. Takes a one-line prompt and produces a deliberately high-level spec broken into sprints. It does not plan granular technical details. The reasoning: technical errors in plans cascade through every subsequent sprint, magnifying over multi-hour horizons. Keep the plan at the product level.

Generator. The builder. Implements one feature at a time within a continuous session.

Evaluator. Uses Playwright to actually open the app, click around, test interactions, and score against a rubric. This isn’t reading diffs. It’s using the product like a human would.

The evaluator scores on four weighted criteria:

Criterion	Weight	What it measures
Design	High	Visual quality, no “AI slop” aesthetics
Originality	High	Novel solutions vs. generic purple gradients
Craft	Medium	Attention to detail, polish
Functionality	Lower	Does it work? (Models already decent here)

The weighting toward design and originality is intentional. Opus 4.6 is already good at making things functional. The problem Anthropic is solving with the evaluator is taste: preventing the generic, gradient-heavy, soulless aesthetic that marks AI-generated interfaces.

Contract Negotiation: The Innovation Ralph Never Had

Before the generator writes a single line, the two agents negotiate what “done” means. This is the part that surprised me most.

The generator proposes: “I’ll build feature X, and you should verify it by testing Y.”

The evaluator pushes back: “Scope is too big. Those tests are too weak. You’ve missed edge cases A, B, and C.”

They iterate via files on disk until both agree on a contract. One writes markdown, the other reads and responds. Only after consensus does implementation begin. The evaluator then grades against the contract they co-authored, not the original high-level spec.

For the retro game maker, this process produced 27 contract criteria. That granularity is the point. Vague criteria produce vague critiques. The generator shrugs and makes cosmetic changes. Granular criteria produce actionable findings: “this exact interaction is broken on this exact screen.”

This is what the Ralph loop never had. Ralph had a fixed plan.md. Nobody on the other side argued with it. No adversarial pressure. No negotiation over what “done” means. The plan was accepted as truth and executed linearly. When the plan was wrong (and plans are always partially wrong), the loop couldn’t self-correct at the specification level.

The Evaluator Uses the App

This is worth emphasizing. The evaluator doesn’t read code. It doesn’t review diffs. It launches the app with Playwright, navigates pages, clicks buttons, types inputs, and checks what actually happens.

For the retro game maker, the evaluator:

Launched play mode and pressed arrow keys (caught: movement didn’t work)
Tested the sprite editor’s color picker (caught: only black swatches rendering)
Checked keyboard shortcuts (caught: delete key had a Boolean logic bug)
Verified API route ordering (caught: route conflicts passing unit tests but breaking in production)

These are bugs that would sail through CI. Unit tests pass. The code looks correct. But actually using the app reveals that nothing works. The evaluator’s superpower is that it operates at the integration level, like a human QA tester who doesn’t care about your test coverage metrics, only whether the button does the thing. This is a fundamentally different verification approach from the circuit breakers and timeout patterns that protect agents in production. Those patterns handle infrastructure failures. This pattern catches functional failures that only manifest when you use the product.

When to Throw Everything Away

One behavior that emerged with Opus 4.6 surprised even the Anthropic team. The generator became “extremely willing to throw away everything” when it couldn’t hill-climb against the evaluator’s rubric.

After 10 passes at something, if scores aren’t improving, the system deletes everything and restarts from scratch. The evaluator sometimes initiates this explicitly: “This approach obviously isn’t working. Delete everything and restart.”

This never happens with self-evaluation. A model reviewing its own work has ownership bias. It spent 10 rounds building this thing. Of course it thinks the current approach is salvageable. The separated evaluator has no attachment. It only sees outputs and scores. If the scores plateau, the answer is clear: start over.

This mirrors what experienced developers do intuitively. Research on professional developers shows they succeed precisely because they constrain scope and verify every step. Sometimes you’ve been patching a broken approach for hours and the right move is git reset --hard and a fresh start. Models needed the adversarial pressure of an external critic to learn the same lesson.

Adapting the Harness as Models Improve

Anthropic walked through how this harness evolved across model generations. The pattern stays. The specifics simplify.

Harness element	Opus 4.5	Opus 4.6
Context reset between sessions	Required (context anxiety)	Dropped (single continuous session)
Sprint decomposition	Critical (one feature per session)	Optional (holds 2-hour builds coherently)
Evaluator cadence	Every sprint	End of full generation only
File system for state	Required	Still required

The lesson: the harness wasn’t wrong for 4.5. The frontier moved. What stays constant is the generator/evaluator split. What changes is how much hand-holding the generator needs between evaluation rounds.

Server-side compaction and 1M context GA mean a single continuous session works now. No need for fresh context windows between features. I covered how Claude Code’s agentic loop works previously, including the gather-act-verify cycle. The structural separation of roles builds on top of that loop. It still matters because the problem it solves (self-evaluation bias) is independent of context length.

Building This With Claude Code Today

You don’t need Anthropic’s internal tooling to try this pattern. The primitives already exist:

Custom sub-agents for the evaluator role. Give it a harsh system prompt and a detailed rubric.
Playwright MCP or Claude for Chrome MCP for the evaluator to actually use your app.
Skills to package grading rubrics into reusable evaluation flows.
Auto mode for the generator to run without permission interrupts.
File system state for contract documents and progress tracking between the two agents.

The evaluation rubric is the piece worth investing in. If you have a strong opinion on what “good” looks like in your domain, write it down with specific criteria and few-shot examples. Anthropic found this made “a really massive difference” to output quality. The rubric doesn’t need to be perfect. It just needs to be specific enough that the evaluator can produce actionable, granular critiques instead of “looks good to me.”

The Bottom Line

The pattern is simple. The insight is counterintuitive.

Models are better critics than self-evaluators. The same intelligence that can’t reliably judge its own work becomes an effective quality gate when you give it a separate context, a rubric, and no knowledge of the generator’s reasoning. Split the roles. Let them argue about what “done” means before any work begins. Make the evaluator actually use the output instead of reading the code. And when scores plateau, let it recommend starting over.

I’ve been running a version of this for my own projects since watching this talk. The quality difference on UI work is immediately noticeable. Not because the model writes better code, but because the evaluator catches the same “looks done but isn’t” failure mode that I’d otherwise only notice three days later.

Building long-running agents or experimenting with adversarial evaluation patterns? I’d love to hear what’s working for you. Reach out on LinkedIn.