This is a deep-dive companion to Why All Your Fears About AI Are Right. That post covers the broader measurement problem; this one gets technical about evaluating AI agents specifically.
Picture an AI booking agent that successfully reserves your flight—but only after it hallucinated three phantom error messages, retried the same API call six times, and accidentally queried your credit card twice before the duplicate charge got reversed.
The final output looks fine. The process was a disaster. Traditional evaluation would give it a passing grade.
This is the Agent Problem: AI systems are evolving from simple generators into agents—they don’t just produce a response, they take multiple steps, use tools, make decisions, and call other systems. Measuring just the final output misses where things actually went wrong.
Why LLM-as-Judge Breaks for Agents
Traditional LLM-as-Judge evaluation has a straightforward flow:
Input → Judge LLM → Score + Explanation
(one call)
This works when you’re evaluating a single response. But agents don’t produce single responses—they produce trajectories: sequences of decisions, tool calls, and intermediate states.
Recent research identifies the core limitations:
| Limitation | Description | Impact |
|---|---|---|
| Single-pass reasoning | Cannot iterate or verify | Misses complex errors |
| No tool access | Cannot execute code/verify facts | Hallucinated correctness |
| Final outcome focus | Ignores intermediate steps | Poor trajectory evaluation |
| Cognitive overload | Multi-criteria in one call | Coarse-grained scores |
The key insight: “Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes—ignoring the step-by-step nature of agentic systems—or require excessive manual labor.”
Agent-as-Judge: The Architecture
Agent-as-Judge adds the capabilities that LLM-as-Judge lacks:
Input → ┌──────────────────────────────────────────────────┐
│ Planning: Decompose evaluation into subtasks │
├──────────────────────────────────────────────────┤
│ Tool Use: Execute code, search facts, verify │
├──────────────────────────────────────────────────┤
│ Memory: Track intermediate state, prior context │
├──────────────────────────────────────────────────┤
│ Multi-Agent: Debate, specialist roles, consensus │
└──────────────────────────────────────────────────┘
│
▼
Granular Scores + Step-by-Step Feedback
Instead of one LLM making one judgment, you have:
- Specialist evaluators for different aspects (reasoning, tool use, safety)
- Tool verification that actually re-executes code to check results
- Trajectory analysis that scores each step, not just the final output
- Multi-agent debate where evaluators argue for different interpretations
Practical Patterns
Tool Verification
Don’t trust an agent’s claim that its tool call succeeded—verify it:
async function verifyToolCall(action: Action): Promise<VerificationResult> {
// Re-execute the tool call
const actualResult = await executeToolCall(action.tool, action.arguments);
// Compare with claimed result
const match = await compareResults(action.result, actualResult);
return {
score: match.similarity,
evidence: { expected: action.result, actual: actualResult }
};
}
This catches the booking agent problem: even if the final flight is booked, verification reveals the duplicate credit card queries and unnecessary retries.
Trajectory Analysis
Score each step independently, then aggregate:
interface StepScore {
step: string | number;
score: number; // 0-1
evidence?: any;
explanation?: string;
}
function aggregateStepScores(
scores: StepScore[],
method: 'average' | 'weighted' | 'min'
): number {
switch (method) {
case 'min':
// Fail-fast: worst step determines overall score
return Math.min(...scores.map(s => s.score));
case 'weighted':
// Early steps weighted higher (upstream errors compound)
const weights = scores.map((_, i) => 1 / (i + 1));
const totalWeight = weights.reduce((a, b) => a + b, 0);
return scores.reduce((sum, s, i) => sum + s.score * weights[i], 0) / totalWeight;
default:
return scores.reduce((sum, s) => sum + s.score, 0) / scores.length;
}
}
The min method is particularly useful for catching the “one bad step ruins everything” scenarios.
Multi-Agent Consensus
Have multiple evaluators debate the score:
async function collectiveConsensus(
evaluand: Evaluand,
judges: JudgeAgent[],
maxRounds: number = 3,
convergenceThreshold: number = 0.1
): Promise<ConsensusResult> {
let scores: Map<string, number[]> = new Map();
for (let round = 0; round < maxRounds; round++) {
// Each judge evaluates, seeing others' prior scores
const roundScores = await Promise.all(
judges.map(j => j.evaluate(evaluand, scores))
);
// Update history
roundScores.forEach((s, i) => {
const key = judges[i].id;
scores.set(key, [...(scores.get(key) || []), s.score]);
});
// Stop if judges converge
const variance = calculateVariance(roundScores.map(s => s.score));
if (variance < convergenceThreshold) break;
}
return {
finalScore: median(Array.from(scores.values()).map(s => s.at(-1)!)),
convergenceRound: round,
judgeScores: scores
};
}
This catches bias: if one judge is systematically too harsh or lenient, the median filters it out.
The Adversarial Pattern
For high-stakes evaluations, some teams implement adversarial evaluation:
- Prosecutors argue for low scores, looking for every possible flaw
- Defenders argue for high scores, steelmanning the response
- Arbiter weighs both arguments and makes the final call
This forces explicit consideration of both failure modes (false positives and false negatives).
Production Considerations
Cost-Latency Tradeoffs
| Approach | Latency | Cost | Quality |
|---|---|---|---|
| Single LLM-as-Judge | Low (200-500ms) | Low | Medium |
| Multi-agent debate | High (5-15s) | High | High |
| Tool-augmented | Medium (1-5s) | Medium | High |
| Hierarchical tiered | Variable | Medium | High |
Tiered Evaluation
Not every evaluation needs the full treatment:
async function tieredEvaluation(evaluand: Evaluand): Promise<EvalResult> {
// Tier 1: Fast, cheap initial check
const tier1 = await quickEvaluator.evaluate(evaluand);
if (tier1.confidence > 0.95) return tier1;
// Tier 2: More thorough
const tier2 = await thoroughEvaluator.evaluate(evaluand, tier1);
if (tier2.confidence > 0.90) return tier2;
// Tier 3: Full agent-as-judge with tools
return await fullAgentJudge.evaluate(evaluand, [tier1, tier2]);
}
High-confidence cases get fast evaluation. Edge cases get deep analysis. Costs stay manageable.
Circuit Breakers
Your evaluation system is itself an AI system—it can fail:
const breaker = new JudgeCircuitBreaker({
threshold: 5, // failures before opening
resetTimeout: 30000, // ms before retry
fallbackModel: 'gpt-4o-mini'
});
// Ignores transient rate limit errors
// Switches to fallback when primary fails
// Prevents cascade failures
The Measurement of Measurement
Here’s the uncomfortable recursion: we’re building AI to evaluate AI.
This creates an arms race. Build better AI, need better measurement, measurement becomes AI, now measure the measurement AI, repeat.
The practical solution: canary evaluations.
const report = await runCanaryEvaluations();
// Built-in test cases:
// - Perfect answer (should score high)
// - Obvious hallucination (should score low)
// - Off-topic response (should score very low)
If your judge starts giving passing grades to known-bad outputs, you know something’s broken—before it corrupts your production data.
Implementation Resources
For teams building Agent-as-Judge systems, these patterns are available in production-ready form:
Libraries:
Research:
- Agent-as-a-Judge (Zhuge et al., 2024) - Original framework
- Survey on Agent-as-a-Judge (You et al., 2026) - Comprehensive taxonomy
The Bottom Line
Evaluating AI agents requires evaluating the process, not just the product.
The booking agent that succeeded after six retries and a duplicate charge isn’t the same quality as one that succeeded on the first try. Traditional evaluation can’t tell the difference. Agent-as-Judge can.
The measurement moat isn’t just about catching bad outputs—it’s about understanding why they’re bad, where the failures happen, and building systems that can verify themselves.
Building agent evaluation systems? Integrity Studio helps teams implement production-grade LLM and agent measurement. Let’s talk.