This is a deep-dive companion to Why All Your Fears About AI Are Right. That post covers the broader measurement problem; this one gets technical about evaluating AI agents specifically.


Picture an AI booking agent that successfully reserves your flight—but only after it hallucinated three phantom error messages, retried the same API call six times, and accidentally queried your credit card twice before the duplicate charge got reversed.

The final output looks fine. The process was a disaster. Traditional evaluation would give it a passing grade.

This is the Agent Problem: AI systems are evolving from simple generators into agents—they don’t just produce a response, they take multiple steps, use tools, make decisions, and call other systems. Measuring just the final output misses where things actually went wrong.

Why LLM-as-Judge Breaks for Agents

Traditional LLM-as-Judge evaluation has a straightforward flow:

Input → Judge LLM → Score + Explanation
         (one call)

This works when you’re evaluating a single response. But agents don’t produce single responses—they produce trajectories: sequences of decisions, tool calls, and intermediate states.

Recent research identifies the core limitations:

Limitation Description Impact
Single-pass reasoning Cannot iterate or verify Misses complex errors
No tool access Cannot execute code/verify facts Hallucinated correctness
Final outcome focus Ignores intermediate steps Poor trajectory evaluation
Cognitive overload Multi-criteria in one call Coarse-grained scores

The key insight: “Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes—ignoring the step-by-step nature of agentic systems—or require excessive manual labor.”

Agent-as-Judge: The Architecture

Agent-as-Judge adds the capabilities that LLM-as-Judge lacks:

Input → ┌──────────────────────────────────────────────────┐
        │ Planning: Decompose evaluation into subtasks      │
        ├──────────────────────────────────────────────────┤
        │ Tool Use: Execute code, search facts, verify      │
        ├──────────────────────────────────────────────────┤
        │ Memory: Track intermediate state, prior context   │
        ├──────────────────────────────────────────────────┤
        │ Multi-Agent: Debate, specialist roles, consensus  │
        └──────────────────────────────────────────────────┘
                              │
                              ▼
         Granular Scores + Step-by-Step Feedback

Instead of one LLM making one judgment, you have:

  • Specialist evaluators for different aspects (reasoning, tool use, safety)
  • Tool verification that actually re-executes code to check results
  • Trajectory analysis that scores each step, not just the final output
  • Multi-agent debate where evaluators argue for different interpretations

Practical Patterns

Tool Verification

Don’t trust an agent’s claim that its tool call succeeded—verify it:

async function verifyToolCall(action: Action): Promise<VerificationResult> {
  // Re-execute the tool call
  const actualResult = await executeToolCall(action.tool, action.arguments);

  // Compare with claimed result
  const match = await compareResults(action.result, actualResult);

  return {
    score: match.similarity,
    evidence: { expected: action.result, actual: actualResult }
  };
}

This catches the booking agent problem: even if the final flight is booked, verification reveals the duplicate credit card queries and unnecessary retries.

Trajectory Analysis

Score each step independently, then aggregate:

interface StepScore {
  step: string | number;
  score: number;           // 0-1
  evidence?: any;
  explanation?: string;
}

function aggregateStepScores(
  scores: StepScore[],
  method: 'average' | 'weighted' | 'min'
): number {
  switch (method) {
    case 'min':
      // Fail-fast: worst step determines overall score
      return Math.min(...scores.map(s => s.score));
    case 'weighted':
      // Early steps weighted higher (upstream errors compound)
      const weights = scores.map((_, i) => 1 / (i + 1));
      const totalWeight = weights.reduce((a, b) => a + b, 0);
      return scores.reduce((sum, s, i) => sum + s.score * weights[i], 0) / totalWeight;
    default:
      return scores.reduce((sum, s) => sum + s.score, 0) / scores.length;
  }
}

The min method is particularly useful for catching the “one bad step ruins everything” scenarios.

Multi-Agent Consensus

Have multiple evaluators debate the score:

async function collectiveConsensus(
  evaluand: Evaluand,
  judges: JudgeAgent[],
  maxRounds: number = 3,
  convergenceThreshold: number = 0.1
): Promise<ConsensusResult> {
  let scores: Map<string, number[]> = new Map();

  for (let round = 0; round < maxRounds; round++) {
    // Each judge evaluates, seeing others' prior scores
    const roundScores = await Promise.all(
      judges.map(j => j.evaluate(evaluand, scores))
    );

    // Update history
    roundScores.forEach((s, i) => {
      const key = judges[i].id;
      scores.set(key, [...(scores.get(key) || []), s.score]);
    });

    // Stop if judges converge
    const variance = calculateVariance(roundScores.map(s => s.score));
    if (variance < convergenceThreshold) break;
  }

  return {
    finalScore: median(Array.from(scores.values()).map(s => s.at(-1)!)),
    convergenceRound: round,
    judgeScores: scores
  };
}

This catches bias: if one judge is systematically too harsh or lenient, the median filters it out.

The Adversarial Pattern

For high-stakes evaluations, some teams implement adversarial evaluation:

  • Prosecutors argue for low scores, looking for every possible flaw
  • Defenders argue for high scores, steelmanning the response
  • Arbiter weighs both arguments and makes the final call

This forces explicit consideration of both failure modes (false positives and false negatives).

Production Considerations

Cost-Latency Tradeoffs

Approach Latency Cost Quality
Single LLM-as-Judge Low (200-500ms) Low Medium
Multi-agent debate High (5-15s) High High
Tool-augmented Medium (1-5s) Medium High
Hierarchical tiered Variable Medium High

Tiered Evaluation

Not every evaluation needs the full treatment:

async function tieredEvaluation(evaluand: Evaluand): Promise<EvalResult> {
  // Tier 1: Fast, cheap initial check
  const tier1 = await quickEvaluator.evaluate(evaluand);
  if (tier1.confidence > 0.95) return tier1;

  // Tier 2: More thorough
  const tier2 = await thoroughEvaluator.evaluate(evaluand, tier1);
  if (tier2.confidence > 0.90) return tier2;

  // Tier 3: Full agent-as-judge with tools
  return await fullAgentJudge.evaluate(evaluand, [tier1, tier2]);
}

High-confidence cases get fast evaluation. Edge cases get deep analysis. Costs stay manageable.

Circuit Breakers

Your evaluation system is itself an AI system—it can fail:

const breaker = new JudgeCircuitBreaker({
  threshold: 5,           // failures before opening
  resetTimeout: 30000,    // ms before retry
  fallbackModel: 'gpt-4o-mini'
});

// Ignores transient rate limit errors
// Switches to fallback when primary fails
// Prevents cascade failures

The Measurement of Measurement

Here’s the uncomfortable recursion: we’re building AI to evaluate AI.

This creates an arms race. Build better AI, need better measurement, measurement becomes AI, now measure the measurement AI, repeat.

The practical solution: canary evaluations.

const report = await runCanaryEvaluations();
// Built-in test cases:
// - Perfect answer (should score high)
// - Obvious hallucination (should score low)
// - Off-topic response (should score very low)

If your judge starts giving passing grades to known-bad outputs, you know something’s broken—before it corrupts your production data.

Implementation Resources

For teams building Agent-as-Judge systems, these patterns are available in production-ready form:

Libraries:

  • DeepEval - Agent metrics and evaluation
  • Ragas - Retrieval and agent evaluation

Research:

The Bottom Line

Evaluating AI agents requires evaluating the process, not just the product.

The booking agent that succeeded after six retries and a duplicate charge isn’t the same quality as one that succeeded on the first try. Traditional evaluation can’t tell the difference. Agent-as-Judge can.

The measurement moat isn’t just about catching bad outputs—it’s about understanding why they’re bad, where the failures happen, and building systems that can verify themselves.


Building agent evaluation systems? Integrity Studio helps teams implement production-grade LLM and agent measurement. Let’s talk.