LLM UX Interface Explainability for OTel-Native Observability

Version: 1.0 Date: 2026-02-06 Status: Research Scope: observability-toolkit v2.0.1


Executive Summary

Research into how leading observability platforms surface LLM evaluation explainability, aligned with OpenTelemetry GenAI semantic conventions. Key findings:

  • OTel GenAI semantic conventions now include a dedicated gen_ai.evaluation.result event with gen_ai.evaluation.score.value, gen_ai.evaluation.explanation, gen_ai.evaluation.name, and gen_ai.evaluation.score.label attributes, providing an interoperable wire format for evaluation explainability. This maps directly to the toolkit’s existing EvaluationResult interface.
  • Langfuse leads in evaluation traceability with full execution tracing of LLM-as-Judge evaluations (October 2025), where every judge invocation produces an inspectable trace showing the exact prompt, response, and reasoning used to produce a score.
  • Arize Phoenix mandates explanation generation via a provide_explanation parameter on all evaluations, and instruments evaluators with OTel traces, creating a unified observability + evaluation pipeline.
  • Dashboard UX patterns converge on three-tier alerting (healthy/warning/critical) with percentile-based thresholds, but the industry gap is in making those thresholds actionable – linking metric breaches to specific traces and evaluation explanations.
  • Regulatory pressure is accelerating: EU AI Act transparency obligations (Article 50) become fully applicable August 2026, and NIST AI RMF 1.0 emphasizes explainability scoring as a measurable trustworthiness characteristic.

1. Platform Analysis

Comparison Matrix

Capability Langfuse Arize Phoenix Datadog LLM Obs LangSmith W&B Weave Confident AI
Eval explanations stored Yes (via execution trace) Yes (provide_explanation param) Yes (managed evals) Yes (linked to traces) Yes (scorer output) Yes (all metrics include reasons)
Judge execution tracing Full OTel traces per judge call OTel-instrumented evaluators Managed eval traces Linked to run traces Trace integration Via DeepEval framework
CoT reasoning visible Via score tooltip -> trace link Explanation column in eval df Quality check detail view Step-by-step chain inspection Custom dashboard panels CLI output + web reports
Score breakdown UI Score badge tooltip on traces Eval columns on span dataframe Built-in quality checks panel Evaluation results tab Leaderboard aggregation Conversational turn display
Confidence indicators Not built-in Not built-in Not built-in Not built-in Not built-in Not built-in
OTel GenAI support OTLP export OpenInference (OTel-compatible) Native v1.37+ mapping Proprietary + OTel bridge OTel trace support OTel via DeepEval
Multi-agent eval Agent evaluation guide Deep multi-step agent traces AI Agent Monitoring (2025) Agent chain visualization Agent workflow tracking Agent metrics (2025)
Regulatory features Audit trail via traces Trace provenance Governance via OTel Collector Run history Artifacts + Registry Test reports
Open source Yes (self-host) Yes (OSS core) No (SaaS) No (SaaS) Partial (Weave OSS) Partial (DeepEval OSS)

Platform Detail: Langfuse

Langfuse introduced LLM-as-a-Judge Execution Tracing in October 2025, which is the most complete implementation of evaluation explainability observed in this research.

Key UX patterns:

  • Score tooltip on trace view: Hovering over any score badge reveals a “View execution trace” link, providing zero-click access to the judge’s reasoning
  • Four navigation paths to evaluation traces: Score tooltip, tracing table filter (langfuse-llm-as-a-judge environment), scores table “Execution Trace” column, evaluator logs table
  • Full judge trace capture: Every LLM-as-Judge execution creates a trace recording the prompt sent to the judge LLM, the complete response including score and reasoning, and token usage/cost
┌──────────────────────────────────────────────────────────────┐
│  Langfuse Trace View                                          │
│                                                                │
│  Trace: user-query-abc123                                      │
│  ├── [span] LLM Call: gpt-4o                                  │
│  │   ├── Input: "What is the capital of France?"              │
│  │   └── Output: "The capital of France is Paris."            │
│  │                                                             │
│  │   Score Badges:                                             │
│  │   ┌─────────────┐  ┌──────────────────┐                   │
│  │   │ relevance   │  │ faithfulness     │                   │
│  │   │ 0.92 [i]    │  │ 0.88 [i]        │                   │
│  │   └──────┬──────┘  └──────────────────┘                   │
│  │          │                                                  │
│  │          ▼  (hover tooltip)                                 │
│  │   ┌──────────────────────────────────┐                     │
│  │   │ Score: 0.92                       │                     │
│  │   │ Evaluator: gpt-4o-mini            │                     │
│  │   │ [View execution trace ->]         │                     │
│  │   └──────────────────────────────────┘                     │
│  │                                                             │
│  └── [span] Tool Call: search_api                              │
│                                                                │
└──────────────────────────────────────────────────────────────┘

Relevance to observability-toolkit: The toolkit’s obs_export_langfuse already supports OTLP + basic auth export. Evaluation results with explanation fields could be surfaced in Langfuse’s trace view after export.

Platform Detail: Arize Phoenix

Phoenix takes an “explanation-by-default” approach to evaluation explainability.

Key UX patterns:

  • provide_explanation parameter: When set to True on run_evals(), the evaluator LLM is prompted to explain its reasoning, and the explanation is stored alongside the score in the output dataframe
  • OTel-instrumented evaluators: Evaluators are natively instrumented via OpenTelemetry tracing, creating Evaluator Traces that show the full evaluation pipeline
  • Prompt management integration: Prompt templates used by evaluators are versioned and stored, so the exact evaluation prompt can be inspected retroactively
  • Evaluation columns on trace dataframe: Scores and explanations appear as columns directly on the span/trace dataframe, enabling filtering and sorting

Relevance to observability-toolkit: The toolkit’s obs_export_phoenix (Bearer auth + project org) maps EvaluationResult.explanation to Phoenix’s explanation column.

Platform Detail: Datadog LLM Observability

Datadog takes an enterprise-integrated approach with native OTel GenAI Semantic Conventions support (v1.37+).

Key UX patterns:

  • LLM Overview dashboard: Collates trace/span-level error and latency metrics, token consumption, model usage statistics, and triggered monitors in a single view
  • Built-in quality checks: Out-of-the-box evaluations for “Failure to answer”, “Topic relevancy”, “Toxicity”, and “Negative sentiment” with automatic scoring
  • Managed + custom evaluations: Automatic detection of hallucinations, prompt injections, unsafe responses, and PII leaks on both live and historical traces; plus custom LLM-as-a-judge evaluations
  • OTel Collector governance: Data policies (redaction, sampling, enrichment, routing) enforced before telemetry leaves the network
  • LLM Experiments: Test prompt changes against production data before deployment (June 2025)

Relevance to observability-toolkit: The toolkit’s obs_export_datadog uses two-phase spans + eval metrics. Datadog’s native OTel GenAI mapping means the toolkit’s OTel-aligned EvaluationResult schema is directly consumable.

Platform Detail: LangSmith

LangSmith provides deep trace inspection with evaluation integration.

Key UX patterns:

  • Hierarchical trace visualization: Step-by-step inspection of chains, agents, and LLM calls with tree-style rendering
  • Evaluation-trace linking: Failing evaluation grades link back to the exact prompt, tool output, and memory state that caused the failure
  • Online + offline evals: Offline evals run on datasets (benchmarking/regression); online evals run on production traffic in near real time
  • Customizable dashboard widgets: High-level statistics, recent runs, and summaries with configurable views

Platform Detail: W&B Weave

Weave emphasizes side-by-side comparison and leaderboard views.

Key UX patterns:

  • Evaluation leaderboards: Aggregate evaluations into leaderboards featuring best performers, shareable across the organization
  • Side-by-side comparison: Visualizations for objective, precise comparisons between evaluation runs
  • Custom metric dashboards: Design dashboards focused on metrics most relevant to specific LLM tasks
  • Human + quantitative integration: Combine human evaluation results alongside quantitative metrics for holistic performance view

Platform Detail: Confident AI / DeepEval

DeepEval provides the most explicit explanation-per-metric approach.

Key UX patterns:

  • All metrics include explanations: Every DeepEval metric provides a comprehensive reason for the computed score, not just the numeric value
  • Conversational turn display: Multi-turn evaluations show role, truncated content, and tools used per turn
  • Agent-specific metrics: PlanQualityMetric (logical, complete, efficient plans), ToolCorrectnessMetric (proper tool selection), with explanations for each
  • Shareable reports: Generate stakeholder-facing reports from evaluation results

Relevance to observability-toolkit: The toolkit’s obs_export_confident (API key + env tagging) exports to the Confident AI platform where DeepEval results are visualized.


2. OTel GenAI Semantic Conventions for Explainability

Evaluation Event Specification

The OpenTelemetry GenAI semantic conventions define a gen_ai.evaluation.result event for capturing evaluation outcomes. This is the primary interoperability standard for evaluation explainability.

Event name: gen_ai.evaluation.result

Event attributes (per OTel GenAI semantic conventions):

Attribute Type Required Description
gen_ai.evaluation.name string Recommended Name of the evaluation metric used
gen_ai.evaluation.score.value number No The evaluation score returned by the evaluator
gen_ai.evaluation.score.label string Recommended Human-readable interpretation (e.g., “relevant”, “pass”, “fail”)
gen_ai.evaluation.explanation string No Free-form explanation for the assigned score
gen_ai.response.id string Conditional Unique ID of the completion being evaluated; used for correlation when span ID unavailable

Note: The OTel GenAI evaluation event conventions are currently experimental (as of February 2026). Attribute names may change before stabilization. The toolkit’s GENAI_EVALUATION_ATTRIBUTES constant in src/backends/index.ts tracks the canonical names.

Parenting rules:

  • The event SHOULD be parented to the GenAI operation span being evaluated when possible
  • When span ID is not available, set gen_ai.response.id for correlation

Score label semantics:

  • The label SHOULD have low cardinality
  • Possible values depend on the evaluation metric and evaluator used
  • Implementations SHOULD document the possible values
  • Example: a score_value of 1 could mean “relevant” in one system and “not_relevant” in another

Mapping to observability-toolkit EvaluationResult

The toolkit’s EvaluationResult interface already aligns with OTel GenAI semantic conventions:

// src/backends/index.ts — actual interface with OTel attribute mapping
interface EvaluationResult {
  timestamp: string;
  evaluationName: string;       // -> gen_ai.evaluation.name
  scoreValue?: number;          // -> gen_ai.evaluation.score.value
  scoreLabel?: string;          // -> gen_ai.evaluation.score.label
  scoreUnit?: string;           // -> gen_ai.evaluation.score.unit (custom extension)
  explanation?: string;         // -> gen_ai.evaluation.explanation
  evaluator?: string;           // -> gen_ai.evaluation.evaluator (custom extension)
  evaluatorType?: EvaluatorType; // -> gen_ai.evaluation.evaluator.type (custom extension)
  responseId?: string;          // -> gen_ai.response.id (correlation)
  traceId?: string;             // -> OTel trace context
  spanId?: string;              // -> OTel span context (parent)
  sessionId?: string;           // -> session correlation

  // Agent-as-Judge fields (Section 10.7)
  agentId?: string;             // -> gen_ai.agent.id
  agentName?: string;           // -> gen_ai.agent.name
  stepScores?: StepScore[];     // -> gen_ai.evaluation.step_scores (custom)
  toolVerifications?: ToolVerification[]; // -> gen_ai.evaluation.tool_verifications (custom)
  trajectoryLength?: number;    // -> gen_ai.evaluation.trajectory_length (custom)
}

The toolkit distinguishes between official OTel GenAI attributes (e.g., gen_ai.evaluation.name) and custom extensions (e.g., gen_ai.evaluation.evaluator.type). Custom extensions are defined in GENAI_EVALUATION_ATTRIBUTES and AGENT_JUDGE_ATTRIBUTES in src/backends/index.ts.

Agent Span Conventions

The GenAI agent span conventions extend the base GenAI spans with agent-specific operations:

Operation gen_ai.operation.name Span Name
Agent creation create_agent create_agent {gen_ai.agent.name}
Agent invocation invoke_agent invoke_agent {gen_ai.agent.name}
Tool call (defined in base GenAI spans) Tool-specific

GenAI Metrics

Standard OTel GenAI metrics relevant to evaluation explainability:

Metric Type Unit Description
gen_ai.client.token.usage Histogram {token} Token consumption per request
gen_ai.client.operation.duration Histogram s Duration of GenAI operations
gen_ai.server.request.duration Histogram s Server-side request processing time
gen_ai.server.time_per_output_token Histogram s Time per output token (time-to-first-token proxy)

Toolkit Query Capabilities

The toolkit’s obs_query_evaluations supports groupBy on: evaluationName, scoreLabel, evaluator. Note: evaluatorType is not currently a valid groupBy field. Grouping by evaluator type would require a future schema extension.

Gap Analysis: What OTel Does Not Yet Cover

Missing Convention Impact Workaround
gen_ai.evaluation.confidence No standard for judge confidence/uncertainty Use event body extension attributes
gen_ai.evaluation.criteria No standard for what criteria the evaluator used Encode in explanation or custom attributes
gen_ai.evaluation.metadata No standard for evaluation configuration (temperature, model, prompt version) Use resource/span attributes
Agent handoff evaluation events No standard for scoring handoff quality between agents Emit custom events parented to handoff spans
Turn-level relevancy events No standard for per-turn evaluation in multi-turn conversations Emit gen_ai.evaluation.result per turn span

3. LLM-as-Judge UX Patterns

Score Presentation

Research across all six platforms reveals three dominant patterns for presenting evaluation scores:

Pattern 1: Score Badge with Tooltip (Langfuse)

Compact inline display with progressive disclosure. The score appears as a colored badge on the trace/span. Hovering reveals the evaluator identity, score value, and a link to the full execution trace.

┌──────────────────────────────────────────────────────────┐
│  Score Badge Anatomy                                      │
│                                                           │
│  ┌────────────┐                                          │
│  │  0.92      │  <- Color-coded: green (>0.8),           │
│  │  relevance │     yellow (0.5-0.8), red (<0.5)         │
│  └─────┬──────┘                                          │
│        │ hover                                            │
│        ▼                                                  │
│  ┌────────────────────────────┐                          │
│  │ Score: 0.92                │                          │
│  │ Label: relevant            │                          │
│  │ Evaluator: gpt-4o-mini     │                          │
│  │ Type: llm                  │                          │
│  │ ─────────────────────      │                          │
│  │ [View explanation ->]      │                          │
│  │ [View judge trace ->]      │                          │
│  └────────────────────────────┘                          │
│                                                           │
└──────────────────────────────────────────────────────────┘

Pattern 2: Evaluation Column on Trace Table (Phoenix, LangSmith)

Scores appear as columns alongside trace data, enabling sorting and filtering. Explanations appear in a detail panel when a row is selected.

Pattern 3: Dedicated Evaluation Tab (Datadog, W&B Weave)

Separate view for evaluation results with aggregation controls, comparison tools, and drill-down to individual evaluations.

Chain-of-Thought Explanation Display

Best practices for presenting CoT explanations from LLM-as-Judge evaluations:

1. Produce reasoning before the score

The judge prompt should require the LLM to explain its reasoning step-by-step before emitting the final score/label. This improves alignment with human judgments and produces more auditable outputs.

// Recommended judge prompt structure (aligns with toolkit's G-Eval pattern)
const judgePrompt = `
Evaluate the following response for relevance to the user query.

User Query: ${input}
Response: ${output}

Think step by step:
1. Identify the key information requested in the query
2. Assess whether the response addresses each key point
3. Check for any irrelevant or off-topic content
4. Consider completeness of the answer

Provide your evaluation in JSON format:
{
  "reasoning": "<step-by-step analysis>",
  "score": <0.0 to 1.0>,
  "label": "relevant" | "partially_relevant" | "not_relevant"
}
`;

2. Display reasoning in collapsible sections

┌──────────────────────────────────────────────────────────┐
│  Evaluation Detail: relevance                             │
│  Score: 0.92  Label: relevant  Evaluator: gpt-4o-mini    │
│                                                           │
│  [v] Reasoning (click to expand)                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │ 1. The query asks for the capital of France.       │  │
│  │ 2. The response directly states "Paris" which is   │  │
│  │    correct and addresses the core question.        │  │
│  │ 3. No irrelevant content detected.                 │  │
│  │ 4. The answer is complete for the given query.     │  │
│  └────────────────────────────────────────────────────┘  │
│                                                           │
│  [>] Judge Trace (click to expand)                        │
│  [>] Evaluation Config (click to expand)                  │
│                                                           │
└──────────────────────────────────────────────────────────┘

3. Binary labels outperform granular scales for reliability

Industry consensus (Evidently AI, Confident AI, Arize) is that binary evaluations (“pass”/”fail”, “relevant”/”not_relevant”) are more reliable and consistent for both LLM and human evaluators than numeric scales. When numeric scores are needed, combine binary labels with continuous scores:

Approach Reliability Use Case
Binary label High Pass/fail gating, regression testing
Binary + continuous score High Gating + trend analysis
1-5 Likert scale Medium Subjective quality assessment
0-100 continuous Low Avoid – high variance between evaluators

Confidence Indicators

No platform currently provides built-in confidence indicators for evaluation scores. This is a significant gap. Potential approaches:

Method Implementation Pros Cons
Logprob-based confidence Use token logprobs from the judge LLM (toolkit’s normalizeWithLogprobs()) Grounded in model uncertainty Requires logprob API support
Multi-judge agreement Run multiple judges and compute inter-rater agreement Robust confidence signal 2-5x cost increase
Score variance tracking Track historical variance for each metric Low-cost retrospective confidence No per-evaluation confidence
Self-reported confidence Ask the judge to report confidence alongside score Simple to implement Judge may not calibrate well

The toolkit’s existing panelEvaluation() (multi-judge panels) provides the infrastructure for multi-judge agreement as a confidence proxy.


4. Dashboard Metrics Explainability

Making Percentile Metrics Actionable

The toolkit’s computeDashboardSummary() already computes p50, p95, and p99 aggregations. The research identifies patterns for making these actionable in downstream UIs.

Percentile interpretation for LLM quality metrics:

Percentile Interpretation for Quality Scores Interpretation for Latency
p50 (median) Typical evaluation quality – what most responses score Typical evaluation speed
p95 Quality floor for 95% of responses – early warning Performance ceiling for most traffic
p99 Worst-case quality – outlier detection Tail latency – architectural bottlenecks

Three-tier alert strategy (aligns with toolkit’s existing thresholds):

┌──────────────────────────────────────────────────────────────┐
│  Alert Strategy for Quality Metrics                           │
│                                                               │
│  ┌─────────────┐                                             │
│  │   Primary    │  p50 warning/critical thresholds            │
│  │   SLO        │  (the toolkit's current approach)           │
│  └──────┬──────┘                                             │
│         │                                                     │
│  ┌──────▼──────┐                                             │
│  │  Divergence  │  "p99 < 0.5 * p50 for 15min" = alert      │
│  │  Detection   │  Detects bimodal score distributions        │
│  └──────┬──────┘                                             │
│         │                                                     │
│  ┌──────▼──────┐                                             │
│  │  Drill-down  │  Link alert -> specific low-scoring traces │
│  │  Linkage     │  -> evaluation explanations                │
│  └─────────────┘                                             │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Key UX pattern: Alert-to-trace linkage

The industry gap is between “metric X is degraded” and “here is why.” Best-in-class dashboards link:

  1. Triggered alert -> metric detail view showing score distribution
  2. Metric detail -> filtered trace list (low-scoring evaluations)
  3. Individual trace -> evaluation explanation (judge’s reasoning)

This three-level drill-down transforms a p50 threshold breach from “relevance dropped” to “relevance dropped because the retriever returned stale context for queries about product pricing changes.”

Dashboard design principles (from CHI 2025 research):

The CHI 2025 paper “Design Principles and Guidelines for LLM Observability: Insights from Developers” (CHI ‘25 Extended Abstracts, ACM, DOI: 10.1145/3706599.3719914) identifies four developer-centric design principles:

Principle Description Application to Quality Dashboards
Design for Awareness Surface changes and anomalies proactively Highlight metric regressions, new alert triggers
Design for Monitoring Enable continuous tracking of key signals Time-series views of p50/p95/p99 quality scores
Design for Intervention Support direct action from observed data Link alerts to traces, provide remediation context
Design for Operability Make system behavior understandable in production Show evaluation pipeline health, judge uptime

Recommended panel layout (fewer than 12 panels per view):

┌────────────────────────────────────────────────────────────┐
│  Quality Dashboard - Top Level                              │
│                                                             │
│  Row 1: Status Overview                                     │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐     │
│  │ Overall  │ │ Active   │ │ Judge    │ │ Eval     │     │
│  │ Health   │ │ Alerts   │ │ Uptime   │ │ Volume   │     │
│  │ [green]  │ │ 2 warn   │ │ 99.8%    │ │ 1.2k/hr  │     │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘     │
│                                                             │
│  Row 2: Quality Score Time Series                           │
│  ┌────────────────────────────────────────────────────┐    │
│  │  relevance (p50)  ──────────────────────           │    │
│  │  faithfulness (p50) ──  ──  ──  ──  ──             │    │
│  │  coherence (p50)   ─ ─ ─ ─ ─ ─ ─ ─ ─              │    │
│  │  [warning threshold] ........................       │    │
│  │  [critical threshold] ........................      │    │
│  └────────────────────────────────────────────────────┘    │
│                                                             │
│  Row 3: Metric Cards (expandable)                           │
│  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐  │
│  │ Relevance      │ │ Faithfulness   │ │ Hallucination  │  │
│  │ p50: 0.85      │ │ p50: 0.91      │ │ avg: 0.04      │  │
│  │ p95: 0.72      │ │ p95: 0.78      │ │ p95: 0.12      │  │
│  │ [healthy]      │ │ [healthy]      │ │ [warning]      │  │
│  │ [drill ->]     │ │ [drill ->]     │ │ [drill ->]     │  │
│  └────────────────┘ └────────────────┘ └────────────────┘  │
│                                                             │
└────────────────────────────────────────────────────────────┘

5. Multi-Agent Evaluation Explainability

Evaluation Dimensions

Multi-agent systems introduce evaluation complexities not present in single-LLM applications. Microsoft’s Multi-Agent Reference Architecture and Confident AI’s agent evaluation framework identify these key dimensions:

Dimension What It Measures Explainability Requirement
Trajectory correctness Did the agent take the right steps? Show expected vs. actual step sequence
Handoff quality Did agents transfer to the correct sub-agent? Show handoff decision reasoning
Tool selection accuracy Did agents choose appropriate tools? Show tool selection context
Argument correctness Were tool call arguments valid and safe? Show expected vs. actual arguments, safety checks
Task completion Was the overall goal achieved? Show completion criteria evaluation
Per-turn relevancy Was each turn relevant to the conversation goal? Show turn-level scores with explanations
Conversation completeness Were all aspects of the user’s request addressed? Show coverage analysis
Error propagation Did errors in one agent cascade? Show error chain across agent spans

Per-Turn Evaluation Pattern

For multi-turn agent conversations, evaluation should happen at both the turn level and the conversation level:

┌──────────────────────────────────────────────────────────────┐
│  Multi-Turn Agent Evaluation                                  │
│                                                               │
│  Session: session-abc123                                      │
│  Overall Score: 0.82 (task_completion)                        │
│                                                               │
│  Turn 1: User -> Agent-A (Router)                             │
│  ├── Input: "Help me debug my deployment"                     │
│  ├── Action: Route to Agent-B (DevOps)                        │
│  ├── Handoff Score: 0.95 [correct routing]                    │
│  └── Explanation: "Query contains deployment keywords,        │
│       correctly routed to DevOps specialist"                  │
│                                                               │
│  Turn 2: Agent-B (DevOps) -> Tool Call                        │
│  ├── Tool: kubectl_get_pods                                   │
│  ├── Tool Selection Score: 1.0 [correct]                      │
│  └── Explanation: "Appropriate first diagnostic step"         │
│                                                               │
│  Turn 3: Agent-B -> User                                      │
│  ├── Output: "Your pod is in CrashLoopBackOff..."             │
│  ├── Relevancy Score: 0.88                                    │
│  └── Explanation: "Addresses the deployment issue directly,   │
│       but missing suggested remediation steps"                │
│                                                               │
│  Turn 4: Agent-B -> Agent-C (Handoff)                         │
│  ├── Action: Escalate to Agent-C (Senior DevOps)              │
│  ├── Handoff Score: 0.70 [warning]                            │
│  └── Explanation: "Escalation premature -- Agent-B had        │
│       sufficient context to suggest restart/log check"        │
│                                                               │
│  Conversation Completeness: 0.75                              │
│  └── Explanation: "User's issue was identified but not        │
│       resolved. Missing: remediation steps, verification"     │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Mapping to OTel Spans

Each turn in a multi-agent evaluation maps to OTel span hierarchy:

┌──────────────────────────────────────────────────────────┐
│  OTel Span Hierarchy for Multi-Agent Evaluation           │
│                                                           │
│  [root] invoke_agent Router (session-abc123)              │
│  ├── [child] invoke_agent DevOps                          │
│  │   ├── [child] tool_call kubectl_get_pods               │
│  │   │   └── [event] gen_ai.evaluation.result             │
│  │   │       { gen_ai.evaluation.name: "tool_correctness",  │
│  │   │         gen_ai.evaluation.score.value: 1.0,        │
│  │   │         gen_ai.evaluation.explanation: "..." }     │
│  │   │                                                    │
│  │   ├── [event] gen_ai.evaluation.result                 │
│  │   │   { gen_ai.evaluation.name: "relevancy",             │
│  │   │     gen_ai.evaluation.score.value: 0.88,           │
│  │   │     gen_ai.evaluation.explanation: "..." }         │
│  │   │                                                    │
│  │   └── [child] handoff -> SeniorDevOps                  │
│  │       └── [event] gen_ai.evaluation.result             │
│  │           { gen_ai.evaluation.name: "handoff_correctness",│
│  │             gen_ai.evaluation.score.value: 0.70,       │
│  │             gen_ai.evaluation.score.label:              │
│  │               "premature_escalation",                  │
│  │             gen_ai.evaluation.explanation: "..." }     │
│  │                                                        │
│  ├── [child] invoke_agent SeniorDevOps                    │
│  │   └── ...                                              │
│  │                                                        │
│  └── [event] gen_ai.evaluation.result                     │
│      { gen_ai.evaluation.name: "task_completion",           │
│        gen_ai.evaluation.score.value: 0.82,               │
│        gen_ai.evaluation.explanation:                      │
│          "Issue identified, not resolved" }               │
│                                                           │
└──────────────────────────────────────────────────────────┘

Agent Evaluation Metrics from DeepEval / Confident AI

Confident AI introduced agent-specific metrics in 2025:

Metric What It Evaluates Explanation Output
PlanQualityMetric Whether the plan is logical, complete, and efficient Step-by-step plan analysis
ToolCorrectnessMetric Whether the agent selected and used tools correctly Per-tool selection reasoning
Task Completion Whether the overall task was accomplished Criteria-based completion check
Agent Handoff Quality Whether handoffs went to the right sub-agent Routing decision analysis
MCPUseMetric Whether MCP tools were used correctly in single-turn Per-tool selection and argument reasoning
MCPTaskCompletionMetric Whether MCP tool usage achieved the task goal End-to-end MCP task analysis
MultiTurnMCPUseMetric MCP tool correctness across multi-turn conversations Turn-level MCP tool evaluation

Note: The MCP-specific metrics (MCPUseMetric, MCPTaskCompletionMetric, MultiTurnMCPUseMetric) are particularly relevant to observability-toolkit as it is itself an MCP server. These metrics evaluate whether MCP tools are invoked with correct arguments and in the right sequence.


6. Regulatory Frameworks

EU AI Act: Transparency and Explainability

The EU AI Act (Regulation 2024/1689) establishes binding transparency and explainability requirements with a phased implementation timeline.

Key timeline:

Date Requirement Relevance to Evaluation Explainability
Feb 2025 Prohibited AI practices apply Subliminal manipulation, social scoring banned
Aug 2025 GPAI obligations (Articles 53, 55) Model evaluation, adversarial testing, documentation
Aug 2026 High-risk AI transparency (Articles 13, 50) Full transparency obligations including explainability

Article 13: Transparency and Provision of Information to Deployers

High-risk AI systems must be designed to be transparent enough that deployers can:

  • Understand and use them correctly
  • Access information about the provider, capabilities, and limitations
  • Interpret the system’s output correctly
  • Be aware of potential risks

Article 50: Transparency Obligations

  • AI systems interacting with humans must inform users they are interacting with AI
  • AI-generated content must be identifiable (especially deep fakes and public information content)
  • Clear and visible labeling of AI-generated outputs

Penalties: Up to 35 million EUR or 7% of global annual turnover.

Implications for observability-toolkit:

The toolkit’s existing obs_query_verifications (EU AI Act human verification tracking) addresses the human oversight requirement. Evaluation explainability strengthens compliance by providing:

  1. Audit trail of evaluation decisions (what was evaluated, by what judge, what score, why)
  2. Traceability from output to evaluation to explanation
  3. Documentation of evaluation methodology (criteria, prompts, models used)

NIST AI Risk Management Framework (AI RMF 1.0)

NIST AI 100-1 provides voluntary, risk-based guidance structured around four core functions.

Explainability within the four functions:

Function Explainability Role Observability-Toolkit Alignment
GOVERN Establish explainability policies and accountability structures Configuration: evaluation criteria, judge selection, threshold policies
MAP Identify where explainability is needed based on risk context Map evaluation coverage across metrics and agent types
MEASURE Continuously evaluate how explainable the AI system is obs_query_evaluations with explanation analysis
MANAGE Monitor explainability in production and improve over time Dashboard alerts, explanation quality tracking

NIST distinction between transparency, explainability, and interpretability:

Concept Answers Example in Evaluation Context
Transparency “What happened?” The evaluation was run, score was 0.72
Explainability “How was the decision made?” The judge compared the response to criteria X, Y, Z
Interpretability “Why was this decision made?” The score is low because the response omitted pricing context

NIST 2025 updates:

  • Explainability scoring is now emphasized as a measurable trustworthiness characteristic
  • Risk assessments must include AI-specific vulnerabilities including bias, explainability, and model vulnerabilities
  • NIST AI 600-1 (GenAI Profile) requires model evaluation using standardized protocols, adversarial testing, and systemic risk tracking

Regulatory Mapping to OTel Evaluation Events

┌───────────────────────────────────────────────────────────┐
│  Regulatory Requirement -> OTel Evaluation Event Mapping   │
│                                                            │
│  EU AI Act Article 13 (Transparency)                       │
│  ├── "Understand AI output" -> explanation field           │
│  ├── "Capabilities and limitations" ->                     │
│  │    gen_ai.evaluation.name + score.label                 │
│  └── "Potential risks" -> alert thresholds + status        │
│                                                            │
│  NIST MEASURE Function                                     │
│  ├── "Evaluate explainability" -> explanation quality       │
│  │    scoring (meta-evaluation)                            │
│  ├── "Metrics and benchmarks" -> p50/p95/p99 aggregations  │
│  └── "Impact assessments" -> evaluation trend analysis     │
│                                                            │
│  Both Frameworks: Audit Trail                              │
│  ├── gen_ai.evaluation.result events with timestamps       │
│  ├── Trace/span context for full request lineage           │
│  ├── evaluator identity and type                           │
│  └── Historical evaluation data via JSONL backend          │
│                                                            │
└───────────────────────────────────────────────────────────┘

7. Recommendations for observability-toolkit

Prioritized recommendations based on research findings, ordered by impact and feasibility.

Priority 1: Explanation Quality and Display (High Impact, Low Effort)

R1.1: Add explanation to QualityMetricResult

Currently, QualityMetricResult captures aggregated scores but not representative explanations. Add a field for the lowest-scoring evaluation’s explanation to support alert-to-explanation drill-down.

interface QualityMetricResult {
  // ... existing fields ...

  // NEW: Representative explanation from lowest-scoring evaluation
  worstExplanation?: {
    scoreValue: number;
    explanation: string;
    traceId?: string;
    timestamp: string;
  };
}

R1.2: Standardize explanation format in G-Eval and QAG prompts

Ensure all judge prompts produce structured explanations with:

  • Step-by-step reasoning (before the score)
  • Criteria-specific findings
  • Concrete evidence from the evaluated output

This is already partially implemented in the toolkit’s G-Eval pattern (buildEvalPrompt()) but should be formalized across all evaluation paths.

Priority 2: Evaluation Traceability (High Impact, Medium Effort)

R2.1: Emit OTel evaluation events for all judge executions

When the toolkit runs LLM-as-Judge evaluations via gEval(), qagEvaluate(), or panelEvaluation(), emit gen_ai.evaluation.result OTel events parented to the span being evaluated. This enables downstream platforms (Langfuse, Datadog, Phoenix) to display evaluation results natively within their trace views.

// Conceptual: emit evaluation result as OTel event
// OTel JS SDK Span.addEvent(name, attributes?, startTime?) — all fields go in attributes
function emitEvaluationEvent(
  span: Span,
  result: EvaluationResult
): void {
  span.addEvent('gen_ai.evaluation.result', {
    'gen_ai.evaluation.name': result.evaluationName,
    'gen_ai.evaluation.score.value': result.scoreValue,
    'gen_ai.evaluation.score.label': result.scoreLabel,
    'gen_ai.evaluation.explanation': result.explanation,
    'gen_ai.response.id': result.responseId,
  });
}

R2.2: Surface judge execution metadata from EvaluationEvent into EvaluationResult

The toolkit’s EvaluationEvent type in src/lib/llm-as-judge.ts (lines 247-282) already captures operational metadata: durationMs, inputTokens, outputTokens, retryCount, judgeModel, judgeTemperature, and samplingReason. However, this type is separate from EvaluationResult in src/backends/index.ts.

The recommended approach is to bridge these types rather than duplicating fields:

// Option A: Add optional judgeConfig to EvaluationResult (new fields)
interface EvaluationResult {
  // ... existing fields ...

  // NEW: Surfaced from EvaluationEvent for audit/debugging
  judgeConfig?: {
    model: string;            // from EvaluationEvent.judgeModel
    temperature: number;      // from EvaluationEvent.judgeTemperature
    promptVersion?: string;   // NEW: "relevance-v2.3"
    durationMs?: number;      // from EvaluationEvent.durationMs
    tokenUsage?: {
      input: number;          // from EvaluationEvent.inputTokens
      output: number;         // from EvaluationEvent.outputTokens
    };
  };
}

// Option B: Unify EvaluationEvent and EvaluationResult
// type EvaluationResult = EvaluationEvent & AgentJudgeFields;

Priority 3: Confidence Indicators (Medium Impact, Medium Effort)

R3.1: Expose logprob-based confidence from G-Eval

The toolkit’s normalizeWithLogprobs() already computes probability-weighted scores. Surface the confidence distribution as part of the evaluation result:

interface EvaluationResult {
  // ... existing fields ...

  // NEW: Confidence from logprob distribution
  confidence?: number;        // 0.0-1.0, derived from logprob entropy
  confidenceMethod?: 'logprobs' | 'multi_judge' | 'historical';
}

R3.2: Compute multi-judge agreement as confidence

When panelEvaluation() runs multiple judges, compute inter-rater agreement (e.g., Krippendorff’s alpha or simple percent agreement) and surface it as a confidence indicator.

Priority 4: Dashboard Explainability Enhancements (Medium Impact, Low Effort)

R4.1: Add divergence detection to alert system

Complement existing threshold alerts with ratio alerts that detect bimodal score distributions:

This would require extending the existing AlertThreshold interface (which has aggregation, value, direction, severity, message) with a new DivergenceAlertThreshold type:

// Requires new interface — not part of current AlertThreshold
interface DivergenceAlertThreshold {
  type: 'divergence';
  aggregationA: EvaluationAggregation;  // e.g., 'p99'
  aggregationB: EvaluationAggregation;  // e.g., 'p50'
  ratio: number;                         // 0.5 = "A < 50% of B"
  severity: AlertSeverity;
  message: string;                       // template with {p99} and {p50}
}

// Example usage:
const divergenceAlert: DivergenceAlertThreshold = {
  type: 'divergence',
  aggregationA: 'p99',
  aggregationB: 'p50',
  ratio: 0.5,                           // Fire when p99 < 0.5 * p50
  severity: 'warning',
  message: 'Score distribution diverging: p99 ({p99}) < 50% of p50 ({p50})',
};

R4.2: Include sample count context in alerts

Alert messages should include the sample count to help users assess statistical significance:

// Current: "Relevance p50 (0.4500) critically low"
// Improved: "Relevance p50 (0.4500) critically low (n=47 evaluations, last 1h)"

Priority 5: Multi-Agent Explainability (High Impact, High Effort)

R5.1: Add turn-level evaluation support to agent-as-judge

The toolkit’s agent-as-judge module should support emitting gen_ai.evaluation.result events at each turn/step of a multi-agent conversation, not just at the conversation level.

R5.2: Add handoff correctness metric to built-in QUALITY_METRICS

The toolkit’s agent-eval-metrics.ts already emits evaluations with evaluationName: 'handoff_correctness'. Add a matching built-in dashboard metric to surface these in computeDashboardSummary():

const handoffCorrectness = createMetricConfig('handoff_correctness')
  .displayName('Agent Handoff Correctness')
  .description('Measures whether agents correctly transfer to appropriate sub-agents')
  .aggregations('avg', 'p50', 'p95', 'count')
  .range(0, 1)
  .unit('score')
  .alertBelow('avg', 0.85, 'warning', 'Handoff correctness ({value}) below 85% target')
  .alertBelow('avg', 0.70, 'critical', 'Handoff correctness ({value}) critically low')
  .build();

Priority 6: Regulatory Compliance Enhancements (Medium Impact, Low Effort)

R6.1: Add evaluation provenance fields

For EU AI Act Article 13 compliance (August 2026 deadline), ensure all evaluation results include sufficient provenance for audit:

  • Evaluator identity (model name, version)
  • Evaluation criteria used
  • Timestamp of evaluation
  • Link to the evaluated output (trace/span ID)

The toolkit’s existing EvaluationResult covers most of these. The gap is evaluation criteria documentation.

R6.2: Add explanation quality meta-evaluation

For NIST MEASURE function compliance, add the capability to evaluate the quality of explanations themselves:

// Meta-evaluation: score the explanation quality
const explanationQuality = createMetricConfig('explanation_quality')
  .displayName('Explanation Quality')
  .description('Meta-evaluation of judge explanation clarity and completeness')
  .aggregations('avg', 'p50', 'count')
  .range(0, 1)
  .unit('score')
  .alertBelow('avg', 0.7, 'warning', 'Explanation quality ({value}) below target')
  .build();

Implementation Roadmap

Priority Recommendation Effort Target
P1 Explanation in QualityMetricResult Low v2.1
P1 Standardize explanation format Low v2.1
P2 OTel evaluation event emission Medium v2.1
P2 Judge execution metadata Medium v2.1
P3 Logprob confidence exposure Medium v2.2
P3 Multi-judge agreement confidence Medium v2.2
P4 Divergence detection alerts Low v2.1
P4 Sample count in alert messages Low v2.1
P5 Turn-level agent evaluation High v2.2
P5 Handoff correctness metric High v2.2
P6 Evaluation provenance fields Low v2.1
P6 Explanation quality meta-eval Low v2.2

8. Sources

OpenTelemetry

Platform Documentation

Multi-Agent Evaluation

LLM-as-Judge Best Practices

Dashboard and Metrics

Regulatory and Research

Industry Overviews

Note: Vendor blog posts may move or be removed. Access dates provided for reference.