Technical White Paper v1.0

Abstract

As Large Language Model (LLM) applications transition from experimental deployments to production-critical systems, the need for standardized observability practices has become paramount. This paper examines the current state of LLM observability standards, with particular focus on OpenTelemetry’s emerging GenAI semantic conventions, agent tracking methodologies, and quality measurement frameworks. We evaluate the observability-toolkit MCP server against these industry standards, identifying alignment areas and gaps. This document serves as an index to deeper technical analyses across five key domains: semantic conventions, agent observability, quality metrics, performance optimization, and tooling ecosystem.

Keywords: LLM observability, OpenTelemetry, GenAI semantic conventions, agent tracking, AI quality metrics, distributed tracing


1. Introduction

1.1 Problem Statement

The rapid adoption of LLM-based applications has outpaced the development of observability tooling, creating a fragmented landscape where teams rely on vendor-specific instrumentation, proprietary formats, and ad-hoc monitoring solutions. This fragmentation leads to:

  • Vendor lock-in through non-standard telemetry formats
  • Incomplete visibility into multi-step agent workflows
  • Inability to compare performance across providers and models
  • Quality blind spots where systems appear operational but produce low-quality outputs

1.2 Scope

This paper focuses on three primary areas:

  1. Standardization: OpenTelemetry GenAI semantic conventions (v1.39.0)
  2. Agent Tracking: Multi-turn, tool-use, and reasoning chain observability
  3. Quality Measurement: Production evaluation metrics beyond latency and throughput

1.3 Methodology

Research was conducted through:

  • Analysis of OpenTelemetry specification documents and GitHub discussions
  • Review of industry tooling (Langfuse, Arize Phoenix, Datadog LLM Observability)
  • Examination of academic literature on hallucination detection and LLM evaluation
  • Comparative analysis against the observability-toolkit MCP server implementation

2. Background: The Evolution of LLM Observability

2.1 Traditional ML Observability vs. LLM Observability

Traditional machine learning observability focused on:

  • Model accuracy metrics (precision, recall, F1)
  • Feature drift detection
  • Inference latency and throughput
  • Resource utilization

LLM applications introduce fundamentally different observability challenges:

Dimension Traditional ML LLM Applications
Input Nature Structured features Unstructured natural language
Output Nature Discrete classes/values Free-form generated text
Evaluation Ground truth comparison Subjective quality assessment
Cost Model Compute-based Token-based pricing
Failure Modes Classification errors Hallucinations, toxicity, irrelevance
Execution Pattern Single inference Multi-turn, tool-augmented chains

2.2 The Three Pillars Extended

The traditional observability pillars (metrics, traces, logs) require extension for LLM systems:

┌─────────────────────────────────────────────────────────────────┐
│                    LLM Observability Pillars                     │
├─────────────────┬─────────────────┬─────────────────────────────┤
│     TRACES      │     METRICS     │           LOGS              │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ • Prompt chains │ • Token usage   │ • Prompt/completion content │
│ • Tool calls    │ • Latency (TTFT)│ • Error details             │
│ • Agent loops   │ • Cost per req  │ • Reasoning chains          │
│ • Retrieval     │ • Quality scores│ • Human feedback            │
└─────────────────┴─────────────────┴─────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    EVALUATION LAYER (NEW)                        │
├─────────────────────────────────────────────────────────────────┤
│ • Hallucination detection    • Answer relevancy                 │
│ • Factual accuracy           • Task completion                  │
│ • Tool correctness           • Safety/toxicity                  │
└─────────────────────────────────────────────────────────────────┘

2.3 Key Industry Developments (2024-2026)

Date Development Impact
Apr 2024 OTel GenAI SIG formation Standardization effort begins
Jun 2024 GenAI semantic conventions draft Initial attribute definitions
Oct 2024 Langfuse OTel support Open-source adoption
Dec 2024 Datadog native OTel GenAI support Enterprise validation
Jan 2025 OTel v1.37+ GenAI conventions Production-ready standards
Mar 2025 Agent framework conventions proposed Multi-agent standardization
Dec 2025 OTel v1.39 GenAI conventions Agent/tool span semantics

3. OpenTelemetry GenAI Semantic Conventions

Deep Dive Reference: See Appendix A: OTel GenAI Attribute Reference

3.1 Overview

The OpenTelemetry GenAI semantic conventions (currently in Development status) establish a standardized schema for:

  • Spans: LLM inference calls, tool executions, agent invocations
  • Metrics: Token usage histograms, operation duration, latency breakdowns
  • Events: Input/output messages, system instructions, tool definitions
  • Attributes: Model parameters, provider metadata, conversation context

3.2 Core Span Attributes

3.2.1 Required Attributes

Attribute Type Description Example
gen_ai.operation.name string Operation type chat, invoke_agent, execute_tool
gen_ai.provider.name string Provider identifier anthropic, openai, aws.bedrock

3.2.2 Conditionally Required Attributes

Attribute Condition Type Example
gen_ai.request.model If available string claude-3-opus-20240229
gen_ai.conversation.id When available string conv_5j66UpCpwteGg4YSxUnt7lPY
error.type If error occurred string timeout, rate_limit
Attribute Type Description
gen_ai.request.temperature double Sampling temperature
gen_ai.request.max_tokens int Maximum output tokens
gen_ai.request.top_p double Nucleus sampling parameter
gen_ai.response.model string Actual model that responded
gen_ai.response.finish_reasons string[] Why generation stopped
gen_ai.usage.input_tokens int Prompt token count
gen_ai.usage.output_tokens int Completion token count

3.3 Operation Types

The specification defines seven standard operation names:

gen_ai.operation.name:
├── chat                 # Chat completion (most common)
├── text_completion      # Legacy completion API
├── generate_content     # Multimodal generation
├── embeddings           # Vector embeddings
├── create_agent         # Agent instantiation
├── invoke_agent         # Agent execution
└── execute_tool         # Tool/function execution

3.4 Provider Identifiers

Standardized gen_ai.provider.name values:

Provider Value Notes
Anthropic anthropic Claude models
OpenAI openai GPT models
AWS Bedrock aws.bedrock Multi-model
Azure OpenAI azure.ai.openai Azure-hosted
Google Gemini gcp.gemini AI Studio API
Google Vertex AI gcp.vertex_ai Enterprise API
Cohere cohere  
Mistral AI mistral_ai  

3.5 Standard Metrics

3.5.1 Client Metrics

Metric Type Unit Buckets
gen_ai.client.token.usage Histogram {token} [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, …]
gen_ai.client.operation.duration Histogram s [0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, …]

3.5.2 Server Metrics (for model hosting)

Metric Type Unit Purpose
gen_ai.server.request.duration Histogram s Total request time
gen_ai.server.time_to_first_token Histogram s Prefill + queue latency
gen_ai.server.time_per_output_token Histogram s Decode phase performance

3.6 Content Handling

The specification addresses sensitive content through three approaches:

  1. Default: Do not capture prompts/completions
  2. Opt-in attributes: Record on spans (gen_ai.input.messages, gen_ai.output.messages)
  3. External storage: Upload to secure storage, record references
Recommended for production:
┌─────────────────────────────────────────────────────────┐
│  Span: gen_ai.operation.name = "chat"                   │
│  ├── gen_ai.input.messages.uri = "s3://bucket/msg/123"  │
│  └── gen_ai.output.messages.uri = "s3://bucket/msg/124" │
└─────────────────────────────────────────────────────────┘

4. Agent Observability Standards

Deep Dive Reference: See Appendix B: Agent Span Hierarchies

4.1 The Agent Observability Challenge

AI agents introduce observability complexity through:

  • Non-deterministic execution: Same input may produce different tool call sequences
  • Multi-turn reasoning: Extended context across many LLM calls
  • Tool orchestration: External system interactions within agent loops
  • Framework diversity: LangGraph, CrewAI, AutoGen, etc. have different patterns

4.2 Agent Application vs. Framework Distinction

The OpenTelemetry specification distinguishes:

Concept Definition Examples
Agent Application Specific AI-driven entity Customer support bot, coding assistant
Agent Framework Infrastructure for building agents LangGraph, CrewAI, Claude Code

4.3 Agent Span Semantics

4.3.1 Agent Creation Span

Span: create_agent {agent_name}
├── gen_ai.operation.name: "create_agent"
├── gen_ai.agent.id: "agent_abc123"
├── gen_ai.agent.name: "CustomerSupportAgent"
└── gen_ai.agent.description: "Handles tier-1 support queries"

4.3.2 Agent Invocation Span

Span: invoke_agent {agent_name}
├── gen_ai.operation.name: "invoke_agent"
├── gen_ai.agent.id: "agent_abc123"
├── gen_ai.agent.name: "CustomerSupportAgent"
└── gen_ai.conversation.id: "conv_xyz789"
    │
    ├── Child Span: chat claude-3-opus
    │   └── gen_ai.operation.name: "chat"
    │
    ├── Child Span: execute_tool get_customer_info
    │   ├── gen_ai.tool.name: "get_customer_info"
    │   ├── gen_ai.tool.type: "function"
    │   └── gen_ai.tool.call.id: "call_abc"
    │
    └── Child Span: chat claude-3-opus
        └── gen_ai.operation.name: "chat"

4.4 Tool Execution Attributes

Attribute Type Description
gen_ai.tool.name string Tool identifier
gen_ai.tool.type string function, extension, datastore
gen_ai.tool.description string Human-readable description
gen_ai.tool.call.id string Unique call identifier
gen_ai.tool.call.arguments any Input parameters (opt-in, sensitive)
gen_ai.tool.call.result any Output (opt-in, sensitive)

4.5 Framework Instrumentation Approaches

Approach Pros Cons Examples
Baked-in Zero config, consistent Bloat, version lag CrewAI
External OTel Decoupled, community-maintained Integration complexity OpenLLMetry
OTel Contrib Official support, best practices Review queue delays instrumentation-genai

4.6 Claude Code as Agent System

Claude Code exhibits agent characteristics:

  • Multi-turn conversation management
  • Tool execution (Bash, Read, Write, Edit, etc.)
  • Reasoning chains across tool calls
  • Session-based context

Current gap: Claude Code telemetry doesn’t emit standardized agent spans.


5. Quality and Evaluation Metrics

Deep Dive Reference: See Appendix C: LLM Evaluation Frameworks

5.1 The Quality Visibility Problem

Traditional observability answers: “Is the system up and performing?”

LLM observability must also answer: “Is the system producing good outputs?”

System Status Matrix:
                    │ Quality: Good    │ Quality: Bad
────────────────────┼──────────────────┼──────────────────
Performance: Good   │ Healthy          │ INVISIBLE FAILURE
Performance: Bad    │ Investigate      │ Obvious failure

The “invisible failure” quadrant is uniquely dangerous for LLM systems.

5.2 Core Quality Metrics

Metric Description Measurement Method
Answer Relevancy Output addresses input intent LLM-as-judge, embedding similarity
Faithfulness Output grounded in provided context LLM-as-judge, NLI models
Hallucination Fabricated or false information LLM-as-judge, fact verification
Task Completion Agent accomplished stated goal Rule-based + LLM assessment
Tool Correctness Correct tools called with valid args Deterministic validation
Toxicity/Safety Output meets safety guidelines Classifier models, guardrails

5.3 LLM-as-Judge Pattern

The dominant approach for quality evaluation:

┌─────────────────────────────────────────────────────────────┐
│                    LLM-as-Judge Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Production LLM Call                                         │
│  ┌──────────┐    ┌───────────┐    ┌──────────┐             │
│  │  Input   │───▶│  Model A  │───▶│  Output  │             │
│  └──────────┘    └───────────┘    └──────────┘             │
│       │                                │                     │
│       │         Evaluation LLM         │                     │
│       │    ┌───────────────────────┐   │                     │
│       └───▶│       Model B         │◀──┘                     │
│            │  (Judge: GPT-4, etc.) │                         │
│            └───────────┬───────────┘                         │
│                        │                                     │
│                        ▼                                     │
│            ┌───────────────────────┐                         │
│            │   Quality Scores      │                         │
│            │ • Relevancy: 0.85     │                         │
│            │ • Faithfulness: 0.92  │                         │
│            │ • Hallucination: 0.08 │                         │
│            └───────────────────────┘                         │
└─────────────────────────────────────────────────────────────┘

5.4 Evaluation Tool Landscape (2026)

Tool Type Key Features
Langfuse Open Source Tracing, prompt management, evals, 19k+ stars
Arize Phoenix Open Source OTel-native, OTLP ingestion, 7.8k stars
DeepEval Open Source 14+ metrics, CI/CD integration
MLflow 3.0 Open Source GenAI evals, experiment tracking
Datadog LLM Obs Commercial Hallucination detection, production monitoring
Braintrust Commercial Eval datasets, prompt playground

5.5 Production Evaluation Architecture

┌─────────────────────────────────────────────────────────────────┐
│                  Production Evaluation Flow                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. CAPTURE                2. EVALUATE              3. ITERATE   │
│  ┌─────────────┐          ┌─────────────┐         ┌───────────┐ │
│  │ Production  │          │ Async Eval  │         │ Feedback  │ │
│  │   Traces    │─────────▶│   Workers   │────────▶│   Loop    │ │
│  └─────────────┘          └─────────────┘         └───────────┘ │
│        │                        │                       │        │
│        │                        │                       │        │
│        ▼                        ▼                       ▼        │
│  ┌─────────────┐          ┌─────────────┐         ┌───────────┐ │
│  │   Span +    │          │   Quality   │         │  Prompt   │ │
│  │  Metadata   │          │   Scores    │         │ Iteration │ │
│  └─────────────┘          └─────────────┘         └───────────┘ │
│                                                                  │
│  Promote interesting traces to evaluation datasets               │
└─────────────────────────────────────────────────────────────────┘

5.6 Hallucination Detection Challenges

Recent research (arXiv:2504.18114) reveals limitations:

  • Metrics often fail to align with human judgments
  • Inconsistent gains with model parameter scaling
  • Myopic view focusing on surface-level patterns
  • GPT-4 as judge yields best overall results but has cost implications

6. Comparative Analysis: observability-toolkit MCP

6.1 Architecture Overview

The observability-toolkit MCP server provides:

┌─────────────────────────────────────────────────────────────┐
│                  observability-toolkit v1.6.0               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Data Sources:                                               │
│  ├── Local JSONL files (~/.claude/telemetry/)               │
│  └── SigNoz Cloud API (optional)                            │
│                                                              │
│  Tools:                                                      │
│  ├── obs_query_traces      - Distributed trace queries      │
│  ├── obs_query_metrics     - Metric aggregation             │
│  ├── obs_query_logs        - Log search with boolean ops    │
│  ├── obs_query_llm_events  - LLM-specific event queries     │
│  ├── obs_health_check      - System health + cache stats    │
│  ├── obs_context_stats     - Context window utilization     │
│  └── obs_get_trace_url     - SigNoz trace viewer links      │
│                                                              │
│  Performance Features:                                       │
│  ├── LRU query caching                                       │
│  ├── File indexing (.idx sidecars)                          │
│  ├── Gzip compression support                               │
│  ├── Streaming with early termination                       │
│  └── Circuit breaker for SigNoz                             │
└─────────────────────────────────────────────────────────────┘

6.2 OTel GenAI Compliance Matrix

Requirement Spec Implementation Status
gen_ai.operation.name Required Not captured Gap
gen_ai.provider.name Required Uses gen_ai.system Partial
gen_ai.request.model Cond. Required Captured Compliant
gen_ai.conversation.id Cond. Required Not captured Gap
gen_ai.usage.input_tokens Recommended Captured Compliant
gen_ai.usage.output_tokens Recommended Captured Compliant
gen_ai.response.model Recommended Not captured Gap
gen_ai.response.finish_reasons Recommended Not captured Gap
gen_ai.request.temperature Recommended Not captured Gap
gen_ai.request.max_tokens Recommended Not captured Gap

Compliance Score: 4/10 attributes implemented

6.3 Agent Tracking Analysis

Capability Spec Requirement Implementation Status
Agent spans (create_agent, invoke_agent) Defined Not implemented Gap
Tool execution spans (execute_tool) Defined Not implemented Gap
gen_ai.agent.id Recommended Not captured Gap
gen_ai.agent.name Recommended Not captured Gap
gen_ai.tool.name Recommended Not captured Gap
gen_ai.tool.call.id Recommended Not captured Gap
Session correlation Custom Uses session.id Partial

Agent Compliance: Minimal - relies on generic trace attributes

6.4 Metrics Compliance

Metric Spec Implementation Status
gen_ai.client.token.usage Histogram w/ buckets Flat storage Partial
gen_ai.client.operation.duration Histogram w/ buckets Available via traces Partial
gen_ai.server.time_to_first_token Histogram Not implemented Gap
gen_ai.server.time_per_output_token Histogram Not implemented Gap
Aggregation support sum, avg, p50, p95, p99 sum, avg, min, max, count Partial

6.5 Quality/Eval Capabilities

Capability Industry Standard Implementation Status
Hallucination detection LLM-as-judge Not implemented Gap
Answer relevancy scoring Automated eval Not implemented Gap
Human feedback collection Annotation API Not implemented Gap
Eval dataset management Trace promotion Not implemented Gap
Cost tracking Price * tokens Token counts only Partial

6.6 Strengths Relative to Industry

Strength Description Competitive Position
Multi-directory scanning Aggregates telemetry across locations Unique
Gzip support Transparent compression handling Standard
Index files Fast lookups via .idx sidecars Above average
Query caching LRU with TTL and stats Standard
OTLP export Standard format output Compliant
Local-first No cloud dependency required Differentiator
Claude Code integration Purpose-built for CC sessions Unique

7. Recommendations and Roadmap

7.1 Priority Matrix

                        Impact
                    Low         High
                ┌───────────┬───────────┐
           High │ P3: Nice  │ P1: Do    │
    Effort      │  to have  │   First   │
                ├───────────┼───────────┤
            Low │ P4: Maybe │ P2: Quick │
                │   later   │    Wins   │
                └───────────┴───────────┘

7.2 Phase 1: OTel GenAI Compliance (P1/P2)

Goal: Achieve 80%+ compliance with GenAI semantic conventions

Task Priority Effort Impact
Add gen_ai.operation.name to LLM events P1 Low High
Rename gen_ai.systemgen_ai.provider.name P2 Low Medium
Capture gen_ai.conversation.id P1 Medium High
Add gen_ai.response.model P2 Low Medium
Add gen_ai.response.finish_reasons P2 Low Medium

Estimated effort: 1-2 development cycles

7.3 Phase 2: Agent Observability (P1)

Goal: First-class support for agent/tool span semantics

Task Priority Effort Impact
Define agent span schema P1 Medium High
Tool execution span tracking P1 Medium High
Agent invocation correlation P1 High High
Multi-agent workflow visualization P3 High Medium

Estimated effort: 2-3 development cycles

7.4 Phase 3: Metrics Enhancement (P2)

Goal: Standard histogram metrics with OTel bucket boundaries

Task Priority Effort Impact
Implement histogram aggregation P2 Medium Medium
Add p50/p95/p99 percentiles P2 Low Medium
Time-to-first-token metric P2 Medium Medium
Cost estimation layer P3 Low Low

Estimated effort: 1-2 development cycles

7.5 Phase 4: Quality Layer (P3)

Goal: Optional integration with evaluation frameworks

Task Priority Effort Impact
Langfuse integration research P3 Low Medium
Eval score storage schema P3 Medium Medium
LLM-as-judge hook support P4 High High
Human feedback API P4 High Medium

Estimated effort: 3+ development cycles (optional)

7.6 Implementation Roadmap

2026 Q1                    Q2                    Q3
───────────────────────────────────────────────────────────
│ Phase 1: OTel Compliance │ Phase 2: Agents    │ Phase 3  │
│ ├── gen_ai.operation.name│ ├── Agent spans    │ Metrics  │
│ ├── gen_ai.provider.name │ ├── Tool spans     │ ├── Hist │
│ ├── gen_ai.conversation  │ └── Correlation    │ └── TTFT │
│ └── finish_reasons       │                    │          │
───────────────────────────────────────────────────────────
                                                     │
                                                     ▼
                                              Phase 4: Quality
                                              (Future/Optional)

8. Future Research Directions

8.1 Emerging Standards

  1. MCP Observability Conventions: As Model Context Protocol gains adoption, standardized telemetry for MCP tool calls may emerge
  2. Agentic System Semantics: OTel Issue #2664 proposes comprehensive agent conventions including tasks, artifacts, and memory
  3. Multi-Agent Coordination: Observability for agent-to-agent communication patterns

8.2 Quality Measurement Evolution

  1. Real-time hallucination detection: Moving from batch eval to streaming assessment
  2. Automated regression detection: Identifying quality degradation across model updates
  3. Domain-specific evaluators: Specialized judges for code, medical, legal domains

8.3 Cost Optimization

  1. Token budget observability: Real-time context window utilization tracking
  2. Model routing telemetry: Visibility into cost/quality tradeoff decisions
  3. Caching effectiveness: Measuring semantic cache hit rates

8.4 Privacy and Compliance

  1. Content redaction pipelines: OTel Collector processors for PII removal
  2. Audit trail requirements: Regulatory compliance for AI decisions
  3. Differential privacy: Aggregated telemetry without individual exposure

9. References

9.1 OpenTelemetry Specifications

  1. OpenTelemetry. “Semantic conventions for generative AI systems.” https://opentelemetry.io/docs/specs/semconv/gen-ai/

  2. OpenTelemetry. “Semantic conventions for generative client AI spans.” https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/

  3. OpenTelemetry. “Semantic conventions for generative AI metrics.” https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/

  4. OpenTelemetry. “Gen AI Registry Attributes.” https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/

9.2 Industry Publications

  1. Liu, G. & Solomon, S. “AI Agent Observability - Evolving Standards and Best Practices.” OpenTelemetry Blog, March 2025. https://opentelemetry.io/blog/2025/ai-agent-observability/

  2. Jain, I. “An Introduction to Observability for LLM-based applications using OpenTelemetry.” OpenTelemetry Blog, June 2024. https://opentelemetry.io/blog/2024/llm-observability/

  3. Datadog. “Datadog LLM Observability natively supports OpenTelemetry GenAI Semantic Conventions.” December 2025. https://www.datadoghq.com/blog/llm-otel-semantic-convention/

  4. Horovits, D. “OpenTelemetry for GenAI and the OpenLLMetry project.” Medium, November 2025. https://horovits.medium.com/opentelemetry-for-genai-and-the-openllmetry-project-81b9cea6a771

9.3 Evaluation and Quality

  1. Confident AI. “LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide.” https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation

  2. DeepEval. “Hallucination Metric Documentation.” https://deepeval.com/docs/metrics-hallucination

  3. “Evaluating Evaluation Metrics – The Mirage of Hallucination Detection.” arXiv:2504.18114, 2025.

9.4 Tools and Frameworks

  1. Langfuse. “OpenTelemetry (OTel) for LLM Observability.” https://langfuse.com/blog/2024-10-opentelemetry-for-llm-observability

  2. Traceloop. “OpenLLMetry: Open-source observability for GenAI.” https://github.com/traceloop/openllmetry

  3. Anthropic. “Building effective agents.” https://www.anthropic.com/research/building-effective-agents


10. Appendices

Appendix A: OTel GenAI Attribute Reference

Status: Index entry for future deep dive

Complete reference of all gen_ai.* attributes with:

  • Full attribute list with types and examples
  • Requirement levels by operation type
  • Provider-specific extensions
  • Migration guide from pre-1.37 conventions

Appendix B: Agent Span Hierarchies

Status: Index entry for future deep dive

Detailed span hierarchy patterns for:

  • Single-agent workflows
  • Multi-agent orchestration
  • Tool execution chains
  • Error propagation patterns
  • Correlation strategies

Appendix C: LLM Evaluation Frameworks

Status: Index entry for future deep dive

Comparative analysis of:

  • Langfuse evaluation capabilities
  • Arize Phoenix integration patterns
  • DeepEval metric implementations
  • Custom evaluator development
  • Production deployment patterns

Appendix D: observability-toolkit Schema Migration

Status: Index entry for future deep dive

Migration guide covering:

  • Current schema documentation
  • Target OTel-compliant schema
  • Backward compatibility strategy
  • Data migration procedures
  • Validation test suites

Appendix E: Cost Tracking Implementation

Status: Index entry for future deep dive

Cost observability implementation covering:

  • Provider pricing models
  • Token-to-cost calculation
  • Budget alerting patterns
  • Cost attribution by session/user
  • Optimization recommendations

Document History

Version Date Author Changes
1.0 2026-01-29 Research Analysis Initial publication

This document was produced through systematic web research and comparative analysis. It represents the state of LLM observability standards as of January 2026 and should be reviewed periodically as the field evolves rapidly.

was published on .