Pattern 20: Evaluation and Monitoring¶

"Tracking agent performance, logging metrics, and evaluating outcomes for continuous improvement"

📖 Overview¶

Evaluation and Monitoring involves systematically tracking agent performance, collecting metrics, logging events, and analyzing outcomes to ensure reliable operation and continuous improvement. This pattern is essential for production systems where you need visibility into agent behavior, performance bottlenecks, and failure modes.

🎯 How Codex Implements Evaluation and Monitoring¶

Codex implements comprehensive monitoring through multiple layers: OpenTelemetry for metrics, rollout logging for session replay, and real-time performance tracking in the TUI.

Key Implementation: OpenTelemetry Integration¶

File: codex-rs/otel/src/lib.rs

use opentelemetry::{
    metrics::{Counter, Histogram, Meter},
    trace::{Span, Tracer},
    KeyValue,
};

pub struct CodexTelemetry {
    tracer: Box<dyn Tracer + Send + Sync>,
    meter: Meter,

    // Metrics
    pub tool_calls_total: Counter<u64>,
    pub tool_call_duration: Histogram<f64>,
    pub tokens_used_total: Counter<u64>,
    pub errors_total: Counter<u64>,
}

impl CodexTelemetry {
    pub fn record_tool_call(&self, tool_name: &str, duration: f64, success: bool) {
        // Record metrics
        self.tool_calls_total.add(1, &[
            KeyValue::new("tool", tool_name.to_string()),
            KeyValue::new("success", success.to_string()),
        ]);

        self.tool_call_duration.record(duration, &[
            KeyValue::new("tool", tool_name.to_string()),
        ]);

        if !success {
            self.errors_total.add(1, &[
                KeyValue::new("error_type", "tool_execution"),
                KeyValue::new("tool", tool_name.to_string()),
            ]);
        }
    }

    pub fn record_token_usage(&self, prompt_tokens: u64, completion_tokens: u64) {
        self.tokens_used_total.add(prompt_tokens, &[
            KeyValue::new("token_type", "prompt"),
        ]);

        self.tokens_used_total.add(completion_tokens, &[
            KeyValue::new("token_type", "completion"),
        ]);
    }
}

Session Recording and Replay¶

File: codex-rs/core/src/rollout/recorder.rs

pub struct RolloutRecorder {
    session_id: String,
    file: Option<File>,
    sequence_number: AtomicU64,
}

impl RolloutRecorder {
    pub async fn record_turn_start(&mut self, turn_context: &TurnContext) -> Result<()> {
        let item = RolloutItem::TurnStart {
            turn_id: turn_context.turn_id,
            timestamp: Utc::now(),
            user_message: turn_context.user_message.clone(),
        };

        self.write_item(&item).await?;

        // Also emit telemetry
        telemetry().record_turn_start(turn_context.turn_id);
        Ok(())
    }

    pub async fn record_tool_execution(
        &mut self,
        tool_name: &str,
        args: &Value,
        result: &ToolResult,
        duration: Duration,
    ) -> Result<()> {
        let item = RolloutItem::ToolExecution {
            tool_name: tool_name.to_string(),
            arguments: args.clone(),
            result: result.clone(),
            duration_ms: duration.as_millis() as u64,
            timestamp: Utc::now(),
        };

        self.write_item(&item).await?;

        // Record metrics
        telemetry().record_tool_call(
            tool_name,
            duration.as_secs_f64(),
            result.is_success(),
        );

        Ok(())
    }
}

Real-Time Performance Monitoring¶

File: codex-rs/tui/src/app.rs

pub struct AppState {
    pub session_stats: SessionStats,
    pub rate_limits: RateLimits,
    pub performance_metrics: PerformanceMetrics,
}

#[derive(Debug, Clone)]
pub struct SessionStats {
    pub turns_completed: u32,
    pub tools_executed: u32,
    pub tokens_used: TokenUsage,
    pub session_duration: Duration,
    pub error_count: u32,
}

#[derive(Debug, Clone)]
pub struct PerformanceMetrics {
    pub avg_response_time: Duration,
    pub tool_success_rate: f64,
    pub tokens_per_minute: f64,
    pub memory_usage: u64,
}

impl AppState {
    pub fn update_metrics(&mut self, event: &ResponseEvent) {
        match event {
            ResponseEvent::Completed { token_usage, .. } => {
                self.session_stats.tokens_used.add(token_usage);
                self.update_performance_metrics();
            }
            ResponseEvent::ToolCallResult { success, duration, .. } => {
                self.session_stats.tools_executed += 1;
                if !success {
                    self.session_stats.error_count += 1;
                }
                self.performance_metrics.update_tool_stats(*duration, *success);
            }
            _ => {}
        }
    }
}

Error Tracking and Analysis¶

File: codex-rs/core/src/error.rs

#[derive(Debug, thiserror::Error)]
pub enum CodexError {
    #[error("Tool execution failed: {tool_name}")]
    ToolExecutionFailed {
        tool_name: String,
        exit_code: Option<i32>,
        stderr: String,
    },

    #[error("Context window exceeded: {tokens_used}/{max_tokens}")]
    ContextWindowExceeded {
        tokens_used: u32,
        max_tokens: u32,
    },

    #[error("Rate limit exceeded: {retry_after:?}")]
    RateLimitExceeded {
        retry_after: Option<Duration>,
    },
}

impl CodexError {
    pub fn record_error_metrics(&self) {
        let error_type = match self {
            Self::ToolExecutionFailed { .. } => "tool_execution",
            Self::ContextWindowExceeded { .. } => "context_window",
            Self::RateLimitExceeded { .. } => "rate_limit",
        };

        telemetry().errors_total.add(1, &[
            KeyValue::new("error_type", error_type),
        ]);

        // Also log for analysis
        tracing::error!(
            error_type = error_type,
            error = %self,
            "Codex error occurred"
        );
    }
}

🔑 Key Monitoring Patterns in Codex¶

1. Multi-Layer Metrics Collection¶

// Application metrics
telemetry.record_tool_call("shell", 1.2, true);
telemetry.record_token_usage(150, 80);

// System metrics
telemetry.record_memory_usage(process_memory());
telemetry.record_cpu_usage(cpu_percent());

// Business metrics
telemetry.record_task_completion(task_id, success, duration);

2. Event-Driven Monitoring¶

// Events trigger automatic metric collection
match event {
    ResponseEvent::ToolCallBegin { tool_name, .. } => {
        span.record("tool.name", tool_name);
        timer.start();
    }
    ResponseEvent::ToolCallEnd { success, .. } => {
        let duration = timer.elapsed();
        record_tool_metrics(tool_name, duration, success);
    }
}

3. Session Replay for Debugging¶

// Every session is recorded for replay
let replayer = RolloutReplayer::new(session_path)?;
let session = replayer.replay_session().await?;

// Analyze session for issues
for turn in session.turns {
    if turn.had_errors() {
        analyze_error_patterns(&turn);
    }
}

4. Real-Time Performance Dashboard¶

// TUI displays live metrics
fn render_metrics(frame: &mut Frame, metrics: &PerformanceMetrics) {
    let metrics_text = vec![
        Line::from(format!("Response Time: {:.2}s", metrics.avg_response_time.as_secs_f64())),
        Line::from(format!("Success Rate: {:.1}%", metrics.tool_success_rate * 100.0)),
        Line::from(format!("Tokens/min: {:.0}", metrics.tokens_per_minute)),
        Line::from(format!("Memory: {:.1}MB", metrics.memory_usage as f64 / 1024.0 / 1024.0)),
    ];

    let paragraph = Paragraph::new(metrics_text)
        .block(Block::default().title("Performance").borders(Borders::ALL));

    frame.render_widget(paragraph, area);
}

🎯 Key Takeaways¶

✅ Production Insights¶

Multi-Dimensional Metrics: Codex tracks performance (latency, throughput), business (task success), and system (memory, CPU) metrics.
Event-Driven Collection: Metrics are collected automatically as events occur, reducing overhead and ensuring completeness.
Session Replay: Complete session recording enables post-mortem analysis and debugging of complex issues.
Real-Time Visibility: The TUI provides immediate feedback on system performance and health.

🏗️ Architecture Benefits¶

Observability: Full visibility into agent behavior and performance
Debugging: Session replay enables root cause analysis
Optimization: Metrics identify performance bottlenecks
Reliability: Error tracking helps improve system robustness

📊 Metrics Categories¶

Performance Metrics¶

Response latency (P50, P95, P99)
Tool execution duration
Token processing rate
Memory and CPU usage

Business Metrics¶

Task completion rate
Tool success rate
User satisfaction scores
Session duration

Error Metrics¶

Error rate by type
Failed tool executions
Context window overflows
Rate limit hits

System Metrics¶

Resource utilization
Network latency
Disk I/O
Thread pool usage

Pattern 18: Rollout System - Provides data for evaluation
Pattern 12: Exception Handling - Error metrics and tracking
Pattern 1: Prompt Chaining - Turn-level performance metrics
Pattern 5: Tool Use - Tool execution metrics

📚 Further Reading¶

Next: Complete Agent Example →