Pattern 20: Evaluation and Monitoring¶
"Tracking agent performance, logging metrics, and evaluating outcomes for continuous improvement"
📖 Overview¶
Evaluation and Monitoring involves systematically tracking agent performance, collecting metrics, logging events, and analyzing outcomes to ensure reliable operation and continuous improvement. This pattern is essential for production systems where you need visibility into agent behavior, performance bottlenecks, and failure modes.
🎯 How Codex Implements Evaluation and Monitoring¶
Codex implements comprehensive monitoring through multiple layers: OpenTelemetry for metrics, rollout logging for session replay, and real-time performance tracking in the TUI.
Key Implementation: OpenTelemetry Integration¶
File: codex-rs/otel/src/lib.rs
use opentelemetry::{
metrics::{Counter, Histogram, Meter},
trace::{Span, Tracer},
KeyValue,
};
pub struct CodexTelemetry {
tracer: Box<dyn Tracer + Send + Sync>,
meter: Meter,
// Metrics
pub tool_calls_total: Counter<u64>,
pub tool_call_duration: Histogram<f64>,
pub tokens_used_total: Counter<u64>,
pub errors_total: Counter<u64>,
}
impl CodexTelemetry {
pub fn record_tool_call(&self, tool_name: &str, duration: f64, success: bool) {
// Record metrics
self.tool_calls_total.add(1, &[
KeyValue::new("tool", tool_name.to_string()),
KeyValue::new("success", success.to_string()),
]);
self.tool_call_duration.record(duration, &[
KeyValue::new("tool", tool_name.to_string()),
]);
if !success {
self.errors_total.add(1, &[
KeyValue::new("error_type", "tool_execution"),
KeyValue::new("tool", tool_name.to_string()),
]);
}
}
pub fn record_token_usage(&self, prompt_tokens: u64, completion_tokens: u64) {
self.tokens_used_total.add(prompt_tokens, &[
KeyValue::new("token_type", "prompt"),
]);
self.tokens_used_total.add(completion_tokens, &[
KeyValue::new("token_type", "completion"),
]);
}
}
Session Recording and Replay¶
File: codex-rs/core/src/rollout/recorder.rs
pub struct RolloutRecorder {
session_id: String,
file: Option<File>,
sequence_number: AtomicU64,
}
impl RolloutRecorder {
pub async fn record_turn_start(&mut self, turn_context: &TurnContext) -> Result<()> {
let item = RolloutItem::TurnStart {
turn_id: turn_context.turn_id,
timestamp: Utc::now(),
user_message: turn_context.user_message.clone(),
};
self.write_item(&item).await?;
// Also emit telemetry
telemetry().record_turn_start(turn_context.turn_id);
Ok(())
}
pub async fn record_tool_execution(
&mut self,
tool_name: &str,
args: &Value,
result: &ToolResult,
duration: Duration,
) -> Result<()> {
let item = RolloutItem::ToolExecution {
tool_name: tool_name.to_string(),
arguments: args.clone(),
result: result.clone(),
duration_ms: duration.as_millis() as u64,
timestamp: Utc::now(),
};
self.write_item(&item).await?;
// Record metrics
telemetry().record_tool_call(
tool_name,
duration.as_secs_f64(),
result.is_success(),
);
Ok(())
}
}
Real-Time Performance Monitoring¶
File: codex-rs/tui/src/app.rs
pub struct AppState {
pub session_stats: SessionStats,
pub rate_limits: RateLimits,
pub performance_metrics: PerformanceMetrics,
}
#[derive(Debug, Clone)]
pub struct SessionStats {
pub turns_completed: u32,
pub tools_executed: u32,
pub tokens_used: TokenUsage,
pub session_duration: Duration,
pub error_count: u32,
}
#[derive(Debug, Clone)]
pub struct PerformanceMetrics {
pub avg_response_time: Duration,
pub tool_success_rate: f64,
pub tokens_per_minute: f64,
pub memory_usage: u64,
}
impl AppState {
pub fn update_metrics(&mut self, event: &ResponseEvent) {
match event {
ResponseEvent::Completed { token_usage, .. } => {
self.session_stats.tokens_used.add(token_usage);
self.update_performance_metrics();
}
ResponseEvent::ToolCallResult { success, duration, .. } => {
self.session_stats.tools_executed += 1;
if !success {
self.session_stats.error_count += 1;
}
self.performance_metrics.update_tool_stats(*duration, *success);
}
_ => {}
}
}
}
Error Tracking and Analysis¶
File: codex-rs/core/src/error.rs
#[derive(Debug, thiserror::Error)]
pub enum CodexError {
#[error("Tool execution failed: {tool_name}")]
ToolExecutionFailed {
tool_name: String,
exit_code: Option<i32>,
stderr: String,
},
#[error("Context window exceeded: {tokens_used}/{max_tokens}")]
ContextWindowExceeded {
tokens_used: u32,
max_tokens: u32,
},
#[error("Rate limit exceeded: {retry_after:?}")]
RateLimitExceeded {
retry_after: Option<Duration>,
},
}
impl CodexError {
pub fn record_error_metrics(&self) {
let error_type = match self {
Self::ToolExecutionFailed { .. } => "tool_execution",
Self::ContextWindowExceeded { .. } => "context_window",
Self::RateLimitExceeded { .. } => "rate_limit",
};
telemetry().errors_total.add(1, &[
KeyValue::new("error_type", error_type),
]);
// Also log for analysis
tracing::error!(
error_type = error_type,
error = %self,
"Codex error occurred"
);
}
}
🔑 Key Monitoring Patterns in Codex¶
1. Multi-Layer Metrics Collection¶
// Application metrics
telemetry.record_tool_call("shell", 1.2, true);
telemetry.record_token_usage(150, 80);
// System metrics
telemetry.record_memory_usage(process_memory());
telemetry.record_cpu_usage(cpu_percent());
// Business metrics
telemetry.record_task_completion(task_id, success, duration);
2. Event-Driven Monitoring¶
// Events trigger automatic metric collection
match event {
ResponseEvent::ToolCallBegin { tool_name, .. } => {
span.record("tool.name", tool_name);
timer.start();
}
ResponseEvent::ToolCallEnd { success, .. } => {
let duration = timer.elapsed();
record_tool_metrics(tool_name, duration, success);
}
}
3. Session Replay for Debugging¶
// Every session is recorded for replay
let replayer = RolloutReplayer::new(session_path)?;
let session = replayer.replay_session().await?;
// Analyze session for issues
for turn in session.turns {
if turn.had_errors() {
analyze_error_patterns(&turn);
}
}
4. Real-Time Performance Dashboard¶
// TUI displays live metrics
fn render_metrics(frame: &mut Frame, metrics: &PerformanceMetrics) {
let metrics_text = vec![
Line::from(format!("Response Time: {:.2}s", metrics.avg_response_time.as_secs_f64())),
Line::from(format!("Success Rate: {:.1}%", metrics.tool_success_rate * 100.0)),
Line::from(format!("Tokens/min: {:.0}", metrics.tokens_per_minute)),
Line::from(format!("Memory: {:.1}MB", metrics.memory_usage as f64 / 1024.0 / 1024.0)),
];
let paragraph = Paragraph::new(metrics_text)
.block(Block::default().title("Performance").borders(Borders::ALL));
frame.render_widget(paragraph, area);
}
🎯 Key Takeaways¶
✅ Production Insights¶
-
Multi-Dimensional Metrics: Codex tracks performance (latency, throughput), business (task success), and system (memory, CPU) metrics.
-
Event-Driven Collection: Metrics are collected automatically as events occur, reducing overhead and ensuring completeness.
-
Session Replay: Complete session recording enables post-mortem analysis and debugging of complex issues.
-
Real-Time Visibility: The TUI provides immediate feedback on system performance and health.
🏗️ Architecture Benefits¶
- Observability: Full visibility into agent behavior and performance
- Debugging: Session replay enables root cause analysis
- Optimization: Metrics identify performance bottlenecks
- Reliability: Error tracking helps improve system robustness
📊 Metrics Categories¶
Performance Metrics¶
- Response latency (P50, P95, P99)
- Tool execution duration
- Token processing rate
- Memory and CPU usage
Business Metrics¶
- Task completion rate
- Tool success rate
- User satisfaction scores
- Session duration
Error Metrics¶
- Error rate by type
- Failed tool executions
- Context window overflows
- Rate limit hits
System Metrics¶
- Resource utilization
- Network latency
- Disk I/O
- Thread pool usage
🔗 Related Patterns¶
- Pattern 18: Rollout System - Provides data for evaluation
- Pattern 12: Exception Handling - Error metrics and tracking
- Pattern 1: Prompt Chaining - Turn-level performance metrics
- Pattern 5: Tool Use - Tool execution metrics
📚 Further Reading¶
Next: Complete Agent Example →