How to evaluate AI agents in production: metrics, evals, and observability
Technical guide to measuring the real reliability of AI agents in production: task completion rate, deterministic evals, LLM-as-judge, and silent degradation detection.
Contributors: Carlos Hernandez Prieto
I’ve been seeing the same pattern for months in teams deploying agents to production: the system passes all tests, model calls have acceptable latencies, there are no exceptions in the logs, and yet the agent fails silently on a significant fraction of tasks. Nobody knows until a user reports it, or until someone manually reviews a batch of outputs and discovers something is wrong. The problem isn’t that agents fail — it’s that they fail in ways conventional monitoring infrastructure doesn’t detect.
In this post I explain how to build an evaluation system for AI agents in production: what metrics actually matter, how to structure evals that reflect real conditions, how to detect degradation before the incident, and what you minimally need for visibility without complex infrastructure.
Why evaluating agents isn’t the same as evaluating models
Evaluating a model means measuring the quality of an output given an input: accuracy, coherence, relevance, BLEU score. It’s a direct relationship between input and output. Evaluating an agent is different in nature: it means measuring whether a sequence of chained decisions, each dependent on the state generated by the previous ones, reaches an objective with measurable reliability under real conditions.
The difference isn’t a matter of degree. It’s qualitative.
When the agent fails at step 3 of 7, the model may have responded perfectly at every step. Each individual call was syntactically correct. The problem was the decision about which call to make, with what parameters, at what point in the flow. That doesn’t show up in the model’s output quality metrics.
Most teams directly transplant the model evaluation mindset to agents because it’s what they know. They configure alerts on latency, HTTP error rate, and cost per token. Those metrics measure the underlying model, not the agent. The agent can work perfectly at the model level and fail completely at the objective level.
The metrics that matter in production
There are five agent metrics that, in my experience, provide real signal in production. None are visible from model call logs alone.
Task completion rate is the primary metric: the percentage of tasks where the agent reached the defined objective, not just the percentage of tasks completed without technical error. An agent can finish without exceptions while taking an alternate path that wasn’t correct.
Step efficiency measures whether the agent used more steps than necessary. If a task that should routinely be solved in 4 steps regularly takes 9, the agent is lost, iterating without convergence, or its system prompt isn’t prioritizing the direct path. This metric also detects hidden loops before they consume budget.
Tool call accuracy measures how many tool calls used the correct tool with correct parameters. An agent that hallucinates tool names or constructs incorrect parameters can appear active without generating any useful results.
Error recovery rate is the percentage of times the agent successfully recovered when an intermediate step failed. Robust agents fail at intermediate steps — what distinguishes good ones is that they recover.
Cost per task is the tokens consumed per completed task, not per call. An unusual spike in this metric is often the first signal of a loop or degradation, sometimes hours before any user reports a problem.
| Dimension | Model metric | Agent metric | How it’s measured |
|---|---|---|---|
| Output quality | Accuracy, coherence, relevance | Task completion rate | % of tasks with objective reached, verified |
| Efficiency | Latency per call | Step efficiency | Actual steps ÷ minimum necessary steps |
| Tool usage | N/A | Tool call accuracy | % of tool calls with correct tool + parameters |
| Resilience | N/A | Error recovery rate | % of successful recoveries after intermediate failure |
| Cost | Tokens per call | Cost per task | Total tokens ÷ completed tasks |
| Temporal reliability | Point-in-time evaluation | Silent degradation | Metric trend over time |
What doesn’t appear in this table also matters: the semantic quality of the final result — whether the agent completed the objective technically but the output is useless to the user. For that you need evals.
How to build evals for agents
An agent eval isn’t a generic benchmark downloaded from the internet. It’s a set of representative tasks from real production, with success criteria defined objectively and measurably. If the criteria aren’t measurable, the eval can’t automatically detect regressions.
The first distinction to make is between deterministic evals and non-deterministic evals.
Deterministic evals have a verifiable correct result: the SQL query must return exactly these records, the generated file must have this JSON structure, the API call must have been made with these parameters. They’re verified programmatically. They’re most valuable because they introduce no subjectivity and can run in CI at no additional cost.
Non-deterministic evals have assessable quality criteria but no single correct result: the summary must contain these key points, the response must follow the given context constraints, the generated plan must be executable. For these, the most common practice is using an LLM as judge with an explicit rubric.
import anthropic
from dataclasses import dataclass, field
@dataclass
class AgentTrace:
task_id: str
steps: list[dict] = field(default_factory=list)
total_tokens: int = 0
@property
def step_count(self) -> int:
return len(self.steps)
@property
def tool_calls(self) -> list[dict]:
return [s for s in self.steps if s["type"] == "tool_use"]
def run_eval_task(task: dict, tools: list, client: anthropic.Anthropic) -> AgentTrace:
"""Run an eval task capturing the complete TAO trace."""
trace = AgentTrace(task_id=task["id"])
messages = [{"role": "user", "content": task["input"]}]
for _ in range(task.get("max_steps", 15)):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
tools=tools,
messages=messages,
)
trace.total_tokens += response.usage.input_tokens + response.usage.output_tokens
for block in response.content:
trace.steps.append({"type": block.type, "data": block})
if response.stop_reason != "tool_use":
break
tool_results = execute_tools(response.content) # domain-specific implementation
messages += [
{"role": "assistant", "content": response.content},
{"role": "user", "content": tool_results},
]
return trace
The trace captures each step of the TAO cycle: what the agent thought, what tool it called, what it observed. With that trace you can calculate all five metrics above without additional instrumentation.
For non-deterministic evals, LLM-as-judge works, but you need to understand its limitations before trusting it.
RUBRIC_PROMPT = """\
You are an evaluator of agent tasks. Evaluate whether the agent completed the objective.
OBJECTIVE: {objective}
AGENT RESULT:
{agent_output}
SUCCESS CRITERIA:
{criteria}
Respond ONLY with valid JSON using this exact schema:
{{
"success": true/false,
"completeness": 0-10,
"accuracy": 0-10,
"reasoning": "justification in one sentence"
}}
IMPORTANT: Evaluate the final result, not the process. Don't penalize an
efficient path if the result is correct. Don't approve a plausible result
that doesn't meet the explicit criteria."""
def llm_judge(
objective: str,
agent_output: str,
criteria: list[str],
client: anthropic.Anthropic,
) -> dict:
prompt = RUBRIC_PROMPT.format(
objective=objective,
agent_output=agent_output,
criteria="\n".join(f"- {c}" for c in criteria),
)
# Using a different model than the one being evaluated reduces self-affirmation bias
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=256,
messages=[{"role": "user", "content": prompt}],
)
return json.loads(response.content[0].text)
The most important bias in LLM-as-judge is self-affirmation: if you use the same model as the agent as the judge, it tends to approve outputs it would have generated itself. You need to use a different model, or at least a different version with zero temperature. The second relevant bias is length: LLMs tend to prefer longer outputs, which can inflate scores for lengthy but incomplete responses. The explicit rubric with binary criteria mitigated this in the projects where I tested it.
Silent degradation: three causes, three detection methods
An agent that works well at launch can degrade without anyone noticing. I’ve seen three main causes, and each requires a different detection method.
Distributional shift: real inputs start differing from the inputs you built evals with. The agent was never designed for that type of task, but users try it anyway. The detection signal is a drop in task completion rate combined with an increase in step count — the agent iterates more because it doesn’t know how to reach the objective with those inputs.
Model updates: the underlying model changes and agent behavior changes with it. This is especially silent because the provider doesn’t always notify behavior changes, just version changes. Detection requires running your complete eval suite against each new model version before updating, not after. In two separate projects I’ve seen tool call accuracy regressions of 15-20% after model updates that improved other metrics.
Tool drift: external APIs or tools the agent uses change their behavior. An API that previously responded in 200ms now takes 2 seconds, which changes the agent’s timeout decisions. An endpoint that returned a field in the response object now moves it elsewhere. The signature of this failure is an increase in error recovery rate with stable tool call accuracy — the agent calls tools correctly, but the tools don’t behave as expected.
The most common failure patterns share a characteristic: they don’t produce exceptions. The hidden infinite loop — the agent keeps iterating without advancing — is detected by a spike in cost per task before reaching any result. Tool call hallucination — calling tools that don’t exist or with invented parameters — is detected by an increase in tool errors in the trace. Silent error cascade — the failure at step 3 propagates incorrect state to step 4, which propagates it to step 5, and the agent delivers a plausible but incorrect result — is the hardest to detect because the final output doesn’t appear wrong without verification. For this you need periodic human sampling, not just automated metrics.
Minimum viable observability
You don’t need complex infrastructure to have real visibility into how your agent behaves in production. These four elements are enough to start.
Tracing each TAO cycle. Each iteration should save: the step input, the model output, the tools called with their parameters, each tool result, and the cost in tokens. Without this trace, you can’t calculate any of the metrics that matter or debug failures in production.
Alerts on cost per task. Configure an alert when cost per task rises more than 50% compared to the 7-day average. This single alert detects hidden loops, severe distributional shift, and efficiency degradation before impact reaches users.
Random sampling for human review. 5-10% of completed tasks should be reviewed manually on a periodic basis. No automated metric substitutes for someone checking whether the result is actually correct. It’s the only mechanism that detects silent error cascades and overconfidence issues.
Dashboard of task completion rate by task type over time. Not a single global number — segmented by task type. A degradation affecting only a subset of inputs gets masked in the global average. Seeing the trend by category lets you identify where the problem is before the impact becomes widespread.
For more context on how TAO cycles structure agent behavior — and why instrumenting each cycle individually is more useful than instrumenting the complete flow — see the post on agentic loops.
Common mistakes when evaluating agents
Measuring model metrics instead of agent metrics
The most frequent and most costly mistake. The team configures dashboards for latency, tokens per call, and HTTP error rate — real metrics that measure the underlying model — and concludes the system works well. Meanwhile, task completion rate can be dropping with no visible signal. The sign of this mistake is when the team only discovers agent failures through user reports, never through their own alerts.
Eval set that doesn’t represent production
You build evals with the cleanest examples, the easiest cases, the tasks the agent already handles well. The eval passes at 98%. In production, the agent encounters messy variations that were never in the eval set. The solution is building evals from real production logs starting the first day you have real traffic, not before.
Judge that approves style, not content
LLM-as-judge without explicit rubric tends to evaluate writing quality instead of success criteria compliance. A well-written output that doesn’t meet defined success criteria can get high scores. The rubric should list verifiable binary criteria, not request a holistic quality assessment. If the judge can’t justify its score with evidence from the output, the rubric is poorly written.
Ignoring tool drift until the incident
External APIs change. If you don’t run your eval suite periodically against real tools in production, you’ll discover tool drift when it’s already affecting users. A weekly run of the deterministic eval subset, with alerts on tool call accuracy changes, is enough to detect it in time.
Minimum evaluation checklist before production
- Agent metrics (task completion rate, step efficiency, cost per task) are instrumented and visible
- Eval set includes at least 20 representative tasks extracted from real use cases
- There’s at least one deterministic eval per critical tool the agent uses
- LLM-as-judge (if used) has a rubric with explicit binary criteria and uses a different model than the evaluated one
- Tracing covers each TAO cycle: input, output, tool calls, tool results, and token cost
- An alert is configured on cost per task that detects anomalies over 50%
- The periodic human sampling process is defined: who reviews, what percentage, how often
- Eval suite runs against each new model version before updating in production
Frequently Asked Questions
What’s the difference between an eval and a benchmark?
A benchmark is a standardized set of tasks designed to compare models with each other under generic conditions. An eval is a set of tasks specific to your use case, designed to measure whether your agent meets your success criteria under real production conditions. Public benchmarks are useful for choosing a base model, but they don’t predict your agent’s behavior with your tools, your context, and your users. An agent can score high on benchmarks and have low task completion rate in production, and vice versa.
When to use LLM-as-judge and when not?
LLM-as-judge is useful when the success criterion requires semantic judgment that isn’t programmatically verifiable: does the summary capture key points? Does the response follow the given context constraints? It’s not appropriate when the result has a verifiable correct answer — use deterministic assertions for that, they’re faster, cheaper, and introduce no bias. Don’t use LLM-as-judge as your only evaluation mechanism either. Periodic human sampling is still necessary to calibrate that the judge is evaluating what you think it’s evaluating.
What to do when you don’t have production data to build evals?
If you don’t have real traffic yet, build evals based on the use cases that motivated the agent: the real tasks users will ask for. Interview potential users or the team that will use the agent. Generate synthetic variations of those tasks — not to replace real data, but as a starting point. The important thing is that evals reflect realistic complexity and variety, not just happy path cases. From the moment you have your first 50 production outputs, start migrating the eval set toward real data. Synthetic evals have a short lifespan.
How to prioritize which metrics to monitor first if resources are limited?
If you have to choose one metric: task completion rate segmented by task type. It most directly measures whether the agent does its job. If you can add a second: cost per task with anomaly alerts — detects loops and degradation before reaching users. With those two metrics and a biweekly human sampling process you have the minimum to not operate blind. Step efficiency and tool call accuracy metrics are valuable for debugging when you already know there’s a problem, but their value as early signal is lower than the two above.