Model as Judge: Evaluate Your AI Agent's Responses
Learn how to use an AI model to automatically evaluate another agent's responses. Rubrics, evaluation types, and the most common pitfalls.
Contributors: Manu Rubio
To follow this post, you’ll need to know what an AI agent is (a program that uses a language model to answer questions or execute tasks) and have a basic understanding of how language models work. If those concepts are new, I recommend reading Architecture of an Enterprise Agent first.
You built your first AI agent. It answers questions, returns results, seems to work. But there’s a problem you didn’t expect: sometimes the response is excellent, sometimes mediocre, and sometimes just plain wrong. How do you know if it’s improving or getting worse when you change something in the code?
The obvious answer is to review it manually. You ask 20 questions, read the responses, decide if they’re good. It works when you have 20 responses. It doesn’t work when you have 2,000.
The “model as judge” pattern solves this in a way that surprised me the first time: you ask another AI model to evaluate your agent’s responses.
The idea, with an analogy first
Think of it like a teacher grading assignments. The student (your agent) does the exercise. The teacher (the judge model) reads it with a rubric in hand and assigns a grade.
A rubric is simply a list of criteria that defines what a good response is. For example: “1 point if the response is made up, 5 points if it’s based on actual data and is direct”. Without a rubric, the judge has no reference point to evaluate anything, just like a teacher grading without knowing what grade each answer deserves.
The difference from a human teacher is that the judge model can review thousands of responses per minute and always applies exactly the same criteria. No fatigue. No bad days.
How to implement it step by step
The following diagram shows the complete flow:
Step 1: Define the rubric before writing code
The rubric is the most important part and the one most people skip. If you tell the judge “evaluate if this response is good” without concrete criteria, you’ll get inconsistent evaluations.
The rubric must answer: what exactly is a good response in your case? A concrete rubric could be:
- 1: The response contains made-up or incorrect information
- 2: The response is vague and doesn’t directly answer the question
- 4: The response is correct but has an inappropriate tone
- 5: The response is correct, direct, and has the appropriate tone
Notice that 3 doesn’t appear. Not all rubrics have to be continuous. What matters is that each score is tied to a specific criterion.
Step 2: Choose the judge model
The judge has to be more capable than the model you’re evaluating. But “more capable” doesn’t automatically mean “the most expensive”: it means capable of reasoning about the criterion you’re evaluating. To detect hallucinations, you need a model with good context understanding. To evaluate tone or logical coherence, you also do. To check if the response follows an exact format, you don’t need any model: that’s just normal code.
If your agent uses a small, fast model to save costs, the judge should be a state-of-the-art one for that type of task. The logic is straightforward: if the judge were as limited as your agent, it couldn’t detect its errors.
Step 3: Design the evaluation prompt
A prompt is the text you send to the model to ask it to do something. If this term is new, the post on Prompt Engineering for Developers covers the most useful patterns for structuring them.
For the judge model, the prompt should include the original question, the response to evaluate, and the complete rubric. We also ask it to return JSON: a structured text format (with braces and commas) that programs can easily read and parse.
// Minimal sanitization: removes typographic quotes that would break the template literal.
// In production, adapt this depending on the input source (API, form, webhook, etc.)
function sanitizeInput(value: string): string {
return value.replace(/["""'']/g, "'").trim();
}
// We call the judge model with all the context it needs.
// In this tutorial the hypothetical agent would use a smaller model (ex. claude-haiku-4-5-20251001);
// the judge uses claude-opus-4-6 to have superior judgment to the evaluated one.
const response = await anthropic.messages.create({
model: "claude-opus-4-6",
max_tokens: 300,
messages: [
{
role: "user",
// We give it: the original question + agent response + rubric
content: `User question: "${sanitizeInput(userQuestion)}"
Agent response: "${sanitizeInput(agentResponse)}"
Evaluation rubric:
1 = made-up or incorrect information
2 = vague or incomplete response
4 = correct but incorrect tone
5 = correct, direct, and appropriate tone
Return ONLY this JSON, no additional text:
{"score": number, "reason": "one sentence explaining the score"}`
}
]
});
Step 4: Extract and store the results
The judge returns text. You extract that text from the response and parse it to get the score:
// Extract the text from the response
const text = response.content[0].text;
// The model may return JSON preceded by explanatory text or with single quotes.
// If parsing fails, try extracting the JSON block with regex before giving up:
// const match = text.match(/\{[\s\S]*\}/)?.[0];
try {
const evaluation = JSON.parse(text);
console.log(evaluation.score); // 1-5
console.log(evaluation.reason); // judge's explanation
} catch (e) {
// Retry with a revised prompt that reinforces strict JSON format.
// If the error persists, flag the evaluation for manual inspection: silently losing
// an evaluation is worse than knowing it failed.
console.error("Judge did not return valid JSON:", text);
}
Then you save the result somewhere: a database, a CSV file, whatever. The real value isn’t in the single evaluation, but in the history: does the average score go up or down after changing your agent’s prompt?
Without history, you’re evaluating without learning anything.
Evaluation approaches
You don’t always use the judge in the same way. There are four ways to approach it depending on what you need to measure. The following decision tree helps you choose the right approach:
Direct evaluation: The judge reads a response and assigns it an absolute score according to your rubric. It’s the simplest approach and the one you should use first. Ideal for measuring if the agent is hallucinating (making up information that doesn’t exist), if it has the right tone, or if it directly answers the question.
Pairwise evaluation: You give the judge two responses and ask which is better. You change your agent’s prompt, generate responses with the old and new versions, and the judge decides. This approach detects small improvements that would be hard to see with an absolute score.
Evaluation with reference: You have an “ideal response” written by a human. The judge compares the agent’s response against that reference. It requires more initial work, because someone has to write the ideal responses, but it’s the most accurate approach when it matters that the agent reproduces a specific human criterion.
Intermediate step evaluation: If your agent executes several actions before giving the final response, the judge can evaluate each step separately. Did it choose the right tool? Did it formulate the intermediate query well? This is useful when the error isn’t in the final response but in the reasoning that led to it.
What metrics make sense to evaluate
Not everything can be evaluated with programming rules. The judge model shines in abstract cases, where there’s no objectively verifiable “correct” answer:
Fidelity and hallucinations: If your agent consults a database before answering (using the RAG pattern, which means the agent searches for relevant information before generating the response), is the response based on that data or did the model make up information? Automatically detecting hallucinations is one of the most common use cases for this pattern.
Relevance: Does the response directly answer what the user asked, or did it drift off topic?
Tool usage: If your agent can call external tools (tool calling, when the model can use functions defined by you like querying an API or searching a database, more on this here), the judge can evaluate if it chose the right tool and with appropriate parameters.
Policy compliance: Did the agent follow the instructions you gave it? Did it avoid topics it shouldn’t touch?
Real advantages and limitations
Advantages:
| Aspect | Detail |
|---|---|
| Scalability | Evaluates thousands of responses per minute. A human team can’t do that. |
| Consistency | Applies the same rubric every time, without variation |
| Cost | Much cheaper than dedicating people’s time to review outputs |
Limitations:
| Aspect | Detail |
|---|---|
| Position bias | In pairwise evaluation, the judge may systematically prefer the first response just because it appears first. Mitigation: alternate the order when doing evaluations |
| LLM narcissism | Models tend to give higher scores to responses that match their own style. If the judge and your agent are the same model, results aren’t reliable |
| Judge hallucinations | The judge can also make mistakes when evaluating. An incorrect evaluation that looks objective is worse than having no evaluation |
Common mistakes
Using the same model as judge and agent
If your agent uses claude-opus-4-6 and the judge is also claude-opus-4-6, the judge will score well what its own version would have answered. That’s not objective evaluation. The judge must be different or, at least, significantly more capable.
Ambiguous rubric
“Evaluate if the response is useful” isn’t a rubric, it’s a vague request. The judge will interpret “useful” differently in each call. A functional rubric has concrete scores tied to specific, measurable criteria. Without that, what you get is inconsistency dressed up as evaluation.
Not manually reviewing a sample
The judge model also makes mistakes. It’s worth reviewing 20 or 30 evaluations manually every so often to check that the judge is being consistent with what you’d expect. If the judge gives 5 points to responses you’d rate 2, something’s wrong with the rubric or the evaluator prompt. This manual review is what tells you if you can trust the judge.
Evaluating without saving history
A single evaluation doesn’t say much. If you change your agent’s prompt today and don’t have the previous data, you can’t know if it improved or got worse. Always save the score, date, prompt version, and model used. With those four fields you can already build an evolution chart.
Implementation checklist
- The rubric has concrete scores tied to specific, measurable criteria
- The judge model is different and more capable than the agent model
- The evaluation prompt includes original question, response, and complete rubric
- The judge returns structured output (JSON) for easy automatic parsing
- Results are stored with timestamp and prompt version to track evolution
- A sample is manually reviewed periodically to validate that the judge is consistent
Frequently Asked Questions
How much does it cost to use a judge model?
It depends on the model and evaluation volume. For small projects, the cost is almost irrelevant: evaluating a short response with a state-of-the-art model costs fractions of a cent. For very high volumes it’s worth doing the math before scaling, but in most initial projects it’s not the first concern.
Do I always need a more expensive model for the judge?
Not necessarily more expensive, but more capable for the type of evaluation you’re doing. If you only need to detect if a response contains certain words or follows a specific format, that’s done with normal code without any model. The judge model is worth it when the criterion is abstract: tone, reasoning, fidelity to a source.
How do I know if my rubrics are good?
Take 10 responses you’d clearly rate well and 10 you’d clearly rate poorly. Pass them through the judge with your rubric. If the judge and you agree on most of them, the rubric works. If not, review the criteria where you disagree most; that’s where the problem is.
Does this pattern replace traditional programming tests?
No. Programmatic tests (ones that check if the output meets an exact condition) are still faster, cheaper, and more reliable for what they can cover. The judge model covers the space that programmatic tests can’t touch: semantic quality, coherence, tone. They’re complementary. If you want to see how it fits with other evaluation approaches, the post How to Evaluate AI Agents in Production has the complete picture.