Reflection in AI Agents: Self-Reflection and Cross-Reflection

How an AI agent evaluates and improves its own output using self-reflection and cross-reflection. Step-by-step code guide for developers getting started.

Contributors: Esther Aznar, Ivan Garcia Villar

Imagine you write an important email, send it, and a few hours later you reread it and realize you missed something important or made an obviously easy-to-catch error. That moment of “how did I miss that?” we all know. Now imagine if before sending it, someone had read it aloud to you. You almost certainly would have caught the mistake.

Reflection is exactly that, but applied to AI agents. Instead of generating a response and returning it as-is, the agent delegates a review task, usually to another instance of the same model or a different one with specific evaluation criteria. It’s the “wait, let me re-read this” before hitting send.

If you come from the development world, the idea will sound familiar: it’s the same as a code review (someone else reads it before the merge). An evaluation layer before giving the output the green light.

Prerequisite: To follow this post, you need to understand what an AI agent is and how to chain them together. If you don’t, start with the prompt chaining post — it explains how to break a complex task into chained steps, which is the foundation for all of this.

What is reflection in an AI agent?

Reflection is an “AI coding pattern” that adds an explicit evaluation and correction phase before returning the final result.

An LLM generally doesn’t tend to ask itself whether the process it’s following is correct, whether it’s complete, or whether it makes sense once it’s taken a certain path. Since models are non-deterministic, it’s possible that a prompt’s result ends up with unexpected outcomes due to small errors that amplify in long tasks and complex workflows. Reflection adds that missing step: before returning the result, the agent evaluates it against specific criteria. If it doesn’t meet them, it iterates. If it does, it returns it. It allows reviewing its result a second or third time without being influenced by the previous reasoning process and following concrete quality criteria that you define.

There are three levels of evaluation:

VariantWho evaluatesWhen to use it
Self-reflectionThe same agentImprove style, structure, and completeness
Cross-reflectionA second agent criticReduce blind spots of the generator using a separate eval
Human reflectionA personDecisions requiring genuine human judgment

Self-Reflection: The Agent Evaluates Its Own Output

Self-reflection is the most straightforward variant: the same model that generates the output also evaluates it, but by changing the role it has in each phase.

The simplest version is defining differentiated phases in its initial prompt, which influence how inference executes and focus attention on reviewing aspects of its own work. This approach can be useful sometimes, but it’s very limited because the model remains influenced by the reasoning that led to previous results. If it didn’t understand the task well when generating, it probably won’t detect the failure when reviewing either.

To solve this limitation, we need to separate the first result from the second inference into separate processes, using the output of the first process to generate the prompt of the second process, along with precise instructions on how to approach the review. This way we manage to influence the model to have different and critical behavior about its first result.

Imagine our task is to generate articles for a blog. And our evaluator is specialized in detecting deviations from our communication style guide:

You are a style evaluator.

Your task is to review whether the generated response complies with the company's editorial style requirements.
You should not directly improve the response. You should only evaluate it and return structured feedback.

Evaluate the response against these criteria:

[... Detailed description of evaluation criteria ...]

Rules:

- Use "approved": true only if the response meets quality standards.
- Include in "blockingIssues" only problems that should be fixed
before accepting the response for going against our style guide
- Include in "suggestions" minor issues that don't directly violate
any rule but you think don't quite match our tone and vocabulary
- approved should be true only if blockingIssues is empty.
- If approved is false, you must include at least one blockingIssue or actionable suggestion.

Original task:

{{task}}

Generated response:

{{output}}

The model is the same, but the context completely changes how it responds and that allows it to detect problems it made in another thread. We’re modifying its approach and objective (generate article vs correct editorial style). This allows giving a second “point of view”.

In the code below, generate() and evaluate() are simply two calls to the same LLM, each with its own system prompt.

1.00

type Critique = {
  approved: boolean;
  blockingIssues: string[];
  suggestions: string[];
};

// Self-reflection loop: the same model generates and evaluates,
// but in separate calls and with different prompts.
async function selfReflect(task: string, maxRounds = 3): Promise<string> {
  let output = await model.generate(task);

  for (let round = 0; round < maxRounds; round++) {
    const critique: Critique = await model.evaluate(output, task);

    if (critique.approved) break;

    output = await model.generate(task, {
      previousOutput: output,
      critique: [
        ...critique.blockingIssues,
        ...critique.suggestions,
      ].join("\n"),
    });
  }

  return output;
}

This method has a clear limit. If the model has an incorrect understanding of something, it will use that same misunderstanding to evaluate itself. It’s like asking someone to correct an exam without knowing the right answer. If the problem is a bias inherent to the model or a lack of capabilities, it probably won’t detect it and will fall back into its own limitations.

The stopping criterion based on the criticality of problems found avoids unnecessary rounds.

That’s what cross-reflection with different models is for.

Cross-Reflection: A Second Model as Critic

Cross-reflection separates the roles into two distinct agents, usually run by different models: one generates the output and another evaluates it.

If you’ve read the model as judge post, this will sound similar. In this pattern we use a model within an operational feedback loop before generating a final response. Normally, when we talk about a model as judge, we use it to evaluate, score, or compare existing results. In this case, the judge is embedded within the process and runs before generating the final result.

1.00

If the generator uses a fast model, the critic can use a more conservative and thorough one. Two models from different families can have different error patterns. That doesn’t guarantee the critic is right, but it can increase diversity of judgment and help detect flaws the generator missed. Depending on the type of task, this pattern can save you costs by using cheaper models to generate and more expensive ones to review, reducing the token consumption of the expensive model on simple tasks.

Human Reflection: When a Human Enters the Loop

There are decisions no agent should make. Remember that AI models are not infallible, and depending on how critical it is to fail in the process you’re trying to automate, human review is mandatory.

Human reflection adds a checkpoint in the loop: the agent pauses until it receives validation from a real person. The code is the same as cross-reflection, but the evaluation call is asynchronous and doesn’t continue until someone responds.

// Loop with human checkpoint: the agent waits before continuing
async function humanReflect(task: string): Promise<string> {
  let output = await generate(task);

  // The agent pauses here until a person responds
  const humanFeedback = await waitForHumanApproval({
    output,
    task,
    // Guide so the reviewer knows exactly what to evaluate
    reviewGuide: "Does the output meet business requirements and is it factually correct?",
  });

  if (humanFeedback.approved) return output;

  // If human rejects, regenerate incorporating their feedback
  return generate(task, { feedback: humanFeedback.comments });
}

The rule for knowing when to use it is simple: if the cost of an error clearly exceeds the time it takes to review it, human review is not optional, it’s the right choice. A contract, a response to an important customer, code that goes straight to production.

Using cross-reflection can greatly improve the quality of responses from your agents, but always remember that AI models are never deterministic and that the ultimate responsibility for a failure is yours, not the model’s.

The Real Cost of Each Iteration

Each call to the model consumes tokens; more iterations mean directly more money.

The growth pattern is worse than it appears. Each iteration includes not only the new output but also the previous output and feedback from the previous round. The second call is more expensive than the first, the third more expensive than the second. If you start with 1,000 base prompt tokens, keep that in mind when estimating costs. Some providers offer discounts for prompt caching when part of the context repeats between calls. If your provider supports it and you reuse identical prefixes, it can offset some of the cost.

The question you need to ask before adding more rounds is economic: is the error this iteration avoids worth more than what it costs to fix it later? In most cases, two or three rounds is the reasonable ceiling. If the third iteration barely changes the output compared to the second, that’s your real limit.

Sometimes, more rounds can mean more deviation from the objective. Models that review aren’t focused on the original task objective but on finding errors. This, in turn, can cause the generating model to deviate from its objective. Abusing cross-reflection rounds usually doesn’t improve results. If your review models can’t find all the errors, consider generating several different ones, with different objectives and each focused on a type of problem. Then, synthesize all those errors and return them to the generator or fix them one by one.

There’s also a technical limit you can’t ignore: models have a context ceiling. If you accumulate output and feedback from several rounds without controlling size, the call fails with an API error when you exceed that window.

Common Mistakes

Loop without Early Stopping Criterion

The most typical error in a first implementation: the loop always executes all rounds, even though the first output is already perfectly valid. The result is paying more without getting anything in return.

The solution is simple: add an explicit approval criterion. A minimum score, an empty suggestions list, whatever fits your case. If the critic has nothing specific to point out, stop.

Critic Sycophancy

Sycophancy is when the critic approves the output even though it has obvious problems. The model prefers to give positive feedback rather than create friction.

The symptom is easy to detect: the critic approves on the first round almost always, even with mediocre outputs. The fix is to add to the critic’s system prompt something like “Before approving the output, identify the weakest point in the response. If you still decide to approve it, explain why that point doesn’t block delivery.”. Forcing the critic to justify its approvals usually increases the level of detail of errors it detects at the cost of consuming more tokens.

Using the Same Model Without Changing the Prompt

Self-reflection with the same system prompt as the generator does almost nothing. The model reproduces exactly the same biases it had when generating, so it evaluates with the same blind spots.

The critic’s system prompt must actively lead the model to question the output, not confirm it.

Vague Feedback That’s Useless

If the critic returns “could improve” or “add more detail”, the generator doesn’t know what to do with that. Vague produces another vague. Feedback needs to be actionable: “the third paragraph repeats the idea from the first one, remove it” or “missing a concrete code example in the costs section”. The more specific, the more useful the next iteration.

Implementation Checklist

  • The loop has a defined maximum number of iterations

  • There’s an early stopping criterion: the loop stops before the max if the output is already satisfactory

  • The critic’s system prompt forces pointing out problems, not passive validation

  • The critic’s feedback is actionable: specifies what to change and where

  • Model calls have error handling for rate limits, timeouts, and unexpected responses

Frequently Asked Questions

Does Reflection Replace Tests and Programmatic Validations?

No. They’re separate layers. Reflection improves subjective output quality: clarity, completeness, coherence. Tests and guardrails verify objective conditions: is the JSON properly formatted?, does the code compile?, are mandatory fields present? They’re complementary.