How to Make Your Agents Improve Themselves With This Skill

The methodology I use to make Claude Code improve itself through measuring, proposing hypotheses, iterating, and validating results

Contributors: Ivan Garcia Villar

The problem with agents isn’t building them—it’s optimizing them to work exactly how you want them to.

You write an idea, tweak the prompt, run a few tests, and hope it always works.

What happens when you can’t get it working exactly as expected?

You make some changes, run more tests… Does that sound like a reliable way to build your agents? To me, it sounds like a trial-and-error process where you don’t really understand why things work when they work, or why they fail when they fail.

At that pace, your agent stays as it is. Not because there’s no room for improvement, but because the cost of exploring is too high.

Today I’m bringing you a methodology and a skill that helps you not only make this process more professional but automate it (partially), letting the AI itself improve itself. Sounds good, right?

1.00

LLM as Judge: The System Foundation

This methodology is based on the fact that AI models are much better at evaluating results than generating them, which is why this system works. You put one AI to evaluate what another AI does and optimize it to improve the result.

Obviously it has limits. The judgment criteria of the AI acting as judge is limited, but given that models are better at evaluating than generating, it works even when the model doing the evaluating is equally powerful as the one generating.

The agent acting as judge (Claude Code) can modify both the prompt and the architecture of the agents it evaluates (if there’s more than one step), propose using other models, research different techniques online, etc.

If You Don’t Measure, You Don’t Improve

Imagine you start going to the gym. On the first day they weigh you: 75 kilos. That number is your baseline, the starting point against which you’ll compare everything that comes after.

Now you need to define your goal—let’s say it’s to lose a few kilos.

If after a month you weigh 74 kilos, you know the routine worked. If you weigh 76, something’s wrong. Without that initial number, you’d have no idea if a month of training actually helped.

Agents work the same way. A metric is a number that tells you how well your system is working: how many correct responses it gives, how many errors it has, how long it takes. Without it, you can’t know if a change made things better or worse.

The mistake almost everyone makes at first: adjust the prompt, test a couple of cases by hand, and conclude “seems to work better.” That’s not systematic improvement. That’s intuition disguised as iteration.

What you need is a measurable starting point: a program that automatically tests your agent with a fixed set of cases and returns a comparable metric. Always the same cases, always the same judge, always the same process. That way numbers are comparable across experiments.

If that comparison can be done without an LLM, even better. For example, if you have an extensive dataset and the model needs to retrieve an exact list of results you already know, that’s something you can do without an LLM. You already have the list; you just compare how many it found.

In cases where it’s impossible to define what the ideal result is, Claude Code acts as judge.

Step 0: Laying the Foundation

Define What You Want to Improve

The first and most important step: correctly define a measurable starting point and what “better” or “worse” means before you change anything.

1.00

Before you start, you need to answer four questions:

What’s the primary metric you want to improve? For example: “I want the agent to correctly classify 95% of the texts it receives.”

What other metrics can’t get worse? If you make the agent more accurate but it now takes twice as long, maybe it’s not an improvement. Define what you can’t sacrifice.

What makes a result better or worse? You need to give your judge enough context so it knows how to evaluate whether one experiment’s result is better than another.

What set of inputs will you use to measure? You need representative examples of the real problem: typical cases and difficult cases that usually fail. If you change the test examples between experiments, the numbers become incomparable and the entire process loses meaning. This condition is non-negotiable. If you need to change the dataset, you’re establishing a new baseline—don’t compare it with previous results.

With those questions answered, the skill builds the evaluation script, runs it against the system without changing anything, and records those numbers. That’s your zero point. From here on, everything is measured against that number. Not against your impression, not against “the last time I tested it.” Against that record.

Design the Experiment

The agent designs an executable experiment that takes the system you want to optimize, makes the measurements you’ve proposed, and returns you a result.

For example, if you’re optimizing how an agent searches the internet to find customers for your business and the metric you want to optimize is the relevance of the contacts, the agent will design a script that runs your search algorithm for a series of predefined searches and evaluates the result on the metrics you’ve told it to consider. With this, the following decisions will be made.

The Six-Step Cycle

With the baseline established, the next step is for our agent to enter an iterative self-improvement cycle.

1.00

Step 1: Read the current state. Claude Code reads the baseline metrics and the history of previous experiments.

Step 2: Propose a hypothesis. Claude Code reasons, researches online, proposes a new hypothesis, and documents it: what it wants to change, why it thinks it will work, and what result it expects to see in the numbers.

Step 3: Accept the hypothesis. At this point you can support the model—read the hypothesis and decide if it makes sense to test it or if you’d prefer to iterate. You can tell it to skip this step initially so it can try the most obvious things on its own.

Step 4: Implement and measure. Claude Code makes the change in the code, runs the evaluation script, and records the new numbers.

Step 5: Decide. Did the primary metric improve without worsening the secondary ones? Adopt it. Did it not improve, or did something get worse? Discard it. You can make the decision or delegate it to Claude, depending on whether your model is good enough at judging the result or not.

Step 6: Document and commit. If adopted, make a commit with the experiment description. If discarded, make a revert and document why it didn’t work. Then go back to step 1.

What matters isn’t the steps themselves, but that the agent follows them in order without skipping any. Without explicit methodology, models tend to combine several changes at once because “they all seem good,” fail to revert when something breaks, or forget the baseline between iterations. With explicit instructions in the skill, that doesn’t happen.

What You Learn After Several Cycles

Negative knowledge is as valuable as positive knowledge. A discarded experiment with its numbers and reasoning is real information. If in six months someone proposes the same idea that was already tested, you’ll be able to check if it was tested before and what the result was.

The system is capable of proposing hypotheses and improving itself. The agent can spend hours in this cycle while you do something else. When you come back, you don’t have a system that’s the same with a couple of tweaks: you have a log of experiments, an updated baseline, and many variations already evaluated. What used to be days of manual iteration is compressed into a few hours of autonomous execution.

How Do I Use This Skill?

Download the file and add it to your Claude Code project. From there, whenever you want to start an optimization session, you simply say something like:

“I want to optimize the classification agent. Follow the /experiment-driven-optimization methodology”

Claude Code reads the skill, knows exactly what to do, and asks you the questions from Step 0: what metric you want to improve, what can’t get worse, what test cases you’ll use. If you already have a baseline established from a previous session, it reads it from the README and goes straight into the cycle.

What changes when you have the skill is that you don’t have to remember the process or supervise each step. Claude Code guides you through each step so you don’t have to worry about the details.

A typical session works like this: you tell it to start, explain what you want to achieve, Claude creates the evaluation script, runs it and documents it, generates hypotheses, and starts working. You can tell it not to ask for validation to execute hypotheses so you have a good test base before you start interacting with it.

What Systems This Works For

The methodology isn’t specific to any type of agent. I’ve used this cycle to optimize information retrieval pipelines, online data search strategies, automatic content generation… You can practically apply it to any algorithm that incorporates AI or that you’re unsure how to optimize.

You only need to meet two conditions.

  1. You’re able to design a fixed set of inputs that represents the real problem well.
  2. You can automatically calculate whether the output is better or worse, either algorithmically or through an agent acting as Judge.

This test base will also be very useful if in the future you want to change the model in your agentic system—you just change the model, measure, and compare. The only thing you can’t change is the Judge; if you do, you need to establish a new baseline.

One new concept every week

How Many Test Cases Do I Need in the Test Runner?

It depends on the problem. The most important thing isn’t the exact number, but that the cases are varied and representative enough: include typical cases and difficult cases that usually fail.

You Can Download the Skill at the Top of This Article 🔝