Programmatic Tool Calling : Step-by-Step Implementation

Hay una técnica que permite ejecutar docenas de herramientas en paralelo sin que el modelo vea ningún resultado intermedio. Nadie habla de ella porque está enterrada en la especificación. Cuando la descubras, vas a replantear cómo diseñas cualquier pipeline de agentes.

Contributors: Ivan Garcia Villar

Imagine you have an agent that checks budget compliance for 20 employees. With traditional tool calling, every query is a full round trip to the Anthropic API: Claude requests the result, you return it, Claude processes it and requests the next one. That’s 20+ model calls, the context accumulates hundreds of kilobytes of intermediate data Claude barely needs, and latency stacks up fast. Programmatic Tool Calling (PTC) eliminates that ping-pong. In this post you’ll see the full loop in TypeScript, exactly what happens inside Claude’s context, and the mistakes I see most often when implementing it.

The hidden cost of one-at-a-time tool calls

The traditional pattern is intuitive: Claude decides which tool to call, you run it, return the result, and the cycle repeats. The problem is that each step forces the model to receive the result, process it, and decide the next action. Fine for one tool. For twenty, the seams start to show.

With 20 employees, the traditional flow accumulates like this:

  • 20 model calls, each requiring a full inference
  • Context grows with every response: if each result weighs 500 tokens, by message 20 you’re carrying 10,000 tokens of intermediate data that Claude doesn’t need to generate the final summary
  • Total latency is the sum of ~22 inferences, each taking hundreds of milliseconds

The model doesn’t actually need that intermediate data. It only needs the final summary — who exceeded the limit and what the total overage was. Everything else is noise that takes up context space without contributing to the answer.

PTC changes the equation entirely: instead of Claude processing each result, Claude writes code that processes all the results inside a sandbox. Only the final output reaches the context.

How Programmatic Tool Calling works

The core idea: instead of Claude calling each tool directly and waiting for the result in context, Claude writes Python code that orchestrates all the calls inside an execution sandbox. Intermediate results are processed in the sandbox without touching Claude’s context. Only the final output — the summary, the filtered list, the aggregated number — reaches the model.

The flow has five steps:

  1. You send the message with tools configured as programmatic (the allowed_callers field).
  2. Claude writes Python code that calls your tools in a loop, with conditionals, with filtering logic.
  3. The sandbox executes that code. When it needs a result from your tool, it pauses and returns a tool_use block.
  4. You execute the tool and return the result. The sandbox continues without going through a new model inference.
  5. When the code finishes, Claude receives only the final output and generates the response.

Step 4 is the economic key: the 20 intermediate results from the employee queries don’t pass through Claude’s context. The model only sees the summary that the Python code generates at the end. In Anthropic’s multi-step research benchmarks, this context reduction brought token consumption down from 43,588 to 27,297 — 37% less [1].

To make it work, you need two things: the code_execution tool enabled, and your tools marked with allowed_callers.

Step-by-step TypeScript implementation

Here’s a complete script you can run with npx tsx ptc-employees.ts. You just need ANTHROPIC_API_KEY in your environment and @anthropic-ai/sdk installed.

Step 1: define tools with allowed_callers

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

// Real tool — this version simulates a database query
async function checkBudget(employee_id: string): Promise<string> {
  const spend = 1500 + parseInt(employee_id.replace("E", "")) * 173;
  return JSON.stringify({
    employee_id,
    spend,
    limit: 3000,
    exceeded: spend > 3000,
  });
}

const tools = [
  // The sandbox where Claude writes and executes Python code
  { type: "code_execution_20260120", name: "code_execution" },
  {
    name: "check_budget",
    // Output format description is critical: Claude will write code that processes this JSON
    description:
      "Checks an employee's budget. " +
      "Returns JSON: { employee_id: string, spend: number, limit: number, exceeded: boolean }",
    input_schema: {
      type: "object",
      properties: {
        employee_id: { type: "string", description: "Employee ID, e.g. E01" },
      },
      required: ["employee_id"],
    },
    // This line makes the tool callable from the sandbox
    allowed_callers: ["code_execution_20260120"],
  },
] as any[];

The allowed_callers field is what enables PTC for that tool. The possible values are ["direct"] (Claude calls it directly only), ["code_execution_20260120"] (sandbox only), or both. Anthropic’s docs recommend picking one: mixing both confuses the model about when and how to use the tool.

Step 2: the full loop with a tool handler

async function main() {
  const messages: Anthropic.MessageParam[] = [
    {
      role: "user",
      content:
        "Check the budget for employees E01 through E20. " +
        "Identify who exceeded the limit and calculate the total overage.",
    },
  ];

  let containerId: string | undefined;

  while (true) {
    const response = await (client.messages.create as any)({
      model: "claude-opus-4-6",
      max_tokens: 4096,
      // Reusing the container keeps sandbox state between iterations
      ...(containerId && { container: containerId }),
      messages,
      tools,
    });

    // Capture the container ID to pass in the next request
    if (response.container?.id) containerId = response.container.id;

    messages.push({ role: "assistant", content: response.content });

    if (response.stop_reason === "end_turn") {
      const text = response.content
        .filter((b: any) => b.type === "text")
        .map((b: any) => b.text)
        .join("\n");
      console.log(text);
      break;
    }

    const toolUses = response.content.filter((b: any) => b.type === "tool_use");

    if (toolUses.length === 0) break;

    const toolResults = await Promise.all(
      toolUses.map(async (toolUse: any) => {
        if (toolUse.name === "check_budget") {
          try {
            const result = await checkBudget(toolUse.input.employee_id);
            return {
              type: "tool_result" as const,
              tool_use_id: toolUse.id,
              content: result,
            };
          } catch (err: any) {
            return {
              type: "tool_result" as const,
              tool_use_id: toolUse.id,
              is_error: true,
              content: err.message,
            };
          }
        } else {
          return {
            type: "tool_result" as const,
            tool_use_id: toolUse.id,
            is_error: true,
            content: `Unknown tool: ${toolUse.name}`,
          };
        }
      })
    );

    // CRITICAL: only tool_results here. Additional text triggers an API error.
    messages.push({ role: "user", content: toolResults });
  }
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

The container is the sandbox where Claude’s code runs. Passing it in every request keeps state between loop iterations — without it, each call starts a fresh sandbox and Claude’s code loses execution context. Dispatching by toolUse.name ensures each call invokes the right function; the try/catch inside the handler communicates errors back to the sandbox instead of letting the timeout expire.

A note on types: PTC is relatively new and the TypeScript SDK doesn’t yet expose full types for allowed_callers, container, or the caller field. The as any casts are temporary — check the SDK changelog for when official types land.

The caller field: programmatic vs direct

Every tool_use block in the response includes a caller field that indicates its origin:

  • { "type": "direct" } — Claude called the tool directly
  • { "type": "code_execution_20260120", "tool_id": "srvtoolu_..." } — the call came from the sandbox

In the example above you don’t need to distinguish them because all tools are programmatic. If you mixed direct and programmatic tools, you’d need to read caller.type to know how to respond in each case (the “tool_results only” constraint only applies when programmatic calls are pending).

Container lifecycle

A container lasts approximately 4.5 minutes of inactivity [2]. The response includes container.expires_at with the exact timestamp. If your tool takes a long time to respond and the container expires, the sandbox receives a TimeoutError that Claude sees in stderr — it usually retries, but not always gracefully. For slow operations, implement timeouts on your side and communicate the error clearly in the tool_result.

Before and after: impact on tokens and latency

The most important difference isn’t the number of API calls. With 20 employees, you’re still making 20 requests to return 20 results. The difference is what those calls cost.

MetricTraditional tool callingProgrammatic Tool Calling
Model inferences~22 (one per tool call + start + end)2 (start + end)
Data in Claude’s contextAll intermediate resultsFinal output only
Context growthLinear: each result accumulatesConstant: just the summary
Tokens (real benchmark)*43,588 tokens27,297 tokens

*Data from Anthropic on multi-step research tasks [1]. Actual reduction depends on how large your intermediate results are.

The savings are larger the bigger the intermediate data you don’t need to pass to the model. If each tool result is 2KB and you only need a single number at the end, PTC eliminates almost all of that weight from the context.

Current availability (March 2026)

PTC is available through the Anthropic API directly and through Azure AI Foundry [2]. Compatible models are Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4.5, and Claude Opus 4.5 — all using the code_execution_20260120 type. MCP tools and tools with strict: true in the schema are not compatible with PTC for now.

When to use PTC (and when not to)

The practical question: when does the sandbox overhead pay off?

Cases where PTC clearly delivers value:

  • You need 3+ sequential tool calls and intermediate data is large
  • You have iteration or filtering logic that translates naturally to Python (loops, aggregates, sorting)
  • Tool results are large but you only need a subset or summary
  • The processing order doesn’t require Claude to reason about the data at each step

Cases where traditional tool calling is better:

  • A single tool with a small response — sandbox overhead doesn’t amortize
  • The flow requires Claude to evaluate a result and dynamically decide the next step based on its content
  • You need human confirmation between intermediate steps
  • Your tools have strict: true in the schema (incompatible with PTC)

The dividing line: if you can write the Python code that orchestrates the calls without Claude needing to reason between them, PTC is the right pattern. If the “which tool to call next” logic depends on evaluating the previous result with the model, you need direct tool calling or a hybrid approach.

Common mistakes when implementing PTC

1. Mixing text with tool_result in the response

This is the most frequent mistake and the most confusing because the error message isn’t always clear. When programmatic tool calls are pending, the API requires your response message to contain only tool_result blocks. No text before, no text after.

// ❌ Triggers API error
messages.push({
  role: "user",
  content: [
    { type: "tool_result", tool_use_id: "toolu_01", content: "..." },
    { type: "text", text: "Should I continue?" }, // invalid here
  ],
});

// ✓ Only tool_results when programmatic calls are pending
messages.push({
  role: "user",
  content: [
    { type: "tool_result", tool_use_id: "toolu_01", content: "..." },
  ],
});

This restriction only applies when responding to programmatic calls. For direct tool calling, you can include text after the tool_result blocks without issue.

2. Vague output format description

Claude writes Python code that deserializes and processes your tool results. If your description says only “returns employee data,” Claude doesn’t know whether to expect JSON, a plain string, or a number. The more precise your output format description — types, fields, JSON structure — the better Claude can write the processing code.

Bad description: "Returns employee information"

Good description: "Returns JSON with fields: employee_id (string), spend (number), limit (number), exceeded (boolean)"

3. Enabling allowed_callers: ["direct", "code_execution_20260120"] without a reason

Anthropic’s docs are clear: pick one or the other for each tool. Enabling both confuses the model about how to use the tool — it doesn’t know whether to call it directly or through the sandbox. If your tool is for PTC, use only ["code_execution_20260120"] and Claude will know what to do.

4. Not validating the data your tools return

The sandbox executes the code Claude writes, and that code processes what your tools return. If tool results come from external sources or contain user input, there’s an injection risk: data containing code fragments could be interpreted by the execution environment [2]. Validate and sanitize results before returning them.

5. Forgetting that containers expire

A container lasts ~4.5 minutes of inactivity. If you implement a flow with human steps between loop calls, the container may expire before you return the result. The sandbox receives a TimeoutError. Monitor container.expires_at in the response and design your flow to respond within the available window.

Implementation checklist

  • code_execution_20260120 included in the tools array
  • All PTC tools have allowed_callers: ["code_execution_20260120"]
  • Tool descriptions include the exact output format (types, JSON structure)
  • The loop responds with ONLY tool_result blocks when programmatic calls are pending
  • container.id is captured and passed in every request to maintain sandbox state
  • Tool results coming from external sources are validated
  • There is expires_at handling or timeouts for tools with variable latency

Sources

  1. Advanced Tool Use — Anthropic Engineering — token reduction data (43,588 → 27,297) from multi-step research benchmarks.
  2. Programmatic Tool Calling — Anthropic Docs — full reference: allowed_callers, caller field, container lifecycle, restrictions, and platform compatibility.

Frequently Asked Questions

Does PTC work with any Claude model?

Not all of them. As of this post, Programmatic Tool Calling is available on Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4.5, and Claude Opus 4.5 — all using the code_execution_20260120 tool type. Older models don’t support this feature.

Do tool results from programmatic calls count as input tokens?

Not in terms of model tokens. The API protocol requires sending tool_result blocks back in the messages array — that’s part of the HTTP payload — but Anthropic doesn’t count them as model input tokens. Only the final output that the Python code generates counts. The larger your intermediate results and the less Claude needs to see them directly, the greater the savings.

Can I mix programmatic and direct tools in the same request?

Yes, but it complicates the loop. You’d need to read the caller field of each tool_use to know whether to respond with the PTC format (only tool_result) or the traditional tool calling format (you can add text). For simplicity, start with everything programmatic or everything direct, and mix only when you have a concrete reason to.

What if I need Claude to reason about a result before making the next call?

In that case PTC is not the right choice. PTC works when the Python code can make all the orchestration decisions without model inference between steps. If the flow requires Claude to evaluate a result and dynamically decide which tool to call next based on its semantic content, you need traditional tool calling. PTC and direct tool calling aren’t mutually exclusive — you can use them in different parts of the same system depending on which pattern fits best.

Are there features that complement PTC in the Anthropic ecosystem?

Two in particular. Tool Search Tool lets you load tools on demand rather than defining all of them upfront in the request, which significantly reduces input tokens when you have many tools available. Tool Use Examples teaches Claude with real input/output examples, improving accuracy when your tool schema is complex. Both are documented in Anthropic Engineering’s article on advanced tool use [1].