Programmatic Tool Calling: Ending AI's Ping-Pong Problem

Advanced Anthropic pattern reducing tokens up to 37% by executing tools via Python sandbox code, eliminating the traditional question-response loop.

If you’ve worked with AI tools that need to make multiple queries — fetch data, process it, filter it — you know the problem isn’t individual capability, but the constant ping-pong between the model and your services. Every query is a complete round-trip that consumes tokens, adds latency, and fills the context with irrelevant garbage. In this post, I’ll explain how Anthropic’s Programmatic Tool Calling transforms that chatty intern into an autonomous programmer that orchestrates complex tasks without bothering you.

Why is traditional tool calling unsustainable?

Conventional tool calling works like this: the model requests a tool, you wait for the response, process it, return it, and repeat. It’s like hiring someone to cook who calls you for every ingredient: “Should I open the fridge?”, “Should I take out the eggs?”, “Should I crack them?”.

Traditional tool flow showing: User → Model → pause and call tool → receive data → process → Model → pause and call tool → receive data → process, repeating multiple times, illustrating the inefficient ping-pong cycle
The traditional pattern generates multiple round-trips: each query requires model pause, data serialization, and context contamination with raw results

Technically, each interaction requires:

  1. Context pause: The model stops processing
  2. Serialization/deserialization: Data travels back and forth via API
  3. Context contamination: Raw results fill the token window
  4. Complete round-trip: Network latency at every step

When you need to process 100 database records, this means 100 separate calls. The result: slowness, cost scales with each additional round-trip, and a context saturated with data the model doesn’t need for reasoning.

How does Programmatic Tool Calling work?

Anthropic completely changed the rules. Instead of traditional ping-pong, the model writes Python code that orchestrates multiple tools inside a secure sandbox.

Programmatic Tool Calling flow showing: User → Model generates Python code → Sandbox (contains loops, conditionals, multiple tool calls without round-trips) → internal processing and filtering → only final result returns to Model
Programmatic Tool Calling executes all orchestration in sandbox: the model generates code once, tools are called internally without interruptions, and only the final clean result returns to the model

The flow works like this:

  1. Model generates code: Writes a script defining all the logic (loops, conditions, filters)
  2. Sandbox execution: Code runs in an isolated container with access to your tools
  3. Internal processing: Tools execute, data gets filtered and aggregated within the code
  4. Clean result: Only the final output reaches the model’s context
# Code that the model generates automatically
async def analyze_sales_by_region():
    regions = ["West", "East", "Central", "North", "South"]
    results = {}

    for region in regions:
        # Each call executes without round-trip to model
        # Claude should generate parameterized queries for security
        data = await query_database("SELECT revenue FROM sales WHERE region = ?", [region])
        results[region] = sum(row["revenue"] for row in data)

    # Only this result reaches the model's context
    top_region = max(results.items(), key=lambda x: x[1])
    return f"Top region: {top_region[0]} with ${top_region[1]:,}"

# The model sees: "Top region: West with $125,000"
# Doesn't see: 500 rows of raw database data

Which tools can be called programmatically?

The allowed_callers field in your tool definition controls this behavior:

// Direct calls only (traditional behavior)
{
  name: "search_emails",
  description: "Search user emails by keyword",
  input_schema: { /* ... */ },
  allowed_callers: ["direct"]  // Default if omitted
}

// Programmatic calls only (from code)
{
  name: "process_expense_report",
  description: "Process expense data and return JSON objects",
  input_schema: { /* ... */ },
  allowed_callers: ["code_execution_20260120"]
}

// Both modes available
{
  name: "query_database",
  description: "Execute SQL query and return results",
  input_schema: { /* ... */ },
  allowed_callers: ["direct", "code_execution_20260120"]
}

Rule of thumb: Use ["code_execution_20260120"] for tools that return large structured data or when you anticipate multiple sequential calls.

When to choose programmatic vs direct vs hybrid?

ScenarioApproachJustification
Single simple queryDirectNo sandbox overhead
Large dataset processingProgrammaticFilters data before context
Multiple loops/iterationsProgrammaticAvoids N round-trips
Tool requiring UI/confirmationDirectHuman oversight before irreversible actions
Complex conditional analysisProgrammaticControl logic in code
Critical tool that must always workHybridTwo separate tools: one ["direct"], other ["code_execution_20260120"]

The three complementary “superpowers”

Anthropic didn’t stop at Programmatic Tool Calling. They introduced three improvements that enhance the pattern:

Dynamic Filtering in web searches

Previously, when Claude read a web page, it would swallow ads, menus, and HTML garbage. Now it generates code that “cleans” content before processing it [1].

Activation: Use the web_search_20260209 or web_fetch_20260209 tools with the beta header code-execution-web-tools-2026-02-09.

Result: 24% fewer input tokens and 11% better accuracy [1]. In BrowseComp, Sonnet 4.6 jumped from 33.3% to 46.6% accuracy [1].

You no longer need to load all your tool manuals “just in case.” Claude searches for the tool it needs when it needs it, learns its interface on the fly, and executes it. The pattern: when you have a large catalog and don’t want to preload all definitions, Claude can query a tool registry on demand.

Result: 85% fewer tokens at startup [2]. From ~77K tokens to ~8.7K tokens in initial configuration.

Tool Use Examples

Instead of robotic instructions about complex forms, you provide real usage examples. Claude learns through pattern matching.

Result: Parameter handling precision from 72% to 90% [2].

Practical implementation with TypeScript

Step 1: Initial request

import { Anthropic } from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

async function analyzeCustomerData() {
  const response = await anthropic.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 4096,
    messages: [{
      role: "user",
      content: "Analyze revenue by region for last quarter and identify the top 3 customers"
    }],
    tools: [
      {
        type: "code_execution_20260120",
        name: "code_execution"
      },
      {
        name: "query_sales_db",
        description: "Execute SQL query. Returns JSON array of rows with columns: customer_id, region, revenue, date",
        input_schema: {
          type: "object",
          properties: {
            sql: { type: "string", description: "SQL query to execute" }
          },
          required: ["sql"]
        },
        allowed_callers: ["code_execution_20260120"]
      }
    ]
  });

  return response;
}

Step 2: Tool call response loop

async function handleToolResults(response, conversationHistory) {
  // If there are pending tool_use blocks, respond with tool_result
  while (response.stop_reason === "tool_use") {
    const toolUseBlocks = response.content.filter(block => block.type === "tool_use");

    // Execute all tools in parallel
    const results = await Promise.all(
      toolUseBlocks.map(toolUse => executeYourTool(toolUse.name, toolUse.input))
    );

    // Respond ONLY with tool_result blocks (no additional text)
    const toolResponse = await anthropic.messages.create({
      model: "claude-opus-4-6",
      max_tokens: 4096,
      container: response.container?.id, // Reuse container if exists
      messages: [
        ...conversationHistory,
        { role: "assistant", content: response.content },
        {
          role: "user",
          content: toolUseBlocks.map((toolUse, index) => ({
            type: "tool_result",
            tool_use_id: toolUse.id,
            content: results[index]
          }))
        }
      ],
      tools: [/* same tools */]
    });

    response = toolResponse;
  }

  return response; // stop_reason: "end_turn"
}

Key point: The detailed output format description (Returns JSON array of rows with columns...) is critical. Claude uses this info to write code that correctly processes the results.

Common errors

Mixed content responses in programmatic tool calls

When there are pending programmatic tool calls, your response should contain only tool_result blocks. Including text causes API error.

// ❌ Invalid: Mixing text with tool_result
{
  "role": "user",
  "content": [
    {"type": "tool_result", "tool_use_id": "toolu_01", "content": "[{\"customer_id\": \"C1\"}]"},
    {"type": "text", "text": "What's next?"}  // This causes error
  ]
}

// ✅ Valid: Only tool_result for programmatic calls
{
  "role": "user",
  "content": [
    {"type": "tool_result", "tool_use_id": "toolu_01", "content": "[{\"customer_id\": \"C1\"}]"}
  ]
}

Orchestrator that also makes direct queries

If you define a tool with ["direct", "code_execution_20260120"], Claude can use both modes in the same conversation. This breaks flow predictability.

Symptom: Sometimes you see raw data in context, sometimes you don’t. Solution: Choose one mode per tool. If you need both, create two distinct tools.

Tools without detailed output schema

Claude needs to know exactly what format your tool returns to write code that processes it.

// ❌ Bad: Vague description
description: "Get user data"

// ✅ Good: Specific format
description: "Returns user object with fields: id (string), name (string), email (string), created_at (ISO date)"

Ignoring container expiration

Containers expire after ~4.5 minutes of inactivity. If your tool takes longer to respond, the code receives a TimeoutError.

Solution: Monitor the expires_at field in responses and implement timeouts in your tools.

Tool results with unvalidated executable content

Tool results get processed in the Python sandbox, exposing specific threats: using eval() on unvalidated results, dynamic SQL construction from external data, shell injection when passing outputs to subprocesses, and prompt injection from malicious web content.

Solution: Use parameterized queries for SQL (already applied in line 63), validate data structures before processing them, and sanitize web content. Avoid eval(), exec(), or subprocess with tool result data without prior validation.

Tools with side effects lacking idempotency

Tool calling includes automatic retries that can duplicate effects. Tools that write to databases, send emails, or process payments may execute multiple times without developer knowledge.

Solution: Design critical tools as idempotent or implement deduplication via unique request_id to prevent duplicate effects.

Known incompatibilities

FeatureStatusDescription
strict: trueNot supportedTools with strict structured outputs
tool_choiceNot supportedCannot force programmatic mode for specific tool
disable_parallel_tool_use: trueNot supportedConflicts with parallel programmatic execution
MCP toolsNot supportedMCP connectors cannot be called programmatically

Implementation checklist

  • Mass data tools configured with allowed_callers: ["code_execution_20260120"]
  • Detailed output format description in each tool (types, fields, structure)
  • Timeouts implemented in tools that can take >30 seconds
  • Tool result validation to prevent code injection
  • Monitor expires_at field to avoid container timeouts
  • Container reuse via container field for related sessions
  • Tests that verify only final output reaches context (no intermediate data)

Sources

  1. Improved Web Search with Dynamic Filtering — Claude Blog — data on accuracy improvements and token reduction in web searches.
  2. Advanced Tool Use Performance — Anthropic Engineering — metrics on token reduction and accuracy improvements in Tool Search and Tool Use Examples.
  3. Programmatic Tool Calling — Claude API Docs — official technical documentation on implementation and use cases.

Frequently Asked Questions

What happens if my tool fails during programmatic execution?

Python code receives the error as a string and Claude can handle it programmatically — retry, logging, fallbacks. It’s more resilient than direct mode because the error doesn’t interrupt the entire conversation.

Can I mix programmatic and direct tools in the same request?

Yes, but it’s not recommended. The hybrid pattern confuses the model about when to use each mode. Better to separate clearly: mass data tools → programmatic, UI/confirmation tools → direct.

How do I debug the code Claude generates internally?

The stdout field in code_execution_tool_result shows prints from the code. Use print() statements in your tool logic for debugging. You can also inspect the code in response.content[n].input.code.

Is sandbox overhead worth it for simple tasks?

No. For a single query with small response, container creation overhead exceeds the benefit. As an empirical heuristic: use programmatic mode when you anticipate 3+ calls or datasets >10KB. Measure p95 latency and token costs for your specific case.

Do containers maintain state between requests?

Yes, if you reuse the container ID. Variables, imports, temporary files persist for ~4.5 minutes. Useful for multistep analysis where you need to keep datasets in memory.

Critical security warning: Only reuse containers within the same user or session. In multi-tenant applications, reusing a container between different users exposes variables, temporary files, and data from one user to another. Always invalidate the container ID when changing security context.

How do I implement the complete conversation loop?

The pattern is simple: while stop_reason === "tool_use", respond with tool_result blocks and continue until stop_reason === "end_turn". See the complete TypeScript example in the “Practical implementation” section showing the while loop for handling multiple programmatic tool calls.