Programmatic Tool Calling: Ending AI's Ping-Pong Problem
Advanced Anthropic pattern reducing tokens up to 37% by executing tools via Python sandbox code, eliminating the traditional question-response loop.
If you’ve worked with AI tools that need to make multiple queries — fetch data, process it, filter it — you know the problem isn’t individual capability, but the constant ping-pong between the model and your services. Every query is a complete round-trip that consumes tokens, adds latency, and fills the context with irrelevant garbage. In this post, I’ll explain how Anthropic’s Programmatic Tool Calling transforms that chatty intern into an autonomous programmer that orchestrates complex tasks without bothering you.
Why is traditional tool calling unsustainable?
Conventional tool calling works like this: the model requests a tool, you wait for the response, process it, return it, and repeat. It’s like hiring someone to cook who calls you for every ingredient: “Should I open the fridge?”, “Should I take out the eggs?”, “Should I crack them?”.
Technically, each interaction requires:
- Context pause: The model stops processing
- Serialization/deserialization: Data travels back and forth via API
- Context contamination: Raw results fill the token window
- Complete round-trip: Network latency at every step
When you need to process 100 database records, this means 100 separate calls. The result: slowness, cost scales with each additional round-trip, and a context saturated with data the model doesn’t need for reasoning.
How does Programmatic Tool Calling work?
Anthropic completely changed the rules. Instead of traditional ping-pong, the model writes Python code that orchestrates multiple tools inside a secure sandbox.
The flow works like this:
- Model generates code: Writes a script defining all the logic (loops, conditions, filters)
- Sandbox execution: Code runs in an isolated container with access to your tools
- Internal processing: Tools execute, data gets filtered and aggregated within the code
- Clean result: Only the final output reaches the model’s context
# Code that the model generates automatically
async def analyze_sales_by_region():
regions = ["West", "East", "Central", "North", "South"]
results = {}
for region in regions:
# Each call executes without round-trip to model
# Claude should generate parameterized queries for security
data = await query_database("SELECT revenue FROM sales WHERE region = ?", [region])
results[region] = sum(row["revenue"] for row in data)
# Only this result reaches the model's context
top_region = max(results.items(), key=lambda x: x[1])
return f"Top region: {top_region[0]} with ${top_region[1]:,}"
# The model sees: "Top region: West with $125,000"
# Doesn't see: 500 rows of raw database data
Which tools can be called programmatically?
The allowed_callers field in your tool definition controls this behavior:
// Direct calls only (traditional behavior)
{
name: "search_emails",
description: "Search user emails by keyword",
input_schema: { /* ... */ },
allowed_callers: ["direct"] // Default if omitted
}
// Programmatic calls only (from code)
{
name: "process_expense_report",
description: "Process expense data and return JSON objects",
input_schema: { /* ... */ },
allowed_callers: ["code_execution_20260120"]
}
// Both modes available
{
name: "query_database",
description: "Execute SQL query and return results",
input_schema: { /* ... */ },
allowed_callers: ["direct", "code_execution_20260120"]
}
Rule of thumb: Use ["code_execution_20260120"] for tools that return large structured data or when you anticipate multiple sequential calls.
When to choose programmatic vs direct vs hybrid?
| Scenario | Approach | Justification |
|---|---|---|
| Single simple query | Direct | No sandbox overhead |
| Large dataset processing | Programmatic | Filters data before context |
| Multiple loops/iterations | Programmatic | Avoids N round-trips |
| Tool requiring UI/confirmation | Direct | Human oversight before irreversible actions |
| Complex conditional analysis | Programmatic | Control logic in code |
| Critical tool that must always work | Hybrid | Two separate tools: one ["direct"], other ["code_execution_20260120"] |
The three complementary “superpowers”
Anthropic didn’t stop at Programmatic Tool Calling. They introduced three improvements that enhance the pattern:
Dynamic Filtering in web searches
Previously, when Claude read a web page, it would swallow ads, menus, and HTML garbage. Now it generates code that “cleans” content before processing it [1].
Activation: Use the web_search_20260209 or web_fetch_20260209 tools with the beta header code-execution-web-tools-2026-02-09.
Result: 24% fewer input tokens and 11% better accuracy [1]. In BrowseComp, Sonnet 4.6 jumped from 33.3% to 46.6% accuracy [1].
Internal Tool Search
You no longer need to load all your tool manuals “just in case.” Claude searches for the tool it needs when it needs it, learns its interface on the fly, and executes it. The pattern: when you have a large catalog and don’t want to preload all definitions, Claude can query a tool registry on demand.
Result: 85% fewer tokens at startup [2]. From ~77K tokens to ~8.7K tokens in initial configuration.
Tool Use Examples
Instead of robotic instructions about complex forms, you provide real usage examples. Claude learns through pattern matching.
Result: Parameter handling precision from 72% to 90% [2].
Practical implementation with TypeScript
Step 1: Initial request
import { Anthropic } from "@anthropic-ai/sdk";
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
async function analyzeCustomerData() {
const response = await anthropic.messages.create({
model: "claude-opus-4-6",
max_tokens: 4096,
messages: [{
role: "user",
content: "Analyze revenue by region for last quarter and identify the top 3 customers"
}],
tools: [
{
type: "code_execution_20260120",
name: "code_execution"
},
{
name: "query_sales_db",
description: "Execute SQL query. Returns JSON array of rows with columns: customer_id, region, revenue, date",
input_schema: {
type: "object",
properties: {
sql: { type: "string", description: "SQL query to execute" }
},
required: ["sql"]
},
allowed_callers: ["code_execution_20260120"]
}
]
});
return response;
}
Step 2: Tool call response loop
async function handleToolResults(response, conversationHistory) {
// If there are pending tool_use blocks, respond with tool_result
while (response.stop_reason === "tool_use") {
const toolUseBlocks = response.content.filter(block => block.type === "tool_use");
// Execute all tools in parallel
const results = await Promise.all(
toolUseBlocks.map(toolUse => executeYourTool(toolUse.name, toolUse.input))
);
// Respond ONLY with tool_result blocks (no additional text)
const toolResponse = await anthropic.messages.create({
model: "claude-opus-4-6",
max_tokens: 4096,
container: response.container?.id, // Reuse container if exists
messages: [
...conversationHistory,
{ role: "assistant", content: response.content },
{
role: "user",
content: toolUseBlocks.map((toolUse, index) => ({
type: "tool_result",
tool_use_id: toolUse.id,
content: results[index]
}))
}
],
tools: [/* same tools */]
});
response = toolResponse;
}
return response; // stop_reason: "end_turn"
}
Key point: The detailed output format description (Returns JSON array of rows with columns...) is critical. Claude uses this info to write code that correctly processes the results.
Common errors
Mixed content responses in programmatic tool calls
When there are pending programmatic tool calls, your response should contain only tool_result blocks. Including text causes API error.
// ❌ Invalid: Mixing text with tool_result
{
"role": "user",
"content": [
{"type": "tool_result", "tool_use_id": "toolu_01", "content": "[{\"customer_id\": \"C1\"}]"},
{"type": "text", "text": "What's next?"} // This causes error
]
}
// ✅ Valid: Only tool_result for programmatic calls
{
"role": "user",
"content": [
{"type": "tool_result", "tool_use_id": "toolu_01", "content": "[{\"customer_id\": \"C1\"}]"}
]
}
Orchestrator that also makes direct queries
If you define a tool with ["direct", "code_execution_20260120"], Claude can use both modes in the same conversation. This breaks flow predictability.
Symptom: Sometimes you see raw data in context, sometimes you don’t. Solution: Choose one mode per tool. If you need both, create two distinct tools.
Tools without detailed output schema
Claude needs to know exactly what format your tool returns to write code that processes it.
// ❌ Bad: Vague description
description: "Get user data"
// ✅ Good: Specific format
description: "Returns user object with fields: id (string), name (string), email (string), created_at (ISO date)"
Ignoring container expiration
Containers expire after ~4.5 minutes of inactivity. If your tool takes longer to respond, the code receives a TimeoutError.
Solution: Monitor the expires_at field in responses and implement timeouts in your tools.
Tool results with unvalidated executable content
Tool results get processed in the Python sandbox, exposing specific threats: using eval() on unvalidated results, dynamic SQL construction from external data, shell injection when passing outputs to subprocesses, and prompt injection from malicious web content.
Solution: Use parameterized queries for SQL (already applied in line 63), validate data structures before processing them, and sanitize web content. Avoid eval(), exec(), or subprocess with tool result data without prior validation.
Tools with side effects lacking idempotency
Tool calling includes automatic retries that can duplicate effects. Tools that write to databases, send emails, or process payments may execute multiple times without developer knowledge.
Solution: Design critical tools as idempotent or implement deduplication via unique request_id to prevent duplicate effects.
Known incompatibilities
| Feature | Status | Description |
|---|---|---|
strict: true | Not supported | Tools with strict structured outputs |
tool_choice | Not supported | Cannot force programmatic mode for specific tool |
disable_parallel_tool_use: true | Not supported | Conflicts with parallel programmatic execution |
| MCP tools | Not supported | MCP connectors cannot be called programmatically |
Implementation checklist
- Mass data tools configured with
allowed_callers: ["code_execution_20260120"] - Detailed output format description in each tool (types, fields, structure)
- Timeouts implemented in tools that can take >30 seconds
- Tool result validation to prevent code injection
- Monitor
expires_atfield to avoid container timeouts - Container reuse via
containerfield for related sessions - Tests that verify only final output reaches context (no intermediate data)
Sources
- Improved Web Search with Dynamic Filtering — Claude Blog — data on accuracy improvements and token reduction in web searches.
- Advanced Tool Use Performance — Anthropic Engineering — metrics on token reduction and accuracy improvements in Tool Search and Tool Use Examples.
- Programmatic Tool Calling — Claude API Docs — official technical documentation on implementation and use cases.
Frequently Asked Questions
What happens if my tool fails during programmatic execution?
Python code receives the error as a string and Claude can handle it programmatically — retry, logging, fallbacks. It’s more resilient than direct mode because the error doesn’t interrupt the entire conversation.
Can I mix programmatic and direct tools in the same request?
Yes, but it’s not recommended. The hybrid pattern confuses the model about when to use each mode. Better to separate clearly: mass data tools → programmatic, UI/confirmation tools → direct.
How do I debug the code Claude generates internally?
The stdout field in code_execution_tool_result shows prints from the code. Use print() statements in your tool logic for debugging. You can also inspect the code in response.content[n].input.code.
Is sandbox overhead worth it for simple tasks?
No. For a single query with small response, container creation overhead exceeds the benefit. As an empirical heuristic: use programmatic mode when you anticipate 3+ calls or datasets >10KB. Measure p95 latency and token costs for your specific case.
Do containers maintain state between requests?
Yes, if you reuse the container ID. Variables, imports, temporary files persist for ~4.5 minutes. Useful for multistep analysis where you need to keep datasets in memory.
Critical security warning: Only reuse containers within the same user or session. In multi-tenant applications, reusing a container between different users exposes variables, temporary files, and data from one user to another. Always invalidate the container ID when changing security context.
How do I implement the complete conversation loop?
The pattern is simple: while stop_reason === "tool_use", respond with tool_result blocks and continue until stop_reason === "end_turn". See the complete TypeScript example in the “Practical implementation” section showing the while loop for handling multiple programmatic tool calls.