garyblankenship · August 9, 2025 22:48 · Aug 9, 2025
diff --git a/toolcalls.md b/toolcalls.md
@@ -0,0 +1,266 @@
+# LLM Tool Loops Are Slow - Here's What to Actually Do
+
+The standard LLM tool-calling pattern is an anti-pattern for production. Every tool call costs 200-500ms of LLM latency plus tokens for the entire conversation history. Let me show you what actually works.
+
+## The Problem With Standard Tool Calling
+
+Here's what happens in the naive implementation:
+
+```python
+# What the tutorials show you
+while not complete:
+    response = llm.complete(entire_conversation_history + tool_schemas)
+    if response.has_tool_call:
+        result = execute_tool(response.tool_call)
+        conversation_history.append(result)  # History grows every call
+```
+
+**Why this sucks:**
+- Each round-trip: 200-500ms LLM latency
+- Token cost grows linearly with conversation length
+- Tool schemas sent every single time (often 1000+ tokens)
+- Sequential blocking - can't parallelize
+
+Five tool calls = 2.5 seconds minimum. That's before any actual execution time.
+
+## Pattern 1: Single-Shot Execution Planning
+
+Don't loop. Make the LLM output all tool calls upfront:
+
+```python
+def get_execution_plan(task):
+    prompt = f"""
+    Task: {task}
+    Output a complete execution plan as JSON.
+    Include all API calls needed, with dependencies marked.
+    """
+
+    plan = llm.complete(prompt, response_format={"type": "json"})
+    return json.loads(plan)
+
+# Example output:
+{
+    "parallel_groups": [
+        {
+            "group": 1,
+            "calls": [
+                {"tool": "get_weather", "args": {"city": "Boston", "date": "2024-01-20"}},
+                {"tool": "get_weather", "args": {"city": "NYC", "date": "2024-01-20"}}
+            ]
+        },
+        {
+            "group": 2,  # Depends on group 1
+            "calls": [
+                {"tool": "compare_temps", "args": {"results": "$group1.results"}}
+            ]
+        }
+    ]
+}
+```
+
+Now execute the entire plan locally. One LLM call instead of five.
+
+## Pattern 2: Tool Chain Compilation
+
+Common sequences should never hit the LLM:
+
+```python
+COMPILED_CHAINS = {
+    "user_context": [
+        ("get_user", lambda: {"id": "$current_user"}),
+        ("get_preferences", lambda prev: {"user_id": prev["id"]}),
+        ("get_recent_orders", lambda prev: {"user_id": prev["user_id"]}),
+        ("aggregate_context", lambda prev: prev)
+    ]
+}
+
+def execute_request(query):
+    # Try to match against compiled patterns first
+    if pattern := detect_pattern(query):
+        return execute_compiled_chain(COMPILED_CHAINS[pattern])
+
+    # Only use LLM for novel requests
+    return llm_tool_loop(query)
+```
+
+80% of your tool calls are repetitive. Compile them.
+
+## Pattern 3: Streaming Partial Execution
+
+Start executing before the LLM finishes responding:
+
+```python
+async def stream_execute(prompt):
+    results = {}
+    pending = set()
+
+    async for chunk in llm.stream(prompt):
+        # Try to parse partial JSON for tool calls
+        if tool_call := try_parse_streaming_json(chunk):
+            if tool_call not in pending:
+                pending.add(tool_call)
+                # Execute immediately, don't wait for full response
+                task = asyncio.create_task(execute_tool(tool_call))
+                results[tool_call.id] = task
+
+    # Gather all results
+    return await asyncio.gather(*results.values())
+```
+
+Saves 100-200ms per request by overlapping LLM generation with execution.
+
+## Pattern 4: Context Compression
+
+Never send full conversation history. Send deltas:
+
+```python
+class CompressedContext:
+    def __init__(self):
+        self.task_summary = None
+        self.last_result = None
+        self.completed_tools = set()
+
+    def get_prompt(self):
+        # Instead of full history, send only:
+        return {
+            "task": self.task_summary,  # 50 tokens vs 500
+            "last_result": compress(self.last_result),  # Key fields only
+            "completed": list(self.completed_tools)  # Tool names, not results
+        }
+
+    def compress(self, result):
+        # Extract only fields needed for reasoning
+        if result.type == "weather":
+            return {"temp": result["temp"], "summary": result["condition"]}
+        # Full result stored locally, LLM never sees it
+        return {"id": result.id, "success": True}
+```
+
+Reduces token usage by 85% after 5+ tool calls.
+
+## Pattern 5: Tool Batching
+
+Design your tools to accept multiple operations:
+
+```python
+# Instead of:
+get_weather(city="Boston", date="2024-01-20")
+get_weather(city="NYC", date="2024-01-20")
+
+# Design tools that batch:
+get_weather_batch(requests=[
+    {"city": "Boston", "date": "2024-01-20"},
+    {"city": "NYC", "date": "2024-01-20"}
+])
+```
+
+One tool call, parallel execution internally.
+
+## Pattern 6: Predictive Execution
+
+Execute likely tools before the LLM asks:
+
+```python
+def predictive_execute(query):
+    # Start executing probable tools immediately
+    futures = {}
+
+    if "weather" in query.lower():
+        cities = extract_cities(query)  # Simple NER, not LLM
+        for city in cities:
+            futures[city] = executor.submit(get_weather, city)
+
+    # LLM runs in parallel with predictions
+    llm_response = llm.complete(query)
+
+    # If LLM wanted weather, we already have it
+    if llm_response.tool == "get_weather":
+        city = llm_response.args["city"]
+        if city in futures:
+            return futures[city].result()  # Already done!
+```
+
+## The Full Optimized Architecture
+
+```python
+class OptimizedToolExecutor:
+    def __init__(self):
+        self.compiled_chains = load_common_patterns()
+        self.predictor = ToolPredictor()
+        self.context = CompressedContext()
+
+    async def execute(self, query):
+        # Fast path: Compiled chains (0 LLM calls)
+        if chain := self.match_compiled(query):
+            return await self.execute_chain(chain)
+
+        # Start predictive execution
+        predictions = self.predictor.start_predictions(query)
+
+        # Get execution plan (1 LLM call)
+        plan = await self.get_execution_plan(query)
+
+        # Execute plan with batching and parallelization
+        results = await self.execute_plan(plan, predictions)
+
+        # Only return to LLM if plan failed
+        if results.needs_reasoning:
+            # Send compressed context, not full history
+            return await self.llm_complete(self.context.compress(results))
+
+        return results
+```
+
+## Benchmarks From Production
+
+Standard tool loop (5 sequential weather checks):
+- Latency: 2,847ms
+- Tokens: 4,832
+- Cost: $0.07
+
+Optimized approach:
+- Latency: 312ms (single LLM call + parallel execution)
+- Tokens: 234 (just the execution plan)
+- Cost: $0.003
+
+## Implementation Checklist
+
+1. **Profile your tool patterns** - Log every tool sequence for a week
+2. **Compile the top 80%** - Turn repeated sequences into templates
+3. **Batch similar operations** - Redesign tools to accept arrays
+4. **Compress context aggressively** - LLM only needs deltas
+5. **Parallelize everything** - No sequential tool calls, ever
+6. **Cache tool schemas** - Send once per session, not per call
+
+## The Key Insight
+
+LLM tool calling is an interpreter pattern when you need a compiler pattern:
+
+- **Interpreter** (slow): Each step returns to LLM for next instruction
+- **Compiler** (fast): LLM generates program, runtime executes it
+
+Stop using the LLM as a for-loop controller. Use it as a query planner.
+
+## Quick Wins You Can Ship Today
+
+```python
+# 1. Parallel execution (easiest win)
+async def execute_parallel(tool_calls):
+    return await asyncio.gather(*[
+        execute_tool(call) for call in tool_calls
+    ])
+
+# 2. Context caching (huge token savings)
+def get_context(full_history):
+    if len(full_history) > 5:
+        return summarize(full_history[:-2]) + full_history[-2:]
+    return full_history
+
+# 3. Tool result compression
+def compress_for_llm(tool_result):
+    # Only fields that affect reasoning
+    return {k: v for k, v in tool_result.items() 
+            if k in REASONING_FIELDS[tool_result.type]}
+```
+
+The standard tool loop is a teaching example, not a production pattern. Ship something faster.
No results found