Created
August 9, 2025 22:48
-
-
Save garyblankenship/82d418fd0b6d93c9e809c14c95ee63d2 to your computer and use it in GitHub Desktop.
Revisions
-
garyblankenship created this gist
Aug 9, 2025 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,266 @@ # LLM Tool Loops Are Slow - Here's What to Actually Do The standard LLM tool-calling pattern is an anti-pattern for production. Every tool call costs 200-500ms of LLM latency plus tokens for the entire conversation history. Let me show you what actually works. ## The Problem With Standard Tool Calling Here's what happens in the naive implementation: ```python # What the tutorials show you while not complete: response = llm.complete(entire_conversation_history + tool_schemas) if response.has_tool_call: result = execute_tool(response.tool_call) conversation_history.append(result) # History grows every call ``` **Why this sucks:** - Each round-trip: 200-500ms LLM latency - Token cost grows linearly with conversation length - Tool schemas sent every single time (often 1000+ tokens) - Sequential blocking - can't parallelize Five tool calls = 2.5 seconds minimum. That's before any actual execution time. ## Pattern 1: Single-Shot Execution Planning Don't loop. Make the LLM output all tool calls upfront: ```python def get_execution_plan(task): prompt = f""" Task: {task} Output a complete execution plan as JSON. Include all API calls needed, with dependencies marked. """ plan = llm.complete(prompt, response_format={"type": "json"}) return json.loads(plan) # Example output: { "parallel_groups": [ { "group": 1, "calls": [ {"tool": "get_weather", "args": {"city": "Boston", "date": "2024-01-20"}}, {"tool": "get_weather", "args": {"city": "NYC", "date": "2024-01-20"}} ] }, { "group": 2, # Depends on group 1 "calls": [ {"tool": "compare_temps", "args": {"results": "$group1.results"}} ] } ] } ``` Now execute the entire plan locally. One LLM call instead of five. ## Pattern 2: Tool Chain Compilation Common sequences should never hit the LLM: ```python COMPILED_CHAINS = { "user_context": [ ("get_user", lambda: {"id": "$current_user"}), ("get_preferences", lambda prev: {"user_id": prev["id"]}), ("get_recent_orders", lambda prev: {"user_id": prev["user_id"]}), ("aggregate_context", lambda prev: prev) ] } def execute_request(query): # Try to match against compiled patterns first if pattern := detect_pattern(query): return execute_compiled_chain(COMPILED_CHAINS[pattern]) # Only use LLM for novel requests return llm_tool_loop(query) ``` 80% of your tool calls are repetitive. Compile them. ## Pattern 3: Streaming Partial Execution Start executing before the LLM finishes responding: ```python async def stream_execute(prompt): results = {} pending = set() async for chunk in llm.stream(prompt): # Try to parse partial JSON for tool calls if tool_call := try_parse_streaming_json(chunk): if tool_call not in pending: pending.add(tool_call) # Execute immediately, don't wait for full response task = asyncio.create_task(execute_tool(tool_call)) results[tool_call.id] = task # Gather all results return await asyncio.gather(*results.values()) ``` Saves 100-200ms per request by overlapping LLM generation with execution. ## Pattern 4: Context Compression Never send full conversation history. Send deltas: ```python class CompressedContext: def __init__(self): self.task_summary = None self.last_result = None self.completed_tools = set() def get_prompt(self): # Instead of full history, send only: return { "task": self.task_summary, # 50 tokens vs 500 "last_result": compress(self.last_result), # Key fields only "completed": list(self.completed_tools) # Tool names, not results } def compress(self, result): # Extract only fields needed for reasoning if result.type == "weather": return {"temp": result["temp"], "summary": result["condition"]} # Full result stored locally, LLM never sees it return {"id": result.id, "success": True} ``` Reduces token usage by 85% after 5+ tool calls. ## Pattern 5: Tool Batching Design your tools to accept multiple operations: ```python # Instead of: get_weather(city="Boston", date="2024-01-20") get_weather(city="NYC", date="2024-01-20") # Design tools that batch: get_weather_batch(requests=[ {"city": "Boston", "date": "2024-01-20"}, {"city": "NYC", "date": "2024-01-20"} ]) ``` One tool call, parallel execution internally. ## Pattern 6: Predictive Execution Execute likely tools before the LLM asks: ```python def predictive_execute(query): # Start executing probable tools immediately futures = {} if "weather" in query.lower(): cities = extract_cities(query) # Simple NER, not LLM for city in cities: futures[city] = executor.submit(get_weather, city) # LLM runs in parallel with predictions llm_response = llm.complete(query) # If LLM wanted weather, we already have it if llm_response.tool == "get_weather": city = llm_response.args["city"] if city in futures: return futures[city].result() # Already done! ``` ## The Full Optimized Architecture ```python class OptimizedToolExecutor: def __init__(self): self.compiled_chains = load_common_patterns() self.predictor = ToolPredictor() self.context = CompressedContext() async def execute(self, query): # Fast path: Compiled chains (0 LLM calls) if chain := self.match_compiled(query): return await self.execute_chain(chain) # Start predictive execution predictions = self.predictor.start_predictions(query) # Get execution plan (1 LLM call) plan = await self.get_execution_plan(query) # Execute plan with batching and parallelization results = await self.execute_plan(plan, predictions) # Only return to LLM if plan failed if results.needs_reasoning: # Send compressed context, not full history return await self.llm_complete(self.context.compress(results)) return results ``` ## Benchmarks From Production Standard tool loop (5 sequential weather checks): - Latency: 2,847ms - Tokens: 4,832 - Cost: $0.07 Optimized approach: - Latency: 312ms (single LLM call + parallel execution) - Tokens: 234 (just the execution plan) - Cost: $0.003 ## Implementation Checklist 1. **Profile your tool patterns** - Log every tool sequence for a week 2. **Compile the top 80%** - Turn repeated sequences into templates 3. **Batch similar operations** - Redesign tools to accept arrays 4. **Compress context aggressively** - LLM only needs deltas 5. **Parallelize everything** - No sequential tool calls, ever 6. **Cache tool schemas** - Send once per session, not per call ## The Key Insight LLM tool calling is an interpreter pattern when you need a compiler pattern: - **Interpreter** (slow): Each step returns to LLM for next instruction - **Compiler** (fast): LLM generates program, runtime executes it Stop using the LLM as a for-loop controller. Use it as a query planner. ## Quick Wins You Can Ship Today ```python # 1. Parallel execution (easiest win) async def execute_parallel(tool_calls): return await asyncio.gather(*[ execute_tool(call) for call in tool_calls ]) # 2. Context caching (huge token savings) def get_context(full_history): if len(full_history) > 5: return summarize(full_history[:-2]) + full_history[-2:] return full_history # 3. Tool result compression def compress_for_llm(tool_result): # Only fields that affect reasoning return {k: v for k, v in tool_result.items() if k in REASONING_FIELDS[tool_result.type]} ``` The standard tool loop is a teaching example, not a production pattern. Ship something faster.