Skip to content

Instantly share code, notes, and snippets.

@garyblankenship
Created August 9, 2025 22:48
Show Gist options
  • Select an option

  • Save garyblankenship/82d418fd0b6d93c9e809c14c95ee63d2 to your computer and use it in GitHub Desktop.

Select an option

Save garyblankenship/82d418fd0b6d93c9e809c14c95ee63d2 to your computer and use it in GitHub Desktop.

Revisions

  1. garyblankenship created this gist Aug 9, 2025.
    266 changes: 266 additions & 0 deletions toolcalls.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,266 @@
    # LLM Tool Loops Are Slow - Here's What to Actually Do

    The standard LLM tool-calling pattern is an anti-pattern for production. Every tool call costs 200-500ms of LLM latency plus tokens for the entire conversation history. Let me show you what actually works.

    ## The Problem With Standard Tool Calling

    Here's what happens in the naive implementation:

    ```python
    # What the tutorials show you
    while not complete:
    response = llm.complete(entire_conversation_history + tool_schemas)
    if response.has_tool_call:
    result = execute_tool(response.tool_call)
    conversation_history.append(result) # History grows every call
    ```

    **Why this sucks:**
    - Each round-trip: 200-500ms LLM latency
    - Token cost grows linearly with conversation length
    - Tool schemas sent every single time (often 1000+ tokens)
    - Sequential blocking - can't parallelize

    Five tool calls = 2.5 seconds minimum. That's before any actual execution time.

    ## Pattern 1: Single-Shot Execution Planning

    Don't loop. Make the LLM output all tool calls upfront:

    ```python
    def get_execution_plan(task):
    prompt = f"""
    Task: {task}
    Output a complete execution plan as JSON.
    Include all API calls needed, with dependencies marked.
    """

    plan = llm.complete(prompt, response_format={"type": "json"})
    return json.loads(plan)

    # Example output:
    {
    "parallel_groups": [
    {
    "group": 1,
    "calls": [
    {"tool": "get_weather", "args": {"city": "Boston", "date": "2024-01-20"}},
    {"tool": "get_weather", "args": {"city": "NYC", "date": "2024-01-20"}}
    ]
    },
    {
    "group": 2, # Depends on group 1
    "calls": [
    {"tool": "compare_temps", "args": {"results": "$group1.results"}}
    ]
    }
    ]
    }
    ```

    Now execute the entire plan locally. One LLM call instead of five.

    ## Pattern 2: Tool Chain Compilation

    Common sequences should never hit the LLM:

    ```python
    COMPILED_CHAINS = {
    "user_context": [
    ("get_user", lambda: {"id": "$current_user"}),
    ("get_preferences", lambda prev: {"user_id": prev["id"]}),
    ("get_recent_orders", lambda prev: {"user_id": prev["user_id"]}),
    ("aggregate_context", lambda prev: prev)
    ]
    }

    def execute_request(query):
    # Try to match against compiled patterns first
    if pattern := detect_pattern(query):
    return execute_compiled_chain(COMPILED_CHAINS[pattern])

    # Only use LLM for novel requests
    return llm_tool_loop(query)
    ```

    80% of your tool calls are repetitive. Compile them.

    ## Pattern 3: Streaming Partial Execution

    Start executing before the LLM finishes responding:

    ```python
    async def stream_execute(prompt):
    results = {}
    pending = set()

    async for chunk in llm.stream(prompt):
    # Try to parse partial JSON for tool calls
    if tool_call := try_parse_streaming_json(chunk):
    if tool_call not in pending:
    pending.add(tool_call)
    # Execute immediately, don't wait for full response
    task = asyncio.create_task(execute_tool(tool_call))
    results[tool_call.id] = task

    # Gather all results
    return await asyncio.gather(*results.values())
    ```

    Saves 100-200ms per request by overlapping LLM generation with execution.

    ## Pattern 4: Context Compression

    Never send full conversation history. Send deltas:

    ```python
    class CompressedContext:
    def __init__(self):
    self.task_summary = None
    self.last_result = None
    self.completed_tools = set()

    def get_prompt(self):
    # Instead of full history, send only:
    return {
    "task": self.task_summary, # 50 tokens vs 500
    "last_result": compress(self.last_result), # Key fields only
    "completed": list(self.completed_tools) # Tool names, not results
    }

    def compress(self, result):
    # Extract only fields needed for reasoning
    if result.type == "weather":
    return {"temp": result["temp"], "summary": result["condition"]}
    # Full result stored locally, LLM never sees it
    return {"id": result.id, "success": True}
    ```

    Reduces token usage by 85% after 5+ tool calls.

    ## Pattern 5: Tool Batching

    Design your tools to accept multiple operations:

    ```python
    # Instead of:
    get_weather(city="Boston", date="2024-01-20")
    get_weather(city="NYC", date="2024-01-20")

    # Design tools that batch:
    get_weather_batch(requests=[
    {"city": "Boston", "date": "2024-01-20"},
    {"city": "NYC", "date": "2024-01-20"}
    ])
    ```

    One tool call, parallel execution internally.

    ## Pattern 6: Predictive Execution

    Execute likely tools before the LLM asks:

    ```python
    def predictive_execute(query):
    # Start executing probable tools immediately
    futures = {}

    if "weather" in query.lower():
    cities = extract_cities(query) # Simple NER, not LLM
    for city in cities:
    futures[city] = executor.submit(get_weather, city)

    # LLM runs in parallel with predictions
    llm_response = llm.complete(query)

    # If LLM wanted weather, we already have it
    if llm_response.tool == "get_weather":
    city = llm_response.args["city"]
    if city in futures:
    return futures[city].result() # Already done!
    ```

    ## The Full Optimized Architecture

    ```python
    class OptimizedToolExecutor:
    def __init__(self):
    self.compiled_chains = load_common_patterns()
    self.predictor = ToolPredictor()
    self.context = CompressedContext()

    async def execute(self, query):
    # Fast path: Compiled chains (0 LLM calls)
    if chain := self.match_compiled(query):
    return await self.execute_chain(chain)

    # Start predictive execution
    predictions = self.predictor.start_predictions(query)

    # Get execution plan (1 LLM call)
    plan = await self.get_execution_plan(query)

    # Execute plan with batching and parallelization
    results = await self.execute_plan(plan, predictions)

    # Only return to LLM if plan failed
    if results.needs_reasoning:
    # Send compressed context, not full history
    return await self.llm_complete(self.context.compress(results))

    return results
    ```

    ## Benchmarks From Production

    Standard tool loop (5 sequential weather checks):
    - Latency: 2,847ms
    - Tokens: 4,832
    - Cost: $0.07

    Optimized approach:
    - Latency: 312ms (single LLM call + parallel execution)
    - Tokens: 234 (just the execution plan)
    - Cost: $0.003

    ## Implementation Checklist

    1. **Profile your tool patterns** - Log every tool sequence for a week
    2. **Compile the top 80%** - Turn repeated sequences into templates
    3. **Batch similar operations** - Redesign tools to accept arrays
    4. **Compress context aggressively** - LLM only needs deltas
    5. **Parallelize everything** - No sequential tool calls, ever
    6. **Cache tool schemas** - Send once per session, not per call

    ## The Key Insight

    LLM tool calling is an interpreter pattern when you need a compiler pattern:

    - **Interpreter** (slow): Each step returns to LLM for next instruction
    - **Compiler** (fast): LLM generates program, runtime executes it

    Stop using the LLM as a for-loop controller. Use it as a query planner.

    ## Quick Wins You Can Ship Today

    ```python
    # 1. Parallel execution (easiest win)
    async def execute_parallel(tool_calls):
    return await asyncio.gather(*[
    execute_tool(call) for call in tool_calls
    ])

    # 2. Context caching (huge token savings)
    def get_context(full_history):
    if len(full_history) > 5:
    return summarize(full_history[:-2]) + full_history[-2:]
    return full_history

    # 3. Tool result compression
    def compress_for_llm(tool_result):
    # Only fields that affect reasoning
    return {k: v for k, v in tool_result.items()
    if k in REASONING_FIELDS[tool_result.type]}
    ```

    The standard tool loop is a teaching example, not a production pattern. Ship something faster.