Skip to content

Instantly share code, notes, and snippets.

@textarcana
Last active October 10, 2025 16:09
Show Gist options
  • Save textarcana/00e9c77f3a40bb1a05dfd5d040ec2cc4 to your computer and use it in GitHub Desktop.
Save textarcana/00e9c77f3a40bb1a05dfd5d040ec2cc4 to your computer and use it in GitHub Desktop.
Iterative planning, code generation, and code review. This is my observation as to what works when developing code using a large language model (mostly Claude).

Iterative planning, code generation, and code review. This is my observation as to what works when developing code using a large language model (mostly Claude).

Iterative planning, code generation, and code review. This is my observation as to what works when developing code using a large language model (mostly Claude).

It is interesting to independently rediscover that the best approach to writing code is to work in rapid iterations on small, well-reviewed improvements. Because large, detailed, and complicated prompts do not work well, or at least not always.

The academic literature refers to this as “prompt chaining,” and there are numerous documented cases where iterative prompting yields better results than so-called one–shot prompting with an extensive and detailed prompt. There are also lots of examples where prompt chaining across models delivers a great result.

This aligns with what my team has found as we increasingly use models to write code. We began by trying to keep our work at the forefront of the context window of the model. Our initial observation was that the context window of even a large language model has a recency bias: the most recent context has outsized influence on the output. So even though the context window of an LLM has the capacity to hold hundreds of thousands of tokens, the model’s attention span is limited. You can see this happening when GPT or Claude starts to leave out code from the solution you are working on — some of the code is less recent in its context and the model has stopped paying attention to it. As a workaround we made a habit of starting a fresh session every hour or so to reset the conversation history and reclaim the most recent part of the model’s context window.

At the same time, people in our network were telling us that the best results are derived from taking the output of one model, such as GPT, and feeding it into a different model family (Claude) for evaluation.

Because we were trying to keep context small, and because we were trying to derive results from multiple models, this eventually led to a working style where we have many sessions open at once, across several different model families, all in the carrying out of a single task.

It is almost a multi–agent approach except that there is no direct interaction between any of the open sessions. Each session is holding a series of prompts that themselves are part of a larger chain of prompts that extends across all of the open sessions and model families. It is like the way a software engineer typically works, with many files and terminals open, moving between them in an iterative sequence of plan-execute-review-execute, the details of which are constantly modified according to feedback from the code.

Evidence Base for Observations on Iterative LLM Code Development

This document lists and sources the empirical and academic evidence supporting the claims made in the post "Iterative planning, code generation, and code review" regarding large language model behavior, context management, and multi-model workflows.


💡 Claims and Supporting Research

  • Iterative prompting (“prompt chaining”) outperforms one-shot prompting.
    Supported by Prompt Chaining vs Stepwise (Findings ACL 2024), which finds that dividing complex tasks into multiple prompts yields more accurate and interpretable results than one large prompt.

    “Dividing a task into multiple chained prompts yields higher accuracy than a single long prompt of equivalent total length.”
    Findings ACL 2024 – Prompt Chaining vs Stepwise

  • Prompt chaining is a recognized research concept.
    The AI Chains paper (CHI 2022) introduced the concept of chaining multiple LLM calls for complex workflows and demonstrated measurable performance improvements.

    “We show how prompt chaining enables modularity, reusability, and debugging in human-AI programming workflows.”
    CHI 2022 – AI Chains

  • LLMs show recency bias: the most recent tokens have outsized influence.
    Lost in the Middle (Liu et al., TACL 2023) documents that model accuracy is highest when relevant information is at the beginning or end of a prompt and lowest when it’s in the middle.

    “Performance significantly degrades when models must access and use information in the middle of their input context.”
    TACL 2023 – Lost in the Middle

  • Attention is distributed unevenly within the context window (U-shaped bias).
    Found in the Middle (Hsieh et al., 2024) links the “lost in the middle” behavior to an intrinsic U-shaped attention bias, where start and end tokens receive more attention than middle tokens.

    “LLMs exhibit a U-shaped attention bias where tokens at the beginning and at the end receive higher attention, regardless of their relevance.”
    Hsieh et al. 2024 – Found in the Middle

  • Effective attention is limited despite large architectural context windows.
    Modern LLMs can hold hundreds of thousands of tokens, but empirical studies show that usable or “effective” context remains much smaller due to positional decay and attention bias.
    Insights into LLM Long-Context Failures (2024) confirms this degradation.

    “Comprehension declines for information in the center of a long context.”
    Insights into LLM Long-Context Failures 2024

  • Resetting the conversation history restores full access to the high-attention region of the context window.
    While not explicitly studied as “session resets,” research on attention decay implies that deleting earlier context re-centers the prompt in the model’s most responsive token range.

    Based on findings in Lost in the Middle and Found in the Middle showing attention decay with positional distance.

  • Cross-model prompt chaining (multi-model collaboration) improves performance.
    Exchange of Thought (EMNLP 2023) and Mixture of Agents (ICLR 2025) show that ensembles or cascades of heterogeneous models outperform individual ones through complementary reasoning.

    “Heterogeneous agent mixtures achieve improved reliability by combining models from different families with complementary strengths.”
    EMNLP 2023 – Exchange of Thought
    ICLR 2025 – Mixture of Agents

  • Human-orchestrated multi-session workflows resemble multi-agent systems.
    AutoGen (Microsoft Research, 2023) and CAMEL (2023) formalize such setups, showing that role-specialized agents coordinating via chained prompts can outperform single-model baselines.

    “Multi-agent systems perform role-specialized prompting and peer review to improve generation quality.”
    AutoGen 2023 – Microsoft Research
    CAMEL 2023 – Role-Playing Agents


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment