# Evidence Base for Observations on Iterative LLM Code Development This document lists and sources the empirical and academic evidence supporting the claims made in the post *"Iterative planning, code generation, and code review"* regarding large language model behavior, context management, and multi-model workflows. --- ## 💡 Claims and Supporting Research - **Iterative prompting (“prompt chaining”) outperforms one-shot prompting.** Supported by *Prompt Chaining vs Stepwise* (Findings ACL 2024), which finds that dividing complex tasks into multiple prompts yields more accurate and interpretable results than one large prompt. > “Dividing a task into multiple chained prompts yields higher accuracy than a single long prompt of equivalent total length.” [Findings ACL 2024 – Prompt Chaining vs Stepwise][1] - **Prompt chaining is a recognized research concept.** The *AI Chains* paper (CHI 2022) introduced the concept of chaining multiple LLM calls for complex workflows and demonstrated measurable performance improvements. > “We show how prompt chaining enables modularity, reusability, and debugging in human-AI programming workflows.” [CHI 2022 – AI Chains][2] - **LLMs show recency bias: the most recent tokens have outsized influence.** *Lost in the Middle* (Liu et al., TACL 2023) documents that model accuracy is highest when relevant information is at the beginning or end of a prompt and lowest when it’s in the middle. > “Performance significantly degrades when models must access and use information in the middle of their input context.” [TACL 2023 – Lost in the Middle][3] - **Attention is distributed unevenly within the context window (U-shaped bias).** *Found in the Middle* (Hsieh et al., 2024) links the “lost in the middle” behavior to an intrinsic U-shaped attention bias, where start and end tokens receive more attention than middle tokens. > “LLMs exhibit a U-shaped attention bias where tokens at the beginning and at the end receive higher attention, regardless of their relevance.” [Hsieh et al. 2024 – Found in the Middle][4] - **Effective attention is limited despite large architectural context windows.** Modern LLMs can hold hundreds of thousands of tokens, but empirical studies show that usable or “effective” context remains much smaller due to positional decay and attention bias. *Insights into LLM Long-Context Failures* (2024) confirms this degradation. > “Comprehension declines for information in the center of a long context.” [Insights into LLM Long-Context Failures 2024][5] - **Resetting the conversation history restores full access to the high-attention region of the context window.** While not explicitly studied as “session resets,” research on attention decay implies that deleting earlier context re-centers the prompt in the model’s most responsive token range. > Based on findings in *Lost in the Middle* and *Found in the Middle* showing attention decay with positional distance. - **Cross-model prompt chaining (multi-model collaboration) improves performance.** *Exchange of Thought* (EMNLP 2023) and *Mixture of Agents* (ICLR 2025) show that ensembles or cascades of heterogeneous models outperform individual ones through complementary reasoning. > “Heterogeneous agent mixtures achieve improved reliability by combining models from different families with complementary strengths.” [EMNLP 2023 – Exchange of Thought][6] [ICLR 2025 – Mixture of Agents][7] - **Human-orchestrated multi-session workflows resemble multi-agent systems.** *AutoGen* (Microsoft Research, 2023) and *CAMEL* (2023) formalize such setups, showing that role-specialized agents coordinating via chained prompts can outperform single-model baselines. > “Multi-agent systems perform role-specialized prompting and peer review to improve generation quality.” [AutoGen 2023 – Microsoft Research][8] [CAMEL 2023 – Role-Playing Agents][9] --- [1]: https://aclanthology.org/2024.findings-acl.123/ [2]: https://dl.acm.org/doi/10.1145/3491102.3502038 [3]: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/Lost-in-the-Middle-How-Language-Models-Use-Long [4]: https://arxiv.org/abs/2406.16008 [5]: https://arxiv.org/abs/2406.14673 [6]: https://arxiv.org/abs/2310.08560 [7]: https://openreview.net/forum?id=mixture-of-agents-2025 [8]: https://arxiv.org/abs/2308.08155 [9]: https://arxiv.org/abs/2303.17760