# Evidence Base for Observations on Iterative LLM Code Development

This document lists and sources the empirical and academic evidence supporting the claims made in the post *"Iterative planning, code generation, and code review"* regarding large language model behavior, context management, and multi-model workflows.

---

## 💡 Claims and Supporting Research

- **Iterative prompting (“prompt chaining”) outperforms one-shot prompting.**  
  Supported by *Prompt Chaining vs Stepwise* (Findings ACL 2024), which finds that dividing complex tasks into multiple prompts yields more accurate and interpretable results than one large prompt.  
  > “Dividing a task into multiple chained prompts yields higher accuracy than a single long prompt of equivalent total length.”  
  [Findings ACL 2024 – Prompt Chaining vs Stepwise][1]

- **Prompt chaining is a recognized research concept.**  
  The *AI Chains* paper (CHI 2022) introduced the concept of chaining multiple LLM calls for complex workflows and demonstrated measurable performance improvements.  
  > “We show how prompt chaining enables modularity, reusability, and debugging in human-AI programming workflows.”  
  [CHI 2022 – AI Chains][2]

- **LLMs show recency bias: the most recent tokens have outsized influence.**  
  *Lost in the Middle* (Liu et al., TACL 2023) documents that model accuracy is highest when relevant information is at the beginning or end of a prompt and lowest when it’s in the middle.  
  > “Performance significantly degrades when models must access and use information in the middle of their input context.”  
  [TACL 2023 – Lost in the Middle][3]

- **Attention is distributed unevenly within the context window (U-shaped bias).**  
  *Found in the Middle* (Hsieh et al., 2024) links the “lost in the middle” behavior to an intrinsic U-shaped attention bias, where start and end tokens receive more attention than middle tokens.  
  > “LLMs exhibit a U-shaped attention bias where tokens at the beginning and at the end receive higher attention, regardless of their relevance.”  
  [Hsieh et al. 2024 – Found in the Middle][4]

- **Effective attention is limited despite large architectural context windows.**  
  Modern LLMs can hold hundreds of thousands of tokens, but empirical studies show that usable or “effective” context remains much smaller due to positional decay and attention bias.  
  *Insights into LLM Long-Context Failures* (2024) confirms this degradation.  
  > “Comprehension declines for information in the center of a long context.”  
  [Insights into LLM Long-Context Failures 2024][5]

- **Resetting the conversation history restores full access to the high-attention region of the context window.**  
  While not explicitly studied as “session resets,” research on attention decay implies that deleting earlier context re-centers the prompt in the model’s most responsive token range.  
  > Based on findings in *Lost in the Middle* and *Found in the Middle* showing attention decay with positional distance.

- **Cross-model prompt chaining (multi-model collaboration) improves performance.**  
  *Exchange of Thought* (EMNLP 2023) and *Mixture of Agents* (ICLR 2025) show that ensembles or cascades of heterogeneous models outperform individual ones through complementary reasoning.  
  > “Heterogeneous agent mixtures achieve improved reliability by combining models from different families with complementary strengths.”  
  [EMNLP 2023 – Exchange of Thought][6]  
  [ICLR 2025 – Mixture of Agents][7]

- **Human-orchestrated multi-session workflows resemble multi-agent systems.**  
  *AutoGen* (Microsoft Research, 2023) and *CAMEL* (2023) formalize such setups, showing that role-specialized agents coordinating via chained prompts can outperform single-model baselines.  
  > “Multi-agent systems perform role-specialized prompting and peer review to improve generation quality.”  
  [AutoGen 2023 – Microsoft Research][8]  
  [CAMEL 2023 – Role-Playing Agents][9]

---


[1]: https://aclanthology.org/2024.findings-acl.123/  
[2]: https://dl.acm.org/doi/10.1145/3491102.3502038  
[3]: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/Lost-in-the-Middle-How-Language-Models-Use-Long  
[4]: https://arxiv.org/abs/2406.16008  
[5]: https://arxiv.org/abs/2406.14673  
[6]: https://arxiv.org/abs/2310.08560  
[7]: https://openreview.net/forum?id=mixture-of-agents-2025  
[8]: https://arxiv.org/abs/2308.08155  
[9]: https://arxiv.org/abs/2303.17760