This document lists and sources the empirical and academic evidence supporting the claims made in the post "Iterative planning, code generation, and code review" regarding large language model behavior, context management, and multi-model workflows.
-
Iterative prompting (“prompt chaining”) outperforms one-shot prompting.
Supported by Prompt Chaining vs Stepwise (Findings ACL 2024), which finds that dividing complex tasks into multiple prompts yields more accurate and interpretable results than one large prompt.“Dividing a task into multiple chained prompts yields higher accuracy than a single long prompt of equivalent total length.”
Findings ACL 2024 – Prompt Chaining vs Stepwise -
Prompt chaining is a recognized research concept.
The AI Chains paper (CHI 2022) introduced the concept of chaining multiple LLM calls for complex workflows and demonstrated measurable performance improvements.“We show how prompt chaining enables modularity, reusability, and debugging in human-AI programming workflows.”
CHI 2022 – AI Chains -
LLMs show recency bias: the most recent tokens have outsized influence.
Lost in the Middle (Liu et al., TACL 2023) documents that model accuracy is highest when relevant information is at the beginning or end of a prompt and lowest when it’s in the middle.“Performance significantly degrades when models must access and use information in the middle of their input context.”
TACL 2023 – Lost in the Middle -
Attention is distributed unevenly within the context window (U-shaped bias).
Found in the Middle (Hsieh et al., 2024) links the “lost in the middle” behavior to an intrinsic U-shaped attention bias, where start and end tokens receive more attention than middle tokens.“LLMs exhibit a U-shaped attention bias where tokens at the beginning and at the end receive higher attention, regardless of their relevance.”
Hsieh et al. 2024 – Found in the Middle -
Effective attention is limited despite large architectural context windows.
Modern LLMs can hold hundreds of thousands of tokens, but empirical studies show that usable or “effective” context remains much smaller due to positional decay and attention bias.
Insights into LLM Long-Context Failures (2024) confirms this degradation.“Comprehension declines for information in the center of a long context.”
Insights into LLM Long-Context Failures 2024 -
Resetting the conversation history restores full access to the high-attention region of the context window.
While not explicitly studied as “session resets,” research on attention decay implies that deleting earlier context re-centers the prompt in the model’s most responsive token range.Based on findings in Lost in the Middle and Found in the Middle showing attention decay with positional distance.
-
Cross-model prompt chaining (multi-model collaboration) improves performance.
Exchange of Thought (EMNLP 2023) and Mixture of Agents (ICLR 2025) show that ensembles or cascades of heterogeneous models outperform individual ones through complementary reasoning.“Heterogeneous agent mixtures achieve improved reliability by combining models from different families with complementary strengths.”
EMNLP 2023 – Exchange of Thought
ICLR 2025 – Mixture of Agents -
Human-orchestrated multi-session workflows resemble multi-agent systems.
AutoGen (Microsoft Research, 2023) and CAMEL (2023) formalize such setups, showing that role-specialized agents coordinating via chained prompts can outperform single-model baselines.“Multi-agent systems perform role-specialized prompting and peer review to improve generation quality.”
AutoGen 2023 – Microsoft Research
CAMEL 2023 – Role-Playing Agents