This document consolidates highly cited foundational papers and their citing works relevant to cross-model prompt chaining across different LLM families (e.g., GPT, Claude, Qwen). Each entry includes a link to its source.
- AI Chains (CHI’22) — formalizes prompt chaining; tooling makes swapping steps/models straightforward.
- Prompt Chaining vs Stepwise (Findings ACL’24) — chaining empirically outperforms single long prompts; supports staged flows that can be mapped onto different models.
- Mixture-of-Agents (ICLR’25) — layered ensembles of different LLMs; strong heterogeneous results.
- Exchange-of-Thought (EMNLP’23) — explicit cross-model communication (Memory / Report / Relay / Debate) to pass reasoning between models.
- FrugalGPT (TMLR / ICLR’24) — routing / cascades select among multiple LLM APIs per query (router + scorer + stop-judger).
-
Rethinking Mixture-of-Agents (2025) — evaluates when heterogeneous mixing (different families) helps vs a “Self-MoA” using only the single best model; decision insights are directly useful for GPT→Claude→Qwen handoffs. (Cites MoA.)
-
Deep Research Agents (2025) — surveys agent systems that blend multiple model families in pipelines (e.g., GPT-4.x, Claude-Sonnet, Gemini, DeepSeek); practical cross-model orchestration patterns. (Cites MoA and related multi-agent work.)
-
When Two LLMs Debate (2025) — analyzes inter-model debate dynamics and confidence revision; applicable as a handoff stage where models critique each other’s code/patches. (Cites / extends debate-style cross-model interaction lines that also reference EoT.)
-
From Standalone LLMs to Integrated Intelligence (CAIS Survey 2025) — taxonomy of orchestration strategies (components / roles / routing) for multi-model systems; design references for chained, cross-family pipelines. (Surveys and cites routing / ensemble literature incl. MoA / FrugalGPT-style methods.)
-
Knowledge-Empowered, Collaborative, and Co-Evolving LLMs (2024) — focuses on model collaboration and co-evolution, covering mechanisms to combine different LLMs / tools; relevant for deciding what role each family plays in a chain.
-
Human Intervention in LLM Multi-Agent Debate (2024) — studies human-in-the-loop control in multi-agent (often cross-model) debate pipelines; helpful guardrails for cross-model code handoffs. (Cites multi-agent debate lines related to EoT-style setups.)
-
ChainBuddy (2024) — assistant that generates evaluative LLM pipelines in ChainForge; supports planning / evaluating multi-step chains where models can be swapped — useful for implementing cross-family stage assignments. (Builds atop prompt-chaining HCI work such as AI Chains.)
-
Advances & Open Problems for LLMs (2025 Survey) — synthesizes evidence around MoA and heterogeneous teaming; extracts conditions where mixing different models is beneficial, informing when to escalate across families.
- Design the chain with AI Chains / ChainBuddy patterns; assign roles per family (e.g., GPT for drafting / spec-aware scaffolds, Claude for safety / compliance critique, Qwen for refactor / optimization).
- Add routing / cascades to escalate to stronger / more expensive families only if a cheap pass (e.g., Qwen-small) or an automated scorer flags low quality / uncertainty.
- Enable cross-model reasoning transfer (EoT): pass not just code but rationales / diffs / tests between models; optionally add a short debate round before merging.
- Sanity-check mixing with MoA + Rethinking-MoA insights: in some contexts, a single strong model with self-aggregation can beat mixing; measure before committing to heavy cross-family ensembles.