Skip to content

Instantly share code, notes, and snippets.

@textarcana
Last active October 10, 2025 13:33
Show Gist options
  • Save textarcana/05276c9a1a19209f10bef4332adb7431 to your computer and use it in GitHub Desktop.
Save textarcana/05276c9a1a19209f10bef4332adb7431 to your computer and use it in GitHub Desktop.
prompt chaining across model families

Cross-Model Prompt Chaining: Expanded and Filtered Literature Review

This document consolidates highly cited foundational papers and their citing works relevant to cross-model prompt chaining across different LLM families (e.g., GPT, Claude, Qwen). Each entry includes a link to its source.


Highly-cited seed papers (≥ 5 citations)

  1. AI Chains (CHI’22) — formalizes prompt chaining; tooling makes swapping steps/models straightforward.
  2. Prompt Chaining vs Stepwise (Findings ACL’24) — chaining empirically outperforms single long prompts; supports staged flows that can be mapped onto different models.
  3. Mixture-of-Agents (ICLR’25) — layered ensembles of different LLMs; strong heterogeneous results.
  4. Exchange-of-Thought (EMNLP’23) — explicit cross-model communication (Memory / Report / Relay / Debate) to pass reasoning between models.
  5. FrugalGPT (TMLR / ICLR’24) — routing / cascades select among multiple LLM APIs per query (router + scorer + stop-judger).

New citing papers relevant to cross-model chaining

  • Rethinking Mixture-of-Agents (2025) — evaluates when heterogeneous mixing (different families) helps vs a “Self-MoA” using only the single best model; decision insights are directly useful for GPT→Claude→Qwen handoffs. (Cites MoA.)

  • Deep Research Agents (2025) — surveys agent systems that blend multiple model families in pipelines (e.g., GPT-4.x, Claude-Sonnet, Gemini, DeepSeek); practical cross-model orchestration patterns. (Cites MoA and related multi-agent work.)

  • When Two LLMs Debate (2025) — analyzes inter-model debate dynamics and confidence revision; applicable as a handoff stage where models critique each other’s code/patches. (Cites / extends debate-style cross-model interaction lines that also reference EoT.)

  • From Standalone LLMs to Integrated Intelligence (CAIS Survey 2025) — taxonomy of orchestration strategies (components / roles / routing) for multi-model systems; design references for chained, cross-family pipelines. (Surveys and cites routing / ensemble literature incl. MoA / FrugalGPT-style methods.)

  • Knowledge-Empowered, Collaborative, and Co-Evolving LLMs (2024) — focuses on model collaboration and co-evolution, covering mechanisms to combine different LLMs / tools; relevant for deciding what role each family plays in a chain.

  • Human Intervention in LLM Multi-Agent Debate (2024) — studies human-in-the-loop control in multi-agent (often cross-model) debate pipelines; helpful guardrails for cross-model code handoffs. (Cites multi-agent debate lines related to EoT-style setups.)

  • ChainBuddy (2024) — assistant that generates evaluative LLM pipelines in ChainForge; supports planning / evaluating multi-step chains where models can be swapped — useful for implementing cross-family stage assignments. (Builds atop prompt-chaining HCI work such as AI Chains.)

  • Advances & Open Problems for LLMs (2025 Survey) — synthesizes evidence around MoA and heterogeneous teaming; extracts conditions where mixing different models is beneficial, informing when to escalate across families.


How to use these for GPT→Claude→Qwen handoffs

  • Design the chain with AI Chains / ChainBuddy patterns; assign roles per family (e.g., GPT for drafting / spec-aware scaffolds, Claude for safety / compliance critique, Qwen for refactor / optimization).
  • Add routing / cascades to escalate to stronger / more expensive families only if a cheap pass (e.g., Qwen-small) or an automated scorer flags low quality / uncertainty.
  • Enable cross-model reasoning transfer (EoT): pass not just code but rationales / diffs / tests between models; optionally add a short debate round before merging.
  • Sanity-check mixing with MoA + Rethinking-MoA insights: in some contexts, a single strong model with self-aggregation can beat mixing; measure before committing to heavy cross-family ensembles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment