Unguided LLM Debate: 3× Token Cost, Lower Accuracy
Three independently-authored arXiv papers published on May 5 deliver a unified verdict against unguided multi-agent LLM debate: it consistently matches or underperforms isolated self-correction on accuracy while consuming 2.1–3.4× more tokens. The findings directly challenge the widespread "council of agents" assumption — that peer review among LLMs filters errors rather than amplifying them.
What the Source Actually Says
"The Cost of Consensus" (Bertalanič & Fortuna, arXiv:2605.00914) ran teams of 10 homogeneous agents — Qwen2.5-7B, Llama-3.1-8B, and Ministral-3-8B — through 3 debate rounds on GSM-Hard and MMLU-Hard. The study isolates three measurable failure pathways. Sycophantic conformity: agents uncritically adopt majority answers, with modal adoption reaching 85.5%. Contextual fragility: peer rationales destabilize previously correct reasoning, with vulnerability rates up to 70%. Consensus collapse: plurality voting discards answers already present in the pool, with an oracle gap of up to 32.3 percentage points — meaning the correct answer existed but got voted out. A critical counter-intuitive finding: conformity peaks at minimal peer exposure (K=2) and intensifies with greater initial diversity. The cheapest debate topology is the most damaging. Token cost reaches up to 28,631 per problem versus far less for self-correction.
Two companion papers reveal the structural depth of the problem. Yao et al. (arXiv:2605.01750) show that LLM dyads consistently fail to reach Pareto-optimal resource allocations in multi-turn negotiation — even when each agent identifies those outcomes in isolation. Full-transparency interventions prove that information access is not the bottleneck: the failure lies in dynamic grounding — joint plan formation, commitment tracking, and turn-by-turn execution. Ko et al. (arXiv:2604.06091) demonstrate that representative LLM agents are adversarially manipulable through pure social levers: larger adversarial groups, more capable-seeming peers, and longer arguments all degrade accuracy, mirroring human group-psychology biases.
Strategic Take
For teams running debate scaffolding with small (7–8B) homogeneous models: structured role differentiation or model heterogeneity are prerequisites, not optional upgrades. Where those conditions aren't met, isolated self-correction is the cost-accurate default. Majority-vote aggregation among homogeneous agents does not correct errors — it propagates and entrenches them.

