MILKYWAY Paper: Prediction Harness Scores 61% vs 44% for Raw GPT-5.4 + Web Search

A new arXiv preprint from City University of Hong Kong, Tsinghua, and USTC introduces MILKYWAY — an external text-harness that outperforms GPT-5.4 + web search on temporal prediction benchmarks (61% vs 44%) by externalizing learning into editable skill.md-style instruction files updated by a second 'harness editor' agent.

MILKYWAY Paper: Prediction Harness Scores 61% vs 44% for Raw GPT-5.4 + Web Search

Researchers at City University of Hong Kong, Tsinghua University, and USTC published MILKYWAY (arXiv preprint, April 17) — a system that freezes GPT-5.4's weights and moves all learning into an external editable text harness structured around three components: Factors (F), Evidence (E), and Uncertainty (T). A second "harness editor" agent rewrites the instruction manual as new evidence arrives before an unresolved event resolves. Results on the Future-X and Future-World benchmarks: MILKYWAY scores 61% vs GPT-5.4 + web search at 44%, with the gap widening at longer prediction horizons (70% vs 57% at T-5 days).

Why It Matters

MILKYWAY is a concrete demonstration that agent scaffolding can outperform base-model upgrades on a structured reasoning task — with zero changes to model weights. The skill.md pattern (editable text instruction files updated by a meta-agent) maps directly to conventions in Claude Code and increasingly across agent platforms. The key limitation: delete the harness and the model reverts to base incompetence. Analysis and walkthrough: Discover AI.