Anthropic RSI Report: Claude Writes 80% of Its Own Code, 52× Training Speedup
Anthropic published its first recursive-self-improvement (RSI) report on June 3, revealing that Claude now authors more than 80% of the code merged into its own codebase and delivered a 52× training speedup in April. Four X accounts, three YouTube channels, and the NLP Newsletter all corroborated the same figures within 48 hours — five independent batches converging on one story. The report frames human "taste" — the judgment to choose the right problem — as the last frontier, and explicitly says it has a clock on it.
What the Source Actually Says
The headline data: as of May 2026, >80% of Anthropic's merged code is Claude-authored, up from low single digits when Claude Code shipped in early 2025. Typical engineers now merge 8× as much code per day as in 2024, though Anthropic flags lines-of-code as a poor proxy and estimates true gains are lower. The harder benchmark is the training-speed test: asked to optimize a training pipeline, Mythos Preview (April 2026) achieved a 52× speedup — a task where a strong human researcher needs 4–8 hours just to reach 4×. Claude Opus 4 managed 3× in May 2025. Separately, Claude Code shipped 100+ fixes in one month that cut an API error class 1,000-fold — work Anthropic estimates would take a human engineer around four years.
The more consequential result is the "taste" experiment. At real session midpoints, Mythos Preview proposed a better next step than the human researcher 64% of the time, up from 51% for Opus 4.5 in November. The footnote matters: against moments where the human made a strong, deliberate choice, the model won only 20% of the time — it rescues weak moves but does not yet outthink good ones. Reinforcing the same trajectory, Claude agents closed an open AI safety research gap from 25% (two humans, one week) to 97%, over ~800 cumulative compute hours at ~$18,000 — but the human still set the objective and wrote the scoring rubric. METR's task-horizon benchmark tracks it all: Mythos Preview clocks ~17 autonomous hours at 50% success, doubling roughly every four months.
Strategic Take
The practical shift is already in motion: leverage is moving from doing the work to specifying it precisely and checking it. Anthropic's own words — "the story that humans stay in the loop on judgment has a clock on it" — are not a launch date, but they are a design signal worth building toward in any orchestration-heavy or agentic stack now, not later.



