HeavySkill RLVR Lifts GPT-OSS-20B from 69.7% to 85.5% on LiveCodeBench

HeavySkill is a two-stage training approach using RLVR (Reinforcement Learning with Verifiable Rewards) to internalize parallel reasoning and deliberation as learnable model skills rather than runtime harness behaviors. Applied to GPT-OSS-20B, it improves LiveCodeBench performance from 69.7% to 85.5% — a 16-point gain. On IFEval, R1-Distill-Qwen-32B improves from 35.7% to 69.3%, a 33-point gain. The underlying claim: capabilities that were previously implemented as external scaffolding (parallel query processing, deliberation loops) can be trained into a model's weights, making the improvement permanent and inference-time-cheap.

Why It Matters

If agentic harness wins can routinely be converted to model weights via RLVR, it threatens the business model of orchestration framework vendors (LangChain, LlamaIndex, CrewAI) and suggests that the "best practice" harness of today becomes the default model behavior of next year.