HeavySkill RLVR Lifts GPT-OSS-20B from 69.7% to 85.5% on LiveCodeBench

HeavySkill uses RLVR training to internalize parallel reasoning and deliberation as learnable skills, lifting GPT-OSS-20B from 69.7% to 85.5% on LiveCodeBench and R1-Distill-Qwen-32B from 35.7% to 69.3% on IFEval — suggesting agentic harness gains can become permanent model capabilities.

1 min read|agenticonsult Intelligence

HeavySkill RLVR Lifts GPT-OSS-20B from 69.7% to 85.5% on LiveCodeBench

HeavySkill is a two-stage training approach using RLVR (Reinforcement Learning with Verifiable Rewards) to internalize parallel reasoning and deliberation as learnable model skills rather than runtime harness behaviors. Applied to GPT-OSS-20B, it improves LiveCodeBench performance from 69.7% to 85.5% — a 16-point gain. On IFEval, R1-Distill-Qwen-32B improves from 35.7% to 69.3%, a 33-point gain. The underlying claim: capabilities that were previously implemented as external scaffolding (parallel query processing, deliberation loops) can be trained into a model's weights, making the improvement permanent and inference-time-cheap.

Why It Matters

If agentic harness wins can routinely be converted to model weights via RLVR, it threatens the business model of orchestration framework vendors (LangChain, LlamaIndex, CrewAI) and suggests that the "best practice" harness of today becomes the default model behavior of next year.

This breaking-news item was assembled from the cited primary source with AI assistance. It is intended for rapid situational awareness — refer to the original publication for the definitive statement.

HeavySkill RLVR Lifts GPT-OSS-20B from 69.7% to 85.5% on LiveCodeBench

HeavySkill RLVR Lifts GPT-OSS-20B from 69.7% to 85.5% on LiveCodeBench

Why It Matters

Live Intel Feed