OpenAI Discloses Accidental CoT Grading Found in Prior RL Training Runs

OpenAI built a system to scan all its RL training runs for accidental chain-of-thought grading, found instances in the training of previously deployed models, but states there is no clear evidence these instances degraded chain-of-thought monitorability.

1 min read|agenticonsult Intelligence

OpenAI Discloses Accidental CoT Grading Found in Prior RL Training Runs

OpenAI has built and deployed an internal scanning system to detect accidental chain-of-thought (CoT) grading across all its reinforcement learning training runs. The audit found instances of accidental CoT grading in the training of previously deployed models. OpenAI states there is no clear evidence that these instances degraded chain-of-thought monitorability in deployed systems, but the disclosure marks a notable step in proactive safety transparency.

Why It Matters

This is a rare public disclosure of an unintended training artifact that was discovered via internal audit rather than external reporting. It signals that both Anthropic (with NLA interpretability) and OpenAI are simultaneously investing in tooling to understand what their models are actually learning — and are willing to disclose imperfect findings rather than waiting for post-deployment incidents.

Primary source

OpenAI

This breaking-news item was assembled from the cited primary source with AI assistance. It is intended for rapid situational awareness — refer to the original publication for the definitive statement.

OpenAI Discloses Accidental CoT Grading Found in Prior RL Training Runs

OpenAI Discloses Accidental CoT Grading Found in Prior RL Training Runs

Why It Matters

Live Intel Feed