OpenAI Discloses Accidental CoT Grading Found in Prior RL Training Runs

OpenAI has built and deployed an internal scanning system to detect accidental chain-of-thought (CoT) grading across all its reinforcement learning training runs. The audit found instances of accidental CoT grading in the training of previously deployed models. OpenAI states there is no clear evidence that these instances degraded chain-of-thought monitorability in deployed systems, but the disclosure marks a notable step in proactive safety transparency.

Why It Matters

This is a rare public disclosure of an unintended training artifact that was discovered via internal audit rather than external reporting. It signals that both Anthropic (with NLA interpretability) and OpenAI are simultaneously investing in tooling to understand what their models are actually learning — and are willing to disclose imperfect findings rather than waiting for post-deployment incidents.