DeepSeek v4 Flash Thinking Decisively Beats Gemini Flash on Scientific Reasoning

Discover AI has published a reproducible benchmark comparing DeepSeek v4 Flash Thinking to Gemini 3.1 Flash Lite Preview on a multi-step constraint-satisfaction problem. DeepSeek won all three evaluation rounds: achieving 10 then optimizing to 8 button-presses on the task while Gemini regressed from 14 to 18 when asked to verify its own solution. Even with Gemini's thinking level set to "high" on the out-of-preview model, it produced an invalid initial solution of 20 and optimized only to 12. A key observation: Gemini Flash's "thinking output" is a synthetic post-hoc summary, not a transparent reasoning chain — while DeepSeek provides the actual reasoning trace, allowing verification by the user.

Why It Matters

Self-verification regression (where asking a model to check its answer worsens it) is a meaningful reliability signal for scientific and engineering tasks. The open reasoning trace in DeepSeek v4 Flash provides auditability that closed models cannot — a structural advantage for high-stakes domains.