DeepSeek V4: 1M-Context Open Weights, 1/7 Opus 4.7 Pricing
DeepSeek dropped two open-weight models on April 24 — V4-Pro (1.6T params, 49B active MoE) and V4-Flash (284B, 13B active) — both with 1M-token context as a standard default, not a premium tier. Four independent source batches converged on the same story: a hybrid CSA+HCA attention architecture cuts KV-cache to 10% of the prior generation at 1M tokens; V4-Pro costs roughly one-seventh the output price of Opus 4.7; and the API ships drop-in compatible with both OpenAI and Anthropic call formats.
What the Source Actually Says
The architectural engine is CSA+HCA hybrid attention. CSA compresses every four tokens into one entry, then a Lightning Indexer retrieves only the top 1,024 most relevant — with a 128-token uncompressed sliding window preserved for short-range fidelity. HCA does the opposite: 128:1 compression, attend to everything simultaneously. Alternating these two approaches across layers achieves 27% of V3.2's compute and 10% of its KV-cache at 1M tokens, natively, without RoPE-extension tricks. Training stability comes from mHC (Manifold-Constrained Hyper-Connections), a doubly-stochastic mixing matrix that caps signal amplification at 1.6× versus the 3,000× seen without it.
Three reasoning modes — Non-Think, Think High, Think Max — replace the binary on/off switch. On HLE (Humanity's Last Exam) they score 7.7% → 34.5% → 37.7%. The Think High → Think Max jump is only 3.2 points; production deployments should default to Think High and reserve Max for proofs and complex architecture work. Think Max requires a minimum 384K context window or it truncates mid-reasoning.
First-day testing corroborates the headline claims: an 800K-character needle-in-haystack succeeded; a 500K-character document yielded 90%+ coverage with no major hallucinations; ~100 tool calls on V4-Flash produced zero errors. Where V4 trails: long-context retrieval accuracy (MRCR 1M: 83.5 vs Opus 4.6's 92.9), world knowledge (SimpleQA-Verified: 57.9 vs Gemini's 75.6), and agentic coding against current closed frontier — independent reviewer Jake Handy reports 3–15-point gaps. One procurement note: V4-Pro throughput depends on Huawei Ascend 950 supernodes expected H2 2026, relevant for EU GDPR and US CLOUD Act contexts.
Strategic Take
The drop-in OpenAI/Anthropic API compatibility makes a model swap a realistic one-afternoon migration. Route V4-Flash for reasoning-heavy, high-volume pipelines where world knowledge arrives via RAG — it is independently pre-trained (not distilled from Pro), so it fails differently, which matters for production isolation and debugging. Deploy V4-Pro Think High as a cost-optimized tier below Opus 4.7 on mixed workloads where you don't need the closed-frontier ceiling.