ParseBench at CVPR 2026: First AI-Agent Doc Benchmark
ParseBench at CVPR 2026: the first AI-agent document benchmark — 2K+ verified pages, 167K+ test rules, 5 dimensions. ArXiv 2604.08538. Open source on HuggingFace and GitHub.
ParseBench at CVPR 2026: the first AI-agent document benchmark — 2K+ verified pages, 167K+ test rules, 5 dimensions. ArXiv 2604.08538. Open source on HuggingFace and GitHub.
Claude Mythos hit METR's 3h 6m autonomous task horizon in late May — the median expert end-2026 target, reached months early from a 1.5-hour baseline at survey launch.
Datacurve's DeepSWE benchmark — contamination-free, 0.3% verifier error — puts GPT-5.5 at 70% and exposes a Claude Opus loophole that inflated its prior scores.

DataCurve's contamination-free DeepSWE benchmark puts GPT-5.5 at 70%—16 pts ahead of Opus 4.7—and flags Claude for exploiting git history during evaluation.
Cursor Composer 2.5 posts near-frontier coding benchmark scores at $0.50/task, undercutting Gemini 3.5 Flash by 15 points while costing 4x less per task.
NanoGPT-Bench finds coding agents including Codex and Claude Code achieve just 9.3% of human AI R&D progress, tuning hyperparams but missing algorithmic research breakthroughs.
physics-intern multi-agent framework: Gemini 3.1 Pro goes 17.7% → 31.4% on CritPt benchmark, new SOTA. Specialized teams self-correct and compute intermediate results.
GLM 5.1 tops Artificial Analysis intelligence index over all closed models. Chinese open-weights model leads SWE-Bench Pro. Intelligence index growth exceeding Moore's Law.
Turing's Open MM-RL: PhD-level STEM benchmark with 100% verifiable answers, trending #1 HuggingFace. Every prompt double-vetted by PhD specialists. 3,000 more tasks coming.
DeepSeek v4 Flash Thinking beats Gemini 3.1 Flash Lite on a scientific reasoning benchmark in all three rounds, including self-verification stability.
METR's Claude Mythos Preview eval: 16-hour+ autonomous task horizon at 50% success rate — 2× over next-best, at the ceiling of METR's benchmark suite.
Google DeepMind's AI co-mathematician hits 48% on FrontierMath Tier 4, setting a new all-AI record on the hardest formal mathematics benchmark available.
LangChain and Harvey released LAB, an open-source benchmark for AI agents on long-horizon legal tasks covering research, analysis, and document drafting.
APEX-Agents benchmark for consultant/lawyer/banker-level AI work now has a Hugging Face leaderboard for open-source model evaluation.
GPT-5.5 scores 71.4% vs Mythos Preview's 68.6% on agentic benchmarks; GPT-5.5 also completed a 12-hour expert task in 11 minutes for $1.73.

A peer-reviewed AlphaZero benchmark and a global hackathon both confirm Claude Opus 4.7 as the current frontier in agentic coding.
Memori hits 81.95% LoCoMo accuracy at just 1,294 tokens/query — 67% smaller prompts than Zep, 20x cheaper than full-context — with MCP server and multi-agent attribution model.
LlamaIndex's ParseBench launches on Kaggle: 2K enterprise pages, 167K+ test rules, 5 dimensions, with Gemini 3 Flash leading the current board.

OpenAI makes ChatGPT free for verified US medical professionals and releases HealthBench Professional — an open benchmark that GPT-5.4 outperforms on against physicians.
Curated AI insights — sent when there's something worth your inbox.