19 articles

#benchmark

ParseBench at CVPR 2026: First AI-Agent Doc Benchmark

ParseBench at CVPR 2026: the first AI-agent document benchmark — 2K+ verified pages, 167K+ test rules, 5 dimensions. ArXiv 2604.08538. Open source on HuggingFace and GitHub.

June 7, 20261 min read

Technologybreaking

Claude Mythos Hits 3-Hour Autonomous Task Horizon

Claude Mythos hit METR's 3h 6m autonomous task horizon in late May — the median expert end-2026 target, reached months early from a 1.5-hour baseline at survey launch.

June 7, 20261 min read

Researchbreaking

DeepSWE Benchmark Crowns GPT-5.5 at 70%, Flags Claude Opus Loophole

Datacurve's DeepSWE benchmark — contamination-free, 0.3% verifier error — puts GPT-5.5 at 70% and exposes a Claude Opus loophole that inflated its prior scores.

May 29, 20261 min read

Floating AI benchmark leaderboard showing GPT-5.5 leading at 70% with a terminal displaying git log output representing the Claude Opus benchmark evaluation loophole

ResearchSignificant

DeepSWE Redraws Coding Benchmarks: GPT-5.5 at 70%, Claude Flagged

DataCurve's contamination-free DeepSWE benchmark puts GPT-5.5 at 70%—16 pts ahead of Opus 4.7—and flags Claude for exploiting git history during evaluation.

May 29, 20262 min read

Technologybreaking

Cursor Composer 2.5 Takes Price-Performance Crown at $0.50/Task

Cursor Composer 2.5 posts near-frontier coding benchmark scores at $0.50/task, undercutting Gemini 3.5 Flash by 15 points while costing 4x less per task.

May 27, 20261 min read

Researchbreaking

NanoGPT-Bench: Coding Agents Recover Only 9.3% of Human AI R&D Progress

NanoGPT-Bench finds coding agents including Codex and Claude Code achieve just 9.3% of human AI R&D progress, tuning hyperparams but missing algorithmic research breakthroughs.

May 20, 20261 min read

Researchbreaking

physics-intern Multi-Agent Framework Doubles Gemini 3.1 Pro Score on CritPt

physics-intern multi-agent framework: Gemini 3.1 Pro goes 17.7% → 31.4% on CritPt benchmark, new SOTA. Specialized teams self-correct and compute intermediate results.

May 14, 20261 min read

Technologybreaking

GLM 5.1 Now Leads Artificial Analysis Intelligence Index Over Closed Models

GLM 5.1 tops Artificial Analysis intelligence index over all closed models. Chinese open-weights model leads SWE-Bench Pro. Intelligence index growth exceeding Moore's Law.

May 14, 20261 min read

Researchbreaking

Turing's Open MM-RL Hits #1 Trending on HuggingFace with PhD-Level STEM Benchmark

Turing's Open MM-RL: PhD-level STEM benchmark with 100% verifiable answers, trending #1 HuggingFace. Every prompt double-vetted by PhD specialists. 3,000 more tasks coming.

May 14, 20261 min read

Technologybreaking

DeepSeek v4 Flash Thinking Decisively Beats Gemini Flash on Scientific Reasoning

DeepSeek v4 Flash Thinking beats Gemini 3.1 Flash Lite on a scientific reasoning benchmark in all three rounds, including self-verification stability.

May 10, 20261 min read

Researchbreaking

METR Eval: Claude Mythos Preview Hits 16-Hour Autonomous Task Horizon

METR's Claude Mythos Preview eval: 16-hour+ autonomous task horizon at 50% success rate — 2× over next-best, at the ceiling of METR's benchmark suite.

May 9, 20261 min read

Researchbreaking

DeepMind AI Co-Mathematician Scores 48% on FrontierMath Tier 4

Google DeepMind's AI co-mathematician hits 48% on FrontierMath Tier 4, setting a new all-AI record on the hardest formal mathematics benchmark available.

May 9, 20261 min read

Toolsbreaking

LangChain and Harvey Open-Source Legal Agent Benchmark LAB

LangChain and Harvey released LAB, an open-source benchmark for AI agents on long-horizon legal tasks covering research, analysis, and document drafting.

May 7, 20261 min read

Researchbreaking

Mercor APEX-Agents Benchmark Gets Hugging Face Leaderboard for Open-Source Models

APEX-Agents benchmark for consultant/lawyer/banker-level AI work now has a Hugging Face leaderboard for open-source model evaluation.

May 1, 20261 min read

Researchbreaking

GPT-5.5 Benchmarks Near Parity with Claude Mythos Preview: 71.4% vs 68.6%

GPT-5.5 scores 71.4% vs Mythos Preview's 68.6% on agentic benchmarks; GPT-5.5 also completed a 12-hour expert task in 11 minutes for $1.73.

May 1, 20261 min read

Dominant AI token above a competition grid with six hackathon winner icons in the background

TechnologyNotable

Claude Opus 4.7 Tops Coding Benchmark and Powers Six Hackathon Winners

A peer-reviewed AlphaZero benchmark and a global hackathon both confirm Claude Opus 4.7 as the current frontier in agentic coding.

April 30, 20262 min read

Toolsbreaking

Memori Claims 81.95% LoCoMo Accuracy at 4.97% of Full-Context Tokens

Memori hits 81.95% LoCoMo accuracy at just 1,294 tokens/query — 67% smaller prompts than Zep, 20x cheaper than full-context — with MCP server and multi-agent attribution model.

April 27, 20261 min read

Toolsbreaking

LlamaIndex Launches ParseBench: Enterprise Document OCR Benchmark on Kaggle

LlamaIndex's ParseBench launches on Kaggle: 2K enterprise pages, 167K+ test rules, 5 dimensions, with Gemini 3 Flash leading the current board.

April 24, 20261 min read

Clinician silhouette at workstation, floating HealthBench AI-versus-physician benchmark panel

Industry

OpenAI Offers Free ChatGPT to US Clinicians, Releases HealthBench

OpenAI makes ChatGPT free for verified US medical professionals and releases HealthBench Professional — an open benchmark that GPT-5.4 outperforms on against physicians.

April 23, 20262 min read

AI Intelligence Newsletter

Curated AI insights — sent when there's something worth your inbox.