MM-ToolBench: Claude Opus 4.6 Achieves Only 32% Task Success vs 94% Human Baseline

MM-ToolBench is a new omni-modal tool-using benchmark with 100 executable tasks, 27 MCP servers, and 324 tools across Customer Service and Intelligent Creation, using closed-loop multimodal verification. Claude Opus 4.6 achieves 32.0% task success against a 94.0% human benchmark — measuring real execution through closed-loop artifact inspection, not self-reporting.

1 min read|agenticonsult Intelligence

MM-ToolBench: Claude Opus 4.6 Achieves Only 32% Task Success vs 94% Human Baseline

MM-ToolBench introduces closed-loop multimodal verification as the evaluation standard: agents execute tasks, inspect the resulting artifacts, and self-correct before being scored — replacing self-reporting with actual artifact outcomes. Across 100 tasks spanning Customer Service and Intelligent Creation, 27 MCP servers, and 324 tools, Claude Opus 4.6 achieves 32.0% task success versus a 94.0% human benchmark.

Why It Matters

The 62-percentage-point gap under closed-loop verification is a more honest measure of current agent capability than leaderboard self-reports. It quantifies the headroom remaining for tool-using agents and establishes a concrete external benchmark for MCP toolchain evaluation.

Primary source

arXiv / Liu et al.

#claude #benchmarks

Discuss onLinkedIn X

This breaking-news item was assembled from the cited primary source with AI assistance. It is intended for rapid situational awareness — refer to the original publication for the definitive statement.

View all live intel

Live Intel Feed

11:11 AMMIT: AI Agents in Supply Chains Create Bullwhip Effect Despite Outperforming Humans 11:10 AMagentmemory Crosses 11.6k Stars: Persistent Memory Daemon for Coding Agents 11:10 AMRodin Gen-2.5: First 10M-Polygon 3D Generative AI, 1M Polygons in 4 Seconds