MM-ToolBench: Claude Opus 4.6 Achieves Only 32% Task Success vs 94% Human Baseline
MM-ToolBench introduces closed-loop multimodal verification as the evaluation standard: agents execute tasks, inspect the resulting artifacts, and self-correct before being scored — replacing self-reporting with actual artifact outcomes. Across 100 tasks spanning Customer Service and Intelligent Creation, 27 MCP servers, and 324 tools, Claude Opus 4.6 achieves 32.0% task success versus a 94.0% human benchmark.
Why It Matters
The 62-percentage-point gap under closed-loop verification is a more honest measure of current agent capability than leaderboard self-reports. It quantifies the headroom remaining for tool-using agents and establishes a concrete external benchmark for MCP toolchain evaluation.