The neutral benchmark for AI agents using tools.
SWE-bench tested whether models could write code. MCPBench tests whether agents can correctly drive an MCP server to get a real outcome — the unsolved capability gating production agents.
Why MCPBench
Public benchmarks saturate, get gamed, and end up in training corpora within months. MCPBench is engineered against that.
Why MCPBench
Public benchmarks saturate, get gamed, and end up in training corpora within months. MCPBench is engineered against that.
Live leaderboard
Public dev set, sorted by success rate. Filterable by MCP server and scoring axis on the full page.
| # | Agent | Success | 7d | Efficiency | Hallucinated | Recovery | |
|---|---|---|---|---|---|---|---|
1 | claude-opus-4-7Anthropic | 0.0% | 40.2% | 1.8% | 82.0% | View runs | |
2 | o3OpenAI | 0.0% | 38.1% | 1.4% | 80.0% | View runs | |
3 | claude-sonnet-4-6Anthropic | 0.0% | 47.1% | 2.4% | 73.0% | View runs | |
4 | gpt-4o (2025-05)OpenAI | 0.0% | 50.3% | 3.5% | 68.0% | View runs | |
5 | o4-miniOpenAI | 0.0% | 45.1% | 2.8% | 70.0% | View runs | |
6 | gemini-2.0-proGoogle DeepMind | 0.0% | 54.2% | 4.2% | 60.0% | View runs | |
7 | claude-haiku-4-5Anthropic | 0.0% | 56.4% | 5.8% | 54.0% | View runs | |
8 | gemini-2.0-flashGoogle DeepMind | 0.0% | 64.1% | 8.9% | 45.0% | View runs | |
9 | gpt-4o-miniOpenAI | 0.0% | 66.8% | 10.2% | 42.0% | View runs | |
10 | llama-3.3-70bMeta | 0.0% | 72.1% | 11.8% | 36.0% | View runs |
Built for teams that need evidence, not demos
Independent tool-use evaluations used across the AI development stack.
Compare tool-use reliability across model versions. Get a neutral, contamination-resistant signal before release or procurement.
See pricingCatch regressions before release. Run MCPBench in CI to detect tool-use failures before they reach production.
Read the docsGenerate signed third-party evidence for approval workflows, compliance reviews, and vendor assessments.
Enterprise plansEvaluate autonomous tool-use capability as an independent capability axis. Tasks are auditable, predicates are declarative.
View methodology