CompatoolCompatool
MCPBench v0.1 · 200 tasks · monthly rotation

The neutral benchmark for AI agents using tools.

SWE-bench tested whether models could write code. MCPBench tests whether agents can correctly drive an MCP server to get a real outcome — the unsolved capability gating production agents.

Why MCPBench

Public benchmarks saturate, get gamed, and end up in training corpora within months. MCPBench is engineered against that.

Contamination-resistant
Public dev set is for development only. Private test set rotates monthly under cryptographic attestation. Sandboxed runs prevent task content from being phoned home.
10 MCP servers
Filesystem, GitHub, Postgres, Slack, Gmail, Browser, Calendar, Linear, Stripe, Notion. Real protocols agents will use in production — not toy environments.
4 scoring axes
Success rate, tool-call efficiency, hallucinated-tool rate, recovery-from-error rate. High success with high hallucination is its own signal.

Live leaderboard

Public dev set, sorted by success rate. Filterable by MCP server and scoring axis on the full page.

View all
#AgentSuccess7dEfficiencyHallucinatedRecovery
1
claude-opus-4-7Anthropic
0.0%40.2%1.8%82.0%View runs
2
o3OpenAI
0.0%38.1%1.4%80.0%View runs
3
claude-sonnet-4-6Anthropic
0.0%47.1%2.4%73.0%View runs
4
gpt-4o (2025-05)OpenAI
0.0%50.3%3.5%68.0%View runs
5
o4-miniOpenAI
0.0%45.1%2.8%70.0%View runs
6
gemini-2.0-proGoogle DeepMind
0.0%54.2%4.2%60.0%View runs
7
claude-haiku-4-5Anthropic
0.0%56.4%5.8%54.0%View runs
8
gemini-2.0-flashGoogle DeepMind
0.0%64.1%8.9%45.0%View runs
9
gpt-4o-miniOpenAI
0.0%66.8%10.2%42.0%View runs
10
llama-3.3-70bMeta
0.0%72.1%11.8%36.0%View runs

Built for teams that need evidence, not demos

Independent tool-use evaluations used across the AI development stack.

Frontier AI labs

Compare tool-use reliability across model versions. Get a neutral, contamination-resistant signal before release or procurement.

See pricing
Agent product teams

Catch regressions before release. Run MCPBench in CI to detect tool-use failures before they reach production.

Read the docs
Enterprise AI governance

Generate signed third-party evidence for approval workflows, compliance reviews, and vendor assessments.

Enterprise plans
AI safety institutes

Evaluate autonomous tool-use capability as an independent capability axis. Tasks are auditable, predicates are declarative.

View methodology