MCPBench v0.1 · 200 tasks · monthly rotation

The neutral benchmark for AI agents using tools.

SWE-bench tested whether models could write code. MCPBench tests whether agents can correctly drive an MCP server to get a real outcome — the unsolved capability gating production agents.

See leaderboard Read the spec Submit an agent

Why MCPBench

Public benchmarks saturate, get gamed, and end up in training corpora within months. MCPBench is engineered against that.

Contamination-resistant

Public dev set is for development only. Private test set rotates monthly under cryptographic attestation. Sandboxed runs prevent task content from being phoned home.

10 MCP servers

Filesystem, GitHub, Postgres, Slack, Gmail, Browser, Calendar, Linear, Stripe, Notion. Real protocols agents will use in production — not toy environments.

4 scoring axes

Success rate, tool-call efficiency, hallucinated-tool rate, recovery-from-error rate. High success with high hallucination is its own signal.

Why MCPBench

Public benchmarks saturate, get gamed, and end up in training corpora within months. MCPBench is engineered against that.

Contamination-resistant

Public dev set is for development only. Private test set rotates monthly under cryptographic attestation. Sandboxed runs prevent task content from being phoned home.

10 MCP servers

Filesystem, GitHub, Postgres, Slack, Gmail, Browser, Calendar, Linear, Stripe, Notion. Real protocols agents will use in production — not toy environments.

4 scoring axes

Success rate, tool-call efficiency, hallucinated-tool rate, recovery-from-error rate. High success with high hallucination is its own signal.

Live leaderboard

Public dev set, sorted by success rate. Filterable by MCP server and scoring axis on the full page.

View all

#	Agent	Success	Efficiency	Hallucinated	Recovery
1	claude-opus-4-7Anthropic	0.0%	40.2%	1.8%	82.0%	View runs
2	o3OpenAI	0.0%	38.1%	1.4%	80.0%	View runs
3	claude-sonnet-4-6Anthropic	0.0%	47.1%	2.4%	73.0%	View runs
4	gpt-4o (2025-05)OpenAI	0.0%	50.3%	3.5%	68.0%	View runs
5	o4-miniOpenAI	0.0%	45.1%	2.8%	70.0%	View runs
6	gemini-2.0-proGoogle DeepMind	0.0%	54.2%	4.2%	60.0%	View runs
7	claude-haiku-4-5Anthropic	0.0%	56.4%	5.8%	54.0%	View runs
8	gemini-2.0-flashGoogle DeepMind	0.0%	64.1%	8.9%	45.0%	View runs
9	gpt-4o-miniOpenAI	0.0%	66.8%	10.2%	42.0%	View runs
10	llama-3.3-70bMeta	0.0%	72.1%	11.8%	36.0%	View runs

Built for teams that need evidence, not demos

Independent tool-use evaluations used across the AI development stack.

Frontier AI labs

Compare tool-use reliability across model versions. Get a neutral, contamination-resistant signal before release or procurement.

See pricing

Agent product teams

Catch regressions before release. Run MCPBench in CI to detect tool-use failures before they reach production.

Read the docs

Enterprise AI governance

Generate signed third-party evidence for approval workflows, compliance reviews, and vendor assessments.

Enterprise plans

AI safety institutes

Evaluate autonomous tool-use capability as an independent capability axis. Tasks are auditable, predicates are declarative.

View methodology