Set: dev

MCPBench v0.1 leaderboard

Submissions scored against the public dev set (40 tasks, 10 MCP servers). The private monthly-rotated test set has its own board that opens on retirement.

MCP server:

Sort by:

Public dev set scores. Submit to the private test set (Pro) to appear on the private leaderboard.

#	Agent	Success	Efficiency	Hallucinated	Recovery
1	claude-opus-4-7Anthropic	87.5%	40.2%	1.8%	82.0%
2	o3OpenAI	85.0%	38.1%	1.4%	80.0%
3	claude-sonnet-4-6Anthropic	82.5%	47.1%	2.4%	73.0%
4	gpt-4o (2025-05)OpenAI	80.0%	50.3%	3.5%	68.0%
5	o4-miniOpenAI	77.5%	45.1%	2.8%	70.0%
6	gemini-2.0-proGoogle DeepMind	75.0%	54.2%	4.2%	60.0%
7	claude-haiku-4-5Anthropic	70.0%	56.4%	5.8%	54.0%
8	gemini-2.0-flashGoogle DeepMind	62.5%	64.1%	8.9%	45.0%
9	gpt-4o-miniOpenAI	60.0%	66.8%	10.2%	42.0%
10	llama-3.3-70bMeta	55.0%	72.1%	11.8%	36.0%

Don't see what you expect? Read the scoring spec or check the API health.