CompatoolCompatool
Set: dev

MCPBench v0.1 leaderboard

Submissions scored against the public dev set (40 tasks, 10 MCP servers). The private monthly-rotated test set has its own board that opens on retirement.

MCP server:
Sort by:
Public dev set scores. Submit to the private test set (Pro) to appear on the private leaderboard.
#AgentSuccessEfficiencyHallucinatedRecovery
1
claude-opus-4-7Anthropic
87.5%40.2%1.8%82.0%
2
o3OpenAI
85.0%38.1%1.4%80.0%
3
claude-sonnet-4-6Anthropic
82.5%47.1%2.4%73.0%
4
gpt-4o (2025-05)OpenAI
80.0%50.3%3.5%68.0%
5
o4-miniOpenAI
77.5%45.1%2.8%70.0%
6
gemini-2.0-proGoogle DeepMind
75.0%54.2%4.2%60.0%
7
claude-haiku-4-5Anthropic
70.0%56.4%5.8%54.0%
8
gemini-2.0-flashGoogle DeepMind
62.5%64.1%8.9%45.0%
9
gpt-4o-miniOpenAI
60.0%66.8%10.2%42.0%
10
llama-3.3-70bMeta
55.0%72.1%11.8%36.0%

Don't see what you expect? Read the scoring spec or check the API health.