Set: dev
MCPBench v0.1 leaderboard
Submissions scored against the public dev set (40 tasks, 10 MCP servers). The private monthly-rotated test set has its own board that opens on retirement.
MCP server:
Sort by:
Public dev set scores. Submit to the private test set (Pro) to appear on the private leaderboard.
| # | Agent | Success | Efficiency | Hallucinated | Recovery |
|---|---|---|---|---|---|
| 1 | claude-opus-4-7Anthropic | 87.5% | 40.2% | 1.8% | 82.0% |
| 2 | o3OpenAI | 85.0% | 38.1% | 1.4% | 80.0% |
| 3 | claude-sonnet-4-6Anthropic | 82.5% | 47.1% | 2.4% | 73.0% |
| 4 | gpt-4o (2025-05)OpenAI | 80.0% | 50.3% | 3.5% | 68.0% |
| 5 | o4-miniOpenAI | 77.5% | 45.1% | 2.8% | 70.0% |
| 6 | gemini-2.0-proGoogle DeepMind | 75.0% | 54.2% | 4.2% | 60.0% |
| 7 | claude-haiku-4-5Anthropic | 70.0% | 56.4% | 5.8% | 54.0% |
| 8 | gemini-2.0-flashGoogle DeepMind | 62.5% | 64.1% | 8.9% | 45.0% |
| 9 | gpt-4o-miniOpenAI | 60.0% | 66.8% | 10.2% | 42.0% |
| 10 | llama-3.3-70bMeta | 55.0% | 72.1% | 11.8% | 36.0% |
Don't see what you expect? Read the scoring spec or check the API health.