Example evaluation
This page walks through a single MCPBench task end-to-end. You will see the task specification, a correct agent trace, a failed trace, and the final scored output.
1. Task specification
Every MCPBench task is a JSON document. The agent receives the goal, the MCP server URL, the tools it may call, and its budget.
{
"id": "github-003",
"server": "github",
"category": "composition",
"difficulty": "medium",
"max_steps": 8,
"goal": "Close the open issue titled 'Fix typo in README' in repo sandbox/demo-repo, then add a comment 'Fixed in main.'",
"initial_state": {
"repo": "sandbox/demo-repo",
"issues": [
{ "number": 7, "title": "Fix typo in README", "state": "open" }
]
},
"available_tools": [
"github_list_issues",
"github_get_issue",
"github_close_issue",
"github_create_issue_comment"
],
"success_predicate": {
"all": [
{ "github.issueClosed": { "repo": "sandbox/demo-repo", "number": 7 } },
{ "github.commentPosted": { "repo": "sandbox/demo-repo", "issue": 7, "body_contains": "Fixed in main" } }
]
}
}- The
available_toolslist is the only tools the agent may call. Calling anything outside this list counts as a hallucination. - The
success_predicateusesall([…])— both conditions must be true for the task to pass. max_steps: 8means the agent has 8 tool calls before the run is aborted.
2. A correct trace (3 steps)
A well-behaved agent uses the minimum tool calls needed.
Step 1
→ github_list_issues({ "repo": "sandbox/demo-repo", "state": "open" })← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]Step 2
→ github_close_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })← { "number": 7, "state": "closed" }Step 3
→ github_create_issue_comment({ "repo": "sandbox/demo-repo", "issue_number": 7, "body": "Fixed in main." })← { "id": 1041, "body": "Fixed in main." }Predicate evaluation: github.issueClosed ✓ · github.commentPosted ✓ → PASS
3. Failure — hallucinated tool
This agent calls a tool that is not in available_tools. The call is rejected by the sandbox.
Step 1
→ github_search_issues({ "query": "typo README" })← Error: tool not available in this taskStep 2
→ github_list_issues({ "repo": "sandbox/demo-repo" })← [{ "number": 7, ... }]Step 3
→ github_close_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })← { "number": 7, "state": "closed" }Step 4
Agent stops without posting a comment.
Predicate evaluation: github.issueClosed ✓ · github.commentPosted ✗ → FAIL
4. Failure — exceeds step budget
This agent loops on a planning step and runs out of budget before completing the task.
Step 1
→ github_list_issues({ "repo": "sandbox/demo-repo" })← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]Step 2
→ github_list_issues({ "repo": "sandbox/demo-repo", "state": "open" })← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]Step 3
→ github_get_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })← { "number": 7, "title": "Fix typo in README", "state": "open", "body": "..." }Step 4
→ github_list_issues({ "repo": "sandbox/demo-repo" })← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]Step 5
→ github_get_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })← { "number": 7, "state": "open" }Step 6
→ github_list_issues({ "repo": "sandbox/demo-repo", "state": "open" })← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]Step 7
→ github_get_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })← { "number": 7, "state": "open" }Step 8
→ github_list_issues({ "repo": "sandbox/demo-repo" })← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]Run aborted: max_steps exceeded.Predicate evaluation: task budget exceeded → FAIL (infra_error: false, budget_exceeded: true)
5. What the report shows
After all tasks in a submission complete, MCPBench generates a scorecard.
| Server | Success | Efficiency | Hallucination | Recovery |
|---|---|---|---|---|
| github | 1/1 (100%) | 37.5% | 0% | — |
| Aggregate | — | — | — | — |
Enterprise signed reports include per-server breakdowns, per-difficulty breakdowns, a failure taxonomy, confidence intervals, methodology version, test-set rotation ID, and a cryptographic attestation reference.