CompatoolCompatool

Example evaluation

This page walks through a single MCPBench task end-to-end. You will see the task specification, a correct agent trace, a failed trace, and the final scored output.

1. Task specification

Every MCPBench task is a JSON document. The agent receives the goal, the MCP server URL, the tools it may call, and its budget.

{
  "id": "github-003",
  "server": "github",
  "category": "composition",
  "difficulty": "medium",
  "max_steps": 8,
  "goal": "Close the open issue titled 'Fix typo in README' in repo sandbox/demo-repo, then add a comment 'Fixed in main.'",
  "initial_state": {
    "repo": "sandbox/demo-repo",
    "issues": [
      { "number": 7, "title": "Fix typo in README", "state": "open" }
    ]
  },
  "available_tools": [
    "github_list_issues",
    "github_get_issue",
    "github_close_issue",
    "github_create_issue_comment"
  ],
  "success_predicate": {
    "all": [
      { "github.issueClosed": { "repo": "sandbox/demo-repo", "number": 7 } },
      { "github.commentPosted": { "repo": "sandbox/demo-repo", "issue": 7, "body_contains": "Fixed in main" } }
    ]
  }
}
  • The available_tools list is the only tools the agent may call. Calling anything outside this list counts as a hallucination.
  • The success_predicate uses all([…]) — both conditions must be true for the task to pass.
  • max_steps: 8 means the agent has 8 tool calls before the run is aborted.

2. A correct trace (3 steps)

A well-behaved agent uses the minimum tool calls needed.

Step 1

→ github_list_issues({ "repo": "sandbox/demo-repo", "state": "open" })
← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]

Step 2

→ github_close_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })
← { "number": 7, "state": "closed" }

Step 3

→ github_create_issue_comment({ "repo": "sandbox/demo-repo", "issue_number": 7, "body": "Fixed in main." })
← { "id": 1041, "body": "Fixed in main." }

Predicate evaluation: github.issueClosed ✓ · github.commentPosted ✓ → PASS

Success
✓ (1/1)
Tool-call efficiency
3/8 = 37.5% (lower is better)
Hallucinated-tool rate
0/3 = 0%
Recovery-from-error rate
N/A (no error encountered)

3. Failure — hallucinated tool

This agent calls a tool that is not in available_tools. The call is rejected by the sandbox.

Step 1

→ github_search_issues({ "query": "typo README" })
← Error: tool not available in this task

Step 2

→ github_list_issues({ "repo": "sandbox/demo-repo" })
← [{ "number": 7, ... }]

Step 3

→ github_close_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })
← { "number": 7, "state": "closed" }

Step 4

Agent stops without posting a comment.

Predicate evaluation: github.issueClosed ✓ · github.commentPosted ✗ → FAIL

Success
Hallucinated-tool rate
1/3 = 33%
Recovery-from-error rate
N/A

4. Failure — exceeds step budget

This agent loops on a planning step and runs out of budget before completing the task.

Step 1

→ github_list_issues({ "repo": "sandbox/demo-repo" })
← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]

Step 2

→ github_list_issues({ "repo": "sandbox/demo-repo", "state": "open" })
← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]

Step 3

→ github_get_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })
← { "number": 7, "title": "Fix typo in README", "state": "open", "body": "..." }

Step 4

→ github_list_issues({ "repo": "sandbox/demo-repo" })
← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]

Step 5

→ github_get_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })
← { "number": 7, "state": "open" }

Step 6

→ github_list_issues({ "repo": "sandbox/demo-repo", "state": "open" })
← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]

Step 7

→ github_get_issue({ "repo": "sandbox/demo-repo", "issue_number": 7 })
← { "number": 7, "state": "open" }

Step 8

→ github_list_issues({ "repo": "sandbox/demo-repo" })
← [{ "number": 7, "title": "Fix typo in README", "state": "open" }]
Run aborted: max_steps exceeded.

Predicate evaluation: task budget exceeded → FAIL (infra_error: false, budget_exceeded: true)

5. What the report shows

After all tasks in a submission complete, MCPBench generates a scorecard.

ServerSuccessEfficiencyHallucinationRecovery
github1/1 (100%)37.5%0%
Aggregate

Enterprise signed reports include per-server breakdowns, per-difficulty breakdowns, a failure taxonomy, confidence intervals, methodology version, test-set rotation ID, and a cryptographic attestation reference.