MCPBench v0.1 — Documentation
Overview
MCPBench evaluates an agent's ability to correctly use a provided MCP server to complete a task. 200 tasks across 10 servers, 4 scoring axes, public dev set + private monthly-rotated test set.
The full spec lives in docs/SPEC.md in the repo. The pages below summarize the parts you'll need to submit and to interpret leaderboard scores.
Task schema
Each task is a JSON document with: id, server,category (single-tool, composition, or recovery),difficulty, max_steps, goal (the natural-language ask shown to the agent), initial_state for the MCP server, available_tools, and a declarative success_predicate.
Predicates are server-namespaced (filesystem.fileExists, github.prMerged, postgres.rowExists, etc.) and compose with all / any / not so a task can specify multiple sufficient solution paths.
Scoring
tasks_passed / tasks_attempted. Pass = predicate true and no budget exceeded.tool_calls / max_steps over passed tasks. Lower is better.calls_to_unlisted_tool / total_calls. Reported across the whole submission.recovery tasks where the agent observed at least one error: did it still pass?Submission
Two formats: hosted endpoint (we POST tasks to your URL) or Docker image (we run it in an ephemeral CF Container). Full step-by-step in /submit.
Anti-contamination
- Private test set never plaintext on the open web.
- Sandboxed runs — agents cannot phone task content home.
- Cryptographic publication on retirement:
(task_id, hash)attestations let anyone verify the set after the fact. - Monthly rotation; ~25% of tasks replaced per month.
- Per-rotation anomaly detection on score jumps and per-task patterns.
Methodology
Tasks are sampled from per-server template grammars with bounded parameter ranges, then human-reviewed for predicate correctness and multiple-solution-path coverage. Verifier implementations are in packages/tasks/src/verifiers. Full details in the methodology page.
API
OpenAPI 3.1 contract: apps/api/openapi.yaml. Endpoints:
POST /submissions— submit (auth)GET /submissions/{id}— status (auth)GET /leaderboard— publicGET /tasks,GET /tasks/{id}— public dev set onlyGET /runs/{id}— public if owner published trace
Versioning
Spec follows semver. Set rotation is on its own monthly cadence and does not bump the spec version. Each retired month gets a Historical-NNNN-MM tag that opens publicly.