CompatoolCompatool
Spec v0.1.0

MCPBench v0.1 — Documentation

Overview

MCPBench evaluates an agent's ability to correctly use a provided MCP server to complete a task. 200 tasks across 10 servers, 4 scoring axes, public dev set + private monthly-rotated test set.

The full spec lives in docs/SPEC.md in the repo. The pages below summarize the parts you'll need to submit and to interpret leaderboard scores.

Task schema

Each task is a JSON document with: id, server,category (single-tool, composition, or recovery),difficulty, max_steps, goal (the natural-language ask shown to the agent), initial_state for the MCP server, available_tools, and a declarative success_predicate.

Predicates are server-namespaced (filesystem.fileExists, github.prMerged, postgres.rowExists, etc.) and compose with all / any / not so a task can specify multiple sufficient solution paths.

Scoring

Success rate (primary)
tasks_passed / tasks_attempted. Pass = predicate true and no budget exceeded.
Tool-call efficiency
Mean of tool_calls / max_steps over passed tasks. Lower is better.
Hallucinated-tool rate
calls_to_unlisted_tool / total_calls. Reported across the whole submission.
Recovery-from-error rate
On recovery tasks where the agent observed at least one error: did it still pass?

Submission

Two formats: hosted endpoint (we POST tasks to your URL) or Docker image (we run it in an ephemeral CF Container). Full step-by-step in /submit.

Anti-contamination

  • Private test set never plaintext on the open web.
  • Sandboxed runs — agents cannot phone task content home.
  • Cryptographic publication on retirement: (task_id, hash) attestations let anyone verify the set after the fact.
  • Monthly rotation; ~25% of tasks replaced per month.
  • Per-rotation anomaly detection on score jumps and per-task patterns.

Methodology

Tasks are sampled from per-server template grammars with bounded parameter ranges, then human-reviewed for predicate correctness and multiple-solution-path coverage. Verifier implementations are in packages/tasks/src/verifiers. Full details in the methodology page.

API

OpenAPI 3.1 contract: apps/api/openapi.yaml. Endpoints:

  • POST /submissions — submit (auth)
  • GET /submissions/{id} — status (auth)
  • GET /leaderboard — public
  • GET /tasks, GET /tasks/{id} — public dev set only
  • GET /runs/{id} — public if owner published trace

Versioning

Spec follows semver. Set rotation is on its own monthly cadence and does not bump the spec version. Each retired month gets a Historical-NNNN-MM tag that opens publicly.