CompatoolCompatool
Enterprise

Independent evidence for production AI agents

MCPBench generates cryptographically-attested evaluation reports for procurement due diligence, governance filings, vendor assessments, and pre-release validation.

Internal evals are necessary. They are not sufficient.

Internal benchmarks tell you how your agent performs against your own test suite — but they cannot serve as independent evidence for procurement, vendor selection, or governance review. MCPBench provides a neutral, contamination-resistant signal that you can cite, share, and verify.

Procurement
Vendor teams need third-party evidence. A signed MCPBench report satisfies requests for independent evaluation data.
Governance
AI governance teams need auditable, versioned records of capability. MCPBench reports include methodology version and test-set attestation.
Competitive selection
Compare agent frameworks or model versions on the same neutral task suite, under identical conditions.

What you receive

  • Overall success rate (primary), tool-call efficiency, hallucinated-tool rate, recovery-from-error rate
  • Per-server breakdown (10 MCP servers: Filesystem, GitHub, Postgres, Slack, Gmail, Browser, Calendar, Linear, Stripe, Notion)
  • Per-difficulty breakdown (single-tool, composition, recovery categories)
  • Failure taxonomy with representative failing trace summaries
  • Confidence intervals on all metrics
  • Comparison against published baselines (optional)
  • Methodology version and test-set rotation ID
  • Cryptographic attestation reference (SHA-256 hash chain verifiable on retirement)
  • Signed PDF, suitable for attaching to procurement filings

Four steps

  1. 1
    Submit your agent

    Provide a hosted endpoint or Docker image. We run it in our sandboxed infrastructure against the private test set.

  2. 2
    We evaluate

    Each task runs in an isolated Cloudflare Container with egress restricted to the MCP server. No data leaves the sandbox.

  3. 3
    Report generated

    Within 4 hours (SLA), your signed report is ready. It includes all metrics, attestation, and methodology reference.

  4. 4
    Verify and file

    The report is a self-contained PDF. Your team, auditors, or procurement reviewers can verify the attestation independently when the test set retires.

Built for enterprise security requirements

Submitted Docker images are not redistributed. MCP credentials are scoped per-run and destroyed after completion. Evaluation results are private by default. A DPA is available for all Enterprise customers.

Enterprise plan

£5,000/mo (from)
  • Signed PDF reports for procurement and vendor due diligence
  • Custom MCP servers added to the benchmark
  • SLA on eval turnaround (≤ 4h on staged submissions)
  • Dedicated test set rotation cadence
  • Named technical contact
  • Quarterly methodology review with your eval team
  • Data Processing Agreement (DPA)
Talk to us

Annual contracts available. Net-30 invoicing. Procurement-friendly order form and DPA available on request.

Ready to evaluate?

Request a sample report

See exactly what enterprise customers receive before committing.

Get sample
Book a walkthrough

30-minute call with our eval team. We'll walk through the methodology, your use case, and the submission process.

Book call
Read the methodology

Full citable reference for task design, scoring formulas, and contamination defences.

Read methodology