Enterprise

Independent evidence for production AI agents

MCPBench generates cryptographically-attested evaluation reports for procurement due diligence, governance filings, vendor assessments, and pre-release validation.

Request signed report Download procurement pack

Internal evals are necessary. They are not sufficient.

Internal benchmarks tell you how your agent performs against your own test suite — but they cannot serve as independent evidence for procurement, vendor selection, or governance review. MCPBench provides a neutral, contamination-resistant signal that you can cite, share, and verify.

Procurement

Vendor teams need third-party evidence. A signed MCPBench report satisfies requests for independent evaluation data.

Governance

AI governance teams need auditable, versioned records of capability. MCPBench reports include methodology version and test-set attestation.

Competitive selection

Compare agent frameworks or model versions on the same neutral task suite, under identical conditions.

What you receive

Overall success rate (primary), tool-call efficiency, hallucinated-tool rate, recovery-from-error rate
Per-server breakdown (10 MCP servers: Filesystem, GitHub, Postgres, Slack, Gmail, Browser, Calendar, Linear, Stripe, Notion)
Per-difficulty breakdown (single-tool, composition, recovery categories)
Failure taxonomy with representative failing trace summaries
Confidence intervals on all metrics
Comparison against published baselines (optional)
Methodology version and test-set rotation ID
Cryptographic attestation reference (SHA-256 hash chain verifiable on retirement)
Signed PDF, suitable for attaching to procurement filings

Four steps

1
Submit your agent
Provide a hosted endpoint or Docker image. We run it in our sandboxed infrastructure against the private test set.
2
We evaluate
Each task runs in an isolated Cloudflare Container with egress restricted to the MCP server. No data leaves the sandbox.
3
Report generated
Within 4 hours (SLA), your signed report is ready. It includes all metrics, attestation, and methodology reference.
4
Verify and file
The report is a self-contained PDF. Your team, auditors, or procurement reviewers can verify the attestation independently when the test set retires.

Built for enterprise security requirements

Submitted Docker images are not redistributed. MCP credentials are scoped per-run and destroyed after completion. Evaluation results are private by default. A DPA is available for all Enterprise customers.

Security architecture Trust Center

Enterprise plan

£5,000/mo (from)

Signed PDF reports for procurement and vendor due diligence
Custom MCP servers added to the benchmark
SLA on eval turnaround (≤ 4h on staged submissions)
Dedicated test set rotation cadence
Named technical contact
Quarterly methodology review with your eval team
Data Processing Agreement (DPA)

Talk to us

Annual contracts available. Net-30 invoicing. Procurement-friendly order form and DPA available on request.

Ready to evaluate?

Request a sample report

See exactly what enterprise customers receive before committing.

Get sample

Book a walkthrough

30-minute call with our eval team. We'll walk through the methodology, your use case, and the submission process.

Book call

Read the methodology

Full citable reference for task design, scoring formulas, and contamination defences.

Read methodology