Independent evidence for production AI agents
MCPBench generates cryptographically-attested evaluation reports for procurement due diligence, governance filings, vendor assessments, and pre-release validation.
Internal evals are necessary. They are not sufficient.
Internal benchmarks tell you how your agent performs against your own test suite — but they cannot serve as independent evidence for procurement, vendor selection, or governance review. MCPBench provides a neutral, contamination-resistant signal that you can cite, share, and verify.
What you receive
- Overall success rate (primary), tool-call efficiency, hallucinated-tool rate, recovery-from-error rate
- Per-server breakdown (10 MCP servers: Filesystem, GitHub, Postgres, Slack, Gmail, Browser, Calendar, Linear, Stripe, Notion)
- Per-difficulty breakdown (single-tool, composition, recovery categories)
- Failure taxonomy with representative failing trace summaries
- Confidence intervals on all metrics
- Comparison against published baselines (optional)
- Methodology version and test-set rotation ID
- Cryptographic attestation reference (SHA-256 hash chain verifiable on retirement)
- Signed PDF, suitable for attaching to procurement filings
Four steps
- 1Submit your agent
Provide a hosted endpoint or Docker image. We run it in our sandboxed infrastructure against the private test set.
- 2We evaluate
Each task runs in an isolated Cloudflare Container with egress restricted to the MCP server. No data leaves the sandbox.
- 3Report generated
Within 4 hours (SLA), your signed report is ready. It includes all metrics, attestation, and methodology reference.
- 4Verify and file
The report is a self-contained PDF. Your team, auditors, or procurement reviewers can verify the attestation independently when the test set retires.
Built for enterprise security requirements
Submitted Docker images are not redistributed. MCP credentials are scoped per-run and destroyed after completion. Evaluation results are private by default. A DPA is available for all Enterprise customers.
Enterprise plan
- Signed PDF reports for procurement and vendor due diligence
- Custom MCP servers added to the benchmark
- SLA on eval turnaround (≤ 4h on staged submissions)
- Dedicated test set rotation cadence
- Named technical contact
- Quarterly methodology review with your eval team
- Data Processing Agreement (DPA)
Annual contracts available. Net-30 invoicing. Procurement-friendly order form and DPA available on request.
Ready to evaluate?
See exactly what enterprise customers receive before committing.
Get sample30-minute call with our eval team. We'll walk through the methodology, your use case, and the submission process.
Book callFull citable reference for task design, scoring formulas, and contamination defences.
Read methodology