Set: dev (public)
Tasks
The public dev set is a 40-task subset for development and CI. The private 160-task test set is what scores the leaderboard and rotates monthly.
… shown
Private set: 160 tasks · monthly rotation
MCP server:
The public dev set is a 40-task subset for development and CI. The private 160-task test set is what scores the leaderboard and rotates monthly.