Skip to content
LIVE
OPUS 4.8$5 / $25per Mtok
GPT-5.5$5 / $30per Mtok
GEMINI 2.5 PRO$1.25 / $10per Mtok
SONNET 4.6$3 / $15per Mtok
SWE-BENCHleader GPT-5.568.7%
MMLU-PROleader GPT-5.594.2
GPQAleader GPT-5.578.3
AFTAv1.0 whitepaper live at /whitepaper
OPUS 4.8$5 / $25per Mtok
GPT-5.5$5 / $30per Mtok
GEMINI 2.5 PRO$1.25 / $10per Mtok
SONNET 4.6$3 / $15per Mtok
SWE-BENCHleader GPT-5.568.7%
MMLU-PROleader GPT-5.594.2
GPQAleader GPT-5.578.3
AFTAv1.0 whitepaper live at /whitepaper
All systems operational0 AI providers monitored, polled every 2 minutes
Live status

Benchmark Registry

Meta-catalog of AI evaluation benchmarks: knowledge, math, code, multimodal, agents, long-context. Each entry has size, score range, current frontier, status (active vs saturated), contamination risk, and links to paper, repo, and leaderboard.

Different from /benchmarks (which has model x score data for the 5 we ingest) and /harnesses (which has harness x score data for 4 agentic benchmarks). This is the broader registry of which benchmarks exist, what they test, and where to find current numbers.

Category:
Status:

For agents: same data at /api/benchmark-registry. Filter with ?category=knowledge|math|code|... or ?status=active|saturated. Free, no auth, cached 10 min.