Benchmark Dashboard
Scores across major evaluation datasets. Verified entries are cross-referenced against official papers or reproducible runs.
GPQA Diamond — Model comparison
How we source scores: Verified (✓) entries are cross-referenced against official technical reports, peer-reviewed papers, or reproducible open runs. Unverified entries may come from community submissions pending review. Scores shown are pass@1 unless noted. We track raw capability metrics — not vendor-selected cherry-picked numbers.