Benchmark Dashboard

Scores across major evaluation datasets. Verified entries are cross-referenced against official papers or reproducible runs.

GPQA Diamond — Model comparison

How we source scores: Verified (✓) entries are cross-referenced against official technical reports, peer-reviewed papers, or reproducible open runs. Unverified entries may come from community submissions pending review. Scores shown are pass@1 unless noted. We track raw capability metrics — not vendor-selected cherry-picked numbers.