Methodology

How this site defines benchmark saturation, where the data comes from, and the academic work that informs our approach.

Defining Saturation

We follow the framework established by Akhtar et al. (2026), which defines benchmark saturation as the “loss of reliable discriminative power among state-of-the-art models.” A benchmark is considered saturated when top-performing models cluster so tightly that their performance differences fall within the expected margin of evaluation uncertainty, rendering them statistically indistinguishable.

This means saturation does not require a model to score 100%. A benchmark can saturate well below its theoretical maximum if the remaining gap reflects noise, ambiguous questions, or data errors rather than meaningful capability differences. For example, PubMedQA plateaued around 81% and WMDP-Bio around 87% — both below 100% but with top models no longer meaningfully differentiable.

For each benchmark, we compute activeLifespanMonths as the number of months from the benchmark's introduction to its approximate saturation date. For benchmarks classified as “nearing” saturation, lifespan is computed as time from introduction to April 2026 (a lower bound that will increase if the benchmark resists full saturation).

Limitation: Saturation dates are approximate, based on when published analyses identified performance plateaus rather than exact mathematical inflection points. Different researchers may identify slightly different saturation dates for the same benchmark.

Data Sources

Benchmark scores, saturation dates, and metadata are compiled from the following publicly available sources.

SourceUsed forURL
Epoch AI Benchmarking HubHistorical SOTA scores for GPQA Diamond, MMLU, MedQA, HLEepoch.ai/benchmarks
Justen (2025) biology-benchmarks27-model evaluation data across 8 bio benchmarks (Nov 2022 - Apr 2025)github.com/lennijusten/biology-benchmarks
CASP Prediction CenterGDT-TS and lDDT scores across 16 rounds (1994-2024)predictioncenter.org
MedHELM / HELMClinical benchmark scores across 121 taskscrfm.stanford.edu/helm/medhelm
HuggingFaceWMDP, MMLU, and various benchmark datasetshuggingface.co
Therapeutics Data CommonsDrug discovery leaderboards across 29 taskstdcommons.ai
ProteinGymProtein fitness prediction baselines, 217 DMS assays, 90+ modelsproteingym.org
Papers With Code archiveHistorical SOTA data across 9,327 benchmarks (frozen July 2025)github.com/paperswithcode/paperswithcode-data
Vals.aiMedQA saturation tracking (archived as saturated)vals.ai
BioASQAnnual biomedical semantic QA competition results since 2013bioasq.org
BLURBBiomedical NLP leaderboard (13 datasets, 6 task types)microsoft.github.io/BLURB
Nucleotide Transformer BenchmarkGenomic prediction leaderboard (18 tasks)HuggingFace Space (InstaDeepAI)
Scale AI / HLE LeaderboardHumanity's Last Exam scoreslabs.scale.com/leaderboard
SecureBio / VCTVirology Capabilities Test resultsvirologytest.ai
Open Problems in Single-Cell BiologySingle-cell benchmark resultsopenproblems.bio
Open Graph BenchmarkProtein and molecular graph benchmark scoresogb.stanford.edu

References

  1. Akhtar, M. et al. (2026). "When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation." arXiv:2602.16763
  2. Ott, S. et al. (2022). "Mapping global dynamics of benchmark creation and saturation in artificial intelligence." Nature Communications. doi:10.1038/s41467-022-34591-0
  3. Justen, L. (2025). "LLMs Outperform Experts on Challenging Biology Benchmarks." arXiv:2505.06108
  4. Dev, S. et al. (2025). "Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier LLMs." RAND Corporation. RAND RRA3797-1
  5. Ho, A. & Berg, A. (2025). "Do the biorisk evaluations of AI labs actually measure the risk of developing bioweapons?" Epoch AI Gradient Updates. Epoch AI
  6. Götting, J. et al. (2025). "Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark." arXiv:2504.16137
  7. Li, N. et al. (2024). "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning." ICML 2024. arXiv:2403.03218
  8. Rein, D. et al. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." ICLR 2024. arXiv:2311.12022
  9. Kiela, D. et al. (2021). "Dynabench: Rethinking Benchmarking in NLP." NAACL 2021. ACL Anthology
  10. Jumper, J. et al. (2021). "Highly Accurate Protein Structure Prediction with AlphaFold." Nature. doi:10.1038/s41586-021-03819-2
  11. Notin, P. et al. (2023). "ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design." NeurIPS 2023 Datasets & Benchmarks. proteingym.org
  12. Hendrycks, D. et al. (2025). "Humanity's Last Exam." Nature. doi:10.1038/s41586-025-09962-4
  13. Liu, A.B. et al. (2025). "ABC-Bench: Agentic Bio-Capabilities Benchmark." NeurIPS 2025 BioSafe GenAI Workshop.