Methodology

How this site defines benchmark saturation, where the data comes from, and the academic work that informs our approach.

Defining Saturation

We follow the framework established by Akhtar et al. (2026), which defines benchmark saturation as the “loss of reliable discriminative power among state-of-the-art models.” A benchmark is considered saturated when top-performing models cluster so tightly that their performance differences fall within the expected margin of evaluation uncertainty, rendering them statistically indistinguishable.

This means saturation does not require a model to score 100%. A benchmark can saturate well below its theoretical maximum if the remaining gap reflects noise, ambiguous questions, or data errors rather than meaningful capability differences. For example, WMDP-Bio saturated around 87% and MedQA around 96% — in both cases top models became statistically indistinguishable despite remaining distance to 100%.

For each benchmark, we compute activeLifespanMonths as the number of months from the benchmark's introduction to its approximate saturation date. For benchmarks classified as “nearing” saturation, lifespan is computed as time from introduction to the current date (a lower bound that will increase if the benchmark resists full saturation).

Limitation: Saturation dates are approximate, based on when published analyses identified performance plateaus rather than exact mathematical inflection points. Different researchers may identify slightly different saturation dates for the same benchmark.

Gap Closure & Halving Time

Inclusion criteria.A benchmark is featured in the rate-of-improvement analysis if it comes from an approved aggregator (Papers With Code archive, Epoch AI, TDC, or manual curation), is identified as biology/biomedical, and has sufficient SOTA envelope points with distinct dates: ≥2 for Papers With Code and Epoch entries, ≥3 for Therapeutics Data Commons entries. The TDC threshold is stricter because TDC time series are reconstructed from Wayback Machine snapshots of a rolling leaderboard, which includes more spurious churn than paper-attested PWC entries. No manual curation of which benchmarks to include — if a benchmark meets the criteria for its source, it goes in. Two benchmarks (NCBI-disease and TS115) are manual additions that fall below the PWC threshold; a third (tdc-cyp2d6s) is a manual addition below the TDC threshold. These manual adds are documented individually in the Plateau Classification section and in src/data/featured.json.

Dataset evolution. The featured-benchmark pool has grown across site versions. An early non-systematic sample gave a median halving time of roughly 37 months; a systematic Stage-E expansion (including TDC ADMET and molecular-property benchmarks) shifted this to roughly 82 months; the current post-2022 headline of approximately 29 months reflects the same systematic pool filtered to post-2022 intervals. These shifts are driven by pool composition and window selection, not by methodology changes. The current headline is reported because we judge the current pool to be the most representative; users should treat the point estimate with appropriate uncertainty and prefer the per-domain breakdown for specific capability claims.

Chance baseline required. Gap closure is defined against a random-performance reference, so benchmarks where the chance baseline is undefined for the metric are excluded from the pooled halving-time analysis. This affects eleven TDC ADMET benchmarks: five regression benchmarks (MAE on tdc-caco2, tdc-lipo, tdc-aqsol, tdc-ppbr, tdc-ld50) where a “random” MAE depends on the target distribution and has no universal convention, and six AUPRC classification benchmarks (CYP inhibition/substrate assays) where the random-classifier AUPRC equals the class prior, which varies per dataset. Rather than silently defaulting chance to zero — which would overstate gap closure for the AUPRC entries and flip the sign for the MAE entries — these eleven benchmarks contribute zero halving intervals to the pooled median. Their trajectories still appear in the gap-closure plot for reference (using a zero-chance visualization convenience), but they do not influence rate estimates.

To quantify the rate of improvement, we compute gap closure for each featured benchmark. For higher-is-better metrics: gc = (score − chance) / (ceiling − chance). For lower-is-better metrics (e.g., RMSE): gc = (chance − score) / (chance − ceiling). This normalizes benchmarks with different scales (accuracy %, F1, ROC-AUC, Spearman ρ, MCC, MAE, RMSE) to a common 0–1 scale where 0 = chance performance and 1 = ceiling performance. This is the Human Normalized Score (HNS) lineage from Mnih et al. (2015) through BIG-bench, HELM, and the HuggingFace Leaderboard v2.

Only the SOTA envelope (best score seen to date) is used, ensuring monotonic gap closure. Between consecutive improvements, the halving time is computed as: months × ln(2) / ln(gap_before / gap_after) — the time for the remaining gap to halve, assuming exponential decay between two data points.

The headline number is the pooled median across all halving intervals from all featured benchmarks, with 25th and 75th percentile (IQR) as the descriptive range.

Uncertainty. We report a 95% confidence interval on the pooled median via a non-parametric bootstrap (2,000 draws) clustered at the benchmark level. Each draw samples benchmarks with replacement (not individual intervals, which would overstate precision because intervals within a benchmark share ceiling, chance baseline, and metric). The 2.5 and 97.5 percentiles of the bootstrap distribution define the CI. The CI is typically wider than the IQR because it reflects sampling uncertainty in which benchmarks are in the pool, not just the spread of observed halving times.

Ceiling modes. The primary analysis uses theoretical ceilings (100% or 1.0). A sensitivity analysis substitutes human-expert baselines where available. Benchmarks where expert baselines are below chance (VCT, ABC-Bench) are excluded in human-ceiling mode.

Limitation: Halving time assumes locally exponential decay, which may not hold near saturation. Intervals where the remaining gap is below 0.1% are excluded to avoid extreme values from near-saturated benchmarks. Lower-is-better regression benchmarks with no meaningful chance baseline contribute zero halving intervals and appear on charts only.

Plateau Classification

Independently of the halving-time analysis, we classify TDC drug discovery benchmarks as plateau or active based on their post-2022 SOTA activity. A benchmark is on the plateau list if it has received at most one new SOTA in the window from January 2022 through the current date. Benchmarks with two or more post-2022 SOTAs are active.

This is stricter than a simple time-since-last-SOTA rule. Two benchmarks with the same last-SOTA date can land on opposite sides: tdc-cyp2c9i had a single 2023-10 improvement and is classified as plateau, while tdc-clmicro had both a 2023-02 and a 2023-10 improvement and is classified as active. The rule captures ongoing multi-year progress vs. isolated single updates.

Applied across the 23 TDC ADMET benchmarks in src/data/featured.json, this rule identifies 7 as plateau and 16 as active. Supporting time series come from historical Wayback Machine snapshots of the TDC leaderboard (2021-06 through 2026-03), processed via scripts/06_acquire_tdc.py and scripts/07_normalize_tdc.py.

Date semantics.For TDC benchmarks, each SOTA score is dated to the publication date of the paper that reported the model, resolved via arxiv ID or CrossRef DOI query — not the date the score was added to the leaderboard. For Papers With Code and Epoch AI entries, dates come from the respective aggregators' own date fields (typically paper publication date). In all cases, the date attached to a score is a lower bound on “capability existed by” rather than an exact evaluation-run timestamp.

Scope.The plateau list surfaced on the analysis page is filtered to TDC drug discovery benchmarks only. An identical rule applied to non-TDC PWC-sourced drug discovery benchmarks would surface five additional candidates (hiv-dataset, hiv-dti-77, ogbg-molhiv, ogbg-molpcba, lit-pcba-esr1ant). These are omitted from the headline finding to keep the analytical scope tight; a broader “drug discovery plateau” framing across both sources would include twelve benchmarks.

Note: tdc-cyp2d6s has only two SOTA envelope points (2016-09 and 2019-05) and is normally filtered out by the len(envelope) >= 3 threshold in scripts/07_normalize_tdc.py. It is included in the plateau set via a manual entry in src/data/featured.json, following the NCBI-disease and TS115 precedent from Stage E.

Evidence Provenance

Each featured benchmark can carry benchmark-level provenance and score-level provenance. Benchmark-level provenance answers a narrow question: what source establishes the benchmark identity, task framing, metric, or leaderboard context? Score-level provenance answers a stricter question: what source establishes the exact model/date/value row used in the timeline?

The current source categories are deliberately conservative. papers-with-code and tdc identify rows generated from the Papers With Code archive or Therapeutics Data Commons pipeline. epoch marks Epoch AI benchmark data. official-benchmark means an official benchmark page, paper, dataset page, or maintained leaderboard establishes the benchmark context. manual means the row or series is retained from repo-local curation rather than regenerated from a durable result table. mixed is reserved for inferred summaries where score rows come from more than one source family.

Confidence is about the evidence behind the displayed timeline, not about whether the benchmark is important. High means the row-level score and date are generated from a structured source such as an aggregator or leaderboard. Medium means the row is reconstructed from durable but imperfect evidence, such as TDC paper dates combined with Wayback leaderboard snapshots and live-score updates. Low means the benchmark context may be source-backed, but at least some retained score rows are still hand-curated.

This distinction is important: official benchmark context can coexist with low-confidence manual score rows. For example, an official benchmark page may establish the task, metric, human baseline, and leaderboard family while the exact historical model/date/value points in src/data/featured.json remain legacy curation. Those rows are annotated as manual entries until they are independently reconstructed from source tables.

The analysis page surfaces this contract as an evidence-confidence summary: high/medium/low benchmark counts plus score metadata coverage of annotated score rows over total score rows. That coverage number measures how much of the displayed timeline has explicit score-level provenance; it does not claim that every annotated row has been regenerated from an official table.

Data Sources

Benchmark scores, saturation dates, and metadata are compiled from the following publicly available sources.

SourceUsed forURL
Epoch AI Benchmarking HubHistorical SOTA scores for GPQA Diamond, MMLU, MedQA, HLEepoch.ai/benchmarks
Justen (2025) biology-benchmarks27-model evaluation data across 8 bio benchmarks (Nov 2022 - Apr 2025)github.com/lennijusten/biology-benchmarks
CASP Prediction CenterGDT-TS and lDDT scores across 16 rounds (1994-2024)predictioncenter.org
MedHELM / HELMClinical benchmark scores across 121 taskscrfm.stanford.edu/helm/medhelm
HuggingFaceWMDP, MMLU, and various benchmark datasetshuggingface.co
Therapeutics Data CommonsDrug discovery leaderboards across 22 ADMET tasks. Historical time series (2021-2026) reconstructed from Wayback Machine snapshots; paper dates resolved via CrossRef/arXiv.tdcommons.ai
ProteinGymProtein fitness prediction baselines, 217 DMS assays, 90+ modelsproteingym.org
Papers With Code archiveHistorical SOTA data across 9,327 benchmarks (frozen July 2025)github.com/paperswithcode/paperswithcode-data
HuggingFace Open LLM Leaderboard v1 (CoreyMorris archive)Model tier analysis: 5,198 open-source models across 9 MMLU biology subtasks (mid-2023 to mid-2024). Data pulled from CoreyMorris/MMLU-by-task-Leaderboard, an archive of the original HF Open LLM Leaderboard v1.huggingface.co (CoreyMorris archive)
Vals.aiMedQA saturation tracking (archived as saturated)vals.ai
BioASQAnnual biomedical semantic QA competition results since 2013bioasq.org
BLURBBiomedical NLP leaderboard (13 datasets, 6 task types)microsoft.github.io/BLURB
Nucleotide Transformer BenchmarkGenomic prediction leaderboard (18 tasks)HuggingFace Space (InstaDeepAI)
Scale AI / HLE LeaderboardHumanity's Last Exam scoreslabs.scale.com/leaderboard
SecureBio / VCTVirology Capabilities Test resultsvirologytest.ai
Open Problems in Single-Cell BiologySingle-cell benchmark resultsopenproblems.bio
Open Graph BenchmarkProtein and molecular graph benchmark scoresogb.stanford.edu

References

  1. Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning." Nature. doi:10.1038/nature14236
  2. Akhtar, M. et al. (2026). "When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation." arXiv:2602.16763
  3. Ott, S. et al. (2022). "Mapping global dynamics of benchmark creation and saturation in artificial intelligence." Nature Communications. doi:10.1038/s41467-022-34591-0
  4. Justen, L. (2025). "LLMs Outperform Experts on Challenging Biology Benchmarks." arXiv:2505.06108
  5. Dev, S. et al. (2025). "Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier LLMs." RAND Corporation. RAND RRA3797-1
  6. Ho, A. & Berg, A. (2025). "Do the biorisk evaluations of AI labs actually measure the risk of developing bioweapons?" Epoch AI Gradient Updates. Epoch AI
  7. Götting, J. et al. (2025). "Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark." arXiv:2504.16137
  8. Li, N. et al. (2024). "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning." ICML 2024. arXiv:2403.03218
  9. Rein, D. et al. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." ICLR 2024. arXiv:2311.12022
  10. Kiela, D. et al. (2021). "Dynabench: Rethinking Benchmarking in NLP." NAACL 2021. ACL Anthology
  11. Jumper, J. et al. (2021). "Highly Accurate Protein Structure Prediction with AlphaFold." Nature. doi:10.1038/s41586-021-03819-2
  12. Notin, P. et al. (2023). "ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design." NeurIPS 2023 Datasets & Benchmarks. proteingym.org
  13. Hendrycks, D. et al. (2025). "Humanity's Last Exam." Nature. doi:10.1038/s41586-025-09962-4
  14. Liu, A.B. et al. (2025). "ABC-Bench: Agentic Bio-Capabilities Benchmark." NeurIPS 2025 BioSafe GenAI Workshop.