Meta-Analysis

Rate of Biology Benchmark Saturation

How fast are AI biology benchmarks becoming obsolete? Analysis across 73 benchmarks and 11 domains.

Since 2022, models halve the remaining performance gap on biology benchmarks every

~27months
95% CI: 14–49 · IQR: 9–100 · n=96

All-time: ~71 months (n=206)

~94 months with human-expert ceilings

…and the rate is accelerating

3.4–9.1×faster
post-breakpoint vs pre-breakpoint

Tested across Dec 2020, Mar 2023, Apr 2024

Based on 73 benchmarks across 11 domains.

Evidence confidenceHigh 33Medium 24Low 16Score metadata 342/342

TDC series may mix paper dates, Wayback first-seen dates, and live leaderboard score updates.

Median halving by domain

Composition caveat: This acceleration may partly reflect composition — newer benchmarks tend to be LLM knowledge tests (MedQA, GPQA) that saturate faster by design, while older benchmarks test specialized prediction tasks (CASP, ProteinGym) with longer horizons.

The period around December 2020 includes the most dramatic single-benchmark improvement (CASP/AlphaFold2), but it would be a mistake to attribute the broader pattern to any one breakthrough.

Ceiling assumption

Gap Closure Trajectories

How fast are benchmarks saturating?

to

Across 73 benchmarks, the median halving time is 26.5 months.

Domain Breakdown

Halving time by domain

Domains with 2+ benchmarks only

← Before Jan 2022 (months)
After Jan 2022(months) →
Science QA
12

Science QA

Overall: 11.9 mo (IQR: 7.615.5)

After Jan 2022: 11.9 mo

2 benchmarks

Biosecurity
12

Biosecurity

Overall: 12.5 mo (IQR: 6.519.5)

After Jan 2022: 12.5 mo

3 benchmarks

Genomics
14

Genomics

Overall: 14.1 mo (IQR: 10.726.3)

After Jan 2022: 14.1 mo

2 benchmarks

84
Medical Imaging
22

Medical Imaging

Overall: 30.4 mo (IQR: 25.1296.4)

Before Jan 2022: 83.9 mo

After Jan 2022: 22.1 mo

3 benchmarks

43
Clinical
25

Clinical

Overall: 42.6 mo (IQR: 2.949.8)

Before Jan 2022: 42.6 mo

After Jan 2022: 25.4 mo

3 benchmarks

119
Bio NLP
37

Bio NLP

Overall: 81.7 mo (IQR: 12.7151.2)

Before Jan 2022: 119.2 mo

After Jan 2022: 36.7 mo

14 benchmarks

122
Drug Discovery
54

Drug Discovery

Overall: 102.4 mo (IQR: 40.0248.4)

Before Jan 2022: 121.7 mo

After Jan 2022: 54.0 mo

40 benchmarks

366
Protein

Protein

Overall: 365.9 mo (IQR: 303.1544.8)

Before Jan 2022: 365.9 mo

3 benchmarks

Eras with fewer than 2 halving intervals are omitted.

Science QA benchmarks saturate fastest with a median halving time of 11.9 months.

Drug Discovery Plateau

7 of 23 TDC drug discovery benchmarks have plateaued since 2022

tdc-cyp2d6s
last 2019-05 · 84 mo stalled
tdc-cyp3a4s
last 2022-04 · 49 mo stalled
tdc-bindingdb
last 2023-06 · 35 mo stalled
tdc-cyp2c9i
last 2023-10 · 31 mo stalled
tdc-cyp2d6i
last 2023-10 · 31 mo stalled
tdc-cyp3a4i
last 2023-10 · 31 mo stalled
tdc-lipo
last 2024-01 · 28 mo stalled
2016202020242026
active SOTA spanstalled (no new SOTA)last SOTA

These benchmarks have received at most one new SOTA since January 2022, while the TDC leaderboard remained continuously active throughout. The other 16TDC benchmarks all saw two or more post-2022 SOTA improvements. Drug discovery is the slowest-improving domain overall (shown above) — these 7 benchmarks are the extreme tail of that slowdown.

Scope note: This chart is restricted to TDC drug discovery benchmarks. The same rule applied to non-TDC PWC-sourced drug discovery benchmarks would add five candidates (hiv-dataset, hiv-dti-77, ogbg-molhiv, ogbg-molpcba, lit-pcba-esr1ant) for a broader total of twelve plateau benchmarks. See Methodology.

What this implies

Policy and evaluation strategy implications

Evaluation cadence. At current gap-halving rates, a benchmark introduced today has an effective discriminative life of roughly four years before frontier models cluster within evaluation noise. For dual-use-bio evaluations that serve as capability gates — WMDP-Bio, VCT, and their successors — this implies that evaluation programs need to commission replacement benchmarks 18–24 months beforecurrent ones saturate, not after. Current practice is closer to 0–6 months of lead time. The gap is not a matter of urgency; it is a matter of procurement.

Ceiling Sensitivity

Does the ceiling assumption matter?

With human-expert ceilings, the median halving time shifts from ~71 to ~94 months (32% increase). Benchmarks where models have already exceeded expert performance (GPQA Diamond, WMDP-Bio, MedQA) show faster gap closure under theoretical ceilings. Use the toggle above to compare both modes across all charts.

Model Accessibility

Open-source model tiers on biology tasks

Among open-source models, larger models (≥30B parameters) score 8.7 percentage points higher than locally-runnable models (<30B) on biology knowledge tasks. The gap is widest on College Biology (15.3 pp) and narrowest on Medical Genetics (5.0 pp).

Based on 5,198 open-source models from HuggingFace Open LLM Leaderboard v1 (mid-2023 to mid-2024). Proprietary models (GPT-4, Claude, Gemini) are not included.

Methodology

How we compute this

Gap closure

gc = (score − chance) / (ceiling − chance), using the SOTA envelope (maximum score seen so far) for monotonicity. Halving time = months × ln(2) / ln(gap_before / gap_after) per interval, pooled via median.

Ceiling approach

Theoretical maximum (100% or 1.0) as the primary ceiling. Human-expert baselines as a sensitivity check. Scores exceeding the ceiling are capped at 100% gap closure.

Key caveats

  • SOTA-only: We track the best model at each time point, not the average model.
  • Composition: 73benchmarks span 1994–2026. Drug Discovery is the largest domain (~55%), reflecting data availability from TDC and MoleculeNet.
  • Heterogeneity: The IQR (9.099.6months) is wide — the median is a summary, not a universal rate.
  • Metric diversity:Accuracy, F1, ROC-AUC, Spearman ρ, and MCC are normalized but measure fundamentally different things.
  • Task diversity: The 73 benchmarks include knowledge QA (MedQA, GPQA), molecular property prediction (BBBP, Tox21), and specialized prediction tasks (ProteinGym, NT-18). These measure fundamentally different capabilities and saturate at different rates.
  • Pool composition: 34 of 73featured benchmarks contribute zero halving intervals to the post-2022 pool — most are regression benchmarks with no meaningful chance baseline, or benchmarks with fewer than two post-window SOTA updates. The headline median is pooled over the remaining 39.
  • Data quality: Some PWC-sourced scores (notably MUV at 99.8% ROC-AUC) may reflect evaluation protocol differences rather than genuine capability. All data is sourced from aggregators as-is.

References

Ott et al. 2022 (Nature Comms) · Epoch AI / Ho et al. 2025 · Akhtar et al. 2026 · HNS lineage (Mnih et al. 2015)