Meta-Analysis

Rate of Biology Benchmark Saturation

How fast are AI biology benchmarks becoming obsolete? Analysis across 73 benchmarks and 11 domains.

Since 2022, models halve the remaining performance gap on biology benchmarks every

~27months

95% CI: 14–49 · IQR: 9–100 · n=96

All-time: ~71 months (n=206)

~94 months with human-expert ceilings

…and the rate is accelerating

3.4–9.1×faster

post-breakpoint vs pre-breakpoint

Tested across Dec 2020, Mar 2023, Apr 2024

Based on 73 benchmarks across 11 domains.

Evidence confidenceHigh 33Medium 24Low 16Score metadata 342/342

TDC series may mix paper dates, Wayback first-seen dates, and live leaderboard score updates.

Median halving by domain

Composition caveat: This acceleration may partly reflect composition — newer benchmarks tend to be LLM knowledge tests (MedQA, GPQA) that saturate faster by design, while older benchmarks test specialized prediction tasks (CASP, ProteinGym) with longer horizons.

The period around December 2020 includes the most dramatic single-benchmark improvement (CASP/AlphaFold2), but it would be a mistake to attribute the broader pattern to any one breakthrough.

Ceiling assumption

Gap Closure Trajectories

How fast are benchmarks saturating?

2010to2026

Across 73 benchmarks, the median halving time is 26.5 months.

Domain Breakdown

Halving time by domain

Domains with 2+ benchmarks only

← Before Jan 2022 (months)

After Jan 2022(months) →

Science QA

Overall: 11.9 mo (IQR: 7.6–15.5)

After Jan 2022: 11.9 mo

2 benchmarks

Biosecurity

Overall: 12.5 mo (IQR: 6.5–19.5)

After Jan 2022: 12.5 mo

3 benchmarks

Genomics

Overall: 14.1 mo (IQR: 10.7–26.3)

After Jan 2022: 14.1 mo

2 benchmarks

Medical Imaging

Overall: 30.4 mo (IQR: 25.1–296.4)

Before Jan 2022: 83.9 mo

After Jan 2022: 22.1 mo

3 benchmarks

Clinical

Overall: 42.6 mo (IQR: 2.9–49.8)

Before Jan 2022: 42.6 mo

After Jan 2022: 25.4 mo

3 benchmarks

119

Bio NLP

Overall: 81.7 mo (IQR: 12.7–151.2)

Before Jan 2022: 119.2 mo

After Jan 2022: 36.7 mo

14 benchmarks

122

Drug Discovery

Overall: 102.4 mo (IQR: 40.0–248.4)

Before Jan 2022: 121.7 mo

After Jan 2022: 54.0 mo

40 benchmarks

366

Protein

Overall: 365.9 mo (IQR: 303.1–544.8)

Before Jan 2022: 365.9 mo

3 benchmarks

Eras with fewer than 2 halving intervals are omitted.

Science QA benchmarks saturate fastest with a median halving time of 11.9 months.

Drug Discovery Plateau

7 of 23 TDC drug discovery benchmarks have plateaued since 2022

tdc-cyp2d6s

last 2019-05 · 84 mo stalled

tdc-cyp3a4s

last 2022-04 · 49 mo stalled

tdc-bindingdb

last 2023-06 · 35 mo stalled

tdc-cyp2c9i

last 2023-10 · 31 mo stalled

tdc-cyp2d6i

last 2023-10 · 31 mo stalled

tdc-cyp3a4i

last 2023-10 · 31 mo stalled

tdc-lipo

last 2024-01 · 28 mo stalled

2016202020242026

active SOTA spanstalled (no new SOTA)last SOTA

These benchmarks have received at most one new SOTA since January 2022, while the TDC leaderboard remained continuously active throughout. The other 16TDC benchmarks all saw two or more post-2022 SOTA improvements. Drug discovery is the slowest-improving domain overall (shown above) — these 7 benchmarks are the extreme tail of that slowdown.

Scope note: This chart is restricted to TDC drug discovery benchmarks. The same rule applied to non-TDC PWC-sourced drug discovery benchmarks would add five candidates (hiv-dataset, hiv-dti-77, ogbg-molhiv, ogbg-molpcba, lit-pcba-esr1ant) for a broader total of twelve plateau benchmarks. See Methodology.

What this implies

Policy and evaluation strategy implications

Evaluation cadence. At current gap-halving rates, a benchmark introduced today has an effective discriminative life of roughly four years before frontier models cluster within evaluation noise. For dual-use-bio evaluations that serve as capability gates — WMDP-Bio, VCT, and their successors — this implies that evaluation programs need to commission replacement benchmarks 18–24 months beforecurrent ones saturate, not after. Current practice is closer to 0–6 months of lead time. The gap is not a matter of urgency; it is a matter of procurement.

Ceiling Sensitivity

Does the ceiling assumption matter?

With human-expert ceilings, the median halving time shifts from ~71 to ~94 months (32% increase). Benchmarks where models have already exceeded expert performance (GPQA Diamond, WMDP-Bio, MedQA) show faster gap closure under theoretical ceilings. Use the toggle above to compare both modes across all charts.

Model Accessibility

Open-source model tiers on biology tasks

Among open-source models, larger models (≥30B parameters) score 8.7 percentage points higher than locally-runnable models (<30B) on biology knowledge tasks. The gap is widest on College Biology (15.3 pp) and narrowest on Medical Genetics (5.0 pp).

Based on 5,198 open-source models from HuggingFace Open LLM Leaderboard v1 (mid-2023 to mid-2024). Proprietary models (GPT-4, Claude, Gemini) are not included.

Methodology

How we compute this

Gap closure

gc = (score − chance) / (ceiling − chance), using the SOTA envelope (maximum score seen so far) for monotonicity. Halving time = months × ln(2) / ln(gap_before / gap_after) per interval, pooled via median.

Ceiling approach

Theoretical maximum (100% or 1.0) as the primary ceiling. Human-expert baselines as a sensitivity check. Scores exceeding the ceiling are capped at 100% gap closure.

Key caveats

SOTA-only: We track the best model at each time point, not the average model.
Composition: 73benchmarks span 1994–2026. Drug Discovery is the largest domain (~55%), reflecting data availability from TDC and MoleculeNet.
Heterogeneity: The IQR (9.0–99.6months) is wide — the median is a summary, not a universal rate.
Metric diversity:Accuracy, F1, ROC-AUC, Spearman ρ, and MCC are normalized but measure fundamentally different things.
Task diversity: The 73 benchmarks include knowledge QA (MedQA, GPQA), molecular property prediction (BBBP, Tox21), and specialized prediction tasks (ProteinGym, NT-18). These measure fundamentally different capabilities and saturate at different rates.
Pool composition: 34 of 73featured benchmarks contribute zero halving intervals to the post-2022 pool — most are regression benchmarks with no meaningful chance baseline, or benchmarks with fewer than two post-window SOTA updates. The headline median is pooled over the remaining 39.
Data quality: Some PWC-sourced scores (notably MUV at 99.8% ROC-AUC) may reflect evaluation protocol differences rather than genuine capability. All data is sourced from aggregators as-is.

References

Ott et al. 2022 (Nature Comms) · Epoch AI / Ho et al. 2025 · Akhtar et al. 2026 · HNS lineage (Mnih et al. 2015)