Meta-Analysis
How fast are AI biology benchmarks becoming obsolete? Analysis across 73 benchmarks and 11 domains.
Since 2022, models halve the remaining performance gap on biology benchmarks every
All-time: ~71 months (n=206)
~94 months with human-expert ceilings
…and the rate is accelerating
Tested across Dec 2020, Mar 2023, Apr 2024
Based on 73 benchmarks across 11 domains.
TDC series may mix paper dates, Wayback first-seen dates, and live leaderboard score updates.
Median halving by domain
Composition caveat: This acceleration may partly reflect composition — newer benchmarks tend to be LLM knowledge tests (MedQA, GPQA) that saturate faster by design, while older benchmarks test specialized prediction tasks (CASP, ProteinGym) with longer horizons.
The period around December 2020 includes the most dramatic single-benchmark improvement (CASP/AlphaFold2), but it would be a mistake to attribute the broader pattern to any one breakthrough.
Ceiling assumption
Gap Closure Trajectories
Across 73 benchmarks, the median halving time is 26.5 months.
Domain Breakdown
Domains with 2+ benchmarks only
Science QA
Overall: 11.9 mo (IQR: 7.6–15.5)
After Jan 2022: 11.9 mo
2 benchmarks
Biosecurity
Overall: 12.5 mo (IQR: 6.5–19.5)
After Jan 2022: 12.5 mo
3 benchmarks
Genomics
Overall: 14.1 mo (IQR: 10.7–26.3)
After Jan 2022: 14.1 mo
2 benchmarks
Medical Imaging
Overall: 30.4 mo (IQR: 25.1–296.4)
Before Jan 2022: 83.9 mo
After Jan 2022: 22.1 mo
3 benchmarks
Clinical
Overall: 42.6 mo (IQR: 2.9–49.8)
Before Jan 2022: 42.6 mo
After Jan 2022: 25.4 mo
3 benchmarks
Bio NLP
Overall: 81.7 mo (IQR: 12.7–151.2)
Before Jan 2022: 119.2 mo
After Jan 2022: 36.7 mo
14 benchmarks
Drug Discovery
Overall: 102.4 mo (IQR: 40.0–248.4)
Before Jan 2022: 121.7 mo
After Jan 2022: 54.0 mo
40 benchmarks
Protein
Overall: 365.9 mo (IQR: 303.1–544.8)
Before Jan 2022: 365.9 mo
3 benchmarks
Eras with fewer than 2 halving intervals are omitted.
Science QA benchmarks saturate fastest with a median halving time of 11.9 months.
Drug Discovery Plateau
These benchmarks have received at most one new SOTA since January 2022, while the TDC leaderboard remained continuously active throughout. The other 16TDC benchmarks all saw two or more post-2022 SOTA improvements. Drug discovery is the slowest-improving domain overall (shown above) — these 7 benchmarks are the extreme tail of that slowdown.
Scope note: This chart is restricted to TDC drug discovery benchmarks. The same rule applied to non-TDC PWC-sourced drug discovery benchmarks would add five candidates (hiv-dataset, hiv-dti-77, ogbg-molhiv, ogbg-molpcba, lit-pcba-esr1ant) for a broader total of twelve plateau benchmarks. See Methodology.
What this implies
Evaluation cadence. At current gap-halving rates, a benchmark introduced today has an effective discriminative life of roughly four years before frontier models cluster within evaluation noise. For dual-use-bio evaluations that serve as capability gates — WMDP-Bio, VCT, and their successors — this implies that evaluation programs need to commission replacement benchmarks 18–24 months beforecurrent ones saturate, not after. Current practice is closer to 0–6 months of lead time. The gap is not a matter of urgency; it is a matter of procurement.
Ceiling Sensitivity
With human-expert ceilings, the median halving time shifts from ~71 to ~94 months (32% increase). Benchmarks where models have already exceeded expert performance (GPQA Diamond, WMDP-Bio, MedQA) show faster gap closure under theoretical ceilings. Use the toggle above to compare both modes across all charts.
Model Accessibility
Among open-source models, larger models (≥30B parameters) score 8.7 percentage points higher than locally-runnable models (<30B) on biology knowledge tasks. The gap is widest on College Biology (15.3 pp) and narrowest on Medical Genetics (5.0 pp).
Based on 5,198 open-source models from HuggingFace Open LLM Leaderboard v1 (mid-2023 to mid-2024). Proprietary models (GPT-4, Claude, Gemini) are not included.
Methodology
gc = (score − chance) / (ceiling − chance), using the SOTA envelope (maximum score seen so far) for monotonicity. Halving time = months × ln(2) / ln(gap_before / gap_after) per interval, pooled via median.
Theoretical maximum (100% or 1.0) as the primary ceiling. Human-expert baselines as a sensitivity check. Scores exceeding the ceiling are capped at 100% gap closure.
Ott et al. 2022 (Nature Comms) · Epoch AI / Ho et al. 2025 · Akhtar et al. 2026 · HNS lineage (Mnih et al. 2015)