Full Catalog

Benchmarks

Every AI biology and biomedical benchmark we track, with saturation status, scores, and successor information.

222 of 222 benchmarks
NameDomainYearTypeStatusTime to SaturationBest ScoreHuman BaselineSuccessor
MoleculeNetMolecular/Drug Discovery2018Classification/RegressionNearing96 moAUROC 0.85-0.95TDC, WelQrate
TDC ADMET (22 leaderboards)Molecular/Drug Discovery2021Classification/RegressionActiveVaries by taskTDC-2
TDC DrugComboMolecular/Drug Discovery2021RegressionActiveVaries
TDC DockingMolecular/Drug Discovery2021GenerationActiveVaries
TDC DTI DG GroupMolecular/Drug Discovery2021RegressionActiveVaries
TDC Clinical Trial OutcomeMolecular/Drug Discovery2023ClassificationActiveVaries
TDC-2 (Single-cell DTI)Molecular/Drug Discovery2024ClassificationActiveVaries
TDC-2 Protein-Peptide BindingMolecular/Drug Discovery2024RegressionActiveVaries
GuacaMolMolecular/Drug Discovery2019GenerationSaturated24 moNear-perfect on simple goalsPMO, TARTARUS, DrugGym
MOSESMolecular/Drug Discovery2019GenerationSaturated48 moHigh validity/uniquenessMolScore, TARTARUS
PMOMolecular/Drug Discovery2022OptimizationActiveVaries
TARTARUSMolecular/Drug Discovery2023Generation/RLActiveLow success on hard objectives
DrugGymMolecular/Drug Discovery2024Agent WorkflowActiveN/A (simulator)
DOCKSTRINGMolecular/Drug Discovery2022Regression/GenerationActiveVaries
MolScoreMolecular/Drug Discovery2024Meta-frameworkUnknownN/A
MolGenBenchMolecular/Drug Discovery2025Structure-based GenerationActiveVaries
ToxiMolMolecular/Drug Discovery2025GenerationActiveLow success rates
Tox21Molecular/Drug Discovery2014ClassificationUnknownAUC ~0.85
PolarisMolecular/Drug Discovery2024Classification/RegressionActiveVaries (blind test)
WelQrateMolecular/Drug Discovery2024ClassificationActiveVaries
PharmaBenchMolecular/Drug Discovery2024Classification/RegressionActiveVaries
TOXRICMolecular/Drug Discovery2023ClassificationActiveVaries
ZINCMolecular/Drug Discovery2015Regression/GenerationActiveVariesZINC20, ZINC-22
PDBbindMolecular/Drug Discovery2004RegressionNearing264 moPearson R >0.86 (standard); <0.60 (CleanSplit)PDBbind CleanSplit, PLINDER
PDBbind CleanSplitMolecular/Drug Discovery2025RegressionActiveVaries (leak-free)
PLINDERMolecular/Drug Discovery2024Docking/RegressionActiveDiffDock drops 38%->15% with leak control
CASF-2016Molecular/Drug Discovery2016Scoring/Ranking/DockingNearing120 moPearson R ~0.86PDBbind CleanSplit
DAVISMolecular/Drug Discovery2011RegressionNearing180 moCI ~0.903
KIBAMolecular/Drug Discovery2014RegressionActiveCI ~0.898
CPI2MMolecular/Drug Discovery2024RegressionActiveBaseline
PoseBustersMolecular/Drug Discovery2023Pose ValidationActive~75% PB-Valid
PoseBenchMolecular/Drug Discovery2024Pose PredictionActive~68% (PB); ~33% (DockGen-E)
DockGenMolecular/Drug Discovery2024Pose PredictionActive~24-33% top-1 RMSD<2A
PoseXMolecular/Drug Discovery2025Pose PredictionActiveAI > physics-based (post-relaxation)
LIT-PCBAMolecular/Drug Discovery2020Virtual ScreeningActiveVariesMF-PCBA
DUD-EMolecular/Drug Discovery2012Virtual ScreeningSaturated96 moBiased high AUCLIT-PCBA, MF-PCBA, WelQrate
MF-PCBAMolecular/Drug Discovery2024Virtual ScreeningActiveBaselines established
CrossDocked2020Molecular/Drug Discovery2020Scoring/PoseActiveR 0.612
USPTO-50KMolecular/Drug Discovery2016RetrosynthesisNearing120 mo65% top-1~48.2% avg (forward)
USPTO-MITMolecular/Drug Discovery2017Reaction PredictionNearing108 mo>90% top-1 (forward)
Schneider 50KMolecular/Drug Discovery2015Reaction ClassificationSaturated72 mo>99% (RXNFP)USPTO 1K TPL
TOMG-BenchMolecular/Drug Discovery2024LLM GenerationActiveFine-tuned Llama > GPT-3.5 by 46.5%
MolQuestMolecular/Drug Discovery2026Agent WorkflowActiveSOTA ~50%; most <30%
ChemBenchMolecular/Drug Discovery2024MCQ/Free-formNearing24 moClaude 3.5 Sonnet > best human chemistsExpert chemists
QM9Molecular/Drug Discovery2014RegressionNearing144 moNear chemical accuracyPCQM4Mv2
QM7/QM7bMolecular/Drug Discovery2012RegressionSaturated96 moChemical accuracy achievedQM9
QM8Molecular/Drug Discovery2015RegressionNearing132 moNear saturationQM9
PCQM4Mv2Molecular/Drug Discovery2021RegressionActiveMAE 0.0719 eV
CASP (Single-Chain)Protein1994Structural PredictionSaturated312 moGDT-TS 92.4~90 (experimental)CASP complexes/RNA tracks
CASP16 (Complexes)Protein2024Structural PredictionActiveVaries; AB-Ag major failure area
CASP16 (RNA)Genomics2024Structural PredictionActiveNo TM-score >0.8 for novel RNAs
CASP16 (Protein-Ligand)Molecular/Drug Discovery2024Binding AffinityActiveMax Kendall tau 0.42
CAMEOProtein2013Structure EvaluationSaturated144 moAlphaFold dominantCAMEO complexes/ligands/peptides
CATHProtein1997ClassificationNearing348 moHigh accuracy (known folds)
TAPEProtein2019Multi-taskSaturated48 moNear-ceiling on most tasksProteinGym, PEER, FLIP
PEERProtein2022Multi-taskActiveVaries across 17 tasks
ProteinGymProtein2023RegressionActiveSpearman ~0.52
ProteinBenchProtein2024Multi-taskActiveVaries; drops >20% at 500+ residues
FLIPProtein2021RegressionNearing60 moVariesFLIP2
FLIP2Protein2026RegressionActiveVaries
FoldBenchProtein2025Structural PredictionActiveVaries by task
CAPRIProtein2001Docking/InteractionActiveVaries
ProteinInvBenchProtein2023GenerationActiveRecovery ~66%
CAFAProtein2010Function PredictionActiveFmax 0.4-0.6
DB5.5 (Docking Benchmark)Protein2003DockingActiveTop-10 success ~38%
SKEMPI 2.0Protein2019RegressionActivePearson R ~0.7-0.8
PPB-AffinityProtein2024RegressionActiveVaries
SAbDab (CDR design)Protein2014GenerationActiveCDR-H3 AAR ~40-50%; RMSD ~2.5-3.5A
AbBiBenchProtein2025Antibody DesignActiveVaries
Mega-Scale Stability DatasetProtein2023RegressionActivePCC 0.72 (ThermoMPNN)
S669 (Stability blind test)Protein2020RegressionActivePCC ~0.43-0.67
EC-BenchProtein2025ClassificationActiveVaries
MotifBenchProtein2025GenerationActiveRFdiffusion solves ~16/30
SHAPESProtein2025GenerationActiveMost models fail on loops/mixed
PFMBenchProtein2025Multi-taskActiveVaries across 38 tasks
LiveProteinBenchProtein2025Multi-taskActiveVaries
ATOM3DProtein2021Multi-task (3D)ActiveVaries across 8 tasks
OGB Protein TasksProtein2020Classification/Link PredictionActiveVaries
LRGB Peptides-funcProtein2022ClassificationActiveVaries (AP)
LRGB Peptides-structProtein2022RegressionActiveVaries (MAE)
RNA-PuzzlesProtein2012Structural PredictionActiveBest RMSD 3-7AHuman experts outperform servers
BEACON (RNA Tasks)Genomics2024Multi-task (13 RNA tasks)ActivePLMs surpass SOTA on 8/13
rnaglibProtein2025Multi-task (RNA 3D)ActiveVaries
ATLAS MD Ensemble DatasetProtein2024Conformational DynamicsActiveVaries
PeptoneBenchProtein2025Conformational DynamicsActiveBioEmu and PepTron lead
PDFBenchProtein2025Protein Design from FunctionActiveVaries across 16 metrics
SafeProteinBiosecurity2025Red-teamingActiveVaries
GenomicBenchmarksGenomics2022ClassificationSaturated18 mo95%+ accuracyGUE, BEND
GUEGenomics2023ClassificationActiveF1 0.3-0.95DART-Eval, GenBench
BENDGenomics2023Classification/RegressionActiveVaries
NT-18 BenchmarkGenomics2023ClassificationActiveMCC 0.974 (promoter); 0.983 (splice)NTv3
Gene-MTEBGenomics2025Classification/ClusteringActive0.59 average
NullsettesGenomics2025Zero-shot PredictionActiveMost models fail
Evo / Evo 2Genomics2024Multi-taskActiveBRCA1 >90% AUROC (Evo 2)
METAGENE-1Genomics2025Classification/ClusteringActiveMCC 92.96 (pathogen)
DART-EvalGenomics2024Multi-taskActiveDNALMs inconsistent
OmniGenBenchGenomics2025Multi-taskActiveVaries across 123+ datasets
DNALongBenchGenomics2025Classification/RegressionActiveExpert models lead
GenBenchGenomics2024Multi-taskActiveVaries across 43 datasets
NABenchGenomics2025Fitness PredictionActiveVaries
Genomics Long-Range BenchmarkGenomics2024Classification/RegressionActiveLarge gaps in VEP
DeepSEAGenomics2015ClassificationNearing132 moAUC 0.958Sei, AlphaGenome
SeiGenomics2022ClassificationActiveAUROC 0.972; AUPRC 0.409AlphaGenome
AlphaGenomeGenomics2025Multi-taskActive25/26 VEP evaluations SOTA
EnformerGenomics2021Gene Expression PredictionActiveCross-gene CAGE Pearson 0.85~0.94 (experimental replicate ceiling)Borzoi, AlphaGenome
BorzoiGenomics2024Gene Expression PredictionActive524kb input, 32bp resolutionAlphaGenome
SpliceAIGenomics2019ClassificationNearing84 moAUPRC 0.98; ~95% top-kOpenSpliceAI, AlphaGenome
CAGIGenomics2011Variant InterpretationActiveVaries
TraitGymGenomics2025Variant PredictionActiveCADD/GPN-MSA best for disease traits
ClinVar (coding variants)Genomics2013ClassificationNearing156 moAUC ~0.95
CAMI (Metagenomic)Genomics2017Classification/AssemblyActiveGood at genus, poor at strainCAMI III
DNABERT evaluation tasksGenomics2021ClassificationNearing60 moF1 0.940 (promoter); MCC 0.871 (splice)DNABERT-2, GUE
CRISPR on-target (CRISPRon)Genomics2021RegressionNearing60 moSpearman ~0.80Ceiling ~0.85-0.90 (biological noise)
CRISPR off-target (CIRCLE-seq)Genomics2017ClassificationActiveAUROC 0.977 (CCLMoff)
PRIDICT 2.0 (Prime Editing)Genomics2024RegressionActiveSpearman R=0.85
DeepPrimeGenomics2023RegressionActiveApproaching strong performance for specific edits
scIBVirtual Cell2020IntegrationNearing72 moscANVI ~0.8CZI cz-benchmarks
CZI cz-benchmarksVirtual Cell2024Classification/PredictionActiveVaries
Open Problems (single-cell)Virtual Cell2021Multi-taskActiveVaries across 12 tasks
Arc Institute Virtual Cell ChallengeVirtual Cell2025Perturbation PredictionActiveStatistical baselines competitive with AI
scPerturbVirtual Cell2024Data Resource/QCActiveE-distance metricscPerturBench
scPerturBenchVirtual Cell2025Perturbation PredictionActiveNo method consistently wins; linear baselines competitive
GEARSVirtual Cell2023Perturbation PredictionActive40% higher precision than baselines (claimed); linear models outperform (2025)
PerturBenchVirtual Cell2024Perturbation PredictionActiveSimple models outperform complex
BEELINE (GRN inference)Virtual Cell2020GRN InferenceActiveClose to random predictor in many cases
dynverse (Trajectory inference)Virtual Cell2019Trajectory InferenceActiveVaries by topology
PubMedQABio NLP2019MCQSaturated72 mo81.6%78%MedHELM, BigBIO
MedQA (USMLE)Clinical2021MCQSaturated42 mo96.5%87%MedXpertQA, HealthBench
MedMCQAClinical2022MCQActive~75-80%
MMLU-BioBio NLP2020MCQSaturated48 mo93%+89.8%MMLU-Pro
MMLU-Pro MedicalBio NLP2024MCQSaturated18 mo90.1%
BLURBBio NLP2020Multi-task NLPNearing72 mo82.91 BLURB scoreBigBIO
BigBIOBio NLP2022FrameworkActiveN/A (126+ datasets)
BioASQBio NLP2013Semantic QAActiveF1 ~0.58; yes/no >80%
BC5CDR-Chemical NERBio NLP2015NERSaturated84 moF1 94.2%BioRED
BC5CDR-Disease NERBio NLP2015NERNearing132 moF1 ~90%
NCBI-Disease NERBio NLP2014NERSaturated96 moF1 ~91%IAA ~87%
JNLPBA NERBio NLP2004NERSaturated204 moF1 ~79.6%
BC2GM NERBio NLP2007NERNearing228 moF1 ~90.9%
ChemProt REBio NLP2017Relation ExtractionNearing108 moF1 90.8%
DDI REBio NLP2013Relation ExtractionSaturated120 moF1 83.3%MultiADE
GAD REBio NLP2004Relation ExtractionSaturated204 moF1 ~85%
HoC (Hallmarks of Cancer)Bio NLP2016ClassificationNearing120 moF1 ~90.3%
BioREDBio NLP2022Relation ExtractionActiveVaries
MedNLIBio NLP2018NLINearing96 mo~82% accuracyBioNLI
LINNAEUS NERBio NLP2010NERSaturated132 moMid-to-high 90s%
GPQA Diamond (Bio)Science QA2023MCQSaturated36 mo94.1%67%HLE, ATLAS
HLE (Bio/Medicine)Science QA2025Short Answer/MCQActive44.7% overall; ~22% bio/med98%
ATLAS (Bio)Science QA2025Open-endedActiveVaries
MedHELMClinical2025Multi-task ClinicalActive66% win-rate (DeepSeek R1)
MedAgentBenchClinical2025Agent WorkflowActive69.67%
HealthBenchClinical2025ConversationalActive60% (o3); Hard: 32%
MedCalc-BenchClinical2024CalculationActive50.9%
MIMIC-III BenchmarksClinical2017PredictionNearing108 moAUROC ~0.94 (mortality)MIMIC-IV, YAIB
MIMIC-IV BenchmarksClinical2020PredictionActiveHospitalization AUROC ~0.87
EHRSHOTClinical2023Few-shot PredictionActiveVaries across 15 tasks
FHIR-AgentBenchClinical2025Agent WorkflowActiveVaries
LiveMedBenchClinical2025Dynamic MCQActiveVaries (refreshes weekly)
MedXpertQAClinical2025MCQActiveVaries (4,460 Qs, 17 specialties)
DiagnosisArenaClinical2025Diagnostic ReasoningActive45.82%
MedS-BenchClinical2024Multi-taskActiveVaries across 39 datasets
Script Concordance TestingClinical2025Probabilistic ReasoningActiveVaries
Doctorina MedBenchClinical2026Agent SimulationActiveVaries (>1,000 cases)
AgentClinicClinical2025Diagnostic AgentActiveVaries (120 NEJM cases)
ACI-BENCHClinical2023Note GenerationActiveMEDCON 57.78
PhysioNet 2019 (Sepsis)Clinical2019PredictionActiveVaries (104 teams)
JAMA Clinical ChallengeClinical2024MCQActive88.6% (o1-preview on 70 cases)
MedBench v4 (China)Clinical2025Multi-taskActiveAgent 85.3/100 (Claude Sonnet 4.5)
AfriMed-QAClinical2024MCQActiveVaries (15K+ Qs, 32 specialties)
MMedBenchClinical2024Multilingual MCQActiveVaries across 6 languages
TAC ADRBio NLP2017ExtractionNearing108 moF1 ~85.2%MultiADE
MultiADEBio NLP2024ExtractionActiveVaries across 6 ADE datasets
SMM4HBio NLP2017Social Media MiningActiveADR detection F1 ~0.65-0.70
WMDP-BioBiosecurity2024MCQSaturated7 mo87%60.5%VCT, ABC-Bench
VCTBiosecurity2025Multimodal QAActive43.8% (o3)22.1% (expert virologists)
ABC-BenchBiosecurity2025Agent WorkflowActive53% (Grok 3)24% (PhD experts)
ABLEBiosecurity2025Agent WorkflowActive7/8 low-level tasks
BBG Framework / B3Biosecurity2025Open-endedUnknownPilot phase
BioLP-BenchBiosecurity2024Protocol Error DetectionActive34% (o4-mini)~38% (expert avg)
Scale AI FORTRESS (CBRNE)Biosecurity2025Adversarial SafetyActiveVaries
Scale AI PropensityBenchBiosecurity2025Agent SafetyActiveVaries (5,874 tasks)
TroubleshootingBenchBiosecurity2025Protocol QAActiveNo model exceeds 80th-pct expert (36.4%)80th-pct expert: 36.4%
LAB-BenchAgentic Bio2024Research WorkflowSaturated18 mo89%; several subtasks at ceiling~79% (expert on ProtocolQA)LABBench2
LABBench2Agentic Bio2026Research WorkflowActive26-46% drop vs LAB-Bench
BixBenchAgentic Bio2025BioinformaticsActive17% (Claude 3.5 Sonnet); 9% (GPT-4o)
ScienceAgentBenchAgentic Bio2024Code GenerationActive42.2% (o1-preview + self-debug)
BioCoderAgentic Bio2024Code GenerationNearing24 mo~50% Pass@1 (GPT-4)BixBench
MedAgentGymAgentic Bio2025Training EnvironmentActiveN/A
LitQA2 / PaperQA2Agentic Bio2024Literature QANearing24 mo~90% (PaperQA2)~67% (human experts)
BioProBenchAgentic Bio2025Protocol UnderstandingActiveVaries
CheXpertMedical Imaging2019ClassificationNearing84 moAUC ~0.942.6/3 radiologistsCheXpert Plus
MIMIC-CXRMedical Imaging2019Report GenerationActiveRadCliQ-v1 0.92
NIH ChestX-ray14Medical Imaging2017ClassificationNearing108 moAUC ~0.85-0.88CheXpert, MIMIC-CXR
PathVQAMedical Imaging2020Visual QAActive50-65%GEMeX
VQA-RADMedical Imaging2018Visual QAUnknown79.2%
SLAKEMedical Imaging2021Visual QAActive~78.7%
OmniMedVQAMedical Imaging2024Visual QAActiveLVLMs struggle
ISIC ChallengesMedical Imaging2016Classification/SegmentationNearing120 moExceeds cliniciansClinician level3D TBP (2024)
HAM10000Medical Imaging2018ClassificationNearing96 mo~96%+ accuracyDermaMNIST-E
Camelyon16Medical Imaging2016ClassificationSaturated72 moAUC 0.994Pathologist AUCCamelyon17, Camelyon+
Camelyon17Medical Imaging2017ClassificationActiveKappa ~0.89Camelyon+
PANDA (Prostate Gleason)Medical Imaging2020ClassificationActiveQWK ~0.93+
PCamMedical Imaging2018ClassificationNearing96 mo~97%+ accuracy
BraTSMedical Imaging2012SegmentationActiveDice ~0.88 whole tumorBraTS 2025 Lighthouse
Medical Segmentation DecathlonMedical Imaging2018SegmentationActivennU-Net variants lead
TotalSegmentatorMedical Imaging2023SegmentationNearing36 moAvg Dice >0.90 for major organs
AMOSMedical Imaging2022SegmentationActiveVaries (15 organs)AMOS-MM (2024)
CT-BenchMedical Imaging2025Multimodal QAActive61.8%
APTOS Diabetic RetinopathyMedical Imaging2019ClassificationNearing84 moQWK 0.967
EyePACS DRMedical Imaging2015ClassificationSaturated48 moAUC ~0.99
RSNA Pneumonia DetectionMedical Imaging2018Object DetectionActiveVaries (1,400+ Kaggle teams)
ReXrankMedical Imaging2024Report GenerationActive1/RadCliQ-v1 0.98 (ReXGradient)
RadGraphMedical Imaging2021Entity/Relation ExtractionActiveMicro F1 0.82
ReXVQAMedical Imaging2025Visual QAActive83.24% (MedGemma)
MatBench DiscoveryMolecular/Drug Discovery2025Crystal StabilityNearing12 moF1=0.924
OC20Molecular/Drug Discovery2020Catalyst PredictionActiveEquiformerV2 leadsOC22, OC25
MedPerfClinical2023Federated BenchmarkingActiveN/A (platform)
FairMedQAClinical2025Fairness EvaluationActive3-19 pct point accuracy disparity
MEDECClinical2025Error Detection/CorrectionActiveVaries
MedThink-BenchClinical2025Reasoning EvaluationActiveVaries (500 QA pairs)
CLINICClinical2025TrustworthinessActiveDeepSeek-R1-LLaMA best robustness
MedAgentsBenchClinical2025Multi-agent ReasoningActive92.6% (TeamMedAgents)