Full Catalog
Every AI biology and biomedical benchmark we track, with saturation status, scores, and successor information.
| Name | Domain | Year | Type | Status | Time to Saturation | Best Score | Human Baseline | Successor |
|---|---|---|---|---|---|---|---|---|
| MoleculeNet | Molecular/Drug Discovery | 2018 | Classification/Regression | Nearing | 96 mo | AUROC 0.85-0.95 | — | TDC, WelQrate |
| TDC ADMET (22 leaderboards) | Molecular/Drug Discovery | 2021 | Classification/Regression | Active | — | Varies by task | — | TDC-2 |
| TDC DrugCombo | Molecular/Drug Discovery | 2021 | Regression | Active | — | Varies | — | — |
| TDC Docking | Molecular/Drug Discovery | 2021 | Generation | Active | — | Varies | — | — |
| TDC DTI DG Group | Molecular/Drug Discovery | 2021 | Regression | Active | — | Varies | — | — |
| TDC Clinical Trial Outcome | Molecular/Drug Discovery | 2023 | Classification | Active | — | Varies | — | — |
| TDC-2 (Single-cell DTI) | Molecular/Drug Discovery | 2024 | Classification | Active | — | Varies | — | — |
| TDC-2 Protein-Peptide Binding | Molecular/Drug Discovery | 2024 | Regression | Active | — | Varies | — | — |
| GuacaMol | Molecular/Drug Discovery | 2019 | Generation | Saturated | 24 mo | Near-perfect on simple goals | — | PMO, TARTARUS, DrugGym |
| MOSES | Molecular/Drug Discovery | 2019 | Generation | Saturated | 48 mo | High validity/uniqueness | — | MolScore, TARTARUS |
| PMO | Molecular/Drug Discovery | 2022 | Optimization | Active | — | Varies | — | — |
| TARTARUS | Molecular/Drug Discovery | 2023 | Generation/RL | Active | — | Low success on hard objectives | — | — |
| DrugGym | Molecular/Drug Discovery | 2024 | Agent Workflow | Active | — | N/A (simulator) | — | — |
| DOCKSTRING | Molecular/Drug Discovery | 2022 | Regression/Generation | Active | — | Varies | — | — |
| MolScore | Molecular/Drug Discovery | 2024 | Meta-framework | Unknown | — | N/A | — | — |
| MolGenBench | Molecular/Drug Discovery | 2025 | Structure-based Generation | Active | — | Varies | — | — |
| ToxiMol | Molecular/Drug Discovery | 2025 | Generation | Active | — | Low success rates | — | — |
| Tox21 | Molecular/Drug Discovery | 2014 | Classification | Unknown | — | AUC ~0.85 | — | — |
| Polaris | Molecular/Drug Discovery | 2024 | Classification/Regression | Active | — | Varies (blind test) | — | — |
| WelQrate | Molecular/Drug Discovery | 2024 | Classification | Active | — | Varies | — | — |
| PharmaBench | Molecular/Drug Discovery | 2024 | Classification/Regression | Active | — | Varies | — | — |
| TOXRIC | Molecular/Drug Discovery | 2023 | Classification | Active | — | Varies | — | — |
| ZINC | Molecular/Drug Discovery | 2015 | Regression/Generation | Active | — | Varies | — | ZINC20, ZINC-22 |
| PDBbind | Molecular/Drug Discovery | 2004 | Regression | Nearing | 264 mo | Pearson R >0.86 (standard); <0.60 (CleanSplit) | — | PDBbind CleanSplit, PLINDER |
| PDBbind CleanSplit | Molecular/Drug Discovery | 2025 | Regression | Active | — | Varies (leak-free) | — | — |
| PLINDER | Molecular/Drug Discovery | 2024 | Docking/Regression | Active | — | DiffDock drops 38%->15% with leak control | — | — |
| CASF-2016 | Molecular/Drug Discovery | 2016 | Scoring/Ranking/Docking | Nearing | 120 mo | Pearson R ~0.86 | — | PDBbind CleanSplit |
| DAVIS | Molecular/Drug Discovery | 2011 | Regression | Nearing | 180 mo | CI ~0.903 | — | — |
| KIBA | Molecular/Drug Discovery | 2014 | Regression | Active | — | CI ~0.898 | — | — |
| CPI2M | Molecular/Drug Discovery | 2024 | Regression | Active | — | Baseline | — | — |
| PoseBusters | Molecular/Drug Discovery | 2023 | Pose Validation | Active | — | ~75% PB-Valid | — | — |
| PoseBench | Molecular/Drug Discovery | 2024 | Pose Prediction | Active | — | ~68% (PB); ~33% (DockGen-E) | — | — |
| DockGen | Molecular/Drug Discovery | 2024 | Pose Prediction | Active | — | ~24-33% top-1 RMSD<2A | — | — |
| PoseX | Molecular/Drug Discovery | 2025 | Pose Prediction | Active | — | AI > physics-based (post-relaxation) | — | — |
| LIT-PCBA | Molecular/Drug Discovery | 2020 | Virtual Screening | Active | — | Varies | — | MF-PCBA |
| DUD-E | Molecular/Drug Discovery | 2012 | Virtual Screening | Saturated | 96 mo | Biased high AUC | — | LIT-PCBA, MF-PCBA, WelQrate |
| MF-PCBA | Molecular/Drug Discovery | 2024 | Virtual Screening | Active | — | Baselines established | — | — |
| CrossDocked2020 | Molecular/Drug Discovery | 2020 | Scoring/Pose | Active | — | R 0.612 | — | — |
| USPTO-50K | Molecular/Drug Discovery | 2016 | Retrosynthesis | Nearing | 120 mo | 65% top-1 | ~48.2% avg (forward) | — |
| USPTO-MIT | Molecular/Drug Discovery | 2017 | Reaction Prediction | Nearing | 108 mo | >90% top-1 (forward) | — | — |
| Schneider 50K | Molecular/Drug Discovery | 2015 | Reaction Classification | Saturated | 72 mo | >99% (RXNFP) | — | USPTO 1K TPL |
| TOMG-Bench | Molecular/Drug Discovery | 2024 | LLM Generation | Active | — | Fine-tuned Llama > GPT-3.5 by 46.5% | — | — |
| MolQuest | Molecular/Drug Discovery | 2026 | Agent Workflow | Active | — | SOTA ~50%; most <30% | — | — |
| ChemBench | Molecular/Drug Discovery | 2024 | MCQ/Free-form | Nearing | 24 mo | Claude 3.5 Sonnet > best human chemists | Expert chemists | — |
| QM9 | Molecular/Drug Discovery | 2014 | Regression | Nearing | 144 mo | Near chemical accuracy | — | PCQM4Mv2 |
| QM7/QM7b | Molecular/Drug Discovery | 2012 | Regression | Saturated | 96 mo | Chemical accuracy achieved | — | QM9 |
| QM8 | Molecular/Drug Discovery | 2015 | Regression | Nearing | 132 mo | Near saturation | — | QM9 |
| PCQM4Mv2 | Molecular/Drug Discovery | 2021 | Regression | Active | — | MAE 0.0719 eV | — | — |
| CASP (Single-Chain) | Protein | 1994 | Structural Prediction | Saturated | 312 mo | GDT-TS 92.4 | ~90 (experimental) | CASP complexes/RNA tracks |
| CASP16 (Complexes) | Protein | 2024 | Structural Prediction | Active | — | Varies; AB-Ag major failure area | — | — |
| CASP16 (RNA) | Genomics | 2024 | Structural Prediction | Active | — | No TM-score >0.8 for novel RNAs | — | — |
| CASP16 (Protein-Ligand) | Molecular/Drug Discovery | 2024 | Binding Affinity | Active | — | Max Kendall tau 0.42 | — | — |
| CAMEO | Protein | 2013 | Structure Evaluation | Saturated | 144 mo | AlphaFold dominant | — | CAMEO complexes/ligands/peptides |
| CATH | Protein | 1997 | Classification | Nearing | 348 mo | High accuracy (known folds) | — | — |
| TAPE | Protein | 2019 | Multi-task | Saturated | 48 mo | Near-ceiling on most tasks | — | ProteinGym, PEER, FLIP |
| PEER | Protein | 2022 | Multi-task | Active | — | Varies across 17 tasks | — | — |
| ProteinGym | Protein | 2023 | Regression | Active | — | Spearman ~0.52 | — | — |
| ProteinBench | Protein | 2024 | Multi-task | Active | — | Varies; drops >20% at 500+ residues | — | — |
| FLIP | Protein | 2021 | Regression | Nearing | 60 mo | Varies | — | FLIP2 |
| FLIP2 | Protein | 2026 | Regression | Active | — | Varies | — | — |
| FoldBench | Protein | 2025 | Structural Prediction | Active | — | Varies by task | — | — |
| CAPRI | Protein | 2001 | Docking/Interaction | Active | — | Varies | — | — |
| ProteinInvBench | Protein | 2023 | Generation | Active | — | Recovery ~66% | — | — |
| CAFA | Protein | 2010 | Function Prediction | Active | — | Fmax 0.4-0.6 | — | — |
| DB5.5 (Docking Benchmark) | Protein | 2003 | Docking | Active | — | Top-10 success ~38% | — | — |
| SKEMPI 2.0 | Protein | 2019 | Regression | Active | — | Pearson R ~0.7-0.8 | — | — |
| PPB-Affinity | Protein | 2024 | Regression | Active | — | Varies | — | — |
| SAbDab (CDR design) | Protein | 2014 | Generation | Active | — | CDR-H3 AAR ~40-50%; RMSD ~2.5-3.5A | — | — |
| AbBiBench | Protein | 2025 | Antibody Design | Active | — | Varies | — | — |
| Mega-Scale Stability Dataset | Protein | 2023 | Regression | Active | — | PCC 0.72 (ThermoMPNN) | — | — |
| S669 (Stability blind test) | Protein | 2020 | Regression | Active | — | PCC ~0.43-0.67 | — | — |
| EC-Bench | Protein | 2025 | Classification | Active | — | Varies | — | — |
| MotifBench | Protein | 2025 | Generation | Active | — | RFdiffusion solves ~16/30 | — | — |
| SHAPES | Protein | 2025 | Generation | Active | — | Most models fail on loops/mixed | — | — |
| PFMBench | Protein | 2025 | Multi-task | Active | — | Varies across 38 tasks | — | — |
| LiveProteinBench | Protein | 2025 | Multi-task | Active | — | Varies | — | — |
| ATOM3D | Protein | 2021 | Multi-task (3D) | Active | — | Varies across 8 tasks | — | — |
| OGB Protein Tasks | Protein | 2020 | Classification/Link Prediction | Active | — | Varies | — | — |
| LRGB Peptides-func | Protein | 2022 | Classification | Active | — | Varies (AP) | — | — |
| LRGB Peptides-struct | Protein | 2022 | Regression | Active | — | Varies (MAE) | — | — |
| RNA-Puzzles | Protein | 2012 | Structural Prediction | Active | — | Best RMSD 3-7A | Human experts outperform servers | — |
| BEACON (RNA Tasks) | Genomics | 2024 | Multi-task (13 RNA tasks) | Active | — | PLMs surpass SOTA on 8/13 | — | — |
| rnaglib | Protein | 2025 | Multi-task (RNA 3D) | Active | — | Varies | — | — |
| ATLAS MD Ensemble Dataset | Protein | 2024 | Conformational Dynamics | Active | — | Varies | — | — |
| PeptoneBench | Protein | 2025 | Conformational Dynamics | Active | — | BioEmu and PepTron lead | — | — |
| PDFBench | Protein | 2025 | Protein Design from Function | Active | — | Varies across 16 metrics | — | — |
| SafeProtein | Biosecurity | 2025 | Red-teaming | Active | — | Varies | — | — |
| GenomicBenchmarks | Genomics | 2022 | Classification | Saturated | 18 mo | 95%+ accuracy | — | GUE, BEND |
| GUE | Genomics | 2023 | Classification | Active | — | F1 0.3-0.95 | — | DART-Eval, GenBench |
| BEND | Genomics | 2023 | Classification/Regression | Active | — | Varies | — | — |
| NT-18 Benchmark | Genomics | 2023 | Classification | Active | — | MCC 0.974 (promoter); 0.983 (splice) | — | NTv3 |
| Gene-MTEB | Genomics | 2025 | Classification/Clustering | Active | — | 0.59 average | — | — |
| Nullsettes | Genomics | 2025 | Zero-shot Prediction | Active | — | Most models fail | — | — |
| Evo / Evo 2 | Genomics | 2024 | Multi-task | Active | — | BRCA1 >90% AUROC (Evo 2) | — | — |
| METAGENE-1 | Genomics | 2025 | Classification/Clustering | Active | — | MCC 92.96 (pathogen) | — | — |
| DART-Eval | Genomics | 2024 | Multi-task | Active | — | DNALMs inconsistent | — | — |
| OmniGenBench | Genomics | 2025 | Multi-task | Active | — | Varies across 123+ datasets | — | — |
| DNALongBench | Genomics | 2025 | Classification/Regression | Active | — | Expert models lead | — | — |
| GenBench | Genomics | 2024 | Multi-task | Active | — | Varies across 43 datasets | — | — |
| NABench | Genomics | 2025 | Fitness Prediction | Active | — | Varies | — | — |
| Genomics Long-Range Benchmark | Genomics | 2024 | Classification/Regression | Active | — | Large gaps in VEP | — | — |
| DeepSEA | Genomics | 2015 | Classification | Nearing | 132 mo | AUC 0.958 | — | Sei, AlphaGenome |
| Sei | Genomics | 2022 | Classification | Active | — | AUROC 0.972; AUPRC 0.409 | — | AlphaGenome |
| AlphaGenome | Genomics | 2025 | Multi-task | Active | — | 25/26 VEP evaluations SOTA | — | — |
| Enformer | Genomics | 2021 | Gene Expression Prediction | Active | — | Cross-gene CAGE Pearson 0.85 | ~0.94 (experimental replicate ceiling) | Borzoi, AlphaGenome |
| Borzoi | Genomics | 2024 | Gene Expression Prediction | Active | — | 524kb input, 32bp resolution | — | AlphaGenome |
| SpliceAI | Genomics | 2019 | Classification | Nearing | 84 mo | AUPRC 0.98; ~95% top-k | — | OpenSpliceAI, AlphaGenome |
| CAGI | Genomics | 2011 | Variant Interpretation | Active | — | Varies | — | — |
| TraitGym | Genomics | 2025 | Variant Prediction | Active | — | CADD/GPN-MSA best for disease traits | — | — |
| ClinVar (coding variants) | Genomics | 2013 | Classification | Nearing | 156 mo | AUC ~0.95 | — | — |
| CAMI (Metagenomic) | Genomics | 2017 | Classification/Assembly | Active | — | Good at genus, poor at strain | — | CAMI III |
| DNABERT evaluation tasks | Genomics | 2021 | Classification | Nearing | 60 mo | F1 0.940 (promoter); MCC 0.871 (splice) | — | DNABERT-2, GUE |
| CRISPR on-target (CRISPRon) | Genomics | 2021 | Regression | Nearing | 60 mo | Spearman ~0.80 | Ceiling ~0.85-0.90 (biological noise) | — |
| CRISPR off-target (CIRCLE-seq) | Genomics | 2017 | Classification | Active | — | AUROC 0.977 (CCLMoff) | — | — |
| PRIDICT 2.0 (Prime Editing) | Genomics | 2024 | Regression | Active | — | Spearman R=0.85 | — | — |
| DeepPrime | Genomics | 2023 | Regression | Active | — | Approaching strong performance for specific edits | — | — |
| scIB | Virtual Cell | 2020 | Integration | Nearing | 72 mo | scANVI ~0.8 | — | CZI cz-benchmarks |
| CZI cz-benchmarks | Virtual Cell | 2024 | Classification/Prediction | Active | — | Varies | — | — |
| Open Problems (single-cell) | Virtual Cell | 2021 | Multi-task | Active | — | Varies across 12 tasks | — | — |
| Arc Institute Virtual Cell Challenge | Virtual Cell | 2025 | Perturbation Prediction | Active | — | Statistical baselines competitive with AI | — | — |
| scPerturb | Virtual Cell | 2024 | Data Resource/QC | Active | — | E-distance metric | — | scPerturBench |
| scPerturBench | Virtual Cell | 2025 | Perturbation Prediction | Active | — | No method consistently wins; linear baselines competitive | — | — |
| GEARS | Virtual Cell | 2023 | Perturbation Prediction | Active | — | 40% higher precision than baselines (claimed); linear models outperform (2025) | — | — |
| PerturBench | Virtual Cell | 2024 | Perturbation Prediction | Active | — | Simple models outperform complex | — | — |
| BEELINE (GRN inference) | Virtual Cell | 2020 | GRN Inference | Active | — | Close to random predictor in many cases | — | — |
| dynverse (Trajectory inference) | Virtual Cell | 2019 | Trajectory Inference | Active | — | Varies by topology | — | — |
| PubMedQA | Bio NLP | 2019 | MCQ | Saturated | 72 mo | 81.6% | 78% | MedHELM, BigBIO |
| MedQA (USMLE) | Clinical | 2021 | MCQ | Saturated | 42 mo | 96.5% | 87% | MedXpertQA, HealthBench |
| MedMCQA | Clinical | 2022 | MCQ | Active | — | ~75-80% | — | — |
| MMLU-Bio | Bio NLP | 2020 | MCQ | Saturated | 48 mo | 93%+ | 89.8% | MMLU-Pro |
| MMLU-Pro Medical | Bio NLP | 2024 | MCQ | Saturated | 18 mo | 90.1% | — | — |
| BLURB | Bio NLP | 2020 | Multi-task NLP | Nearing | 72 mo | 82.91 BLURB score | — | BigBIO |
| BigBIO | Bio NLP | 2022 | Framework | Active | — | N/A (126+ datasets) | — | — |
| BioASQ | Bio NLP | 2013 | Semantic QA | Active | — | F1 ~0.58; yes/no >80% | — | — |
| BC5CDR-Chemical NER | Bio NLP | 2015 | NER | Saturated | 84 mo | F1 94.2% | — | BioRED |
| BC5CDR-Disease NER | Bio NLP | 2015 | NER | Nearing | 132 mo | F1 ~90% | — | — |
| NCBI-Disease NER | Bio NLP | 2014 | NER | Saturated | 96 mo | F1 ~91% | IAA ~87% | — |
| JNLPBA NER | Bio NLP | 2004 | NER | Saturated | 204 mo | F1 ~79.6% | — | — |
| BC2GM NER | Bio NLP | 2007 | NER | Nearing | 228 mo | F1 ~90.9% | — | — |
| ChemProt RE | Bio NLP | 2017 | Relation Extraction | Nearing | 108 mo | F1 90.8% | — | — |
| DDI RE | Bio NLP | 2013 | Relation Extraction | Saturated | 120 mo | F1 83.3% | — | MultiADE |
| GAD RE | Bio NLP | 2004 | Relation Extraction | Saturated | 204 mo | F1 ~85% | — | — |
| HoC (Hallmarks of Cancer) | Bio NLP | 2016 | Classification | Nearing | 120 mo | F1 ~90.3% | — | — |
| BioRED | Bio NLP | 2022 | Relation Extraction | Active | — | Varies | — | — |
| MedNLI | Bio NLP | 2018 | NLI | Nearing | 96 mo | ~82% accuracy | — | BioNLI |
| LINNAEUS NER | Bio NLP | 2010 | NER | Saturated | 132 mo | Mid-to-high 90s% | — | — |
| GPQA Diamond (Bio) | Science QA | 2023 | MCQ | Saturated | 36 mo | 94.1% | 67% | HLE, ATLAS |
| HLE (Bio/Medicine) | Science QA | 2025 | Short Answer/MCQ | Active | — | 44.7% overall; ~22% bio/med | 98% | — |
| ATLAS (Bio) | Science QA | 2025 | Open-ended | Active | — | Varies | — | — |
| MedHELM | Clinical | 2025 | Multi-task Clinical | Active | — | 66% win-rate (DeepSeek R1) | — | — |
| MedAgentBench | Clinical | 2025 | Agent Workflow | Active | — | 69.67% | — | — |
| HealthBench | Clinical | 2025 | Conversational | Active | — | 60% (o3); Hard: 32% | — | — |
| MedCalc-Bench | Clinical | 2024 | Calculation | Active | — | 50.9% | — | — |
| MIMIC-III Benchmarks | Clinical | 2017 | Prediction | Nearing | 108 mo | AUROC ~0.94 (mortality) | — | MIMIC-IV, YAIB |
| MIMIC-IV Benchmarks | Clinical | 2020 | Prediction | Active | — | Hospitalization AUROC ~0.87 | — | — |
| EHRSHOT | Clinical | 2023 | Few-shot Prediction | Active | — | Varies across 15 tasks | — | — |
| FHIR-AgentBench | Clinical | 2025 | Agent Workflow | Active | — | Varies | — | — |
| LiveMedBench | Clinical | 2025 | Dynamic MCQ | Active | — | Varies (refreshes weekly) | — | — |
| MedXpertQA | Clinical | 2025 | MCQ | Active | — | Varies (4,460 Qs, 17 specialties) | — | — |
| DiagnosisArena | Clinical | 2025 | Diagnostic Reasoning | Active | — | 45.82% | — | — |
| MedS-Bench | Clinical | 2024 | Multi-task | Active | — | Varies across 39 datasets | — | — |
| Script Concordance Testing | Clinical | 2025 | Probabilistic Reasoning | Active | — | Varies | — | — |
| Doctorina MedBench | Clinical | 2026 | Agent Simulation | Active | — | Varies (>1,000 cases) | — | — |
| AgentClinic | Clinical | 2025 | Diagnostic Agent | Active | — | Varies (120 NEJM cases) | — | — |
| ACI-BENCH | Clinical | 2023 | Note Generation | Active | — | MEDCON 57.78 | — | — |
| PhysioNet 2019 (Sepsis) | Clinical | 2019 | Prediction | Active | — | Varies (104 teams) | — | — |
| JAMA Clinical Challenge | Clinical | 2024 | MCQ | Active | — | 88.6% (o1-preview on 70 cases) | — | — |
| MedBench v4 (China) | Clinical | 2025 | Multi-task | Active | — | Agent 85.3/100 (Claude Sonnet 4.5) | — | — |
| AfriMed-QA | Clinical | 2024 | MCQ | Active | — | Varies (15K+ Qs, 32 specialties) | — | — |
| MMedBench | Clinical | 2024 | Multilingual MCQ | Active | — | Varies across 6 languages | — | — |
| TAC ADR | Bio NLP | 2017 | Extraction | Nearing | 108 mo | F1 ~85.2% | — | MultiADE |
| MultiADE | Bio NLP | 2024 | Extraction | Active | — | Varies across 6 ADE datasets | — | — |
| SMM4H | Bio NLP | 2017 | Social Media Mining | Active | — | ADR detection F1 ~0.65-0.70 | — | — |
| WMDP-Bio | Biosecurity | 2024 | MCQ | Saturated | 7 mo | 87% | 60.5% | VCT, ABC-Bench |
| VCT | Biosecurity | 2025 | Multimodal QA | Active | — | 43.8% (o3) | 22.1% (expert virologists) | — |
| ABC-Bench | Biosecurity | 2025 | Agent Workflow | Active | — | 53% (Grok 3) | 24% (PhD experts) | — |
| ABLE | Biosecurity | 2025 | Agent Workflow | Active | — | 7/8 low-level tasks | — | — |
| BBG Framework / B3 | Biosecurity | 2025 | Open-ended | Unknown | — | Pilot phase | — | — |
| BioLP-Bench | Biosecurity | 2024 | Protocol Error Detection | Active | — | 34% (o4-mini) | ~38% (expert avg) | — |
| Scale AI FORTRESS (CBRNE) | Biosecurity | 2025 | Adversarial Safety | Active | — | Varies | — | — |
| Scale AI PropensityBench | Biosecurity | 2025 | Agent Safety | Active | — | Varies (5,874 tasks) | — | — |
| TroubleshootingBench | Biosecurity | 2025 | Protocol QA | Active | — | No model exceeds 80th-pct expert (36.4%) | 80th-pct expert: 36.4% | — |
| LAB-Bench | Agentic Bio | 2024 | Research Workflow | Saturated | 18 mo | 89%; several subtasks at ceiling | ~79% (expert on ProtocolQA) | LABBench2 |
| LABBench2 | Agentic Bio | 2026 | Research Workflow | Active | — | 26-46% drop vs LAB-Bench | — | — |
| BixBench | Agentic Bio | 2025 | Bioinformatics | Active | — | 17% (Claude 3.5 Sonnet); 9% (GPT-4o) | — | — |
| ScienceAgentBench | Agentic Bio | 2024 | Code Generation | Active | — | 42.2% (o1-preview + self-debug) | — | — |
| BioCoder | Agentic Bio | 2024 | Code Generation | Nearing | 24 mo | ~50% Pass@1 (GPT-4) | — | BixBench |
| MedAgentGym | Agentic Bio | 2025 | Training Environment | Active | — | N/A | — | — |
| LitQA2 / PaperQA2 | Agentic Bio | 2024 | Literature QA | Nearing | 24 mo | ~90% (PaperQA2) | ~67% (human experts) | — |
| BioProBench | Agentic Bio | 2025 | Protocol Understanding | Active | — | Varies | — | — |
| CheXpert | Medical Imaging | 2019 | Classification | Nearing | 84 mo | AUC ~0.94 | 2.6/3 radiologists | CheXpert Plus |
| MIMIC-CXR | Medical Imaging | 2019 | Report Generation | Active | — | RadCliQ-v1 0.92 | — | — |
| NIH ChestX-ray14 | Medical Imaging | 2017 | Classification | Nearing | 108 mo | AUC ~0.85-0.88 | — | CheXpert, MIMIC-CXR |
| PathVQA | Medical Imaging | 2020 | Visual QA | Active | — | 50-65% | — | GEMeX |
| VQA-RAD | Medical Imaging | 2018 | Visual QA | Unknown | — | 79.2% | — | — |
| SLAKE | Medical Imaging | 2021 | Visual QA | Active | — | ~78.7% | — | — |
| OmniMedVQA | Medical Imaging | 2024 | Visual QA | Active | — | LVLMs struggle | — | — |
| ISIC Challenges | Medical Imaging | 2016 | Classification/Segmentation | Nearing | 120 mo | Exceeds clinicians | Clinician level | 3D TBP (2024) |
| HAM10000 | Medical Imaging | 2018 | Classification | Nearing | 96 mo | ~96%+ accuracy | — | DermaMNIST-E |
| Camelyon16 | Medical Imaging | 2016 | Classification | Saturated | 72 mo | AUC 0.994 | Pathologist AUC | Camelyon17, Camelyon+ |
| Camelyon17 | Medical Imaging | 2017 | Classification | Active | — | Kappa ~0.89 | — | Camelyon+ |
| PANDA (Prostate Gleason) | Medical Imaging | 2020 | Classification | Active | — | QWK ~0.93+ | — | — |
| PCam | Medical Imaging | 2018 | Classification | Nearing | 96 mo | ~97%+ accuracy | — | — |
| BraTS | Medical Imaging | 2012 | Segmentation | Active | — | Dice ~0.88 whole tumor | — | BraTS 2025 Lighthouse |
| Medical Segmentation Decathlon | Medical Imaging | 2018 | Segmentation | Active | — | nnU-Net variants lead | — | — |
| TotalSegmentator | Medical Imaging | 2023 | Segmentation | Nearing | 36 mo | Avg Dice >0.90 for major organs | — | — |
| AMOS | Medical Imaging | 2022 | Segmentation | Active | — | Varies (15 organs) | — | AMOS-MM (2024) |
| CT-Bench | Medical Imaging | 2025 | Multimodal QA | Active | — | 61.8% | — | — |
| APTOS Diabetic Retinopathy | Medical Imaging | 2019 | Classification | Nearing | 84 mo | QWK 0.967 | — | — |
| EyePACS DR | Medical Imaging | 2015 | Classification | Saturated | 48 mo | AUC ~0.99 | — | — |
| RSNA Pneumonia Detection | Medical Imaging | 2018 | Object Detection | Active | — | Varies (1,400+ Kaggle teams) | — | — |
| ReXrank | Medical Imaging | 2024 | Report Generation | Active | — | 1/RadCliQ-v1 0.98 (ReXGradient) | — | — |
| RadGraph | Medical Imaging | 2021 | Entity/Relation Extraction | Active | — | Micro F1 0.82 | — | — |
| ReXVQA | Medical Imaging | 2025 | Visual QA | Active | — | 83.24% (MedGemma) | — | — |
| MatBench Discovery | Molecular/Drug Discovery | 2025 | Crystal Stability | Nearing | 12 mo | F1=0.924 | — | — |
| OC20 | Molecular/Drug Discovery | 2020 | Catalyst Prediction | Active | — | EquiformerV2 leads | — | OC22, OC25 |
| MedPerf | Clinical | 2023 | Federated Benchmarking | Active | — | N/A (platform) | — | — |
| FairMedQA | Clinical | 2025 | Fairness Evaluation | Active | — | 3-19 pct point accuracy disparity | — | — |
| MEDEC | Clinical | 2025 | Error Detection/Correction | Active | — | Varies | — | — |
| MedThink-Bench | Clinical | 2025 | Reasoning Evaluation | Active | — | Varies (500 QA pairs) | — | — |
| CLINIC | Clinical | 2025 | Trustworthiness | Active | — | DeepSeek-R1-LLaMA best robustness | — | — |
| MedAgentsBench | Clinical | 2025 | Multi-agent Reasoning | Active | — | 92.6% (TeamMedAgents) | — | — |