HealthBench Hard
1 models tested · Updated 2026-04-08 · Verified sources only
Muse Spark leads at 42.8%
1
Meta · Blog/Meta · 2026-04-08
New SOTA on health benchmarks. Trained with 1000+ physician collaborators. Beats GPT-5.4 (40.1) and Gemini 3.1 Pro (20.6).
42.8%