HumanEval Leaderboard 2026 — Results Across 19 Real AI Models

HumanEval leaderboard

HumanEval

19 models tested · Updated 2026-01-27 · Verified sources only

      Kimi K2.5 leads at 99.0%
    

Kimi K2.5

Moonshot AI · NVIDIA NIM / Moonshot AI · 2026-01-27

Reasoning variant. Effectively perfect score. Top of HumanEval leaderboard as of March 2026.

99.0%

Claude Opus 4.7

Anthropic · Blog/Anthropic · 2026-04-16

Near-perfect HumanEval. Consistent with Opus 4.6 era performance.

96.3%

ReflexiCoder-8B

Research · arxiv/2603.05863 · 2026-03-06

8B model matching or beating GPT-5.1 on coding benchmarks. RL framework internalizes self-reflection into model weights.

95.73%

JoyAI-LLM Flash

arxiv · arxiv/2604.03044 · 2026-04-03

New 48B MoE model (2.7B active) trained on 20T tokens with novel FiberPO RL algorithm. Open-sourced on HuggingFace. Competitive with much larger models on SWE-bench Verified (62.6%), LiveCodeBench v6

94.5%

InCoder-32B-Thinking

Kuaishou · arxiv/2604.03144 · 2026-04-03

Strong code generation. Industrial code world model with ECoT synthesis framework.

88.9%

JoyAI-LLM Flash

JD.com · Paper/JD.com (arXiv) · 2026-04-03

48B MoE, only 2.7B active params. Remarkable efficiency — competitive scores at fraction of compute.

87.5%

DMax-Coder

arxiv · arxiv/2604.08302 · 2026-04-10

Aggressive parallel decoding for diffusion LLMs that preserves accuracy while achieving 2x+ throughput gains over LLaDA-2.0-mini. At low parallelism, DMax-Math improves MATH500 from 75.8% to 78.0% and

87.2%

GPT-5

arxiv · arxiv/2604.06753 · 2026-04-08

Independent evaluation of 4 frontier LLMs across 6 reasoning paradigms and 10 benchmarks. No single paradigm dominates; learned router recovers 37% of oracle gap.

85.0%

Qwen 3 Max

arxiv · arxiv/2604.06753 · 2026-04-08

Independent evaluation of 4 frontier LLMs across 6 reasoning paradigms and 10 benchmarks. No single paradigm dominates; learned router recovers 37% of oracle gap.

85.0%

DeepSeek-R1-Distill-Qwen-32B (W16A16)

arxiv · arxiv/2603.25284 · 2026-03-26

SliderQuant preserves accuracy in 4-bit quantized LLMs. DeepSeek-R1-Distill-Qwen-14B at W4A16 retains 94.6% MATH and 91.35% GSM8K vs full-precision. Even W2A16 keeps 29.4% MATH vs 0% for OmniQuant. Qw

81.71%

MARS-7B

arxiv · arxiv/2604.07023 · 2026-04-09

Lightweight fine-tuning that teaches AR models to predict multiple tokens per forward pass. At 7B on Qwen2.5, MARS improves GSM8K by 4.5 points and HumanEval by 3.0 points over AR SFT while enabling 1

81.7%

DeepSeek-R1-Distill-Qwen-32B SliderQuant W4A16

arxiv · arxiv/2603.25284 · 2026-03-26

80.49%

Nemotron 3 Nano 30B

NVIDIA · arxiv/2512.20848 · 2025-12-23

3B active hybrid Mamba-Transformer MoE. 0-shot HumanEval.

78.05%

DeepSeek V4 Pro Base

DeepSeek · DeepSeek/HuggingFace · 2026-04-24

Base model score (pre-instruct), not the instruct model.

76.8%

DeepSeek-R1-Distill-Qwen-14B (W16A16)

arxiv · arxiv/2603.25284 · 2026-03-26

73.17%

Gemini Flash

arxiv · arxiv/2604.06753 · 2026-04-08

Independent evaluation of 4 frontier LLMs across 6 reasoning paradigms and 10 benchmarks. No single paradigm dominates; learned router recovers 37% of oracle gap.

73.0%

DeepSeek-R1-Distill-Qwen-14B SliderQuant W4A16

arxiv · arxiv/2603.25284 · 2026-03-26

72.56%

MARS-0.5B

arxiv · arxiv/2604.07023 · 2026-04-09

40.2%

Qwen 3 30B A3B

arxiv · arxiv/2604.06753 · 2026-04-08

Independent evaluation of 4 frontier LLMs across 6 reasoning paradigms and 10 benchmarks. No single paradigm dominates; learned router recovers 37% of oracle gap.

18.0%