MRCR v2 Leaderboard 2026 — Results Across 18 Real AI Models

MRCR v2 leaderboard

MRCR v2

18 models tested · Updated 2025-12-11 · Verified sources only

      GPT-5.2 leads at 98.0%
    

GPT-5.2

OpenAI · Vellum — GPT-5.2 Benchmarks · 2025-12-11

4-needle variant at 128k context. 8-needle 128k: 85%.

98.0%

Claude Opus 4.6

Anthropic · Anthropic — Introducing Claude Opus 4.6 · 2026-02-05

Highest published match ratio at 256k context. Drops to 76% at 1M tokens.

93.0%

Gemini 3.1 Pro

Google · Blog/Google DeepMind · 2026-02-19

Long-context retrieval. Strong at 128k, drops to 26.3% at 1M tokens.

84.9%

DeepSeek V4 Pro

DeepSeek · DeepSeek/HuggingFace · 2026-04-24

1M context. Below Opus 4.6 (92.9) but competitive.

83.5%

DeepSeek V4 Flash

DeepSeek · HuggingFace/deepseek-ai · 2026-04-24

Solid long-context retrieval at 1M tokens for 13B activated.

78.7%

Gemini 3.5 Flash

Google DeepMind · Blog/Google DeepMind · 2026-06-10

128k 8-needle average. Below Gemini 3.1 Pro (84.9).

77.3%

Hunyuan Hy3

Tencent · HuggingFace/tencent-Hy3-modelcard · 2026-07-06

Up from 42.9% in Hy3 Preview. Major long-context improvement.

75.1%

GPT-5.5

OpenAI · OpenAI Blog · 2026-04-23

512K-1M range. Massive lead over GPT-5.4 (36.6) and Opus 4.6 (32.2).

74.0%

Gemini 3 Flash

Google DeepMind · Blog/Google DeepMind · 2026-06-10

128k 8-needle. Below Gemini 3.5 Flash (77.3).

67.2%

Gemma 4 31B

Google · Model Card/Google · 2026-04-02

8-needle variant at 128k context. Solid long-context retrieval for a 31B dense model.

66.4%

Claude Opus 4.7

Anthropic · Blog/Google DeepMind · 2026-06-10

128k 8-needle. Below Gemini 3.1 Pro (84.9).

59.3%

Claude Opus 4.7

Anthropic · Blog/OpenAI · 2026-04-16

OpenAI-tested at 128K-256K. Much weaker long-context than GPT-5.5.

59.2%

MiMo-V2-Flash

Xiaomi · HuggingFace/XiaomiMiMo-MiMo-V2-Flash · 2026-01-06

Long-context retrieval. Lower than frontier models (89.7).

45.7%

Gemma 4 26B A4B

Google · HuggingFace/Google DeepMind · 2026-04-02

Multi-round co-reference resolution, 128k context.

44.1%

GPT-5.4

OpenAI · OpenAI Blog · 2026-04-23

512K-1M range. Massive gap to GPT-5.5 (74.0).

36.6%

GLM-5

Zhipu · DocsBot AI benchmark comparison · 2026-02-11

Sharp degradation at extreme context length. Drops from 77% at 128k.

26.3%

Gemma 4 E4B

Google · HuggingFace/Google DeepMind · 2026-04-02

Multi-round co-reference resolution, 128k context.

25.4%

Gemma 4 E2B

Google · HuggingFace/Google DeepMind · 2026-04-02

Multi-round co-reference resolution, 128k context.

19.1%