DDAM LLMIndependent research · AI × DAM

Statistic · Benchmark · Third-party verified

94% (top score)

GPT-5.4 Pro leads the MMMU-Pro multimodal benchmark.

As of May 25, 2026, GPT-5.4 Pro holds the top position on MMMU-Pro at 94%, followed by Claude Mythos Preview at 92.7% and Gemini 3.1 Pro at 83.9%. The top tier has pulled away from the field — three models above 83% and the rest clustered between 60% and 81%. The benchmark covers 30 academic disciplines and was redesigned in 2024 specifically to remove the text-only shortcuts that inflated earlier multimodal scores.

As of
May 25, 2026
Sample
27 models scored
Source
BenchLM.ai (third party)
Disciplines
30
Question type
Multimodal · vision-only inputs
Topic
AI Tagging

MMMU-Pro top 10 · May 25, 2026

Source: BenchLM.ai · 27 of 49 publicly-evaluated models shown

Rank Model Provider Score License
1GPT-5.4 ProOpenAI94.0%Closed
2Claude Mythos PreviewAnthropic92.7%Closed
3Gemini 3.1 ProGoogle83.9%Closed
4Gemini 3.5 FlashGoogle83.6%Closed
5GPT-5.5 (medium)OpenAI81.0%Closed
6Gemini 3 FlashGoogle81.2%Closed
7Grok 4.1xAI79.5%Closed
8Kimi K2 ThinkingMoonshot AI76.9%Open
9Qwen3-VL InstructAlibaba75.8%Open
10Claude Haiku 4.5Anthropic73.8%Closed

MMMU-Pro is a stricter variant of MMMU that filters out questions answerable by text alone, augments the answer set, and tests vision-only inputs. Scores on MMMU-Pro typically drop 17-27 percentage points relative to the original MMMU benchmark for the same model — see the related statistic. Source: BenchLM.ai, accessed 2026-05-26.

Why this benchmark matters more than the original MMMU

The original MMMU benchmark, released in 2023, became the industry default for multimodal evaluation — and within a year, every major lab had a model scoring 60-70% on it. But researchers noticed an awkward fact: many of the highest-scoring "multimodal" answers could be reached by a text-only model that just guessed cleverly from the question wording. The benchmark was leaking signal through its language.

MMMU-Pro, introduced in late 2024, fixes three failure modes:

  1. Text shortcut filtering. Questions answerable by a strong text-only model (without seeing the image) are removed.
  2. Augmented options. The original four-option multiple-choice is expanded, removing easy-rule guessing.
  3. Vision-only setting. The hardest variant embeds the question text inside the image — forcing the model to genuinely read and integrate visual and textual signal.

The result is the clearest currently-public test of whether a model can actually see rather than just answer from priors. On the original MMMU, Claude 3.5 Sonnet scored 68.3%; on MMMU-Pro the same model scored 48.0%. That gap — 20.3 percentage points — is roughly the amount of "real visual reasoning" the benchmark exposes.

What's actually changed since 2024

Comparing the late-2024 results to the May 2026 leaderboard, every model in the top 10 is at least one generation newer than what was being tested two years ago, and the top score has jumped from 56% (GPT-4o 0513) to 94% (GPT-5.4 Pro) — an absolute jump of 38 percentage points in 18 months. The top-tier models are now scoring above the documented human-expert "medium" rater band on the original MMMU (82.1%), although on MMMU-Pro the human-expert ceiling has not been re-measured.

Sources

Cite this statistic

DAM LLM Research. "MMMU-Pro multimodal benchmark leaderboard, May 2026." damllm.ai, 2026. https://damllm.ai/statistics/mmmu-pro-leaderboard-may-2026/

See also