GPT-5.4 Pro leads the MMMU-Pro multimodal benchmark.

As of May 25, 2026, GPT-5.4 Pro holds the top position on MMMU-Pro at 94%, followed by Claude Mythos Preview at 92.7% and Gemini 3.1 Pro at 83.9%. The top tier has pulled away from the field — three models above 83% and the rest clustered between 60% and 81%. The benchmark covers 30 academic disciplines and was redesigned in 2024 specifically to remove the text-only shortcuts that inflated earlier multimodal scores.

As of: May 25, 2026
Sample: 27 models scored
Source: BenchLM.ai (third party)
Disciplines: 30
Question type: Multimodal · vision-only inputs
Topic: AI Tagging

MMMU-Pro top 10 · May 25, 2026

Source: BenchLM.ai · 27 of 49 publicly-evaluated models shown

Rank	Model	Provider	Score	License
1	GPT-5.4 Pro	OpenAI	94.0%	Closed
2	Claude Mythos Preview	Anthropic	92.7%	Closed
3	Gemini 3.1 Pro	Google	83.9%	Closed
4	Gemini 3.5 Flash	Google	83.6%	Closed
5	GPT-5.5 (medium)	OpenAI	81.0%	Closed
6	Gemini 3 Flash	Google	81.2%	Closed
7	Grok 4.1	xAI	79.5%	Closed
8	Kimi K2 Thinking	Moonshot AI	76.9%	Open
9	Qwen3-VL Instruct	Alibaba	75.8%	Open
10	Claude Haiku 4.5	Anthropic	73.8%	Closed

MMMU-Pro is a stricter variant of MMMU that filters out questions answerable by text alone, augments the answer set, and tests vision-only inputs. Scores on MMMU-Pro typically drop 17-27 percentage points relative to the original MMMU benchmark for the same model — see the related statistic. Source: BenchLM.ai, accessed 2026-05-26.

Why this benchmark matters more than the original MMMU

The original MMMU benchmark, released in 2023, became the industry default for multimodal evaluation — and within a year, every major lab had a model scoring 60-70% on it. But researchers noticed an awkward fact: many of the highest-scoring "multimodal" answers could be reached by a text-only model that just guessed cleverly from the question wording. The benchmark was leaking signal through its language.

MMMU-Pro, introduced in late 2024, fixes three failure modes:

Text shortcut filtering. Questions answerable by a strong text-only model (without seeing the image) are removed.
Augmented options. The original four-option multiple-choice is expanded, removing easy-rule guessing.
Vision-only setting. The hardest variant embeds the question text inside the image — forcing the model to genuinely read and integrate visual and textual signal.

The result is the clearest currently-public test of whether a model can actually see rather than just answer from priors. On the original MMMU, Claude 3.5 Sonnet scored 68.3%; on MMMU-Pro the same model scored 48.0%. That gap — 20.3 percentage points — is roughly the amount of "real visual reasoning" the benchmark exposes.

What's actually changed since 2024

Comparing the late-2024 results to the May 2026 leaderboard, every model in the top 10 is at least one generation newer than what was being tested two years ago, and the top score has jumped from 56% (GPT-4o 0513) to 94% (GPT-5.4 Pro) — an absolute jump of 38 percentage points in 18 months. The top-tier models are now scoring above the documented human-expert "medium" rater band on the original MMMU (82.1%), although on MMMU-Pro the human-expert ceiling has not been re-measured.

Sources

BenchLM.ai MMMU-Pro leaderboard — benchlm.ai/benchmarks/mmmuPro (accessed 2026-05-26)
Artificial Analysis MMMU-Pro evaluation — artificialanalysis.ai/evaluations/mmmu-pro
LLM-Stats MMMU-Pro leaderboard — llm-stats.com/benchmarks/mmmu-pro
Original MMMU-Pro paper — arxiv.org/html/2409.02813v3 (Yue et al., 2024)

Cite this statistic

DAM LLM Research. "MMMU-Pro multimodal benchmark leaderboard, May 2026." damllm.ai, 2026. https://damllm.ai/statistics/mmmu-pro-leaderboard-may-2026/

Why this benchmark matters more than the original MMMU

What's actually changed since 2024

Sources

Cite this statistic

See also