Multimodal LLM scores drop 17 to 27 points when text shortcuts are removed.

When the MMMU benchmark is reformulated to filter out questions answerable by text alone and to embed the question inside the image (the MMMU-Pro variant), top multimodal LLMs lose 13 to 27 percentage points of accuracy. This is the most direct currently-public estimate of how much of a "multimodal" model's score reflects genuine visual reasoning versus pattern-matching on the question text.

As of: 2024 (paper) · still current benchmark
Source: Yue et al., arXiv 2409.02813
Models tested: 6 proprietary + 14 open
Δ range: 13.3 to 26.9 points
Question setting: Vision-only (question embedded in image)
Topic: AI Tagging

Score on original MMMU vs vision-only MMMU-Pro · same model

Source: Yue et al. (2024), Table 1 · selected models

Model	MMMU (Val)	MMMU-Pro (vision-only)	Δ (drop)
GPT-4o (May 2024)	69.1%	49.7%	−19.4 pts
Claude 3.5 Sonnet	68.3%	48.0%	−20.3 pts
Gemini 1.5 Pro (Aug 2024)	65.8%	44.4%	−21.4 pts
Gemini 1.5 Pro (May 2024)	62.2%	40.5%	−21.7 pts
GPT-4o-mini	59.4%	35.2%	−24.2 pts
VILA-1.5-40B (open)	62.0%	20.0%	−42.0 pts

"MMMU (Val)" is the validation set of the original 2023 MMMU benchmark. "MMMU-Pro (vision-only)" is the strictest 2024 variant: the question text is embedded inside the image, the answer options are augmented, and questions answerable by a text-only model are removed. Source: Yue et al., 2024 (arXiv:2409.02813v3), Table 1.

Why the gap exists

The MMMU-Pro authors documented three distinct ways a "multimodal" model can score high on MMMU without doing much visual reasoning:

Text leakage. Roughly 21% of MMMU questions can be answered by a strong text-only model that never sees the image, because the question phrasing leaks the answer.
Option elimination. With four answer options, a model that's only slightly better than random on the actual content can score well by ruling out two obvious wrong answers using language priors.
Caption shortcuts. Many MMMU images include text labels (chemistry diagrams, math notation, chart legends). A model with good OCR can ignore the visual structure and still answer correctly from the text inside the image.

MMMU-Pro neutralizes all three by filtering text-leakable questions, augmenting options, and rendering the question itself as part of the image so OCR becomes a prerequisite rather than a shortcut.

How to read the gap

The size of the gap is itself a signal of model architecture. Models with stronger language priors (the proprietary frontier models) lose 13-20 percentage points — meaningful, but they retain most of their absolute capability. Open-source models with weaker visual encoders and stronger text priors lose 25-42 points, suggesting their public MMMU scores significantly overstated their genuine visual capability.

For an operator deciding which model to put in front of a real DAM tagging pipeline, the MMMU-Pro score is the better proxy for production quality than the headline MMMU number. A 20-point absolute difference in MMMU-Pro means a meaningfully different quality of tag against an in-the-wild image set.

Sources

Primary source — Yue et al. "MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark." arXiv 2409.02813v3, 2024. arxiv.org/html/2409.02813v3
Original MMMU paper — Yue et al. "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark." CVPR 2024.
Live leaderboard — BenchLM.ai · Artificial Analysis

Cite this statistic

DAM LLM Research. "Multimodal LLM scores drop 17-27 points when text shortcuts are removed." damllm.ai, 2026. https://damllm.ai/statistics/vision-text-shortcut-gap/

Why the gap exists

How to read the gap

Sources

Cite this statistic

See also