DDAM LLMIndependent research · AI × DAM

Statistic · Vision API economics · Cited from vendor docs

70× spread

A single 1 MP image costs anywhere from $0.0002 to $0.014 to analyze.

Across the seven leading frontier multimodal LLM products available in May 2026, analyzing a one-megapixel image costs 70× more on Claude Opus 4.7 than on Gemini 2.0 Flash, before a single output token is billed. The cost surface is not what it looks like on the per-token pricing page: the "mini" models, which appear cheap by token rate, can cost more per image than the flagship models because of how their image tokenizers expand pixels into tokens.

As of
May 26, 2026
Image tested
1 MP (1000×1000 px)
Sample
n=7 products
Source
Vendor docs + math
Updated
Monthly
Topic
AI Tagging

Cost to analyze a single 1 MP image · May 2026

Input tokens only · output cost depends on response length

Provider · model Image tokens (1 MP) Input rate / 1M tokens Cost / image Cost / 1k images
Google — Gemini 2.0 Flashai.google.dev ~258 (1 tile) ~$0.075 $0.00019 $0.19
OpenAI — GPT-4o (high detail)platform.openai.com 765 (85 + 4 tiles) $2.50 $0.0019 $1.91
Google — Gemini 1.5 Procloud.google.com ~1,000 $3.00 $0.0030 $3.00
Anthropic — Claude Sonnet 4.6platform.claude.com ~1,334 $3.00 $0.0040 $4.00
Anthropic — Claude Opus 4.7 (1 MP)platform.claude.com ~1,334 $5.00 $0.0067 $6.70
OpenAI — GPT-4o-mini (high detail, max)platform.openai.com ~48,000 (effective) $0.15 $0.0072 $7.20
Anthropic — Claude Opus 4.7 (3 MP)platform.claude.com ~2,800 $5.00 $0.0140 $14.00

Token counts use each vendor's published image-tokenization formula. OpenAI tile formula: 85 + 170 × ceil(w/512) × ceil(h/512) for high-detail. Anthropic megapixel formula: tokens ≈ (width × height) / 750. Gemini tile formula: 258 tokens per 768×768 tile. Methodology →

The "mini" trap

GPT-4o-mini lists at $0.15 per million input tokens — sixteen times cheaper than full GPT-4o. Operators routinely assume that means they'll spend sixteen times less on image inference. They do not. OpenAI's tokenizer expands a maximum-size image (768 × 2048 px) into roughly 48,000 effective tokens on the mini model — versus about 765 tokens for the same image on full GPT-4o. The result is that processing a single large image can cost more on mini than on the flagship.

This is documented in OpenAI's developer community (linked in the methodology) but is not surfaced on the headline pricing page. It catches operators almost universally on first deployment.

What this means for product choices

  • Gemini 2.0 Flash is the cost-leader by a wide margin — roughly 10× cheaper per image than GPT-4o, ~20× cheaper than Claude Sonnet 4.6, ~70× cheaper than Claude Opus 4.7 at 3 MP. For high-volume taxonomic tagging where reasoning quality is sufficient, the math is decisive.
  • The Sonnet/Opus gap on Claude is meaningful. Same image, same megapixels — Opus 4.7 charges 1.7× the Sonnet price. The 3 MP version is 3.5× Sonnet's 1 MP cost, because Opus has a steeper image-token formula on high-res inputs.
  • GPT-4o is mid-pack on cost, top-pack on availability. It's not the cheapest, but it's one of the most documented and most operator-friendly for production deployment. The cost-quality tradeoff is real.
  • Avoid optimizing for token rate. The headline rate is misleading. Compute per-image cost using each vendor's image-token formula before architecting a pipeline; the rankings will surprise you.

How we computed this

For each provider we used the official published image-tokenization formula (or, where unpublished, vendor-confirmed numbers from developer documentation) multiplied by the current input rate per 1M tokens. Output tokens are not included — they vary by response length and dwarf input cost in the long tail. For Anthropic, the documented formula is tokens ≈ (w × h) / 750 (capped at 1,568 tokens for ≥1.19 MP on Sonnet, with Opus expanding up to ~3× that ratio on its highest resolutions). For OpenAI, the tile formula is published on the platform docs and confirmed via the OpenAI developer community. For Google, tile sizes vary by model (768×768 for Gemini 2.0 Flash; up to 2304×2304 for max-quality requests on Gemini 1.5 Pro).

What this statistic does not capture

  • Output tokens. A tagging prompt that returns 200 tokens of JSON costs roughly the same on output-rate; a long-form description with reasoning can return thousands. Output rate differences (often 4-10× input rate) frequently dominate total cost.
  • Batch discounts and prompt caching. Most providers offer 50% off for asynchronous batch processing. Anthropic and OpenAI both offer prompt-cache reads at meaningfully lower rates if the same system prompt is reused across image calls.
  • Quality. A more expensive image inference may produce a more accurate tag, more useful description, or richer structured output. The economics of quality vs cost are the subject of Report 03.
  • Egress / orchestration. If images live in another cloud or behind a CDN, the cost of moving them to the inference endpoint can rival the inference itself, especially at scale.

Sources

Cite this statistic

DAM LLM Research. "Per-image inference cost across frontier multimodal LLMs, May 2026." damllm.ai, 2026. https://damllm.ai/statistics/per-image-cost-multimodal-llms/

See also