Report 03 · Live Index · Re-verified monthly

AI tagging: the 2026 provider index.

AI tagging is the practice of letting a model assign descriptive labels to images and video automatically — and the market for it is split in two. We scored 10 leading AI tagging providers on one rubric: six classical computer-vision APIs (Google Cloud Vision, AWS Rekognition, Azure AI Vision, Clarifai, Imagga, Hive), one hybrid (Cloudinary AI), and three frontier multimodal LLMs (Anthropic Claude, OpenAI GPT-4o, Google Gemini). Headline finding: only three of ten can answer free-form questions about an image. The other seven return labels.

Providers covered: 10
Dimensions: 6
Data source: Public documentation
Snapshot: May 26, 2026
Cadence: Monthly re-verification
Version: v1.0

Definition

AI tagging is the automatic assignment of descriptive labels — objects, scenes, text, brand marks, emotions — to images and video by a machine-learning model. It replaces manual metadata entry and powers search, filtering, and retrieval across a digital asset library. Two model families dominate in 2026: classical computer-vision APIs (closed taxonomies, fast, cheap) and frontier multimodal LLMs (open-ended, descriptive, more expensive).

How to read this

Full methodology →

Each cell records whether the provider's public documentation demonstrates a capability — not whether the marketing site claims it. We re-verify monthly; the version line above tracks revisions. v1.0 is preliminary; vendor and reader corrections fold into v1.x within 7 days.

YesDocumented & verified
PartialLimitations noted in cell
NoMissing from public docs

The matrix

AI Tagging Provider Index · v1.0 · Snapshot 2026-05-26

n=10 providers · 6 dimensions · monthly re-verification

Pricing published: 9/10
Free tier (no CC): 4/10
OCR documented: 5/10
Custom training: 6/10
Multimodal LLM: 3/10
SLA published: 5/10

Provider	Pricing per unit	Free tier no CC	OCR documented	Custom training	Multimodal LLM reasoning	SLA published	Score
Azure AI Visionazure.microsoft.com	Yes	Partial	Yes	Yes	Partial	Yes	5/6
Clarifaiclarifai.com	Yes	Yes	Yes	Yes	Partial	Partial	5/6
Google Gemini (vision)ai.google.dev	Yes	Yes	Partial	Partial	Yes	Yes	5/6
Google Cloud Visioncloud.google.com	Yes	Partial	Yes	Yes	No	Yes	4.5/6
AWS Rekognitionaws.amazon.com	Yes	Partial	Yes	Yes	No	Yes	4.5/6
Cloudinary AIcloudinary.com	Yes	Yes	Yes	No	No	Yes	4/6
Imaggaimagga.com	Yes	Yes	No	Yes	No	Partial	3.5/6
OpenAI GPT-4o (vision)platform.openai.com	Yes	No	Partial	Partial	Yes	No	3/6
Anthropic Claude (vision)anthropic.com	Yes	No	Partial	No	Yes	No	2.5/6
Hive AIthehive.ai	Partial	No	Partial	Yes	No	Partial	2.5/6

Scoring: each Yes = 1, each Partial = 0.5, each No = 0. Snapshot of provider public documentation as of 2026-05-26. v1.0 is preliminary; corrections from vendors and operators are folded into v1.x within 7 days. Full methodology →

The classical-CV vs frontier-LLM split

The most striking pattern in v1.0 is not who scores highest — it's that the rubric draws a sharp line down the middle of the field. The six classical computer-vision APIs (Google Cloud Vision, AWS Rekognition, Azure AI Vision, Clarifai, Imagga, Hive AI) plus Cloudinary AI return closed label sets: feed an image, get back a list of tags from a fixed taxonomy. The three frontier multimodal LLMs (Anthropic Claude, OpenAI GPT-4o, Google Gemini) do something categorically different — they will describe the image in natural language, answer follow-up questions, or follow arbitrary instructions ("list every brand logo visible," "what's the mood of this product photo," "redact PII").

The classical providers score higher on the rubric on average (median 4/6 vs 2.5/6 for the frontier LLMs), and that's because the rubric was built around the work an asset-tagging pipeline actually does: pricing, OCR, custom training, SLAs. The frontier LLMs don't fail those columns because they're worse products — they fail them because their product shape is different. They are not asset-tagging APIs that happen to use LLMs; they are LLMs that happen to look at images.

For operators building an asset library today, the implication is: you'll likely want both. A classical CV provider for the high-volume, low-latency, structured-output tagging path, and a frontier LLM for the long-tail "answer questions about this image" cases. Picking only one means accepting a hole in your coverage.

What the index measures

Per-unit pricing published. Can a developer look up the per-image (or per-token, for LLMs) price on the public site without filling in a contact form? "Yes" requires a public number; "Partial" means a calculator with no per-unit rate disclosed.
Free tier without credit card. Can a developer hit the API and receive real responses without entering a payment method? "Yes" is no card required for a meaningful trial; "Partial" means a free quota exists but a card is needed to enable it.
OCR / text extraction documented. Does public documentation describe a dedicated endpoint, model, or supported prompt pattern for extracting text from images? "Partial" means OCR-style behaviour works in practice (typical of multimodal LLMs) but is not a documented capability.
Custom model training via API. Can operators train on their own labeled data through a documented API or workflow — not just upload a single label set, but a real training pipeline? "Partial" means fine-tuning is supported but coverage of vision-specific fine-tuning is limited or in preview.
Multimodal LLM-style reasoning. Can the model answer arbitrary, open-ended natural-language questions about an image — not just return a closed taxonomy? This is the dimension that separates classical CV from frontier multimodal models.
Production SLA published. Does the provider publish a paid-tier uptime SLA? Status pages count as a public observation but not as an SLA commitment; that's "Partial."

What the index does not measure

Accuracy. A real accuracy benchmark requires a held-out test set, controlled prompts, and rater agreement. That's the subject of a separate forthcoming report. The index measures capability surface, not output quality.
Latency. Real-world latency depends on region, image size, and concurrency. Out of scope for v1.0.
Content moderation specifics. Hive, AWS, Azure, and Google all ship moderation features with different taxonomies. We're treating those as a single "tag" capability for v1.0; v2 may break them out.
Embedding APIs. Several providers (notably OpenAI, Google, AWS) ship image-embedding endpoints. Embeddings are a different product shape than tagging and will get their own index.

Methodology notes

Every score in v1.0 is sourced from a publicly reachable provider documentation page as of 2026-05-26. We don't score based on marketing copy, sales-deck claims, or third-party blog posts. If a documentation page is behind a partner-only login or under NDA, we score that capability "Partial" or "No" depending on whether any public artifact exists.

Where a provider has multiple product lines (Microsoft ships Azure AI Vision and Azure OpenAI Service with GPT-4 vision; Google ships Cloud Vision API and Gemini), each product gets its own row. We treat them as distinct because operators have to make distinct purchase decisions about them.

Two coders independently scored each cell from the same documentation page. The 60 cells in v1.0 resolved to 56 unanimous scores. Four cells required adjudication — those are the "Partial" cells most likely to move in v1.x as we get vendor feedback.

How this gets updated

Every month, on or around the 25th, we re-run the survey. Every cell is re-checked against the provider's current public documentation. Version line on the matrix head bumps; a diff is published as a changelog note (link at the bottom of every index page).

If a vendor ships a capability and we miss it, mail the team via the About page. Corrections fold in within 7 days and are credited in the changelog.

Providers not yet in the index

v1.0 covers 10 providers. The following are candidates for v1.1 (July 2026 cycle), pending public-doc availability and operator demand:

Ximilar — strong custom-model offering; under-represented in North American DAM discussion.
DeepAI — generic tagging API; pricing model is unusual enough to deserve its own row.
Mistral Pixtral — open-weights multimodal model; the obvious frontier-LLM addition once the hosted API matures.
Meta Llama 3.2 vision — also open-weights frontier; same caveat.
AssemblyAI / SuperAnnotate / Roboflow — labeling-first platforms with inference APIs. Sit at a different layer but operators occasionally treat them as tagging providers.

Reader requests for additional providers are welcome.

Cite this index

DAM LLM Research. "AI Tagging Provider Index, v1.0 · May 2026." damllm.ai, 2026. https://damllm.ai/research/ai-tagging-provider-index/