Ideal für: (Vision + Text)

Best Multimodal LLM (Vision + Text)

Ranked on vision benchmark accuracy, context window, and combined per-query cost for image + text workloads.

Aktualisiert June 2026. Top 3 diesen Monat: Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B, GPT-4o (2024-11-20).

Podium

This month’s top three.

1
Qwen3.5 Plus 2026-02-15
Qwen
Input / 1M
$0.26
Output / 1M
$1.56
Context
1,000,000
Model page
2
Qwen3.5 397B A17B
Qwen
Input / 1M
$0.39
Output / 1M
$2.34
Context
262,144
Model page
3
GPT-4o (2024-11-20)
OpenAI
Input / 1M
$2.50
Output / 1M
$10.00
Context
128,000
Model page

So werten wir

Weights tuned for (vision + text).

Multimodal workloads — document parsing, screenshot understanding, chart reading — demand a model that can handle dense visual information without hallucinating fields that are not present. We weight MMMU for vision reasoning and DocVQA for document tasks, alongside price because vision tokens are priced differently per provider.

Our full methodology is published on the Methodik-Seite.

Säulen und Gewichte:

MMMU40%
DocVQA30%
price30%

Full ranking

Top-Modelle

Rang	Modell	Anbieter	Input $/1M	Output $/1M	Kontext
1	Qwen3.5 Plus 2026-02-15	Qwen	$0.26	$1.56	1,000,000
2	Qwen3.5 397B A17B	Qwen	$0.39	$2.34	262,144
3	GPT-4o (2024-11-20)	OpenAI	$2.50	$10.00	128,000
4	MiniMax-01	MiniMax	$0.20	$1.10	1,000,192
5	Claude Sonnet 4.5	Anthropic	$3.00	$15.00	1,000,000
6	Qwen3.5-122B-A10B	Qwen	$0.26	$2.08	262,144
7	Qwen3.5-27B	Qwen	$0.20	$1.56	262,144
8	Llama 4 Maverick	Meta	$0.15	$0.60	1,048,576
9	Gemma 4 31B	Google	$0.00	$0.00	262,144
10	Gemma 4 31B	Google	$0.13	$0.38	262,144

Field notes

Tipps für (vision + text)

01
Verify the provider's image-size and token-per-image policy before committing — pricing varies dramatically.
02
For structured extraction, combine vision with JSON mode.
03
Chart/graph understanding is still fragile. Add a text-only sanity check if the downstream consequence is high.

FAQ

Häufige Fragen

The questions teams ask before picking a model for (vision + text).

Get instant answers from our AI agent

As of June 2026, our weighted top 3 are Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B, GPT-4o (2024-11-20).

Usually cheaper per-image than people assume, but high-res detail modes can 10x the cost. Benchmark on your actual images.

Gemini and a few research-tier OpenAI SKUs handle video. The ecosystem is still early — expect to preprocess to frames for most use cases.

About

Insights

Streamline

Integration

Solutions

Healthcare AI

Use Cases

Industries

Best Multimodal LLM (Vision + Text)

This month’s top three.

Weights tuned for (vision + text).

Top-Modelle

Tipps für (vision + text)

Häufige Fragen

Model your own workload.

Best Multimodal LLM (Vision + Text)

This month’s top three.

Weights tuned for (vision + text).

Top-Modelle

Tipps für (vision + text)

Häufige Fragen

Verwandte Aufgaben

Model your own workload.