Best for: Reasoning

Best LLM for Reasoning

Ranked on MMLU-Pro, GPQA, and AIME. Price is a tiebreaker — reasoning quality dominates for reasoning-heavy work.

Updated July 2026. Top 3 this month: R1 0528, Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B.

Podium

This month’s top three.

1
R1 0528
DeepSeek
Input / 1M
$0.50
Output / 1M
$2.15
Context
163,840
Model page
2
Qwen3.5 Plus 2026-02-15
Qwen
Input / 1M
$0.26
Output / 1M
$1.56
Context
1,000,000
Model page
3
Qwen3.5 397B A17B
Qwen
Input / 1M
$0.39
Output / 1M
$2.34
Context
262,144
Model page

How we rank

Weights tuned for reasoning.

Reasoning workloads — math, logic, science, multi-step planning — reward the top-tier frontier models disproportionately. The gap between the best and second-best can be a 20-point accuracy swing. We weight reasoning benchmarks heavily and use price only as a tiebreaker.

Our full methodology is published on the methodology page.

Pillars and weights:

MMLU-Pro35%
GPQA25%
AIME20%
price20%

Full ranking

Top ranked models

Rank	Model	Provider	Input $/1M	Output $/1M	Context
1	R1 0528	DeepSeek	$0.50	$2.15	163,840
2	Qwen3.5 Plus 2026-02-15	Qwen	$0.26	$1.56	1,000,000
3	Qwen3.5 397B A17B	Qwen	$0.39	$2.34	262,144
4	MiniMax M2.1	MiniMax	$0.29	$0.95	196,608
5	Claude Sonnet 4.5	Anthropic	$3.00	$15.00	1,000,000
6	MiMo-V2-Flash	Xiaomi	$0.09	$0.29	262,144
7	Qwen3.5-122B-A10B	Qwen	$0.26	$2.08	262,144
8	Qwen3.5-27B	Qwen	$0.20	$1.56	262,144
9	Olmo 3 32B Think	Allen AI	$0.15	$0.50	65,536
10	Qwen3.5-35B-A3B	Qwen	$0.16	$1.30	262,144

Field notes

Tips for reasoning

01
Turn on native reasoning mode if the model offers it — the accuracy gains are real.
02
Reasoning mode costs more tokens. Budget accordingly.
03
Ensemble a cheap model + a reasoning model behind a router to control cost.

FAQ

Frequently asked questions

The questions teams ask before picking a model for reasoning.

Get instant answers from our AI agent

As of July 2026, our weighted top 3 for reasoning are R1 0528, Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B.

Yes — typically 2–5x in output tokens, occasionally more. Check your billing.

Not well on frontier benchmarks. For simple chains of thought they can be OK, but multi-step reasoning clearly separates the top tier.

About

Insights

Streamline

Integration

Solutions

Healthcare AI

Use Cases

Industries

Best LLM for Reasoning

This month’s top three.

Weights tuned for reasoning.

Top ranked models

Tips for reasoning

Frequently asked questions

Model your own workload.

Best LLM for Reasoning

This month’s top three.

Weights tuned for reasoning.

Top ranked models

Tips for reasoning

Frequently asked questions

Related tasks

Model your own workload.