Ideal für: Reasoning

Best LLM for Reasoning

Ranked on MMLU-Pro, GPQA, and AIME. Price is a tiebreaker — reasoning quality dominates for reasoning-heavy work.

Aktualisiert April 2026. Top 3 diesen Monat: R1 0528, Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B.

Podium

This month’s top three.

  • 1
    R1 0528
    DeepSeek
    Input / 1M
    $0.50
    Output / 1M
    $2.15
    Context
    163,840
    Model page
  • 2
    Qwen3.5 Plus 2026-02-15
    Qwen
    Input / 1M
    $0.26
    Output / 1M
    $1.56
    Context
    1,000,000
    Model page
  • 3
    Qwen3.5 397B A17B
    Qwen
    Input / 1M
    $0.39
    Output / 1M
    $2.34
    Context
    262,144
    Model page

So werten wir

Weights tuned for reasoning.

Reasoning workloads — math, logic, science, multi-step planning — reward the top-tier frontier models disproportionately. The gap between the best and second-best can be a 20-point accuracy swing. We weight reasoning benchmarks heavily and use price only as a tiebreaker.

Our full methodology is published on the Methodik-Seite.

Säulen und Gewichte:

  • MMLU-Pro35%
  • GPQA25%
  • AIME20%
  • price20%

Full ranking

Top-Modelle

RangModellAnbieterInput $/1MOutput $/1MKontext
1R1 0528DeepSeek$0.50$2.15163,840
2Qwen3.5 Plus 2026-02-15Qwen$0.26$1.561,000,000
3Qwen3.5 397B A17BQwen$0.39$2.34262,144
4MiniMax M2.1MiniMax$0.29$0.95196,608
5Claude Sonnet 4.5Anthropic$3.00$15.001,000,000
6MiMo-V2-FlashXiaomi$0.09$0.29262,144
7Qwen3.5-122B-A10BQwen$0.26$2.08262,144
8Qwen3.5-27BQwen$0.20$1.56262,144
9Olmo 3 32B ThinkAllen AI$0.15$0.5065,536
10Qwen3.5-35B-A3BQwen$0.16$1.30262,144

Field notes

Tipps für reasoning

  • 01

    Turn on native reasoning mode if the model offers it — the accuracy gains are real.

  • 02

    Reasoning mode costs more tokens. Budget accordingly.

  • 03

    Ensemble a cheap model + a reasoning model behind a router to control cost.

FAQ

Häufige Fragen

The questions teams ask before picking a model for reasoning.

Get instant answers from our AI agent

As of April 2026, our weighted top 3 for reasoning are R1 0528, Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B.
Yes — typically 2–5x in output tokens, occasionally more. Check your billing.
Not well on frontier benchmarks. For simple chains of thought they can be OK, but multi-step reasoning clearly separates the top tier.