Ideal für: Coding

Best LLM for Coding

Ranked on SWE-Bench, HumanEval, and dollars-per-1M output tokens. Balanced for autonomous and assistive coding workflows.

Aktualisiert April 2026. Top 3 diesen Monat: GPT-4o (2024-11-20), Claude Sonnet 4.5, GPT-5 Codex.

Podium

This month’s top three.

  • 1
    GPT-4o (2024-11-20)
    OpenAI
    Input / 1M
    $2.50
    Output / 1M
    $10.00
    Context
    128,000
    Model page
  • 2
    Claude Sonnet 4.5
    Anthropic
    Input / 1M
    $3.00
    Output / 1M
    $15.00
    Context
    1,000,000
    Model page
  • 3
    GPT-5 Codex
    OpenAI
    Input / 1M
    $1.25
    Output / 1M
    $10.00
    Context
    400,000
    Model page

So werten wir

Weights tuned for coding.

Choosing an LLM for coding comes down to three things: how well it turns specifications into working code, how well it reasons about large repositories, and how much it will cost once you wire it into CI or an agent loop. We weight SWE-Bench heaviest because it best predicts real-world coding-agent success, followed by HumanEval for short-form correctness, and a price pillar so the recommendation survives contact with a finance review.

Our full methodology is published on the Methodik-Seite.

Säulen und Gewichte:

  • SWE-Bench50%
  • HumanEval30%
  • price20%

Full ranking

Top-Modelle

RangModellAnbieterInput $/1MOutput $/1MKontext
1GPT-4o (2024-11-20)OpenAI$2.50$10.00128,000
2Claude Sonnet 4.5Anthropic$3.00$15.001,000,000
3GPT-5 CodexOpenAI$1.25$10.00400,000
4Gemini 2.5 ProGoogle$1.25$10.001,048,576
5Gemini 2.5 Pro Preview 06-05Google$1.25$10.001,048,576
6GPT-5.1-CodexOpenAI$1.25$10.00400,000
7o3OpenAI$2.00$8.00200,000
8Claude 3.7 SonnetAnthropic$3.00$15.00200,000
9Claude 3.7 Sonnet (thinking)Anthropic$3.00$15.00200,000
10GPT-5 MiniOpenAI$0.25$2.00400,000

Field notes

Tipps für coding

  • 01

    Prefer a model with a large context window if your repo is bigger than ~200 files.

  • 02

    Use batch pricing for CI / nightly refactor jobs; interactive IDE work stays on the standard price.

  • 03

    Check function-calling reliability before committing to an agentic flow.

FAQ

Häufige Fragen

The questions teams ask before picking a model for coding.

Get instant answers from our AI agent

As of April 2026, our weighted top 3 are GPT-4o (2024-11-20), Claude Sonnet 4.5, GPT-5 Codex.
Claude wins long-horizon refactoring; GPT wins short-burst correctness. The right answer depends on your workload mix — see the scoring pillars below.
Rarely. Fine-tuning on proprietary code still helps, but for 90% of shops a strong frontier model with RAG over the repo gets you most of the way.
DeepSeek and Meta Llama variants are competitive on price. We list their hosted pricing here; self-host economics live in our Shadow AI audit tool.