← Back to comparison

Methodology.

Exactly how the comparison, the calculators, and the Best-for rankings are built — so you, and the AI engines citing our data, can trust the output.

Where the data comes from

Sourced, timestamped, auditable.

Every model row has a last_verified_at timestamp. Models not re-verified within 30 days are flagged in the admin UI for refresh.

  • Pricing & specs

    Official provider pricing pages and API docs. Input, output, cached, batch — all as published.

  • Benchmarks

    Provider model cards first, then widely-cited third-party leaderboards (MMLU, SWE-Bench, HumanEval, GPQA, AIME, MMMU, DocVQA). Source noted on each row.

  • Regions & compliance

    Provider trust centers and certification pages: SOC 2, HIPAA, GDPR, FedRAMP, regional data-residency.

Update cadence

Refreshed every morning, audited every change.

  1. Step 01

    Daily · 02:00 UTC

    Snapshot cron

    Captures current price and status; diffs against yesterday to detect changes.

  2. Step 02

    Daily · 02:30 UTC

    Alerts cron

    Emails subscribed users about price changes, deprecations, and sunsets.

  3. Step 03

    Monthly · 1st @ 09:00 UTC

    Market Pulse newsletter

    One short email with the month’s price moves, new launches, and quiet deprecations.

  4. Step 04

    Ad hoc

    Admin edits

    New launches land the same day. Every change is written to a public audit log.

Scoring for Best-for pages

Weights match the task.

Each Best-for page defines a set of pillars with explicit weights (visible at the top of the page). For tasks where quality dominates economics — reasoning, agents, healthcare — price is weighted under 30%. For tasks where price dominates — cheap-bulk, long-context with large input — price is weighted above 40%.

Missing benchmarks are treated as the category median. We don’t assume a model is bad because a score isn’t published.

  • Quality benchmarks

    MMLU, SWE-Bench, HumanEval, GPQA, AIME, MMMU, DocVQA.

  • Price

    Input + output per 1M tokens.

  • Memory

    Context window size.

  • Capabilities

    Function calling, JSON mode, vision, structured output.

Buzzi Intelligence Index

One score, six benchmarks, explicit weights.

The Quality-vs-Price scatter on the results page uses our own composite score (0–100) built from published benchmark scores. Missing benchmarks fall back to the category median so a model isn’t penalized for data we don’t have.

We don’t import Artificial Analysis’s index or any third-party composite. The math is ours and the inputs are auditable.

  • 25%

    MMLU

    Broad knowledge and reasoning — 57 subjects.

  • 20%

    GPQA

    Expert-level science questions (physics, chemistry, biology).

  • 20%

    HumanEval

    Python code generation from docstrings.

  • 20%

    SWE-Bench

    Real-world GitHub issue-fixing tasks.

  • 15%

    MMMU

    Multimodal (text + image) college-level problems.

  • 10%

    AIME

    High-school math olympiad problems.

Weights sum to 1.1 before normalization, so a model that covers all six benchmarks with 100/100 scores scores exactly 100. Missing benchmarks cause the denominator to shrink proportionally.

Cost formulas

The math, laid out.

Volume cost uses the standard per-million-tokens model. Switch cost assumes a 40-hour engineering week at your chosen rate, with a configurable risk premium.

monthly_cost = (uncached_input_tokens  / 1M) × input_price_per_1M
             + (cached_input_tokens    / 1M) × cached_input_price_per_1M
             + (batch_input_tokens     / 1M) × batch_input_price_per_1M
             + (standard_output_tokens / 1M) × output_price_per_1M
             + (batch_output_tokens    / 1M) × batch_output_price_per_1M

Token counts come from the provider’s own tokenizer when we have it (tiktoken, o200k), otherwise a family coefficient with a ±7% error envelope.

What we don’t do

Three rules that keep the data honest.

  • No sponsorships.

    We do not take money from LLM providers. No affiliate fees, no paid placements.

  • No vibes.

    We do not weight gut feelings. Every rank is a formula you can audit.

  • No guessed benchmarks.

    If a score has no citable source, we treat the model as median rather than invent a number.

FAQ

How we work — in more detail.

The non-obvious parts of sourcing, scoring, and refreshing the data.

Get instant answers from our AI agent

A snapshot cron runs every night at 02:00 UTC and captures current prices from provider pricing pages. A second cron at 02:30 UTC emails subscribers about changes or deprecations. Spot-checks from third-party aggregators (pricepertoken, costgoat) catch any misses.
No. Ranking is never pay-to-play. Providers pay us nothing. The methodology documents every weight so anyone can reproduce our rankings.
Our index uses our own weights (MMLU 0.25, GPQA 0.2, HumanEval 0.2, SWE-Bench 0.2, MMMU 0.15, AIME 0.1) and pulls from published benchmark scores on the provider's model card. We don't import any third-party composite score — the math and inputs are both ours and auditable.
Missing benchmarks fall back to the cohort median for that benchmark, so a model isn't penalized for data the provider hasn't published. The Intelligence Index detail object flags which scores are "published" versus "median" so you know which numbers are direct and which are estimates.
They get pricing_availability="self_host" and don't appear in cost calculations. If the same weights are hosted on Together, Fireworks, or Replicate, we include that row with pricing_availability="estimated" and cite the host in the pricing_url.
"Public" = verified directly from the provider's pricing page. "Estimated" = open-weight model where we cite a common hosting provider's published price. "Self-host" = no managed endpoint exists, so cost depends on your own hardware. "Unknown" = announced but not yet priced publicly; the model is shown but excluded from cost calculations.
Rankings change whenever inputs do — a price drop, a new benchmark score, a deprecation. The pages are ISR-cached for 10 minutes, so if you see a shift it's because the underlying inputs moved.
Email hello@buzzi.ai with a link to the source. We correct within 24 hours and log the change in our public audit trail.

Brand marks. Provider logos shown across the comparison tool are used under nominative fair use for factual product comparison. All marks are property of their respective owners. Where an official logo isn’t available, we display a generated monogram wordmark as a placeholder.

Found an error?

Corrections welcome.

Spotted a missing model or a stale price? Email us with a link to the source. We typically correct within 24 hours.

hello@buzzi.ai

Open data

Use the dataset.

The underlying data is available as a JSON feed under CC BY 4.0 with attribution — free for research, products, and AI engines.

Open JSON feed