Methodology.

Exactly how the comparison, the calculators, and the Best-for rankings are built — so you, and the AI engines citing our data, can trust the output.

Data sources Scoring Intelligence Index Formulas Integrity FAQ

Where the data comes from

Sourced, timestamped, auditable.

Every model row has a last_verified_at timestamp. Models not re-verified within 30 days are flagged in the admin UI for refresh.

Pricing & specs
Official provider pricing pages and API docs. Input, output, cached, batch — all as published.
Benchmarks
Provider model cards first, then widely-cited third-party leaderboards (MMLU, SWE-Bench, HumanEval, GPQA, AIME, MMMU, DocVQA). Source noted on each row.
Regions & compliance
Provider trust centers and certification pages: SOC 2, HIPAA, GDPR, FedRAMP, regional data-residency.

Update cadence

Refreshed every morning, audited every change.

Step 01
Daily · 02:00 UTC
Snapshot cron
Captures current price and status; diffs against yesterday to detect changes.
Step 02
Daily · 02:30 UTC
Alerts cron
Emails subscribed users about price changes, deprecations, and sunsets.
Step 03
Monthly · 1st @ 09:00 UTC
Market Pulse newsletter
One short email with the month’s price moves, new launches, and quiet deprecations.
Step 04
Ad hoc
Admin edits
New launches land the same day. Every change is written to a public audit log.

Scoring for Best-for pages

Weights match the task.

Each Best-for page defines a set of pillars with explicit weights (visible at the top of the page). For tasks where quality dominates economics — reasoning, agents, healthcare — price is weighted under 30%. For tasks where price dominates — cheap-bulk, long-context with large input — price is weighted above 40%.

Missing benchmarks are treated as the category median. We don’t assume a model is bad because a score isn’t published.

Quality benchmarks
MMLU, SWE-Bench, HumanEval, GPQA, AIME, MMMU, DocVQA.
Price
Input + output per 1M tokens.
Memory
Context window size.
Capabilities
Function calling, JSON mode, vision, structured output.

Buzzi Intelligence Index

One score, six benchmarks, explicit weights.

The Quality-vs-Price scatter on the results page uses our own composite score (0–100) built from published benchmark scores. Missing benchmarks fall back to the category median so a model isn’t penalized for data we don’t have.

We don’t import Artificial Analysis’s index or any third-party composite. The math is ours and the inputs are auditable.

25%
MMLU
Broad knowledge and reasoning — 57 subjects.
20%
GPQA
Expert-level science questions (physics, chemistry, biology).
20%
HumanEval
Python code generation from docstrings.
20%
SWE-Bench
Real-world GitHub issue-fixing tasks.
15%
MMMU
Multimodal (text + image) college-level problems.
10%
AIME
High-school math olympiad problems.

Weights sum to 1.1 before normalization, so a model that covers all six benchmarks with 100/100 scores scores exactly 100. Missing benchmarks cause the denominator to shrink proportionally.

Cost formulas

The math, laid out.

Volume cost uses the standard per-million-tokens model. Switch cost assumes a 40-hour engineering week at your chosen rate, with a configurable risk premium.

monthly_cost = (uncached_input_tokens  / 1M) × input_price_per_1M
             + (cached_input_tokens    / 1M) × cached_input_price_per_1M
             + (batch_input_tokens     / 1M) × batch_input_price_per_1M
             + (standard_output_tokens / 1M) × output_price_per_1M
             + (batch_output_tokens    / 1M) × batch_output_price_per_1M

Token counts come from the provider’s own tokenizer when we have it (tiktoken, o200k), otherwise a family coefficient with a ±7% error envelope.

What we don’t do

Three rules that keep the data honest.

No sponsorships.
We do not take money from LLM providers. No affiliate fees, no paid placements.
No vibes.
We do not weight gut feelings. Every rank is a formula you can audit.
No guessed benchmarks.
If a score has no citable source, we treat the model as median rather than invent a number.

FAQ

How we work — in more detail.

The non-obvious parts of sourcing, scoring, and refreshing the data.

Get instant answers from our AI agent

A snapshot cron runs every night at 02:00 UTC and captures current prices from provider pricing pages. A second cron at 02:30 UTC emails subscribers about changes or deprecations. Spot-checks from third-party aggregators (pricepertoken, costgoat) catch any misses.

No. Ranking is never pay-to-play. Providers pay us nothing. The methodology documents every weight so anyone can reproduce our rankings.

Our index uses our own weights (MMLU 0.25, GPQA 0.2, HumanEval 0.2, SWE-Bench 0.2, MMMU 0.15, AIME 0.1) and pulls from published benchmark scores on the provider's model card. We don't import any third-party composite score — the math and inputs are both ours and auditable.

Missing benchmarks fall back to the cohort median for that benchmark, so a model isn't penalized for data the provider hasn't published. The Intelligence Index detail object flags which scores are "published" versus "median" so you know which numbers are direct and which are estimates.

They get pricing_availability="self_host" and don't appear in cost calculations. If the same weights are hosted on Together, Fireworks, or Replicate, we include that row with pricing_availability="estimated" and cite the host in the pricing_url.

"Public" = verified directly from the provider's pricing page. "Estimated" = open-weight model where we cite a common hosting provider's published price. "Self-host" = no managed endpoint exists, so cost depends on your own hardware. "Unknown" = announced but not yet priced publicly; the model is shown but excluded from cost calculations.

Rankings change whenever inputs do — a price drop, a new benchmark score, a deprecation. The pages are ISR-cached for 10 minutes, so if you see a shift it's because the underlying inputs moved.

Email hello@buzzi.ai with a link to the source. We correct within 24 hours and log the change in our public audit trail.

Brand marks. Provider logos shown across the comparison tool are used under nominative fair use for factual product comparison. All marks are property of their respective owners. Where an official logo isn’t available, we display a generated monogram wordmark as a placeholder.

Found an error?

Corrections welcome.

Spotted a missing model or a stale price? Email us with a link to the source. We typically correct within 24 hours.

hello@buzzi.ai

Open data

Use the dataset.

The underlying data is available as a JSON feed under CC BY 4.0 with attribution — free for research, products, and AI engines.

Open JSON feed

About

Insights

Streamline

Integration

Solutions

Healthcare AI

Use Cases

Industries

Methodology.

Sourced, timestamped, auditable.

Pricing & specs

Benchmarks

Regions & compliance

Refreshed every morning, audited every change.

Snapshot cron

Alerts cron

Market Pulse newsletter

Admin edits

Weights match the task.

Quality benchmarks

Price

Memory

Capabilities

One score, six benchmarks, explicit weights.

MMLU

GPQA

HumanEval

SWE-Bench

MMMU

AIME

The math, laid out.

Three rules that keep the data honest.

No sponsorships.

No vibes.

No guessed benchmarks.

How we work — in more detail.

Corrections welcome.

Use the dataset.