$ipbr-rank · live llm coding-role score
refreshed · 13 sources

What this is

ipbr-rank is a public-LLM coding-role scoreboard. It pulls model performance from public benchmarks, normalizes them onto a common 0-100 scale, and produces four role scores: Idea, Plan, Build, Review.

All inputs come from public, verifiable sources. Weights and aggregation rules are explicit and versioned. A small number of vendor-published metrics that haven't yet appeared on public leaderboards are recorded as overrides. There is no manual reranking.

The four roles

  • Idea — open-ended creativity, general intelligence, breadth. Driven by LM Arena Text, AI Stupid Level idea-shaped axes, and reasoning blends.
  • Plan — structured reasoning, function-calling, multi-step task decomposition. Driven by Terminal-Bench, tau2-bench, IFBench, MCP-Atlas, and AISL plan axes.
  • Build — actually writing code that runs. Driven by SWE-bench (Verified + Multilingual + Pro), SWE-rebench, LiveCodeBench, Sonar code quality, and AISL build axes.
  • Review — judging code quality, correctness, and preference. Driven by LM Arena, Sonar issue density, and AISL review axes. Review has no adjusted variant — it is the source of the penalty applied to the other three.

Raw vs adjusted

The raw score is the benchmark composite, normalized. The adjusted score subtracts a reviewer-reservation penalty: when one vendor's models dominate Review, that lead gets discounted from their other scores.

L_v   = max(0, max(R_all) - max(R_outside_v))
I_adj = I_raw - 0.08 × L_v
P_adj = P_raw - 0.18 × L_v
B_adj = B_raw - 0.32 × L_v

Coefficients reflect how easy each role is to game with biased preference evaluations. Build is hardest hit; Idea is barely touched; Plan sits in between.

How scores are built

  1. Normalize — each metric is percentile-mapped within the active model population (5th/95th boundaries; log-scaled for cost/speed/latency). Operational metrics use a tail-penalty curve — the top 80% maps into 70-100 with mild differentiation; only the bottom 20% drops sharply.
  2. Aggregate — metrics roll up into groups (CRE, GEN, PLAN, BUILD, LM_ARENA_REVIEW_PROXY, OPS_*, A_I/A_P/A_B/A_R). If at least 70% of a group's weight is present, the score is the present-weight mean. Below that threshold, it shrinks toward 50 proportional to missing weight.
  3. Combine — each role score is a weighted average of groups. AISL's role-shaped perspective (A_*) carries 0.35 in every formula. Operational metrics carry 0.08 — fast-enough models cluster within a 1-2 point spread, but genuinely slow models lose 4-6 points.
  4. Canary health — AISL canary drift is a penalty-only signal. Healthy or missing canary data adds nothing; degraded canary data can subtract up to 6 points from raw role scores.
  5. Synthesize last — when a known sibling pair has a metric on one model but not the other, the missing field is filled from the sibling and softened toward 50 by 15% so it reads as a softer signal.

Sources

sourcestatusrowsmatchedunmatched
aistupidlevelverified24391
arc_agiverified14956107
artificial_analysisverified50095421
livecodebenchverified28424
lmarenaverified38790313
mcp_atlasverified19267
openrouterverified36766317
overridesverified23370
sonarverified583735
swebenchverified19436167
swebench_proverified242214
swerebenchverified913172
terminal_benchverified1248752

Glossary

  • Trust threshold — the 70% group-coverage cutoff above which the present-weight mean is trusted directly.
  • Composite — a metric that is itself a weighted blend of related sub-metrics (currently SWEComposite).
  • A_* perspective — AISL's 17 capability axes weighted four ways (one weighting per role). Canary health is separate and penalty-only.

← back to scoreboard