ipbr-rank · about

What this is

ipbr-rank is a public-LLM coding-role scoreboard. It pulls model performance from public benchmarks, normalizes them onto a common 0-100 scale, and produces four role scores: Idea, Plan, Build, Review.

All inputs come from public, verifiable sources. Weights and aggregation rules are explicit and versioned. A small number of vendor-published metrics that haven't yet appeared on public leaderboards are recorded as overrides. There is no manual reranking.

The four roles

Idea — open-ended creativity, general intelligence, breadth. Driven by LM Arena Text, AI Stupid Level idea-shaped axes, and reasoning blends.
Plan — structured reasoning, function-calling, multi-step task decomposition. Driven by Terminal-Bench, tau2-bench, IFBench, MCP-Atlas, and AISL plan axes.
Build — actually writing code that runs. Driven by SWE-bench (Verified + Multilingual + Pro), SWE-rebench, LiveCodeBench, Sonar code quality, and AISL build axes.
Review — judging code quality, correctness, and preference. Driven by LM Arena, Sonar issue density, and AISL review axes. Review has no adjusted variant — it is the source of the penalty applied to the other three.

Raw vs adjusted

The raw score is the benchmark composite, normalized. The adjusted score subtracts a reviewer-reservation penalty: when one vendor's models dominate Review, that lead gets discounted from their other scores.

L_v   = max(0, max(R_all) - max(R_outside_v))
I_adj = I_raw - 0.08 × L_v
P_adj = P_raw - 0.18 × L_v
B_adj = B_raw - 0.32 × L_v

Coefficients reflect how easy each role is to game with biased preference evaluations. Build is hardest hit; Idea is barely touched; Plan sits in between.

How scores are built

Normalize — each metric is percentile-mapped within the active model population (5th/95th boundaries; log-scaled for cost/speed/latency). Operational metrics use a tail-penalty curve — the top 80% maps into 70-100 with mild differentiation; only the bottom 20% drops sharply.
Aggregate — metrics roll up into groups (CRE, GEN, PLAN, BUILD, LM_ARENA_REVIEW_PROXY, OPS_*, A_I/A_P/A_B/A_R). If at least 70% of a group's weight is present, the score is the present-weight mean. Below that threshold, it shrinks toward 50 proportional to missing weight.
Combine — each role score is a weighted average of groups. AISL's role-shaped perspective (A_*) carries 0.35 in every formula. Operational metrics carry 0.08 — fast-enough models cluster within a 1-2 point spread, but genuinely slow models lose 4-6 points.
Canary health — AISL canary drift is a penalty-only signal. Healthy or missing canary data adds nothing; degraded canary data can subtract up to 6 points from raw role scores.
Synthesize last — when a known sibling pair has a metric on one model but not the other, the missing field is filled from the sibling and softened toward 50 by 15% so it reads as a softer signal.

Sources

source	status	rows	matched	unmatched
aistupidlevel	verified	24	39	1
arc_agi	verified	149	56	107
artificial_analysis	verified	500	95	421
livecodebench	verified	28	4	24
lmarena	verified	387	90	313
mcp_atlas	verified	19	26	7
openrouter	verified	367	66	317
overrides	verified	23	37	0
sonar	verified	58	37	35
swebench	verified	194	36	167
swebench_pro	verified	24	22	14
swerebench	verified	91	31	72
terminal_bench	verified	124	87	52

Glossary

Trust threshold — the 70% group-coverage cutoff above which the present-weight mean is trusted directly.
Composite — a metric that is itself a weighted blend of related sub-metrics (currently SWEComposite).
A_* perspective — AISL's 17 capability axes weighted four ways (one weighting per role). Canary health is separate and penalty-only.

← back to scoreboard