$ipbr-rank · live llm coding-role score
refreshed · 13 sources
[ idea ]
1claude-opus-4.787.087.0
2claude-opus-4.686.886.8
3gpt-5.582.182.1
[ plan ]
1gpt-5.582.582.5
2gemini-3.1-pro-preview78.778.7
3claude-opus-4.777.377.2
[ build ]
1claude-opus-4.773.573.2
2gpt-5.569.169.1
3claude-opus-4.668.568.2
[ review ]
1claude-opus-4.778.178.1
2kimi-k2.677.377.3
3gpt-5.575.375.3
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

raw vs adjusted

The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models dominate Review, that lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.

Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted — it is the source of the penalty.

missing data

If a model is missing some metrics within a group, the group score is computed from the present metrics if at least 70% of the group weight is covered. Below that threshold, the score shrinks toward 50 proportional to the missing weight.

Full math, role definitions, and source list →

claude-opus-4.7anthropic87.087.077.377.273.573.278.1

group breakdown

A_B54.713 / 24A_I78.74 / 24A_P66.54 / 24A_R68.915 / 24BUILD87.61 / 24CRE95.04 / 24GEN97.02 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long74.317 / 24OPS_precision69.017 / 24OPS_review65.617 / 24PLAN79.94 / 24

metrics

AI_canary_health88.23 / 7AI_code0.015 / 22AI_complexity2.911 / 22AI_context_awareness15.25 / 24AI_correctness94.07 / 22AI_edge_cases85.911 / 22AI_efficiency100.01 / 22AI_hallucination_resistance0.023 / 24AI_memory_retention30.510 / 24AI_parameter_accuracy35.220 / 24AI_plan_coherence21.913 / 24AI_recovery98.68 / 22AI_refusal100.04 / 22AI_safety_compliance100.02 / 24AI_spec100.04 / 22AI_stability90.45 / 22AI_task_completion100.01 / 24AI_tool_selection57.613 / 24ARC_AGI_294.03 / 17ArtificialAnalysisCoding90.33 / 21ArtificialAnalysisIntelligence100.01 / 21ArtificialAnalysisReasoning95.63 / 21ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.02 / 22GDPval95.01 / 11GPQA_HLE_Reasoning95.63 / 21IFBench46.610 / 21InverseCost61.922 / 24InverseTTFT49.116 / 19LMArenaCreativeOrOpenEnded95.04 / 24LMArenaSearchDocument100.01 / 19LMArenaText95.04 / 24LongContextRecall88.26 / 21OutputSpeed78.812 / 19SWEBenchPro95.01 / 14SWEBenchVerified94.62 / 18SWEComposite92.71 / 24SWERebench85.36 / 20SciCode100.01 / 21SonarFunctionalSkill98.42 / 17SonarIssueDensity2.417 / 17Tau2Bench83.19 / 21TerminalBench78.64 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarmissing BUILD/LiveCodeBenchBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
gpt-5.5openai82.182.182.582.569.169.175.3

group breakdown

A_B56.87 / 24A_I76.210 / 24A_P62.810 / 24A_R79.17 / 24BUILD74.44 / 24CRE82.07 / 24GEN94.43 / 24LM_ARENA_REVIEW_PROXY27.414 / 24OPS_long81.710 / 24OPS_precision80.09 / 24OPS_review77.511 / 24PLAN95.41 / 24

metrics

AI_code0.022 / 22AI_complexity0.022 / 22AI_context_awareness0.020 / 24AI_correctness94.015 / 22AI_edge_cases85.919 / 22AI_efficiency85.89 / 22AI_hallucination_resistance60.014 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy95.83 / 24AI_plan_coherence7.919 / 24AI_recovery98.616 / 22AI_refusal100.015 / 22AI_safety_compliance100.010 / 24AI_spec100.015 / 22AI_stability90.48 / 22AI_task_completion87.09 / 24AI_tool_selection84.96 / 24ARC_AGI_298.12 / 17ArtificialAnalysisCoding100.02 / 21ArtificialAnalysisIntelligence98.13 / 21ArtificialAnalysisReasoning100.02 / 21ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.48 / 22GPQA_HLE_Reasoning100.02 / 21IFBench80.76 / 21InverseCost50.623 / 24InverseTTFT81.78 / 19LMArenaCreativeOrOpenEnded82.07 / 24LMArenaSearchDocument27.49 / 19LMArenaText82.07 / 24LongContextRecall98.03 / 21OutputSpeed82.49 / 19SWEBenchVerified95.01 / 18SWEComposite63.58 / 24SciCode94.54 / 21SonarFunctionalSkill70.713 / 17SonarIssueDensity46.04 / 17Tau2Bench90.57 / 21TerminalBench100.02 / 22
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouteroverridessonarterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWERebench
claude-opus-4.6anthropic86.886.872.372.168.568.267.0

group breakdown

A_B54.812 / 24A_I76.29 / 24A_P63.29 / 24A_R70.113 / 24BUILD78.32 / 24CRE100.01 / 24GEN89.64 / 24LM_ARENA_REVIEW_PROXY32.213 / 24OPS_long78.013 / 24OPS_precision76.813 / 24OPS_review74.713 / 24PLAN72.85 / 24

metrics

AI_canary_health83.34 / 7AI_code12.25 / 22AI_complexity2.910 / 22AI_context_awareness9.89 / 24AI_correctness94.06 / 22AI_edge_cases85.910 / 22AI_efficiency82.711 / 22AI_hallucination_resistance0.022 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy71.815 / 24AI_plan_coherence2.420 / 24AI_recovery98.67 / 22AI_refusal100.03 / 22AI_safety_compliance83.716 / 24AI_spec100.03 / 22AI_stability90.44 / 22AI_task_completion93.24 / 24AI_tool_selection100.02 / 24ARC_AGI_292.24 / 17ArtificialAnalysisCoding76.15 / 21ArtificialAnalysisIntelligence84.05 / 21ArtificialAnalysisReasoning86.35 / 21ContextWindow99.37 / 24CopilotArenaOrLMArenaCode100.01 / 22GDPval78.05 / 11GPQA_HLE_Reasoning86.35 / 21IFBench31.416 / 21InverseCost61.921 / 24InverseTTFT73.615 / 19LMArenaCreativeOrOpenEnded100.01 / 24LMArenaSearchDocument32.28 / 19LMArenaText100.01 / 24LongContextRecall90.24 / 21OutputSpeed76.615 / 19SWEBenchMultilingual90.92 / 6SWEBenchPro76.33 / 14SWEBenchVerified67.98 / 18SWEComposite78.33 / 24SWERebench91.64 / 20SciCode85.85 / 21SonarFunctionalSkill97.43 / 17SonarIssueDensity41.96 / 17Tau2Bench91.26 / 21TerminalBench64.57 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBenchBUILD/MCPAtlasPLAN/MCPAtlas
kimi-k2.6moonshot74.974.971.671.666.166.177.3

group breakdown

A_B53.617 / 24A_I76.111 / 24A_P58.114 / 24A_R72.311 / 24BUILD74.25 / 24CRE77.68 / 24GEN73.75 / 24LM_ARENA_REVIEW_PROXY92.22 / 24OPS_long58.020 / 24OPS_precision59.818 / 24OPS_review60.318 / 24PLAN86.73 / 24

metrics

AI_code0.018 / 22AI_complexity2.917 / 22AI_context_awareness0.016 / 24AI_correctness94.011 / 22AI_edge_cases85.915 / 22AI_efficiency81.812 / 22AI_hallucination_resistance20.021 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy66.717 / 24AI_plan_coherence7.918 / 24AI_recovery98.612 / 22AI_refusal100.011 / 22AI_safety_compliance66.719 / 24AI_spec100.011 / 22AI_stability90.46 / 22AI_task_completion72.511 / 24AI_tool_selection52.914 / 24ARC_AGI_211.99 / 17ArtificialAnalysisCoding72.87 / 21ArtificialAnalysisIntelligence87.54 / 21ArtificialAnalysisReasoning87.64 / 21ContextWindow78.414 / 24CopilotArenaOrLMArenaCode94.74 / 22GDPval54.78 / 11GPQA_HLE_Reasoning87.64 / 21IFBench94.54 / 21InverseCost87.19 / 24LMArenaCreativeOrOpenEnded77.68 / 24LMArenaSearchDocument92.22 / 19LMArenaText77.68 / 24LongContextRecall85.37 / 21MCPAtlas92.52 / 13SWEBenchVerified77.04 / 18SWEComposite62.710 / 24SWERebench72.911 / 20SciCode94.53 / 21SonarFunctionalSkill79.212 / 17SonarIssueDensity92.52 / 17Tau2Bench100.01 / 21TerminalBench74.95 / 22
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/LiveCodeBenchOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
claude-opus-4.5anthropic73.973.867.667.565.865.562.9

group breakdown

A_B59.04 / 24A_I78.06 / 24A_P68.62 / 24A_R74.110 / 24BUILD70.66 / 24CRE73.611 / 24GEN70.56 / 24LM_ARENA_REVIEW_PROXY10.821 / 24OPS_long76.416 / 24OPS_precision74.614 / 24OPS_review73.714 / 24PLAN65.88 / 24

metrics

AI_canary_health88.52 / 7AI_code18.34 / 22AI_complexity2.99 / 22AI_context_awareness55.03 / 24AI_correctness94.05 / 22AI_edge_cases85.99 / 22AI_efficiency93.34 / 22AI_hallucination_resistance20.019 / 24AI_memory_retention10.211 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence10.717 / 24AI_recovery98.66 / 22AI_refusal100.02 / 22AI_safety_compliance72.918 / 24AI_spec100.02 / 22AI_stability90.43 / 22AI_task_completion65.314 / 24AI_tool_selection100.01 / 24ArtificialAnalysisCoding75.16 / 21ArtificialAnalysisIntelligence71.57 / 21ArtificialAnalysisReasoning63.79 / 21ContextWindow74.721 / 24CopilotArenaOrLMArenaCode76.56 / 22GDPval73.86 / 11GPQA_HLE_Reasoning63.79 / 21IFBench44.911 / 21InverseCost61.920 / 24InverseTTFT74.514 / 19LMArenaCreativeOrOpenEnded73.611 / 24LMArenaSearchDocument10.816 / 19LMArenaText73.611 / 24LongContextRecall100.01 / 21OutputSpeed80.311 / 19SWEBenchPro60.55 / 14SWEBenchVerified65.29 / 18SWEComposite65.66 / 24SWERebench76.38 / 20SciCode72.77 / 21SonarFunctionalSkill100.01 / 17SonarIssueDensity63.63 / 17Tau2Bench85.28 / 21TerminalBench54.911 / 22
sources aistupidlevelartificial_analysislmarenamcp_atlasopenroutersonarswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBenchBUILD/MCPAtlasGEN/ARC_AGI_2PLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
gpt-5.4openai72.672.656.456.465.865.863.8

group breakdown

A_B56.88 / 24A_I75.712 / 24A_P61.412 / 24A_R79.16 / 24BUILD68.47 / 24CRE77.59 / 24GEN45.216 / 24LM_ARENA_REVIEW_PROXY17.120 / 24OPS_long92.83 / 24OPS_precision91.22 / 24OPS_review89.75 / 24PLAN50.715 / 24

metrics

AI_code0.021 / 22AI_complexity2.920 / 22AI_context_awareness0.019 / 24AI_correctness94.014 / 22AI_edge_cases85.918 / 22AI_efficiency81.413 / 22AI_hallucination_resistance60.013 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy91.010 / 24AI_plan_coherence2.422 / 24AI_recovery98.615 / 22AI_refusal100.014 / 22AI_safety_compliance100.09 / 24AI_spec100.014 / 22AI_stability90.47 / 22AI_task_completion87.08 / 24AI_tool_selection83.77 / 24ARC_AGI_276.95 / 17ArtificialAnalysisCoding33.715 / 21ArtificialAnalysisIntelligence27.416 / 21ArtificialAnalysisReasoning15.518 / 21ContextWindow100.01 / 24CopilotArenaOrLMArenaCode67.411 / 22GPQA_HLE_Reasoning15.518 / 21IFBench62.59 / 21InverseCost75.015 / 24InverseTTFT90.46 / 19LMArenaCreativeOrOpenEnded77.59 / 24LMArenaSearchDocument17.115 / 19LMArenaText77.59 / 24LongContextRecall24.518 / 21MCPAtlas68.24 / 13OutputSpeed95.03 / 19SWEBenchPro88.42 / 14SWEBenchVerified72.35 / 18SWEComposite82.02 / 24SWERebench83.57 / 20SciCode12.018 / 21SonarFunctionalSkill82.611 / 17SonarIssueDensity13.214 / 17Tau2Bench0.021 / 21TerminalBench100.01 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebench_proterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
grok-4-latestxai77.377.353.353.365.065.062.7

group breakdown

A_B91.42 / 24A_I92.02 / 24A_P65.18 / 24A_R100.01 / 24BUILD47.415 / 24CRE76.310 / 24GEN49.415 / 24LM_ARENA_REVIEW_PROXY18.619 / 24OPS_long77.514 / 24OPS_precision77.512 / 24OPS_review77.312 / 24PLAN38.018 / 24

metrics

AI_code100.02 / 22AI_complexity100.02 / 22AI_context_awareness0.021 / 24AI_correctness100.02 / 22AI_edge_cases100.03 / 22AI_efficiency0.022 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention85.55 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.01 / 24AI_recovery100.03 / 22AI_refusal100.016 / 22AI_safety_compliance0.021 / 24AI_spec100.016 / 22AI_stability100.02 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_221.08 / 17ArtificialAnalysisCoding51.510 / 21ArtificialAnalysisIntelligence40.314 / 21ArtificialAnalysisReasoning57.010 / 21ContextWindow78.415 / 24CopilotArenaOrLMArenaCode57.015 / 22GPQA_HLE_Reasoning57.010 / 21IFBench33.115 / 21InverseCost74.419 / 24InverseTTFT78.510 / 19LMArenaCreativeOrOpenEnded76.310 / 24LMArenaSearchDocument18.614 / 19LMArenaText76.310 / 24LongContextRecall77.08 / 21OutputSpeed77.414 / 19SWEComposite47.719 / 24SWERebench38.316 / 20SciCode51.910 / 21Tau2Bench51.514 / 21TerminalBench11.619 / 22
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouterswerebenchterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
gemini-3.1-pro-previewgoogle79.579.578.778.765.065.074.1

group breakdown

A_B45.721 / 24A_I45.621 / 24A_P50.820 / 24A_R56.522 / 24BUILD74.93 / 24CRE100.02 / 24GEN100.01 / 24LM_ARENA_REVIEW_PROXY89.84 / 24OPS_long82.08 / 24OPS_precision73.516 / 24OPS_review69.716 / 24PLAN94.12 / 24

metrics

AI_code11.48 / 22AI_complexity7.57 / 22AI_context_awareness14.68 / 24AI_correctness37.919 / 22AI_edge_cases92.57 / 22AI_efficiency81.316 / 22AI_hallucination_resistance92.58 / 24AI_memory_retention92.54 / 24AI_parameter_accuracy92.57 / 24AI_plan_coherence79.08 / 24AI_recovery92.519 / 22AI_refusal7.521 / 22AI_safety_compliance92.513 / 24AI_spec7.521 / 22AI_stability53.020 / 22AI_task_completion61.318 / 24AI_tool_selection10.319 / 24ARC_AGI_2100.01 / 17ArtificialAnalysisCoding100.01 / 21ArtificialAnalysisIntelligence100.02 / 21ArtificialAnalysisReasoning100.01 / 21ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.07 / 22GDPval23.89 / 11GPQA_HLE_Reasoning100.01 / 21IFBench97.53 / 21InverseCost77.313 / 24InverseTTFT40.918 / 19LMArenaCreativeOrOpenEnded100.02 / 24LMArenaSearchDocument89.84 / 19LMArenaText100.02 / 24LongContextRecall100.02 / 21MCPAtlas66.86 / 13OutputSpeed93.05 / 19SWEBenchPro61.14 / 14SWEBenchVerified78.03 / 18SWEComposite75.44 / 24SWERebench100.02 / 20SciCode100.02 / 21SonarFunctionalSkill86.38 / 17SonarIssueDensity18.713 / 17Tau2Bench99.34 / 21TerminalBench89.93 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
glm-5.1zai74.374.363.163.163.863.870.7

group breakdown

A_B52.318 / 24A_I66.716 / 24A_P55.216 / 24A_R64.216 / 24BUILD67.79 / 24CRE86.75 / 24GEN57.611 / 24LM_ARENA_REVIEW_PROXY85.95 / 24OPS_long81.99 / 24OPS_precision86.57 / 24OPS_review88.77 / 24PLAN69.57 / 24

metrics

AI_code7.59 / 22AI_complexity9.94 / 22AI_context_awareness7.511 / 24AI_correctness87.416 / 22AI_edge_cases80.520 / 22AI_efficiency77.118 / 22AI_hallucination_resistance24.518 / 24AI_memory_retention7.512 / 24AI_parameter_accuracy64.218 / 24AI_plan_coherence14.316 / 24AI_recovery91.320 / 22AI_refusal92.518 / 22AI_safety_compliance64.220 / 24AI_spec92.518 / 22AI_stability84.412 / 22AI_task_completion69.112 / 24AI_tool_selection52.515 / 24ARC_AGI_25.211 / 17ArtificialAnalysisCoding39.513 / 21ArtificialAnalysisIntelligence60.58 / 21ArtificialAnalysisReasoning54.013 / 21ContextWindow74.919 / 24CopilotArenaOrLMArenaCode96.23 / 22GDPval63.07 / 11GPQA_HLE_Reasoning54.013 / 21IFBench86.85 / 21InverseCost93.06 / 24InverseTTFT100.01 / 19LMArenaCreativeOrOpenEnded86.75 / 24LMArenaSearchDocument85.95 / 19LMArenaText86.75 / 24LongContextRecall41.217 / 21MCPAtlas100.01 / 13OutputSpeed75.217 / 19SWEBenchMultilingual50.93 / 6SWEBenchVerified60.511 / 18SWEComposite63.29 / 24SWERebench100.01 / 20SciCode40.414 / 21SonarFunctionalSkill84.39 / 17SonarIssueDensity100.01 / 17Tau2Bench100.03 / 21TerminalBench56.010 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswerebenchterminal_benchmissing BUILD/LiveCodeBenchSWEComposite/SWEBenchPro
claude-sonnet-4.6anthropic73.072.961.761.661.661.458.9

group breakdown

A_B53.815 / 24A_I78.07 / 24A_P65.97 / 24A_R69.114 / 24BUILD67.88 / 24CRE73.113 / 24GEN65.57 / 24LM_ARENA_REVIEW_PROXY22.615 / 24OPS_long66.718 / 24OPS_precision54.321 / 24OPS_review49.023 / 24PLAN56.612 / 24

metrics

AI_code6.110 / 22AI_complexity2.914 / 22AI_context_awareness15.74 / 24AI_correctness94.010 / 22AI_edge_cases85.914 / 22AI_efficiency91.35 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy94.44 / 24AI_plan_coherence24.711 / 24AI_recovery98.611 / 22AI_refusal100.07 / 22AI_safety_compliance100.04 / 24AI_spec100.07 / 22AI_stability86.811 / 22AI_task_completion67.913 / 24AI_tool_selection94.93 / 24ARC_AGI_210.610 / 17ArtificialAnalysisCoding85.14 / 21ArtificialAnalysisIntelligence79.16 / 21ArtificialAnalysisReasoning68.78 / 21ContextWindow99.311 / 24CopilotArenaOrLMArenaCode93.45 / 22GDPval82.54 / 11GPQA_HLE_Reasoning68.78 / 21IFBench41.013 / 21InverseCost74.418 / 24InverseTTFT0.019 / 19LMArenaCreativeOrOpenEnded73.113 / 24LMArenaSearchDocument22.610 / 19LMArenaText73.113 / 24LongContextRecall90.25 / 21MCPAtlas65.17 / 13OutputSpeed80.710 / 19SWEBenchPro53.87 / 14SWEBenchVerified63.410 / 18SWEComposite66.45 / 24SWERebench95.83 / 20SciCode57.98 / 21SonarFunctionalSkill92.94 / 17SonarIssueDensity24.310 / 17Tau2Bench53.312 / 21TerminalBench47.514 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenroutersonarswerebenchmissing BUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
gpt-5.3-codexopenai69.369.357.157.157.857.870.4

group breakdown

A_B57.26 / 24A_I73.915 / 24A_P56.315 / 24A_R81.55 / 24BUILD58.011 / 24CRE73.412 / 24GEN55.812 / 24LM_ARENA_REVIEW_PROXY91.63 / 24OPS_long58.021 / 24OPS_precision59.320 / 24OPS_review58.821 / 24PLAN58.410 / 24

metrics

AI_code0.020 / 22AI_complexity2.919 / 22AI_context_awareness0.018 / 24AI_correctness94.013 / 22AI_edge_cases85.917 / 22AI_efficiency76.119 / 22AI_hallucination_resistance80.011 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy85.911 / 24AI_plan_coherence2.421 / 24AI_recovery98.614 / 22AI_refusal100.013 / 22AI_safety_compliance88.915 / 24AI_spec100.013 / 22AI_stability81.115 / 22AI_task_completion58.019 / 24AI_tool_selection66.412 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode58.414 / 22InverseCost76.614 / 24LMArenaCreativeOrOpenEnded73.412 / 24LMArenaSearchDocument91.63 / 19LMArenaText73.412 / 24SWEBenchVerified68.96 / 18SWEComposite63.67 / 24SWERebench89.45 / 20TerminalBench74.66 / 22
sources aistupidlevelartificial_analysislmarenaopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/LiveCodeBenchBUILD/LongContextRecallBUILD/MCPAtlasBUILD/SciCodeBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityGEN/ARC_AGI_2GEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/MCPAtlasPLAN/Tau2BenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
gemini-2.5-flashgoogle60.560.538.738.754.354.358.2

group breakdown

A_B94.91 / 24A_I92.01 / 24A_P66.93 / 24A_R94.92 / 24BUILD24.723 / 24CRE45.519 / 24GEN15.020 / 24LM_ARENA_REVIEW_PROXY76.97 / 24OPS_long94.62 / 24OPS_precision90.73 / 24OPS_review89.26 / 24PLAN13.423 / 24

metrics

AI_code100.01 / 22AI_complexity100.01 / 22AI_context_awareness100.02 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency100.02 / 22AI_hallucination_resistance69.912 / 24AI_memory_retention31.99 / 24AI_parameter_accuracy73.314 / 24AI_plan_coherence0.023 / 24AI_recovery100.01 / 22AI_refusal100.09 / 22AI_safety_compliance100.06 / 24AI_spec100.09 / 22AI_stability100.01 / 22AI_task_completion44.320 / 24AI_tool_selection27.516 / 24ARC_AGI_20.715 / 17ArtificialAnalysisCoding0.020 / 21ArtificialAnalysisIntelligence0.819 / 21ArtificialAnalysisReasoning17.916 / 21ContextWindow100.03 / 24CopilotArenaOrLMArenaCode64.813 / 22GDPval11.810 / 11GPQA_HLE_Reasoning17.916 / 21IFBench29.217 / 21InverseCost94.45 / 24InverseTTFT75.812 / 19LMArenaCreativeOrOpenEnded45.519 / 24LMArenaSearchDocument76.97 / 19LMArenaText45.519 / 24LiveCodeBench100.01 / 2LongContextRecall58.813 / 21MCPAtlas26.48 / 13OutputSpeed100.01 / 19SWEBenchPro33.811 / 14SWEBenchVerified0.018 / 18SWEComposite15.024 / 24SWERebench0.020 / 20SciCode23.516 / 21Tau2Bench0.020 / 21TerminalBench0.021 / 22
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing BUILD/SonarFunctionalSkillBUILD/SonarIssueDensitySWEComposite/SWEBenchMultilingual
gemini-3-flashgoogle68.568.561.361.353.053.055.5

group breakdown

A_B45.720 / 24A_I45.620 / 24A_P50.819 / 24A_R56.521 / 24BUILD51.712 / 24CRE86.26 / 24GEN61.69 / 24LM_ARENA_REVIEW_PROXY19.517 / 24OPS_long94.91 / 24OPS_precision91.61 / 24OPS_review90.23 / 24PLAN64.79 / 24

metrics

AI_code11.47 / 22AI_complexity7.56 / 22AI_context_awareness14.67 / 24AI_correctness37.918 / 22AI_edge_cases92.56 / 22AI_efficiency81.315 / 22AI_hallucination_resistance92.57 / 24AI_memory_retention92.53 / 24AI_parameter_accuracy92.56 / 24AI_plan_coherence79.07 / 24AI_recovery92.518 / 22AI_refusal7.520 / 22AI_safety_compliance92.512 / 24AI_spec7.520 / 22AI_stability53.019 / 22AI_task_completion61.317 / 24AI_tool_selection10.318 / 24ARC_AGI_23.014 / 17ArtificialAnalysisCoding58.39 / 21ArtificialAnalysisIntelligence58.911 / 21ArtificialAnalysisReasoning82.76 / 21ContextWindow100.05 / 24CopilotArenaOrLMArenaCode67.412 / 22GDPval5.011 / 11GPQA_HLE_Reasoning82.76 / 21IFBench100.02 / 21InverseCost91.58 / 24InverseTTFT80.29 / 19LMArenaCreativeOrOpenEnded86.26 / 24LMArenaSearchDocument19.512 / 19LMArenaText86.26 / 24LongContextRecall68.69 / 21MCPAtlas22.39 / 13OutputSpeed99.42 / 19SWEBenchMultilingual100.01 / 6SWEBenchPro31.012 / 14SWEBenchVerified68.47 / 18SWEComposite58.111 / 24SWERebench76.19 / 20SciCode78.76 / 21SonarFunctionalSkill86.37 / 17SonarIssueDensity18.712 / 17Tau2Bench64.210 / 21TerminalBench48.412 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBench
gemini-3-progoogle66.966.954.254.252.452.453.1

group breakdown

A_B43.322 / 24A_I43.122 / 24A_P51.317 / 24A_R60.317 / 24BUILD58.910 / 24CRE95.03 / 24GEN60.110 / 24LM_ARENA_REVIEW_PROXY19.318 / 24OPS_long45.223 / 24OPS_precision46.623 / 24OPS_review50.522 / 24PLAN55.113 / 24

metrics

AI_code4.613 / 22AI_complexity0.021 / 22AI_context_awareness8.310 / 24AI_correctness35.820 / 22AI_edge_cases100.02 / 22AI_efficiency86.87 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy100.02 / 24AI_plan_coherence84.15 / 24AI_recovery100.02 / 22AI_refusal0.022 / 22AI_safety_compliance100.07 / 24AI_spec0.022 / 22AI_stability53.617 / 22AI_task_completion63.315 / 24AI_tool_selection3.320 / 24ARC_AGI_242.46 / 17ContextWindow0.024 / 24CopilotArenaOrLMArenaCode67.810 / 22InverseCost77.312 / 24LMArenaCreativeOrOpenEnded95.03 / 24LMArenaSearchDocument19.313 / 19LMArenaText95.03 / 24MCPAtlas69.73 / 13SWEBenchMultilingual33.54 / 6SWEBenchPro53.78 / 14SWEBenchVerified52.113 / 18SWEComposite54.512 / 24SWERebench70.313 / 20SonarFunctionalSkill92.75 / 17SonarIssueDensity13.115 / 17TerminalBench61.48 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/LiveCodeBenchBUILD/LongContextRecallBUILD/SciCodeGEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/Tau2Bench
grok-code-fast-1xai59.659.637.837.852.252.255.1

group breakdown

A_B74.43 / 24A_I87.33 / 24A_P66.25 / 24A_R90.23 / 24BUILD34.221 / 24CRE47.818 / 24GEN15.819 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long90.24 / 24OPS_precision89.16 / 24OPS_review89.74 / 24PLAN11.424 / 24

metrics

AI_code30.53 / 22AI_complexity46.73 / 22AI_context_awareness0.022 / 24AI_correctness100.03 / 22AI_edge_cases95.64 / 22AI_efficiency72.320 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention85.56 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.02 / 24AI_recovery100.04 / 22AI_refusal100.017 / 22AI_safety_compliance0.022 / 24AI_spec100.017 / 22AI_stability77.316 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.37 / 17ArtificialAnalysisCoding0.021 / 21ArtificialAnalysisIntelligence0.021 / 21ArtificialAnalysisReasoning0.021 / 21ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.022 / 22GPQA_HLE_Reasoning0.021 / 21IFBench0.021 / 21InverseCost99.32 / 24InverseTTFT84.87 / 19LMArenaCreativeOrOpenEnded47.818 / 24LMArenaText47.818 / 24LongContextRecall0.021 / 21OutputSpeed93.84 / 19SWEComposite45.420 / 24SWERebench27.018 / 20SciCode0.021 / 21Tau2Bench53.313 / 21TerminalBench0.022 / 22
sources aistupidlevelartificial_analysislmarenaopenrouterswerebenchterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityLM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
claude-sonnet-4.5anthropic64.764.755.555.351.050.851.4

group breakdown

A_B56.79 / 24A_I78.15 / 24A_P76.41 / 24A_R75.98 / 24BUILD46.816 / 24CRE64.015 / 24GEN42.217 / 24LM_ARENA_REVIEW_PROXY1.722 / 24OPS_long79.612 / 24OPS_precision79.511 / 24OPS_review78.39 / 24PLAN42.417 / 24

metrics

AI_canary_health78.17 / 7AI_code6.111 / 22AI_complexity2.913 / 22AI_context_awareness100.01 / 24AI_correctness94.09 / 22AI_edge_cases85.913 / 22AI_efficiency87.66 / 22AI_hallucination_resistance40.016 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy61.319 / 24AI_plan_coherence30.39 / 24AI_recovery98.610 / 22AI_refusal100.06 / 22AI_safety_compliance73.617 / 24AI_spec100.06 / 22AI_stability86.810 / 22AI_task_completion100.02 / 24AI_tool_selection83.18 / 24ARC_AGI_23.612 / 17ArtificialAnalysisCoding45.312 / 21ArtificialAnalysisIntelligence46.012 / 21ArtificialAnalysisReasoning35.315 / 21ContextWindow99.310 / 24CopilotArenaOrLMArenaCode52.316 / 22GDPval88.32 / 11GPQA_HLE_Reasoning35.315 / 21IFBench43.012 / 21InverseCost74.417 / 24InverseTTFT76.311 / 19LMArenaCreativeOrOpenEnded64.015 / 24LMArenaSearchDocument1.717 / 19LMArenaText64.015 / 24LongContextRecall65.711 / 21MCPAtlas8.011 / 13OutputSpeed76.516 / 19SWEBenchMultilingual3.95 / 6SWEBenchPro54.56 / 14SWEBenchVerified54.712 / 18SWEComposite53.514 / 24SWERebench74.710 / 20SciCode46.413 / 21SonarFunctionalSkill53.615 / 17SonarIssueDensity29.89 / 17Tau2Bench58.911 / 21TerminalBench37.315 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBench
glm-4.7zai35.635.649.849.850.850.855.3

group breakdown

A_B55.411 / 24A_I54.018 / 24A_P46.122 / 24A_R58.519 / 24BUILD42.018 / 24CRE9.222 / 24GEN35.618 / 24LM_ARENA_REVIEW_PROXY50.012 / 24OPS_long87.65 / 24OPS_precision90.14 / 24OPS_review91.91 / 24PLAN53.614 / 24

metrics

AI_context_awareness0.024 / 24AI_hallucination_resistance100.05 / 24AI_memory_retention85.58 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.04 / 24AI_safety_compliance0.024 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding37.914 / 21ArtificialAnalysisIntelligence42.613 / 21ArtificialAnalysisReasoning55.812 / 21ContextWindow74.918 / 24CopilotArenaOrLMArenaCode68.29 / 22GPQA_HLE_Reasoning55.812 / 21IFBench72.27 / 21InverseCost96.13 / 24InverseTTFT99.02 / 19LMArenaCreativeOrOpenEnded9.222 / 24LMArenaText9.222 / 24LongContextRecall57.414 / 21MCPAtlas0.013 / 13OutputSpeed85.37 / 19SWEComposite54.113 / 24SWERebench70.612 / 20SciCode48.611 / 21SonarFunctionalSkill31.316 / 17SonarIssueDensity44.75 / 17Tau2Bench100.02 / 21TerminalBench27.017 / 22
sources aistupidlevelartificial_analysislmarenamcp_atlasopenroutersonarswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_codeA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_refusalA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GDPvalBUILD/LiveCodeBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
gpt-5.2openai66.866.857.057.049.149.157.9

group breakdown

A_B57.95 / 24A_I74.614 / 24A_P59.713 / 24A_R81.74 / 24BUILD41.719 / 24CRE67.914 / 24GEN52.213 / 24LM_ARENA_REVIEW_PROXY20.616 / 24OPS_long58.319 / 24OPS_precision59.819 / 24OPS_review59.520 / 24PLAN56.611 / 24

metrics

AI_code0.019 / 22AI_complexity2.918 / 22AI_context_awareness0.017 / 24AI_correctness94.012 / 22AI_edge_cases85.916 / 22AI_efficiency83.210 / 22AI_hallucination_resistance80.010 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy85.212 / 24AI_plan_coherence0.024 / 24AI_recovery98.613 / 22AI_refusal100.012 / 22AI_safety_compliance100.08 / 24AI_spec100.012 / 22AI_stability83.013 / 22AI_task_completion87.07 / 24AI_tool_selection76.39 / 24ARC_AGI_20.017 / 17ArtificialAnalysisCoding63.48 / 21ArtificialAnalysisIntelligence59.79 / 21ArtificialAnalysisReasoning56.411 / 21ContextWindow85.312 / 24CopilotArenaOrLMArenaCode37.220 / 22GPQA_HLE_Reasoning56.411 / 21IFBench64.78 / 21InverseCost80.111 / 24LMArenaCreativeOrOpenEnded67.914 / 24LMArenaSearchDocument20.611 / 19LMArenaText67.914 / 24LongContextRecall53.915 / 21SWEBenchMultilingual0.06 / 6SWEBenchPro18.613 / 14SWEBenchVerified50.514 / 18SWEComposite28.221 / 24SciCode54.69 / 21SonarFunctionalSkill82.810 / 17SonarIssueDensity33.98 / 17Tau2Bench50.115 / 21TerminalBench58.49 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenroutersonarswebenchswebench_proterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/MCPAtlasSWEComposite/SWERebench
claude-sonnet-4anthropic35.735.644.043.948.748.458.4

group breakdown

A_B53.716 / 24A_I76.98 / 24A_P65.96 / 24A_R71.912 / 24BUILD41.520 / 24CRE0.023 / 24GEN14.022 / 24LM_ARENA_REVIEW_PROXY84.16 / 24OPS_long80.011 / 24OPS_precision79.610 / 24OPS_review78.210 / 24PLAN33.419 / 24

metrics

AI_code0.016 / 22AI_complexity2.912 / 22AI_context_awareness0.013 / 24AI_correctness94.08 / 22AI_edge_cases85.912 / 22AI_efficiency86.28 / 22AI_hallucination_resistance20.020 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy91.98 / 24AI_plan_coherence19.115 / 24AI_recovery98.69 / 22AI_refusal100.05 / 22AI_safety_compliance100.03 / 24AI_spec100.05 / 22AI_stability86.89 / 22AI_task_completion99.73 / 24AI_tool_selection90.04 / 24ARC_AGI_20.116 / 17ArtificialAnalysisCoding30.716 / 21ArtificialAnalysisIntelligence29.715 / 21ArtificialAnalysisReasoning8.619 / 21ContextWindow99.39 / 24CopilotArenaOrLMArenaCode52.018 / 22GDPval82.53 / 11GPQA_HLE_Reasoning8.619 / 21IFBench35.814 / 21InverseCost74.416 / 24InverseTTFT75.413 / 19LMArenaCreativeOrOpenEnded0.023 / 24LMArenaSearchDocument84.16 / 19LMArenaText0.023 / 24LiveCodeBench0.02 / 2LongContextRecall60.812 / 21MCPAtlas14.310 / 13OutputSpeed77.513 / 19SWEBenchPro52.110 / 14SWEBenchVerified39.716 / 18SWEComposite48.518 / 24SWERebench54.514 / 20SciCode20.817 / 21SonarFunctionalSkill59.014 / 17SonarIssueDensity34.07 / 17Tau2Bench27.718 / 21TerminalBench47.513 / 22
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenroutersonarswebenchswebench_proswerebenchmissing SWEComposite/SWEBenchMultilingual
claude-opus-4.1anthropic58.458.349.949.748.548.248.5

group breakdown

A_B54.314 / 24A_I75.613 / 24A_P61.711 / 24A_R74.79 / 24BUILD48.414 / 24CRE52.717 / 24GEN50.714 / 24LM_ARENA_REVIEW_PROXY0.023 / 24OPS_long48.722 / 24OPS_precision46.224 / 24OPS_review42.524 / 24PLAN43.016 / 24

metrics

AI_canary_health79.25 / 7AI_code0.014 / 22AI_complexity2.98 / 22AI_context_awareness0.012 / 24AI_correctness94.04 / 22AI_edge_cases85.98 / 22AI_efficiency80.117 / 22AI_hallucination_resistance40.015 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy71.516 / 24AI_plan_coherence19.114 / 24AI_recovery98.65 / 22AI_refusal100.01 / 22AI_safety_compliance100.01 / 24AI_spec100.01 / 22AI_stability81.114 / 22AI_task_completion83.410 / 24AI_tool_selection72.010 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode52.117 / 22InverseCost0.024 / 24LMArenaCreativeOrOpenEnded52.717 / 24LMArenaSearchDocument0.018 / 19LMArenaText52.717 / 24SWEComposite50.315 / 24SWERebench51.715 / 20TerminalBench29.316 / 22
sources aistupidlevellmarenaopenrouterswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/LiveCodeBenchBUILD/LongContextRecallBUILD/MCPAtlasBUILD/SciCodeBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityGEN/ARC_AGI_2GEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/MCPAtlasPLAN/Tau2BenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
deepseek-v4-flashdeepseek52.152.158.158.142.942.946.2

group breakdown

A_B24.823 / 24A_I36.423 / 24A_P40.023 / 24A_R21.424 / 24BUILD49.813 / 24CRE58.716 / 24GEN62.88 / 24LM_ARENA_REVIEW_PROXY50.08 / 24OPS_long86.56 / 24OPS_precision89.65 / 24OPS_review91.82 / 24PLAN71.16 / 24

metrics

AI_canary_health78.46 / 7AI_code6.112 / 22AI_complexity2.915 / 22AI_context_awareness0.014 / 24AI_correctness0.021 / 22AI_edge_cases0.021 / 22AI_efficiency99.33 / 22AI_hallucination_resistance40.017 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy91.69 / 24AI_plan_coherence24.712 / 24AI_recovery0.021 / 22AI_refusal100.08 / 22AI_safety_compliance100.05 / 24AI_spec100.08 / 22AI_stability0.021 / 22AI_task_completion87.05 / 24AI_tool_selection87.45 / 24ArtificialAnalysisCoding45.611 / 21ArtificialAnalysisIntelligence59.310 / 21ArtificialAnalysisReasoning76.77 / 21ContextWindow71.622 / 24GPQA_HLE_Reasoning76.77 / 21IFBench100.01 / 21InverseCost100.01 / 24InverseTTFT98.93 / 19LMArenaCreativeOrOpenEnded58.716 / 24LMArenaText58.716 / 24LongContextRecall52.516 / 21OutputSpeed83.68 / 19SWEComposite50.016 / 24SciCode47.512 / 21Tau2Bench97.95 / 21
sources aistupidlevelartificial_analysislmarenaopenroutermissing BUILD/CopilotArenaOrLMArenaCodeBUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityBUILD/TerminalBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasPLAN/TerminalBenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSWEComposite/SWERebench
gemini-2.5-progoogle25.025.034.534.541.141.138.1

group breakdown

A_B45.719 / 24A_I45.619 / 24A_P50.818 / 24A_R56.520 / 24BUILD34.122 / 24CRE0.024 / 24GEN14.521 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long82.17 / 24OPS_precision74.415 / 24OPS_review71.015 / 24PLAN21.821 / 24

metrics

AI_code11.46 / 22AI_complexity7.55 / 22AI_context_awareness14.66 / 24AI_correctness37.917 / 22AI_edge_cases92.55 / 22AI_efficiency81.314 / 22AI_hallucination_resistance92.56 / 24AI_memory_retention92.52 / 24AI_parameter_accuracy92.55 / 24AI_plan_coherence79.06 / 24AI_recovery92.517 / 22AI_refusal7.519 / 22AI_safety_compliance92.511 / 24AI_spec7.519 / 22AI_stability53.018 / 22AI_task_completion61.316 / 24AI_tool_selection10.317 / 24ARC_AGI_23.613 / 17ArtificialAnalysisCoding23.617 / 21ArtificialAnalysisIntelligence14.117 / 21ArtificialAnalysisReasoning44.814 / 21ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.021 / 22GPQA_HLE_Reasoning44.814 / 21IFBench19.318 / 21InverseCost80.110 / 24InverseTTFT43.917 / 19LMArenaCreativeOrOpenEnded0.024 / 24LMArenaSearchDocument0.019 / 19LMArenaText0.024 / 24LongContextRecall67.210 / 21MCPAtlas66.85 / 13OutputSpeed91.46 / 19SWEBenchPro53.29 / 14SWEBenchVerified9.817 / 18SWEComposite27.022 / 24SWERebench0.519 / 20SciCode36.115 / 21SonarFunctionalSkill86.36 / 17SonarIssueDensity18.711 / 17Tau2Bench3.519 / 21TerminalBench1.420 / 22
sources arc_agiartificial_analysislmarenaopenrouterswebenchswerebenchterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
glm-4.6zai36.836.832.032.036.436.441.5

group breakdown

A_B55.410 / 24A_I54.017 / 24A_P46.121 / 24A_R58.518 / 24BUILD18.224 / 24CRE23.621 / 24GEN13.423 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long77.015 / 24OPS_precision83.38 / 24OPS_review86.08 / 24PLAN18.022 / 24

metrics

AI_context_awareness0.023 / 24AI_hallucination_resistance100.04 / 24AI_memory_retention85.57 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.03 / 24AI_safety_compliance0.023 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding15.918 / 21ArtificialAnalysisIntelligence6.118 / 21ArtificialAnalysisReasoning16.517 / 21ContextWindow75.017 / 24CopilotArenaOrLMArenaCode43.019 / 22GPQA_HLE_Reasoning16.517 / 21IFBench4.719 / 21InverseCost95.44 / 24InverseTTFT98.74 / 19LMArenaCreativeOrOpenEnded23.621 / 24LMArenaText23.621 / 24LongContextRecall9.819 / 21MCPAtlas7.512 / 13OutputSpeed66.318 / 19SWEBenchPro0.014 / 14SWEBenchVerified48.415 / 18SWEComposite24.523 / 24SWERebench37.617 / 20SciCode12.019 / 21SonarFunctionalSkill7.517 / 17SonarIssueDensity7.516 / 17Tau2Bench41.317 / 21TerminalBench13.718 / 22
sources aistupidlevelartificial_analysislmarenaopenrouterswebenchswebench_proswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_codeA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_refusalA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GDPvalBUILD/LiveCodeBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchMultilingual
kimi-k2-0905moonshot24.124.127.827.834.034.035.9

group breakdown

A_B18.724 / 24A_I28.724 / 24A_P35.724 / 24A_R27.623 / 24BUILD42.717 / 24CRE26.920 / 24GEN7.924 / 24LM_ARENA_REVIEW_PROXY50.09 / 24OPS_long35.424 / 24OPS_precision53.722 / 24OPS_review60.219 / 24PLAN29.620 / 24

metrics

AI_canary_health88.91 / 7AI_code0.017 / 22AI_complexity2.916 / 22AI_context_awareness0.015 / 24AI_correctness0.022 / 22AI_edge_cases0.022 / 22AI_efficiency0.021 / 22AI_hallucination_resistance80.09 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy84.713 / 24AI_plan_coherence30.310 / 24AI_recovery0.022 / 22AI_refusal100.010 / 22AI_safety_compliance88.914 / 24AI_spec100.010 / 22AI_stability0.022 / 22AI_task_completion87.06 / 24AI_tool_selection68.911 / 24ArtificialAnalysisCoding4.219 / 21ArtificialAnalysisIntelligence0.020 / 21ArtificialAnalysisReasoning0.020 / 21ContextWindow53.423 / 24GPQA_HLE_Reasoning0.020 / 21IFBench0.020 / 21InverseCost92.77 / 24InverseTTFT90.75 / 19LMArenaCreativeOrOpenEnded26.920 / 24LMArenaText26.920 / 24LongContextRecall0.020 / 21OutputSpeed0.019 / 19SWEComposite50.017 / 24SciCode0.020 / 21Tau2Bench48.016 / 21
sources aistupidlevelartificial_analysislmarenaopenroutermissing BUILD/CopilotArenaOrLMArenaCodeBUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityBUILD/TerminalBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasPLAN/TerminalBenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSWEComposite/SWERebench