$ipbr-rank · live llm coding-role score
refreshed · 13 sources
[ idea ]
1gemini-3.1-pro-preview81.681.4
2claude-opus-4.777.477.4
3claude-opus-4.675.575.5
[ plan ]
1gemini-3.1-pro-preview80.379.9
2gpt-5.575.275.2
3claude-opus-4.770.070.0
[ build ]
1claude-opus-4.767.967.9
2gemini-3.1-pro-preview67.166.4
3gpt-5.563.763.7
[ review ]
1gemini-3.1-pro-preview71.371.3
2claude-opus-4.769.069.0
3kimi-k2.668.268.2
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

raw vs adjusted

The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models dominate Review, that lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.

Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted — it is the source of the penalty.

missing data

If a model is missing some metrics within a group, the group score is computed from the present metrics if at least 70% of the group weight is covered. Below that threshold, the score shrinks toward 50 proportional to the missing weight.

Full math, role definitions, and source list →

claude-opus-4.7anthropic77.477.470.070.067.967.969.0

group breakdown

A_B38.717 / 24A_I51.210 / 24A_P45.612 / 24A_R42.822 / 24BUILD87.61 / 24CRE95.04 / 24GEN97.02 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long74.317 / 24OPS_precision69.017 / 24OPS_review65.617 / 24PLAN79.94 / 24

metrics

AI_canary_health88.23 / 7AI_code15.012 / 22AI_complexity44.614 / 22AI_context_awareness15.25 / 24AI_correctness0.012 / 22AI_edge_cases56.67 / 22AI_efficiency68.76 / 22AI_hallucination_resistance0.023 / 24AI_memory_retention30.510 / 24AI_parameter_accuracy35.220 / 24AI_plan_coherence21.912 / 24AI_recovery94.37 / 22AI_refusal50.04 / 22AI_safety_compliance100.02 / 24AI_spec50.04 / 22AI_stability100.02 / 22AI_task_completion100.01 / 24AI_tool_selection57.613 / 24ARC_AGI_294.03 / 17ArtificialAnalysisCoding90.33 / 21ArtificialAnalysisIntelligence100.01 / 21ArtificialAnalysisReasoning95.63 / 21ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.02 / 22GDPval95.01 / 11GPQA_HLE_Reasoning95.63 / 21IFBench46.610 / 21InverseCost61.922 / 24InverseTTFT49.116 / 19LMArenaCreativeOrOpenEnded95.04 / 24LMArenaSearchDocument100.01 / 19LMArenaText95.04 / 24LongContextRecall88.26 / 21OutputSpeed78.812 / 19SWEBenchPro95.01 / 14SWEBenchVerified94.62 / 18SWEComposite92.71 / 24SWERebench85.36 / 20SciCode100.01 / 21SonarFunctionalSkill98.42 / 17SonarIssueDensity2.417 / 17Tau2Bench83.19 / 21TerminalBench78.64 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarmissing BUILD/LiveCodeBenchBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
gemini-3.1-pro-previewgoogle81.681.480.379.967.166.471.3

group breakdown

A_B51.99 / 24A_I51.49 / 24A_P55.57 / 24A_R48.512 / 24BUILD74.93 / 24CRE100.02 / 24GEN100.01 / 24LM_ARENA_REVIEW_PROXY89.84 / 24OPS_long82.08 / 24OPS_precision73.516 / 24OPS_review69.716 / 24PLAN94.12 / 24

metrics

AI_code30.37 / 22AI_complexity67.87 / 22AI_context_awareness14.68 / 24AI_correctness92.57 / 22AI_edge_cases7.520 / 22AI_efficiency66.89 / 22AI_hallucination_resistance92.58 / 24AI_memory_retention92.54 / 24AI_parameter_accuracy92.57 / 24AI_plan_coherence79.08 / 24AI_recovery7.520 / 22AI_refusal50.013 / 22AI_safety_compliance92.513 / 24AI_spec50.013 / 22AI_stability32.619 / 22AI_task_completion61.318 / 24AI_tool_selection10.319 / 24ARC_AGI_2100.01 / 17ArtificialAnalysisCoding100.01 / 21ArtificialAnalysisIntelligence100.02 / 21ArtificialAnalysisReasoning100.01 / 21ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.07 / 22GDPval23.89 / 11GPQA_HLE_Reasoning100.01 / 21IFBench97.53 / 21InverseCost77.313 / 24InverseTTFT40.918 / 19LMArenaCreativeOrOpenEnded100.02 / 24LMArenaSearchDocument89.84 / 19LMArenaText100.02 / 24LongContextRecall100.02 / 21MCPAtlas66.86 / 13OutputSpeed93.05 / 19SWEBenchPro61.14 / 14SWEBenchVerified78.03 / 18SWEComposite75.44 / 24SWERebench100.02 / 20SciCode100.02 / 21SonarFunctionalSkill86.38 / 17SonarIssueDensity18.713 / 17Tau2Bench99.34 / 21TerminalBench89.93 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
gpt-5.5openai72.672.675.275.263.763.766.1

group breakdown

A_B41.311 / 24A_I49.118 / 24A_P42.016 / 24A_R53.09 / 24BUILD74.44 / 24CRE82.07 / 24GEN94.43 / 24LM_ARENA_REVIEW_PROXY27.414 / 24OPS_long81.710 / 24OPS_precision80.09 / 24OPS_review77.511 / 24PLAN95.41 / 24

metrics

AI_code15.018 / 22AI_complexity44.620 / 22AI_context_awareness0.020 / 24AI_correctness0.022 / 22AI_edge_cases56.616 / 22AI_efficiency56.310 / 22AI_hallucination_resistance60.014 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy95.83 / 24AI_plan_coherence7.919 / 24AI_recovery94.316 / 22AI_refusal50.019 / 22AI_safety_compliance100.010 / 24AI_spec50.019 / 22AI_stability100.07 / 22AI_task_completion87.09 / 24AI_tool_selection84.96 / 24ARC_AGI_298.12 / 17ArtificialAnalysisCoding100.02 / 21ArtificialAnalysisIntelligence98.13 / 21ArtificialAnalysisReasoning100.02 / 21ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.48 / 22GPQA_HLE_Reasoning100.02 / 21IFBench80.76 / 21InverseCost50.623 / 24InverseTTFT81.78 / 19LMArenaCreativeOrOpenEnded82.07 / 24LMArenaSearchDocument27.49 / 19LMArenaText82.07 / 24LongContextRecall98.03 / 21OutputSpeed82.49 / 19SWEBenchVerified95.01 / 18SWEComposite63.58 / 24SciCode94.54 / 21SonarFunctionalSkill70.713 / 17SonarIssueDensity46.04 / 17Tau2Bench90.57 / 21TerminalBench100.02 / 22
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouteroverridessonarterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWERebench
grok-4-latestxai72.872.850.850.862.962.960.2

group breakdown

A_B85.41 / 24A_I79.01 / 24A_P58.12 / 24A_R93.01 / 24BUILD47.415 / 24CRE76.310 / 24GEN49.415 / 24LM_ARENA_REVIEW_PROXY18.619 / 24OPS_long77.514 / 24OPS_precision77.512 / 24OPS_review77.312 / 24PLAN38.018 / 24

metrics

AI_code100.02 / 22AI_complexity100.02 / 22AI_context_awareness0.021 / 24AI_correctness100.03 / 22AI_edge_cases100.02 / 22AI_efficiency0.022 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention85.55 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.01 / 24AI_recovery100.02 / 22AI_refusal50.020 / 22AI_safety_compliance0.021 / 24AI_spec50.020 / 22AI_stability100.08 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_221.08 / 17ArtificialAnalysisCoding51.510 / 21ArtificialAnalysisIntelligence40.314 / 21ArtificialAnalysisReasoning57.010 / 21ContextWindow78.415 / 24CopilotArenaOrLMArenaCode57.015 / 22GPQA_HLE_Reasoning57.010 / 21IFBench33.115 / 21InverseCost74.419 / 24InverseTTFT78.510 / 19LMArenaCreativeOrOpenEnded76.310 / 24LMArenaSearchDocument18.614 / 19LMArenaText76.310 / 24LongContextRecall77.08 / 21OutputSpeed77.414 / 19SWEComposite47.719 / 24SWERebench38.316 / 20SciCode51.910 / 21Tau2Bench51.514 / 21TerminalBench11.619 / 22
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouterswerebenchterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
claude-opus-4.6anthropic75.575.564.764.761.061.057.5

group breakdown

A_B33.322 / 24A_I44.022 / 24A_P41.618 / 24A_R42.921 / 24BUILD78.32 / 24CRE100.01 / 24GEN89.64 / 24LM_ARENA_REVIEW_PROXY32.213 / 24OPS_long78.013 / 24OPS_precision76.813 / 24OPS_review74.713 / 24PLAN72.85 / 24

metrics

AI_canary_health83.34 / 7AI_code25.410 / 22AI_complexity0.021 / 22AI_context_awareness9.89 / 24AI_correctness0.011 / 22AI_edge_cases56.66 / 22AI_efficiency54.811 / 22AI_hallucination_resistance0.022 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy71.815 / 24AI_plan_coherence2.420 / 24AI_recovery94.36 / 22AI_refusal50.03 / 22AI_safety_compliance83.716 / 24AI_spec50.03 / 22AI_stability89.810 / 22AI_task_completion93.24 / 24AI_tool_selection100.02 / 24ARC_AGI_292.24 / 17ArtificialAnalysisCoding76.15 / 21ArtificialAnalysisIntelligence84.05 / 21ArtificialAnalysisReasoning86.35 / 21ContextWindow99.37 / 24CopilotArenaOrLMArenaCode100.01 / 22GDPval78.05 / 11GPQA_HLE_Reasoning86.35 / 21IFBench31.416 / 21InverseCost61.921 / 24InverseTTFT73.615 / 19LMArenaCreativeOrOpenEnded100.01 / 24LMArenaSearchDocument32.28 / 19LMArenaText100.01 / 24LongContextRecall90.24 / 21OutputSpeed76.615 / 19SWEBenchMultilingual90.92 / 6SWEBenchPro76.33 / 14SWEBenchVerified67.98 / 18SWEComposite78.33 / 24SWERebench91.64 / 20SciCode85.85 / 21SonarFunctionalSkill97.43 / 17SonarIssueDensity41.96 / 17Tau2Bench91.26 / 21TerminalBench64.57 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBenchBUILD/MCPAtlasPLAN/MCPAtlas
kimi-k2.6moonshot65.665.664.264.260.960.968.2

group breakdown

A_B38.816 / 24A_I49.615 / 24A_P37.021 / 24A_R46.219 / 24BUILD74.25 / 24CRE77.68 / 24GEN73.75 / 24LM_ARENA_REVIEW_PROXY92.22 / 24OPS_long58.020 / 24OPS_precision59.818 / 24OPS_review60.318 / 24PLAN86.73 / 24

metrics

AI_code15.015 / 22AI_complexity60.210 / 22AI_context_awareness0.016 / 24AI_correctness0.018 / 22AI_edge_cases56.612 / 22AI_efficiency46.215 / 22AI_hallucination_resistance20.021 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy66.717 / 24AI_plan_coherence7.918 / 24AI_recovery94.312 / 22AI_refusal50.015 / 22AI_safety_compliance66.719 / 24AI_spec50.015 / 22AI_stability100.05 / 22AI_task_completion72.511 / 24AI_tool_selection52.914 / 24ARC_AGI_211.99 / 17ArtificialAnalysisCoding72.87 / 21ArtificialAnalysisIntelligence87.54 / 21ArtificialAnalysisReasoning87.64 / 21ContextWindow78.414 / 24CopilotArenaOrLMArenaCode94.74 / 22GDPval54.78 / 11GPQA_HLE_Reasoning87.64 / 21IFBench94.54 / 21InverseCost87.19 / 24LMArenaCreativeOrOpenEnded77.68 / 24LMArenaSearchDocument92.22 / 19LMArenaText77.68 / 24LongContextRecall85.37 / 21MCPAtlas92.52 / 13SWEBenchVerified77.04 / 18SWEComposite62.710 / 24SWERebench72.911 / 20SciCode94.53 / 21SonarFunctionalSkill79.212 / 17SonarIssueDensity92.52 / 17Tau2Bench100.01 / 21TerminalBench74.95 / 22
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/LiveCodeBenchOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
glm-5.1zai68.468.458.458.460.560.564.9

group breakdown

A_B42.910 / 24A_I49.714 / 24A_P41.717 / 24A_R47.615 / 24BUILD67.79 / 24CRE86.75 / 24GEN57.611 / 24LM_ARENA_REVIEW_PROXY85.95 / 24OPS_long81.99 / 24OPS_precision86.57 / 24OPS_review88.77 / 24PLAN69.57 / 24

metrics

AI_code20.311 / 22AI_complexity58.712 / 22AI_context_awareness7.511 / 24AI_correctness7.58 / 22AI_edge_cases55.617 / 22AI_efficiency46.814 / 22AI_hallucination_resistance24.518 / 24AI_memory_retention7.512 / 24AI_parameter_accuracy64.218 / 24AI_plan_coherence14.316 / 24AI_recovery87.717 / 22AI_refusal50.022 / 22AI_safety_compliance64.220 / 24AI_spec50.022 / 22AI_stability92.59 / 22AI_task_completion69.112 / 24AI_tool_selection52.515 / 24ARC_AGI_25.211 / 17ArtificialAnalysisCoding39.513 / 21ArtificialAnalysisIntelligence60.58 / 21ArtificialAnalysisReasoning54.013 / 21ContextWindow74.919 / 24CopilotArenaOrLMArenaCode96.23 / 22GDPval63.07 / 11GPQA_HLE_Reasoning54.013 / 21IFBench86.85 / 21InverseCost93.06 / 24InverseTTFT100.01 / 19LMArenaCreativeOrOpenEnded86.75 / 24LMArenaSearchDocument85.95 / 19LMArenaText86.75 / 24LongContextRecall41.217 / 21MCPAtlas100.01 / 13OutputSpeed75.217 / 19SWEBenchMultilingual50.93 / 6SWEBenchVerified60.511 / 18SWEComposite63.29 / 24SWERebench100.01 / 20SciCode40.414 / 21SonarFunctionalSkill84.39 / 17SonarIssueDensity100.01 / 17Tau2Bench100.03 / 21TerminalBench56.010 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswerebenchterminal_benchmissing BUILD/LiveCodeBenchSWEComposite/SWEBenchPro
gpt-5.4openai62.462.448.848.859.659.654.7

group breakdown

A_B39.115 / 24A_I46.519 / 24A_P39.619 / 24A_R53.08 / 24BUILD68.47 / 24CRE77.59 / 24GEN45.216 / 24LM_ARENA_REVIEW_PROXY17.120 / 24OPS_long92.83 / 24OPS_precision91.22 / 24OPS_review89.75 / 24PLAN50.715 / 24

metrics

AI_code15.017 / 22AI_complexity44.619 / 22AI_context_awareness0.019 / 24AI_correctness0.021 / 22AI_edge_cases56.615 / 22AI_efficiency29.219 / 22AI_hallucination_resistance60.013 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy91.010 / 24AI_plan_coherence2.422 / 24AI_recovery94.315 / 22AI_refusal50.018 / 22AI_safety_compliance100.09 / 24AI_spec50.018 / 22AI_stability100.06 / 22AI_task_completion87.08 / 24AI_tool_selection83.77 / 24ARC_AGI_276.95 / 17ArtificialAnalysisCoding33.715 / 21ArtificialAnalysisIntelligence27.416 / 21ArtificialAnalysisReasoning15.518 / 21ContextWindow100.01 / 24CopilotArenaOrLMArenaCode67.411 / 22GPQA_HLE_Reasoning15.518 / 21IFBench62.59 / 21InverseCost75.015 / 24InverseTTFT90.46 / 19LMArenaCreativeOrOpenEnded77.59 / 24LMArenaSearchDocument17.115 / 19LMArenaText77.59 / 24LongContextRecall24.518 / 21MCPAtlas68.24 / 13OutputSpeed95.03 / 19SWEBenchPro88.42 / 14SWEBenchVerified72.35 / 18SWEComposite82.02 / 24SWERebench83.57 / 20SciCode12.018 / 21SonarFunctionalSkill82.611 / 17SonarIssueDensity13.214 / 17Tau2Bench0.021 / 21TerminalBench100.01 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebench_proterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
claude-opus-4.5anthropic63.963.960.260.259.359.353.5

group breakdown

A_B40.512 / 24A_I49.516 / 24A_P47.39 / 24A_R47.316 / 24BUILD70.66 / 24CRE73.611 / 24GEN70.56 / 24LM_ARENA_REVIEW_PROXY10.821 / 24OPS_long76.416 / 24OPS_precision74.614 / 24OPS_review73.714 / 24PLAN65.88 / 24

metrics

AI_canary_health88.52 / 7AI_code25.49 / 22AI_complexity44.613 / 22AI_context_awareness55.03 / 24AI_correctness0.010 / 22AI_edge_cases56.65 / 22AI_efficiency52.713 / 22AI_hallucination_resistance20.019 / 24AI_memory_retention10.211 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence10.717 / 24AI_recovery94.35 / 22AI_refusal50.02 / 22AI_safety_compliance72.918 / 24AI_spec50.02 / 22AI_stability100.01 / 22AI_task_completion65.314 / 24AI_tool_selection100.01 / 24ArtificialAnalysisCoding75.16 / 21ArtificialAnalysisIntelligence71.57 / 21ArtificialAnalysisReasoning63.79 / 21ContextWindow74.721 / 24CopilotArenaOrLMArenaCode76.56 / 22GDPval73.86 / 11GPQA_HLE_Reasoning63.79 / 21IFBench44.911 / 21InverseCost61.920 / 24InverseTTFT74.514 / 19LMArenaCreativeOrOpenEnded73.611 / 24LMArenaSearchDocument10.816 / 19LMArenaText73.611 / 24LongContextRecall100.01 / 21OutputSpeed80.311 / 19SWEBenchPro60.55 / 14SWEBenchVerified65.29 / 18SWEComposite65.66 / 24SWERebench76.38 / 20SciCode72.77 / 21SonarFunctionalSkill100.01 / 17SonarIssueDensity63.63 / 17Tau2Bench85.28 / 21TerminalBench54.911 / 22
sources aistupidlevelartificial_analysislmarenamcp_atlasopenroutersonarswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBenchBUILD/MCPAtlasGEN/ARC_AGI_2PLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
gemini-3-progoogle70.169.956.856.455.755.048.7

group breakdown

A_B52.96 / 24A_I52.26 / 24A_P58.71 / 24A_R47.614 / 24BUILD58.910 / 24CRE95.03 / 24GEN60.110 / 24LM_ARENA_REVIEW_PROXY19.318 / 24OPS_long45.223 / 24OPS_precision46.623 / 24OPS_review50.522 / 24PLAN55.113 / 24

metrics

AI_code26.88 / 22AI_complexity70.94 / 22AI_context_awareness8.310 / 24AI_correctness100.02 / 22AI_edge_cases0.021 / 22AI_efficiency69.75 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy100.02 / 24AI_plan_coherence84.15 / 24AI_recovery0.021 / 22AI_refusal50.012 / 22AI_safety_compliance100.07 / 24AI_spec50.012 / 22AI_stability29.620 / 22AI_task_completion63.315 / 24AI_tool_selection3.320 / 24ARC_AGI_242.46 / 17ContextWindow0.024 / 24CopilotArenaOrLMArenaCode67.810 / 22InverseCost77.312 / 24LMArenaCreativeOrOpenEnded95.03 / 24LMArenaSearchDocument19.313 / 19LMArenaText95.03 / 24MCPAtlas69.73 / 13SWEBenchMultilingual33.54 / 6SWEBenchPro53.78 / 14SWEBenchVerified52.113 / 18SWEComposite54.512 / 24SWERebench70.313 / 20SonarFunctionalSkill92.75 / 17SonarIssueDensity13.115 / 17TerminalBench61.48 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/LiveCodeBenchBUILD/LongContextRecallBUILD/SciCodeGEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/Tau2Bench
gemini-3-flashgoogle70.570.362.962.555.254.552.6

group breakdown

A_B51.98 / 24A_I51.48 / 24A_P55.56 / 24A_R48.511 / 24BUILD51.712 / 24CRE86.26 / 24GEN61.69 / 24LM_ARENA_REVIEW_PROXY19.517 / 24OPS_long94.91 / 24OPS_precision91.61 / 24OPS_review90.23 / 24PLAN64.79 / 24

metrics

AI_code30.36 / 22AI_complexity67.86 / 22AI_context_awareness14.67 / 24AI_correctness92.56 / 22AI_edge_cases7.519 / 22AI_efficiency66.88 / 22AI_hallucination_resistance92.57 / 24AI_memory_retention92.53 / 24AI_parameter_accuracy92.56 / 24AI_plan_coherence79.07 / 24AI_recovery7.519 / 22AI_refusal50.011 / 22AI_safety_compliance92.512 / 24AI_spec50.011 / 22AI_stability32.618 / 22AI_task_completion61.317 / 24AI_tool_selection10.318 / 24ARC_AGI_23.014 / 17ArtificialAnalysisCoding58.39 / 21ArtificialAnalysisIntelligence58.911 / 21ArtificialAnalysisReasoning82.76 / 21ContextWindow100.05 / 24CopilotArenaOrLMArenaCode67.412 / 22GDPval5.011 / 11GPQA_HLE_Reasoning82.76 / 21IFBench100.02 / 21InverseCost91.58 / 24InverseTTFT80.29 / 19LMArenaCreativeOrOpenEnded86.26 / 24LMArenaSearchDocument19.512 / 19LMArenaText86.26 / 24LongContextRecall68.69 / 21MCPAtlas22.39 / 13OutputSpeed99.42 / 19SWEBenchMultilingual100.01 / 6SWEBenchPro31.012 / 14SWEBenchVerified68.47 / 18SWEComposite58.111 / 24SWERebench76.19 / 20SciCode78.76 / 21SonarFunctionalSkill86.37 / 17SonarIssueDensity18.712 / 17Tau2Bench64.210 / 21TerminalBench48.412 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBench
claude-sonnet-4.6anthropic63.063.054.354.355.255.249.7

group breakdown

A_B35.421 / 24A_I49.417 / 24A_P44.714 / 24A_R42.823 / 24BUILD67.88 / 24CRE73.113 / 24GEN65.57 / 24LM_ARENA_REVIEW_PROXY22.615 / 24OPS_long66.718 / 24OPS_precision54.321 / 24OPS_review49.023 / 24PLAN56.612 / 24

metrics

AI_code15.014 / 22AI_complexity44.617 / 22AI_context_awareness15.74 / 24AI_correctness0.015 / 22AI_edge_cases56.610 / 22AI_efficiency43.116 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy94.44 / 24AI_plan_coherence24.710 / 24AI_recovery94.310 / 22AI_refusal50.07 / 22AI_safety_compliance100.04 / 24AI_spec50.07 / 22AI_stability100.04 / 22AI_task_completion67.913 / 24AI_tool_selection94.93 / 24ARC_AGI_210.610 / 17ArtificialAnalysisCoding85.14 / 21ArtificialAnalysisIntelligence79.16 / 21ArtificialAnalysisReasoning68.78 / 21ContextWindow99.311 / 24CopilotArenaOrLMArenaCode93.45 / 22GDPval82.54 / 11GPQA_HLE_Reasoning68.78 / 21IFBench41.013 / 21InverseCost74.418 / 24InverseTTFT0.019 / 19LMArenaCreativeOrOpenEnded73.113 / 24LMArenaSearchDocument22.610 / 19LMArenaText73.113 / 24LongContextRecall90.25 / 21MCPAtlas65.17 / 13OutputSpeed80.710 / 19SWEBenchPro53.87 / 14SWEBenchVerified63.410 / 18SWEComposite66.45 / 24SWERebench95.83 / 20SciCode57.98 / 21SonarFunctionalSkill92.94 / 17SonarIssueDensity24.310 / 17Tau2Bench53.312 / 21TerminalBench47.514 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenroutersonarswerebenchmissing BUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
gpt-5.3-codexopenai59.059.049.249.251.951.960.7

group breakdown

A_B40.413 / 24A_I44.321 / 24A_P33.523 / 24A_R53.87 / 24BUILD58.011 / 24CRE73.412 / 24GEN55.812 / 24LM_ARENA_REVIEW_PROXY91.63 / 24OPS_long58.021 / 24OPS_precision59.320 / 24OPS_review58.821 / 24PLAN58.410 / 24

metrics

AI_code15.016 / 22AI_complexity60.211 / 22AI_context_awareness0.018 / 24AI_correctness0.020 / 22AI_edge_cases56.614 / 22AI_efficiency31.418 / 22AI_hallucination_resistance80.010 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy85.911 / 24AI_plan_coherence2.421 / 24AI_recovery94.314 / 22AI_refusal50.017 / 22AI_safety_compliance88.915 / 24AI_spec50.017 / 22AI_stability74.115 / 22AI_task_completion58.019 / 24AI_tool_selection66.412 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode58.414 / 22InverseCost76.614 / 24LMArenaCreativeOrOpenEnded73.412 / 24LMArenaSearchDocument91.63 / 19LMArenaText73.412 / 24SWEBenchVerified68.96 / 18SWEComposite63.67 / 24SWERebench89.45 / 20TerminalBench74.66 / 22
sources aistupidlevelartificial_analysislmarenaopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/LiveCodeBenchBUILD/LongContextRecallBUILD/MCPAtlasBUILD/SciCodeBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityGEN/ARC_AGI_2GEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/MCPAtlasPLAN/Tau2BenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
glm-4.7zai35.635.649.849.850.850.855.3

group breakdown

A_B55.45 / 24A_I54.05 / 24A_P46.111 / 24A_R58.55 / 24BUILD42.018 / 24CRE9.222 / 24GEN35.618 / 24LM_ARENA_REVIEW_PROXY50.012 / 24OPS_long87.65 / 24OPS_precision90.14 / 24OPS_review91.91 / 24PLAN53.614 / 24

metrics

AI_context_awareness0.024 / 24AI_hallucination_resistance100.05 / 24AI_memory_retention85.58 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.04 / 24AI_safety_compliance0.024 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding37.914 / 21ArtificialAnalysisIntelligence42.613 / 21ArtificialAnalysisReasoning55.812 / 21ContextWindow74.918 / 24CopilotArenaOrLMArenaCode68.29 / 22GPQA_HLE_Reasoning55.812 / 21IFBench72.27 / 21InverseCost96.13 / 24InverseTTFT99.02 / 19LMArenaCreativeOrOpenEnded9.222 / 24LMArenaText9.222 / 24LongContextRecall57.414 / 21MCPAtlas0.013 / 13OutputSpeed85.37 / 19SWEComposite54.113 / 24SWERebench70.612 / 20SciCode48.611 / 21SonarFunctionalSkill31.316 / 17SonarIssueDensity44.75 / 17Tau2Bench100.02 / 21TerminalBench27.017 / 22
sources aistupidlevelartificial_analysislmarenamcp_atlasopenroutersonarswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_codeA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_refusalA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GDPvalBUILD/LiveCodeBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
grok-code-fast-1xai53.853.834.234.250.050.052.1

group breakdown

A_B68.23 / 24A_I70.92 / 24A_P56.13 / 24A_R81.72 / 24BUILD34.221 / 24CRE47.818 / 24GEN15.819 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long90.24 / 24OPS_precision89.16 / 24OPS_review89.74 / 24PLAN11.424 / 24

metrics

AI_code40.93 / 22AI_complexity80.53 / 22AI_context_awareness0.022 / 24AI_correctness100.04 / 22AI_edge_cases87.03 / 22AI_efficiency21.620 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention85.56 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.02 / 24AI_recovery100.03 / 22AI_refusal50.021 / 22AI_safety_compliance0.022 / 24AI_spec50.021 / 22AI_stability63.516 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.37 / 17ArtificialAnalysisCoding0.021 / 21ArtificialAnalysisIntelligence0.021 / 21ArtificialAnalysisReasoning0.021 / 21ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.022 / 22GPQA_HLE_Reasoning0.021 / 21IFBench0.021 / 21InverseCost99.32 / 24InverseTTFT84.87 / 19LMArenaCreativeOrOpenEnded47.818 / 24LMArenaText47.818 / 24LongContextRecall0.021 / 21OutputSpeed93.84 / 19SWEComposite45.420 / 24SWERebench27.018 / 20SciCode0.021 / 21Tau2Bench53.313 / 21TerminalBench0.022 / 22
sources aistupidlevelartificial_analysislmarenaopenrouterswerebenchterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityLM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
gemini-2.5-flashgoogle51.150.933.433.049.448.752.2

group breakdown

A_B80.92 / 24A_I65.03 / 24A_P51.98 / 24A_R77.93 / 24BUILD24.723 / 24CRE45.519 / 24GEN15.020 / 24LM_ARENA_REVIEW_PROXY76.97 / 24OPS_long94.62 / 24OPS_precision90.73 / 24OPS_review89.26 / 24PLAN13.423 / 24

metrics

AI_code100.01 / 22AI_complexity100.01 / 22AI_context_awareness100.02 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency100.02 / 22AI_hallucination_resistance69.911 / 24AI_memory_retention31.99 / 24AI_parameter_accuracy73.314 / 24AI_plan_coherence0.023 / 24AI_recovery100.01 / 22AI_refusal50.09 / 22AI_safety_compliance100.06 / 24AI_spec50.09 / 22AI_stability0.021 / 22AI_task_completion44.320 / 24AI_tool_selection27.516 / 24ARC_AGI_20.715 / 17ArtificialAnalysisCoding0.020 / 21ArtificialAnalysisIntelligence0.819 / 21ArtificialAnalysisReasoning17.916 / 21ContextWindow100.03 / 24CopilotArenaOrLMArenaCode64.813 / 22GDPval11.810 / 11GPQA_HLE_Reasoning17.916 / 21IFBench29.217 / 21InverseCost94.45 / 24InverseTTFT75.812 / 19LMArenaCreativeOrOpenEnded45.519 / 24LMArenaSearchDocument76.97 / 19LMArenaText45.519 / 24LiveCodeBench100.01 / 2LongContextRecall58.813 / 21MCPAtlas26.48 / 13OutputSpeed100.01 / 19SWEBenchPro33.811 / 14SWEBenchVerified0.018 / 18SWEComposite15.024 / 24SWERebench0.020 / 20SciCode23.516 / 21Tau2Bench0.020 / 21TerminalBench0.021 / 22
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing BUILD/SonarFunctionalSkillBUILD/SonarIssueDensitySWEComposite/SWEBenchMultilingual
deepseek-v4-flashdeepseek56.856.859.759.747.547.554.7

group breakdown

A_B37.819 / 24A_I49.713 / 24A_P44.615 / 24A_R45.520 / 24BUILD49.813 / 24CRE58.716 / 24GEN62.88 / 24LM_ARENA_REVIEW_PROXY50.08 / 24OPS_long86.56 / 24OPS_precision89.65 / 24OPS_review91.82 / 24PLAN71.16 / 24

metrics

AI_canary_health78.46 / 7AI_code0.021 / 22AI_complexity44.618 / 22AI_context_awareness0.014 / 24AI_correctness0.016 / 22AI_edge_cases56.611 / 22AI_efficiency100.01 / 22AI_hallucination_resistance40.017 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy91.69 / 24AI_plan_coherence24.711 / 24AI_recovery94.311 / 22AI_refusal50.08 / 22AI_safety_compliance100.05 / 24AI_spec50.08 / 22AI_stability74.114 / 22AI_task_completion87.05 / 24AI_tool_selection87.45 / 24ArtificialAnalysisCoding45.611 / 21ArtificialAnalysisIntelligence59.310 / 21ArtificialAnalysisReasoning76.77 / 21ContextWindow71.622 / 24GPQA_HLE_Reasoning76.77 / 21IFBench100.01 / 21InverseCost100.01 / 24InverseTTFT98.93 / 19LMArenaCreativeOrOpenEnded58.716 / 24LMArenaText58.716 / 24LongContextRecall52.516 / 21OutputSpeed83.68 / 19SWEComposite50.016 / 24SciCode47.512 / 21Tau2Bench97.95 / 21
sources aistupidlevelartificial_analysislmarenaopenroutermissing BUILD/CopilotArenaOrLMArenaCodeBUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityBUILD/TerminalBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasPLAN/TerminalBenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSWEComposite/SWERebench
claude-sonnet-4.5anthropic55.255.248.348.345.045.041.6

group breakdown

A_B39.414 / 24A_I51.011 / 24A_P55.84 / 24A_R48.113 / 24BUILD46.816 / 24CRE64.015 / 24GEN42.217 / 24LM_ARENA_REVIEW_PROXY1.722 / 24OPS_long79.612 / 24OPS_precision79.511 / 24OPS_review78.39 / 24PLAN42.417 / 24

metrics

AI_canary_health78.17 / 7AI_code9.820 / 22AI_complexity44.616 / 22AI_context_awareness100.01 / 24AI_correctness0.014 / 22AI_edge_cases56.69 / 22AI_efficiency77.73 / 22AI_hallucination_resistance40.016 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy61.319 / 24AI_plan_coherence30.39 / 24AI_recovery94.39 / 22AI_refusal50.06 / 22AI_safety_compliance73.617 / 24AI_spec50.06 / 22AI_stability89.811 / 22AI_task_completion100.02 / 24AI_tool_selection83.18 / 24ARC_AGI_23.612 / 17ArtificialAnalysisCoding45.312 / 21ArtificialAnalysisIntelligence46.012 / 21ArtificialAnalysisReasoning35.315 / 21ContextWindow99.310 / 24CopilotArenaOrLMArenaCode52.316 / 22GDPval88.32 / 11GPQA_HLE_Reasoning35.315 / 21IFBench43.012 / 21InverseCost74.417 / 24InverseTTFT76.311 / 19LMArenaCreativeOrOpenEnded64.015 / 24LMArenaSearchDocument1.717 / 19LMArenaText64.015 / 24LongContextRecall65.711 / 21MCPAtlas8.011 / 13OutputSpeed76.516 / 19SWEBenchMultilingual3.95 / 6SWEBenchPro54.56 / 14SWEBenchVerified54.712 / 18SWEComposite53.514 / 24SWERebench74.710 / 20SciCode46.413 / 21SonarFunctionalSkill53.615 / 17SonarIssueDensity29.89 / 17Tau2Bench58.911 / 21TerminalBench37.315 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBench
gemini-2.5-progoogle27.026.836.135.743.342.635.3

group breakdown

A_B51.97 / 24A_I51.47 / 24A_P55.55 / 24A_R48.510 / 24BUILD34.122 / 24CRE0.024 / 24GEN14.521 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long82.17 / 24OPS_precision74.415 / 24OPS_review71.015 / 24PLAN21.821 / 24

metrics

AI_code30.35 / 22AI_complexity67.85 / 22AI_context_awareness14.66 / 24AI_correctness92.55 / 22AI_edge_cases7.518 / 22AI_efficiency66.87 / 22AI_hallucination_resistance92.56 / 24AI_memory_retention92.52 / 24AI_parameter_accuracy92.55 / 24AI_plan_coherence79.06 / 24AI_recovery7.518 / 22AI_refusal50.010 / 22AI_safety_compliance92.511 / 24AI_spec50.010 / 22AI_stability32.617 / 22AI_task_completion61.316 / 24AI_tool_selection10.317 / 24ARC_AGI_23.613 / 17ArtificialAnalysisCoding23.617 / 21ArtificialAnalysisIntelligence14.117 / 21ArtificialAnalysisReasoning44.814 / 21ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.021 / 22GPQA_HLE_Reasoning44.814 / 21IFBench19.318 / 21InverseCost80.110 / 24InverseTTFT43.917 / 19LMArenaCreativeOrOpenEnded0.024 / 24LMArenaSearchDocument0.019 / 19LMArenaText0.024 / 24LongContextRecall67.210 / 21MCPAtlas66.85 / 13OutputSpeed91.46 / 19SWEBenchPro53.29 / 14SWEBenchVerified9.817 / 18SWEComposite27.022 / 24SWERebench0.519 / 20SciCode36.115 / 21SonarFunctionalSkill86.36 / 17SonarIssueDensity18.711 / 17Tau2Bench3.519 / 21TerminalBench1.420 / 22
sources arc_agiartificial_analysislmarenaopenrouterswebenchswerebenchterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
claude-sonnet-4anthropic26.226.236.836.843.143.149.4

group breakdown

A_B37.818 / 24A_I49.812 / 24A_P45.313 / 24A_R46.218 / 24BUILD41.520 / 24CRE0.023 / 24GEN14.022 / 24LM_ARENA_REVIEW_PROXY84.16 / 24OPS_long80.011 / 24OPS_precision79.610 / 24OPS_review78.210 / 24PLAN33.419 / 24

metrics

AI_code15.013 / 22AI_complexity44.615 / 22AI_context_awareness0.013 / 24AI_correctness0.013 / 22AI_edge_cases56.68 / 22AI_efficiency53.212 / 22AI_hallucination_resistance20.020 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy91.98 / 24AI_plan_coherence19.115 / 24AI_recovery94.38 / 22AI_refusal50.05 / 22AI_safety_compliance100.03 / 24AI_spec50.05 / 22AI_stability100.03 / 22AI_task_completion99.73 / 24AI_tool_selection90.04 / 24ARC_AGI_20.116 / 17ArtificialAnalysisCoding30.716 / 21ArtificialAnalysisIntelligence29.715 / 21ArtificialAnalysisReasoning8.619 / 21ContextWindow99.39 / 24CopilotArenaOrLMArenaCode52.018 / 22GDPval82.53 / 11GPQA_HLE_Reasoning8.619 / 21IFBench35.814 / 21InverseCost74.416 / 24InverseTTFT75.413 / 19LMArenaCreativeOrOpenEnded0.023 / 24LMArenaSearchDocument84.16 / 19LMArenaText0.023 / 24LiveCodeBench0.02 / 2LongContextRecall60.812 / 21MCPAtlas14.310 / 13OutputSpeed77.513 / 19SWEBenchPro52.110 / 14SWEBenchVerified39.716 / 18SWEComposite48.518 / 24SWERebench54.514 / 20SciCode20.817 / 21SonarFunctionalSkill59.014 / 17SonarIssueDensity34.07 / 17Tau2Bench27.718 / 21TerminalBench47.513 / 22
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenroutersonarswebenchswebench_proswerebenchmissing SWEComposite/SWEBenchMultilingual
claude-opus-4.1anthropic48.148.142.042.042.442.438.6

group breakdown

A_B36.820 / 24A_I46.320 / 24A_P39.220 / 24A_R46.517 / 24BUILD48.414 / 24CRE52.717 / 24GEN50.714 / 24LM_ARENA_REVIEW_PROXY0.023 / 24OPS_long48.722 / 24OPS_precision46.224 / 24OPS_review42.524 / 24PLAN43.016 / 24

metrics

AI_canary_health79.25 / 7AI_code9.819 / 22AI_complexity60.28 / 22AI_context_awareness0.012 / 24AI_correctness0.09 / 22AI_edge_cases56.64 / 22AI_efficiency41.717 / 22AI_hallucination_resistance40.015 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy71.516 / 24AI_plan_coherence19.114 / 24AI_recovery94.34 / 22AI_refusal50.01 / 22AI_safety_compliance100.01 / 24AI_spec50.01 / 22AI_stability74.113 / 22AI_task_completion83.410 / 24AI_tool_selection72.011 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode52.117 / 22InverseCost0.024 / 24LMArenaCreativeOrOpenEnded52.717 / 24LMArenaSearchDocument0.018 / 19LMArenaText52.717 / 24SWEComposite50.315 / 24SWERebench51.715 / 20TerminalBench29.316 / 22
sources aistupidlevellmarenaopenrouterswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/LiveCodeBenchBUILD/LongContextRecallBUILD/MCPAtlasBUILD/SciCodeBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityGEN/ARC_AGI_2GEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/MCPAtlasPLAN/Tau2BenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
gpt-5.2openai54.154.148.848.839.339.348.2

group breakdown

A_B29.823 / 24A_I38.423 / 24A_P36.522 / 24A_R53.96 / 24BUILD41.719 / 24CRE67.914 / 24GEN52.213 / 24LM_ARENA_REVIEW_PROXY20.616 / 24OPS_long58.319 / 24OPS_precision59.819 / 24OPS_review59.520 / 24PLAN56.611 / 24

metrics

AI_code0.022 / 22AI_complexity0.022 / 22AI_context_awareness0.017 / 24AI_correctness0.019 / 22AI_edge_cases56.613 / 22AI_efficiency0.021 / 22AI_hallucination_resistance80.09 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy85.212 / 24AI_plan_coherence0.024 / 24AI_recovery94.313 / 22AI_refusal50.016 / 22AI_safety_compliance100.08 / 24AI_spec50.016 / 22AI_stability89.812 / 22AI_task_completion87.07 / 24AI_tool_selection76.39 / 24ARC_AGI_20.017 / 17ArtificialAnalysisCoding63.48 / 21ArtificialAnalysisIntelligence59.79 / 21ArtificialAnalysisReasoning56.411 / 21ContextWindow85.312 / 24CopilotArenaOrLMArenaCode37.220 / 22GPQA_HLE_Reasoning56.411 / 21IFBench64.78 / 21InverseCost80.111 / 24LMArenaCreativeOrOpenEnded67.914 / 24LMArenaSearchDocument20.611 / 19LMArenaText67.914 / 24LongContextRecall53.915 / 21SWEBenchMultilingual0.06 / 6SWEBenchPro18.613 / 14SWEBenchVerified50.514 / 18SWEComposite28.221 / 24SciCode54.69 / 21SonarFunctionalSkill82.810 / 17SonarIssueDensity33.98 / 17Tau2Bench50.115 / 21TerminalBench58.49 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenroutersonarswebenchswebench_proterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/MCPAtlasSWEComposite/SWERebench
kimi-k2-0905moonshot23.423.425.925.937.837.833.3

group breakdown

A_B29.524 / 24A_I26.824 / 24A_P30.324 / 24A_R20.324 / 24BUILD42.717 / 24CRE26.920 / 24GEN7.924 / 24LM_ARENA_REVIEW_PROXY50.09 / 24OPS_long35.424 / 24OPS_precision53.722 / 24OPS_review60.219 / 24PLAN29.620 / 24

metrics

AI_canary_health88.91 / 7AI_code30.54 / 22AI_complexity60.29 / 22AI_context_awareness0.015 / 24AI_correctness0.017 / 22AI_edge_cases0.022 / 22AI_efficiency74.64 / 22AI_hallucination_resistance60.012 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy81.113 / 24AI_plan_coherence21.913 / 24AI_recovery0.022 / 22AI_refusal50.014 / 22AI_safety_compliance88.914 / 24AI_spec50.014 / 22AI_stability0.022 / 22AI_task_completion87.06 / 24AI_tool_selection73.810 / 24ArtificialAnalysisCoding4.219 / 21ArtificialAnalysisIntelligence0.020 / 21ArtificialAnalysisReasoning0.020 / 21ContextWindow53.423 / 24GPQA_HLE_Reasoning0.020 / 21IFBench0.020 / 21InverseCost92.77 / 24InverseTTFT90.75 / 19LMArenaCreativeOrOpenEnded26.920 / 24LMArenaText26.920 / 24LongContextRecall0.020 / 21OutputSpeed0.019 / 19SWEComposite50.017 / 24SciCode0.020 / 21Tau2Bench48.016 / 21
sources aistupidlevelartificial_analysislmarenaopenroutermissing BUILD/CopilotArenaOrLMArenaCodeBUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityBUILD/TerminalBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasPLAN/TerminalBenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSWEComposite/SWERebench
glm-4.6zai36.836.832.032.036.436.441.5

group breakdown

A_B55.44 / 24A_I54.04 / 24A_P46.110 / 24A_R58.54 / 24BUILD18.224 / 24CRE23.621 / 24GEN13.423 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long77.015 / 24OPS_precision83.38 / 24OPS_review86.08 / 24PLAN18.022 / 24

metrics

AI_context_awareness0.023 / 24AI_hallucination_resistance100.04 / 24AI_memory_retention85.57 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.03 / 24AI_safety_compliance0.023 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding15.918 / 21ArtificialAnalysisIntelligence6.118 / 21ArtificialAnalysisReasoning16.517 / 21ContextWindow75.017 / 24CopilotArenaOrLMArenaCode43.019 / 22GPQA_HLE_Reasoning16.517 / 21IFBench4.719 / 21InverseCost95.44 / 24InverseTTFT98.74 / 19LMArenaCreativeOrOpenEnded23.621 / 24LMArenaText23.621 / 24LongContextRecall9.819 / 21MCPAtlas7.512 / 13OutputSpeed66.318 / 19SWEBenchPro0.014 / 14SWEBenchVerified48.415 / 18SWEComposite24.523 / 24SWERebench37.617 / 20SciCode12.019 / 21SonarFunctionalSkill7.517 / 17SonarIssueDensity7.516 / 17Tau2Bench41.317 / 21TerminalBench13.718 / 22
sources aistupidlevelartificial_analysislmarenaopenrouterswebenchswebench_proswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_codeA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_refusalA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GDPvalBUILD/LiveCodeBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchMultilingual