$ipbr-rank · live llm coding-role score
refreshed · 13 sources · updated frequently — models drift and degrade
[ idea ]
1gemini-3.1-pro-preview90.689.9
2claude-opus-4.683.183.1
3claude-opus-4.783.183.1
[ plan ]
1gemini-3.1-pro-preview85.483.8
2gpt-5.582.482.4
3claude-opus-4.776.576.5
[ build ]
1gemini-3.1-pro-preview81.078.1
2gemini-3-pro74.571.7
3claude-opus-4.772.472.4
[ review ]
1gemini-3.1-pro-preview84.484.4
2claude-opus-4.775.575.5
3kimi-k2.674.774.7
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

raw vs adjusted

The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models dominate Review, that lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.

Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted — it is the source of the penalty.

missing data

If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.

Full math, role definitions, and source list →

gemini-3.1-pro-previewgoogle90.689.985.483.881.078.184.4

group breakdown

A_B79.07 / 24A_I73.47 / 24A_P65.66 / 24A_R78.97 / 24BUILD82.53 / 24CRE100.02 / 24GEN100.01 / 24LM_ARENA_REVIEW_PROXY92.34 / 24OPS_long82.08 / 24OPS_precision73.516 / 24OPS_review69.716 / 24PLAN94.22 / 24

metrics

AI_code91.65 / 22AI_complexity91.16 / 22AI_context_awareness14.18 / 24AI_correctness92.57 / 22AI_edge_cases92.56 / 22AI_efficiency87.65 / 22AI_hallucination_resistance92.58 / 24AI_memory_retention92.58 / 24AI_parameter_accuracy92.57 / 24AI_plan_coherence79.18 / 24AI_recovery92.57 / 22AI_refusal50.013 / 22AI_spec92.520 / 22AI_stability53.419 / 22AI_task_completion61.218 / 24AI_tool_selection10.319 / 24ARC_AGI_2100.01 / 17ArtificialAnalysisCoding100.01 / 21ArtificialAnalysisIntelligence100.02 / 21ArtificialAnalysisReasoning100.01 / 21ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.47 / 22GDPval23.89 / 11GPQA_HLE_Reasoning100.01 / 21IFBench97.53 / 21InverseCost77.313 / 24InverseTTFT40.918 / 19LMArenaCreativeOrOpenEnded100.02 / 24LMArenaSearchDocument92.34 / 19LMArenaText100.02 / 24LongContextRecall100.02 / 21MCPAtlas71.16 / 13OutputSpeed93.05 / 19SWEBenchPro89.14 / 14SWEBenchVerified95.04 / 18SWEComposite93.42 / 24SWERebench99.82 / 20SciCode100.02 / 21SonarFunctionalSkill78.98 / 17SonarIssueDensity13.213 / 17Tau2Bench99.34 / 21TerminalBench89.43 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
gemini-3-progoogle81.680.961.559.974.571.766.0

group breakdown

A_B95.41 / 24A_I86.73 / 24A_P74.51 / 24A_R95.33 / 24BUILD68.510 / 24CRE94.53 / 24GEN59.910 / 24LM_ARENA_REVIEW_PROXY19.918 / 24OPS_long45.223 / 24OPS_precision46.623 / 24OPS_review50.522 / 24PLAN55.313 / 24

metrics

AI_code98.92 / 22AI_complexity98.43 / 22AI_context_awareness7.710 / 24AI_correctness100.02 / 22AI_edge_cases100.02 / 22AI_efficiency94.22 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy100.02 / 24AI_plan_coherence84.25 / 24AI_recovery100.02 / 22AI_refusal50.012 / 22AI_spec100.09 / 22AI_stability54.016 / 22AI_task_completion63.215 / 24AI_tool_selection3.220 / 24ARC_AGI_241.96 / 17ContextWindow0.024 / 24CopilotArenaOrLMArenaCode68.410 / 22InverseCost77.312 / 24LMArenaCreativeOrOpenEnded94.53 / 24LMArenaSearchDocument19.913 / 19LMArenaText94.53 / 24MCPAtlas74.93 / 13SWEBenchMultilingual33.54 / 6SWEBenchPro80.37 / 14SWEBenchVerified82.913 / 18SWEComposite74.48 / 24SWERebench70.613 / 20SonarFunctionalSkill84.15 / 17SonarIssueDensity6.716 / 17TerminalBench61.28 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/LiveCodeBenchBUILD/LongContextRecallBUILD/SciCodeGEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/Tau2Bench
claude-opus-4.7anthropic83.183.176.576.572.472.475.5

group breakdown

A_B46.219 / 24A_I63.59 / 24A_P60.98 / 24A_R58.120 / 24BUILD87.32 / 24CRE94.54 / 24GEN96.72 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long74.317 / 24OPS_precision69.017 / 24OPS_review65.617 / 24PLAN79.84 / 24

metrics

AI_canary_health88.23 / 7AI_code4.715 / 22AI_complexity6.013 / 22AI_context_awareness14.15 / 24AI_correctness70.611 / 22AI_edge_cases49.012 / 22AI_efficiency68.211 / 22AI_hallucination_resistance0.023 / 24AI_memory_retention35.410 / 24AI_parameter_accuracy35.220 / 24AI_plan_coherence22.513 / 24AI_recovery91.911 / 22AI_refusal50.04 / 22AI_spec100.04 / 22AI_stability79.75 / 22AI_task_completion100.01 / 24AI_tool_selection56.913 / 24ARC_AGI_292.73 / 17ArtificialAnalysisCoding90.33 / 21ArtificialAnalysisIntelligence100.01 / 21ArtificialAnalysisReasoning95.63 / 21ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.01 / 22GDPval95.01 / 11GPQA_HLE_Reasoning95.63 / 21IFBench46.610 / 21InverseCost61.922 / 24InverseTTFT49.116 / 19LMArenaCreativeOrOpenEnded94.54 / 24LMArenaSearchDocument100.01 / 19LMArenaText94.54 / 24LongContextRecall88.26 / 21OutputSpeed78.812 / 19SWEBenchPro95.02 / 14SWEBenchVerified95.03 / 18SWEComposite92.93 / 24SWERebench85.36 / 20SciCode100.01 / 21SonarFunctionalSkill93.92 / 17SonarIssueDensity0.017 / 17Tau2Bench83.19 / 21TerminalBench78.24 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarmissing BUILD/LiveCodeBenchBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
claude-opus-4.6anthropic83.183.171.371.371.571.566.0

group breakdown

A_B42.722 / 24A_I60.814 / 24A_P57.612 / 24A_R57.621 / 24BUILD87.41 / 24CRE100.01 / 24GEN89.54 / 24LM_ARENA_REVIEW_PROXY33.113 / 24OPS_long78.013 / 24OPS_precision76.813 / 24OPS_review74.713 / 24PLAN72.75 / 24

metrics

AI_canary_health83.34 / 7AI_code0.022 / 22AI_complexity6.012 / 22AI_context_awareness9.09 / 24AI_correctness70.610 / 22AI_edge_cases49.011 / 22AI_efficiency55.420 / 22AI_hallucination_resistance0.022 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy71.815 / 24AI_plan_coherence3.220 / 24AI_recovery91.910 / 22AI_refusal50.03 / 22AI_spec100.03 / 22AI_stability79.74 / 22AI_task_completion92.94 / 24AI_tool_selection100.01 / 24ARC_AGI_290.94 / 17ArtificialAnalysisCoding76.15 / 21ArtificialAnalysisIntelligence84.05 / 21ArtificialAnalysisReasoning86.35 / 21ContextWindow99.37 / 24CopilotArenaOrLMArenaCode99.82 / 22GDPval78.05 / 11GPQA_HLE_Reasoning86.35 / 21IFBench31.416 / 21InverseCost61.921 / 24InverseTTFT73.615 / 19LMArenaCreativeOrOpenEnded100.01 / 24LMArenaSearchDocument33.18 / 19LMArenaText100.01 / 24LongContextRecall90.24 / 21OutputSpeed76.615 / 19SWEBenchMultilingual90.92 / 6SWEBenchPro100.01 / 14SWEBenchVerified99.72 / 18SWEComposite97.31 / 24SWERebench91.64 / 20SciCode85.85 / 21SonarFunctionalSkill92.23 / 17SonarIssueDensity46.86 / 17Tau2Bench91.26 / 21TerminalBench64.27 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBenchBUILD/MCPAtlasPLAN/MCPAtlas
claude-opus-4.5anthropic67.867.865.165.169.769.761.2

group breakdown

A_B47.615 / 24A_I59.018 / 24A_P60.411 / 24A_R59.916 / 24BUILD81.74 / 24CRE73.411 / 24GEN70.46 / 24LM_ARENA_REVIEW_PROXY11.221 / 24OPS_long76.416 / 24OPS_precision74.614 / 24OPS_review73.714 / 24PLAN65.78 / 24

metrics

AI_canary_health88.52 / 7AI_code19.08 / 22AI_complexity6.011 / 22AI_context_awareness50.83 / 24AI_correctness70.69 / 22AI_edge_cases49.010 / 22AI_efficiency68.510 / 22AI_hallucination_resistance20.019 / 24AI_memory_retention11.811 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence11.517 / 24AI_recovery91.99 / 22AI_refusal50.02 / 22AI_spec100.02 / 22AI_stability49.420 / 22AI_task_completion65.114 / 24AI_tool_selection99.62 / 24ArtificialAnalysisCoding75.16 / 21ArtificialAnalysisIntelligence71.57 / 21ArtificialAnalysisReasoning63.79 / 21ContextWindow74.721 / 24CopilotArenaOrLMArenaCode76.86 / 22GDPval73.86 / 11GPQA_HLE_Reasoning63.79 / 21IFBench44.911 / 21InverseCost61.920 / 24InverseTTFT74.514 / 19LMArenaCreativeOrOpenEnded73.411 / 24LMArenaSearchDocument11.216 / 19LMArenaText73.411 / 24LongContextRecall100.01 / 21OutputSpeed80.311 / 19SWEBenchPro88.45 / 14SWEBenchVerified92.28 / 18SWEComposite87.05 / 24SWERebench76.58 / 20SciCode72.77 / 21SonarFunctionalSkill100.01 / 17SonarIssueDensity77.23 / 17Tau2Bench85.28 / 21TerminalBench54.811 / 22
sources aistupidlevelartificial_analysislmarenamcp_atlasopenroutersonarswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBenchBUILD/MCPAtlasGEN/ARC_AGI_2PLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
gemini-3-flashgoogle78.277.566.364.767.664.864.1

group breakdown

A_B79.06 / 24A_I73.46 / 24A_P65.65 / 24A_R78.96 / 24BUILD58.912 / 24CRE85.86 / 24GEN61.69 / 24LM_ARENA_REVIEW_PROXY20.017 / 24OPS_long94.91 / 24OPS_precision91.61 / 24OPS_review90.23 / 24PLAN64.69 / 24

metrics

AI_code91.64 / 22AI_complexity91.15 / 22AI_context_awareness14.17 / 24AI_correctness92.56 / 22AI_edge_cases92.55 / 22AI_efficiency87.64 / 22AI_hallucination_resistance92.57 / 24AI_memory_retention92.57 / 24AI_parameter_accuracy92.56 / 24AI_plan_coherence79.17 / 24AI_recovery92.56 / 22AI_refusal50.011 / 22AI_spec92.519 / 22AI_stability53.418 / 22AI_task_completion61.217 / 24AI_tool_selection10.318 / 24ARC_AGI_23.114 / 17ArtificialAnalysisCoding58.39 / 21ArtificialAnalysisIntelligence58.911 / 21ArtificialAnalysisReasoning82.76 / 21ContextWindow100.05 / 24CopilotArenaOrLMArenaCode68.012 / 22GDPval5.011 / 11GPQA_HLE_Reasoning82.76 / 21IFBench100.02 / 21InverseCost91.58 / 24InverseTTFT80.29 / 19LMArenaCreativeOrOpenEnded85.86 / 24LMArenaSearchDocument20.012 / 19LMArenaText85.86 / 24LongContextRecall68.69 / 21MCPAtlas22.49 / 13OutputSpeed99.42 / 19SWEBenchMultilingual100.01 / 6SWEBenchPro53.011 / 14SWEBenchVerified100.01 / 18SWEComposite76.57 / 24SWERebench76.39 / 20SciCode78.76 / 21SonarFunctionalSkill78.97 / 17SonarIssueDensity13.212 / 17Tau2Bench64.210 / 21TerminalBench48.312 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBench
gpt-5.5openai78.178.182.482.466.866.872.0

group breakdown

A_B49.112 / 24A_I61.913 / 24A_P57.414 / 24A_R68.311 / 24BUILD72.87 / 24CRE81.77 / 24GEN94.13 / 24LM_ARENA_REVIEW_PROXY28.214 / 24OPS_long81.710 / 24OPS_precision80.09 / 24OPS_review77.511 / 24PLAN95.41 / 24

metrics

AI_code4.720 / 22AI_complexity6.09 / 22AI_context_awareness0.020 / 24AI_correctness70.620 / 22AI_edge_cases49.021 / 22AI_efficiency62.516 / 22AI_hallucination_resistance60.014 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy95.83 / 24AI_plan_coherence8.719 / 24AI_recovery91.920 / 22AI_refusal50.019 / 22AI_spec100.015 / 22AI_stability79.710 / 22AI_task_completion86.79 / 24AI_tool_selection83.86 / 24ARC_AGI_296.72 / 17ArtificialAnalysisCoding100.02 / 21ArtificialAnalysisIntelligence98.13 / 21ArtificialAnalysisReasoning100.02 / 21ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.98 / 22GPQA_HLE_Reasoning100.02 / 21IFBench80.76 / 21InverseCost50.623 / 24InverseTTFT81.78 / 19LMArenaCreativeOrOpenEnded81.77 / 24LMArenaSearchDocument28.29 / 19LMArenaText81.77 / 24LongContextRecall98.03 / 21OutputSpeed82.49 / 19SWEBenchVerified95.07 / 18SWEComposite63.514 / 24SciCode94.54 / 21SonarFunctionalSkill46.513 / 17SonarIssueDensity52.74 / 17Tau2Bench90.57 / 21TerminalBench100.01 / 22
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouteroverridessonarterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWERebench
gpt-5.4openai67.467.454.154.166.166.160.4

group breakdown

A_B48.713 / 24A_I60.615 / 24A_P55.617 / 24A_R67.512 / 24BUILD72.08 / 24CRE77.39 / 24GEN45.016 / 24LM_ARENA_REVIEW_PROXY17.620 / 24OPS_long92.83 / 24OPS_precision91.22 / 24OPS_review89.75 / 24PLAN50.915 / 24

metrics

AI_code4.719 / 22AI_complexity6.021 / 22AI_context_awareness0.019 / 24AI_correctness70.619 / 22AI_edge_cases49.020 / 22AI_efficiency64.015 / 22AI_hallucination_resistance60.013 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy91.010 / 24AI_plan_coherence3.222 / 24AI_recovery91.919 / 22AI_refusal50.018 / 22AI_spec100.014 / 22AI_stability72.312 / 22AI_task_completion86.78 / 24AI_tool_selection82.67 / 24ARC_AGI_275.85 / 17ArtificialAnalysisCoding33.715 / 21ArtificialAnalysisIntelligence27.416 / 21ArtificialAnalysisReasoning15.518 / 21ContextWindow100.01 / 24CopilotArenaOrLMArenaCode68.011 / 22GPQA_HLE_Reasoning15.518 / 21IFBench62.59 / 21InverseCost75.015 / 24InverseTTFT90.46 / 19LMArenaCreativeOrOpenEnded77.39 / 24LMArenaSearchDocument17.615 / 19LMArenaText77.39 / 24LongContextRecall24.518 / 21MCPAtlas72.84 / 13OutputSpeed95.03 / 19SWEBenchPro92.53 / 14SWEBenchVerified95.06 / 18SWEComposite91.34 / 24SWERebench83.57 / 20SciCode12.018 / 21SonarFunctionalSkill66.811 / 17SonarIssueDensity6.815 / 17Tau2Bench0.021 / 21TerminalBench100.02 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebench_proterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
kimi-k2.6moonshot69.769.770.670.665.665.674.7

group breakdown

A_B45.221 / 24A_I59.317 / 24A_P51.220 / 24A_R60.115 / 24BUILD75.86 / 24CRE77.48 / 24GEN73.75 / 24LM_ARENA_REVIEW_PROXY94.82 / 24OPS_long58.020 / 24OPS_precision59.818 / 24OPS_review60.318 / 24PLAN86.63 / 24

metrics

AI_code9.513 / 22AI_complexity6.018 / 22AI_context_awareness0.016 / 24AI_correctness70.616 / 22AI_edge_cases49.017 / 22AI_efficiency59.717 / 22AI_hallucination_resistance20.021 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy66.717 / 24AI_plan_coherence8.718 / 24AI_recovery91.916 / 22AI_refusal50.015 / 22AI_spec100.011 / 22AI_stability60.913 / 22AI_task_completion72.311 / 24AI_tool_selection52.214 / 24ARC_AGI_211.99 / 17ArtificialAnalysisCoding72.87 / 21ArtificialAnalysisIntelligence87.54 / 21ArtificialAnalysisReasoning87.64 / 21ContextWindow78.414 / 24CopilotArenaOrLMArenaCode94.44 / 22GDPval54.78 / 11GPQA_HLE_Reasoning87.64 / 21IFBench94.54 / 21InverseCost87.19 / 24LMArenaCreativeOrOpenEnded77.48 / 24LMArenaSearchDocument94.82 / 19LMArenaText77.48 / 24LongContextRecall85.37 / 21MCPAtlas92.52 / 13SWEBenchVerified95.05 / 18SWEComposite68.113 / 24SWERebench73.111 / 20SciCode94.53 / 21SonarFunctionalSkill66.812 / 17SonarIssueDensity92.52 / 17Tau2Bench100.01 / 21TerminalBench74.65 / 22
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/LiveCodeBenchOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
claude-sonnet-4.6anthropic68.468.459.959.965.465.457.9

group breakdown

A_B47.714 / 24A_I64.18 / 24A_P60.610 / 24A_R59.517 / 24BUILD76.05 / 24CRE73.013 / 24GEN65.47 / 24LM_ARENA_REVIEW_PROXY23.315 / 24OPS_long66.718 / 24OPS_precision54.321 / 24OPS_review49.023 / 24PLAN56.911 / 24

metrics

AI_code19.09 / 22AI_complexity6.016 / 22AI_context_awareness14.54 / 24AI_correctness70.614 / 22AI_edge_cases49.015 / 22AI_efficiency65.414 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy94.44 / 24AI_plan_coherence25.311 / 24AI_recovery91.914 / 22AI_refusal50.07 / 22AI_spec100.06 / 22AI_stability79.78 / 22AI_task_completion67.713 / 24AI_tool_selection93.63 / 24ARC_AGI_210.610 / 17ArtificialAnalysisCoding85.14 / 21ArtificialAnalysisIntelligence79.16 / 21ArtificialAnalysisReasoning68.78 / 21ContextWindow99.311 / 24CopilotArenaOrLMArenaCode93.25 / 22GDPval82.54 / 11GPQA_HLE_Reasoning68.78 / 21IFBench41.013 / 21InverseCost74.418 / 24InverseTTFT0.019 / 19LMArenaCreativeOrOpenEnded73.013 / 24LMArenaSearchDocument23.310 / 19LMArenaText73.013 / 24LongContextRecall90.25 / 21MCPAtlas69.87 / 13OutputSpeed80.710 / 19SWEBenchPro76.59 / 14SWEBenchVerified90.310 / 18SWEComposite85.46 / 24SWERebench95.73 / 20SciCode57.98 / 21SonarFunctionalSkill84.54 / 17SonarIssueDensity22.310 / 17Tau2Bench53.312 / 21TerminalBench47.414 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenroutersonarswerebenchmissing BUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
glm-5.1zai71.471.462.262.264.964.969.7

group breakdown

A_B47.016 / 24A_I55.920 / 24A_P50.821 / 24A_R56.422 / 24BUILD70.99 / 24CRE86.35 / 24GEN57.511 / 24LM_ARENA_REVIEW_PROXY88.05 / 24OPS_long81.99 / 24OPS_precision86.57 / 24OPS_review88.77 / 24PLAN69.57 / 24

metrics

AI_code15.610 / 22AI_complexity12.68 / 22AI_context_awareness7.511 / 24AI_correctness67.521 / 22AI_edge_cases49.18 / 22AI_efficiency58.218 / 22AI_hallucination_resistance24.518 / 24AI_memory_retention7.512 / 24AI_parameter_accuracy64.218 / 24AI_plan_coherence14.916 / 24AI_recovery85.621 / 22AI_refusal50.022 / 22AI_spec92.521 / 22AI_stability59.215 / 22AI_task_completion68.912 / 24AI_tool_selection51.915 / 24ARC_AGI_25.211 / 17ArtificialAnalysisCoding39.513 / 21ArtificialAnalysisIntelligence60.58 / 21ArtificialAnalysisReasoning54.013 / 21ContextWindow74.919 / 24CopilotArenaOrLMArenaCode95.93 / 22GDPval63.07 / 11GPQA_HLE_Reasoning54.013 / 21IFBench86.85 / 21InverseCost93.06 / 24InverseTTFT100.01 / 19LMArenaCreativeOrOpenEnded86.35 / 24LMArenaSearchDocument88.05 / 19LMArenaText86.35 / 24LongContextRecall41.217 / 21MCPAtlas100.01 / 13OutputSpeed75.217 / 19SWEBenchMultilingual50.93 / 6SWEBenchVerified91.99 / 18SWEComposite72.610 / 24SWERebench100.01 / 20SciCode40.414 / 21SonarFunctionalSkill69.89 / 17SonarIssueDensity100.01 / 17Tau2Bench100.03 / 21TerminalBench55.810 / 22
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswerebenchterminal_benchmissing BUILD/LiveCodeBenchSWEComposite/SWEBenchPro
grok-4-latestxai74.474.452.452.462.862.859.8

group breakdown

A_B92.13 / 24A_I87.12 / 24A_P66.03 / 24A_R100.01 / 24BUILD47.118 / 24CRE76.110 / 24GEN49.315 / 24LM_ARENA_REVIEW_PROXY19.219 / 24OPS_long77.514 / 24OPS_precision77.512 / 24OPS_review77.312 / 24PLAN38.118 / 24

metrics

AI_code100.01 / 22AI_complexity100.01 / 22AI_context_awareness0.021 / 24AI_correctness100.03 / 22AI_edge_cases100.03 / 22AI_efficiency1.121 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention99.22 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.01 / 24AI_recovery100.03 / 22AI_refusal50.020 / 22AI_spec100.016 / 22AI_stability100.01 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_220.78 / 17ArtificialAnalysisCoding51.510 / 21ArtificialAnalysisIntelligence40.314 / 21ArtificialAnalysisReasoning57.010 / 21ContextWindow78.415 / 24CopilotArenaOrLMArenaCode58.015 / 22GPQA_HLE_Reasoning57.010 / 21IFBench33.115 / 21InverseCost74.419 / 24InverseTTFT78.510 / 19LMArenaCreativeOrOpenEnded76.110 / 24LMArenaSearchDocument19.214 / 19LMArenaText76.110 / 24LongContextRecall77.08 / 21OutputSpeed77.414 / 19SWEComposite47.820 / 24SWERebench39.116 / 20SciCode51.910 / 21Tau2Bench51.514 / 21TerminalBench11.819 / 22
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouterswerebenchterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
gpt-5.3-codexopenai65.365.355.955.957.457.467.0

group breakdown

A_B51.210 / 24A_I62.012 / 24A_P52.219 / 24A_R71.79 / 24BUILD60.311 / 24CRE73.212 / 24GEN55.812 / 24LM_ARENA_REVIEW_PROXY92.53 / 24OPS_long58.021 / 24OPS_precision59.320 / 24OPS_review58.821 / 24PLAN58.310 / 24

metrics

AI_code4.718 / 22AI_complexity6.020 / 22AI_context_awareness0.018 / 24AI_correctness70.618 / 22AI_edge_cases49.019 / 22AI_efficiency68.212 / 22AI_hallucination_resistance80.011 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy85.911 / 24AI_plan_coherence3.221 / 24AI_recovery91.918 / 22AI_refusal50.017 / 22AI_spec100.013 / 22AI_stability79.79 / 22AI_task_completion57.819 / 24AI_tool_selection65.512 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode59.314 / 22InverseCost76.614 / 24LMArenaCreativeOrOpenEnded73.212 / 24LMArenaSearchDocument92.53 / 19LMArenaText73.212 / 24SWEBenchVerified88.211 / 18SWEComposite69.412 / 24SWERebench89.55 / 20TerminalBench74.36 / 22
sources aistupidlevelartificial_analysislmarenaopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/LiveCodeBenchBUILD/LongContextRecallBUILD/MCPAtlasBUILD/SciCodeBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityGEN/ARC_AGI_2GEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/MCPAtlasPLAN/Tau2BenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchPro
gemini-2.5-progoogle31.330.637.535.955.953.045.7

group breakdown

A_B79.05 / 24A_I73.45 / 24A_P65.64 / 24A_R78.95 / 24BUILD43.019 / 24CRE0.024 / 24GEN14.521 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long82.17 / 24OPS_precision74.415 / 24OPS_review71.015 / 24PLAN22.221 / 24

metrics

AI_code91.63 / 22AI_complexity91.14 / 22AI_context_awareness14.16 / 24AI_correctness92.55 / 22AI_edge_cases92.54 / 22AI_efficiency87.63 / 22AI_hallucination_resistance92.56 / 24AI_memory_retention92.56 / 24AI_parameter_accuracy92.55 / 24AI_plan_coherence79.16 / 24AI_recovery92.55 / 22AI_refusal50.010 / 22AI_spec92.518 / 22AI_stability53.417 / 22AI_task_completion61.216 / 24AI_tool_selection10.317 / 24ARC_AGI_23.713 / 17ArtificialAnalysisCoding23.617 / 21ArtificialAnalysisIntelligence14.117 / 21ArtificialAnalysisReasoning44.814 / 21ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.921 / 22GPQA_HLE_Reasoning44.814 / 21IFBench19.318 / 21InverseCost80.110 / 24InverseTTFT43.917 / 19LMArenaCreativeOrOpenEnded0.024 / 24LMArenaSearchDocument0.019 / 19LMArenaText0.024 / 24LongContextRecall67.210 / 21MCPAtlas71.15 / 13OutputSpeed91.46 / 19SWEBenchPro75.710 / 14SWEBenchVerified38.217 / 18SWEComposite46.821 / 24SWERebench1.819 / 20SciCode36.115 / 21SonarFunctionalSkill78.96 / 17SonarIssueDensity13.211 / 17Tau2Bench3.519 / 21TerminalBench1.820 / 22
sources arc_agiartificial_analysislmarenaopenrouterswebenchswerebenchterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchSWEComposite/SWEBenchMultilingual
grok-code-fast-1xai58.558.536.036.054.954.953.4

group breakdown

A_B95.22 / 24A_I92.01 / 24A_P69.02 / 24A_R97.82 / 24BUILD31.622 / 24CRE48.118 / 24GEN15.819 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long90.24 / 24OPS_precision89.16 / 24OPS_review89.74 / 24PLAN11.424 / 24

metrics

AI_code91.56 / 22AI_complexity98.72 / 22AI_context_awareness0.022 / 24AI_correctness100.04 / 22AI_edge_cases89.97 / 22AI_efficiency75.27 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention99.23 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.02 / 24AI_recovery100.04 / 22AI_refusal50.021 / 22AI_spec100.017 / 22AI_stability100.02 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.17 / 17ArtificialAnalysisCoding0.021 / 21ArtificialAnalysisIntelligence0.021 / 21ArtificialAnalysisReasoning0.021 / 21ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.022 / 22GPQA_HLE_Reasoning0.021 / 21IFBench0.021 / 21InverseCost99.32 / 24InverseTTFT84.87 / 19LMArenaCreativeOrOpenEnded48.118 / 24LMArenaText48.118 / 24LongContextRecall0.021 / 21OutputSpeed93.84 / 19SWEComposite45.622 / 24SWERebench27.918 / 20SciCode0.021 / 21Tau2Bench53.313 / 21TerminalBench0.022 / 22
sources aistupidlevelartificial_analysislmarenaopenrouterswerebenchterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityLM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
gpt-5.2openai61.661.654.954.952.352.355.5

group breakdown

A_B50.811 / 24A_I59.416 / 24A_P53.718 / 24A_R70.310 / 24BUILD51.914 / 24CRE67.814 / 24GEN52.213 / 24LM_ARENA_REVIEW_PROXY21.216 / 24OPS_long58.319 / 24OPS_precision59.819 / 24OPS_review59.520 / 24PLAN56.612 / 24

metrics

AI_code9.514 / 22AI_complexity6.019 / 22AI_context_awareness0.017 / 24AI_correctness70.617 / 22AI_edge_cases49.018 / 22AI_efficiency69.08 / 22AI_hallucination_resistance80.010 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy85.212 / 24AI_plan_coherence0.423 / 24AI_recovery91.917 / 22AI_refusal50.016 / 22AI_spec100.012 / 22AI_stability60.914 / 22AI_task_completion86.77 / 24AI_tool_selection75.39 / 24ARC_AGI_20.017 / 17ArtificialAnalysisCoding63.48 / 21ArtificialAnalysisIntelligence59.79 / 21ArtificialAnalysisReasoning56.411 / 21ContextWindow85.312 / 24CopilotArenaOrLMArenaCode38.720 / 22GPQA_HLE_Reasoning56.411 / 21IFBench64.78 / 21InverseCost80.111 / 24LMArenaCreativeOrOpenEnded67.814 / 24LMArenaSearchDocument21.211 / 19LMArenaText67.814 / 24LongContextRecall53.915 / 21SWEBenchMultilingual0.06 / 6SWEBenchPro38.213 / 14SWEBenchVerified81.314 / 18SWEComposite49.619 / 24SciCode54.69 / 21SonarFunctionalSkill67.210 / 17SonarIssueDensity35.78 / 17Tau2Bench50.115 / 21TerminalBench58.29 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenroutersonarswebenchswebench_proterminal_benchmissing BUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/MCPAtlasSWEComposite/SWERebench
claude-sonnet-4anthropic28.028.040.540.550.150.155.6

group breakdown

A_B46.218 / 24A_I63.110 / 24A_P60.97 / 24A_R61.514 / 24BUILD48.616 / 24CRE0.023 / 24GEN14.022 / 24LM_ARENA_REVIEW_PROXY86.26 / 24OPS_long80.011 / 24OPS_precision79.610 / 24OPS_review78.210 / 24PLAN33.319 / 24

metrics

AI_code4.716 / 22AI_complexity6.014 / 22AI_context_awareness0.013 / 24AI_correctness70.612 / 22AI_edge_cases49.013 / 22AI_efficiency66.313 / 22AI_hallucination_resistance20.020 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy91.98 / 24AI_plan_coherence19.815 / 24AI_recovery91.912 / 22AI_refusal50.05 / 22AI_spec100.05 / 22AI_stability79.76 / 22AI_task_completion99.43 / 24AI_tool_selection88.84 / 24ARC_AGI_20.216 / 17ArtificialAnalysisCoding30.716 / 21ArtificialAnalysisIntelligence29.715 / 21ArtificialAnalysisReasoning8.619 / 21ContextWindow99.39 / 24CopilotArenaOrLMArenaCode52.918 / 22GDPval82.53 / 11GPQA_HLE_Reasoning8.619 / 21IFBench35.814 / 21InverseCost74.416 / 24InverseTTFT75.413 / 19LMArenaCreativeOrOpenEnded0.023 / 24LMArenaSearchDocument86.26 / 19LMArenaText0.023 / 24LiveCodeBench0.02 / 2LongContextRecall60.812 / 21MCPAtlas13.110 / 13OutputSpeed77.513 / 19SWEBenchPro78.48 / 14SWEBenchVerified69.916 / 18SWEComposite70.411 / 24SWERebench55.114 / 20SciCode20.817 / 21SonarFunctionalSkill26.414 / 17SonarIssueDensity35.87 / 17Tau2Bench27.718 / 21TerminalBench47.413 / 22
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenroutersonarswebenchswebench_proswerebenchmissing SWEComposite/SWEBenchMultilingual
gemini-2.5-flashgoogle52.451.734.232.650.047.152.8

group breakdown

A_B84.74 / 24A_I74.34 / 24A_P60.79 / 24A_R85.54 / 24BUILD28.423 / 24CRE45.819 / 24GEN15.120 / 24LM_ARENA_REVIEW_PROXY78.87 / 24OPS_long94.62 / 24OPS_precision90.73 / 24OPS_review89.26 / 24PLAN13.523 / 24

metrics

AI_code87.27 / 22AI_complexity88.87 / 22AI_context_awareness100.01 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency100.01 / 22AI_hallucination_resistance69.912 / 24AI_memory_retention37.09 / 24AI_parameter_accuracy73.314 / 24AI_plan_coherence0.024 / 24AI_recovery100.01 / 22AI_refusal50.09 / 22AI_spec100.08 / 22AI_stability19.321 / 22AI_task_completion44.220 / 24AI_tool_selection27.216 / 24ARC_AGI_20.815 / 17ArtificialAnalysisCoding0.020 / 21ArtificialAnalysisIntelligence0.819 / 21ArtificialAnalysisReasoning17.916 / 21ContextWindow100.03 / 24CopilotArenaOrLMArenaCode65.313 / 22GDPval11.810 / 11GPQA_HLE_Reasoning17.916 / 21IFBench29.217 / 21InverseCost94.45 / 24InverseTTFT75.812 / 19LMArenaCreativeOrOpenEnded45.819 / 24LMArenaSearchDocument78.87 / 19LMArenaText45.819 / 24LiveCodeBench100.01 / 2LongContextRecall58.813 / 21MCPAtlas26.68 / 13OutputSpeed100.01 / 19SWEBenchPro52.512 / 14SWEBenchVerified0.018 / 18SWEComposite23.424 / 24SWERebench0.020 / 20SciCode23.516 / 21Tau2Bench0.020 / 21TerminalBench0.321 / 22
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing BUILD/SonarFunctionalSkillBUILD/SonarIssueDensitySWEComposite/SWEBenchMultilingual
glm-4.7zai34.434.450.150.149.449.454.4

group breakdown

A_B56.09 / 24A_I54.022 / 24A_P46.923 / 24A_R58.519 / 24BUILD40.621 / 24CRE10.122 / 24GEN35.818 / 24LM_ARENA_REVIEW_PROXY50.012 / 24OPS_long87.65 / 24OPS_precision90.14 / 24OPS_review91.91 / 24PLAN53.614 / 24

metrics

AI_context_awareness0.024 / 24AI_hallucination_resistance100.05 / 24AI_memory_retention99.25 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.04 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding37.914 / 21ArtificialAnalysisIntelligence42.613 / 21ArtificialAnalysisReasoning55.812 / 21ContextWindow74.918 / 24CopilotArenaOrLMArenaCode68.89 / 22GPQA_HLE_Reasoning55.812 / 21IFBench72.27 / 21InverseCost96.13 / 24InverseTTFT99.02 / 19LMArenaCreativeOrOpenEnded10.122 / 24LMArenaText10.122 / 24LongContextRecall57.414 / 21MCPAtlas0.013 / 13OutputSpeed85.37 / 19SWEComposite54.215 / 24SWERebench70.912 / 20SciCode48.611 / 21SonarFunctionalSkill0.017 / 17SonarIssueDensity50.85 / 17Tau2Bench100.02 / 21TerminalBench27.117 / 22
sources aistupidlevelartificial_analysislmarenamcp_atlasopenroutersonarswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_codeA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_refusalA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GDPvalBUILD/LiveCodeBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
claude-sonnet-4.5anthropic54.654.648.148.148.248.243.9

group breakdown

A_B36.123 / 24A_I48.223 / 24A_P57.413 / 24A_R50.923 / 24BUILD53.313 / 24CRE64.015 / 24GEN42.217 / 24LM_ARENA_REVIEW_PROXY1.822 / 24OPS_long79.612 / 24OPS_precision79.511 / 24OPS_review78.39 / 24PLAN42.317 / 24

metrics

AI_canary_health78.17 / 7AI_code4.717 / 22AI_complexity6.015 / 22AI_context_awareness99.82 / 24AI_correctness70.613 / 22AI_edge_cases49.014 / 22AI_efficiency68.99 / 22AI_hallucination_resistance40.016 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy61.319 / 24AI_plan_coherence30.89 / 24AI_recovery91.913 / 22AI_refusal50.06 / 22AI_spec0.022 / 22AI_stability79.77 / 22AI_task_completion99.82 / 24AI_tool_selection82.08 / 24ARC_AGI_23.712 / 17ArtificialAnalysisCoding45.312 / 21ArtificialAnalysisIntelligence46.012 / 21ArtificialAnalysisReasoning35.315 / 21ContextWindow99.310 / 24CopilotArenaOrLMArenaCode53.416 / 22GDPval88.32 / 11GPQA_HLE_Reasoning35.315 / 21IFBench43.012 / 21InverseCost74.417 / 24InverseTTFT76.311 / 19LMArenaCreativeOrOpenEnded64.015 / 24LMArenaSearchDocument1.817 / 19LMArenaText64.015 / 24LongContextRecall65.711 / 21MCPAtlas6.612 / 13OutputSpeed76.516 / 19SWEBenchMultilingual3.95 / 6SWEBenchPro81.26 / 14SWEBenchVerified85.712 / 18SWEComposite73.69 / 24SWERebench74.910 / 20SciCode46.413 / 21SonarFunctionalSkill17.215 / 17SonarIssueDensity30.09 / 17Tau2Bench58.911 / 21TerminalBench37.415 / 22
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/LiveCodeBench
claude-opus-4.1anthropic53.353.347.747.745.845.843.9

group breakdown

A_B46.020 / 24A_I62.211 / 24A_P57.015 / 24A_R64.413 / 24BUILD48.517 / 24CRE52.917 / 24GEN50.714 / 24LM_ARENA_REVIEW_PROXY0.023 / 24OPS_long48.722 / 24OPS_precision46.224 / 24OPS_review42.524 / 24PLAN43.016 / 24

metrics

AI_canary_health79.25 / 7AI_code0.021 / 22AI_complexity6.010 / 22AI_context_awareness0.012 / 24AI_correctness70.68 / 22AI_edge_cases49.09 / 22AI_efficiency56.819 / 22AI_hallucination_resistance40.015 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy71.516 / 24AI_plan_coherence19.814 / 24AI_recovery91.98 / 22AI_refusal50.01 / 22AI_spec100.01 / 22AI_stability79.73 / 22AI_task_completion83.110 / 24AI_tool_selection71.010 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode53.217 / 22InverseCost0.024 / 24LMArenaCreativeOrOpenEnded52.917 / 24LMArenaSearchDocument0.018 / 19LMArenaText52.917 / 24SWEComposite50.516 / 24SWERebench52.315 / 20TerminalBench29.416 / 22
sources aistupidlevellmarenaopenrouterswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/GDPvalBUILD/LiveCodeBenchBUILD/LongContextRecallBUILD/MCPAtlasBUILD/SciCodeBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityGEN/ARC_AGI_2GEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/InverseTTFTOPS_long/OutputSpeedOPS_precision/InverseTTFTOPS_precision/OutputSpeedOPS_review/InverseTTFTOPS_review/OutputSpeedPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/MCPAtlasPLAN/Tau2BenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerified
deepseek-v4-flashdeepseek51.551.559.359.344.044.048.3

group breakdown

A_B24.324 / 24A_I30.324 / 24A_P39.324 / 24A_R21.724 / 24BUILD49.815 / 24CRE58.816 / 24GEN62.88 / 24LM_ARENA_REVIEW_PROXY50.08 / 24OPS_long86.56 / 24OPS_precision89.65 / 24OPS_review91.82 / 24PLAN71.16 / 24

metrics

AI_canary_health78.46 / 7AI_code9.512 / 22AI_complexity6.017 / 22AI_context_awareness0.014 / 24AI_correctness0.022 / 22AI_edge_cases0.022 / 22AI_efficiency80.46 / 22AI_hallucination_resistance40.017 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy91.69 / 24AI_plan_coherence25.312 / 24AI_recovery0.022 / 22AI_refusal50.08 / 22AI_spec100.07 / 22AI_stability0.022 / 22AI_task_completion86.75 / 24AI_tool_selection86.25 / 24ArtificialAnalysisCoding45.611 / 21ArtificialAnalysisIntelligence59.310 / 21ArtificialAnalysisReasoning76.77 / 21ContextWindow71.622 / 24GPQA_HLE_Reasoning76.77 / 21IFBench100.01 / 21InverseCost100.01 / 24InverseTTFT98.93 / 19LMArenaCreativeOrOpenEnded58.816 / 24LMArenaText58.816 / 24LongContextRecall52.516 / 21OutputSpeed83.68 / 19SWEComposite50.017 / 24SciCode47.512 / 21Tau2Bench97.95 / 21
sources aistupidlevelartificial_analysislmarenaopenroutermissing BUILD/CopilotArenaOrLMArenaCodeBUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityBUILD/TerminalBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasPLAN/TerminalBenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSWEComposite/SWERebench
kimi-k2-0905moonshot32.632.633.433.443.643.649.6

group breakdown

A_B46.617 / 24A_I57.619 / 24A_P56.916 / 24A_R71.98 / 24BUILD42.720 / 24CRE27.620 / 24GEN8.124 / 24LM_ARENA_REVIEW_PROXY50.09 / 24OPS_long35.424 / 24OPS_precision53.722 / 24OPS_review60.219 / 24PLAN29.620 / 24

metrics

AI_canary_health88.91 / 7AI_code14.211 / 22AI_complexity0.022 / 22AI_context_awareness0.015 / 24AI_correctness70.615 / 22AI_edge_cases49.016 / 22AI_efficiency0.022 / 22AI_hallucination_resistance80.09 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy84.713 / 24AI_plan_coherence30.810 / 24AI_recovery91.915 / 22AI_refusal50.014 / 22AI_spec100.010 / 22AI_stability72.311 / 22AI_task_completion86.76 / 24AI_tool_selection68.011 / 24ArtificialAnalysisCoding4.219 / 21ArtificialAnalysisIntelligence0.020 / 21ArtificialAnalysisReasoning0.020 / 21ContextWindow53.423 / 24GPQA_HLE_Reasoning0.020 / 21IFBench0.020 / 21InverseCost92.77 / 24InverseTTFT90.75 / 19LMArenaCreativeOrOpenEnded27.620 / 24LMArenaText27.620 / 24LongContextRecall0.020 / 21OutputSpeed0.019 / 19SWEComposite50.018 / 24SciCode0.020 / 21Tau2Bench48.016 / 21
sources aistupidlevelartificial_analysislmarenaopenroutermissing BUILD/CopilotArenaOrLMArenaCodeBUILD/GDPvalBUILD/LiveCodeBenchBUILD/MCPAtlasBUILD/SonarFunctionalSkillBUILD/SonarIssueDensityBUILD/TerminalBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentPLAN/MCPAtlasPLAN/TerminalBenchSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSWEComposite/SWERebench
glm-4.6zai35.435.430.930.937.637.640.7

group breakdown

A_B56.08 / 24A_I54.021 / 24A_P46.922 / 24A_R58.518 / 24BUILD23.024 / 24CRE24.321 / 24GEN13.623 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long77.015 / 24OPS_precision83.38 / 24OPS_review86.08 / 24PLAN18.122 / 24

metrics

AI_context_awareness0.023 / 24AI_hallucination_resistance100.04 / 24AI_memory_retention99.24 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.03 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding15.918 / 21ArtificialAnalysisIntelligence6.118 / 21ArtificialAnalysisReasoning16.517 / 21ContextWindow75.017 / 24CopilotArenaOrLMArenaCode44.419 / 22GPQA_HLE_Reasoning16.517 / 21IFBench4.719 / 21InverseCost95.44 / 24InverseTTFT98.74 / 19LMArenaCreativeOrOpenEnded24.321 / 24LMArenaText24.321 / 24LongContextRecall9.819 / 21MCPAtlas7.511 / 13OutputSpeed66.318 / 19SWEBenchPro0.014 / 14SWEBenchVerified79.015 / 18SWEComposite34.923 / 24SWERebench38.417 / 20SciCode12.019 / 21SonarFunctionalSkill7.516 / 17SonarIssueDensity7.514 / 17Tau2Bench41.317 / 21TerminalBench13.918 / 22
sources aistupidlevelartificial_analysislmarenaopenrouterswebenchswebench_proswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_codeA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_refusalA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GDPvalBUILD/LiveCodeBenchGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchMultilingual