$ipbr-rank · live llm coding-role score
refreshed · 14 sources · updated frequently — models drift and degrade
[ idea ]
1gemini-3.1-pro-preview90.690.6
2claude-opus-4.690.490.4
3claude-opus-4.788.788.3
[ plan ]
1gpt-5.584.984.9
2gemini-3.1-pro-preview84.384.3
3claude-opus-4.778.777.8
[ build ]
1gpt-5.582.082.0
2claude-opus-4.679.579.5
3claude-opus-4.777.976.2
[ review ]
1gemini-3.1-pro-preview83.283.2
2claude-opus-4.780.380.3
3gpt-5.579.179.1
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

raw vs adjusted

The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models lead the direct LM Arena search/document review proxy, that proxy lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.

Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted.

missing data

If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.

Full math, role definitions, and source list →

gpt-5.5openai83.883.884.984.982.082.079.1

group breakdown

A_B67.23 / 25A_I76.87 / 25A_P66.55 / 25A_R83.74 / 25BUILD87.41 / 25CRE82.87 / 25GEN94.43 / 25LM_ARENA_REVIEW_PROXY27.514 / 25OPS_long81.612 / 25OPS_precision78.515 / 25OPS_review80.215 / 25PLAN90.62 / 25

metrics

AI_code13.521 / 23AI_complexity32.321 / 23AI_context_awareness0.021 / 25AI_correctness93.215 / 23AI_edge_cases67.619 / 23AI_efficiency78.510 / 23AI_hallucination_resistance100.010 / 25AI_memory_retention100.01 / 25AI_parameter_accuracy32.121 / 25AI_plan_coherence25.37 / 25AI_recovery99.614 / 23AI_refusal100.016 / 23AI_spec100.016 / 23AI_stability78.99 / 23AI_task_completion71.913 / 25AI_tool_selection84.37 / 25ARC_AGI_296.72 / 23ArtificialAnalysisCoding100.02 / 24ArtificialAnalysisIntelligence98.23 / 24ArtificialAnalysisReasoning100.02 / 24BlendedCost49.024 / 25ContextWindow100.02 / 25CopilotArenaOrLMArenaCode69.711 / 25GDPval95.02 / 25GPQA_HLE_Reasoning100.02 / 24GSO94.02 / 16IFBench78.28 / 24LMArenaCreativeOrOpenEnded82.87 / 25LMArenaSearchDocument27.512 / 23LMArenaText82.87 / 25LongContextRecall98.13 / 24MCPAtlas72.88 / 17OutputSpeed81.912 / 24SWEBenchPro95.07 / 22SWEBenchVerified95.09 / 24SWEComposite89.95 / 25SWERebench83.58 / 24SciCode94.84 / 24SonarBugDensity94.52 / 23SonarComposite65.59 / 25SonarFunctionalSkill46.519 / 23SonarIssueDensity52.78 / 23SonarVulnerabilityDensity99.22 / 23TTFT83.39 / 24Tau2Bench87.09 / 24TerminalBench100.01 / 25
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarterminal_benchmissing SWEComposite/SWEBenchMultilingual
claude-opus-4.6anthropic90.490.474.274.279.579.574.6

group breakdown

A_B63.510 / 25A_I76.86 / 25A_P58.013 / 25A_R83.26 / 25BUILD86.82 / 25CRE100.01 / 25GEN89.54 / 25LM_ARENA_REVIEW_PROXY33.613 / 25OPS_long78.616 / 25OPS_precision76.116 / 25OPS_review78.717 / 25PLAN73.27 / 25

metrics

AI_canary_health84.25 / 7AI_code8.722 / 23AI_complexity52.77 / 23AI_context_awareness0.016 / 25AI_correctness93.25 / 23AI_edge_cases67.69 / 23AI_efficiency69.218 / 23AI_hallucination_resistance100.02 / 25AI_memory_retention0.016 / 25AI_parameter_accuracy66.113 / 25AI_plan_coherence12.816 / 25AI_recovery99.64 / 23AI_refusal100.03 / 23AI_spec100.03 / 23AI_stability78.97 / 23AI_task_completion51.020 / 25AI_tool_selection96.84 / 25ARC_AGI_290.94 / 23ArtificialAnalysisCoding76.45 / 24ArtificialAnalysisIntelligence84.16 / 24ArtificialAnalysisReasoning86.45 / 24BlendedCost60.422 / 25ContextWindow99.38 / 25CopilotArenaOrLMArenaCode100.01 / 25GDPval82.47 / 25GPQA_HLE_Reasoning86.45 / 24GSO75.33 / 16IFBench30.319 / 24LMArenaCreativeOrOpenEnded100.01 / 25LMArenaSearchDocument33.611 / 23LMArenaText100.01 / 25LongContextRecall90.25 / 24OutputSpeed78.316 / 24SWEBenchMultilingual90.99 / 20SWEBenchPro100.01 / 22SWEBenchVerified99.72 / 24SWEComposite95.71 / 25SWERebench91.64 / 24SciCode85.95 / 24SonarBugDensity59.513 / 23SonarComposite70.58 / 25SonarFunctionalSkill92.24 / 23SonarIssueDensity46.810 / 23SonarVulnerabilityDensity66.612 / 23TTFT73.318 / 24Tau2Bench87.78 / 24TerminalBench64.28 / 25
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlas
claude-opus-4.7anthropic88.788.378.777.877.976.280.3

group breakdown

A_B57.318 / 25A_I75.79 / 25A_P60.59 / 25A_R66.017 / 25BUILD86.53 / 25CRE93.84 / 25GEN96.62 / 25LM_ARENA_REVIEW_PROXY100.01 / 25OPS_long76.718 / 25OPS_precision71.720 / 25OPS_review75.718 / 25PLAN78.96 / 25

metrics

AI_code13.516 / 23AI_complexity52.78 / 23AI_context_awareness5.915 / 25AI_correctness93.26 / 23AI_edge_cases67.610 / 23AI_efficiency86.83 / 23AI_hallucination_resistance0.023 / 25AI_memory_retention0.017 / 25AI_parameter_accuracy50.520 / 25AI_plan_coherence0.025 / 25AI_recovery99.65 / 23AI_refusal100.04 / 23AI_spec100.04 / 23AI_stability71.110 / 23AI_task_completion97.63 / 25AI_tool_selection100.01 / 25ARC_AGI_292.73 / 23ArtificialAnalysisCoding90.73 / 24ArtificialAnalysisIntelligence100.01 / 24ArtificialAnalysisReasoning95.83 / 24BlendedCost60.423 / 25ContextWindow99.39 / 25CopilotArenaOrLMArenaCode100.02 / 25GDPval95.01 / 25GPQA_HLE_Reasoning95.83 / 24GSO100.01 / 16IFBench45.013 / 24LMArenaCreativeOrOpenEnded93.84 / 25LMArenaSearchDocument100.01 / 23LMArenaText93.84 / 25LongContextRecall88.37 / 24OutputSpeed79.815 / 24SWEBenchMultilingual95.03 / 20SWEBenchPro95.02 / 22SWEBenchVerified95.03 / 24SWEComposite91.14 / 25SWERebench85.36 / 24SciCode100.01 / 24SonarBugDensity50.120 / 23SonarComposite51.419 / 25SonarFunctionalSkill93.92 / 23SonarIssueDensity0.023 / 23SonarVulnerabilityDensity25.320 / 23TTFT59.521 / 24Tau2Bench79.911 / 24TerminalBench78.24 / 25
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarmissing BUILD/MCPAtlasPLAN/MCPAtlas
gemini-3.1-pro-previewgoogle90.690.684.384.376.976.983.2

group breakdown

A_B60.416 / 25A_I66.421 / 25A_P55.719 / 25A_R69.716 / 25BUILD82.15 / 25CRE100.02 / 25GEN100.01 / 25LM_ARENA_REVIEW_PROXY92.34 / 25OPS_long82.611 / 25OPS_precision73.917 / 25OPS_review79.416 / 25PLAN92.71 / 25

metrics

AI_code24.36 / 23AI_complexity37.719 / 23AI_context_awareness21.18 / 25AI_correctness83.321 / 23AI_edge_cases78.06 / 23AI_efficiency71.815 / 23AI_hallucination_resistance92.517 / 25AI_memory_retention7.514 / 25AI_parameter_accuracy58.718 / 25AI_plan_coherence14.515 / 25AI_recovery23.221 / 23AI_refusal92.521 / 23AI_spec92.521 / 23AI_stability92.55 / 23AI_task_completion83.78 / 25AI_tool_selection59.419 / 25ARC_AGI_2100.01 / 23ArtificialAnalysisCoding100.01 / 24ArtificialAnalysisIntelligence100.02 / 24ArtificialAnalysisReasoning100.01 / 24BlendedCost77.214 / 25ContextWindow100.07 / 25CopilotArenaOrLMArenaCode73.89 / 25GDPval50.116 / 25GPQA_HLE_Reasoning100.01 / 24GSO51.39 / 16IFBench94.54 / 24LMArenaCreativeOrOpenEnded100.02 / 25LMArenaSearchDocument92.34 / 23LMArenaText100.02 / 25LongContextRecall100.02 / 24MCPAtlas71.110 / 17OutputSpeed92.64 / 24SWEBenchMultilingual36.013 / 20SWEBenchPro89.111 / 22SWEBenchVerified95.06 / 24SWEComposite88.96 / 25SWERebench99.82 / 24SciCode100.02 / 24SonarBugDensity52.718 / 23SonarComposite54.218 / 25SonarFunctionalSkill78.910 / 23SonarIssueDensity13.218 / 23SonarVulnerabilityDensity58.217 / 23TTFT44.922 / 24Tau2Bench95.56 / 24TerminalBench89.43 / 25
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing none
claude-opus-4.5anthropic72.472.469.869.875.575.567.7

group breakdown

A_B65.17 / 25A_I68.617 / 25A_P70.13 / 25A_R82.78 / 25BUILD80.66 / 25CRE73.611 / 25GEN72.66 / 25LM_ARENA_REVIEW_PROXY10.822 / 25OPS_long76.519 / 25OPS_precision73.718 / 25OPS_review73.619 / 25PLAN67.110 / 25

metrics

AI_code23.37 / 23AI_complexity0.023 / 23AI_context_awareness28.15 / 25AI_correctness93.24 / 23AI_edge_cases67.68 / 23AI_efficiency85.25 / 23AI_hallucination_resistance100.01 / 25AI_memory_retention99.73 / 25AI_parameter_accuracy100.01 / 25AI_plan_coherence1.322 / 25AI_recovery99.63 / 23AI_refusal100.02 / 23AI_spec100.02 / 23AI_stability59.314 / 23AI_task_completion100.02 / 25AI_tool_selection97.63 / 25ARC_AGI_284.85 / 23ArtificialAnalysisCoding75.46 / 24ArtificialAnalysisIntelligence71.59 / 24ArtificialAnalysisReasoning63.711 / 24BlendedCost60.421 / 25ContextWindow74.522 / 25CopilotArenaOrLMArenaCode77.48 / 25GDPval80.69 / 25GPQA_HLE_Reasoning63.711 / 24GSO59.35 / 16IFBench43.415 / 24LMArenaCreativeOrOpenEnded73.611 / 25LMArenaSearchDocument10.820 / 23LMArenaText73.611 / 25LongContextRecall100.01 / 24OutputSpeed80.413 / 24SWEBenchMultilingual95.02 / 20SWEBenchPro88.412 / 22SWEBenchVerified92.211 / 24SWEComposite84.911 / 25SWERebench76.59 / 24SciCode72.78 / 24SonarBugDensity73.79 / 23SonarComposite87.11 / 25SonarFunctionalSkill100.01 / 23SonarIssueDensity77.26 / 23SonarVulnerabilityDensity87.24 / 23TTFT75.317 / 24Tau2Bench81.910 / 24TerminalBench54.813 / 25
sources aistupidlevelartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlas
deepseek-v4-prodeepseek71.571.573.173.175.375.378.3

group breakdown

A_B57.319 / 25A_I70.715 / 25A_P56.616 / 25A_R62.420 / 25BUILD80.08 / 25CRE72.712 / 25GEN67.98 / 25LM_ARENA_REVIEW_PROXY88.06 / 25OPS_long76.917 / 25OPS_precision86.88 / 25OPS_review87.06 / 25PLAN83.54 / 25

metrics

AI_code27.33 / 23AI_complexity69.66 / 23AI_context_awareness7.513 / 25AI_correctness86.717 / 23AI_edge_cases65.020 / 23AI_efficiency70.017 / 23AI_hallucination_resistance7.521 / 25AI_memory_retention7.511 / 25AI_parameter_accuracy76.011 / 25AI_plan_coherence20.39 / 25AI_recovery92.215 / 23AI_refusal92.518 / 23AI_spec92.518 / 23AI_stability47.816 / 23AI_task_completion71.414 / 25AI_tool_selection72.412 / 25ARC_AGI_211.913 / 23ArtificialAnalysisCoding74.47 / 24ArtificialAnalysisIntelligence78.48 / 24ArtificialAnalysisReasoning83.27 / 24BlendedCost99.03 / 25ContextWindow100.03 / 25CopilotArenaOrLMArenaCode73.710 / 25GDPval67.514 / 25GPQA_HLE_Reasoning83.27 / 24IFBench92.95 / 24LMArenaCreativeOrOpenEnded72.712 / 25LMArenaSearchDocument88.06 / 23LMArenaText72.712 / 25LongContextRecall68.59 / 24OutputSpeed59.022 / 24SWEBenchMultilingual95.05 / 20SWEBenchPro95.04 / 22SWEBenchVerified95.05 / 24SWEComposite86.29 / 25SWERebench73.113 / 24SciCode75.57 / 24SonarBugDensity92.54 / 23SonarComposite80.65 / 25SonarFunctionalSkill66.816 / 23SonarIssueDensity92.53 / 23SonarVulnerabilityDensity81.68 / 23TTFT97.95 / 24Tau2Bench96.83 / 24TerminalBench70.07 / 25
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlas
glm-5.1zai75.775.765.365.374.174.175.2

group breakdown

A_B53.923 / 25A_I67.918 / 25A_P53.421 / 25A_R62.021 / 25BUILD80.27 / 25CRE87.25 / 25GEN57.713 / 25LM_ARENA_REVIEW_PROXY88.08 / 25OPS_long82.710 / 25OPS_precision87.66 / 25OPS_review85.28 / 25PLAN72.98 / 25

metrics

AI_code23.210 / 23AI_complexity52.316 / 23AI_context_awareness7.514 / 25AI_correctness86.718 / 23AI_edge_cases65.021 / 23AI_efficiency61.221 / 23AI_hallucination_resistance7.522 / 25AI_memory_retention7.515 / 25AI_parameter_accuracy73.612 / 25AI_plan_coherence16.411 / 25AI_recovery92.216 / 23AI_refusal92.522 / 23AI_spec92.522 / 23AI_stability47.817 / 23AI_task_completion60.718 / 25AI_tool_selection55.921 / 25ARC_AGI_25.217 / 23ArtificialAnalysisCoding39.516 / 24ArtificialAnalysisIntelligence60.511 / 24ArtificialAnalysisReasoning53.915 / 24BlendedCost93.47 / 25ContextWindow74.720 / 25CopilotArenaOrLMArenaCode96.33 / 25GDPval73.510 / 25GPQA_HLE_Reasoning53.915 / 24IFBench84.17 / 24LMArenaCreativeOrOpenEnded87.25 / 25LMArenaSearchDocument88.08 / 23LMArenaText87.25 / 25LongContextRecall40.920 / 24MCPAtlas100.01 / 17OutputSpeed77.118 / 24SWEBenchMultilingual50.911 / 20SWEBenchPro95.08 / 22SWEBenchVerified91.913 / 24SWEComposite92.13 / 25SWERebench100.01 / 24SciCode40.217 / 24SonarBugDensity100.01 / 23SonarComposite86.02 / 25SonarFunctionalSkill69.812 / 23SonarIssueDensity100.01 / 23SonarVulnerabilityDensity87.25 / 23TTFT98.94 / 24Tau2Bench100.02 / 24TerminalBench55.812 / 25
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswebenchswerebenchterminal_benchmissing BUILD/GSO
deepseek-v4-flashdeepseek63.963.968.868.873.073.076.8

group breakdown

A_B58.517 / 25A_I74.412 / 25A_P57.814 / 25A_R64.618 / 25BUILD75.610 / 25CRE58.817 / 25GEN55.214 / 25LM_ARENA_REVIEW_PROXY88.05 / 25OPS_long85.78 / 25OPS_precision90.24 / 25OPS_review87.35 / 25PLAN80.15 / 25

metrics

AI_canary_health89.51 / 7AI_code23.38 / 23AI_complexity73.04 / 23AI_context_awareness0.018 / 25AI_correctness93.210 / 23AI_edge_cases67.614 / 23AI_efficiency73.512 / 23AI_hallucination_resistance0.024 / 25AI_memory_retention0.021 / 25AI_parameter_accuracy80.68 / 25AI_plan_coherence15.112 / 25AI_recovery99.69 / 23AI_refusal100.08 / 23AI_spec100.08 / 23AI_stability47.420 / 23AI_task_completion75.210 / 25AI_tool_selection76.38 / 25ARC_AGI_211.912 / 23ArtificialAnalysisCoding45.713 / 24ArtificialAnalysisIntelligence59.413 / 24ArtificialAnalysisReasoning76.89 / 24BlendedCost100.01 / 25ContextWindow71.523 / 25CopilotArenaOrLMArenaCode87.66 / 25GDPval67.513 / 25GPQA_HLE_Reasoning76.89 / 24IFBench100.01 / 24LMArenaCreativeOrOpenEnded58.817 / 25LMArenaSearchDocument88.05 / 23LMArenaText58.817 / 25LongContextRecall52.318 / 24MCPAtlas92.53 / 17OutputSpeed81.911 / 24SWEBenchMultilingual58.610 / 20SWEBenchPro95.03 / 22SWEBenchVerified95.04 / 24SWEComposite82.612 / 25SWERebench73.112 / 24SciCode47.414 / 24SonarBugDensity92.53 / 23SonarComposite80.64 / 25SonarFunctionalSkill66.815 / 23SonarIssueDensity92.52 / 23SonarVulnerabilityDensity81.67 / 23TTFT99.72 / 24Tau2Bench94.27 / 24TerminalBench60.910 / 25
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSO
gpt-5.4openai71.171.151.951.971.671.661.6

group breakdown

A_B66.44 / 25A_I77.53 / 25A_P62.68 / 25A_R83.73 / 25BUILD74.012 / 25CRE76.79 / 25GEN44.817 / 25LM_ARENA_REVIEW_PROXY17.120 / 25OPS_long92.03 / 25OPS_precision89.45 / 25OPS_review90.53 / 25PLAN43.017 / 25

metrics

AI_code13.520 / 23AI_complexity52.715 / 23AI_context_awareness7.712 / 25AI_correctness93.214 / 23AI_edge_cases67.618 / 23AI_efficiency83.18 / 23AI_hallucination_resistance100.09 / 25AI_memory_retention18.88 / 25AI_parameter_accuracy88.77 / 25AI_plan_coherence8.819 / 25AI_recovery99.613 / 23AI_refusal100.015 / 23AI_spec100.015 / 23AI_stability78.98 / 23AI_task_completion77.79 / 25AI_tool_selection85.06 / 25ARC_AGI_275.87 / 23ArtificialAnalysisCoding33.718 / 24ArtificialAnalysisIntelligence27.419 / 24ArtificialAnalysisReasoning15.221 / 24BlendedCost74.916 / 25ContextWindow100.01 / 25CopilotArenaOrLMArenaCode68.614 / 25GDPval88.24 / 25GPQA_HLE_Reasoning15.221 / 24GSO54.07 / 16IFBench60.511 / 24LMArenaCreativeOrOpenEnded76.79 / 25LMArenaSearchDocument17.118 / 23LMArenaText76.79 / 25LongContextRecall24.221 / 24MCPAtlas72.87 / 17OutputSpeed94.13 / 24SWEBenchPro92.510 / 22SWEBenchVerified95.08 / 24SWEComposite88.97 / 25SWERebench83.57 / 24SciCode11.521 / 24SonarBugDensity84.77 / 23SonarComposite60.411 / 25SonarFunctionalSkill66.814 / 23SonarIssueDensity6.821 / 23SonarVulnerabilityDensity100.01 / 23TTFT89.08 / 24Tau2Bench0.024 / 24TerminalBench100.02 / 25
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proterminal_benchmissing SWEComposite/SWEBenchMultilingual
gpt-5.3-codexopenai68.868.850.950.971.571.570.8

group breakdown

A_B65.36 / 25A_I75.111 / 25A_P58.711 / 25A_R83.07 / 25BUILD75.211 / 25CRE72.713 / 25GEN47.716 / 25LM_ARENA_REVIEW_PROXY92.53 / 25OPS_long86.27 / 25OPS_precision83.511 / 25OPS_review83.912 / 25PLAN42.118 / 25

metrics

AI_code13.519 / 23AI_complexity52.714 / 23AI_context_awareness15.310 / 25AI_correctness93.213 / 23AI_edge_cases67.617 / 23AI_efficiency79.09 / 23AI_hallucination_resistance100.08 / 25AI_memory_retention15.69 / 25AI_parameter_accuracy93.56 / 25AI_plan_coherence0.224 / 25AI_recovery99.612 / 23AI_refusal100.014 / 23AI_spec100.014 / 23AI_stability71.112 / 23AI_task_completion63.016 / 25AI_tool_selection70.714 / 25ARC_AGI_271.98 / 23ArtificialAnalysisCoding43.115 / 24ArtificialAnalysisIntelligence29.518 / 24ArtificialAnalysisReasoning35.117 / 24BlendedCost76.515 / 25ContextWindow85.214 / 25CopilotArenaOrLMArenaCode59.617 / 25GDPval68.112 / 25GPQA_HLE_Reasoning35.117 / 24GSO53.48 / 16IFBench59.912 / 24LMArenaCreativeOrOpenEnded72.713 / 25LMArenaSearchDocument92.53 / 23LMArenaText72.713 / 25LongContextRecall44.819 / 24OutputSpeed90.07 / 24SWEBenchPro95.06 / 22SWEBenchVerified92.510 / 24SWEComposite92.12 / 25SWERebench89.55 / 24SciCode44.516 / 24SonarBugDensity80.88 / 23SonarComposite60.910 / 25SonarFunctionalSkill72.311 / 23SonarIssueDensity7.519 / 23SonarVulnerabilityDensity92.53 / 23TTFT81.213 / 24Tau2Bench7.521 / 24TerminalBench74.36 / 25
sources aistupidlevelartificial_analysislmarenaopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
claude-sonnet-4.6anthropic71.871.860.160.171.471.465.4

group breakdown

A_B66.15 / 25A_I75.610 / 25A_P58.112 / 25A_R83.45 / 25BUILD76.99 / 25CRE73.810 / 25GEN65.79 / 25LM_ARENA_REVIEW_PROXY23.215 / 25OPS_long66.322 / 25OPS_precision53.724 / 25OPS_review63.622 / 25PLAN58.813 / 25

metrics

AI_canary_health88.24 / 7AI_code18.412 / 23AI_complexity52.711 / 23AI_context_awareness0.017 / 25AI_correctness93.29 / 23AI_edge_cases67.613 / 23AI_efficiency83.27 / 23AI_hallucination_resistance100.05 / 25AI_memory_retention0.020 / 25AI_parameter_accuracy100.02 / 25AI_plan_coherence1.323 / 25AI_recovery99.68 / 23AI_refusal100.07 / 23AI_spec100.07 / 23AI_stability71.111 / 23AI_task_completion75.112 / 25AI_tool_selection71.313 / 25ARC_AGI_210.616 / 23ArtificialAnalysisCoding85.54 / 24ArtificialAnalysisIntelligence79.27 / 24ArtificialAnalysisReasoning68.710 / 24BlendedCost74.219 / 25ContextWindow99.312 / 25CopilotArenaOrLMArenaCode94.74 / 25GDPval86.46 / 25GPQA_HLE_Reasoning68.710 / 24GSO30.711 / 16IFBench39.717 / 24LMArenaCreativeOrOpenEnded73.810 / 25LMArenaSearchDocument23.213 / 23LMArenaText73.810 / 25LongContextRecall90.26 / 24MCPAtlas69.811 / 17OutputSpeed79.914 / 24SWEBenchMultilingual95.04 / 20SWEBenchPro76.517 / 22SWEBenchVerified90.314 / 24SWEComposite88.18 / 25SWERebench95.73 / 24SciCode57.811 / 24SonarBugDensity65.811 / 23SonarComposite55.813 / 25SonarFunctionalSkill84.55 / 23SonarIssueDensity22.314 / 23SonarVulnerabilityDensity21.821 / 23TTFT0.024 / 24Tau2Bench51.215 / 24TerminalBench47.416 / 25
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswerebenchmissing none
kimi-k2.6moonshot67.967.968.768.770.570.574.7

group breakdown

A_B54.622 / 25A_I71.114 / 25A_P54.020 / 25A_R64.119 / 25BUILD84.44 / 25CRE78.38 / 25GEN73.95 / 25LM_ARENA_REVIEW_PROXY94.72 / 25OPS_long56.823 / 25OPS_precision72.319 / 25OPS_review70.620 / 25PLAN87.83 / 25

metrics

AI_canary_health0.07 / 7AI_code18.413 / 23AI_complexity52.713 / 23AI_context_awareness0.020 / 25AI_correctness93.211 / 23AI_edge_cases67.615 / 23AI_efficiency63.220 / 23AI_hallucination_resistance0.025 / 25AI_memory_retention0.024 / 25AI_parameter_accuracy77.810 / 25AI_plan_coherence10.517 / 25AI_recovery99.610 / 23AI_refusal100.012 / 23AI_spec100.012 / 23AI_stability47.421 / 23AI_task_completion62.617 / 25AI_tool_selection56.920 / 25ARC_AGI_211.915 / 23ArtificialAnalysisCoding73.18 / 24ArtificialAnalysisIntelligence87.54 / 24ArtificialAnalysisReasoning87.84 / 24BlendedCost88.910 / 25ContextWindow78.715 / 25CopilotArenaOrLMArenaCode94.25 / 25GDPval68.611 / 25GPQA_HLE_Reasoning87.84 / 24IFBench91.66 / 24LMArenaCreativeOrOpenEnded78.38 / 25LMArenaSearchDocument94.72 / 23LMArenaText78.38 / 25LongContextRecall85.38 / 24MCPAtlas92.55 / 17OutputSpeed30.923 / 24SWEBenchMultilingual95.06 / 20SWEBenchPro95.05 / 22SWEBenchVerified95.07 / 24SWEComposite86.210 / 25SWERebench73.115 / 24SciCode94.83 / 24SonarBugDensity92.56 / 23SonarComposite80.67 / 25SonarFunctionalSkill66.818 / 23SonarIssueDensity92.55 / 23SonarVulnerabilityDensity81.610 / 23TTFT95.66 / 24Tau2Bench96.24 / 24TerminalBench74.65 / 25
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSO
claude-opus-4.1anthropic62.362.365.265.268.668.659.8

group breakdown

A_B65.08 / 25A_I75.88 / 25A_P70.82 / 25A_R74.912 / 25BUILD71.813 / 25CRE53.018 / 25GEN65.510 / 25LM_ARENA_REVIEW_PROXY0.124 / 25OPS_long67.221 / 25OPS_precision58.923 / 25OPS_review59.223 / 25PLAN63.012 / 25

metrics

AI_canary_health68.16 / 7AI_code13.515 / 23AI_complexity73.03 / 23AI_context_awareness56.93 / 25AI_correctness93.23 / 23AI_edge_cases67.67 / 23AI_efficiency64.719 / 23AI_hallucination_resistance66.718 / 25AI_memory_retention99.72 / 25AI_parameter_accuracy64.114 / 25AI_plan_coherence35.76 / 25AI_recovery99.62 / 23AI_refusal100.01 / 23AI_spec100.01 / 23AI_stability47.418 / 23AI_task_completion63.815 / 25AI_tool_selection73.110 / 25ARC_AGI_282.86 / 23ArtificialAnalysisCoding71.69 / 24ArtificialAnalysisIntelligence68.310 / 24ArtificialAnalysisReasoning61.712 / 24BlendedCost0.025 / 25ContextWindow74.521 / 25CopilotArenaOrLMArenaCode53.420 / 25GDPval80.68 / 25GPQA_HLE_Reasoning61.712 / 24GSO57.96 / 16IFBench44.414 / 24LMArenaCreativeOrOpenEnded53.018 / 25LMArenaSearchDocument0.122 / 23LMArenaText53.018 / 25LongContextRecall92.54 / 24MCPAtlas92.52 / 17OutputSpeed75.820 / 24SWEBenchMultilingual92.57 / 20SWEBenchPro82.613 / 22SWEBenchVerified92.012 / 24SWEComposite72.915 / 25SWERebench52.319 / 24SciCode69.39 / 24SonarBugDensity70.110 / 23SonarComposite81.53 / 25SonarFunctionalSkill92.53 / 23SonarIssueDensity73.17 / 23SonarVulnerabilityDensity81.66 / 23TTFT71.519 / 24Tau2Bench77.112 / 24TerminalBench29.419 / 25
sources aistupidlevellmarenaopenrouteroverridesswerebenchterminal_benchmissing none
gemini-3-flashgoogle76.976.963.963.963.363.361.2

group breakdown

A_B60.415 / 25A_I66.420 / 25A_P55.718 / 25A_R69.715 / 25BUILD60.615 / 25CRE86.46 / 25GEN61.711 / 25LM_ARENA_REVIEW_PROXY19.218 / 25OPS_long95.51 / 25OPS_precision92.01 / 25OPS_review93.81 / 25PLAN64.611 / 25

metrics

AI_code24.35 / 23AI_complexity37.718 / 23AI_context_awareness21.17 / 25AI_correctness83.320 / 23AI_edge_cases78.05 / 23AI_efficiency71.814 / 23AI_hallucination_resistance92.516 / 25AI_memory_retention7.513 / 25AI_parameter_accuracy58.717 / 25AI_plan_coherence14.514 / 25AI_recovery23.220 / 23AI_refusal92.520 / 23AI_spec92.520 / 23AI_stability92.54 / 23AI_task_completion83.77 / 25AI_tool_selection59.418 / 25ARC_AGI_23.120 / 23ArtificialAnalysisCoding58.411 / 24ArtificialAnalysisIntelligence59.014 / 24ArtificialAnalysisReasoning82.88 / 24BlendedCost91.59 / 25ContextWindow100.06 / 25CopilotArenaOrLMArenaCode68.615 / 25GDPval38.918 / 25GPQA_HLE_Reasoning82.88 / 24GSO14.014 / 16IFBench96.93 / 24LMArenaCreativeOrOpenEnded86.46 / 25LMArenaSearchDocument19.216 / 23LMArenaText86.46 / 25LongContextRecall68.510 / 24MCPAtlas22.413 / 17OutputSpeed100.01 / 24SWEBenchMultilingual100.01 / 20SWEBenchPro53.019 / 22SWEBenchVerified100.01 / 24SWEComposite74.113 / 25SWERebench76.310 / 24SciCode78.86 / 24SonarBugDensity52.717 / 23SonarComposite54.217 / 25SonarFunctionalSkill78.99 / 23SonarIssueDensity13.217 / 23SonarVulnerabilityDensity58.216 / 23TTFT81.911 / 24Tau2Bench61.713 / 24TerminalBench48.314 / 25
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
gemini-3-progoogle77.277.256.356.363.263.256.9

group breakdown

A_B62.213 / 25A_I69.316 / 25A_P56.715 / 25A_R73.213 / 25BUILD66.314 / 25CRE95.03 / 25GEN60.012 / 25LM_ARENA_REVIEW_PROXY19.917 / 25OPS_long45.224 / 25OPS_precision47.925 / 25OPS_review42.925 / 25PLAN55.215 / 25

metrics

AI_code19.811 / 23AI_complexity35.520 / 23AI_context_awareness16.09 / 25AI_correctness89.216 / 23AI_edge_cases83.03 / 23AI_efficiency75.611 / 23AI_hallucination_resistance100.07 / 25AI_memory_retention0.022 / 25AI_parameter_accuracy60.315 / 25AI_plan_coherence8.220 / 25AI_recovery18.522 / 23AI_refusal100.010 / 23AI_spec100.010 / 23AI_stability100.02 / 23AI_task_completion89.74 / 25AI_tool_selection61.016 / 25ARC_AGI_241.99 / 23BlendedCost77.213 / 25ContextWindow0.025 / 25CopilotArenaOrLMArenaCode68.913 / 25GDPval36.920 / 25GSO40.710 / 16LMArenaCreativeOrOpenEnded95.03 / 25LMArenaSearchDocument19.915 / 23LMArenaText95.03 / 25MCPAtlas74.96 / 17SWEBenchMultilingual33.514 / 20SWEBenchPro80.315 / 22SWEBenchVerified82.917 / 24SWEComposite72.116 / 25SWERebench70.617 / 24SonarBugDensity53.214 / 23SonarComposite54.914 / 25SonarFunctionalSkill84.16 / 23SonarIssueDensity6.722 / 23SonarVulnerabilityDensity59.713 / 23TerminalBench61.29 / 25
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing BUILD/ArtificialAnalysisCodingBUILD/LongContextRecallBUILD/SciCodeGEN/ArtificialAnalysisIntelligenceGEN/GPQA_HLE_ReasoningOPS_long/OutputSpeedOPS_long/TTFTOPS_precision/OutputSpeedOPS_precision/TTFTOPS_review/OutputSpeedOPS_review/TTFTPLAN/ArtificialAnalysisReasoningPLAN/IFBenchPLAN/LongContextRecallPLAN/Tau2Bench
gpt-5.2openai68.268.258.858.857.257.257.0

group breakdown

A_B63.112 / 25A_I77.54 / 25A_P63.67 / 25A_R75.211 / 25BUILD51.618 / 25CRE67.814 / 25GEN52.215 / 25LM_ARENA_REVIEW_PROXY20.816 / 25OPS_long86.56 / 25OPS_precision84.210 / 25OPS_review84.610 / 25PLAN55.414 / 25

metrics

AI_code18.414 / 23AI_complexity73.05 / 23AI_context_awareness13.911 / 25AI_correctness93.212 / 23AI_edge_cases67.616 / 23AI_efficiency70.516 / 23AI_hallucination_resistance54.719 / 25AI_memory_retention0.025 / 25AI_parameter_accuracy94.05 / 25AI_plan_coherence18.310 / 25AI_recovery99.611 / 23AI_refusal100.013 / 23AI_spec100.013 / 23AI_stability66.213 / 23AI_task_completion100.01 / 25AI_tool_selection66.015 / 25ARC_AGI_20.023 / 23ArtificialAnalysisCoding63.710 / 24ArtificialAnalysisIntelligence59.712 / 24ArtificialAnalysisReasoning56.313 / 24BlendedCost80.012 / 25ContextWindow85.213 / 25CopilotArenaOrLMArenaCode38.623 / 25GDPval66.315 / 25GPQA_HLE_Reasoning56.313 / 24GSO64.74 / 16IFBench62.710 / 24LMArenaCreativeOrOpenEnded67.814 / 25LMArenaSearchDocument20.814 / 23LMArenaText67.814 / 25LongContextRecall53.717 / 24OutputSpeed90.06 / 24SWEBenchMultilingual0.020 / 20SWEBenchPro38.221 / 22SWEBenchVerified81.319 / 24SWEComposite45.622 / 25SciCode54.512 / 24SonarBugDensity64.212 / 23SonarComposite59.712 / 25SonarFunctionalSkill67.213 / 23SonarIssueDensity35.712 / 23SonarVulnerabilityDensity73.411 / 23TTFT81.212 / 24Tau2Bench48.117 / 24TerminalBench58.211 / 25
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWERebench
claude-sonnet-4.5anthropic63.963.951.651.657.157.152.2

group breakdown

A_B64.39 / 25A_I77.35 / 25A_P70.91 / 25A_R81.89 / 25BUILD52.817 / 25CRE64.315 / 25GEN42.318 / 25LM_ARENA_REVIEW_PROXY2.323 / 25OPS_long80.713 / 25OPS_precision80.713 / 25OPS_review82.713 / 25PLAN40.819 / 25

metrics

AI_canary_health89.13 / 7AI_code13.518 / 23AI_complexity52.710 / 23AI_context_awareness52.04 / 25AI_correctness93.28 / 23AI_edge_cases67.612 / 23AI_efficiency85.44 / 23AI_hallucination_resistance100.04 / 25AI_memory_retention0.019 / 25AI_parameter_accuracy95.54 / 25AI_plan_coherence35.75 / 25AI_recovery99.67 / 23AI_refusal100.06 / 23AI_spec100.06 / 23AI_stability59.315 / 23AI_task_completion89.35 / 25AI_tool_selection89.35 / 25ARC_AGI_23.718 / 23ArtificialAnalysisCoding45.414 / 24ArtificialAnalysisIntelligence46.015 / 24ArtificialAnalysisReasoning35.118 / 24BlendedCost74.218 / 25ContextWindow99.311 / 25CopilotArenaOrLMArenaCode53.619 / 25GDPval88.63 / 25GPQA_HLE_Reasoning35.118 / 24GSO27.312 / 16IFBench41.516 / 24LMArenaCreativeOrOpenEnded64.315 / 25LMArenaSearchDocument2.321 / 23LMArenaText64.315 / 25LongContextRecall65.612 / 24MCPAtlas6.616 / 17OutputSpeed77.317 / 24SWEBenchMultilingual3.919 / 20SWEBenchPro81.214 / 22SWEBenchVerified85.716 / 24SWEComposite71.617 / 25SWERebench74.911 / 24SciCode46.315 / 24SonarBugDensity2.822 / 23SonarComposite15.624 / 25SonarFunctionalSkill17.221 / 23SonarIssueDensity30.013 / 23SonarVulnerabilityDensity4.622 / 23TTFT79.414 / 24Tau2Bench56.614 / 24TerminalBench37.418 / 25
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
grok-4-latestxai66.866.866.866.856.556.561.4

group breakdown

A_B85.02 / 25A_I78.92 / 25A_P59.710 / 25A_R92.02 / 25BUILD43.621 / 25CRE58.916 / 25GEN68.67 / 25LM_ARENA_REVIEW_PROXY18.419 / 25OPS_long71.020 / 25OPS_precision59.721 / 25OPS_review65.521 / 25PLAN71.39 / 25

metrics

AI_code99.02 / 23AI_complexity95.32 / 23AI_context_awareness0.022 / 25AI_correctness100.02 / 23AI_edge_cases100.01 / 23AI_efficiency0.322 / 23AI_hallucination_resistance100.011 / 25AI_memory_retention99.74 / 25AI_parameter_accuracy0.022 / 25AI_plan_coherence100.01 / 25AI_recovery100.01 / 23AI_refusal100.017 / 23AI_spec100.017 / 23AI_stability20.822 / 23AI_task_completion0.022 / 25AI_tool_selection0.022 / 25ARC_AGI_220.711 / 23ArtificialAnalysisCoding53.212 / 24ArtificialAnalysisIntelligence84.95 / 24ArtificialAnalysisReasoning84.06 / 24BlendedCost74.220 / 25ContextWindow78.316 / 25CopilotArenaOrLMArenaCode59.218 / 25GDPval15.123 / 25GPQA_HLE_Reasoning84.06 / 24IFBench100.02 / 24LMArenaCreativeOrOpenEnded58.916 / 25LMArenaSearchDocument18.417 / 23LMArenaText58.916 / 25LongContextRecall58.715 / 24OutputSpeed86.88 / 24SWEComposite45.621 / 25SWERebench39.120 / 24SciCode60.610 / 24SonarComposite50.020 / 25TTFT20.323 / 24Tau2Bench100.01 / 24TerminalBench11.822 / 25
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouteroverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
claude-sonnet-4anthropic27.227.238.838.852.652.658.2

group breakdown

A_B63.211 / 25A_I73.913 / 25A_P68.54 / 25A_R80.610 / 25BUILD47.119 / 25CRE0.024 / 25GEN14.023 / 25LM_ARENA_REVIEW_PROXY86.29 / 25OPS_long80.315 / 25OPS_precision80.214 / 25OPS_review82.314 / 25PLAN29.621 / 25

metrics

AI_code13.517 / 23AI_complexity52.79 / 23AI_context_awareness90.62 / 25AI_correctness93.27 / 23AI_edge_cases67.611 / 23AI_efficiency83.76 / 23AI_hallucination_resistance100.03 / 25AI_memory_retention0.018 / 25AI_parameter_accuracy96.73 / 25AI_plan_coherence21.98 / 25AI_recovery99.66 / 23AI_refusal100.05 / 23AI_spec100.05 / 23AI_stability47.419 / 23AI_task_completion54.719 / 25AI_tool_selection99.72 / 25ARC_AGI_20.222 / 23ArtificialAnalysisCoding30.719 / 24ArtificialAnalysisIntelligence29.717 / 24ArtificialAnalysisReasoning8.222 / 24BlendedCost74.217 / 25ContextWindow99.310 / 25CopilotArenaOrLMArenaCode53.121 / 25GDPval86.45 / 25GPQA_HLE_Reasoning8.222 / 24GSO6.015 / 16IFBench34.618 / 24LMArenaCreativeOrOpenEnded0.024 / 25LMArenaSearchDocument86.29 / 23LMArenaText0.024 / 25LiveCodeBench0.02 / 2LongContextRecall60.713 / 24MCPAtlas13.114 / 17OutputSpeed77.019 / 24SWEBenchMultilingual10.815 / 20SWEBenchPro78.416 / 22SWEBenchVerified69.922 / 24SWEComposite61.018 / 25SWERebench55.118 / 24SciCode20.320 / 24SonarBugDensity0.023 / 23SonarComposite19.523 / 25SonarFunctionalSkill26.420 / 23SonarIssueDensity35.811 / 23SonarVulnerabilityDensity0.023 / 23TTFT78.215 / 24Tau2Bench26.520 / 24TerminalBench47.415 / 25
sources aistupidlevelarc_agiartificial_analysisgsolivecodebenchlmarenaopenroutersonarswebenchswebench_proswerebenchmissing none
glm-4.7zai32.432.450.450.452.052.055.0

group breakdown

A_B56.021 / 25A_I55.023 / 25A_P47.023 / 25A_R58.523 / 25BUILD45.220 / 25CRE9.523 / 25GEN35.719 / 25LM_ARENA_REVIEW_PROXY50.012 / 25OPS_long87.25 / 25OPS_precision90.62 / 25OPS_review88.14 / 25PLAN54.316 / 25

metrics

AI_context_awareness0.025 / 25AI_hallucination_resistance100.014 / 25AI_memory_retention99.77 / 25AI_parameter_accuracy0.025 / 25AI_plan_coherence100.04 / 25AI_task_completion0.025 / 25AI_tool_selection0.025 / 25ArtificialAnalysisCoding37.917 / 24ArtificialAnalysisIntelligence42.616 / 24ArtificialAnalysisReasoning55.714 / 24BlendedCost96.14 / 25ContextWindow74.719 / 25CopilotArenaOrLMArenaCode69.412 / 25GDPval36.621 / 25GPQA_HLE_Reasoning55.714 / 24IFBench69.99 / 24LMArenaCreativeOrOpenEnded9.523 / 25LMArenaText9.523 / 25LongContextRecall57.216 / 24MCPAtlas0.017 / 17OutputSpeed84.49 / 24SWEBenchMultilingual5.018 / 20SWEBenchVerified90.215 / 24SWEComposite60.719 / 25SWERebench70.916 / 24SciCode48.513 / 24SonarBugDensity51.619 / 23SonarComposite27.322 / 25SonarFunctionalSkill0.023 / 23SonarIssueDensity50.89 / 23SonarVulnerabilityDensity28.719 / 23TTFT99.63 / 24Tau2Bench96.25 / 24TerminalBench27.120 / 25
sources aistupidlevelartificial_analysislmarenamcp_atlasopenrouteroverridessonarswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchPro
kimi-k2-0905moonshot24.824.827.527.551.351.345.7

group breakdown

A_B32.324 / 25A_I31.125 / 25A_P33.724 / 25A_R22.025 / 25BUILD59.716 / 25CRE27.321 / 25GEN8.625 / 25LM_ARENA_REVIEW_PROXY88.07 / 25OPS_long35.825 / 25OPS_precision59.222 / 25OPS_review55.024 / 25PLAN30.220 / 25

metrics

AI_canary_health89.32 / 7AI_code23.39 / 23AI_complexity52.712 / 23AI_context_awareness0.019 / 25AI_correctness0.023 / 23AI_edge_cases0.023 / 23AI_efficiency90.82 / 23AI_hallucination_resistance33.320 / 25AI_memory_retention0.023 / 25AI_parameter_accuracy79.69 / 25AI_plan_coherence5.921 / 25AI_recovery0.023 / 23AI_refusal100.011 / 23AI_spec100.011 / 23AI_stability0.023 / 23AI_task_completion75.211 / 25AI_tool_selection72.711 / 25ARC_AGI_211.914 / 23ArtificialAnalysisCoding4.022 / 24ArtificialAnalysisIntelligence0.023 / 24ArtificialAnalysisReasoning0.023 / 24BlendedCost92.78 / 25ContextWindow51.724 / 25CopilotArenaOrLMArenaCode87.67 / 25GDPval5.024 / 25GPQA_HLE_Reasoning0.023 / 24IFBench0.023 / 24LMArenaCreativeOrOpenEnded27.321 / 25LMArenaSearchDocument88.07 / 23LMArenaText27.321 / 25LongContextRecall0.023 / 24MCPAtlas92.54 / 17OutputSpeed0.024 / 24SWEBenchMultilingual5.016 / 20SWEBenchPro92.59 / 22SWEBenchVerified78.621 / 24SWEComposite73.914 / 25SWERebench73.114 / 24SciCode0.023 / 24SonarBugDensity92.55 / 23SonarComposite80.66 / 25SonarFunctionalSkill66.817 / 23SonarIssueDensity92.54 / 23SonarVulnerabilityDensity81.69 / 23TTFT94.17 / 24Tau2Bench46.118 / 24TerminalBench44.617 / 25
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSO
gemini-2.5-flashgoogle53.053.033.833.849.449.453.2

group breakdown

A_B94.91 / 25A_I87.71 / 25A_P64.76 / 25A_R95.11 / 25BUILD29.623 / 25CRE45.920 / 25GEN15.121 / 25LM_ARENA_REVIEW_PROXY78.810 / 25OPS_long94.42 / 25OPS_precision90.33 / 25OPS_review92.72 / 25PLAN16.924 / 25

metrics

AI_code100.01 / 23AI_complexity100.01 / 23AI_context_awareness100.01 / 25AI_correctness100.01 / 23AI_edge_cases98.82 / 23AI_efficiency100.01 / 23AI_hallucination_resistance100.06 / 25AI_memory_retention14.910 / 25AI_parameter_accuracy51.619 / 25AI_plan_coherence9.518 / 25AI_recovery73.417 / 23AI_refusal100.09 / 23AI_spec100.09 / 23AI_stability100.01 / 23AI_task_completion11.321 / 25AI_tool_selection74.89 / 25ARC_AGI_20.821 / 23ArtificialAnalysisCoding0.023 / 24ArtificialAnalysisIntelligence0.722 / 24ArtificialAnalysisReasoning17.619 / 24BlendedCost94.46 / 25ContextWindow100.04 / 25CopilotArenaOrLMArenaCode65.816 / 25GDPval39.517 / 25GPQA_HLE_Reasoning17.619 / 24GSO19.413 / 16IFBench28.120 / 24LMArenaCreativeOrOpenEnded45.920 / 25LMArenaSearchDocument78.810 / 23LMArenaText45.920 / 25LiveCodeBench100.01 / 2LongContextRecall58.714 / 24MCPAtlas26.612 / 17OutputSpeed99.52 / 24SWEBenchMultilingual92.58 / 20SWEBenchPro52.520 / 22SWEBenchVerified0.024 / 24SWEComposite27.625 / 25SWERebench0.024 / 24SciCode23.119 / 24SonarBugDensity52.715 / 23SonarComposite54.215 / 25SonarFunctionalSkill78.97 / 23SonarIssueDensity13.215 / 23SonarVulnerabilityDensity58.214 / 23TTFT75.916 / 24Tau2Bench0.023 / 24TerminalBench0.324 / 25
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing none
gemini-2.5-progoogle26.126.135.735.746.046.041.7

group breakdown

A_B60.414 / 25A_I66.419 / 25A_P55.717 / 25A_R69.714 / 25BUILD37.322 / 25CRE0.025 / 25GEN14.522 / 25LM_ARENA_REVIEW_PROXY0.025 / 25OPS_long87.34 / 25OPS_precision83.112 / 25OPS_review86.07 / 25PLAN28.722 / 25

metrics

AI_code24.34 / 23AI_complexity37.717 / 23AI_context_awareness21.16 / 25AI_correctness83.319 / 23AI_edge_cases78.04 / 23AI_efficiency71.813 / 23AI_hallucination_resistance92.515 / 25AI_memory_retention7.512 / 25AI_parameter_accuracy58.716 / 25AI_plan_coherence14.513 / 25AI_recovery23.219 / 23AI_refusal92.519 / 23AI_spec92.519 / 23AI_stability92.53 / 23AI_task_completion83.76 / 25AI_tool_selection59.417 / 25ARC_AGI_23.719 / 23ArtificialAnalysisCoding23.620 / 24ArtificialAnalysisIntelligence14.020 / 24ArtificialAnalysisReasoning44.716 / 24BlendedCost80.011 / 25ContextWindow100.05 / 25CopilotArenaOrLMArenaCode0.024 / 25GDPval37.619 / 25GPQA_HLE_Reasoning44.716 / 24GSO0.016 / 16IFBench18.521 / 24LMArenaCreativeOrOpenEnded0.025 / 25LMArenaSearchDocument0.023 / 23LMArenaText0.025 / 25LongContextRecall67.111 / 24MCPAtlas71.19 / 17OutputSpeed91.35 / 24SWEBenchMultilingual36.012 / 20SWEBenchPro75.718 / 22SWEBenchVerified38.223 / 24SWEComposite36.523 / 25SWERebench1.823 / 24SciCode35.818 / 24SonarBugDensity52.716 / 23SonarComposite54.216 / 25SonarFunctionalSkill78.98 / 23SonarIssueDensity13.216 / 23SonarVulnerabilityDensity58.215 / 23TTFT70.720 / 24Tau2Bench3.222 / 24TerminalBench1.823 / 25
sources arc_agiartificial_analysisgsolmarenaopenrouterswebenchswerebenchterminal_benchmissing none
glm-4.6zai33.733.729.129.134.234.237.7

group breakdown

A_B56.020 / 25A_I55.022 / 25A_P47.022 / 25A_R58.522 / 25BUILD20.625 / 25CRE24.222 / 25GEN13.524 / 25LM_ARENA_REVIEW_PROXY50.011 / 25OPS_long80.314 / 25OPS_precision86.97 / 25OPS_review84.411 / 25PLAN17.523 / 25

metrics

AI_context_awareness0.024 / 25AI_hallucination_resistance100.013 / 25AI_memory_retention99.76 / 25AI_parameter_accuracy0.024 / 25AI_plan_coherence100.03 / 25AI_task_completion0.024 / 25AI_tool_selection0.024 / 25ArtificialAnalysisCoding15.821 / 24ArtificialAnalysisIntelligence6.121 / 24ArtificialAnalysisReasoning16.220 / 24BlendedCost95.45 / 25ContextWindow74.918 / 25CopilotArenaOrLMArenaCode44.522 / 25GDPval20.122 / 25GPQA_HLE_Reasoning16.220 / 24IFBench4.322 / 24LMArenaCreativeOrOpenEnded24.222 / 25LMArenaText24.222 / 25LongContextRecall9.422 / 24MCPAtlas7.515 / 17OutputSpeed71.921 / 24SWEBenchMultilingual5.017 / 20SWEBenchPro0.022 / 22SWEBenchVerified79.020 / 24SWEComposite27.724 / 25SWERebench38.421 / 24SciCode11.522 / 24SonarBugDensity7.521 / 23SonarComposite10.725 / 25SonarFunctionalSkill7.522 / 23SonarIssueDensity7.520 / 23SonarVulnerabilityDensity29.018 / 23TTFT100.01 / 24Tau2Bench39.719 / 24TerminalBench13.921 / 25
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesswebenchswebench_proswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_recoveryA_R/AI_specA_R/AI_stabilityBUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocument
grok-code-fast-1xai39.839.824.424.431.731.729.0

group breakdown

A_B24.125 / 25A_I31.324 / 25A_P33.425 / 25A_R35.924 / 25BUILD29.524 / 25CRE48.119 / 25GEN15.820 / 25LM_ARENA_REVIEW_PROXY15.721 / 25OPS_long84.09 / 25OPS_precision85.59 / 25OPS_review85.09 / 25PLAN12.825 / 25

metrics

AI_code0.023 / 23AI_complexity1.522 / 23AI_context_awareness0.023 / 25AI_correctness2.822 / 23AI_edge_cases32.422 / 23AI_efficiency0.023 / 23AI_hallucination_resistance100.012 / 25AI_memory_retention99.75 / 25AI_parameter_accuracy0.023 / 25AI_plan_coherence100.02 / 25AI_recovery27.818 / 23AI_refusal0.023 / 23AI_spec0.023 / 23AI_stability91.96 / 23AI_task_completion0.023 / 25AI_tool_selection0.023 / 25ARC_AGI_225.110 / 23ArtificialAnalysisCoding0.024 / 24ArtificialAnalysisIntelligence0.024 / 24ArtificialAnalysisReasoning0.024 / 24BlendedCost99.32 / 25ContextWindow78.317 / 25CopilotArenaOrLMArenaCode0.025 / 25GDPval5.025 / 25GPQA_HLE_Reasoning0.024 / 24IFBench0.024 / 24LMArenaCreativeOrOpenEnded48.119 / 25LMArenaSearchDocument15.719 / 23LMArenaText48.119 / 25LongContextRecall0.024 / 24OutputSpeed83.210 / 24SWEBenchVerified82.718 / 24SWEComposite46.120 / 25SWERebench27.922 / 24SciCode0.024 / 24SonarComposite50.021 / 25TTFT82.610 / 24Tau2Bench51.216 / 24TerminalBench0.025 / 25
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity