how scoring works
Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.
raw vs adjusted
The raw score is the benchmark composite, normalized to 0-100. The adjusted score subtracts a reviewer-reservation penalty: when a vendor's models dominate Review, that lead gets discounted from their Idea, Plan, and Build scores so vendors can't game their own preference evaluations.
Penalty coefficients differ by role: Build is penalized hardest (0.32), Plan moderately (0.18), Idea lightly (0.08). Review is never adjusted — it is the source of the penalty.
missing data
If a model is missing some metrics within a group, the group score is computed from the present metrics if at least 70% of the group weight is covered. Below that threshold, the score shrinks toward 50 proportional to the missing weight.
| claude-opus-4.5 | anthropic | 78.2 | 78.2 | 68.2 | 68.2 | 75.2 | 75.2 | 66.7 | ▸ |
group breakdownA_B85.44 / 24A_I90.02 / 24A_P70.03 / 24A_R84.75 / 24BUILD70.66 / 24CRE73.611 / 24GEN70.56 / 24LM_ARENA_REVIEW_PROXY10.821 / 24OPS_long76.416 / 24OPS_precision74.614 / 24OPS_review73.714 / 24PLAN65.88 / 24 metricsAI_canary_health89.51 / 7AI_code90.14 / 22AI_complexity81.74 / 22AI_context_awareness55.03 / 24AI_correctness100.02 / 22AI_edge_cases100.02 / 22AI_efficiency100.01 / 22AI_hallucination_resistance20.019 / 24AI_memory_retention10.211 / 24AI_parameter_accuracy100.01 / 24AI_plan_coherence10.717 / 24AI_recovery100.02 / 22AI_refusal100.02 / 22AI_safety_compliance72.918 / 24AI_spec100.02 / 22AI_stability92.64 / 22AI_task_completion65.314 / 24AI_tool_selection100.01 / 24ArtificialAnalysisCoding75.16 / 21ArtificialAnalysisIntelligence71.57 / 21ArtificialAnalysisReasoning63.79 / 21ContextWindow74.721 / 24CopilotArenaOrLMArenaCode76.56 / 22GDPval73.86 / 11GPQA_HLE_Reasoning63.79 / 21IFBench44.911 / 21InverseCost61.920 / 24InverseTTFT74.514 / 19LMArenaCreativeOrOpenEnded73.611 / 24LMArenaSearchDocument10.816 / 19LMArenaText73.611 / 24LongContextRecall100.01 / 21OutputSpeed80.311 / 19SWEBenchPro60.55 / 14SWEBenchVerified65.29 / 18SWEComposite65.66 / 24SWERebench76.38 / 20SciCode72.77 / 21SonarFunctionalSkill100.01 / 17SonarIssueDensity63.63 / 17Tau2Bench85.28 / 21TerminalBench54.911 / 22 | |||||||||
| gemini-3.1-pro-preview | 87.5 | 87.3 | 82.9 | 82.5 | 72.8 | 72.0 | 78.9 | ▸ | |
group breakdownA_B68.010 / 24A_I68.316 / 24A_P63.09 / 24A_R70.416 / 24BUILD74.93 / 24CRE100.02 / 24GEN100.01 / 24LM_ARENA_REVIEW_PROXY89.84 / 24OPS_long82.08 / 24OPS_precision73.516 / 24OPS_review69.716 / 24PLAN94.12 / 24 metricsAI_code62.310 / 22AI_complexity68.48 / 22AI_context_awareness14.68 / 24AI_correctness92.516 / 22AI_edge_cases63.511 / 22AI_efficiency88.76 / 22AI_hallucination_resistance92.58 / 24AI_memory_retention92.54 / 24AI_parameter_accuracy92.57 / 24AI_plan_coherence79.08 / 24AI_recovery92.510 / 22AI_refusal92.517 / 22AI_safety_compliance92.513 / 24AI_spec92.517 / 22AI_stability7.520 / 22AI_task_completion61.318 / 24AI_tool_selection10.319 / 24ARC_AGI_2100.01 / 17ArtificialAnalysisCoding100.01 / 21ArtificialAnalysisIntelligence100.02 / 21ArtificialAnalysisReasoning100.01 / 21ContextWindow100.06 / 24CopilotArenaOrLMArenaCode73.07 / 22GDPval23.89 / 11GPQA_HLE_Reasoning100.01 / 21IFBench97.53 / 21InverseCost77.313 / 24InverseTTFT40.918 / 19LMArenaCreativeOrOpenEnded100.02 / 24LMArenaSearchDocument89.84 / 19LMArenaText100.02 / 24LongContextRecall100.02 / 21MCPAtlas66.86 / 13OutputSpeed93.05 / 19SWEBenchPro61.14 / 14SWEBenchVerified78.03 / 18SWEComposite75.44 / 24SWERebench100.02 / 20SciCode100.02 / 21SonarFunctionalSkill86.38 / 17SonarIssueDensity18.713 / 17Tau2Bench99.34 / 21TerminalBench89.93 / 22 | |||||||||
| gpt-5.5 | openai | 81.1 | 81.1 | 81.5 | 81.5 | 70.8 | 70.8 | 74.6 | ▸ |
group breakdownA_B61.512 / 24A_I73.69 / 24A_P59.812 / 24A_R77.19 / 24BUILD74.44 / 24CRE82.07 / 24GEN94.43 / 24LM_ARENA_REVIEW_PROXY27.414 / 24OPS_long81.710 / 24OPS_precision80.09 / 24OPS_review77.511 / 24PLAN95.41 / 24 metricsAI_code34.017 / 22AI_complexity36.818 / 22AI_context_awareness0.020 / 24AI_correctness94.113 / 22AI_edge_cases62.617 / 22AI_efficiency44.812 / 22AI_hallucination_resistance60.014 / 24AI_memory_retention0.024 / 24AI_parameter_accuracy95.83 / 24AI_plan_coherence7.919 / 24AI_recovery91.916 / 22AI_refusal100.013 / 22AI_safety_compliance100.010 / 24AI_spec100.013 / 22AI_stability78.98 / 22AI_task_completion87.09 / 24AI_tool_selection84.95 / 24ARC_AGI_298.12 / 17ArtificialAnalysisCoding100.02 / 21ArtificialAnalysisIntelligence98.13 / 21ArtificialAnalysisReasoning100.02 / 21ContextWindow100.02 / 24CopilotArenaOrLMArenaCode71.48 / 22GPQA_HLE_Reasoning100.02 / 21IFBench80.76 / 21InverseCost50.623 / 24InverseTTFT81.78 / 19LMArenaCreativeOrOpenEnded82.07 / 24LMArenaSearchDocument27.49 / 19LMArenaText82.07 / 24LongContextRecall98.03 / 21OutputSpeed82.49 / 19SWEBenchVerified95.01 / 18SWEComposite63.58 / 24SciCode94.54 / 21SonarFunctionalSkill70.713 / 17SonarIssueDensity46.04 / 17Tau2Bench90.57 / 21TerminalBench100.02 / 22 | |||||||||
| kimi-k2.6 | moonshot | 73.0 | 73.0 | 70.1 | 70.1 | 67.1 | 67.1 | 76.5 | ▸ |
group breakdownA_B56.617 / 24A_I70.911 / 24A_P53.816 / 24A_R69.817 / 24BUILD74.25 / 24CRE77.68 / 24GEN73.75 / 24LM_ARENA_REVIEW_PROXY92.22 / 24OPS_long58.020 / 24OPS_precision59.818 / 24OPS_review60.318 / 24PLAN86.73 / 24 metricsAI_code37.013 / 22AI_complexity36.815 / 22AI_context_awareness0.016 / 24AI_correctness94.19 / 22AI_edge_cases62.613 / 22AI_efficiency22.919 / 22AI_hallucination_resistance20.021 / 24AI_memory_retention0.020 / 24AI_parameter_accuracy66.717 / 24AI_plan_coherence7.918 / 24AI_recovery91.912 / 22AI_refusal100.09 / 22AI_safety_compliance66.719 / 24AI_spec100.09 / 22AI_stability71.210 / 22AI_task_completion72.511 / 24AI_tool_selection52.914 / 24ARC_AGI_211.99 / 17ArtificialAnalysisCoding72.87 / 21ArtificialAnalysisIntelligence87.54 / 21ArtificialAnalysisReasoning87.64 / 21ContextWindow78.414 / 24CopilotArenaOrLMArenaCode94.74 / 22GDPval54.78 / 11GPQA_HLE_Reasoning87.64 / 21IFBench94.54 / 21InverseCost87.19 / 24LMArenaCreativeOrOpenEnded77.68 / 24LMArenaSearchDocument92.22 / 19LMArenaText77.68 / 24LongContextRecall85.37 / 21MCPAtlas92.52 / 13SWEBenchVerified77.04 / 18SWEComposite62.710 / 24SWERebench72.911 / 20SciCode94.53 / 21SonarFunctionalSkill79.212 / 17SonarIssueDensity92.52 / 17Tau2Bench100.01 / 21TerminalBench74.95 / 22 | |||||||||
| claude-sonnet-4.6 | anthropic | 74.4 | 74.4 | 61.7 | 61.7 | 66.9 | 66.9 | 60.2 | ▸ |
group breakdownA_B69.07 / 24A_I82.16 / 24A_P65.95 / 24A_R72.813 / 24BUILD67.88 / 24CRE73.113 / 24GEN65.57 / 24LM_ARENA_REVIEW_PROXY22.615 / 24OPS_long66.718 / 24OPS_precision54.321 / 24OPS_review49.023 / 24PLAN56.612 / 24 metricsAI_code66.76 / 22AI_complexity60.09 / 22AI_context_awareness15.74 / 24AI_correctness100.04 / 22AI_edge_cases54.719 / 22AI_efficiency59.19 / 22AI_hallucination_resistance0.024 / 24AI_memory_retention0.017 / 24AI_parameter_accuracy94.44 / 24AI_plan_coherence24.711 / 24AI_recovery100.04 / 22AI_refusal100.04 / 22AI_safety_compliance100.04 / 24AI_spec100.04 / 22AI_stability90.25 / 22AI_task_completion67.913 / 24AI_tool_selection94.93 / 24ARC_AGI_210.610 / 17ArtificialAnalysisCoding85.14 / 21ArtificialAnalysisIntelligence79.16 / 21ArtificialAnalysisReasoning68.78 / 21ContextWindow99.311 / 24CopilotArenaOrLMArenaCode93.45 / 22GDPval82.54 / 11GPQA_HLE_Reasoning68.78 / 21IFBench41.013 / 21InverseCost74.418 / 24InverseTTFT0.019 / 19LMArenaCreativeOrOpenEnded73.113 / 24LMArenaSearchDocument22.610 / 19LMArenaText73.113 / 24LongContextRecall90.25 / 21MCPAtlas65.17 / 13OutputSpeed80.710 / 19SWEBenchPro53.87 / 14SWEBenchVerified63.410 / 18SWEComposite66.45 / 24SWERebench95.83 / 20SciCode57.98 / 21SonarFunctionalSkill92.94 / 17SonarIssueDensity24.310 / 17Tau2Bench53.312 / 21TerminalBench47.514 / 22 | |||||||||
| gpt-5.4 | openai | 70.4 | 70.4 | 55.2 | 55.2 | 66.0 | 66.0 | 62.8 | ▸ |
group breakdownA_B57.516 / 24A_I69.412 / 24A_P58.113 / 24A_R76.310 / 24BUILD68.47 / 24CRE77.59 / 24GEN45.216 / 24LM_ARENA_REVIEW_PROXY17.120 / 24OPS_long92.83 / 24OPS_precision91.22 / 24OPS_review89.75 / 24PLAN50.715 / 24 metricsAI_code34.016 / 22AI_complexity0.022 / 22AI_context_awareness0.019 / 24AI_correctness94.112 / 22AI_edge_cases62.616 / 22AI_efficiency48.311 / 22AI_hallucination_resistance60.013 / 24AI_memory_retention0.023 / 24AI_parameter_accuracy91.09 / 24AI_plan_coherence2.422 / 24AI_recovery91.915 / 22AI_refusal100.012 / 22AI_safety_compliance100.09 / 24AI_spec100.012 / 22AI_stability71.211 / 22AI_task_completion87.08 / 24AI_tool_selection83.76 / 24ARC_AGI_276.95 / 17ArtificialAnalysisCoding33.715 / 21ArtificialAnalysisIntelligence27.416 / 21ArtificialAnalysisReasoning15.518 / 21ContextWindow100.01 / 24CopilotArenaOrLMArenaCode67.411 / 22GPQA_HLE_Reasoning15.518 / 21IFBench62.59 / 21InverseCost75.015 / 24InverseTTFT90.46 / 19LMArenaCreativeOrOpenEnded77.59 / 24LMArenaSearchDocument17.115 / 19LMArenaText77.59 / 24LongContextRecall24.518 / 21MCPAtlas68.24 / 13OutputSpeed95.03 / 19SWEBenchPro88.42 / 14SWEBenchVerified72.35 / 18SWEComposite82.02 / 24SWERebench83.57 / 20SciCode12.018 / 21SonarFunctionalSkill82.611 / 17SonarIssueDensity13.214 / 17Tau2Bench0.021 / 21TerminalBench100.01 / 22 | |||||||||
| gemini-3-pro | 79.3 | 79.1 | 60.9 | 60.4 | 64.6 | 63.8 | 60.7 | ▸ | |
group breakdownA_B78.35 / 24A_I78.67 / 24A_P70.32 / 24A_R82.06 / 24BUILD58.910 / 24CRE95.03 / 24GEN60.110 / 24LM_ARENA_REVIEW_PROXY19.318 / 24OPS_long45.223 / 24OPS_precision46.623 / 24OPS_review50.522 / 24PLAN55.113 / 24 metricsAI_code64.57 / 22AI_complexity71.75 / 22AI_context_awareness8.310 / 24AI_correctness100.06 / 22AI_edge_cases65.98 / 22AI_efficiency95.63 / 22AI_hallucination_resistance100.01 / 24AI_memory_retention100.01 / 24AI_parameter_accuracy100.02 / 24AI_plan_coherence84.15 / 24AI_recovery100.06 / 22AI_refusal100.07 / 22AI_safety_compliance100.07 / 24AI_spec100.07 / 22AI_stability0.021 / 22AI_task_completion63.315 / 24AI_tool_selection3.320 / 24ARC_AGI_242.46 / 17ContextWindow0.024 / 24CopilotArenaOrLMArenaCode67.810 / 22InverseCost77.312 / 24LMArenaCreativeOrOpenEnded95.03 / 24LMArenaSearchDocument19.313 / 19LMArenaText95.03 / 24MCPAtlas69.73 / 13SWEBenchMultilingual33.54 / 6SWEBenchPro53.78 / 14SWEBenchVerified52.113 / 18SWEComposite54.512 / 24SWERebench70.313 / 20SonarFunctionalSkill92.75 / 17SonarIssueDensity13.115 / 17TerminalBench61.48 / 22 | |||||||||
| glm-5.1 | zai | 73.2 | 73.2 | 62.2 | 62.2 | 64.5 | 64.5 | 70.2 | ▸ |
group breakdownA_B54.220 / 24A_I63.318 / 24A_P52.418 / 24A_R62.618 / 24BUILD67.79 / 24CRE86.75 / 24GEN57.611 / 24LM_ARENA_REVIEW_PROXY85.95 / 24OPS_long81.99 / 24OPS_precision86.57 / 24OPS_review88.77 / 24PLAN69.57 / 24 metricsAI_code38.912 / 22AI_complexity38.812 / 22AI_context_awareness7.511 / 24AI_correctness87.417 / 22AI_edge_cases60.718 / 22AI_efficiency27.017 / 22AI_hallucination_resistance24.518 / 24AI_memory_retention7.512 / 24AI_parameter_accuracy64.218 / 24AI_plan_coherence14.316 / 24AI_recovery85.717 / 22AI_refusal92.518 / 22AI_safety_compliance64.220 / 24AI_spec92.518 / 22AI_stability68.012 / 22AI_task_completion69.112 / 24AI_tool_selection52.515 / 24ARC_AGI_25.211 / 17ArtificialAnalysisCoding39.513 / 21ArtificialAnalysisIntelligence60.58 / 21ArtificialAnalysisReasoning54.013 / 21ContextWindow74.919 / 24CopilotArenaOrLMArenaCode96.23 / 22GDPval63.07 / 11GPQA_HLE_Reasoning54.013 / 21IFBench86.85 / 21InverseCost93.06 / 24InverseTTFT100.01 / 19LMArenaCreativeOrOpenEnded86.75 / 24LMArenaSearchDocument85.95 / 19LMArenaText86.75 / 24LongContextRecall41.217 / 21MCPAtlas100.01 / 13OutputSpeed75.217 / 19SWEBenchMultilingual50.93 / 6SWEBenchVerified60.511 / 18SWEComposite63.29 / 24SWERebench100.01 / 20SciCode40.414 / 21SonarFunctionalSkill84.39 / 17SonarIssueDensity100.01 / 17Tau2Bench100.03 / 21TerminalBench56.010 / 22 | |||||||||
| grok-4-latest | xai | 75.2 | 75.2 | 52.0 | 52.0 | 63.8 | 63.8 | 58.9 | ▸ |
group breakdownA_B87.82 / 24A_I85.95 / 24A_P61.510 / 24A_R89.12 / 24BUILD47.415 / 24CRE76.310 / 24GEN49.415 / 24LM_ARENA_REVIEW_PROXY18.619 / 24OPS_long77.514 / 24OPS_precision77.512 / 24OPS_review77.312 / 24PLAN38.018 / 24 metricsAI_code100.02 / 22AI_complexity100.02 / 22AI_context_awareness0.021 / 24AI_correctness100.07 / 22AI_edge_cases100.05 / 22AI_efficiency0.022 / 22AI_hallucination_resistance100.02 / 24AI_memory_retention85.55 / 24AI_parameter_accuracy0.021 / 24AI_plan_coherence100.01 / 24AI_recovery39.319 / 22AI_refusal100.014 / 22AI_safety_compliance0.021 / 24AI_spec100.014 / 22AI_stability100.03 / 22AI_task_completion0.021 / 24AI_tool_selection0.021 / 24ARC_AGI_221.08 / 17ArtificialAnalysisCoding51.510 / 21ArtificialAnalysisIntelligence40.314 / 21ArtificialAnalysisReasoning57.010 / 21ContextWindow78.415 / 24CopilotArenaOrLMArenaCode57.015 / 22GPQA_HLE_Reasoning57.010 / 21IFBench33.115 / 21InverseCost74.419 / 24InverseTTFT78.510 / 19LMArenaCreativeOrOpenEnded76.310 / 24LMArenaSearchDocument18.614 / 19LMArenaText76.310 / 24LongContextRecall77.08 / 21OutputSpeed77.414 / 19SWEComposite47.719 / 24SWERebench38.316 / 20SciCode51.910 / 21Tau2Bench51.514 / 21TerminalBench11.619 / 22 | |||||||||
| gemini-3-flash | 76.4 | 76.2 | 65.5 | 65.1 | 60.8 | 60.1 | 60.3 | ▸ | |
group breakdownA_B68.09 / 24A_I68.315 / 24A_P63.08 / 24A_R70.415 / 24BUILD51.712 / 24CRE86.26 / 24GEN61.69 / 24LM_ARENA_REVIEW_PROXY19.517 / 24OPS_long94.91 / 24OPS_precision91.61 / 24OPS_review90.23 / 24PLAN64.79 / 24 metricsAI_code62.39 / 22AI_complexity68.47 / 22AI_context_awareness14.67 / 24AI_correctness92.515 / 22AI_edge_cases63.510 / 22AI_efficiency88.75 / 22AI_hallucination_resistance92.57 / 24AI_memory_retention92.53 / 24AI_parameter_accuracy92.56 / 24AI_plan_coherence79.07 / 24AI_recovery92.59 / 22AI_refusal92.516 / 22AI_safety_compliance92.512 / 24AI_spec92.516 / 22AI_stability7.519 / 22AI_task_completion61.317 / 24AI_tool_selection10.318 / 24ARC_AGI_23.014 / 17ArtificialAnalysisCoding58.39 / 21ArtificialAnalysisIntelligence58.911 / 21ArtificialAnalysisReasoning82.76 / 21ContextWindow100.05 / 24CopilotArenaOrLMArenaCode67.412 / 22GDPval5.011 / 11GPQA_HLE_Reasoning82.76 / 21IFBench100.02 / 21InverseCost91.58 / 24InverseTTFT80.29 / 19LMArenaCreativeOrOpenEnded86.26 / 24LMArenaSearchDocument19.512 / 19LMArenaText86.26 / 24LongContextRecall68.69 / 21MCPAtlas22.39 / 13OutputSpeed99.42 / 19SWEBenchMultilingual100.01 / 6SWEBenchPro31.012 / 14SWEBenchVerified68.47 / 18SWEComposite58.111 / 24SWERebench76.19 / 20SciCode78.76 / 21SonarFunctionalSkill86.37 / 17SonarIssueDensity18.712 / 17Tau2Bench64.210 / 21TerminalBench48.412 / 22 | |||||||||
| claude-opus-4.7 | anthropic | 65.7 | 65.7 | 64.9 | 64.9 | 59.2 | 59.2 | 57.3 | ▸ |
group breakdownA_B13.823 / 24A_I17.823 / 24A_P31.123 / 24A_R9.424 / 24BUILD87.61 / 24CRE95.04 / 24GEN97.02 / 24LM_ARENA_REVIEW_PROXY100.01 / 24OPS_long74.317 / 24OPS_precision69.017 / 24OPS_review65.617 / 24PLAN79.94 / 24 metricsAI_canary_health88.23 / 7AI_code0.022 / 22AI_complexity22.220 / 22AI_context_awareness15.25 / 24AI_correctness20.120 / 22AI_edge_cases0.021 / 22AI_efficiency36.013 / 22AI_hallucination_resistance0.023 / 24AI_memory_retention30.510 / 24AI_parameter_accuracy35.220 / 24AI_plan_coherence21.912 / 24AI_recovery0.021 / 22AI_refusal0.022 / 22AI_safety_compliance100.02 / 24AI_spec0.022 / 22AI_stability58.115 / 22AI_task_completion100.01 / 24AI_tool_selection57.613 / 24ARC_AGI_294.03 / 17ArtificialAnalysisCoding90.33 / 21ArtificialAnalysisIntelligence100.01 / 21ArtificialAnalysisReasoning95.63 / 21ContextWindow99.38 / 24CopilotArenaOrLMArenaCode100.02 / 22GDPval95.01 / 11GPQA_HLE_Reasoning95.63 / 21IFBench46.610 / 21InverseCost61.922 / 24InverseTTFT49.116 / 19LMArenaCreativeOrOpenEnded95.04 / 24LMArenaSearchDocument100.01 / 19LMArenaText95.04 / 24LongContextRecall88.26 / 21OutputSpeed78.812 / 19SWEBenchPro95.01 / 14SWEBenchVerified94.62 / 18SWEComposite92.71 / 24SWERebench85.36 / 20SciCode100.01 / 21SonarFunctionalSkill98.42 / 17SonarIssueDensity2.417 / 17Tau2Bench83.19 / 21TerminalBench78.64 / 22 | |||||||||
| claude-opus-4.1 | anthropic | 61.5 | 61.5 | 49.0 | 49.0 | 58.7 | 58.7 | 51.4 | ▸ |
group breakdownA_B85.73 / 24A_I86.94 / 24A_P61.411 / 24A_R85.34 / 24BUILD48.414 / 24CRE52.717 / 24GEN50.714 / 24LM_ARENA_REVIEW_PROXY0.023 / 24OPS_long48.722 / 24OPS_precision46.224 / 24OPS_review42.524 / 24PLAN43.016 / 24 metricsAI_canary_health69.17 / 7AI_code93.33 / 22AI_complexity100.01 / 22AI_context_awareness0.012 / 24AI_correctness100.01 / 22AI_edge_cases100.01 / 22AI_efficiency87.37 / 22AI_hallucination_resistance40.015 / 24AI_memory_retention0.013 / 24AI_parameter_accuracy71.516 / 24AI_plan_coherence19.114 / 24AI_recovery100.01 / 22AI_refusal100.01 / 22AI_safety_compliance100.01 / 24AI_spec100.01 / 22AI_stability61.913 / 22AI_task_completion83.410 / 24AI_tool_selection72.011 / 24ContextWindow74.720 / 24CopilotArenaOrLMArenaCode52.117 / 22InverseCost0.024 / 24LMArenaCreativeOrOpenEnded52.717 / 24LMArenaSearchDocument0.018 / 19LMArenaText52.717 / 24SWEComposite50.315 / 24SWERebench51.715 / 20TerminalBench29.316 / 22 | |||||||||
| claude-sonnet-4.5 | anthropic | 68.2 | 68.2 | 56.1 | 56.1 | 58.5 | 58.5 | 55.1 | ▸ |
group breakdownA_B78.16 / 24A_I88.13 / 24A_P78.01 / 24A_R86.53 / 24BUILD46.816 / 24CRE64.015 / 24GEN42.217 / 24LM_ARENA_REVIEW_PROXY1.722 / 24OPS_long79.612 / 24OPS_precision79.511 / 24OPS_review78.39 / 24PLAN42.417 / 24 metricsAI_canary_health78.16 / 7AI_code66.85 / 22AI_complexity59.110 / 22AI_context_awareness100.01 / 24AI_correctness100.03 / 22AI_edge_cases100.03 / 22AI_efficiency78.58 / 22AI_hallucination_resistance40.016 / 24AI_memory_retention0.016 / 24AI_parameter_accuracy61.319 / 24AI_plan_coherence30.310 / 24AI_recovery100.03 / 22AI_refusal100.03 / 22AI_safety_compliance73.617 / 24AI_spec100.03 / 22AI_stability100.01 / 22AI_task_completion100.02 / 24AI_tool_selection83.17 / 24ARC_AGI_23.612 / 17ArtificialAnalysisCoding45.312 / 21ArtificialAnalysisIntelligence46.012 / 21ArtificialAnalysisReasoning35.315 / 21ContextWindow99.310 / 24CopilotArenaOrLMArenaCode52.316 / 22GDPval88.32 / 11GPQA_HLE_Reasoning35.315 / 21IFBench43.012 / 21InverseCost74.417 / 24InverseTTFT76.311 / 19LMArenaCreativeOrOpenEnded64.015 / 24LMArenaSearchDocument1.717 / 19LMArenaText64.015 / 24LongContextRecall65.711 / 21MCPAtlas8.011 / 13OutputSpeed76.516 / 19SWEBenchMultilingual3.95 / 6SWEBenchPro54.56 / 14SWEBenchVerified54.712 / 18SWEComposite53.514 / 24SWERebench74.710 / 20SciCode46.413 / 21SonarFunctionalSkill53.615 / 17SonarIssueDensity29.89 / 17Tau2Bench58.911 / 21TerminalBench37.315 / 22 | |||||||||
| gpt-5.3-codex | openai | 67.7 | 67.7 | 55.8 | 55.8 | 58.4 | 58.4 | 69.1 | ▸ |
group breakdownA_B58.814 / 24A_I69.213 / 24A_P52.517 / 24A_R77.68 / 24BUILD58.011 / 24CRE73.412 / 24GEN55.812 / 24LM_ARENA_REVIEW_PROXY91.63 / 24OPS_long58.021 / 24OPS_precision59.320 / 24OPS_review58.821 / 24PLAN58.410 / 24 metricsAI_code24.919 / 22AI_complexity36.817 / 22AI_context_awareness0.018 / 24AI_correctness94.111 / 22AI_edge_cases62.615 / 22AI_efficiency35.115 / 22AI_hallucination_resistance80.010 / 24AI_memory_retention0.022 / 24AI_parameter_accuracy85.910 / 24AI_plan_coherence2.421 / 24AI_recovery91.914 / 22AI_refusal100.011 / 22AI_safety_compliance88.915 / 24AI_spec100.011 / 22AI_stability59.314 / 22AI_task_completion58.019 / 24AI_tool_selection66.412 / 24ContextWindow85.313 / 24CopilotArenaOrLMArenaCode58.414 / 22InverseCost76.614 / 24LMArenaCreativeOrOpenEnded73.412 / 24LMArenaSearchDocument91.63 / 19LMArenaText73.412 / 24SWEBenchVerified68.96 / 18SWEComposite63.67 / 24SWERebench89.45 / 20TerminalBench74.66 / 22 | |||||||||
| deepseek-v4-flash | deepseek | 66.2 | 66.2 | 67.0 | 67.0 | 55.6 | 55.6 | 64.6 | ▸ |
group breakdownA_B60.013 / 24A_I75.88 / 24A_P64.56 / 24A_R72.912 / 24BUILD49.813 / 24CRE58.716 / 24GEN62.88 / 24LM_ARENA_REVIEW_PROXY50.08 / 24OPS_long86.56 / 24OPS_precision89.65 / 24OPS_review91.82 / 24PLAN71.16 / 24 metricsAI_canary_health82.45 / 7AI_code34.015 / 22AI_complexity36.813 / 22AI_context_awareness0.014 / 24AI_correctness94.18 / 22AI_edge_cases62.612 / 22AI_efficiency53.310 / 22AI_hallucination_resistance40.017 / 24AI_memory_retention0.018 / 24AI_parameter_accuracy84.212 / 24AI_plan_coherence41.49 / 24AI_recovery91.911 / 22AI_refusal100.05 / 22AI_safety_compliance100.05 / 24AI_spec100.05 / 22AI_stability71.29 / 22AI_task_completion87.05 / 24AI_tool_selection73.89 / 24ArtificialAnalysisCoding45.611 / 21ArtificialAnalysisIntelligence59.310 / 21ArtificialAnalysisReasoning76.77 / 21ContextWindow71.622 / 24GPQA_HLE_Reasoning76.77 / 21IFBench100.01 / 21InverseCost100.01 / 24InverseTTFT98.93 / 19LMArenaCreativeOrOpenEnded58.716 / 24LMArenaText58.716 / 24LongContextRecall52.516 / 21OutputSpeed83.68 / 19SWEComposite50.016 / 24SciCode47.512 / 21Tau2Bench97.95 / 21 | |||||||||
| gemini-2.5-flash | 60.0 | 59.8 | 38.7 | 38.2 | 53.7 | 52.9 | 58.2 | ▸ | |
group breakdownA_B93.11 / 24A_I90.61 / 24A_P66.94 / 24A_R94.91 / 24BUILD24.723 / 24CRE45.519 / 24GEN15.020 / 24LM_ARENA_REVIEW_PROXY76.97 / 24OPS_long94.62 / 24OPS_precision90.73 / 24OPS_review89.26 / 24PLAN13.423 / 24 metricsAI_code100.01 / 22AI_complexity82.73 / 22AI_context_awareness100.02 / 24AI_correctness100.05 / 22AI_edge_cases100.04 / 22AI_efficiency100.02 / 22AI_hallucination_resistance69.911 / 24AI_memory_retention31.99 / 24AI_parameter_accuracy73.314 / 24AI_plan_coherence0.023 / 24AI_recovery100.05 / 22AI_refusal100.06 / 22AI_safety_compliance100.06 / 24AI_spec100.06 / 22AI_stability100.02 / 22AI_task_completion44.320 / 24AI_tool_selection27.516 / 24ARC_AGI_20.715 / 17ArtificialAnalysisCoding0.020 / 21ArtificialAnalysisIntelligence0.819 / 21ArtificialAnalysisReasoning17.916 / 21ContextWindow100.03 / 24CopilotArenaOrLMArenaCode64.813 / 22GDPval11.810 / 11GPQA_HLE_Reasoning17.916 / 21IFBench29.217 / 21InverseCost94.45 / 24InverseTTFT75.812 / 19LMArenaCreativeOrOpenEnded45.519 / 24LMArenaSearchDocument76.97 / 19LMArenaText45.519 / 24LiveCodeBench100.01 / 2LongContextRecall58.813 / 21MCPAtlas26.48 / 13OutputSpeed100.01 / 19SWEBenchPro33.811 / 14SWEBenchVerified0.018 / 18SWEComposite15.024 / 24SWERebench0.020 / 20SciCode23.516 / 21Tau2Bench0.020 / 21TerminalBench0.021 / 22 | |||||||||
| claude-opus-4.6 | anthropic | 64.0 | 64.0 | 58.6 | 58.6 | 52.4 | 52.4 | 47.1 | ▸ |
group breakdownA_B8.824 / 24A_I11.224 / 24A_P24.024 / 24A_R13.223 / 24BUILD78.32 / 24CRE100.01 / 24GEN89.64 / 24LM_ARENA_REVIEW_PROXY32.213 / 24OPS_long78.013 / 24OPS_precision76.813 / 24OPS_review74.713 / 24PLAN72.85 / 24 metricsAI_canary_health83.34 / 7AI_code0.021 / 22AI_complexity0.021 / 22AI_context_awareness9.89 / 24AI_correctness0.021 / 22AI_edge_cases54.520 / 22AI_efficiency0.021 / 22AI_hallucination_resistance0.022 / 24AI_memory_retention0.014 / 24AI_parameter_accuracy71.815 / 24AI_plan_coherence2.420 / 24AI_recovery5.820 / 22AI_refusal0.021 / 22AI_safety_compliance83.716 / 24AI_spec0.021 / 22AI_stability50.816 / 22AI_task_completion93.24 / 24AI_tool_selection100.02 / 24ARC_AGI_292.24 / 17ArtificialAnalysisCoding76.15 / 21ArtificialAnalysisIntelligence84.05 / 21ArtificialAnalysisReasoning86.35 / 21ContextWindow99.37 / 24CopilotArenaOrLMArenaCode100.01 / 22GDPval78.05 / 11GPQA_HLE_Reasoning86.35 / 21IFBench31.416 / 21InverseCost61.921 / 24InverseTTFT73.615 / 19LMArenaCreativeOrOpenEnded100.01 / 24LMArenaSearchDocument32.28 / 19LMArenaText100.01 / 24LongContextRecall90.24 / 21OutputSpeed76.615 / 19SWEBenchMultilingual90.92 / 6SWEBenchPro76.33 / 14SWEBenchVerified67.98 / 18SWEComposite78.33 / 24SWERebench91.64 / 20SciCode85.85 / 21SonarFunctionalSkill97.43 / 17SonarIssueDensity41.96 / 17Tau2Bench91.26 / 21TerminalBench64.57 / 22 | |||||||||
| gpt-5.2 | openai | 65.9 | 65.9 | 56.0 | 56.0 | 50.9 | 50.9 | 57.6 | ▸ |
group breakdownA_B62.811 / 24A_I72.110 / 24A_P57.014 / 24A_R80.87 / 24BUILD41.719 / 24CRE67.914 / 24GEN52.213 / 24LM_ARENA_REVIEW_PROXY20.616 / 24OPS_long58.319 / 24OPS_precision59.819 / 24OPS_review59.520 / 24PLAN56.611 / 24 metricsAI_code37.014 / 22AI_complexity36.816 / 22AI_context_awareness0.017 / 24AI_correctness94.110 / 22AI_edge_cases62.614 / 22AI_efficiency32.516 / 22AI_hallucination_resistance80.09 / 24AI_memory_retention0.021 / 24AI_parameter_accuracy85.211 / 24AI_plan_coherence0.024 / 24AI_recovery91.913 / 22AI_refusal100.010 / 22AI_safety_compliance100.08 / 24AI_spec100.010 / 22AI_stability78.97 / 22AI_task_completion87.07 / 24AI_tool_selection76.38 / 24ARC_AGI_20.017 / 17ArtificialAnalysisCoding63.48 / 21ArtificialAnalysisIntelligence59.79 / 21ArtificialAnalysisReasoning56.411 / 21ContextWindow85.312 / 24CopilotArenaOrLMArenaCode37.220 / 22GPQA_HLE_Reasoning56.411 / 21IFBench64.78 / 21InverseCost80.111 / 24LMArenaCreativeOrOpenEnded67.914 / 24LMArenaSearchDocument20.611 / 19LMArenaText67.914 / 24LongContextRecall53.915 / 21SWEBenchMultilingual0.06 / 6SWEBenchPro18.613 / 14SWEBenchVerified50.514 / 18SWEComposite28.221 / 24SciCode54.69 / 21SonarFunctionalSkill82.810 / 17SonarIssueDensity33.98 / 17Tau2Bench50.115 / 21TerminalBench58.49 / 22 | |||||||||
| glm-4.7 | zai | 35.6 | 35.6 | 49.8 | 49.8 | 50.8 | 50.8 | 55.3 | ▸ |
group breakdownA_B55.419 / 24A_I54.020 / 24A_P46.120 / 24A_R58.520 / 24BUILD42.018 / 24CRE9.222 / 24GEN35.618 / 24LM_ARENA_REVIEW_PROXY50.012 / 24OPS_long87.65 / 24OPS_precision90.14 / 24OPS_review91.91 / 24PLAN53.614 / 24 metricsAI_context_awareness0.024 / 24AI_hallucination_resistance100.05 / 24AI_memory_retention85.58 / 24AI_parameter_accuracy0.024 / 24AI_plan_coherence100.04 / 24AI_safety_compliance0.024 / 24AI_task_completion0.024 / 24AI_tool_selection0.024 / 24ArtificialAnalysisCoding37.914 / 21ArtificialAnalysisIntelligence42.613 / 21ArtificialAnalysisReasoning55.812 / 21ContextWindow74.918 / 24CopilotArenaOrLMArenaCode68.29 / 22GPQA_HLE_Reasoning55.812 / 21IFBench72.27 / 21InverseCost96.13 / 24InverseTTFT99.02 / 19LMArenaCreativeOrOpenEnded9.222 / 24LMArenaText9.222 / 24LongContextRecall57.414 / 21MCPAtlas0.013 / 13OutputSpeed85.37 / 19SWEComposite54.113 / 24SWERebench70.612 / 20SciCode48.611 / 21SonarFunctionalSkill31.316 / 17SonarIssueDensity44.75 / 17Tau2Bench100.02 / 21TerminalBench27.017 / 22 | |||||||||
| gemini-2.5-pro | 32.9 | 32.7 | 38.8 | 38.3 | 48.9 | 48.2 | 43.0 | ▸ | |
group breakdownA_B68.08 / 24A_I68.314 / 24A_P63.07 / 24A_R70.414 / 24BUILD34.122 / 24CRE0.024 / 24GEN14.521 / 24LM_ARENA_REVIEW_PROXY0.024 / 24OPS_long82.17 / 24OPS_precision74.415 / 24OPS_review71.015 / 24PLAN21.821 / 24 metricsAI_code62.38 / 22AI_complexity68.46 / 22AI_context_awareness14.66 / 24AI_correctness92.514 / 22AI_edge_cases63.59 / 22AI_efficiency88.74 / 22AI_hallucination_resistance92.56 / 24AI_memory_retention92.52 / 24AI_parameter_accuracy92.55 / 24AI_plan_coherence79.06 / 24AI_recovery92.58 / 22AI_refusal92.515 / 22AI_safety_compliance92.511 / 24AI_spec92.515 / 22AI_stability7.518 / 22AI_task_completion61.316 / 24AI_tool_selection10.317 / 24ARC_AGI_23.613 / 17ArtificialAnalysisCoding23.617 / 21ArtificialAnalysisIntelligence14.117 / 21ArtificialAnalysisReasoning44.814 / 21ContextWindow100.04 / 24CopilotArenaOrLMArenaCode0.021 / 22GPQA_HLE_Reasoning44.814 / 21IFBench19.318 / 21InverseCost80.110 / 24InverseTTFT43.917 / 19LMArenaCreativeOrOpenEnded0.024 / 24LMArenaSearchDocument0.019 / 19LMArenaText0.024 / 24LongContextRecall67.210 / 21MCPAtlas66.85 / 13OutputSpeed91.46 / 19SWEBenchPro53.29 / 14SWEBenchVerified9.817 / 18SWEComposite27.022 / 24SWERebench0.519 / 20SciCode36.115 / 21SonarFunctionalSkill86.36 / 17SonarIssueDensity18.711 / 17Tau2Bench3.519 / 21TerminalBench1.420 / 22 | |||||||||
| grok-code-fast-1 | xai | 52.1 | 52.1 | 33.8 | 33.8 | 46.5 | 46.5 | 50.0 | ▸ |
group breakdownA_B58.315 / 24A_I65.917 / 24A_P54.815 / 24A_R75.611 / 24BUILD34.221 / 24CRE47.818 / 24GEN15.819 / 24LM_ARENA_REVIEW_PROXY50.010 / 24OPS_long90.24 / 24OPS_precision89.16 / 24OPS_review89.74 / 24PLAN11.424 / 24 metricsAI_code26.818 / 22AI_complexity47.011 / 22AI_context_awareness0.022 / 24AI_correctness84.718 / 22AI_edge_cases68.76 / 22AI_efficiency16.220 / 22AI_hallucination_resistance100.03 / 24AI_memory_retention85.56 / 24AI_parameter_accuracy0.022 / 24AI_plan_coherence100.02 / 24AI_recovery100.07 / 22AI_refusal76.119 / 22AI_safety_compliance0.022 / 24AI_spec76.119 / 22AI_stability30.617 / 22AI_task_completion0.022 / 24AI_tool_selection0.022 / 24ARC_AGI_225.37 / 17ArtificialAnalysisCoding0.021 / 21ArtificialAnalysisIntelligence0.021 / 21ArtificialAnalysisReasoning0.021 / 21ContextWindow78.416 / 24CopilotArenaOrLMArenaCode0.022 / 22GPQA_HLE_Reasoning0.021 / 21IFBench0.021 / 21InverseCost99.32 / 24InverseTTFT84.87 / 19LMArenaCreativeOrOpenEnded47.818 / 24LMArenaText47.818 / 24LongContextRecall0.021 / 21OutputSpeed93.84 / 19SWEComposite45.420 / 24SWERebench27.018 / 20SciCode0.021 / 21Tau2Bench53.313 / 21TerminalBench0.022 / 22 | |||||||||
| claude-sonnet-4 | anthropic | 20.5 | 20.5 | 34.7 | 34.7 | 39.5 | 39.5 | 46.5 | ▸ |
group breakdownA_B27.422 / 24A_I33.322 / 24A_P39.421 / 24A_R37.821 / 24BUILD41.520 / 24CRE0.023 / 24GEN14.022 / 24LM_ARENA_REVIEW_PROXY84.16 / 24OPS_long80.011 / 24OPS_precision79.610 / 24OPS_review78.210 / 24PLAN33.419 / 24 metricsAI_code0.620 / 22AI_complexity22.619 / 22AI_context_awareness0.013 / 24AI_correctness43.719 / 22AI_edge_cases67.67 / 22AI_efficiency26.218 / 22AI_hallucination_resistance20.020 / 24AI_memory_retention0.015 / 24AI_parameter_accuracy91.98 / 24AI_plan_coherence19.115 / 24AI_recovery52.118 / 22AI_refusal0.220 / 22AI_safety_compliance100.03 / 24AI_spec0.220 / 22AI_stability82.56 / 22AI_task_completion99.73 / 24AI_tool_selection90.04 / 24ARC_AGI_20.116 / 17ArtificialAnalysisCoding30.716 / 21ArtificialAnalysisIntelligence29.715 / 21ArtificialAnalysisReasoning8.619 / 21ContextWindow99.39 / 24CopilotArenaOrLMArenaCode52.018 / 22GDPval82.53 / 11GPQA_HLE_Reasoning8.619 / 21IFBench35.814 / 21InverseCost74.416 / 24InverseTTFT75.413 / 19LMArenaCreativeOrOpenEnded0.023 / 24LMArenaSearchDocument84.16 / 19LMArenaText0.023 / 24LiveCodeBench0.02 / 2LongContextRecall60.812 / 21MCPAtlas14.310 / 13OutputSpeed77.513 / 19SWEBenchPro52.110 / 14SWEBenchVerified39.716 / 18SWEComposite48.518 / 24SWERebench54.514 / 20SciCode20.817 / 21SonarFunctionalSkill59.014 / 17SonarIssueDensity34.07 / 17Tau2Bench27.718 / 21TerminalBench47.513 / 22 | |||||||||
| kimi-k2-0905 | moonshot | 26.4 | 26.4 | 27.8 | 27.8 | 38.7 | 38.7 | 36.1 | ▸ |
group breakdownA_B32.121 / 24A_I35.221 / 24A_P35.722 / 24A_R28.222 / 24BUILD42.717 / 24CRE26.920 / 24GEN7.924 / 24LM_ARENA_REVIEW_PROXY50.09 / 24OPS_long35.424 / 24OPS_precision53.722 / 24OPS_review60.219 / 24PLAN29.620 / 24 metricsAI_canary_health88.92 / 7AI_code40.011 / 22AI_complexity36.814 / 22AI_context_awareness0.015 / 24AI_correctness0.022 / 22AI_edge_cases0.022 / 22AI_efficiency35.714 / 22AI_hallucination_resistance60.012 / 24AI_memory_retention0.019 / 24AI_parameter_accuracy81.113 / 24AI_plan_coherence21.913 / 24AI_recovery0.022 / 22AI_refusal100.08 / 22AI_safety_compliance88.914 / 24AI_spec100.08 / 22AI_stability0.022 / 22AI_task_completion87.06 / 24AI_tool_selection73.810 / 24ArtificialAnalysisCoding4.219 / 21ArtificialAnalysisIntelligence0.020 / 21ArtificialAnalysisReasoning0.020 / 21ContextWindow53.423 / 24GPQA_HLE_Reasoning0.020 / 21IFBench0.020 / 21InverseCost92.77 / 24InverseTTFT90.75 / 19LMArenaCreativeOrOpenEnded26.920 / 24LMArenaText26.920 / 24LongContextRecall0.020 / 21OutputSpeed0.019 / 19SWEComposite50.017 / 24SciCode0.020 / 21Tau2Bench48.016 / 21 | |||||||||
| glm-4.6 | zai | 36.8 | 36.8 | 32.0 | 32.0 | 36.4 | 36.4 | 41.5 | ▸ |
group breakdownA_B55.418 / 24A_I54.019 / 24A_P46.119 / 24A_R58.519 / 24BUILD18.224 / 24CRE23.621 / 24GEN13.423 / 24LM_ARENA_REVIEW_PROXY50.011 / 24OPS_long77.015 / 24OPS_precision83.38 / 24OPS_review86.08 / 24PLAN18.022 / 24 metricsAI_context_awareness0.023 / 24AI_hallucination_resistance100.04 / 24AI_memory_retention85.57 / 24AI_parameter_accuracy0.023 / 24AI_plan_coherence100.03 / 24AI_safety_compliance0.023 / 24AI_task_completion0.023 / 24AI_tool_selection0.023 / 24ArtificialAnalysisCoding15.918 / 21ArtificialAnalysisIntelligence6.118 / 21ArtificialAnalysisReasoning16.517 / 21ContextWindow75.017 / 24CopilotArenaOrLMArenaCode43.019 / 22GPQA_HLE_Reasoning16.517 / 21IFBench4.719 / 21InverseCost95.44 / 24InverseTTFT98.74 / 19LMArenaCreativeOrOpenEnded23.621 / 24LMArenaText23.621 / 24LongContextRecall9.819 / 21MCPAtlas7.512 / 13OutputSpeed66.318 / 19SWEBenchPro0.014 / 14SWEBenchVerified48.415 / 18SWEComposite24.523 / 24SWERebench37.617 / 20SciCode12.019 / 21SonarFunctionalSkill7.517 / 17SonarIssueDensity7.516 / 17Tau2Bench41.317 / 21TerminalBench13.718 / 22 | |||||||||