Models drift. Agents battle. Math decides.
live · refreshed · 17 sources · 39 models
- gpt-5.586.1
- claude-opus-4.785.6
- gemini-3.1-pro-preview83.9
leaders now
- 1gemini-3.1-pro-preview98.5▲+0.3 up 0.3 since last refresh
- 2claude-opus-4.795.9▲+0.4 up 0.4 since last refresh
- 3gemini-3.5-flash93.9▼-0.4 down 0.4 since last refresh
- 1gemini-3.1-pro-preview91.6
- 2claude-opus-4.886.9▼-1.5 down 1.5 since last refresh
- 3claude-fable-586.6▼-1.9 down 1.9 since last refresh
- 1gpt-5.587.7
- 2claude-opus-4.883.4▲+0.1 up 0.1 since last refresh
- 3claude-fable-582.4
- 1gpt-5.586.5
- 2claude-fable-583.2
- 3claude-opus-4.883.0▲+0.1 up 0.1 since last refresh
how scoring works
Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.
scoring
Each role score is the benchmark composite for that role, normalized to 0-100 and combined via weighted average of group scores. See the about page for the full math.
missing data
If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.
| gpt-5.5 | openai | 84.3▼-0.7 down 0.7 since last refresh | 86.0▼-0.1 down 0.1 since last refresh | 87.7 | 86.5 | ▸ | |||
group breakdownBUILD89.91 / 39CRE82.710 / 39GEN91.15 / 39LM_ARENA_REVIEW_PROXY90.13 / 39OPS_long71.422 / 39OPS_precision67.127 / 39OPS_review69.023 / 39PLAN85.34 / 39 metricsARC_AGI_296.43 / 31ArtificialAnalysisCoding99.73 / 37ArtificialAnalysisIntelligence94.95 / 37ArtificialAnalysisReasoning90.84 / 37BlendedCost12.738 / 39ContextWindow100.02 / 36CopilotArenaOrLMArenaCode63.319 / 38GDPval93.23 / 38GPQA_HLE_Reasoning90.84 / 37GSO94.02 / 19IFBench73.518 / 37LMArenaCreativeOrOpenEnded82.710 / 39LMArenaDocument84.63 / 33LMArenaSearch95.72 / 22LMArenaText82.710 / 39LongContextRecall94.25 / 37MCPAtlas50.215 / 34OutputSpeed70.826 / 33SWEAtlasComposite96.01 / 39SWEAtlasQnA100.01 / 21SWEAtlasRefactoring93.22 / 19SWEAtlasTestWriting95.82 / 21SWEBenchPro95.013 / 35SWEBenchVerified95.016 / 37SWEComposite96.81 / 39SWERebench99.03 / 36SciCode83.17 / 37SonarBugDensity94.42 / 24SonarComposite68.86 / 39SonarFunctionalSkill52.315 / 24SonarIssueDensity58.113 / 24SonarVulnerabilityDensity96.72 / 24TTFT80.919 / 33Tau2Bench83.118 / 37TerminalBench100.02 / 39TerminalBenchHard99.63 / 37 | |||||||||
| claude-opus-4.8 | anthropic | 80.8▼-11.0 down 11.0 since last refresh | 86.9▼-1.5 down 1.5 since last refresh | 83.4▲+0.1 up 0.1 since last refresh | 83.0▲+0.1 up 0.1 since last refresh | ▸ | |||
group breakdownBUILD84.73 / 39CRE75.913 / 39GEN93.33 / 39LM_ARENA_REVIEW_PROXY80.04 / 39OPS_long72.518 / 39OPS_precision66.928 / 39OPS_review69.622 / 39PLAN85.53 / 39 metricsARC_AGI_298.12 / 31ArtificialAnalysisCoding100.02 / 37ArtificialAnalysisIntelligence100.02 / 37ArtificialAnalysisReasoning99.63 / 37BlendedCost22.135 / 39ContextWindow98.916 / 36CopilotArenaOrLMArenaCode100.02 / 38GDPval95.02 / 38GPQA_HLE_Reasoning99.63 / 37GSO92.54 / 19IFBench50.625 / 37LMArenaCreativeOrOpenEnded75.913 / 39LMArenaDocument72.25 / 33LMArenaSearch87.85 / 22LMArenaText75.913 / 39LongContextRecall70.318 / 37MCPAtlas100.01 / 34OutputSpeed74.718 / 33SWEAtlasComposite76.97 / 39SWEAtlasQnA67.59 / 21SWEAtlasRefactoring92.54 / 19SWEAtlasTestWriting65.39 / 21SWEBenchMultilingual95.04 / 33SWEBenchPro95.05 / 35SWEBenchVerified95.08 / 37SWEComposite88.39 / 39SWERebench78.415 / 36SciCode83.15 / 37SonarBugDensity34.621 / 24SonarComposite48.931 / 39SonarFunctionalSkill87.94 / 24SonarIssueDensity14.220 / 24SonarVulnerabilityDensity21.719 / 24TTFT72.231 / 33Tau2Bench90.514 / 37TerminalBench79.98 / 39TerminalBenchHard100.02 / 37 | |||||||||
| claude-fable-5 | anthropic | 76.5▼-13.8 down 13.8 since last refresh | 86.6▼-1.9 down 1.9 since last refresh | 82.4 | 83.2 | ▸ | |||
group breakdownBUILD85.92 / 39CRE72.017 / 39GEN91.34 / 39LM_ARENA_REVIEW_PROXY78.36 / 39OPS_long56.235 / 39OPS_precision37.439 / 39OPS_review47.336 / 39PLAN90.61 / 39 metricsARC_AGI_290.95 / 31ArtificialAnalysisCoding100.01 / 37ArtificialAnalysisIntelligence100.01 / 37ArtificialAnalysisReasoning100.01 / 37BlendedCost0.039 / 39ContextWindow98.913 / 36CopilotArenaOrLMArenaCode92.58 / 38GDPval95.01 / 38GPQA_HLE_Reasoning100.01 / 37GSO92.53 / 19IFBench53.824 / 37LMArenaCreativeOrOpenEnded72.017 / 39LMArenaDocument68.86 / 33LMArenaSearch87.84 / 22LMArenaText72.017 / 39LongContextRecall82.311 / 37MCPAtlas92.53 / 34OutputSpeed75.217 / 33SWEAtlasComposite76.96 / 39SWEAtlasQnA67.56 / 21SWEAtlasRefactoring92.53 / 19SWEAtlasTestWriting65.36 / 21SWEBenchMultilingual92.511 / 33SWEBenchPro95.03 / 35SWEBenchVerified95.05 / 37SWEComposite87.911 / 39SWERebench77.916 / 36SciCode100.01 / 37SonarBugDensity34.620 / 24SonarComposite48.930 / 39SonarFunctionalSkill87.93 / 24SonarIssueDensity14.219 / 24SonarVulnerabilityDensity21.718 / 24TTFT0.033 / 33Tau2Bench100.01 / 37TerminalBench94.43 / 39TerminalBenchHard100.01 / 37 | |||||||||
| claude-opus-4.7 | anthropic | 95.9▲+0.4 up 0.4 since last refresh | 84.5▲+0.1 up 0.1 since last refresh | 79.6 | 82.3 | ▸ | |||
group breakdownBUILD80.75 / 39CRE99.63 / 39GEN95.12 / 39LM_ARENA_REVIEW_PROXY96.22 / 39OPS_long70.426 / 39OPS_precision66.729 / 39OPS_review69.024 / 39PLAN79.96 / 39 metricsARC_AGI_292.34 / 31ArtificialAnalysisCoding87.75 / 37ArtificialAnalysisIntelligence97.03 / 37ArtificialAnalysisReasoning86.66 / 37BlendedCost22.134 / 39ContextWindow98.915 / 36CopilotArenaOrLMArenaCode100.01 / 38GDPval90.24 / 38GPQA_HLE_Reasoning86.66 / 37GSO100.01 / 19IFBench41.126 / 37LMArenaCreativeOrOpenEnded99.63 / 39LMArenaDocument98.02 / 33LMArenaSearch94.43 / 22LMArenaText99.63 / 39LongContextRecall84.010 / 37MCPAtlas86.74 / 34OutputSpeed69.329 / 33SWEAtlasComposite79.95 / 39SWEAtlasQnA67.58 / 21SWEAtlasRefactoring100.01 / 19SWEAtlasTestWriting65.38 / 21SWEBenchMultilingual95.03 / 33SWEBenchPro95.04 / 35SWEBenchVerified95.07 / 37SWEComposite77.023 / 39SciCode88.44 / 37SonarBugDensity31.923 / 24SonarComposite48.732 / 39SonarFunctionalSkill94.52 / 24SonarIssueDensity7.921 / 24SonarVulnerabilityDensity16.722 / 24TTFT76.127 / 33Tau2Bench74.024 / 37TerminalBench100.01 / 39TerminalBenchHard82.56 / 37 | |||||||||
| gpt-5.4 | openai | 67.9▼-2.1 down 2.1 since last refresh | 56.9▼-0.3 down 0.3 since last refresh | 78.1▼-0.1 down 0.1 since last refresh | 66.9 | ▸ | |||
group breakdownBUILD81.94 / 39CRE73.616 / 39GEN58.322 / 39LM_ARENA_REVIEW_PROXY61.47 / 39OPS_long59.331 / 39OPS_precision61.030 / 39OPS_review66.027 / 39PLAN55.427 / 39 metricsARC_AGI_275.56 / 31BlendedCost67.628 / 39ContextWindow100.01 / 36CopilotArenaOrLMArenaCode40.930 / 38GDPval81.76 / 38GSO54.08 / 19LMArenaCreativeOrOpenEnded73.616 / 39LMArenaDocument67.37 / 33LMArenaSearch55.614 / 22LMArenaText73.616 / 39MCPAtlas50.214 / 34SWEAtlasComposite94.42 / 39SWEAtlasQnA92.62 / 21SWEAtlasRefactoring91.75 / 19SWEAtlasTestWriting100.01 / 21SWEBenchPro92.521 / 35SWEBenchVerified95.015 / 37SWEComposite91.84 / 39SWERebench90.05 / 36SonarBugDensity79.86 / 24SonarComposite62.57 / 39SonarFunctionalSkill78.96 / 24SonarIssueDensity0.024 / 24SonarVulnerabilityDensity100.01 / 24TerminalBench83.67 / 39 | |||||||||
| qwen3.7-max | alibaba | 89.5 | 83.4 | 77.6 | 75.0 | ▸ | |||
group breakdownBUILD75.77 / 39CRE93.46 / 39GEN80.78 / 39LM_ARENA_REVIEW_PROXY47.317 / 39OPS_long92.23 / 39OPS_precision91.42 / 39OPS_review91.84 / 39PLAN84.05 / 39 metricsARC_AGI_211.922 / 31ArtificialAnalysisCoding79.97 / 37ArtificialAnalysisIntelligence94.56 / 37ArtificialAnalysisReasoning85.47 / 37BFCL89.72 / 15BlendedCost79.120 / 39ContextWindow98.921 / 36CopilotArenaOrLMArenaCode98.24 / 38GDPval65.715 / 38GPQA_HLE_Reasoning85.47 / 37IFBench98.83 / 37LMArenaCreativeOrOpenEnded93.46 / 39LMArenaDocument44.513 / 33LMArenaText93.46 / 39LongContextRecall77.115 / 37MCPAtlas72.66 / 34MMLUPro81.68 / 28OutputSpeed91.96 / 33SWEAtlasComposite50.021 / 39SWEBenchMultilingual95.09 / 33SWEBenchPro95.015 / 35SWEBenchVerified95.018 / 37SWEComposite88.47 / 39SWERebench78.413 / 36SciCode58.016 / 37SonarBugDensity92.54 / 24SonarComposite81.14 / 39SonarFunctionalSkill69.613 / 24SonarIssueDensity92.53 / 24SonarVulnerabilityDensity77.47 / 24TTFT94.912 / 33Tau2Bench91.413 / 37TerminalBench72.611 / 39TerminalBenchHard80.37 / 37 | |||||||||
| gpt-5.3-codex | openai | 65.1▼-1.8 down 1.8 since last refresh | 56.8▼-0.2 down 0.2 since last refresh | 75.0▼-0.1 down 0.1 since last refresh | 65.3 | ▸ | |||
group breakdownBUILD78.46 / 39CRE70.120 / 39GEN57.023 / 39LM_ARENA_REVIEW_PROXY59.78 / 39OPS_long56.334 / 39OPS_precision58.433 / 39OPS_review61.233 / 39PLAN56.324 / 39 metricsARC_AGI_271.77 / 31BlendedCost71.126 / 39ContextWindow78.126 / 36CopilotArenaOrLMArenaCode33.634 / 38GDPval61.122 / 38GSO53.49 / 19LMArenaCreativeOrOpenEnded70.120 / 39LMArenaDocument64.78 / 33LMArenaSearch54.715 / 22LMArenaText70.120 / 39SWEAtlasComposite83.74 / 39SWEAtlasQnA86.24 / 21SWEAtlasRefactoring85.47 / 19SWEAtlasTestWriting78.94 / 21SWEBenchPro95.012 / 35SWEBenchVerified92.524 / 37SWEComposite95.53 / 39SWERebench97.14 / 36SonarBugDensity75.38 / 24SonarComposite60.69 / 39SonarFunctionalSkill74.610 / 24SonarIssueDensity7.523 / 24SonarVulnerabilityDensity92.54 / 24TerminalBench89.55 / 39 | |||||||||
| muse-spark | meta | 87.7▲+0.1 up 0.1 since last refresh | 78.9 | 75.0 | 72.0 | ▸ | |||
group breakdownBUILD73.99 / 39CRE93.17 / 39GEN77.210 / 39LM_ARENA_REVIEW_PROXY46.922 / 39OPS_long85.810 / 39OPS_precision81.214 / 39OPS_review83.013 / 39PLAN79.77 / 39 metricsAALiveCodeBench91.46 / 17ARC_AGI_221.213 / 31ArtificialAnalysisCoding71.511 / 37ArtificialAnalysisIntelligence78.613 / 37ArtificialAnalysisMath92.56 / 17ArtificialAnalysisReasoning81.39 / 37BlendedCost70.527 / 39ContextWindow92.524 / 36CopilotArenaOrLMArenaCode86.810 / 38GDPval78.110 / 38GPQA_HLE_Reasoning81.39 / 37GSO19.416 / 19IFBench86.612 / 37LMArenaCreativeOrOpenEnded93.17 / 39LMArenaDocument32.920 / 33LMArenaSearch60.913 / 22LMArenaText93.17 / 39LongContextRecall80.512 / 37MCPAtlas100.02 / 34MMLUPro86.55 / 28OutputSpeed91.07 / 33SWEAtlasComposite45.829 / 39SWEAtlasQnA44.011 / 21SWEAtlasTestWriting41.911 / 21SWEBenchMultilingual92.514 / 33SWEBenchPro100.01 / 35SWEBenchVerified88.429 / 37SWEComposite88.65 / 39SWERebench77.718 / 36SciCode72.410 / 37SonarBugDensity55.812 / 24SonarComposite58.410 / 39SonarFunctionalSkill74.78 / 24SonarIssueDensity39.517 / 24SonarVulnerabilityDensity49.510 / 24TTFT74.129 / 33Tau2Bench82.319 / 37TerminalBench63.720 / 39TerminalBenchHard65.412 / 37 | |||||||||
| claude-opus-4.6 | anthropic | 92.2 | 76.0 | 73.2 | 78.0 | ▸ | |||
group breakdownBUILD73.710 / 39CRE100.01 / 39GEN81.77 / 39LM_ARENA_REVIEW_PROXY100.01 / 39OPS_long71.225 / 39OPS_precision67.526 / 39OPS_review69.621 / 39PLAN73.413 / 39 metricsArtificialAnalysisCoding73.48 / 37ArtificialAnalysisIntelligence81.112 / 37ArtificialAnalysisReasoning77.412 / 37BlendedCost22.133 / 39ContextWindow98.914 / 36CopilotArenaOrLMArenaCode98.73 / 38GDPval75.711 / 38GPQA_HLE_Reasoning77.412 / 37GSO75.35 / 19IFBench26.532 / 37LMArenaCreativeOrOpenEnded100.01 / 39LMArenaDocument100.01 / 33LMArenaSearch100.01 / 22LMArenaText100.01 / 39LongContextRecall85.77 / 37MCPAtlas76.95 / 34OutputSpeed70.128 / 33SWEAtlasComposite67.88 / 39SWEAtlasQnA70.65 / 21SWEAtlasRefactoring65.58 / 19SWEAtlasTestWriting68.05 / 21SWEBenchMultilingual91.920 / 33SWEBenchPro95.12 / 35SWEBenchVerified94.521 / 37SWEComposite76.725 / 39SciCode74.59 / 37SonarComposite50.015 / 39TTFT77.725 / 33Tau2Bench83.917 / 37TerminalBench86.26 / 39TerminalBenchHard67.510 / 37 | |||||||||
| qwen3.7-plus | alibaba | 62.1▼-0.3 down 0.3 since last refresh | 72.7 | 73.1 | 69.9 | ▸ | |||
group breakdownBUILD71.411 / 39CRE57.827 / 39GEN65.618 / 39LM_ARENA_REVIEW_PROXY47.318 / 39OPS_long82.712 / 39OPS_precision88.07 / 39OPS_review88.29 / 39PLAN75.311 / 39 metricsARC_AGI_211.923 / 31ArtificialAnalysisCoding68.214 / 37ArtificialAnalysisIntelligence82.611 / 37ArtificialAnalysisReasoning71.815 / 37BFCL80.36 / 15BlendedCost88.98 / 39ContextWindow98.922 / 36CopilotArenaOrLMArenaCode66.518 / 38GDPval65.716 / 38GPQA_HLE_Reasoning71.815 / 37IFBench92.07 / 37LMArenaCreativeOrOpenEnded57.827 / 39LMArenaDocument44.514 / 33LMArenaText57.827 / 39LongContextRecall56.725 / 37MCPAtlas60.313 / 34MMLUPro83.57 / 28OutputSpeed72.122 / 33SWEAtlasComposite50.022 / 39SWEBenchMultilingual95.010 / 33SWEBenchPro95.016 / 35SWEBenchVerified95.019 / 37SWEComposite88.48 / 39SWERebench78.414 / 36SciCode40.322 / 37SonarBugDensity92.55 / 24SonarComposite81.15 / 39SonarFunctionalSkill69.614 / 24SonarIssueDensity92.54 / 24SonarVulnerabilityDensity77.48 / 24TTFT96.58 / 33Tau2Bench86.416 / 37TerminalBench73.510 / 39TerminalBenchHard69.79 / 37 | |||||||||
| claude-opus-4.5 | anthropic | 69.4▲+0.3 up 0.3 since last refresh | 66.3 | 72.5 | 64.2 | ▸ | |||
group breakdownBUILD74.88 / 39CRE70.818 / 39GEN69.415 / 39LM_ARENA_REVIEW_PROXY47.121 / 39OPS_long58.432 / 39OPS_precision54.035 / 39OPS_review46.137 / 39PLAN66.118 / 39 metricsAALiveCodeBench85.27 / 17ARC_AGI_250.68 / 31ArtificialAnalysisCoding72.49 / 37ArtificialAnalysisIntelligence69.619 / 37ArtificialAnalysisMath87.88 / 17ArtificialAnalysisReasoning55.522 / 37BFCL100.01 / 15BlendedCost22.132 / 39ContextWindow0.036 / 36CopilotArenaOrLMArenaCode70.814 / 38GDPval74.312 / 38GPQA_HLE_Reasoning55.522 / 37GSO59.37 / 19IFBench39.327 / 37LMArenaCreativeOrOpenEnded70.818 / 39LMArenaDocument55.29 / 33LMArenaSearch39.117 / 22LMArenaText70.818 / 39LongContextRecall100.01 / 37MCPAtlas46.819 / 34MMLUPro98.62 / 28OutputSpeed73.719 / 33SWEAtlasComposite65.19 / 39SWEAtlasQnA67.57 / 21SWEAtlasRefactoring63.29 / 19SWEAtlasTestWriting65.37 / 21SWEBenchMultilingual95.02 / 33SWEBenchPro77.826 / 35SWEBenchVerified95.06 / 37SWEComposite84.114 / 39SWERebench82.88 / 36SciCode61.714 / 37SonarBugDensity56.911 / 24SonarComposite82.72 / 39SonarFunctionalSkill100.01 / 24SonarIssueDensity80.55 / 24SonarVulnerabilityDensity74.99 / 24TTFT78.522 / 33Tau2Bench76.522 / 37TerminalBench64.118 / 39TerminalBenchHard69.78 / 37 | |||||||||
| kimi-k2.6 | moonshot | 76.0▼-0.7 down 0.7 since last refresh | 78.1▼-0.1 down 0.1 since last refresh | 71.2 | 69.4 | ▸ | |||
group breakdownBUILD69.613 / 39CRE75.715 / 39GEN76.911 / 39LM_ARENA_REVIEW_PROXY46.823 / 39OPS_long74.817 / 39OPS_precision80.215 / 39OPS_review75.918 / 39PLAN78.68 / 39 metricsArtificialAnalysisCoding70.212 / 37ArtificialAnalysisIntelligence84.79 / 37ArtificialAnalysisReasoning78.811 / 37BlendedCost81.817 / 39ContextWindow55.529 / 36CopilotArenaOrLMArenaCode90.29 / 38GDPval61.521 / 38GPQA_HLE_Reasoning78.811 / 37IFBench86.811 / 37LMArenaCreativeOrOpenEnded75.715 / 39LMArenaDocument43.518 / 33LMArenaText75.715 / 39LongContextRecall80.513 / 37MCPAtlas68.59 / 34OutputSpeed70.227 / 33SWEAtlasComposite50.019 / 39SWEBenchMultilingual95.08 / 33SWEBenchPro95.011 / 35SWEBenchVerified95.014 / 37SWEComposite82.818 / 39SWERebench64.526 / 36SciCode83.16 / 37SonarComposite50.025 / 39TTFT98.55 / 33Tau2Bench94.77 / 37TerminalBench68.113 / 39TerminalBenchHard61.113 / 37 | |||||||||
| gemini-3.1-pro-preview | 98.5▲+0.3 up 0.3 since last refresh | 91.6 | 71.1 | 74.4 | ▸ | ||||
group breakdownBUILD68.515 / 39CRE100.02 / 39GEN98.61 / 39LM_ARENA_REVIEW_PROXY52.49 / 39OPS_long86.09 / 39OPS_precision81.812 / 39OPS_review84.412 / 39PLAN88.32 / 39 metricsARC_AGI_2100.01 / 31ArtificialAnalysisCoding97.44 / 37ArtificialAnalysisIntelligence96.74 / 37ArtificialAnalysisReasoning100.02 / 37BFCL77.010 / 15BlendedCost71.624 / 39ContextWindow100.03 / 36CopilotArenaOrLMArenaCode63.120 / 38GDPval42.729 / 38GPQA_HLE_Reasoning100.02 / 37GSO51.310 / 19IFBench89.88 / 37LMArenaCreativeOrOpenEnded100.02 / 39LMArenaDocument32.021 / 33LMArenaSearch72.87 / 22LMArenaText100.02 / 39LongContextRecall95.94 / 37MCPAtlas49.117 / 34OutputSpeed89.49 / 33SWEAtlasComposite34.630 / 39SWEAtlasQnA12.715 / 21SWEAtlasTestWriting36.014 / 21SWEBenchMultilingual36.024 / 33SWEBenchPro78.425 / 35SWEBenchVerified95.012 / 37SWEComposite80.519 / 39SWERebench87.96 / 36SciCode100.02 / 37SonarComposite50.022 / 39TTFT73.230 / 33Tau2Bench93.99 / 37TerminalBench92.54 / 39TerminalBenchHard88.94 / 37 | |||||||||
| glm-5.1 | zai | 82.9▲+0.6 up 0.6 since last refresh | 71.6▲+0.1 up 0.1 since last refresh | 70.9 | 65.9 | ▸ | |||
group breakdownBUILD70.612 / 39CRE89.58 / 39GEN72.513 / 39LM_ARENA_REVIEW_PROXY47.320 / 39OPS_long71.323 / 39OPS_precision74.823 / 39OPS_review64.929 / 39PLAN70.617 / 39 metricsArtificialAnalysisCoding58.220 / 37ArtificialAnalysisIntelligence75.716 / 37ArtificialAnalysisReasoning55.123 / 37BlendedCost80.918 / 39ContextWindow0.935 / 36CopilotArenaOrLMArenaCode96.55 / 38GDPval66.613 / 38GPQA_HLE_Reasoning55.123 / 37IFBench87.510 / 37LMArenaCreativeOrOpenEnded89.58 / 39LMArenaDocument44.516 / 33LMArenaText89.58 / 39LongContextRecall43.033 / 37MCPAtlas71.77 / 34OutputSpeed78.315 / 33SWEAtlasComposite50.028 / 39SWEBenchMultilingual92.519 / 33SWEBenchPro95.019 / 35SWEBenchVerified92.526 / 37SWEComposite96.42 / 39SWERebench100.02 / 36SciCode31.228 / 37SonarComposite50.029 / 39TTFT99.92 / 33Tau2Bench99.74 / 37TerminalBench63.321 / 39TerminalBenchHard59.018 / 37 | |||||||||
| qwen3.6-plus | alibaba | 60.8▼-0.3 down 0.3 since last refresh | 68.1 | 70.8 | 67.7 | ▸ | |||
group breakdownBUILD69.214 / 39CRE59.225 / 39GEN58.421 / 39LM_ARENA_REVIEW_PROXY47.316 / 39OPS_long82.213 / 39OPS_precision87.09 / 39OPS_review87.410 / 39PLAN71.916 / 39 metricsARC_AGI_211.921 / 31ArtificialAnalysisCoding56.521 / 37ArtificialAnalysisIntelligence70.617 / 37ArtificialAnalysisReasoning53.324 / 37BlendedCost87.09 / 39ContextWindow98.920 / 36CopilotArenaOrLMArenaCode69.416 / 38GDPval65.714 / 38GPQA_HLE_Reasoning53.324 / 37IFBench84.614 / 37LMArenaCreativeOrOpenEnded59.225 / 39LMArenaDocument44.512 / 33LMArenaText59.225 / 39LongContextRecall80.514 / 37MCPAtlas63.812 / 34MMLUPro83.56 / 28OutputSpeed72.121 / 33SWEAtlasComposite50.020 / 39SWEBenchMultilingual92.516 / 33SWEBenchPro95.014 / 35SWEBenchVerified95.017 / 37SWEComposite88.110 / 39SWERebench78.412 / 36SciCode14.732 / 37SonarBugDensity92.53 / 24SonarComposite81.13 / 39SonarFunctionalSkill69.612 / 24SonarIssueDensity92.52 / 24SonarVulnerabilityDensity77.46 / 24TTFT94.713 / 33Tau2Bench99.73 / 37TerminalBench60.523 / 39TerminalBenchHard61.114 / 37 | |||||||||
| gemini-3.5-flash | 93.9▼-0.4 down 0.4 since last refresh | 77.9 | 68.9 | 65.4 | ▸ | ||||
group breakdownBUILD66.816 / 39CRE99.54 / 39GEN82.96 / 39LM_ARENA_REVIEW_PROXY35.332 / 39OPS_long92.14 / 39OPS_precision86.710 / 39OPS_review88.98 / 39PLAN73.414 / 39 metricsAALiveCodeBench91.45 / 17ARC_AGI_221.212 / 31ArtificialAnalysisCoding59.818 / 37ArtificialAnalysisIntelligence88.07 / 37ArtificialAnalysisMath92.55 / 17ArtificialAnalysisReasoning88.55 / 37BlendedCost74.121 / 39ContextWindow100.09 / 36CopilotArenaOrLMArenaCode86.111 / 38GDPval79.79 / 38GPQA_HLE_Reasoning88.55 / 37GSO19.415 / 19IFBench83.015 / 37LMArenaCreativeOrOpenEnded99.54 / 39LMArenaDocument9.731 / 33LMArenaSearch60.912 / 22LMArenaText99.54 / 39LongContextRecall87.46 / 37MCPAtlas18.929 / 34MMLUPro86.54 / 28OutputSpeed98.24 / 33SWEAtlasComposite50.017 / 39SWEBenchMultilingual92.513 / 33SWEBenchPro95.07 / 35SWEBenchVerified88.428 / 37SWEComposite86.812 / 39SWERebench77.717 / 36SciCode80.48 / 37SonarComposite50.023 / 39TTFT78.423 / 33Tau2Bench93.910 / 37TerminalBench63.719 / 39TerminalBenchHard48.322 / 37 | |||||||||
| deepseek-v4-pro | deepseek | 72.6▲+2.3 up 2.3 since last refresh | 75.9▲+0.3 up 0.3 since last refresh | 68.4▼-0.1 down 0.1 since last refresh | 68.6 | ▸ | |||
group breakdownBUILD65.718 / 39CRE70.719 / 39GEN73.512 / 39LM_ARENA_REVIEW_PROXY50.010 / 39OPS_long84.011 / 39OPS_precision89.35 / 39OPS_review89.56 / 39PLAN75.610 / 39 metricsArtificialAnalysisCoding71.510 / 37ArtificialAnalysisIntelligence76.115 / 37ArtificialAnalysisReasoning74.313 / 37BlendedCost89.56 / 39ContextWindow100.05 / 36CopilotArenaOrLMArenaCode68.617 / 38GPQA_HLE_Reasoning74.313 / 37IFBench88.09 / 37LMArenaCreativeOrOpenEnded70.719 / 39LMArenaText70.719 / 39LongContextRecall63.519 / 37MCPAtlas64.110 / 34MMLUPro69.111 / 28OutputSpeed73.620 / 33SWEAtlasComposite50.014 / 39SWEBenchMultilingual95.06 / 33SWEBenchPro95.06 / 35SWEBenchVerified95.011 / 37SWEComposite77.024 / 39SciCode64.413 / 37SonarComposite50.017 / 39TTFT97.96 / 33Tau2Bench95.55 / 37TerminalBench63.022 / 39TerminalBenchHard67.511 / 37 | |||||||||
| minimax-m3 | minimax | 67.9▲+26.3 up 26.3 since last refresh | 75.3▲+3.4 up 3.4 since last refresh | 64.9▲+0.9 up 0.9 since last refresh | 65.6▲+0.4 up 0.4 since last refresh | ▸ | |||
group breakdownBUILD62.521 / 39CRE67.322 / 39GEN70.514 / 39LM_ARENA_REVIEW_PROXY40.025 / 39OPS_long62.230 / 39OPS_precision76.119 / 39OPS_review76.817 / 39PLAN78.39 / 39 metricsARC_AGI_211.920 / 31ArtificialAnalysisCoding58.219 / 37ArtificialAnalysisIntelligence87.68 / 37ArtificialAnalysisReasoning84.68 / 37BlendedCost89.85 / 39ContextWindow100.010 / 36CopilotArenaOrLMArenaCode94.76 / 38GDPval61.620 / 38GPQA_HLE_Reasoning84.68 / 37IFBench100.01 / 37LMArenaCreativeOrOpenEnded67.322 / 39LMArenaDocument30.022 / 33LMArenaText67.322 / 39LongContextRecall100.02 / 37MCPAtlas64.111 / 34MMLUPro68.015 / 28OutputSpeed36.032 / 33SWEAtlasComposite14.237 / 39SWEAtlasQnA10.418 / 21SWEAtlasRefactoring22.115 / 19SWEAtlasTestWriting7.520 / 21SWEBenchMultilingual92.515 / 33SWEBenchPro95.010 / 35SWEBenchVerified95.013 / 37SWEComposite86.813 / 39SWERebench75.221 / 36SciCode39.823 / 37SonarBugDensity62.910 / 24SonarComposite51.714 / 39SonarFunctionalSkill40.317 / 24SonarIssueDensity64.111 / 24SonarVulnerabilityDensity46.411 / 24TTFT92.314 / 33Tau2Bench74.823 / 37TerminalBench67.015 / 39TerminalBenchHard56.819 / 37 | |||||||||
| gemini-3-pro | 88.7▲+0.6 up 0.6 since last refresh | 72.7▲+0.1 up 0.1 since last refresh | 64.5 | 63.1 | ▸ | ||||
group breakdownBUILD64.720 / 39CRE98.75 / 39GEN77.79 / 39LM_ARENA_REVIEW_PROXY45.224 / 39OPS_long52.236 / 39OPS_precision54.334 / 39OPS_review54.335 / 39PLAN72.015 / 39 metricsAALiveCodeBench100.01 / 17ARC_AGI_241.79 / 31ArtificialAnalysisCoding68.213 / 37ArtificialAnalysisIntelligence64.923 / 37ArtificialAnalysisMath97.53 / 17ArtificialAnalysisReasoning80.710 / 37BFCL81.74 / 15BlendedCost71.623 / 39CopilotArenaOrLMArenaCode59.722 / 38GDPval29.233 / 38GPQA_HLE_Reasoning80.710 / 37GSO40.711 / 19IFBench72.119 / 37LMArenaCreativeOrOpenEnded98.75 / 39LMArenaDocument24.927 / 33LMArenaSearch65.49 / 22LMArenaText98.75 / 39LongContextRecall85.79 / 37MCPAtlas49.018 / 34MMLUPro100.01 / 28SWEAtlasComposite50.016 / 39SWEBenchMultilingual33.525 / 33SWEBenchPro70.428 / 35SWEBenchVerified100.01 / 37SWEComposite73.526 / 39SWERebench76.220 / 36SciCode97.03 / 37SonarComposite50.021 / 39Tau2Bench69.825 / 37TerminalBench74.69 / 39TerminalBenchHard54.720 / 37 | |||||||||
| claude-sonnet-4.6 | anthropic | 71.9▲+0.5 up 0.5 since last refresh | 59.0▲+0.1 up 0.1 since last refresh | 64.0 | 64.1 | ▸ | |||
group breakdownBUILD66.017 / 39CRE75.914 / 39GEN65.717 / 39LM_ARENA_REVIEW_PROXY79.85 / 39OPS_long64.728 / 39OPS_precision51.736 / 39OPS_review61.331 / 39PLAN55.526 / 39 metricsAALiveCodeBench31.311 / 17ARC_AGI_210.624 / 31ArtificialAnalysisCoding82.56 / 37ArtificialAnalysisIntelligence76.814 / 37ArtificialAnalysisMath75.913 / 17ArtificialAnalysisReasoning60.318 / 37BFCL80.18 / 15BlendedCost62.531 / 39ContextWindow98.919 / 36CopilotArenaOrLMArenaCode92.67 / 38GDPval80.28 / 38GPQA_HLE_Reasoning60.318 / 37GSO30.712 / 19IFBench35.729 / 37LMArenaCreativeOrOpenEnded75.914 / 39LMArenaDocument83.94 / 33LMArenaSearch75.76 / 22LMArenaText75.914 / 39LongContextRecall85.78 / 37MCPAtlas45.620 / 34MMLUPro71.910 / 28OutputSpeed78.414 / 33SWEAtlasComposite55.010 / 39SWEAtlasQnA64.510 / 21SWEAtlasRefactoring55.310 / 19SWEAtlasTestWriting45.010 / 21SWEBenchMultilingual95.05 / 33SWEBenchPro68.130 / 35SWEBenchVerified95.09 / 37SWEComposite78.120 / 39SWERebench76.319 / 36SciCode47.318 / 37SonarBugDensity52.714 / 24SonarComposite55.512 / 39SonarFunctionalSkill86.25 / 24SonarIssueDensity33.918 / 24SonarVulnerabilityDensity13.123 / 24TTFT2.532 / 33Tau2Bench37.529 / 37TerminalBench48.025 / 39TerminalBenchHard86.85 / 37 | |||||||||
| glm-5 | zai | 64.2▲+0.4 up 0.4 since last refresh | 60.0▲+0.1 up 0.1 since last refresh | 63.0▼-0.1 down 0.1 since last refresh | 59.2 | ▸ | |||
group breakdownBUILD62.022 / 39CRE67.521 / 39GEN55.324 / 39LM_ARENA_REVIEW_PROXY47.319 / 39OPS_long71.921 / 39OPS_precision75.621 / 39OPS_review65.828 / 39PLAN60.920 / 39 metricsARC_AGI_25.126 / 31ArtificialAnalysisCoding60.817 / 37ArtificialAnalysisIntelligence69.918 / 37ArtificialAnalysisReasoning44.229 / 37BlendedCost85.014 / 39ContextWindow0.934 / 36CopilotArenaOrLMArenaCode58.325 / 38GDPval65.717 / 38GPQA_HLE_Reasoning44.229 / 37IFBench77.116 / 37LMArenaCreativeOrOpenEnded67.521 / 39LMArenaDocument44.515 / 33LMArenaText67.521 / 39LongContextRecall48.130 / 37MCPAtlas39.421 / 34OutputSpeed78.713 / 33SWEAtlasComposite31.731 / 39SWEAtlasQnA33.212 / 21SWEAtlasRefactoring31.411 / 19SWEAtlasTestWriting30.815 / 21SWEBenchMultilingual51.222 / 33SWEBenchPro92.522 / 35SWEBenchVerified84.031 / 37SWEComposite83.517 / 39SWERebench83.57 / 36SciCode44.120 / 37SonarBugDensity100.01 / 24SonarComposite86.61 / 39SonarFunctionalSkill73.111 / 24SonarIssueDensity100.01 / 24SonarVulnerabilityDensity82.25 / 24TTFT99.73 / 33Tau2Bench100.02 / 37TerminalBench46.326 / 39TerminalBenchHard59.017 / 37 | |||||||||
| mimo-v2.5-pro | xiaomi | 75.7▲+0.3 up 0.3 since last refresh | 71.1 | 63.0▼-0.1 down 0.1 since last refresh | 63.0 | ▸ | |||
group breakdownBUILD60.223 / 39CRE81.211 / 39GEN65.119 / 39LM_ARENA_REVIEW_PROXY37.629 / 39OPS_long72.519 / 39OPS_precision81.513 / 39OPS_review82.314 / 39PLAN73.612 / 39 metricsARC_AGI_220.116 / 31ArtificialAnalysisCoding65.015 / 37ArtificialAnalysisIntelligence84.410 / 37ArtificialAnalysisReasoning66.017 / 37BlendedCost89.57 / 39ContextWindow100.012 / 36CopilotArenaOrLMArenaCode70.415 / 38GDPval60.927 / 38GPQA_HLE_Reasoning66.017 / 37IFBench97.04 / 37LMArenaCreativeOrOpenEnded81.211 / 39LMArenaDocument25.226 / 33LMArenaText81.211 / 39LongContextRecall99.33 / 37MCPAtlas27.626 / 34MMLUPro5.027 / 28OutputSpeed54.930 / 33SWEAtlasComposite22.033 / 39SWEAtlasQnA17.314 / 21SWEAtlasRefactoring25.813 / 19SWEAtlasTestWriting21.817 / 21SWEBenchMultilingual92.518 / 33SWEBenchPro95.018 / 35SWEBenchVerified95.020 / 37SWEComposite84.115 / 39SWERebench68.324 / 36SciCode65.512 / 37SonarBugDensity45.217 / 24SonarComposite37.035 / 39SonarFunctionalSkill12.721 / 24SonarIssueDensity65.610 / 24SonarVulnerabilityDensity43.414 / 24TTFT91.915 / 33Tau2Bench89.715 / 37TerminalBench70.612 / 39TerminalBenchHard59.016 / 37 | |||||||||
| gpt-5.2 | openai | 41.3▼-0.2 down 0.2 since last refresh | 48.3 | 62.8▼-0.1 down 0.1 since last refresh | 52.0 | ▸ | |||
group breakdownBUILD64.819 / 39CRE35.833 / 39GEN48.529 / 39LM_ARENA_REVIEW_PROXY33.633 / 39OPS_long56.333 / 39OPS_precision58.432 / 39OPS_review61.232 / 39PLAN46.631 / 39 metricsAALiveCodeBench93.63 / 17ARC_AGI_20.031 / 31ArtificialAnalysisCoding60.816 / 37ArtificialAnalysisIntelligence58.425 / 37ArtificialAnalysisMath99.72 / 17ArtificialAnalysisReasoning48.326 / 37BFCL0.015 / 15BlendedCost71.125 / 39ContextWindow78.125 / 36CopilotArenaOrLMArenaCode46.428 / 38GDPval59.328 / 38GPQA_HLE_Reasoning48.326 / 37GSO64.76 / 19IFBench58.423 / 37LMArenaCreativeOrOpenEnded35.833 / 39LMArenaDocument0.033 / 33LMArenaSearch67.18 / 22LMArenaText35.833 / 39LongContextRecall48.129 / 37MMLUPro57.520 / 28SWEAtlasComposite87.83 / 39SWEAtlasQnA86.23 / 21SWEAtlasRefactoring85.46 / 19SWEAtlasTestWriting92.53 / 21SWEBenchMultilingual0.033 / 33SWEBenchPro32.034 / 35SWEBenchVerified84.030 / 37SWEComposite63.831 / 39SWERebench100.01 / 36SciCode44.119 / 37SonarBugDensity75.37 / 24SonarComposite60.68 / 39SonarFunctionalSkill74.69 / 24SonarIssueDensity7.522 / 24SonarVulnerabilityDensity92.53 / 24Tau2Bench33.332 / 37TerminalBench67.114 / 39TerminalBenchHard59.015 / 37 | |||||||||
| deepseek-v4-flash | deepseek | 57.2▲+0.6 up 0.6 since last refresh | 65.5▲+0.1 up 0.1 since last refresh | 62.0 | 62.2 | ▸ | |||
group breakdownBUILD58.524 / 39CRE52.128 / 39GEN58.620 / 39LM_ARENA_REVIEW_PROXY47.314 / 39OPS_long91.45 / 39OPS_precision95.01 / 39OPS_review95.11 / 39PLAN65.919 / 39 metricsArtificialAnalysisCoding42.927 / 37ArtificialAnalysisIntelligence58.026 / 37ArtificialAnalysisReasoning68.116 / 37BlendedCost100.01 / 39ContextWindow100.04 / 36CopilotArenaOrLMArenaCode84.212 / 38GDPval60.923 / 38GPQA_HLE_Reasoning68.116 / 37IFBench95.25 / 37LMArenaCreativeOrOpenEnded52.128 / 39LMArenaDocument44.510 / 33LMArenaText52.128 / 39LongContextRecall46.431 / 37MCPAtlas37.922 / 34MMLUPro61.918 / 28OutputSpeed84.810 / 33SWEAtlasComposite50.013 / 39SWEBenchMultilingual59.121 / 33SWEBenchPro91.623 / 35SWEBenchVerified95.010 / 37SWEComposite77.222 / 39SWERebench62.327 / 36SciCode37.125 / 37SonarComposite50.016 / 39TTFT98.94 / 33Tau2Bench92.212 / 37TerminalBench53.024 / 39TerminalBenchHard37.627 / 37 | |||||||||
| mimo-v2.5 | xiaomi | 51.2▼-1.2 down 1.2 since last refresh | 56.2▼-0.2 down 0.2 since last refresh | 58.4▼-0.1 down 0.1 since last refresh | 55.8 | ▸ | |||
group breakdownBUILD55.426 / 39CRE48.329 / 39GEN47.430 / 39LM_ARENA_REVIEW_PROXY37.628 / 39OPS_long87.47 / 39OPS_precision91.03 / 39OPS_review91.93 / 39PLAN57.122 / 39 metricsARC_AGI_220.115 / 31ArtificialAnalysisCoding54.023 / 37ArtificialAnalysisIntelligence67.021 / 37ArtificialAnalysisReasoning46.028 / 37BlendedCost99.22 / 39ContextWindow100.011 / 36CopilotArenaOrLMArenaCode58.824 / 38GDPval60.926 / 38GPQA_HLE_Reasoning46.028 / 37IFBench63.522 / 37LMArenaCreativeOrOpenEnded48.329 / 39LMArenaDocument25.225 / 33LMArenaText48.329 / 39LongContextRecall44.732 / 37MCPAtlas27.625 / 34MMLUPro5.026 / 28OutputSpeed80.311 / 33SWEAtlasComposite22.032 / 39SWEAtlasQnA17.313 / 21SWEAtlasRefactoring25.812 / 19SWEAtlasTestWriting21.816 / 21SWEBenchMultilingual92.517 / 33SWEBenchPro95.017 / 35SWEBenchVerified92.525 / 37SWEComposite83.716 / 39SWERebench68.323 / 36SciCode27.529 / 37SonarBugDensity45.216 / 24SonarComposite37.034 / 39SonarFunctionalSkill12.720 / 24SonarIssueDensity65.69 / 24SonarVulnerabilityDensity43.413 / 24TTFT91.816 / 33Tau2Bench79.821 / 37TerminalBench66.716 / 39TerminalBenchHard54.721 / 37 | |||||||||
| minimax-m2.7 | minimax | 37.1▲+0.7 up 0.7 since last refresh | 54.8▲+0.1 up 0.1 since last refresh | 57.5▼-0.1 down 0.1 since last refresh | 53.3 | ▸ | |||
group breakdownBUILD56.025 / 39CRE26.836 / 39GEN49.228 / 39LM_ARENA_REVIEW_PROXY37.627 / 39OPS_long71.224 / 39OPS_precision75.522 / 39OPS_review66.226 / 39PLAN55.625 / 39 metricsARC_AGI_211.919 / 31ArtificialAnalysisCoding53.324 / 37ArtificialAnalysisIntelligence69.220 / 37ArtificialAnalysisReasoning56.421 / 37BlendedCost90.74 / 39ContextWindow3.131 / 36CopilotArenaOrLMArenaCode42.529 / 38GDPval62.319 / 38GPQA_HLE_Reasoning56.421 / 37IFBench86.113 / 37LMArenaCreativeOrOpenEnded26.836 / 39LMArenaDocument25.224 / 33LMArenaText26.836 / 39LongContextRecall75.416 / 37MCPAtlas27.624 / 34MMLUPro68.014 / 28OutputSpeed77.116 / 33SWEAtlasComposite14.236 / 39SWEAtlasQnA10.417 / 21SWEAtlasRefactoring22.114 / 19SWEAtlasTestWriting7.519 / 21SWEBenchMultilingual95.07 / 33SWEBenchPro95.09 / 35SWEBenchVerified92.523 / 37SWEComposite88.56 / 39SWERebench79.611 / 36SciCode48.317 / 37SonarBugDensity65.29 / 24SonarComposite52.013 / 39SonarFunctionalSkill38.518 / 24SonarIssueDensity66.68 / 24SonarVulnerabilityDensity45.812 / 24TTFT96.49 / 33Tau2Bench63.226 / 37TerminalBench34.230 / 39TerminalBenchHard48.323 / 37 | |||||||||
| gemini-3-flash | 81.4▲+0.4 up 0.4 since last refresh | 65.7 | 55.2 | 54.7 | ▸ | ||||
group breakdownBUILD51.328 / 39CRE86.39 / 39GEN68.016 / 39LM_ARENA_REVIEW_PROXY32.734 / 39OPS_long94.31 / 39OPS_precision90.64 / 39OPS_review92.22 / 39PLAN60.521 / 39 metricsAALiveCodeBench98.72 / 17ARC_AGI_216.217 / 31ArtificialAnalysisCoding55.622 / 37ArtificialAnalysisIntelligence57.627 / 37ArtificialAnalysisMath100.01 / 17ArtificialAnalysisReasoning73.914 / 37BlendedCost83.416 / 39ContextWindow100.08 / 36CopilotArenaOrLMArenaCode59.023 / 38GDPval31.331 / 38GPQA_HLE_Reasoning73.914 / 37GSO14.017 / 19IFBench92.06 / 37LMArenaCreativeOrOpenEnded86.39 / 39LMArenaDocument2.632 / 33LMArenaSearch62.910 / 22LMArenaText86.39 / 39LongContextRecall63.520 / 37MCPAtlas13.430 / 34MMLUPro92.93 / 28OutputSpeed98.43 / 33SWEAtlasComposite11.438 / 39SWEAtlasQnA0.021 / 21SWEAtlasRefactoring0.019 / 19SWEAtlasTestWriting38.113 / 21SWEBenchMultilingual100.01 / 33SWEBenchPro45.533 / 35SWEBenchVerified95.23 / 37SWEComposite73.227 / 39SWERebench82.69 / 36SciCode67.611 / 37SonarComposite50.020 / 39TTFT84.117 / 33Tau2Bench50.727 / 37TerminalBench66.117 / 39TerminalBenchHard46.224 / 37 | |||||||||
| claude-sonnet-4.5 | anthropic | 57.0 | 44.6 | 53.9▼-0.1 down 0.1 since last refresh | 44.8 | ▸ | |||
group breakdownBUILD53.227 / 39CRE59.724 / 39GEN46.431 / 39LM_ARENA_REVIEW_PROXY26.037 / 39OPS_long75.916 / 39OPS_precision76.020 / 39OPS_review78.116 / 39PLAN38.732 / 39 metricsAALiveCodeBench28.012 / 17ARC_AGI_23.627 / 31ArtificialAnalysisCoding42.628 / 37ArtificialAnalysisIntelligence45.328 / 37ArtificialAnalysisMath80.510 / 17ArtificialAnalysisReasoning27.732 / 37BFCL85.43 / 15BlendedCost62.530 / 39ContextWindow98.918 / 36CopilotArenaOrLMArenaCode38.932 / 38GDPval82.05 / 38GPQA_HLE_Reasoning27.732 / 37GSO27.313 / 19IFBench37.528 / 37LMArenaCreativeOrOpenEnded59.724 / 39LMArenaDocument43.319 / 33LMArenaSearch8.719 / 22LMArenaText59.724 / 39LongContextRecall60.123 / 37MCPAtlas2.733 / 34MMLUPro75.89 / 28OutputSpeed71.224 / 33SWEAtlasComposite50.012 / 39SWEBenchMultilingual3.532 / 33SWEBenchPro71.327 / 35SWEBenchVerified91.527 / 37SWEComposite71.429 / 39SWERebench81.010 / 36SciCode36.026 / 37SonarBugDensity52.713 / 24SonarComposite57.711 / 39SonarFunctionalSkill76.97 / 24SonarIssueDensity53.515 / 24SonarVulnerabilityDensity20.420 / 24TTFT78.124 / 33Tau2Bench44.128 / 37TerminalBench36.528 / 39TerminalBenchHard37.626 / 37 | |||||||||
| kimi-k2.5 | moonshot | 47.4▼-0.5 down 0.5 since last refresh | 56.0▼-0.1 down 0.1 since last refresh | 50.5▼-0.1 down 0.1 since last refresh | 50.8 | ▸ | |||
group breakdownBUILD47.629 / 39CRE43.830 / 39GEN50.627 / 39LM_ARENA_REVIEW_PROXY35.430 / 39OPS_long64.129 / 39OPS_precision74.724 / 39OPS_review70.519 / 39PLAN57.023 / 39 metricsARC_AGI_214.818 / 31ArtificialAnalysisCoding45.826 / 37ArtificialAnalysisIntelligence59.124 / 37ArtificialAnalysisReasoning59.919 / 37BlendedCost86.512 / 39ContextWindow55.528 / 36CopilotArenaOrLMArenaCode47.627 / 38GDPval60.925 / 38GPQA_HLE_Reasoning59.919 / 37IFBench71.520 / 37LMArenaCreativeOrOpenEnded43.830 / 39LMArenaDocument20.928 / 33LMArenaText43.830 / 39LongContextRecall58.424 / 37MCPAtlas23.727 / 34MMLUPro69.112 / 28OutputSpeed50.531 / 33SWEAtlasComposite17.135 / 39SWEAtlasQnA11.616 / 21SWEAtlasRefactoring21.516 / 19SWEAtlasTestWriting16.818 / 21SWEBenchMultilingual8.828 / 33SWEBenchPro87.524 / 35SWEBenchVerified76.533 / 37SWEComposite71.628 / 39SWERebench71.522 / 36SciCode59.015 / 37SonarBugDensity44.418 / 24SonarComposite34.836 / 39SonarFunctionalSkill6.223 / 24SonarIssueDensity68.46 / 24SonarVulnerabilityDensity42.215 / 24TTFT96.87 / 33Tau2Bench94.76 / 37TerminalBench31.131 / 39TerminalBenchHard35.529 / 37 | |||||||||
| minimax-m2.5 | minimax | 14.8 | 45.2 | 50.1▼-0.1 down 0.1 since last refresh | 49.2 | ▸ | |||
group breakdownBUILD47.330 / 39CRE0.039 / 39GEN27.735 / 39LM_ARENA_REVIEW_PROXY37.626 / 39OPS_long81.314 / 39OPS_precision78.016 / 39OPS_review70.120 / 39PLAN52.129 / 39 metricsARC_AGI_25.125 / 31ArtificialAnalysisCoding38.729 / 37ArtificialAnalysisIntelligence41.430 / 37ArtificialAnalysisReasoning33.931 / 37BlendedCost93.63 / 39ContextWindow3.130 / 36CopilotArenaOrLMArenaCode37.333 / 38GDPval60.924 / 38GPQA_HLE_Reasoning33.931 / 37IFBench75.317 / 37LMArenaCreativeOrOpenEnded0.039 / 39LMArenaDocument25.223 / 33LMArenaText0.039 / 39LongContextRecall61.822 / 37MCPAtlas27.623 / 34MMLUPro68.013 / 28OutputSpeed100.01 / 33SWEAtlasComposite7.939 / 39SWEAtlasQnA3.420 / 21SWEAtlasRefactoring17.217 / 19SWEAtlasTestWriting0.021 / 21SWEBenchMultilingual26.526 / 33SWEBenchPro95.08 / 35SWEBenchVerified95.24 / 37SWEComposite77.321 / 39SWERebench67.825 / 36SciCode24.831 / 37SonarBugDensity52.715 / 24SonarComposite47.433 / 39SonarFunctionalSkill40.816 / 24SonarIssueDensity67.87 / 24SonarVulnerabilityDensity24.017 / 24TTFT82.518 / 33Tau2Bench93.011 / 37TerminalBench30.232 / 39TerminalBenchHard35.528 / 37 | |||||||||
| grok-4.3 | xai | 42.7▼-1.8 down 1.8 since last refresh | 57.3▼-0.2 down 0.2 since last refresh | 48.5▼-0.1 down 0.1 since last refresh | 52.7 | ▸ | |||
group breakdownBUILD44.233 / 39CRE31.035 / 39GEN54.025 / 39LM_ARENA_REVIEW_PROXY48.113 / 39OPS_long90.96 / 39OPS_precision87.28 / 39OPS_review89.27 / 39PLAN55.228 / 39 metricsAALiveCodeBench63.89 / 17ARC_AGI_225.010 / 31ArtificialAnalysisCoding31.331 / 37ArtificialAnalysisIntelligence66.322 / 37ArtificialAnalysisMath84.79 / 17ArtificialAnalysisReasoning59.620 / 37BFCL36.911 / 15BlendedCost80.619 / 39ContextWindow98.923 / 36CopilotArenaOrLMArenaCode33.135 / 38GDPval62.818 / 38GPQA_HLE_Reasoning59.620 / 37IFBench100.02 / 37LMArenaCreativeOrOpenEnded31.035 / 39LMArenaSearch46.216 / 22LMArenaText31.035 / 39LongContextRecall56.726 / 37MMLUPro63.217 / 28OutputSpeed94.85 / 33SWEAtlasComposite50.024 / 39SWEComposite47.134 / 39SWERebench42.631 / 36SciCode35.527 / 37SonarComposite50.027 / 39TTFT79.520 / 33Tau2Bench81.420 / 37TerminalBench11.335 / 39TerminalBenchHard22.632 / 37 | |||||||||
| claude-sonnet-4 | anthropic | 11.5 | 26.5 | 46.0▼-0.1 down 0.1 since last refresh | 37.4 | ▸ | |||
group breakdownBUILD45.131 / 39CRE0.038 / 39GEN18.036 / 39LM_ARENA_REVIEW_PROXY29.635 / 39OPS_long76.115 / 39OPS_precision76.318 / 39OPS_review78.315 / 39PLAN25.035 / 39 metricsAALiveCodeBench6.616 / 17ARC_AGI_20.230 / 31ArtificialAnalysisCoding28.032 / 37ArtificialAnalysisIntelligence29.832 / 37ArtificialAnalysisMath50.114 / 17ArtificialAnalysisReasoning1.735 / 37BFCL80.17 / 15BlendedCost62.529 / 39ContextWindow98.917 / 36CopilotArenaOrLMArenaCode40.631 / 38GDPval80.27 / 38GPQA_HLE_Reasoning1.735 / 37GSO6.018 / 19IFBench30.730 / 37LMArenaCreativeOrOpenEnded0.038 / 39LMArenaDocument44.317 / 33LMArenaSearch14.918 / 22LMArenaText0.038 / 39LiveCodeBench50.01 / 1LongContextRecall54.927 / 37MCPAtlas9.831 / 34MMLUPro38.122 / 28OutputSpeed71.423 / 33SWEAtlasComposite50.011 / 39SWEBenchMultilingual10.427 / 33SWEBenchPro68.729 / 35SWEBenchVerified99.02 / 37SWEComposite63.532 / 39SWERebench59.030 / 36SciCode10.933 / 37SonarBugDensity0.024 / 24SonarComposite25.238 / 39SonarFunctionalSkill34.619 / 24SonarIssueDensity45.416 / 24SonarVulnerabilityDensity0.024 / 24TTFT78.721 / 33Tau2Bench6.035 / 37TerminalBench38.627 / 39TerminalBenchHard24.831 / 37 | |||||||||
| glm-4.7 | zai | 58.0▼-0.1 down 0.1 since last refresh | 51.7 | 45.8 | 47.5 | ▸ | |||
group breakdownBUILD42.835 / 39CRE58.226 / 39GEN53.726 / 39LM_ARENA_REVIEW_PROXY50.012 / 39OPS_long72.420 / 39OPS_precision76.317 / 39OPS_review66.425 / 39PLAN46.830 / 39 metricsAALiveCodeBench93.64 / 17ArtificialAnalysisCoding35.130 / 37ArtificialAnalysisIntelligence42.129 / 37ArtificialAnalysisMath96.04 / 17ArtificialAnalysisReasoning47.727 / 37BlendedCost87.010 / 39ContextWindow0.933 / 36CopilotArenaOrLMArenaCode60.321 / 38GDPval28.934 / 38GPQA_HLE_Reasoning47.727 / 37IFBench65.421 / 37LMArenaCreativeOrOpenEnded58.226 / 39LMArenaText58.226 / 39LongContextRecall51.528 / 37MCPAtlas0.034 / 34MMLUPro54.121 / 28OutputSpeed79.312 / 33SWEAtlasComposite50.027 / 39SWEBenchMultilingual5.031 / 33SWEBenchVerified83.932 / 37SWEComposite55.733 / 39SWERebench61.729 / 36SciCode38.224 / 37SonarBugDensity34.022 / 24SonarComposite24.439 / 39SonarFunctionalSkill0.024 / 24SonarIssueDensity58.212 / 24SonarVulnerabilityDensity20.421 / 24TTFT100.01 / 33Tau2Bench94.78 / 37TerminalBench14.833 / 39TerminalBenchHard26.930 / 37 | |||||||||
| kimi-k2-0905 | moonshot | 7.9▼-0.9 down 0.9 since last refresh | 17.2▼-0.1 down 0.1 since last refresh | 43.7 | 36.4 | ▸ | |||
group breakdownBUILD44.632 / 39CRE5.937 / 39GEN4.439 / 39LM_ARENA_REVIEW_PROXY47.315 / 39OPS_long35.839 / 39OPS_precision58.631 / 39OPS_review54.534 / 39PLAN19.836 / 39 metricsAALiveCodeBench0.017 / 17ArtificialAnalysisCoding1.435 / 37ArtificialAnalysisIntelligence1.635 / 37ArtificialAnalysisMath12.416 / 17ArtificialAnalysisReasoning0.036 / 37BlendedCost83.815 / 39ContextWindow55.527 / 36CopilotArenaOrLMArenaCode84.213 / 38GDPval5.037 / 38GPQA_HLE_Reasoning0.036 / 37IFBench0.036 / 37LMArenaCreativeOrOpenEnded5.937 / 39LMArenaDocument44.511 / 33LMArenaText5.937 / 39LongContextRecall0.036 / 37MCPAtlas68.58 / 34MMLUPro11.925 / 28OutputSpeed0.033 / 33SWEAtlasComposite50.018 / 39SWEBenchMultilingual5.029 / 33SWEBenchPro92.520 / 35SWEBenchVerified68.435 / 37SWEComposite68.130 / 39SWERebench62.328 / 36SciCode0.036 / 37SonarComposite50.024 / 39TTFT95.710 / 33Tau2Bench30.833 / 37TerminalBench34.929 / 39TerminalBenchHard3.435 / 37 | |||||||||
| grok-4-latest | xai | 38.1▼-0.4 down 0.4 since last refresh | 39.5▼-0.1 down 0.1 since last refresh | 42.8 | 37.2 | ▸ | |||
group breakdownBUILD43.434 / 39CRE34.034 / 39GEN44.332 / 39LM_ARENA_REVIEW_PROXY25.038 / 39OPS_long46.538 / 39OPS_precision42.938 / 39OPS_review42.939 / 39PLAN35.833 / 39 metricsAALiveCodeBench66.38 / 17ARC_AGI_220.614 / 31ArtificialAnalysisCoding48.825 / 37ArtificialAnalysisIntelligence39.931 / 37ArtificialAnalysisMath90.97 / 17ArtificialAnalysisReasoning48.925 / 37BFCL34.613 / 15BlendedCost14.537 / 39GDPval6.936 / 38GPQA_HLE_Reasoning48.925 / 37IFBench28.031 / 37LMArenaCreativeOrOpenEnded34.034 / 39LMArenaSearch0.022 / 22LMArenaText34.034 / 39LongContextRecall72.017 / 37MMLUPro65.516 / 28SWEAtlasComposite50.023 / 39SWEComposite46.535 / 39SWERebench41.332 / 36SciCode41.421 / 37SonarComposite50.026 / 39Tau2Bench35.031 / 37TerminalBench4.536 / 39TerminalBenchHard44.025 / 37 | |||||||||
| gemini-2.5-pro | 67.4▲+0.1 up 0.1 since last refresh | 37.1 | 42.0 | 32.9 | ▸ | ||||
group breakdownBUILD39.536 / 39CRE78.012 / 39GEN40.333 / 39LM_ARENA_REVIEW_PROXY6.039 / 39OPS_long86.98 / 39OPS_precision82.811 / 39OPS_review85.411 / 39PLAN28.334 / 39 metricsAALiveCodeBench59.710 / 17ARC_AGI_23.628 / 31ArtificialAnalysisCoding21.233 / 37ArtificialAnalysisIntelligence15.033 / 37ArtificialAnalysisMath79.811 / 17ArtificialAnalysisReasoning37.030 / 37BFCL77.09 / 15BlendedCost73.922 / 39ContextWindow100.07 / 36CopilotArenaOrLMArenaCode0.037 / 38GDPval30.432 / 38GPQA_HLE_Reasoning37.030 / 37GSO0.019 / 19IFBench14.934 / 37LMArenaCreativeOrOpenEnded78.012 / 39LMArenaDocument11.929 / 33LMArenaSearch0.121 / 22LMArenaText78.012 / 39LongContextRecall61.821 / 37MCPAtlas49.116 / 34MMLUPro61.019 / 28OutputSpeed90.48 / 33SWEAtlasComposite50.015 / 39SWEBenchMultilingual36.023 / 33SWEBenchPro67.331 / 35SWEBenchVerified93.022 / 37SWEComposite41.137 / 39SWERebench0.036 / 36SciCode25.930 / 37SonarComposite50.019 / 39TTFT74.128 / 33Tau2Bench0.037 / 37TerminalBench13.534 / 39TerminalBenchHard12.033 / 37 | |||||||||
| grok-code-fast-1 | xai | 29.7▼-0.4 down 0.4 since last refresh | 13.7 | 30.0 | 23.9 | ▸ | |||
group breakdownBUILD30.537 / 39CRE36.432 / 39GEN11.038 / 39LM_ARENA_REVIEW_PROXY28.836 / 39OPS_long47.037 / 39OPS_precision44.037 / 39OPS_review44.038 / 39PLAN11.238 / 39 metricsAALiveCodeBench7.315 / 17ARC_AGI_225.011 / 31ArtificialAnalysisCoding0.037 / 37ArtificialAnalysisIntelligence0.037 / 37ArtificialAnalysisMath0.017 / 17ArtificialAnalysisReasoning0.037 / 37BFCL36.912 / 15BlendedCost19.836 / 39CopilotArenaOrLMArenaCode0.038 / 38GDPval5.038 / 38GPQA_HLE_Reasoning0.037 / 37IFBench0.037 / 37LMArenaCreativeOrOpenEnded36.432 / 39LMArenaSearch7.520 / 22LMArenaText36.432 / 39LongContextRecall0.037 / 37MMLUPro0.028 / 28SWEAtlasComposite50.025 / 39SWEBenchVerified73.834 / 37SWEComposite45.236 / 39SWERebench29.034 / 36SciCode0.037 / 37SonarComposite50.028 / 39Tau2Bench37.530 / 37TerminalBench2.237 / 39TerminalBenchHard0.037 / 37 | |||||||||
| glm-4.6 | zai | 53.1 | 24.6 | 27.9▼-0.1 down 0.1 since last refresh | 28.4 | ▸ | |||
group breakdownBUILD25.038 / 39CRE62.523 / 39GEN30.034 / 39LM_ARENA_REVIEW_PROXY50.011 / 39OPS_long67.027 / 39OPS_precision72.225 / 39OPS_review62.830 / 39PLAN14.137 / 39 metricsAALiveCodeBench21.114 / 17ArtificialAnalysisCoding13.134 / 37ArtificialAnalysisIntelligence7.434 / 37ArtificialAnalysisMath76.012 / 17ArtificialAnalysisReasoning9.433 / 37BFCL81.15 / 15BlendedCost86.711 / 39ContextWindow0.932 / 36CopilotArenaOrLMArenaCode26.936 / 38GDPval12.035 / 38GPQA_HLE_Reasoning9.433 / 37IFBench0.935 / 37LMArenaCreativeOrOpenEnded62.523 / 39LMArenaText62.523 / 39LongContextRecall2.035 / 37MCPAtlas7.532 / 34MMLUPro23.324 / 28OutputSpeed71.125 / 33SWEAtlasComposite50.026 / 39SWEBenchMultilingual5.030 / 33SWEBenchPro0.035 / 35SWEBenchVerified66.736 / 37SWEComposite26.738 / 39SWERebench40.533 / 36SciCode2.435 / 37SonarBugDensity36.419 / 24SonarComposite28.237 / 39SonarFunctionalSkill7.522 / 24SonarIssueDensity57.014 / 24SonarVulnerabilityDensity24.816 / 24TTFT95.511 / 33Tau2Bench22.634 / 37TerminalBench0.039 / 39TerminalBenchHard7.734 / 37 | |||||||||
| gemini-2.5-flash | 38.0▼-0.4 down 0.4 since last refresh | 18.2 | 25.7 | 24.7 | ▸ | ||||
group breakdownBUILD21.239 / 39CRE41.131 / 39GEN16.637 / 39LM_ARENA_REVIEW_PROXY35.331 / 39OPS_long93.92 / 39OPS_precision89.26 / 39OPS_review91.45 / 39PLAN8.939 / 39 metricsAALiveCodeBench21.113 / 17ARC_AGI_20.729 / 31ArtificialAnalysisCoding0.036 / 37ArtificialAnalysisIntelligence0.036 / 37ArtificialAnalysisMath47.915 / 17ArtificialAnalysisReasoning7.134 / 37BFCL1.314 / 15BlendedCost85.813 / 39ContextWindow100.06 / 36CopilotArenaOrLMArenaCode57.726 / 38GDPval32.330 / 38GPQA_HLE_Reasoning7.134 / 37GSO19.414 / 19IFBench19.033 / 37LMArenaCreativeOrOpenEnded41.131 / 39LMArenaDocument9.730 / 33LMArenaSearch60.911 / 22LMArenaText41.131 / 39LongContextRecall39.634 / 37MCPAtlas18.928 / 34MMLUPro26.723 / 28OutputSpeed99.62 / 33SWEAtlasComposite17.234 / 39SWEAtlasQnA7.519 / 21SWEAtlasRefactoring7.518 / 19SWEAtlasTestWriting39.912 / 21SWEBenchMultilingual92.512 / 33SWEBenchPro46.232 / 35SWEBenchVerified0.037 / 37SWEComposite25.439 / 39SWERebench0.035 / 36SciCode7.734 / 37SonarComposite50.018 / 39TTFT77.626 / 33Tau2Bench0.036 / 37TerminalBench0.038 / 39TerminalBenchHard0.036 / 37 | |||||||||