$ipbrLive LLM coding scoreboard.

Models drift. Agents battle. Math decides.

live · refreshed · 17 sources · 35 models

gemini-3.1-pro-previewgpt-5.5claude-opus-4.7IPBR
  • claude-opus-4.788.6
  • gpt-5.587.4
  • gemini-3.1-pro-preview85.1

leaders now

[ idea ]
  1. 1gemini-3.5-flash98.1
  2. 2gemini-3.1-pro-preview98.0
  3. 3claude-opus-4.694.7
[ plan ]
  1. 1gemini-3.1-pro-preview92.8
  2. 2gemini-3.5-flash91.8
  3. 3gpt-5.589.2
[ build ]
  1. 1gpt-5.586.6
  2. 2claude-opus-4.786.1
  3. 3claude-opus-4.681.6
[ review ]
  1. 1gpt-5.586.7
  2. 2claude-opus-4.786.6
  3. 3claude-opus-4.682.7
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

scoring

Each role score is the benchmark composite for that role, normalized to 0-100 and combined via weighted average of group scores. See the about page for the full math.

missing data

If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.

Full math, role definitions, and source list →

gpt-5.5openai87.089.286.686.7
gpt-5.5

group breakdown

BUILD88.32 / 35CRE84.710 / 35GEN95.34 / 35LM_ARENA_REVIEW_PROXY88.03 / 35OPS_long74.223 / 35OPS_precision66.924 / 35OPS_review68.824 / 35PLAN88.44 / 35

metrics

ARC_AGI_297.72 / 28ArtificialAnalysisCoding100.02 / 33ArtificialAnalysisIntelligence99.23 / 33ArtificialAnalysisReasoning100.03 / 33BlendedCost0.033 / 33ContextWindow100.02 / 33CopilotArenaOrLMArenaCode63.816 / 35GDPval95.02 / 35GPQA_HLE_Reasoning100.03 / 33GSO94.02 / 17IFBench75.516 / 33LMArenaCreativeOrOpenEnded84.710 / 35LMArenaDocument80.64 / 33LMArenaSearch95.42 / 20LMArenaText84.710 / 35LongContextRecall96.94 / 33MCPAtlas53.412 / 30OutputSpeed78.217 / 32SWEAtlasComposite97.31 / 35SWEAtlasQnA100.01 / 21SWEAtlasRefactoring93.22 / 19SWEAtlasTestWriting100.01 / 21SWEBenchPro95.010 / 31SWEBenchVerified95.012 / 33SWEComposite89.97 / 35SWERebench83.512 / 34SciCode90.75 / 33SonarBugDensity93.92 / 19SonarComposite65.46 / 35SonarFunctionalSkill47.115 / 19SonarIssueDensity51.75 / 19SonarVulnerabilityDensity99.12 / 19TTFT81.316 / 32Tau2Bench84.316 / 33TerminalBench100.02 / 35TerminalBenchHard100.02 / 33
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarsweatlas_qnasweatlas_refactoringsweatlas_test_writingterminal_benchmissing BUILD/AALiveCodeBenchBUILD/BFCLGEN/ArtificialAnalysisMathGEN/MMLUProPLAN/BFCLSWEComposite/SWEBenchMultilingual
claude-opus-4.7anthropic94.487.586.186.6
claude-opus-4.7

group breakdown

BUILD88.31 / 35CRE96.05 / 35GEN97.22 / 35LM_ARENA_REVIEW_PROXY96.52 / 35OPS_long71.724 / 35OPS_precision64.826 / 35OPS_review67.526 / 35PLAN84.37 / 35

metrics

ARC_AGI_293.54 / 28ArtificialAnalysisCoding95.83 / 33ArtificialAnalysisIntelligence100.01 / 33ArtificialAnalysisReasoning95.44 / 33BlendedCost11.231 / 33ContextWindow99.212 / 33CopilotArenaOrLMArenaCode100.02 / 35GDPval95.01 / 35GPQA_HLE_Reasoning95.44 / 33GSO100.01 / 17IFBench42.522 / 33LMArenaCreativeOrOpenEnded96.05 / 35LMArenaDocument99.02 / 33LMArenaSearch94.13 / 20LMArenaText96.05 / 35LongContextRecall86.68 / 33MCPAtlas92.93 / 30OutputSpeed75.126 / 32SWEAtlasComposite81.85 / 35SWEAtlasQnA67.57 / 21SWEAtlasRefactoring100.01 / 19SWEAtlasTestWriting71.97 / 21SWEBenchMultilingual95.03 / 29SWEBenchPro95.03 / 31SWEBenchVerified95.05 / 33SWEComposite91.16 / 35SWERebench85.310 / 34SciCode96.43 / 33SonarBugDensity49.916 / 19SonarComposite51.215 / 35SonarFunctionalSkill93.42 / 19SonarIssueDensity0.019 / 19SonarVulnerabilityDensity25.616 / 19TTFT71.925 / 32Tau2Bench75.720 / 33TerminalBench100.01 / 35TerminalBenchHard94.44 / 33
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarsweatlas_refactoringterminal_benchmissing BUILD/AALiveCodeBenchBUILD/BFCLGEN/ArtificialAnalysisMathGEN/MMLUProPLAN/BFCL
claude-opus-4.6anthropic94.780.781.682.7
claude-opus-4.6

group breakdown

BUILD83.73 / 35CRE100.01 / 35GEN89.95 / 35LM_ARENA_REVIEW_PROXY100.01 / 35OPS_long71.625 / 35OPS_precision64.925 / 35OPS_review67.625 / 35PLAN76.913 / 35

metrics

ARC_AGI_291.85 / 28ArtificialAnalysisCoding80.26 / 33ArtificialAnalysisIntelligence84.59 / 33ArtificialAnalysisReasoning85.59 / 33BlendedCost11.230 / 33ContextWindow99.211 / 33CopilotArenaOrLMArenaCode100.01 / 35GDPval85.39 / 35GPQA_HLE_Reasoning85.59 / 33GSO75.33 / 17IFBench27.828 / 33LMArenaCreativeOrOpenEnded100.01 / 35LMArenaDocument100.01 / 33LMArenaSearch100.01 / 20LMArenaText100.01 / 35LongContextRecall88.35 / 33MCPAtlas82.44 / 30OutputSpeed74.727 / 32SWEAtlasComposite70.16 / 35SWEAtlasQnA70.65 / 21SWEAtlasRefactoring65.56 / 19SWEAtlasTestWriting75.85 / 21SWEBenchMultilingual91.916 / 29SWEBenchPro95.12 / 31SWEBenchVerified99.43 / 33SWEComposite94.02 / 35SWERebench91.68 / 34SciCode81.67 / 33SonarBugDensity59.311 / 19SonarComposite70.15 / 35SonarFunctionalSkill91.73 / 19SonarIssueDensity46.17 / 19SonarVulnerabilityDensity66.79 / 19TTFT72.624 / 32Tau2Bench85.115 / 33TerminalBench83.915 / 35TerminalBenchHard77.97 / 33
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarsweatlas_qnasweatlas_refactoringsweatlas_test_writingswebenchswebench_proswerebenchterminal_benchmissing BUILD/AALiveCodeBenchBUILD/BFCLGEN/ArtificialAnalysisMathGEN/MMLUProPLAN/BFCL
kimi-k2.6moonshot79.184.578.875.7
kimi-k2.6

group breakdown

BUILD77.46 / 35CRE78.113 / 35GEN80.67 / 35LM_ARENA_REVIEW_PROXY45.218 / 35OPS_long81.915 / 35OPS_precision85.413 / 35OPS_review83.514 / 35PLAN87.06 / 35

metrics

ArtificialAnalysisCoding76.710 / 33ArtificialAnalysisIntelligence88.46 / 33ArtificialAnalysisReasoning87.08 / 33BlendedCost86.414 / 33ContextWindow76.424 / 33CopilotArenaOrLMArenaCode93.45 / 35GDPval69.717 / 35GPQA_HLE_Reasoning87.08 / 33IFBench88.910 / 33LMArenaCreativeOrOpenEnded78.113 / 35LMArenaDocument40.315 / 33LMArenaText78.113 / 35LongContextRecall83.210 / 33MCPAtlas72.98 / 30OutputSpeed77.719 / 32SWEAtlasComposite50.016 / 35SWEBenchMultilingual95.07 / 29SWEBenchPro95.08 / 31SWEBenchVerified95.010 / 33SWEComposite94.04 / 35SWERebench92.57 / 34SciCode90.74 / 33SonarComposite50.025 / 35TTFT95.37 / 32Tau2Bench95.37 / 33TerminalBench95.04 / 35TerminalBenchHard70.910 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ARC_AGI_2GEN/ArtificialAnalysisMathGEN/MMLUProLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWritingSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
muse-sparkmeta87.983.778.476.7
muse-spark

group breakdown

BUILD76.97 / 35CRE92.77 / 35GEN77.89 / 35LM_ARENA_REVIEW_PROXY48.810 / 35OPS_long87.55 / 35OPS_precision84.314 / 35OPS_review86.49 / 35PLAN87.65 / 35

metrics

AALiveCodeBench91.46 / 17ARC_AGI_27.621 / 28ArtificialAnalysisCoding78.19 / 33ArtificialAnalysisIntelligence81.810 / 33ArtificialAnalysisMath92.56 / 17ArtificialAnalysisReasoning89.76 / 33ContextWindow92.519 / 33CopilotArenaOrLMArenaCode89.66 / 35GDPval87.68 / 35GPQA_HLE_Reasoning89.76 / 33GSO19.414 / 17IFBench88.711 / 33LMArenaCreativeOrOpenEnded92.77 / 35LMArenaDocument38.616 / 33LMArenaSearch59.012 / 20LMArenaText92.77 / 35LongContextRecall83.29 / 33MCPAtlas100.01 / 30MMLUPro86.55 / 26OutputSpeed90.57 / 32SWEAtlasComposite47.223 / 35SWEAtlasQnA44.09 / 21SWEAtlasTestWriting46.79 / 21SWEBenchMultilingual92.511 / 29SWEBenchPro100.01 / 31SWEBenchVerified92.517 / 33SWEComposite87.010 / 35SWERebench72.221 / 34SciCode79.38 / 33SonarBugDensity65.89 / 19SonarComposite58.010 / 35SonarFunctionalSkill72.58 / 19SonarIssueDensity30.19 / 19SonarVulnerabilityDensity55.413 / 19TTFT75.521 / 32Tau2Bench83.517 / 33TerminalBench89.312 / 35TerminalBenchHard75.69 / 33
sources artificial_analysislmarenamcp_atlassweatlas_qnasweatlas_test_writingswebench_promissing BUILD/BFCLOPS_long/BlendedCostOPS_precision/BlendedCostOPS_review/BlendedCostPLAN/BFCLSWEAtlasComposite/SWEAtlasRefactoring
gpt-5.4openai69.758.278.266.7
gpt-5.4

group breakdown

BUILD81.84 / 35CRE76.314 / 35GEN59.019 / 35LM_ARENA_REVIEW_PROXY57.05 / 35OPS_long59.532 / 35OPS_precision61.628 / 35OPS_review66.627 / 35PLAN57.125 / 35

metrics

ARC_AGI_276.56 / 28BlendedCost70.425 / 33ContextWindow100.01 / 33CopilotArenaOrLMArenaCode43.228 / 35GDPval91.94 / 35GSO54.06 / 17LMArenaCreativeOrOpenEnded76.314 / 35LMArenaDocument61.25 / 33LMArenaSearch52.913 / 20LMArenaText76.314 / 35MCPAtlas53.411 / 30SWEAtlasComposite92.42 / 35SWEAtlasQnA92.62 / 21SWEAtlasRefactoring91.73 / 19SWEAtlasTestWriting93.22 / 21SWEBenchPro92.517 / 31SWEBenchVerified95.011 / 33SWEComposite88.99 / 35SWERebench83.511 / 34SonarBugDensity84.25 / 19SonarComposite60.59 / 35SonarFunctionalSkill67.012 / 19SonarIssueDensity7.317 / 19SonarVulnerabilityDensity100.01 / 19TerminalBench92.58 / 35
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarsweatlas_qnasweatlas_refactoringsweatlas_test_writingswebench_promissing BUILD/AALiveCodeBenchBUILD/ArtificialAnalysisCodingBUILD/BFCLBUILD/LongContextRecallBUILD/SciCodeBUILD/TerminalBenchHardGEN/ArtificialAnalysisIntelligenceGEN/ArtificialAnalysisMathGEN/GPQA_HLE_ReasoningGEN/MMLUProOPS_long/OutputSpeedOPS_long/TTFTOPS_precision/OutputSpeedOPS_precision/TTFTOPS_review/OutputSpeedOPS_review/TTFTPLAN/ArtificialAnalysisReasoningPLAN/BFCLPLAN/IFBenchPLAN/LongContextRecallPLAN/Tau2BenchPLAN/TerminalBenchHardSWEComposite/SWEBenchMultilingual
qwen3.7-maxalibaba89.187.977.376.9
qwen3.7-max

group breakdown

BUILD75.58 / 35CRE92.96 / 35GEN83.76 / 35LM_ARENA_REVIEW_PROXY45.915 / 35OPS_long79.418 / 35OPS_precision81.718 / 35OPS_review83.016 / 35PLAN91.71 / 35

metrics

ARC_AGI_211.919 / 28ArtificialAnalysisCoding87.35 / 33ArtificialAnalysisIntelligence98.84 / 33ArtificialAnalysisReasoning94.15 / 33BFCL95.02 / 14BlendedCost75.120 / 33ContextWindow99.217 / 33CopilotArenaOrLMArenaCode68.414 / 35GDPval73.913 / 35GPQA_HLE_Reasoning94.15 / 33IFBench100.01 / 33LMArenaCreativeOrOpenEnded92.96 / 35LMArenaDocument41.812 / 33LMArenaText92.96 / 35LongContextRecall79.713 / 33MCPAtlas77.55 / 30MMLUPro81.67 / 26OutputSpeed72.428 / 32SWEAtlasComposite35.426 / 35SWEAtlasQnA35.711 / 21SWEAtlasRefactoring34.210 / 19SWEAtlasTestWriting36.715 / 21SWEBenchMultilingual95.08 / 29SWEBenchPro95.012 / 31SWEBenchVerified95.014 / 33SWEComposite86.211 / 35SWERebench72.919 / 34SciCode64.014 / 33SonarBugDensity92.54 / 19SonarComposite80.64 / 35SonarFunctionalSkill66.914 / 19SonarIssueDensity92.53 / 19SonarVulnerabilityDensity81.68 / 19TTFT85.915 / 32Tau2Bench92.213 / 33TerminalBench95.05 / 35TerminalBenchHard92.05 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/AALiveCodeBenchBUILD/GSOGEN/ArtificialAnalysisMathLM_ARENA_REVIEW_PROXY/LMArenaSearch
deepseek-v4-prodeepseek73.481.176.473.8
deepseek-v4-pro

group breakdown

BUILD75.29 / 35CRE72.617 / 35GEN76.810 / 35LM_ARENA_REVIEW_PROXY45.912 / 35OPS_long67.627 / 35OPS_precision81.019 / 35OPS_review81.418 / 35PLAN84.08 / 35

metrics

ArtificialAnalysisCoding78.18 / 33ArtificialAnalysisIntelligence79.112 / 33ArtificialAnalysisReasoning82.211 / 33BlendedCost97.54 / 33ContextWindow100.03 / 33CopilotArenaOrLMArenaCode70.913 / 35GDPval68.620 / 35GPQA_HLE_Reasoning82.211 / 33IFBench90.27 / 33LMArenaCreativeOrOpenEnded72.617 / 35LMArenaDocument41.89 / 33LMArenaText72.617 / 35LongContextRecall66.016 / 33MCPAtlas68.49 / 30MMLUPro69.110 / 26OutputSpeed43.131 / 32SWEAtlasComposite50.012 / 35SWEBenchMultilingual95.05 / 29SWEBenchPro95.04 / 31SWEBenchVerified95.08 / 33SWEComposite94.03 / 35SWERebench92.55 / 34SciCode70.811 / 33SonarComposite50.017 / 35TTFT96.06 / 32Tau2Bench96.15 / 33TerminalBench89.910 / 35TerminalBenchHard77.98 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ARC_AGI_2GEN/ArtificialAnalysisMathLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWritingSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
glm-5.1zai83.377.675.671.5
glm-5.1

group breakdown

BUILD74.310 / 35CRE87.88 / 35GEN74.511 / 35LM_ARENA_REVIEW_PROXY45.917 / 35OPS_long81.416 / 35OPS_precision85.812 / 35OPS_review83.115 / 35PLAN78.511 / 35

metrics

ArtificialAnalysisCoding63.615 / 33ArtificialAnalysisIntelligence78.713 / 33ArtificialAnalysisReasoning61.419 / 33BlendedCost85.716 / 33ContextWindow72.030 / 33CopilotArenaOrLMArenaCode98.43 / 35GDPval75.311 / 35GPQA_HLE_Reasoning61.419 / 33IFBench89.69 / 33LMArenaCreativeOrOpenEnded87.88 / 35LMArenaDocument41.814 / 33LMArenaText87.88 / 35LongContextRecall45.429 / 33MCPAtlas76.96 / 30OutputSpeed76.621 / 32SWEAtlasComposite50.022 / 35SWEBenchMultilingual92.515 / 29SWEBenchPro95.015 / 31SWEBenchVerified92.521 / 33SWEComposite96.41 / 35SWERebench100.01 / 34SciCode35.623 / 33SonarComposite50.031 / 35TTFT99.83 / 32Tau2Bench100.04 / 33TerminalBench90.29 / 35TerminalBenchHard68.514 / 33
sources artificial_analysislmarenamcp_atlasopenrouteroverridesswerebenchmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ARC_AGI_2GEN/ArtificialAnalysisMathGEN/MMLUProLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWritingSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gpt-5.3-codexopenai66.757.775.565.3
gpt-5.3-codex

group breakdown

BUILD78.75 / 35CRE72.318 / 35GEN57.620 / 35LM_ARENA_REVIEW_PROXY56.06 / 35OPS_long57.334 / 35OPS_precision59.531 / 35OPS_review62.829 / 35PLAN57.524 / 35

metrics

ARC_AGI_272.57 / 28BlendedCost72.224 / 33ContextWindow83.621 / 33CopilotArenaOrLMArenaCode38.132 / 35GDPval69.218 / 35GSO53.47 / 17LMArenaCreativeOrOpenEnded72.318 / 35LMArenaDocument59.56 / 33LMArenaSearch52.414 / 20LMArenaText72.318 / 35SWEAtlasComposite86.44 / 35SWEAtlasQnA86.24 / 21SWEAtlasRefactoring85.45 / 19SWEAtlasTestWriting87.94 / 21SWEBenchPro95.09 / 31SWEBenchVerified92.519 / 33SWEComposite92.15 / 35SWERebench89.49 / 34SonarBugDensity80.47 / 19SonarComposite60.88 / 35SonarFunctionalSkill72.310 / 19SonarIssueDensity7.515 / 19SonarVulnerabilityDensity92.54 / 19TerminalBench97.03 / 35
sources artificial_analysislmarenaopenrouteroverridessonarsweatlas_test_writingswerebenchterminal_benchmissing BUILD/AALiveCodeBenchBUILD/ArtificialAnalysisCodingBUILD/BFCLBUILD/LongContextRecallBUILD/MCPAtlasBUILD/SciCodeBUILD/TerminalBenchHardGEN/ArtificialAnalysisIntelligenceGEN/ArtificialAnalysisMathGEN/GPQA_HLE_ReasoningGEN/MMLUProOPS_long/OutputSpeedOPS_long/TTFTOPS_precision/OutputSpeedOPS_precision/TTFTOPS_review/OutputSpeedOPS_review/TTFTPLAN/ArtificialAnalysisReasoningPLAN/BFCLPLAN/IFBenchPLAN/LongContextRecallPLAN/MCPAtlasPLAN/Tau2BenchPLAN/TerminalBenchHardSWEComposite/SWEBenchMultilingual
gemini-3.1-pro-previewgoogle98.092.873.975.9
gemini-3.1-pro-preview

group breakdown

BUILD72.213 / 35CRE99.03 / 35GEN99.71 / 35LM_ARENA_REVIEW_PROXY50.07 / 35OPS_long83.414 / 35OPS_precision75.523 / 35OPS_review80.323 / 35PLAN90.62 / 35

metrics

ARC_AGI_2100.01 / 28ArtificialAnalysisCoding100.01 / 33ArtificialAnalysisIntelligence100.02 / 33ArtificialAnalysisReasoning100.01 / 33BFCL92.57 / 14BlendedCost73.022 / 33ContextWindow100.07 / 33CopilotArenaOrLMArenaCode67.315 / 35GDPval48.926 / 35GPQA_HLE_Reasoning100.01 / 33GSO51.38 / 17IFBench92.06 / 33LMArenaCreativeOrOpenEnded99.03 / 35LMArenaDocument28.919 / 33LMArenaSearch71.15 / 20LMArenaText99.03 / 35LongContextRecall98.63 / 33MCPAtlas52.215 / 30OutputSpeed92.14 / 32SWEAtlasComposite35.824 / 35SWEAtlasQnA12.715 / 21SWEAtlasTestWriting40.113 / 21SWEBenchMultilingual36.020 / 29SWEBenchPro78.421 / 31SWEBenchVerified95.09 / 33SWEComposite85.315 / 35SWERebench100.02 / 34SciCode100.02 / 33SonarBugDensity52.614 / 19SonarComposite54.214 / 35SonarFunctionalSkill78.77 / 19SonarIssueDensity13.713 / 19SonarVulnerabilityDensity58.412 / 19TTFT52.328 / 32Tau2Bench94.59 / 33TerminalBench88.013 / 35TerminalBenchHard100.01 / 33
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarsweatlas_qnasweatlas_test_writingswebench_proswerebenchterminal_benchmissing BUILD/AALiveCodeBenchGEN/ArtificialAnalysisMathGEN/MMLUProSWEAtlasComposite/SWEAtlasRefactoring
gemini-3.5-flashgoogle98.191.872.972.7
gemini-3.5-flash

group breakdown

BUILD70.014 / 35CRE100.02 / 35GEN95.53 / 35LM_ARENA_REVIEW_PROXY34.428 / 35OPS_long92.73 / 35OPS_precision86.79 / 35OPS_review89.15 / 35PLAN90.13 / 35

metrics

AALiveCodeBench91.45 / 17ARC_AGI_295.03 / 28ArtificialAnalysisCoding69.313 / 33ArtificialAnalysisIntelligence93.85 / 33ArtificialAnalysisMath92.55 / 17ArtificialAnalysisReasoning100.02 / 33BlendedCost76.318 / 33ContextWindow100.08 / 33CopilotArenaOrLMArenaCode89.07 / 35GDPval89.87 / 35GPQA_HLE_Reasoning100.02 / 33GSO19.413 / 17IFBench89.88 / 33LMArenaCreativeOrOpenEnded100.02 / 35LMArenaDocument9.830 / 33LMArenaSearch59.011 / 20LMArenaText100.02 / 35LongContextRecall81.412 / 33MCPAtlas95.02 / 30MMLUPro86.54 / 26OutputSpeed99.82 / 32SWEAtlasComposite18.331 / 35SWEAtlasQnA7.519 / 21SWEAtlasRefactoring7.518 / 19SWEAtlasTestWriting43.611 / 21SWEBenchMultilingual92.510 / 29SWEBenchPro95.05 / 31SWEBenchVerified92.516 / 33SWEComposite85.316 / 35SWERebench72.220 / 34SciCode88.46 / 33SonarComposite50.020 / 35TTFT75.820 / 32Tau2Bench93.710 / 33TerminalBench89.311 / 35TerminalBenchHard61.517 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/BFCLPLAN/BFCLSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
claude-opus-4.5anthropic71.366.372.563.7
claude-opus-4.5

group breakdown

BUILD74.211 / 35CRE71.419 / 35GEN71.512 / 35LM_ARENA_REVIEW_PROXY44.219 / 35OPS_long69.926 / 35OPS_precision62.827 / 35OPS_review62.431 / 35PLAN63.321 / 35

metrics

AALiveCodeBench85.27 / 17ARC_AGI_251.28 / 28ArtificialAnalysisCoding79.27 / 33ArtificialAnalysisIntelligence72.215 / 33ArtificialAnalysisMath87.88 / 17ArtificialAnalysisReasoning61.818 / 33BFCL13.713 / 14BlendedCost11.229 / 33ContextWindow71.831 / 33CopilotArenaOrLMArenaCode73.811 / 35GDPval83.410 / 35GPQA_HLE_Reasoning61.818 / 33GSO59.35 / 17IFBench40.723 / 33LMArenaCreativeOrOpenEnded71.419 / 35LMArenaDocument53.17 / 33LMArenaSearch35.416 / 20LMArenaText71.419 / 35LongContextRecall100.01 / 33MCPAtlas50.316 / 30MMLUPro98.62 / 26OutputSpeed78.018 / 32SWEAtlasComposite67.17 / 35SWEAtlasQnA67.56 / 21SWEAtlasRefactoring63.27 / 19SWEAtlasTestWriting71.96 / 21SWEBenchMultilingual95.02 / 29SWEBenchPro77.822 / 31SWEBenchVerified95.04 / 33SWEComposite81.520 / 35SWERebench76.314 / 34SciCode68.012 / 33SonarBugDensity73.38 / 19SonarComposite86.61 / 35SonarFunctionalSkill100.01 / 19SonarIssueDensity75.44 / 19SonarVulnerabilityDensity87.25 / 19TTFT75.422 / 32Tau2Bench78.019 / 33TerminalBench71.521 / 35TerminalBenchHard80.36 / 33
sources artificial_analysisbfclgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing none
qwen3.6-plusalibaba64.273.671.870.7
qwen3.6-plus

group breakdown

BUILD69.415 / 35CRE63.023 / 35GEN61.318 / 35LM_ARENA_REVIEW_PROXY45.914 / 35OPS_long84.710 / 35OPS_precision88.96 / 35OPS_review89.64 / 35PLAN79.710 / 35

metrics

ARC_AGI_211.918 / 28ArtificialAnalysisCoding61.916 / 33ArtificialAnalysisIntelligence73.314 / 33ArtificialAnalysisReasoning59.520 / 33BlendedCost94.15 / 33ContextWindow99.216 / 33CopilotArenaOrLMArenaCode71.612 / 35GDPval73.912 / 35GPQA_HLE_Reasoning59.520 / 33IFBench86.713 / 33LMArenaCreativeOrOpenEnded63.023 / 35LMArenaDocument41.811 / 33LMArenaText63.023 / 35LongContextRecall83.211 / 33MCPAtlas68.010 / 30MMLUPro83.56 / 26OutputSpeed76.323 / 32SWEAtlasComposite35.425 / 35SWEAtlasQnA35.710 / 21SWEAtlasRefactoring34.29 / 19SWEAtlasTestWriting36.714 / 21SWEBenchMultilingual92.512 / 29SWEBenchPro95.011 / 31SWEBenchVerified95.013 / 33SWEComposite85.913 / 35SWERebench72.918 / 34SciCode18.028 / 33SonarBugDensity92.53 / 19SonarComposite80.63 / 35SonarFunctionalSkill66.913 / 19SonarIssueDensity92.52 / 19SonarVulnerabilityDensity81.67 / 19TTFT92.211 / 32Tau2Bench100.01 / 33TerminalBench86.914 / 35TerminalBenchHard70.911 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ArtificialAnalysisMathLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCL
claude-sonnet-4.6anthropic72.464.770.569.9
claude-sonnet-4.6

group breakdown

BUILD72.712 / 35CRE75.415 / 35GEN67.715 / 35LM_ARENA_REVIEW_PROXY78.84 / 35OPS_long66.728 / 35OPS_precision52.832 / 35OPS_review62.730 / 35PLAN64.518 / 35

metrics

AALiveCodeBench31.311 / 17ARC_AGI_210.620 / 28ArtificialAnalysisCoding90.14 / 33ArtificialAnalysisIntelligence79.911 / 33ArtificialAnalysisMath75.913 / 17ArtificialAnalysisReasoning67.015 / 33BFCL92.55 / 14BlendedCost66.628 / 33ContextWindow99.215 / 33CopilotArenaOrLMArenaCode95.24 / 35GDPval89.96 / 35GPQA_HLE_Reasoning67.015 / 33GSO30.710 / 17IFBench37.125 / 33LMArenaCreativeOrOpenEnded75.415 / 35LMArenaDocument83.43 / 33LMArenaSearch74.24 / 20LMArenaText75.415 / 35LongContextRecall88.36 / 33MCPAtlas49.017 / 30MMLUPro71.99 / 26OutputSpeed82.114 / 32SWEAtlasComposite56.58 / 35SWEAtlasQnA64.58 / 21SWEAtlasRefactoring55.38 / 19SWEAtlasTestWriting50.18 / 21SWEBenchMultilingual95.04 / 29SWEBenchPro68.126 / 31SWEBenchVerified95.06 / 33SWEComposite85.914 / 35SWERebench95.83 / 34SciCode52.717 / 33SonarBugDensity65.510 / 19SonarComposite55.711 / 35SonarFunctionalSkill84.34 / 19SonarIssueDensity22.311 / 19SonarVulnerabilityDensity22.217 / 19TTFT0.032 / 32Tau2Bench41.125 / 33TerminalBench74.819 / 35TerminalBenchHard99.13 / 33
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarsweatlas_qnasweatlas_refactoringsweatlas_test_writingswerebenchterminal_benchmissing none
deepseek-v4-flashdeepseek60.070.468.466.7
deepseek-v4-flash

group breakdown

BUILD65.916 / 35CRE55.725 / 35GEN61.817 / 35LM_ARENA_REVIEW_PROXY45.911 / 35OPS_long86.58 / 35OPS_precision89.55 / 35OPS_review85.012 / 35PLAN73.415 / 35

metrics

ArtificialAnalysisCoding47.023 / 33ArtificialAnalysisIntelligence59.822 / 33ArtificialAnalysisReasoning75.413 / 33BlendedCost99.42 / 33ContextWindow55.232 / 33CopilotArenaOrLMArenaCode86.98 / 35GDPval68.619 / 35GPQA_HLE_Reasoning75.413 / 33IFBench97.54 / 33LMArenaCreativeOrOpenEnded55.725 / 35LMArenaDocument41.88 / 33LMArenaText55.725 / 35LongContextRecall48.827 / 33MCPAtlas40.419 / 30MMLUPro61.915 / 26OutputSpeed87.88 / 32SWEAtlasComposite50.011 / 35SWEBenchMultilingual59.117 / 29SWEBenchPro91.619 / 31SWEBenchVerified95.07 / 33SWEComposite89.28 / 35SWERebench92.54 / 34SciCode41.921 / 33SonarComposite50.016 / 35TTFT100.01 / 32Tau2Bench92.912 / 33TerminalBench78.017 / 35TerminalBenchHard45.123 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ARC_AGI_2GEN/ArtificialAnalysisMathLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWritingSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
mimo-v2.5-proxiaomi77.576.567.267.7
mimo-v2.5-pro

group breakdown

BUILD64.018 / 35CRE81.511 / 35GEN67.814 / 35LM_ARENA_REVIEW_PROXY38.324 / 35OPS_long83.713 / 35OPS_precision86.510 / 35OPS_review87.66 / 35PLAN80.99 / 35

metrics

ARC_AGI_220.314 / 28ArtificialAnalysisCoding71.012 / 33ArtificialAnalysisIntelligence88.07 / 33ArtificialAnalysisReasoning73.114 / 33BlendedCost85.715 / 33ContextWindow100.010 / 33CopilotArenaOrLMArenaCode75.310 / 35GDPval68.624 / 35GPQA_HLE_Reasoning73.114 / 33IFBench99.33 / 33LMArenaCreativeOrOpenEnded81.511 / 35LMArenaDocument26.623 / 33LMArenaText81.511 / 35LongContextRecall100.02 / 33MCPAtlas29.323 / 30MMLUPro5.025 / 26OutputSpeed76.620 / 32SWEAtlasComposite22.529 / 35SWEAtlasQnA17.314 / 21SWEAtlasRefactoring25.813 / 19SWEAtlasTestWriting23.418 / 21SWEBenchMultilingual92.514 / 29SWEBenchPro95.014 / 31SWEBenchVerified95.015 / 33SWEComposite82.217 / 35SWERebench63.626 / 34SciCode72.010 / 33SonarComposite50.030 / 35TTFT89.713 / 32Tau2Bench90.614 / 33TerminalBench95.06 / 35TerminalBenchHard68.513 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ArtificialAnalysisMathLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-3-progoogle88.676.065.564.9
gemini-3-pro

group breakdown

BUILD65.617 / 35CRE96.44 / 35GEN79.28 / 35LM_ARENA_REVIEW_PROXY44.020 / 35OPS_long63.729 / 35OPS_precision52.333 / 35OPS_review49.434 / 35PLAN77.312 / 35

metrics

AALiveCodeBench100.01 / 17ARC_AGI_242.29 / 28ArtificialAnalysisCoding74.611 / 33ArtificialAnalysisIntelligence67.218 / 33ArtificialAnalysisMath97.53 / 17ArtificialAnalysisReasoning89.17 / 33BFCL93.23 / 14BlendedCost73.021 / 33ContextWindow0.033 / 33CopilotArenaOrLMArenaCode63.018 / 35GDPval33.930 / 35GPQA_HLE_Reasoning89.17 / 33GSO40.79 / 17IFBench74.017 / 33LMArenaCreativeOrOpenEnded96.44 / 35LMArenaDocument24.724 / 33LMArenaSearch63.38 / 20LMArenaText96.44 / 35LongContextRecall88.37 / 33MCPAtlas52.613 / 30MMLUPro100.01 / 26OutputSpeed92.05 / 32SWEAtlasComposite50.014 / 35SWEBenchMultilingual33.521 / 29SWEBenchPro70.424 / 31SWEBenchVerified81.427 / 33SWEComposite68.325 / 35SWERebench70.323 / 34SciCode100.01 / 33SonarBugDensity53.012 / 19SonarComposite54.912 / 35SonarFunctionalSkill83.85 / 19SonarIssueDensity7.318 / 19SonarVulnerabilityDensity59.810 / 19TTFT29.029 / 32Tau2Bench71.721 / 33TerminalBench79.916 / 35TerminalBenchHard63.815 / 33
sources arc_agiartificial_analysisbfclgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing SWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWriting
glm-5zai65.462.663.761.9
glm-5

group breakdown

BUILD61.219 / 35CRE68.820 / 35GEN53.324 / 35LM_ARENA_REVIEW_PROXY45.916 / 35OPS_long84.111 / 35OPS_precision88.18 / 35OPS_review85.411 / 35PLAN65.116 / 35

metrics

ARC_AGI_25.223 / 28ArtificialAnalysisCoding40.326 / 33ArtificialAnalysisIntelligence61.020 / 33ArtificialAnalysisReasoning51.525 / 33BlendedCost91.311 / 33ContextWindow72.029 / 33CopilotArenaOrLMArenaCode61.821 / 35GDPval73.914 / 35GPQA_HLE_Reasoning51.525 / 33IFBench81.514 / 33LMArenaCreativeOrOpenEnded68.820 / 35LMArenaDocument41.813 / 33LMArenaText68.820 / 35LongContextRecall36.830 / 33MCPAtlas42.118 / 30OutputSpeed80.316 / 32SWEAtlasComposite32.827 / 35SWEAtlasQnA33.212 / 21SWEAtlasRefactoring31.411 / 19SWEAtlasTestWriting34.316 / 21SWEBenchMultilingual51.218 / 29SWEBenchPro92.518 / 31SWEBenchVerified91.022 / 33SWEComposite81.918 / 35SWERebench76.913 / 34SciCode34.524 / 33SonarBugDensity100.01 / 19SonarComposite86.02 / 35SonarFunctionalSkill69.911 / 19SonarIssueDensity100.01 / 19SonarVulnerabilityDensity87.26 / 19TTFT100.02 / 32Tau2Bench100.03 / 33TerminalBench72.820 / 35TerminalBenchHard38.026 / 33
sources arc_agiartificial_analysislmarenaopenrouteroverridessonarsweatlas_qnasweatlas_refactoringsweatlas_test_writingswebenchswerebenchterminal_benchmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ArtificialAnalysisMathGEN/MMLUProLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCL
mimo-v2.5xiaomi53.261.362.260.2
mimo-v2.5

group breakdown

BUILD59.221 / 35CRE50.226 / 35GEN49.726 / 35LM_ARENA_REVIEW_PROXY38.323 / 35OPS_long88.94 / 35OPS_precision90.21 / 35OPS_review91.33 / 35PLAN64.817 / 35

metrics

ARC_AGI_220.313 / 28ArtificialAnalysisCoding59.018 / 33ArtificialAnalysisIntelligence69.517 / 33ArtificialAnalysisReasoning51.524 / 33BlendedCost93.09 / 33ContextWindow100.09 / 33CopilotArenaOrLMArenaCode62.819 / 35GDPval68.623 / 35GPQA_HLE_Reasoning51.524 / 33IFBench65.320 / 33LMArenaCreativeOrOpenEnded50.226 / 35LMArenaDocument26.622 / 33LMArenaText50.226 / 35LongContextRecall47.128 / 33MCPAtlas29.322 / 30MMLUPro5.024 / 26OutputSpeed85.211 / 32SWEAtlasComposite22.528 / 35SWEAtlasQnA17.313 / 21SWEAtlasRefactoring25.812 / 19SWEAtlasTestWriting23.417 / 21SWEBenchMultilingual92.513 / 29SWEBenchPro95.013 / 31SWEBenchVerified92.520 / 33SWEComposite81.819 / 35SWERebench63.625 / 34SciCode31.725 / 33SonarComposite50.029 / 35TTFT88.614 / 32Tau2Bench81.218 / 33TerminalBench94.37 / 35TerminalBenchHard63.816 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ArtificialAnalysisMathLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
minimax-m2.7minimax40.160.760.958.6
minimax-m2.7

group breakdown

BUILD58.322 / 35CRE29.231 / 35GEN51.825 / 35LM_ARENA_REVIEW_PROXY38.322 / 35OPS_long81.217 / 35OPS_precision86.311 / 35OPS_review84.113 / 35PLAN63.022 / 35

metrics

ARC_AGI_211.917 / 28ArtificialAnalysisCoding58.319 / 33ArtificialAnalysisIntelligence71.816 / 33ArtificialAnalysisReasoning62.817 / 33BlendedCost98.33 / 33ContextWindow72.226 / 33CopilotArenaOrLMArenaCode50.324 / 35GDPval70.516 / 35GPQA_HLE_Reasoning62.817 / 33IFBench88.212 / 33LMArenaCreativeOrOpenEnded29.231 / 35LMArenaDocument26.621 / 33LMArenaText29.231 / 35LongContextRecall78.014 / 33MCPAtlas29.321 / 30MMLUPro68.013 / 26OutputSpeed75.625 / 32SWEAtlasComposite14.233 / 35SWEAtlasQnA10.417 / 21SWEAtlasRefactoring22.114 / 19SWEAtlasTestWriting7.520 / 21SWEBenchMultilingual95.06 / 29SWEBenchPro95.07 / 31SWEBenchVerified92.518 / 33SWEComposite86.012 / 35SWERebench73.417 / 34SciCode53.816 / 33SonarComposite50.022 / 35TTFT94.88 / 32Tau2Bench65.522 / 33TerminalBench58.524 / 35TerminalBenchHard56.818 / 33
sources artificial_analysislmarenaopenrouteroverridesswerebenchterminal_benchmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ArtificialAnalysisMathLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gpt-5.2openai61.556.060.553.9
gpt-5.2

group breakdown

BUILD61.120 / 35CRE64.622 / 35GEN56.421 / 35LM_ARENA_REVIEW_PROXY32.629 / 35OPS_long57.333 / 35OPS_precision59.530 / 35OPS_review62.828 / 35PLAN55.226 / 35

metrics

AALiveCodeBench93.63 / 17ARC_AGI_20.028 / 28ArtificialAnalysisCoding66.514 / 33ArtificialAnalysisIntelligence60.221 / 33ArtificialAnalysisMath99.72 / 17ArtificialAnalysisReasoning54.022 / 33BFCL40.812 / 14BlendedCost72.223 / 33ContextWindow83.620 / 33CopilotArenaOrLMArenaCode50.125 / 35GDPval67.225 / 35GPQA_HLE_Reasoning54.022 / 33GSO64.74 / 17IFBench60.221 / 33LMArenaCreativeOrOpenEnded64.622 / 35LMArenaDocument0.033 / 33LMArenaSearch65.16 / 20LMArenaText64.622 / 35LongContextRecall50.526 / 33MMLUPro57.517 / 26SWEAtlasComposite87.83 / 35SWEAtlasQnA86.23 / 21SWEAtlasRefactoring85.44 / 19SWEAtlasTestWriting92.53 / 21SWEBenchMultilingual0.029 / 29SWEBenchPro32.030 / 31SWEBenchVerified79.628 / 33SWEComposite43.232 / 35SciCode49.318 / 33SonarBugDensity80.46 / 19SonarComposite60.87 / 35SonarFunctionalSkill72.39 / 19SonarIssueDensity7.514 / 19SonarVulnerabilityDensity92.53 / 19Tau2Bench37.228 / 33TerminalBench76.018 / 35TerminalBenchHard68.512 / 33
sources arc_agiartificial_analysisbfclgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proterminal_benchmissing BUILD/MCPAtlasOPS_long/OutputSpeedOPS_long/TTFTOPS_precision/OutputSpeedOPS_precision/TTFTOPS_review/OutputSpeedOPS_review/TTFTPLAN/MCPAtlasSWEComposite/SWERebench
gemini-3-flashgoogle81.668.056.156.0
gemini-3-flash

group breakdown

BUILD52.123 / 35CRE85.89 / 35GEN69.713 / 35LM_ARENA_REVIEW_PROXY31.730 / 35OPS_long93.62 / 35OPS_precision90.12 / 35OPS_review92.11 / 35PLAN63.620 / 35

metrics

AALiveCodeBench98.72 / 17ARC_AGI_216.415 / 28ArtificialAnalysisCoding60.817 / 33ArtificialAnalysisIntelligence59.423 / 33ArtificialAnalysisMath100.01 / 17ArtificialAnalysisReasoning81.712 / 33BlendedCost89.113 / 33ContextWindow100.06 / 33CopilotArenaOrLMArenaCode62.520 / 35GDPval36.228 / 35GPQA_HLE_Reasoning81.712 / 33GSO14.015 / 17IFBench94.25 / 33LMArenaCreativeOrOpenEnded85.89 / 35LMArenaDocument2.732 / 33LMArenaSearch60.69 / 20LMArenaText85.89 / 35LongContextRecall66.017 / 33MCPAtlas14.626 / 30MMLUPro92.93 / 26OutputSpeed97.63 / 32SWEAtlasComposite12.734 / 35SWEAtlasQnA0.021 / 21SWEAtlasRefactoring0.019 / 19SWEAtlasTestWriting42.512 / 21SWEBenchMultilingual100.01 / 29SWEBenchPro45.529 / 31SWEBenchVerified100.01 / 33SWEComposite71.423 / 35SWERebench76.115 / 34SciCode74.29 / 33SonarComposite50.019 / 35TTFT80.017 / 32Tau2Bench53.723 / 33TerminalBench63.022 / 35TerminalBenchHard54.519 / 33
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarsweatlas_qnasweatlas_refactoringsweatlas_test_writingswebenchswebench_proswerebenchterminal_benchmissing BUILD/BFCLPLAN/BFCLSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
grok-4.3xai54.971.755.458.3
grok-4.3

group breakdown

BUILD51.026 / 35CRE45.828 / 35GEN65.616 / 35LM_ARENA_REVIEW_PROXY27.531 / 35OPS_long84.79 / 35OPS_precision83.017 / 35OPS_review85.610 / 35PLAN74.214 / 35

metrics

AALiveCodeBench67.98 / 17ARC_AGI_225.310 / 28ArtificialAnalysisCoding55.220 / 33ArtificialAnalysisIntelligence85.78 / 33ArtificialAnalysisMath79.111 / 17ArtificialAnalysisReasoning83.010 / 33BFCL89.58 / 14BlendedCost85.317 / 33ContextWindow99.218 / 33CopilotArenaOrLMArenaCode42.830 / 35GDPval71.115 / 35GPQA_HLE_Reasoning83.010 / 33IFBench100.02 / 33LMArenaCreativeOrOpenEnded45.828 / 35LMArenaDocument12.127 / 33LMArenaSearch42.915 / 20LMArenaText45.828 / 35LongContextRecall55.724 / 33MMLUPro47.719 / 26OutputSpeed84.812 / 32SWEAtlasComposite50.018 / 35SWEComposite46.029 / 35SWERebench40.129 / 34SciCode55.515 / 33SonarComposite50.027 / 35TTFT73.423 / 32Tau2Bench100.02 / 33TerminalBench27.230 / 35TerminalBenchHard52.121 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlasSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWritingSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
kimi-k2.5moonshot52.861.755.256.0
kimi-k2.5

group breakdown

BUILD51.724 / 35CRE49.327 / 35GEN53.723 / 35LM_ARENA_REVIEW_PROXY36.225 / 35OPS_long76.222 / 35OPS_precision83.216 / 35OPS_review81.320 / 35PLAN64.019 / 35

metrics

ARC_AGI_215.016 / 28ArtificialAnalysisCoding50.222 / 33ArtificialAnalysisIntelligence61.019 / 33ArtificialAnalysisReasoning66.616 / 33BlendedCost93.48 / 33ContextWindow76.423 / 33CopilotArenaOrLMArenaCode51.423 / 35GDPval68.622 / 35GPQA_HLE_Reasoning66.616 / 33IFBench73.518 / 33LMArenaCreativeOrOpenEnded49.327 / 35LMArenaDocument22.525 / 33LMArenaText49.327 / 35LongContextRecall60.821 / 33MCPAtlas25.624 / 30MMLUPro69.111 / 26OutputSpeed66.330 / 32SWEAtlasComposite17.732 / 35SWEAtlasQnA11.616 / 21SWEAtlasRefactoring21.515 / 19SWEAtlasTestWriting18.819 / 21SWEBenchMultilingual8.824 / 29SWEBenchPro87.520 / 31SWEBenchVerified85.024 / 33SWEComposite70.624 / 35SWERebench66.024 / 34SciCode65.213 / 33SonarComposite50.024 / 35TTFT94.79 / 32Tau2Bench95.36 / 33TerminalBench54.726 / 35TerminalBenchHard42.725 / 33
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessweatlas_qnasweatlas_refactoringsweatlas_test_writingswebenchswerebenchterminal_benchmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ArtificialAnalysisMathLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
minimax-m2.5minimax17.750.253.854.1
minimax-m2.5

group breakdown

BUILD49.927 / 35CRE3.133 / 35GEN29.531 / 35LM_ARENA_REVIEW_PROXY38.321 / 35OPS_long86.57 / 35OPS_precision89.84 / 35OPS_review87.47 / 35PLAN58.423 / 35

metrics

ARC_AGI_25.222 / 28ArtificialAnalysisCoding42.425 / 33ArtificialAnalysisIntelligence42.026 / 33ArtificialAnalysisReasoning38.527 / 33BlendedCost100.01 / 33ContextWindow72.225 / 33CopilotArenaOrLMArenaCode41.731 / 35GDPval68.621 / 35GPQA_HLE_Reasoning38.527 / 33IFBench77.315 / 33LMArenaCreativeOrOpenEnded3.133 / 35LMArenaDocument26.620 / 33LMArenaText3.133 / 35LongContextRecall64.319 / 33MCPAtlas29.320 / 30MMLUPro68.012 / 26OutputSpeed84.613 / 32SWEAtlasComposite7.935 / 35SWEAtlasQnA3.420 / 21SWEAtlasRefactoring17.216 / 19SWEAtlasTestWriting0.021 / 21SWEBenchMultilingual26.522 / 29SWEBenchPro95.06 / 31SWEBenchVerified100.02 / 33SWEComposite75.922 / 35SWERebench62.627 / 34SciCode28.827 / 33SonarComposite50.021 / 35TTFT96.05 / 32Tau2Bench93.711 / 33TerminalBench52.827 / 35TerminalBenchHard42.724 / 33
sources arc_agiartificial_analysislmarenaopenrouteroverridessweatlas_qnasweatlas_refactoringsweatlas_test_writingswebenchswerebenchterminal_benchmissing BUILD/AALiveCodeBenchBUILD/BFCLBUILD/GSOGEN/ArtificialAnalysisMathLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
glm-4.7zai19.151.250.952.9
glm-4.7

group breakdown

BUILD47.029 / 35CRE0.035 / 35GEN40.429 / 35LM_ARENA_REVIEW_PROXY50.09 / 35OPS_long87.56 / 35OPS_precision90.13 / 35OPS_review87.48 / 35PLAN52.727 / 35

metrics

AALiveCodeBench93.64 / 17ArtificialAnalysisCoding38.627 / 33ArtificialAnalysisIntelligence42.825 / 33ArtificialAnalysisMath96.04 / 17ArtificialAnalysisReasoning53.423 / 33BlendedCost94.06 / 33ContextWindow72.028 / 33CopilotArenaOrLMArenaCode63.617 / 35GDPval33.531 / 35GPQA_HLE_Reasoning53.423 / 33IFBench67.319 / 33LMArenaCreativeOrOpenEnded0.035 / 35LMArenaText0.035 / 35LongContextRecall54.025 / 33MCPAtlas0.030 / 30MMLUPro54.118 / 26OutputSpeed86.410 / 32SWEAtlasComposite50.021 / 35SWEBenchMultilingual5.027 / 29SWEBenchVerified89.623 / 33SWEComposite60.527 / 35SWERebench70.622 / 34SciCode43.020 / 33SonarBugDensity51.515 / 19SonarComposite27.132 / 35SonarFunctionalSkill0.019 / 19SonarIssueDensity49.96 / 19SonarVulnerabilityDensity29.115 / 19TTFT98.74 / 32Tau2Bench95.38 / 33TerminalBench35.229 / 35TerminalBenchHard33.327 / 33
sources artificial_analysislmarenamcp_atlasopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/BFCLBUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaDocumentLM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWritingSWEComposite/SWEBenchPro
claude-sonnet-4.5anthropic58.743.950.141.3
claude-sonnet-4.5

group breakdown

BUILD48.828 / 35CRE61.324 / 35GEN47.727 / 35LM_ARENA_REVIEW_PROXY19.734 / 35OPS_long78.920 / 35OPS_precision78.320 / 35OPS_review80.421 / 35PLAN36.329 / 35

metrics

AALiveCodeBench28.012 / 17ARC_AGI_23.724 / 28ArtificialAnalysisCoding46.724 / 33ArtificialAnalysisIntelligence46.324 / 33ArtificialAnalysisMath80.59 / 17ArtificialAnalysisReasoning31.828 / 33BFCL0.014 / 14BlendedCost66.627 / 33ContextWindow99.214 / 33CopilotArenaOrLMArenaCode43.229 / 35GDPval92.33 / 35GPQA_HLE_Reasoning31.828 / 33GSO27.311 / 17IFBench38.924 / 33LMArenaCreativeOrOpenEnded61.324 / 35LMArenaDocument36.218 / 33LMArenaSearch3.219 / 20LMArenaText61.324 / 35LongContextRecall62.520 / 33MCPAtlas3.229 / 30MMLUPro75.88 / 26OutputSpeed76.024 / 32SWEAtlasComposite50.010 / 35SWEBenchMultilingual3.528 / 29SWEBenchPro71.323 / 31SWEBenchVerified84.425 / 33SWEComposite67.826 / 35SWERebench74.716 / 34SciCode40.722 / 33SonarBugDensity3.118 / 19SonarComposite16.334 / 35SonarFunctionalSkill18.617 / 19SonarIssueDensity29.810 / 19SonarVulnerabilityDensity5.118 / 19TTFT77.918 / 32Tau2Bench47.424 / 33TerminalBench48.628 / 35TerminalBenchHard45.122 / 33
sources arc_agiartificial_analysisbfclgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing SWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWriting
kimi-k2-0905moonshot15.021.149.841.0
kimi-k2-0905

group breakdown

BUILD51.025 / 35CRE16.132 / 35GEN6.435 / 35LM_ARENA_REVIEW_PROXY45.913 / 35OPS_long38.735 / 35OPS_precision61.329 / 35OPS_review59.832 / 35PLAN25.232 / 35

metrics

AALiveCodeBench0.017 / 17ArtificialAnalysisCoding1.831 / 33ArtificialAnalysisIntelligence0.032 / 33ArtificialAnalysisMath12.416 / 17ArtificialAnalysisReasoning0.032 / 33BlendedCost89.612 / 33ContextWindow76.422 / 33CopilotArenaOrLMArenaCode86.99 / 35GDPval5.034 / 35GPQA_HLE_Reasoning0.032 / 33IFBench0.032 / 33LMArenaCreativeOrOpenEnded16.132 / 35LMArenaDocument41.810 / 33LMArenaText16.132 / 35LongContextRecall0.032 / 33MCPAtlas72.97 / 30MMLUPro11.923 / 26OutputSpeed0.032 / 32SWEAtlasComposite50.015 / 35SWEBenchMultilingual5.025 / 29SWEBenchPro92.516 / 31SWEBenchVerified77.229 / 33SWEComposite81.521 / 35SWERebench92.56 / 34SciCode0.032 / 33SonarComposite50.023 / 35TTFT91.212 / 32Tau2Bench34.929 / 33TerminalBench56.625 / 35TerminalBenchHard7.531 / 33
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/BFCLBUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchPLAN/BFCLSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWritingSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
claude-sonnet-4anthropic11.830.447.039.2
claude-sonnet-4

group breakdown

BUILD45.531 / 35CRE0.034 / 35GEN18.233 / 35LM_ARENA_REVIEW_PROXY24.332 / 35OPS_long79.019 / 35OPS_precision78.221 / 35OPS_review80.422 / 35PLAN31.630 / 35

metrics

AALiveCodeBench6.616 / 17ARC_AGI_20.227 / 28ArtificialAnalysisCoding30.828 / 33ArtificialAnalysisIntelligence29.728 / 33ArtificialAnalysisMath50.115 / 17ArtificialAnalysisReasoning3.631 / 33BFCL92.54 / 14BlendedCost66.626 / 33ContextWindow99.213 / 33CopilotArenaOrLMArenaCode44.227 / 35GDPval89.95 / 35GPQA_HLE_Reasoning3.631 / 33GSO6.016 / 17IFBench32.026 / 33LMArenaCreativeOrOpenEnded0.034 / 35LMArenaDocument38.317 / 33LMArenaSearch10.218 / 20LMArenaText0.034 / 35LiveCodeBench0.02 / 2LongContextRecall57.422 / 33MCPAtlas10.227 / 30MMLUPro38.120 / 26OutputSpeed76.322 / 32SWEAtlasComposite50.09 / 35SWEBenchMultilingual10.423 / 29SWEBenchPro68.725 / 31SWEBenchVerified67.430 / 33SWEComposite57.028 / 35SWERebench54.528 / 34SciCode14.130 / 33SonarBugDensity0.019 / 19SonarComposite19.933 / 35SonarFunctionalSkill27.616 / 19SonarIssueDensity35.48 / 19SonarVulnerabilityDensity0.019 / 19TTFT77.519 / 32Tau2Bench11.331 / 33TerminalBench59.523 / 35TerminalBenchHard31.028 / 33
sources arc_agiartificial_analysisgsolivecodebenchlmarenaopenroutersonarswebenchswebench_proswerebenchmissing SWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWriting
grok-4-latestxai67.247.845.743.3
grok-4-latest

group breakdown

BUILD46.330 / 35CRE74.216 / 35GEN54.222 / 35LM_ARENA_REVIEW_PROXY34.626 / 35OPS_long61.831 / 35OPS_precision39.935 / 35OPS_review42.635 / 35PLAN44.728 / 35

metrics

AALiveCodeBench66.39 / 17ARC_AGI_220.912 / 28ArtificialAnalysisCoding53.421 / 33ArtificialAnalysisIntelligence40.527 / 33ArtificialAnalysisMath90.97 / 17ArtificialAnalysisReasoning54.621 / 33BFCL77.710 / 14BlendedCost0.932 / 33CopilotArenaOrLMArenaCode46.826 / 35GDPval9.333 / 35GPQA_HLE_Reasoning54.621 / 33IFBench29.327 / 33LMArenaCreativeOrOpenEnded74.216 / 35LMArenaDocument5.431 / 33LMArenaSearch63.87 / 20LMArenaText74.216 / 35LongContextRecall74.615 / 33MMLUPro65.514 / 26OutputSpeed87.69 / 32SWEAtlasComposite50.017 / 35SWEComposite45.331 / 35SWERebench38.330 / 34SciCode46.419 / 33SonarComposite50.026 / 35TTFT21.230 / 32Tau2Bench38.827 / 33TerminalBench15.232 / 35TerminalBenchHard52.120 / 33
sources arc_agiartificial_analysisbfcllmarenaoverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasOPS_long/ContextWindowOPS_precision/ContextWindowOPS_review/ContextWindowPLAN/MCPAtlasSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWritingSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-2.5-progoogle68.237.640.232.5
gemini-2.5-pro

group breakdown

BUILD37.732 / 35CRE79.412 / 35GEN40.928 / 35LM_ARENA_REVIEW_PROXY6.235 / 35OPS_long83.712 / 35OPS_precision76.922 / 35OPS_review81.419 / 35PLAN29.731 / 35

metrics

AALiveCodeBench59.710 / 17ARC_AGI_23.725 / 28ArtificialAnalysisCoding23.429 / 33ArtificialAnalysisIntelligence13.829 / 33ArtificialAnalysisMath79.810 / 17ArtificialAnalysisReasoning41.826 / 33BFCL92.56 / 14BlendedCost76.119 / 33ContextWindow100.05 / 33CopilotArenaOrLMArenaCode0.034 / 35GDPval34.829 / 35GPQA_HLE_Reasoning41.826 / 33GSO0.017 / 17IFBench16.030 / 33LMArenaCreativeOrOpenEnded79.412 / 35LMArenaDocument12.526 / 33LMArenaSearch0.020 / 20LMArenaText79.412 / 35LongContextRecall64.318 / 33MCPAtlas52.214 / 30MMLUPro61.016 / 26OutputSpeed90.96 / 32SWEAtlasComposite50.013 / 35SWEBenchMultilingual36.019 / 29SWEBenchPro67.327 / 31SWEBenchVerified33.532 / 33SWEComposite32.433 / 35SWERebench0.533 / 34SciCode30.026 / 33SonarBugDensity52.613 / 19SonarComposite54.213 / 35SonarFunctionalSkill78.76 / 19SonarIssueDensity13.712 / 19SonarVulnerabilityDensity58.411 / 19TTFT55.527 / 32Tau2Bench0.033 / 33TerminalBench2.033 / 35TerminalBenchHard16.929 / 33
sources arc_agiartificial_analysisgsolmarenaopenrouterswebenchswerebenchterminal_benchmissing SWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWriting
grok-code-fast-1xai34.717.332.225.8
grok-code-fast-1

group breakdown

BUILD32.233 / 35CRE41.929 / 35GEN12.334 / 35LM_ARENA_REVIEW_PROXY21.733 / 35OPS_long63.630 / 35OPS_precision47.934 / 35OPS_review51.333 / 35PLAN16.235 / 35

metrics

AALiveCodeBench7.315 / 17ARC_AGI_225.311 / 28ArtificialAnalysisCoding0.033 / 33ArtificialAnalysisIntelligence0.033 / 33ArtificialAnalysisMath0.017 / 17ArtificialAnalysisReasoning0.033 / 33BFCL89.59 / 14CopilotArenaOrLMArenaCode0.035 / 35GDPval5.035 / 35GPQA_HLE_Reasoning0.033 / 33IFBench0.033 / 33LMArenaCreativeOrOpenEnded41.929 / 35LMArenaDocument12.128 / 33LMArenaSearch31.317 / 20LMArenaText41.929 / 35LongContextRecall0.033 / 33MMLUPro0.026 / 26OutputSpeed80.915 / 32SWEAtlasComposite50.019 / 35SWEBenchVerified81.526 / 33SWEComposite45.530 / 35SWERebench27.032 / 34SciCode0.033 / 33SonarComposite50.028 / 35TTFT18.131 / 32Tau2Bench41.126 / 33TerminalBench0.035 / 35TerminalBenchHard0.033 / 33
sources artificial_analysislmarenaoverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasOPS_long/BlendedCostOPS_long/ContextWindowOPS_precision/BlendedCostOPS_precision/ContextWindowOPS_review/BlendedCostOPS_review/ContextWindowPLAN/MCPAtlasSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWritingSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-2.5-flashgoogle36.923.229.028.6
gemini-2.5-flash

group breakdown

BUILD24.534 / 35CRE38.230 / 35GEN19.032 / 35LM_ARENA_REVIEW_PROXY34.427 / 35OPS_long93.61 / 35OPS_precision88.67 / 35OPS_review91.42 / 35PLAN16.534 / 35

metrics

AALiveCodeBench27.713 / 17ARC_AGI_20.726 / 28ArtificialAnalysisCoding0.032 / 33ArtificialAnalysisIntelligence0.331 / 33ArtificialAnalysisMath59.014 / 17ArtificialAnalysisReasoning13.529 / 33BFCL53.711 / 14BlendedCost92.310 / 33ContextWindow100.04 / 33CopilotArenaOrLMArenaCode60.622 / 35GDPval36.927 / 35GPQA_HLE_Reasoning13.529 / 33GSO19.412 / 17IFBench25.629 / 33LMArenaCreativeOrOpenEnded38.230 / 35LMArenaDocument9.829 / 33LMArenaSearch59.010 / 20LMArenaText38.230 / 35LiveCodeBench100.01 / 2LongContextRecall55.723 / 33MCPAtlas19.925 / 30MMLUPro38.121 / 26OutputSpeed100.01 / 32SWEAtlasComposite18.330 / 35SWEAtlasQnA7.518 / 21SWEAtlasRefactoring7.517 / 19SWEAtlasTestWriting43.610 / 21SWEBenchMultilingual92.59 / 29SWEBenchPro46.228 / 31SWEBenchVerified0.033 / 33SWEComposite25.434 / 35SWERebench0.034 / 34SciCode16.929 / 33SonarComposite50.018 / 35TTFT71.926 / 32Tau2Bench0.032 / 33TerminalBench0.034 / 35TerminalBenchHard0.032 / 33
sources arc_agiartificial_analysisbfcllivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
glm-4.6zai55.729.228.932.1
glm-4.6

group breakdown

BUILD24.435 / 35CRE65.021 / 35GEN30.430 / 35LM_ARENA_REVIEW_PROXY50.08 / 35OPS_long78.721 / 35OPS_precision84.015 / 35OPS_review81.917 / 35PLAN20.433 / 35

metrics

AALiveCodeBench21.114 / 17ArtificialAnalysisCoding14.530 / 33ArtificialAnalysisIntelligence5.730 / 33ArtificialAnalysisMath76.012 / 17ArtificialAnalysisReasoning12.030 / 33BFCL100.01 / 14BlendedCost93.77 / 33ContextWindow72.027 / 33CopilotArenaOrLMArenaCode31.533 / 35GDPval14.932 / 35GPQA_HLE_Reasoning12.030 / 33IFBench1.831 / 33LMArenaCreativeOrOpenEnded65.021 / 35LMArenaText65.021 / 35LongContextRecall4.131 / 33MCPAtlas7.528 / 30MMLUPro23.322 / 26OutputSpeed72.329 / 32SWEAtlasComposite50.020 / 35SWEBenchMultilingual5.026 / 29SWEBenchPro0.031 / 31SWEBenchVerified38.931 / 33SWEComposite21.435 / 35SWERebench37.631 / 34SciCode5.031 / 33SonarBugDensity7.517 / 19SonarComposite10.835 / 35SonarFunctionalSkill7.518 / 19SonarIssueDensity7.516 / 19SonarVulnerabilityDensity29.314 / 19TTFT93.710 / 32Tau2Bench27.030 / 33TerminalBench17.931 / 35TerminalBenchHard12.230 / 33
sources artificial_analysisbfcllmarenaopenrouteroverridesswebenchswebench_proswerebenchterminal_benchmissing BUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaDocumentLM_ARENA_REVIEW_PROXY/LMArenaSearchSWEAtlasComposite/SWEAtlasQnASWEAtlasComposite/SWEAtlasRefactoringSWEAtlasComposite/SWEAtlasTestWriting