$ipbrLive LLM coding scoreboard.

Models drift. Agents battle. Math decides.

live · refreshed · 14 sources · 32 models

claude-opus-4.6claude-opus-4.7gemini-3.1-pro-previewIPBR
  • gemini-3.1-pro-preview89.9
  • claude-opus-4.789.8
  • claude-opus-4.685.3

leaders now

[ idea ]
  1. 1gemini-3.1-pro-preview98.2 up 0.3 since last refresh
  2. 2claude-opus-4.794.9 down 1.2 since last refresh
  3. 3claude-opus-4.693.8 down 0.1 since last refresh
[ plan ]
  1. 1gemini-3.1-pro-preview93.2
  2. 2claude-opus-4.787.8 down 0.1 since last refresh
  3. 3kimi-k2.685.9 up 2.1 since last refresh
[ build ]
  1. 1claude-opus-4.687.8 up 0.6 since last refresh
  2. 2claude-opus-4.787.0 down 2.5 since last refresh
  3. 3gpt-5.585.3 up 6.9 since last refresh
[ review ]
  1. 1claude-opus-4.789.5 down 0.9 since last refresh
  2. 2kimi-k2.689.3 up 2.7 since last refresh
  3. 3gemini-3.1-pro-preview86.3 up 0.6 since last refresh
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

scoring

Each role score is the benchmark composite for that role, normalized to 0-100 and combined via weighted average of group scores. See the about page for the full math.

missing data

If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.

Full math, role definitions, and source list →

claude-opus-4.6anthropic93.8 down 0.1 since last refresh82.1 down 0.2 since last refresh87.8 up 0.6 since last refresh77.7
claude-opus-4.6

group breakdown

A_B92.34 / 32A_I89.56 / 32A_P80.13 / 32A_R97.74 / 32BUILD89.62 / 32CRE100.01 / 32GEN89.93 / 32LM_ARENA_REVIEW_PROXY32.520 / 32OPS_long75.623 / 32OPS_precision72.523 / 32OPS_review75.324 / 32PLAN79.411 / 32

metrics

AI_canary_health59.13 / 3AI_code92.54 / 28AI_complexity89.55 / 28AI_context_awareness88.63 / 28AI_correctness100.01 / 28AI_edge_cases100.01 / 28AI_efficiency81.95 / 28AI_hallucination_resistance100.01 / 28AI_memory_retention42.011 / 28AI_parameter_accuracy88.513 / 28AI_plan_coherence45.312 / 28AI_recovery100.01 / 28AI_refusal100.01 / 28AI_spec100.01 / 28AI_stability84.111 / 28AI_task_completion70.321 / 28AI_tool_selection100.01 / 28ARC_AGI_291.84 / 25ArtificialAnalysisCoding79.15 / 32ArtificialAnalysisIntelligence84.36 / 32ArtificialAnalysisReasoning87.56 / 32BlendedCost49.627 / 31ContextWindow99.210 / 30CopilotArenaOrLMArenaCode100.01 / 32GDPval84.47 / 32GPQA_HLE_Reasoning87.56 / 32GSO75.33 / 16IFBench29.527 / 32LMArenaCreativeOrOpenEnded100.01 / 32LMArenaSearchDocument32.518 / 30LMArenaText100.01 / 32LongContextRecall88.06 / 32MCPAtlas93.52 / 28OutputSpeed75.325 / 32SWEBenchMultilingual91.914 / 27SWEBenchPro100.01 / 29SWEBenchVerified99.43 / 31SWEComposite95.72 / 32SWERebench91.68 / 31SciCode80.96 / 32SonarBugDensity72.09 / 20SonarComposite74.55 / 32SonarFunctionalSkill92.24 / 20SonarIssueDensity54.87 / 20SonarVulnerabilityDensity63.69 / 20TTFT71.824 / 32Tau2Bench87.512 / 32TerminalBench84.012 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
claude-opus-4.7anthropic94.9 down 1.2 since last refresh87.8 down 0.1 since last refresh87.0 down 2.5 since last refresh89.5 down 0.9 since last refresh
claude-opus-4.7

group breakdown

A_B81.68 / 32A_I90.75 / 32A_P82.02 / 32A_R93.66 / 32BUILD90.21 / 32CRE97.74 / 32GEN98.02 / 32LM_ARENA_REVIEW_PROXY100.01 / 32OPS_long74.724 / 32OPS_precision70.624 / 32OPS_review74.025 / 32PLAN85.85 / 32

metrics

AI_code59.88 / 28AI_complexity50.28 / 28AI_context_awareness76.24 / 28AI_correctness100.02 / 28AI_edge_cases82.78 / 28AI_efficiency77.27 / 28AI_hallucination_resistance100.02 / 28AI_memory_retention67.37 / 28AI_parameter_accuracy87.914 / 28AI_plan_coherence90.86 / 28AI_recovery100.02 / 28AI_refusal100.02 / 28AI_spec100.02 / 28AI_stability99.16 / 28AI_task_completion4.324 / 28AI_tool_selection83.06 / 28ARC_AGI_293.53 / 25ArtificialAnalysisCoding94.33 / 32ArtificialAnalysisIntelligence100.01 / 32ArtificialAnalysisReasoning97.43 / 32BlendedCost49.628 / 31ContextWindow99.211 / 30CopilotArenaOrLMArenaCode100.02 / 32GDPval95.01 / 32GPQA_HLE_Reasoning97.43 / 32GSO100.01 / 16IFBench44.720 / 32LMArenaCreativeOrOpenEnded97.74 / 32LMArenaSearchDocument100.01 / 30LMArenaText97.74 / 32LongContextRecall86.49 / 32MCPAtlas100.01 / 28OutputSpeed75.724 / 32SWEBenchMultilingual95.03 / 27SWEBenchPro95.02 / 29SWEBenchVerified95.04 / 31SWEComposite91.16 / 32SWERebench85.310 / 31SciCode95.23 / 32SonarBugDensity65.512 / 20SonarComposite56.315 / 32SonarFunctionalSkill93.92 / 20SonarIssueDensity8.117 / 20SonarVulnerabilityDensity24.217 / 20TTFT66.126 / 32Tau2Bench79.616 / 32TerminalBench100.01 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarterminal_benchmissing none
gpt-5.5openai68.6 up 6.0 since last refresh83.2 up 3.3 since last refresh85.3 up 6.9 since last refresh79.2 up 5.8 since last refresh
gpt-5.5

group breakdown

A_B87.37 / 32A_I83.48 / 32A_P61.113 / 32A_R97.15 / 32BUILD85.73 / 32CRE53.622 / 32GEN87.64 / 32LM_ARENA_REVIEW_PROXY28.621 / 32OPS_long78.421 / 32OPS_precision74.722 / 32OPS_review76.623 / 32PLAN88.93 / 32

metrics

AI_code72.87 / 28AI_complexity54.37 / 28AI_context_awareness0.024 / 28AI_correctness100.05 / 28AI_edge_cases100.04 / 28AI_efficiency78.96 / 28AI_hallucination_resistance100.09 / 28AI_memory_retention93.76 / 28AI_parameter_accuracy17.824 / 28AI_plan_coherence0.028 / 28AI_recovery100.06 / 28AI_refusal100.09 / 28AI_spec100.09 / 28AI_stability97.77 / 28AI_task_completion79.77 / 28AI_tool_selection30.624 / 28ARC_AGI_297.72 / 25ArtificialAnalysisCoding100.02 / 32ArtificialAnalysisIntelligence98.93 / 32ArtificialAnalysisReasoning100.02 / 32BlendedCost38.130 / 31ContextWindow100.02 / 30CopilotArenaOrLMArenaCode66.114 / 32GDPval95.02 / 32GPQA_HLE_Reasoning100.02 / 32GSO94.02 / 16IFBench78.812 / 32LMArenaCreativeOrOpenEnded53.622 / 32LMArenaSearchDocument28.619 / 30LMArenaText53.622 / 32LongContextRecall96.54 / 32MCPAtlas59.712 / 28OutputSpeed78.818 / 32SWEBenchPro95.010 / 29SWEBenchVerified95.010 / 31SWEComposite89.98 / 32SWERebench83.512 / 31SciCode89.75 / 32SonarBugDensity96.22 / 20SonarComposite67.06 / 32SonarFunctionalSkill46.515 / 20SonarIssueDensity59.85 / 20SonarVulnerabilityDensity94.82 / 20TTFT81.216 / 32Tau2Bench86.813 / 32TerminalBench100.02 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarterminal_benchmissing SWEComposite/SWEBenchMultilingual
kimi-k2.6moonshot82.4 up 3.2 since last refresh85.9 up 2.1 since last refresh83.3 up 2.7 since last refresh89.3 up 2.7 since last refresh
kimi-k2.6

group breakdown

A_B68.011 / 32A_I80.19 / 32A_P69.56 / 32A_R89.48 / 32BUILD85.34 / 32CRE80.88 / 32GEN86.25 / 32LM_ARENA_REVIEW_PROXY95.62 / 32OPS_long85.88 / 32OPS_precision87.99 / 32OPS_review86.211 / 32PLAN90.82 / 32

metrics

AI_code45.015 / 28AI_complexity44.720 / 28AI_context_awareness0.022 / 28AI_correctness94.111 / 28AI_edge_cases82.611 / 28AI_efficiency14.227 / 28AI_hallucination_resistance100.06 / 28AI_memory_retention0.028 / 28AI_parameter_accuracy89.28 / 28AI_plan_coherence75.07 / 28AI_recovery97.811 / 28AI_refusal100.08 / 28AI_spec100.08 / 28AI_stability85.610 / 28AI_task_completion75.511 / 28AI_tool_selection74.211 / 28ArtificialAnalysisCoding75.78 / 32ArtificialAnalysisIntelligence88.24 / 32ArtificialAnalysisReasoning89.05 / 32BlendedCost88.114 / 31ContextWindow78.119 / 30CopilotArenaOrLMArenaCode93.35 / 32GDPval69.313 / 32GPQA_HLE_Reasoning89.05 / 32IFBench92.77 / 32LMArenaCreativeOrOpenEnded80.88 / 32LMArenaSearchDocument95.62 / 30LMArenaText80.88 / 32LongContextRecall83.010 / 32MCPAtlas81.78 / 28OutputSpeed83.814 / 32SWEBenchMultilingual95.07 / 27SWEBenchPro95.08 / 29SWEBenchVerified95.08 / 31SWEComposite94.04 / 32SWERebench92.57 / 31SciCode89.74 / 32SonarComposite50.022 / 32TTFT95.67 / 32Tau2Bench96.06 / 32TerminalBench95.04 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-3.1-pro-previewgoogle98.2 up 0.3 since last refresh93.281.886.3 up 0.6 since last refresh
gemini-3.1-pro-preview

group breakdown

A_B87.86 / 32A_I97.92 / 32A_P98.61 / 32A_R86.111 / 32BUILD80.97 / 32CRE100.02 / 32GEN100.01 / 32LM_ARENA_REVIEW_PROXY92.24 / 32OPS_long81.017 / 32OPS_precision70.625 / 32OPS_review77.122 / 32PLAN90.81 / 32

metrics

AI_code80.06 / 28AI_complexity81.46 / 28AI_context_awareness99.82 / 28AI_correctness100.04 / 28AI_edge_cases100.03 / 28AI_efficiency96.93 / 28AI_hallucination_resistance30.225 / 28AI_memory_retention98.65 / 28AI_parameter_accuracy76.120 / 28AI_plan_coherence100.01 / 28AI_recovery100.04 / 28AI_refusal100.06 / 28AI_spec100.06 / 28AI_stability100.02 / 28AI_task_completion99.82 / 28AI_tool_selection97.02 / 28ARC_AGI_2100.01 / 25ArtificialAnalysisCoding100.01 / 32ArtificialAnalysisIntelligence100.02 / 32ArtificialAnalysisReasoning100.01 / 32BlendedCost76.120 / 31ContextWindow100.07 / 30CopilotArenaOrLMArenaCode69.412 / 32GDPval49.323 / 32GPQA_HLE_Reasoning100.01 / 32GSO51.39 / 16IFBench95.94 / 32LMArenaCreativeOrOpenEnded100.02 / 32LMArenaSearchDocument92.24 / 30LMArenaText100.02 / 32LongContextRecall98.13 / 32MCPAtlas58.414 / 28OutputSpeed93.25 / 32SWEBenchMultilingual36.018 / 27SWEBenchPro89.118 / 29SWEBenchVerified95.07 / 31SWEComposite89.09 / 32SWERebench100.01 / 31SciCode100.02 / 32SonarBugDensity65.016 / 20SonarComposite59.314 / 32SonarFunctionalSkill78.910 / 20SonarIssueDensity25.215 / 20SonarVulnerabilityDensity56.014 / 20TTFT35.528 / 32Tau2Bench95.48 / 32TerminalBench88.110 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing none
deepseek-v4-prodeepseek76.3 up 7.5 since last refresh79.9 up 3.7 since last refresh80.6 up 5.6 since last refresh85.8 up 8.1 since last refresh
deepseek-v4-pro

group breakdown

A_B68.710 / 32A_I72.215 / 32A_P56.616 / 32A_R88.010 / 32BUILD82.25 / 32CRE77.011 / 32GEN79.56 / 32LM_ARENA_REVIEW_PROXY88.87 / 32OPS_long69.826 / 32OPS_precision82.316 / 32OPS_review82.715 / 32PLAN87.44 / 32

metrics

AI_code45.014 / 28AI_complexity44.718 / 28AI_context_awareness0.020 / 28AI_correctness94.110 / 28AI_edge_cases82.610 / 28AI_efficiency36.322 / 28AI_hallucination_resistance100.05 / 28AI_memory_retention0.026 / 28AI_parameter_accuracy91.07 / 28AI_plan_coherence0.027 / 28AI_recovery97.810 / 28AI_refusal100.05 / 28AI_spec100.05 / 28AI_stability72.318 / 28AI_task_completion75.59 / 28AI_tool_selection84.85 / 28ArtificialAnalysisCoding77.07 / 32ArtificialAnalysisIntelligence78.98 / 32ArtificialAnalysisReasoning84.17 / 32BlendedCost98.14 / 31ContextWindow100.03 / 30CopilotArenaOrLMArenaCode72.511 / 32GDPval68.216 / 32GPQA_HLE_Reasoning84.17 / 32IFBench94.05 / 32LMArenaCreativeOrOpenEnded77.011 / 32LMArenaSearchDocument88.87 / 30LMArenaText77.011 / 32LongContextRecall66.214 / 32MCPAtlas81.76 / 28OutputSpeed46.731 / 32SWEBenchMultilingual95.05 / 27SWEBenchPro95.04 / 29SWEBenchVerified95.06 / 31SWEComposite94.03 / 32SWERebench92.55 / 31SciCode70.49 / 32SonarComposite50.017 / 32TTFT96.26 / 32Tau2Bench96.74 / 32TerminalBench90.09 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
glm-5.1zai84.9 up 2.8 since last refresh79.3 up 1.8 since last refresh79.8 up 2.3 since last refresh83.5 up 2.3 since last refresh
glm-5.1

group breakdown

A_B65.315 / 32A_I75.614 / 32A_P66.612 / 32A_R83.515 / 32BUILD82.06 / 32CRE90.95 / 32GEN78.97 / 32LM_ARENA_REVIEW_PROXY88.811 / 32OPS_long81.915 / 32OPS_precision85.913 / 32OPS_review83.414 / 32PLAN82.58 / 32

metrics

AI_code45.813 / 28AI_complexity45.516 / 28AI_context_awareness7.517 / 28AI_correctness87.516 / 28AI_edge_cases77.716 / 28AI_efficiency19.626 / 28AI_hallucination_resistance92.519 / 28AI_memory_retention7.523 / 28AI_parameter_accuracy83.319 / 28AI_plan_coherence71.211 / 28AI_recovery90.717 / 28AI_refusal92.520 / 28AI_spec92.520 / 28AI_stability80.316 / 28AI_task_completion71.720 / 28AI_tool_selection70.516 / 28ArtificialAnalysisCoding62.913 / 32ArtificialAnalysisIntelligence78.59 / 32ArtificialAnalysisReasoning63.215 / 32BlendedCost83.816 / 31ContextWindow74.022 / 30CopilotArenaOrLMArenaCode98.03 / 32GDPval74.710 / 32GPQA_HLE_Reasoning63.215 / 32IFBench93.46 / 32LMArenaCreativeOrOpenEnded90.95 / 32LMArenaSearchDocument88.811 / 30LMArenaText90.95 / 32LongContextRecall46.026 / 32MCPAtlas87.33 / 28OutputSpeed77.221 / 32SWEBenchMultilingual92.513 / 27SWEBenchPro95.014 / 29SWEBenchVerified92.516 / 31SWEComposite96.41 / 32SWERebench100.02 / 31SciCode36.321 / 32SonarComposite50.027 / 32TTFT99.83 / 32Tau2Bench100.03 / 32TerminalBench92.58 / 32
sources artificial_analysislmarenamcp_atlasopenrouteroverridesswerebenchmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
qwen3.6-plusalibaba69.4 up 2.8 since last refresh74.2 up 1.8 since last refresh77.7 up 2.3 since last refresh83.1 up 2.3 since last refresh
qwen3.6-plus

group breakdown

A_B65.313 / 32A_I75.612 / 32A_P66.610 / 32A_R83.513 / 32BUILD78.59 / 32CRE69.415 / 32GEN60.915 / 32LM_ARENA_REVIEW_PROXY88.89 / 32OPS_long85.29 / 32OPS_precision89.36 / 32OPS_review90.04 / 32PLAN83.07 / 32

metrics

AI_code45.811 / 28AI_complexity45.512 / 28AI_context_awareness7.513 / 28AI_correctness87.514 / 28AI_edge_cases77.714 / 28AI_efficiency19.624 / 28AI_hallucination_resistance92.517 / 28AI_memory_retention7.519 / 28AI_parameter_accuracy83.317 / 28AI_plan_coherence71.29 / 28AI_recovery90.715 / 28AI_refusal92.516 / 28AI_spec92.516 / 28AI_stability80.314 / 28AI_task_completion71.716 / 28AI_tool_selection70.514 / 28ARC_AGI_211.916 / 25ArtificialAnalysisCoding61.214 / 32ArtificialAnalysisIntelligence73.210 / 32ArtificialAnalysisReasoning61.317 / 32BlendedCost95.05 / 31ContextWindow99.215 / 30CopilotArenaOrLMArenaCode73.210 / 32GDPval73.311 / 32GPQA_HLE_Reasoning61.317 / 32IFBench90.49 / 32LMArenaCreativeOrOpenEnded69.415 / 32LMArenaSearchDocument88.89 / 30LMArenaText69.415 / 32LongContextRecall83.011 / 32MCPAtlas76.59 / 28OutputSpeed77.122 / 32SWEBenchMultilingual92.510 / 27SWEBenchPro95.011 / 29SWEBenchVerified95.011 / 31SWEComposite85.913 / 32SWERebench72.818 / 31SciCode19.326 / 32SonarBugDensity92.53 / 20SonarComposite80.14 / 32SonarFunctionalSkill66.814 / 20SonarIssueDensity92.52 / 20SonarVulnerabilityDensity78.37 / 20TTFT92.111 / 32Tau2Bench100.01 / 32TerminalBench87.011 / 32
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSO
deepseek-v4-flashdeepseek66.9 up 0.8 since last refresh73.8 up 0.1 since last refresh77.4 up 1.1 since last refresh82.9 up 0.7 since last refresh
deepseek-v4-flash

group breakdown

A_B73.09 / 32A_I76.510 / 32A_P59.015 / 32A_R89.19 / 32BUILD76.212 / 32CRE62.018 / 32GEN64.113 / 32LM_ARENA_REVIEW_PROXY88.86 / 32OPS_long89.15 / 32OPS_precision92.01 / 32OPS_review89.15 / 32PLAN82.19 / 32

metrics

AI_code47.99 / 28AI_complexity44.717 / 28AI_context_awareness0.019 / 28AI_correctness94.19 / 28AI_edge_cases82.69 / 28AI_efficiency73.38 / 28AI_hallucination_resistance100.04 / 28AI_memory_retention0.025 / 28AI_parameter_accuracy91.65 / 28AI_plan_coherence0.026 / 28AI_recovery97.89 / 28AI_refusal100.04 / 28AI_spec100.04 / 28AI_stability80.412 / 28AI_task_completion75.58 / 28AI_tool_selection89.54 / 28ArtificialAnalysisCoding46.720 / 32ArtificialAnalysisIntelligence59.719 / 32ArtificialAnalysisReasoning77.39 / 32BlendedCost99.82 / 31ContextWindow70.628 / 30CopilotArenaOrLMArenaCode86.86 / 32GDPval68.215 / 32GPQA_HLE_Reasoning77.39 / 32IFBench100.01 / 32LMArenaCreativeOrOpenEnded62.018 / 32LMArenaSearchDocument88.86 / 30LMArenaText62.018 / 32LongContextRecall49.324 / 32MCPAtlas81.75 / 28OutputSpeed88.27 / 32SWEBenchMultilingual59.115 / 27SWEBenchPro95.03 / 29SWEBenchVerified95.05 / 31SWEComposite90.47 / 32SWERebench92.54 / 31SciCode42.418 / 32SonarComposite50.016 / 32TTFT100.02 / 32Tau2Bench94.110 / 32TerminalBench78.114 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gpt-5.3-codexopenai68.2 down 1.5 since last refresh51.7 down 0.4 since last refresh72.8 down 1.9 since last refresh70.3 down 1.4 since last refresh
gpt-5.3-codex

group breakdown

A_B63.216 / 32A_I68.616 / 32A_P59.514 / 32A_R80.516 / 32BUILD76.511 / 32CRE75.312 / 32GEN48.121 / 32LM_ARENA_REVIEW_PROXY92.53 / 32OPS_long83.513 / 32OPS_precision80.418 / 32OPS_review81.419 / 32PLAN46.723 / 32

metrics

AI_code30.721 / 28AI_complexity17.022 / 28AI_context_awareness56.25 / 28AI_correctness98.18 / 28AI_edge_cases78.912 / 28AI_efficiency57.39 / 28AI_hallucination_resistance100.07 / 28AI_memory_retention14.013 / 28AI_parameter_accuracy98.12 / 28AI_plan_coherence3.325 / 28AI_recovery60.420 / 28AI_refusal85.621 / 28AI_spec85.621 / 28AI_stability96.18 / 28AI_task_completion55.323 / 28AI_tool_selection61.023 / 28ARC_AGI_272.58 / 25ArtificialAnalysisCoding43.622 / 32ArtificialAnalysisIntelligence29.426 / 32ArtificialAnalysisReasoning33.525 / 32BlendedCost75.421 / 31ContextWindow84.717 / 30CopilotArenaOrLMArenaCode42.728 / 32GDPval68.814 / 32GPQA_HLE_Reasoning33.525 / 32GSO53.48 / 16IFBench59.919 / 32LMArenaCreativeOrOpenEnded75.312 / 32LMArenaSearchDocument92.53 / 30LMArenaText75.312 / 32LongContextRecall42.327 / 32OutputSpeed87.89 / 32SWEBenchPro95.09 / 29SWEBenchVerified92.514 / 31SWEComposite92.15 / 32SWERebench89.49 / 31SciCode40.320 / 32SonarBugDensity84.44 / 20SonarComposite61.68 / 32SonarFunctionalSkill72.311 / 20SonarIssueDensity7.518 / 20SonarVulnerabilityDensity92.53 / 20TTFT75.021 / 32Tau2Bench7.529 / 32TerminalBench97.23 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
glm-5zai69.7 up 2.8 since last refresh65.9 up 1.8 since last refresh70.8 up 2.3 since last refresh76.0 up 2.3 since last refresh
glm-5

group breakdown

A_B65.314 / 32A_I75.613 / 32A_P66.611 / 32A_R83.514 / 32BUILD70.015 / 32CRE73.214 / 32GEN54.217 / 32LM_ARENA_REVIEW_PROXY88.810 / 32OPS_long85.010 / 32OPS_precision88.97 / 32OPS_review86.310 / 32PLAN69.612 / 32

metrics

AI_code45.812 / 28AI_complexity45.515 / 28AI_context_awareness7.516 / 28AI_correctness87.515 / 28AI_edge_cases77.715 / 28AI_efficiency19.625 / 28AI_hallucination_resistance92.518 / 28AI_memory_retention7.522 / 28AI_parameter_accuracy83.318 / 28AI_plan_coherence71.210 / 28AI_recovery90.716 / 28AI_refusal92.519 / 28AI_spec92.519 / 28AI_stability80.315 / 28AI_task_completion71.719 / 28AI_tool_selection70.515 / 28ARC_AGI_25.219 / 25ArtificialAnalysisCoding40.124 / 32ArtificialAnalysisIntelligence60.817 / 32ArtificialAnalysisReasoning53.222 / 32BlendedCost92.511 / 31ContextWindow74.025 / 30CopilotArenaOrLMArenaCode64.518 / 32GDPval73.312 / 32GPQA_HLE_Reasoning53.222 / 32IFBench85.010 / 32LMArenaCreativeOrOpenEnded73.214 / 32LMArenaSearchDocument88.810 / 30LMArenaText73.214 / 32LongContextRecall37.528 / 32MCPAtlas47.217 / 28OutputSpeed81.116 / 32SWEBenchMultilingual51.216 / 27SWEBenchPro92.517 / 29SWEBenchVerified91.019 / 31SWEComposite81.916 / 32SWERebench76.913 / 31SciCode35.222 / 32SonarBugDensity100.01 / 20SonarComposite85.42 / 32SonarFunctionalSkill69.812 / 20SonarIssueDensity100.01 / 20SonarVulnerabilityDensity83.35 / 20TTFT100.01 / 32Tau2Bench100.02 / 32TerminalBench72.917 / 32
sources arc_agiartificial_analysislmarenaopenrouteroverridessonarswebenchswerebenchterminal_benchmissing BUILD/GSO
claude-opus-4.5anthropic69.1 down 7.7 since last refresh67.1 down 4.1 since last refresh70.6 down 8.4 since last refresh59.3 down 6.6 since last refresh
claude-opus-4.5

group breakdown

A_B28.830 / 32A_I39.925 / 32A_P46.423 / 32A_R40.125 / 32BUILD79.88 / 32CRE74.913 / 32GEN73.210 / 32LM_ARENA_REVIEW_PROXY11.029 / 32OPS_long73.625 / 32OPS_precision70.226 / 32OPS_review70.226 / 32PLAN69.113 / 32

metrics

AI_canary_health88.51 / 3AI_code10.624 / 28AI_complexity3.826 / 28AI_context_awareness50.26 / 28AI_correctness37.221 / 28AI_edge_cases66.420 / 28AI_efficiency42.118 / 28AI_hallucination_resistance12.927 / 28AI_memory_retention0.024 / 28AI_parameter_accuracy100.01 / 28AI_plan_coherence28.917 / 28AI_recovery74.218 / 28AI_refusal17.826 / 28AI_spec17.826 / 28AI_stability57.219 / 28AI_task_completion98.93 / 28AI_tool_selection64.521 / 28ARC_AGI_285.55 / 25ArtificialAnalysisCoding78.16 / 32ArtificialAnalysisIntelligence72.011 / 32ArtificialAnalysisReasoning63.614 / 32BlendedCost49.626 / 31ContextWindow73.827 / 30CopilotArenaOrLMArenaCode75.59 / 32GDPval82.59 / 32GPQA_HLE_Reasoning63.614 / 32GSO59.35 / 16IFBench42.822 / 32LMArenaCreativeOrOpenEnded74.913 / 32LMArenaSearchDocument11.027 / 30LMArenaText74.913 / 32LongContextRecall100.01 / 32MCPAtlas57.315 / 28OutputSpeed77.819 / 32SWEBenchMultilingual95.02 / 27SWEBenchPro88.419 / 29SWEBenchVerified92.017 / 31SWEComposite84.714 / 32SWERebench76.314 / 31SciCode67.710 / 32SonarBugDensity81.85 / 20SonarComposite89.01 / 32SonarFunctionalSkill100.01 / 20SonarIssueDensity80.63 / 20SonarVulnerabilityDensity83.34 / 20TTFT74.022 / 32Tau2Bench81.515 / 32TerminalBench71.518 / 32
sources aistupidlevelartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing none
mimo-v2.5-proxiaomi73.3 down 5.5 since last refresh74.1 down 3.2 since last refresh69.7 down 3.1 since last refresh72.9 down 6.3 since last refresh
mimo-v2.5-pro

group breakdown

A_B34.727 / 32A_I31.129 / 32A_P37.330 / 32A_R28.430 / 32BUILD73.813 / 32CRE83.37 / 32GEN74.29 / 32LM_ARENA_REVIEW_PROXY84.316 / 32OPS_long84.411 / 32OPS_precision87.410 / 32OPS_review88.37 / 32PLAN83.76 / 32

metrics

AI_code43.319 / 28AI_complexity45.514 / 28AI_context_awareness7.515 / 28AI_correctness7.626 / 28AI_edge_cases7.526 / 28AI_efficiency42.217 / 28AI_hallucination_resistance39.123 / 28AI_memory_retention7.521 / 28AI_parameter_accuracy88.812 / 28AI_plan_coherence24.521 / 28AI_recovery7.527 / 28AI_refusal92.518 / 28AI_spec92.518 / 28AI_stability7.527 / 28AI_task_completion71.718 / 28AI_tool_selection69.520 / 28ARC_AGI_220.313 / 25ArtificialAnalysisCoding70.111 / 32ArtificialAnalysisIntelligence87.85 / 32ArtificialAnalysisReasoning75.010 / 32BlendedCost87.615 / 31ContextWindow100.09 / 30CopilotArenaOrLMArenaCode77.48 / 32GDPval68.221 / 32GPQA_HLE_Reasoning75.010 / 32IFBench100.02 / 32LMArenaCreativeOrOpenEnded83.37 / 32LMArenaSearchDocument84.316 / 30LMArenaText83.37 / 32LongContextRecall100.02 / 32MCPAtlas32.421 / 28OutputSpeed77.520 / 32SWEBenchMultilingual92.512 / 27SWEBenchPro95.013 / 29SWEBenchVerified95.012 / 31SWEComposite82.115 / 32SWERebench63.523 / 31SciCode71.58 / 32SonarComposite50.026 / 32TTFT90.412 / 32Tau2Bench92.111 / 32TerminalBench95.05 / 32
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-3-progoogle82.771.766.160.8
gemini-3-pro

group breakdown

A_B50.020 / 32A_I50.021 / 32A_P50.020 / 32A_R50.023 / 32BUILD69.816 / 32CRE98.43 / 32GEN75.58 / 32LM_ARENA_REVIEW_PROXY19.224 / 32OPS_long64.531 / 32OPS_precision52.430 / 32OPS_review49.832 / 32PLAN79.610 / 32

metrics

ARC_AGI_242.29 / 25ArtificialAnalysisCoding73.610 / 32ArtificialAnalysisIntelligence67.015 / 32ArtificialAnalysisReasoning91.14 / 32BlendedCost76.119 / 31ContextWindow0.030 / 30CopilotArenaOrLMArenaCode65.616 / 32GDPval34.927 / 32GPQA_HLE_Reasoning91.14 / 32GSO40.710 / 16IFBench77.313 / 32LMArenaCreativeOrOpenEnded98.43 / 32LMArenaSearchDocument19.222 / 30LMArenaText98.43 / 32LongContextRecall88.08 / 32MCPAtlas59.810 / 28OutputSpeed94.13 / 32SWEBenchMultilingual33.519 / 27SWEBenchPro80.322 / 29SWEBenchVerified81.425 / 31SWEComposite71.723 / 32SWERebench70.220 / 31SciCode100.01 / 32SonarBugDensity67.610 / 20SonarComposite60.99 / 32SonarFunctionalSkill84.16 / 20SonarIssueDensity20.816 / 20SonarVulnerabilityDensity57.010 / 20TTFT25.530 / 32Tau2Bench76.318 / 32TerminalBench80.013 / 32
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_hallucination_resistanceA_B/AI_memory_retentionA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_plan_coherenceA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_context_awarenessA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_memory_retentionA_P/AI_parameter_accuracyA_P/AI_plan_coherenceA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_P/AI_task_completionA_P/AI_tool_selectionA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_hallucination_resistanceA_R/AI_recoveryA_R/AI_specA_R/AI_stability
claude-sonnet-4.6anthropic67.1 down 9.4 since last refresh60.7 down 5.2 since last refresh66.1 down 10.5 since last refresh54.6 down 9.4 since last refresh
claude-sonnet-4.6

group breakdown

A_B17.032 / 32A_I26.031 / 32A_P46.522 / 32A_R18.032 / 32BUILD78.410 / 32CRE79.010 / 32GEN67.211 / 32LM_ARENA_REVIEW_PROXY22.422 / 32OPS_long66.727 / 32OPS_precision53.829 / 32OPS_review63.727 / 32PLAN62.219 / 32

metrics

AI_code0.727 / 28AI_complexity0.527 / 28AI_context_awareness100.01 / 28AI_correctness0.028 / 28AI_edge_cases43.122 / 28AI_efficiency47.511 / 28AI_hallucination_resistance0.028 / 28AI_memory_retention47.310 / 28AI_parameter_accuracy56.923 / 28AI_plan_coherence34.914 / 28AI_recovery26.622 / 28AI_refusal0.028 / 28AI_spec0.028 / 28AI_stability75.117 / 28AI_task_completion100.01 / 28AI_tool_selection92.73 / 28ARC_AGI_210.617 / 25ArtificialAnalysisCoding88.84 / 32ArtificialAnalysisIntelligence79.77 / 32ArtificialAnalysisReasoning68.911 / 32BlendedCost73.125 / 31ContextWindow99.214 / 30CopilotArenaOrLMArenaCode95.14 / 32GDPval88.86 / 32GPQA_HLE_Reasoning68.911 / 32GSO30.711 / 16IFBench39.124 / 32LMArenaCreativeOrOpenEnded79.010 / 32LMArenaSearchDocument22.420 / 30LMArenaText79.010 / 32LongContextRecall88.07 / 32MCPAtlas55.716 / 28OutputSpeed80.917 / 32SWEBenchMultilingual95.04 / 27SWEBenchPro76.524 / 29SWEBenchVerified90.020 / 31SWEComposite88.111 / 32SWERebench95.83 / 31SciCode52.814 / 32SonarBugDensity76.47 / 20SonarComposite60.710 / 32SonarFunctionalSkill84.55 / 20SonarIssueDensity34.011 / 20SonarVulnerabilityDensity20.918 / 20TTFT0.032 / 32Tau2Bench50.622 / 32TerminalBench74.916 / 32
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswerebenchterminal_benchmissing none
minimax-m2.7minimax43.4 down 5.5 since last refresh59.3 down 3.2 since last refresh64.8 down 3.1 since last refresh65.4 down 6.3 since last refresh
minimax-m2.7

group breakdown

A_B34.725 / 32A_I31.127 / 32A_P37.328 / 32A_R28.428 / 32BUILD68.617 / 32CRE36.726 / 32GEN52.719 / 32LM_ARENA_REVIEW_PROXY84.314 / 32OPS_long81.316 / 32OPS_precision86.712 / 32OPS_review84.612 / 32PLAN66.117 / 32

metrics

AI_code43.317 / 28AI_complexity45.510 / 28AI_context_awareness7.510 / 28AI_correctness7.624 / 28AI_edge_cases7.524 / 28AI_efficiency42.215 / 28AI_hallucination_resistance39.121 / 28AI_memory_retention7.517 / 28AI_parameter_accuracy88.810 / 28AI_plan_coherence24.519 / 28AI_recovery7.525 / 28AI_refusal92.514 / 28AI_spec92.514 / 28AI_stability7.525 / 28AI_task_completion71.714 / 28AI_tool_selection69.518 / 28ARC_AGI_211.915 / 25ArtificialAnalysisCoding57.717 / 32ArtificialAnalysisIntelligence71.612 / 32ArtificialAnalysisReasoning64.713 / 32BlendedCost98.83 / 31ContextWindow74.221 / 30CopilotArenaOrLMArenaCode54.821 / 32GDPval68.218 / 32GPQA_HLE_Reasoning64.713 / 32IFBench91.98 / 32LMArenaCreativeOrOpenEnded36.726 / 32LMArenaSearchDocument84.314 / 30LMArenaText36.726 / 32LongContextRecall77.912 / 32MCPAtlas32.419 / 28OutputSpeed75.226 / 32SWEBenchMultilingual95.06 / 27SWEBenchPro95.06 / 29SWEBenchVerified92.513 / 31SWEComposite86.012 / 32SWERebench73.317 / 31SciCode53.913 / 32SonarComposite50.019 / 32TTFT94.99 / 32Tau2Bench71.019 / 32TerminalBench58.521 / 32
sources artificial_analysislmarenaopenrouteroverridesswerebenchterminal_benchmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
mimo-v2.5xiaomi54.0 down 5.5 since last refresh61.0 down 3.2 since last refresh64.7 down 3.1 since last refresh66.2 down 6.3 since last refresh
mimo-v2.5

group breakdown

A_B34.726 / 32A_I31.128 / 32A_P37.329 / 32A_R28.429 / 32BUILD67.919 / 32CRE54.521 / 32GEN55.416 / 32LM_ARENA_REVIEW_PROXY84.315 / 32OPS_long89.54 / 32OPS_precision90.93 / 32OPS_review91.92 / 32PLAN67.315 / 32

metrics

AI_code43.318 / 28AI_complexity45.513 / 28AI_context_awareness7.514 / 28AI_correctness7.625 / 28AI_edge_cases7.525 / 28AI_efficiency42.216 / 28AI_hallucination_resistance39.122 / 28AI_memory_retention7.520 / 28AI_parameter_accuracy88.811 / 28AI_plan_coherence24.520 / 28AI_recovery7.526 / 28AI_refusal92.517 / 28AI_spec92.517 / 28AI_stability7.526 / 28AI_task_completion71.717 / 28AI_tool_selection69.519 / 28ARC_AGI_220.312 / 25ArtificialAnalysisCoding58.416 / 32ArtificialAnalysisIntelligence69.313 / 32ArtificialAnalysisReasoning53.221 / 32BlendedCost94.19 / 31ContextWindow100.08 / 30CopilotArenaOrLMArenaCode65.915 / 32GDPval68.220 / 32GPQA_HLE_Reasoning53.221 / 32IFBench68.216 / 32LMArenaCreativeOrOpenEnded54.521 / 32LMArenaSearchDocument84.315 / 30LMArenaText54.521 / 32LongContextRecall47.625 / 32MCPAtlas32.420 / 28OutputSpeed85.811 / 32SWEBenchMultilingual92.511 / 27SWEBenchPro95.012 / 29SWEBenchVerified92.515 / 31SWEComposite81.817 / 32SWERebench63.522 / 31SciCode32.523 / 32SonarComposite50.025 / 32TTFT89.514 / 32Tau2Bench84.214 / 32TerminalBench94.46 / 32
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
claude-opus-4.1anthropic55.5 down 6.6 since last refresh61.4 down 3.5 since last refresh64.2 down 7.2 since last refresh52.9 down 5.6 since last refresh
claude-opus-4.1

group breakdown

A_B32.029 / 32A_I41.424 / 32A_P46.921 / 32A_R41.624 / 32BUILD71.814 / 32CRE53.023 / 32GEN65.712 / 32LM_ARENA_REVIEW_PROXY0.431 / 32OPS_long65.730 / 32OPS_precision57.827 / 32OPS_review58.128 / 32PLAN63.918 / 32

metrics

AI_code16.522 / 28AI_complexity10.724 / 28AI_context_awareness50.27 / 28AI_correctness39.120 / 28AI_edge_cases63.921 / 28AI_efficiency43.213 / 28AI_hallucination_resistance18.526 / 28AI_memory_retention7.515 / 28AI_parameter_accuracy92.54 / 28AI_plan_coherence32.116 / 28AI_recovery70.519 / 28AI_refusal22.725 / 28AI_spec22.725 / 28AI_stability56.120 / 28AI_task_completion91.64 / 28AI_tool_selection62.322 / 28ARC_AGI_283.56 / 25ArtificialAnalysisCoding73.99 / 32ArtificialAnalysisIntelligence68.714 / 32ArtificialAnalysisReasoning61.616 / 32BlendedCost0.031 / 31ContextWindow73.826 / 30CopilotArenaOrLMArenaCode47.226 / 32GDPval82.58 / 32GPQA_HLE_Reasoning61.616 / 32GSO57.96 / 16IFBench43.921 / 32LMArenaCreativeOrOpenEnded53.023 / 32LMArenaSearchDocument0.429 / 30LMArenaText53.023 / 32LongContextRecall92.55 / 32MCPAtlas86.94 / 28OutputSpeed73.628 / 32SWEBenchMultilingual92.58 / 27SWEBenchPro82.620 / 29SWEBenchVerified91.518 / 31SWEComposite72.522 / 32SWERebench51.526 / 31SciCode65.011 / 32SonarBugDensity77.06 / 20SonarComposite83.23 / 32SonarFunctionalSkill92.53 / 20SonarIssueDensity76.04 / 20SonarVulnerabilityDensity78.36 / 20TTFT70.425 / 32Tau2Bench76.817 / 32TerminalBench38.226 / 32
sources lmarenaopenrouteroverridesswerebenchterminal_benchmissing none
gpt-5.4openai66.5 down 6.8 since last refresh44.9 down 3.6 since last refresh63.3 down 7.0 since last refresh51.6 down 5.3 since last refresh
gpt-5.4

group breakdown

A_B40.323 / 32A_I43.023 / 32A_P40.226 / 32A_R61.717 / 32BUILD67.918 / 32CRE79.89 / 32GEN45.122 / 32LM_ARENA_REVIEW_PROXY16.126 / 32OPS_long90.93 / 32OPS_precision87.311 / 32OPS_review88.96 / 32PLAN39.026 / 32

metrics

AI_code7.126 / 28AI_complexity15.923 / 28AI_context_awareness0.023 / 28AI_correctness54.417 / 28AI_edge_cases76.717 / 28AI_efficiency47.212 / 28AI_hallucination_resistance100.08 / 28AI_memory_retention9.114 / 28AI_parameter_accuracy91.36 / 28AI_plan_coherence13.824 / 28AI_recovery100.05 / 28AI_refusal36.323 / 28AI_spec36.324 / 28AI_stability11.023 / 28AI_task_completion72.212 / 28AI_tool_selection79.88 / 28ARC_AGI_276.57 / 25ArtificialAnalysisCoding33.926 / 32ArtificialAnalysisIntelligence27.427 / 32ArtificialAnalysisReasoning12.429 / 32BlendedCost73.822 / 31ContextWindow100.01 / 30CopilotArenaOrLMArenaCode48.223 / 32GDPval90.74 / 32GPQA_HLE_Reasoning12.429 / 32GSO54.07 / 16IFBench60.718 / 32LMArenaCreativeOrOpenEnded79.89 / 32LMArenaSearchDocument16.124 / 30LMArenaText79.89 / 32LongContextRecall20.729 / 32MCPAtlas59.711 / 28OutputSpeed94.14 / 32SWEBenchPro92.516 / 29SWEBenchVerified95.09 / 31SWEComposite88.910 / 32SWERebench83.511 / 31SciCode6.729 / 32SonarBugDensity0.020 / 20SonarComposite30.029 / 32SonarFunctionalSkill37.516 / 20SonarIssueDensity0.020 / 20SonarVulnerabilityDensity100.01 / 20TTFT83.815 / 32Tau2Bench0.032 / 32TerminalBench92.57 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_promissing SWEComposite/SWEBenchMultilingual
gemini-3-flashgoogle77.064.962.457.5
gemini-3-flash

group breakdown

A_B50.019 / 32A_I50.020 / 32A_P50.019 / 32A_R50.022 / 32BUILD61.321 / 32CRE88.96 / 32GEN62.714 / 32LM_ARENA_REVIEW_PROXY18.425 / 32OPS_long94.71 / 32OPS_precision90.92 / 32OPS_review92.91 / 32PLAN66.916 / 32

metrics

ARC_AGI_23.122 / 25ArtificialAnalysisCoding60.115 / 32ArtificialAnalysisIntelligence59.320 / 32ArtificialAnalysisReasoning83.78 / 32BlendedCost90.613 / 31ContextWindow100.06 / 30CopilotArenaOrLMArenaCode65.117 / 32GDPval37.125 / 32GPQA_HLE_Reasoning83.78 / 32GSO14.014 / 16IFBench98.13 / 32LMArenaCreativeOrOpenEnded88.96 / 32LMArenaSearchDocument18.423 / 30LMArenaText88.96 / 32LongContextRecall66.215 / 32MCPAtlas16.924 / 28OutputSpeed99.42 / 32SWEBenchMultilingual100.01 / 27SWEBenchPro53.026 / 29SWEBenchVerified100.01 / 31SWEComposite74.020 / 32SWERebench76.015 / 31SciCode73.77 / 32SonarBugDensity65.015 / 20SonarComposite59.313 / 32SonarFunctionalSkill78.99 / 20SonarIssueDensity25.214 / 20SonarVulnerabilityDensity56.013 / 20TTFT80.017 / 32Tau2Bench61.120 / 32TerminalBench63.019 / 32
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_hallucination_resistanceA_B/AI_memory_retentionA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_plan_coherenceA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_context_awarenessA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_memory_retentionA_P/AI_parameter_accuracyA_P/AI_plan_coherenceA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_P/AI_task_completionA_P/AI_tool_selectionA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_hallucination_resistanceA_R/AI_recoveryA_R/AI_specA_R/AI_stability
claude-sonnet-4.5anthropic65.2 up 9.5 since last refresh49.5 up 5.1 since last refresh61.6 up 10.5 since last refresh51.3 up 8.6 since last refresh
claude-sonnet-4.5

group breakdown

A_B97.91 / 32A_I93.54 / 32A_P71.54 / 32A_R100.01 / 32BUILD53.824 / 32CRE66.017 / 32GEN42.524 / 32LM_ARENA_REVIEW_PROXY1.530 / 32OPS_long79.020 / 32OPS_precision79.220 / 32OPS_review81.320 / 32PLAN41.825 / 32

metrics

AI_canary_health81.82 / 3AI_code100.01 / 28AI_complexity100.01 / 28AI_context_awareness0.018 / 28AI_correctness100.03 / 28AI_edge_cases100.02 / 28AI_efficiency99.72 / 28AI_hallucination_resistance100.03 / 28AI_memory_retention49.29 / 28AI_parameter_accuracy68.621 / 28AI_plan_coherence34.915 / 28AI_recovery100.03 / 28AI_refusal100.03 / 28AI_spec100.03 / 28AI_stability100.01 / 28AI_task_completion86.75 / 28AI_tool_selection80.97 / 28ARC_AGI_23.720 / 25ArtificialAnalysisCoding46.321 / 32ArtificialAnalysisIntelligence46.221 / 32ArtificialAnalysisReasoning33.326 / 32BlendedCost73.124 / 31ContextWindow99.213 / 30CopilotArenaOrLMArenaCode47.425 / 32GDPval91.13 / 32GPQA_HLE_Reasoning33.326 / 32GSO27.312 / 16IFBench41.023 / 32LMArenaCreativeOrOpenEnded66.017 / 32LMArenaSearchDocument1.528 / 30LMArenaText66.017 / 32LongContextRecall62.818 / 32MCPAtlas4.027 / 28OutputSpeed75.127 / 32SWEBenchMultilingual3.526 / 27SWEBenchPro81.221 / 29SWEBenchVerified84.423 / 31SWEComposite71.324 / 32SWERebench74.616 / 31SciCode41.319 / 32SonarBugDensity32.817 / 20SonarComposite24.231 / 32SonarFunctionalSkill17.218 / 20SonarIssueDensity40.610 / 20SonarVulnerabilityDensity4.419 / 20TTFT77.518 / 32Tau2Bench55.821 / 32TerminalBench48.625 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
kimi-k2.5moonshot52.9 down 6.5 since last refresh59.9 down 3.7 since last refresh59.3 down 3.7 since last refresh63.9 down 7.5 since last refresh
kimi-k2.5

group breakdown

A_B32.028 / 32A_I27.830 / 32A_P35.031 / 32A_R24.631 / 32BUILD61.420 / 32CRE55.519 / 32GEN54.018 / 32LM_ARENA_REVIEW_PROXY90.35 / 32OPS_long79.019 / 32OPS_precision85.114 / 32OPS_review83.413 / 32PLAN67.514 / 32

metrics

AI_code42.120 / 28AI_complexity44.719 / 28AI_context_awareness0.021 / 28AI_correctness0.127 / 28AI_edge_cases0.028 / 28AI_efficiency40.920 / 28AI_hallucination_resistance37.224 / 28AI_memory_retention0.027 / 28AI_parameter_accuracy95.63 / 28AI_plan_coherence20.022 / 28AI_recovery0.028 / 28AI_refusal100.07 / 28AI_spec100.07 / 28AI_stability0.028 / 28AI_task_completion75.510 / 28AI_tool_selection73.012 / 28ARC_AGI_215.014 / 25ArtificialAnalysisCoding49.819 / 32ArtificialAnalysisIntelligence60.816 / 32ArtificialAnalysisReasoning68.512 / 32BlendedCost94.48 / 31ContextWindow78.118 / 30CopilotArenaOrLMArenaCode55.020 / 32GDPval68.219 / 32GPQA_HLE_Reasoning68.512 / 32IFBench76.714 / 32LMArenaCreativeOrOpenEnded55.519 / 32LMArenaSearchDocument90.35 / 30LMArenaText55.519 / 32LongContextRecall61.119 / 32MCPAtlas29.322 / 28OutputSpeed70.630 / 32SWEBenchMultilingual8.822 / 27SWEBenchPro95.07 / 29SWEBenchVerified85.022 / 31SWEComposite73.221 / 32SWERebench65.821 / 31SciCode64.912 / 32SonarComposite50.021 / 32TTFT95.18 / 32Tau2Bench96.05 / 32TerminalBench54.823 / 32
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridesswebenchswerebenchterminal_benchmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
minimax-m2.5minimax25.9 down 5.5 since last refresh50.3 down 3.2 since last refresh59.1 down 3.1 since last refresh61.9 down 6.3 since last refresh
minimax-m2.5

group breakdown

A_B34.724 / 32A_I31.126 / 32A_P37.327 / 32A_R28.427 / 32BUILD60.422 / 32CRE13.529 / 32GEN29.026 / 32LM_ARENA_REVIEW_PROXY84.313 / 32OPS_long87.37 / 32OPS_precision90.45 / 32OPS_review88.28 / 32PLAN61.920 / 32

metrics

AI_code43.316 / 28AI_complexity45.59 / 28AI_context_awareness7.59 / 28AI_correctness7.623 / 28AI_edge_cases7.523 / 28AI_efficiency42.214 / 28AI_hallucination_resistance39.120 / 28AI_memory_retention7.516 / 28AI_parameter_accuracy88.89 / 28AI_plan_coherence24.518 / 28AI_recovery7.524 / 28AI_refusal92.513 / 28AI_spec92.513 / 28AI_stability7.524 / 28AI_task_completion71.713 / 28AI_tool_selection69.517 / 28ARC_AGI_25.218 / 25ArtificialAnalysisCoding42.223 / 32ArtificialAnalysisIntelligence42.023 / 32ArtificialAnalysisReasoning40.124 / 32BlendedCost100.01 / 31ContextWindow74.220 / 30CopilotArenaOrLMArenaCode46.127 / 32GDPval68.217 / 32GPQA_HLE_Reasoning40.124 / 32IFBench80.611 / 32LMArenaCreativeOrOpenEnded13.529 / 32LMArenaSearchDocument84.313 / 30LMArenaText13.529 / 32LongContextRecall64.517 / 32MCPAtlas32.418 / 28OutputSpeed85.313 / 32SWEBenchMultilingual26.520 / 27SWEBenchPro95.05 / 29SWEBenchVerified100.02 / 31SWEComposite75.919 / 32SWERebench62.424 / 31SciCode29.725 / 32SonarComposite50.018 / 32TTFT96.45 / 32Tau2Bench94.79 / 32TerminalBench52.824 / 32
sources arc_agiartificial_analysislmarenaopenrouteroverridesswebenchswerebenchterminal_benchmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
kimi-k2-0905moonshot29.0 up 2.8 since last refresh31.1 up 1.8 since last refresh58.5 up 2.3 since last refresh57.9 up 2.3 since last refresh
kimi-k2-0905

group breakdown

A_B65.312 / 32A_I75.611 / 32A_P66.69 / 32A_R83.512 / 32BUILD60.023 / 32CRE25.227 / 32GEN7.432 / 32LM_ARENA_REVIEW_PROXY88.88 / 32OPS_long33.632 / 32OPS_precision56.328 / 32OPS_review51.631 / 32PLAN31.127 / 32

metrics

AI_code45.810 / 28AI_complexity45.511 / 28AI_context_awareness7.511 / 28AI_correctness87.513 / 28AI_edge_cases77.713 / 28AI_efficiency19.623 / 28AI_hallucination_resistance92.515 / 28AI_memory_retention7.518 / 28AI_parameter_accuracy83.316 / 28AI_plan_coherence71.28 / 28AI_recovery90.714 / 28AI_refusal92.515 / 28AI_spec92.515 / 28AI_stability80.313 / 28AI_task_completion71.715 / 28AI_tool_selection70.513 / 28ArtificialAnalysisCoding2.530 / 32ArtificialAnalysisIntelligence0.031 / 32ArtificialAnalysisReasoning0.031 / 32BlendedCost91.812 / 31ContextWindow43.029 / 30CopilotArenaOrLMArenaCode86.87 / 32GDPval5.031 / 32GPQA_HLE_Reasoning0.031 / 32IFBench0.031 / 32LMArenaCreativeOrOpenEnded25.227 / 32LMArenaSearchDocument88.88 / 30LMArenaText25.227 / 32LongContextRecall0.031 / 32MCPAtlas81.77 / 28OutputSpeed0.032 / 32SWEBenchMultilingual5.023 / 27SWEBenchPro92.515 / 29SWEBenchVerified77.228 / 31SWEComposite81.518 / 32SWERebench92.56 / 31SciCode0.031 / 32SonarComposite50.020 / 32TTFT90.013 / 32Tau2Bench45.326 / 32TerminalBench56.622 / 32
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
glm-4.7zai34.0 up 3.1 since last refresh54.1 up 0.9 since last refresh57.7 up 4.0 since last refresh60.9 up 4.1 since last refresh
glm-4.7

group breakdown

A_B97.62 / 32A_I99.71 / 32A_P70.05 / 32A_R99.03 / 32BUILD45.727 / 32CRE6.430 / 32GEN34.725 / 32LM_ARENA_REVIEW_PROXY50.019 / 32OPS_long87.66 / 32OPS_precision90.64 / 32OPS_review88.19 / 32PLAN55.622 / 32

metrics

AI_code90.25 / 28AI_complexity97.52 / 28AI_context_awareness0.028 / 28AI_correctness100.07 / 28AI_edge_cases100.06 / 28AI_efficiency100.01 / 28AI_hallucination_resistance100.013 / 28AI_memory_retention100.04 / 28AI_parameter_accuracy0.028 / 28AI_plan_coherence100.05 / 28AI_recovery100.08 / 28AI_refusal100.011 / 28AI_spec100.011 / 28AI_stability100.05 / 28AI_task_completion0.028 / 28AI_tool_selection0.028 / 28ArtificialAnalysisCoding38.425 / 32ArtificialAnalysisIntelligence42.822 / 32ArtificialAnalysisReasoning55.120 / 32BlendedCost94.96 / 31ContextWindow74.024 / 30CopilotArenaOrLMArenaCode66.213 / 32GDPval34.528 / 32GPQA_HLE_Reasoning55.120 / 32IFBench70.315 / 32LMArenaCreativeOrOpenEnded6.430 / 32LMArenaText6.430 / 32LongContextRecall54.422 / 32MCPAtlas0.028 / 28OutputSpeed85.512 / 32SWEBenchMultilingual5.025 / 27SWEBenchVerified89.621 / 31SWEComposite60.525 / 32SWERebench70.519 / 31SciCode43.517 / 32SonarBugDensity66.611 / 20SonarComposite32.028 / 32SonarFunctionalSkill0.020 / 20SonarIssueDensity58.26 / 20SonarVulnerabilityDensity27.416 / 20TTFT99.74 / 32Tau2Bench96.07 / 32TerminalBench35.227 / 32
sources aistupidlevelartificial_analysislmarenamcp_atlasopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchPro
claude-sonnet-4anthropic22.7 up 8.1 since last refresh34.7 up 4.3 since last refresh55.7 up 9.0 since last refresh57.7 up 7.3 since last refresh
claude-sonnet-4

group breakdown

A_B90.85 / 32A_I86.97 / 32A_P68.37 / 32A_R92.57 / 32BUILD48.026 / 32CRE0.031 / 32GEN13.430 / 32LM_ARENA_REVIEW_PROXY87.812 / 32OPS_long79.518 / 32OPS_precision79.319 / 32OPS_review81.518 / 32PLAN30.428 / 32

metrics

AI_code92.53 / 28AI_complexity92.54 / 28AI_context_awareness7.58 / 28AI_correctness92.512 / 28AI_edge_cases92.57 / 28AI_efficiency92.24 / 28AI_hallucination_resistance92.514 / 28AI_memory_retention49.38 / 28AI_parameter_accuracy65.822 / 28AI_plan_coherence37.113 / 28AI_recovery92.512 / 28AI_refusal92.512 / 28AI_spec92.512 / 28AI_stability92.59 / 28AI_task_completion81.26 / 28AI_tool_selection76.39 / 28ARC_AGI_20.224 / 25ArtificialAnalysisCoding30.827 / 32ArtificialAnalysisIntelligence29.725 / 32ArtificialAnalysisReasoning5.030 / 32BlendedCost73.123 / 31ContextWindow99.212 / 30CopilotArenaOrLMArenaCode47.824 / 32GDPval88.85 / 32GPQA_HLE_Reasoning5.030 / 32GSO6.015 / 16IFBench33.825 / 32LMArenaCreativeOrOpenEnded0.031 / 32LMArenaSearchDocument87.812 / 30LMArenaText0.031 / 32LiveCodeBench0.02 / 2LongContextRecall57.720 / 32MCPAtlas10.925 / 28OutputSpeed76.223 / 32SWEBenchMultilingual10.421 / 27SWEBenchPro78.423 / 29SWEBenchVerified67.429 / 31SWEComposite60.326 / 32SWERebench54.425 / 31SciCode15.428 / 32SonarBugDensity28.418 / 20SonarComposite27.630 / 32SonarFunctionalSkill26.417 / 20SonarIssueDensity45.58 / 20SonarVulnerabilityDensity0.020 / 20TTFT76.819 / 32Tau2Bench25.528 / 32TerminalBench59.620 / 32
sources arc_agiartificial_analysisgsolivecodebenchlmarenaopenroutersonarswebenchswebench_proswerebenchmissing none
gpt-5.2openai62.5 down 5.8 since last refresh56.1 down 3.1 since last refresh54.0 down 5.9 since last refresh53.3 down 4.5 since last refresh
gpt-5.2

group breakdown

A_B41.722 / 32A_I44.022 / 32A_P41.625 / 32A_R59.918 / 32BUILD53.125 / 32CRE69.316 / 32GEN52.620 / 32LM_ARENA_REVIEW_PROXY19.723 / 32OPS_long83.912 / 32OPS_precision81.117 / 32OPS_review82.117 / 32PLAN58.921 / 32

metrics

AI_code13.523 / 28AI_complexity21.021 / 28AI_context_awareness7.512 / 28AI_correctness53.818 / 28AI_edge_cases72.719 / 28AI_efficiency47.610 / 28AI_hallucination_resistance92.516 / 28AI_memory_retention15.212 / 28AI_parameter_accuracy85.115 / 28AI_plan_coherence19.223 / 28AI_recovery92.513 / 28AI_refusal38.422 / 28AI_spec38.423 / 28AI_stability16.921 / 28AI_task_completion68.922 / 28AI_tool_selection75.310 / 28ARC_AGI_20.025 / 25ArtificialAnalysisCoding65.612 / 32ArtificialAnalysisIntelligence60.118 / 32ArtificialAnalysisReasoning55.819 / 32BlendedCost78.918 / 31ContextWindow84.716 / 30CopilotArenaOrLMArenaCode29.530 / 32GDPval66.922 / 32GPQA_HLE_Reasoning55.819 / 32GSO64.74 / 16IFBench63.017 / 32LMArenaCreativeOrOpenEnded69.316 / 32LMArenaSearchDocument19.721 / 30LMArenaText69.316 / 32LongContextRecall51.023 / 32OutputSpeed87.88 / 32SWEBenchMultilingual0.027 / 27SWEBenchPro38.228 / 29SWEBenchVerified79.626 / 31SWEComposite45.328 / 32SciCode49.515 / 32SonarBugDensity75.38 / 20SonarComposite63.87 / 32SonarFunctionalSkill67.213 / 20SonarIssueDensity45.49 / 20SonarVulnerabilityDensity70.28 / 20TTFT75.020 / 32Tau2Bench47.325 / 32TerminalBench76.115 / 32
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWERebench
gemini-2.5-progoogle17.730.040.233.0
gemini-2.5-pro

group breakdown

A_B50.018 / 32A_I50.019 / 32A_P50.018 / 32A_R50.021 / 32BUILD35.529 / 32CRE0.032 / 32GEN14.229 / 32LM_ARENA_REVIEW_PROXY0.032 / 32OPS_long83.514 / 32OPS_precision75.921 / 32OPS_review80.921 / 32PLAN26.229 / 32

metrics

ARC_AGI_23.721 / 25ArtificialAnalysisCoding23.528 / 32ArtificialAnalysisIntelligence13.928 / 32ArtificialAnalysisReasoning43.523 / 32BlendedCost78.917 / 31ContextWindow100.05 / 30CopilotArenaOrLMArenaCode0.031 / 32GDPval35.726 / 32GPQA_HLE_Reasoning43.523 / 32GSO0.016 / 16IFBench17.329 / 32LMArenaCreativeOrOpenEnded0.032 / 32LMArenaSearchDocument0.030 / 30LMArenaText0.032 / 32LongContextRecall64.516 / 32MCPAtlas58.413 / 28OutputSpeed91.96 / 32SWEBenchMultilingual36.017 / 27SWEBenchPro75.725 / 29SWEBenchVerified33.530 / 31SWEComposite35.130 / 32SWERebench0.031 / 31SciCode30.824 / 32SonarBugDensity65.014 / 20SonarComposite59.312 / 32SonarFunctionalSkill78.98 / 20SonarIssueDensity25.213 / 20SonarVulnerabilityDensity56.012 / 20TTFT50.327 / 32Tau2Bench1.830 / 32TerminalBench1.930 / 32
sources arc_agiartificial_analysisgsolmarenaopenrouterswebenchswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_hallucination_resistanceA_B/AI_memory_retentionA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_plan_coherenceA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_context_awarenessA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_memory_retentionA_P/AI_parameter_accuracyA_P/AI_plan_coherenceA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_P/AI_task_completionA_P/AI_tool_selectionA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_hallucination_resistanceA_R/AI_recoveryA_R/AI_specA_R/AI_stability
grok-4-latestxai48.2 down 8.8 since last refresh41.9 down 4.6 since last refresh39.9 down 9.9 since last refresh37.3 down 9.5 since last refresh
grok-4-latest

group breakdown

A_B17.931 / 32A_I20.032 / 32A_P29.432 / 32A_R29.026 / 32BUILD43.128 / 32CRE55.420 / 32GEN44.123 / 32LM_ARENA_REVIEW_PROXY13.228 / 32OPS_long66.528 / 32OPS_precision50.332 / 32OPS_review53.530 / 32PLAN43.124 / 32

metrics

AI_code0.028 / 28AI_complexity0.028 / 28AI_context_awareness0.025 / 28AI_correctness12.622 / 28AI_edge_cases7.227 / 28AI_efficiency0.028 / 28AI_hallucination_resistance100.010 / 28AI_memory_retention100.01 / 28AI_parameter_accuracy0.025 / 28AI_plan_coherence100.02 / 28AI_recovery40.721 / 28AI_refusal0.227 / 28AI_spec0.227 / 28AI_stability13.922 / 28AI_task_completion0.025 / 28AI_tool_selection0.025 / 28ARC_AGI_220.911 / 25ArtificialAnalysisCoding52.918 / 32ArtificialAnalysisIntelligence40.424 / 32ArtificialAnalysisReasoning56.418 / 32BlendedCost40.329 / 31CopilotArenaOrLMArenaCode48.322 / 32GDPval11.230 / 32GPQA_HLE_Reasoning56.418 / 32IFBench31.026 / 32LMArenaCreativeOrOpenEnded55.420 / 32LMArenaSearchDocument13.226 / 30LMArenaText55.420 / 32LongContextRecall74.613 / 32OutputSpeed86.410 / 32SWEComposite45.229 / 32SWERebench38.127 / 31SciCode46.816 / 32SonarComposite50.023 / 32TTFT25.031 / 32Tau2Bench48.624 / 32TerminalBench15.129 / 32
sources aistupidlevelarc_agiartificial_analysislmarenaoverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasOPS_long/ContextWindowOPS_precision/ContextWindowOPS_review/ContextWindowPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
grok-code-fast-1xai48.3 up 4.9 since last refresh25.0 up 2.3 since last refresh39.6 up 6.9 since last refresh34.3 up 4.7 since last refresh
grok-code-fast-1

group breakdown

A_B94.43 / 32A_I94.63 / 32A_P67.78 / 32A_R99.92 / 32BUILD29.230 / 32CRE47.824 / 32GEN15.727 / 32LM_ARENA_REVIEW_PROXY15.027 / 32OPS_long66.429 / 32OPS_precision51.531 / 32OPS_review53.729 / 32PLAN12.632 / 32

metrics

AI_code99.32 / 28AI_complexity92.63 / 28AI_context_awareness0.026 / 28AI_correctness100.06 / 28AI_edge_cases100.05 / 28AI_efficiency41.419 / 28AI_hallucination_resistance100.011 / 28AI_memory_retention100.02 / 28AI_parameter_accuracy0.026 / 28AI_plan_coherence100.03 / 28AI_recovery100.07 / 28AI_refusal100.010 / 28AI_spec100.010 / 28AI_stability100.03 / 28AI_task_completion0.026 / 28AI_tool_selection0.026 / 28ARC_AGI_225.310 / 25ArtificialAnalysisCoding0.032 / 32ArtificialAnalysisIntelligence0.032 / 32ArtificialAnalysisReasoning0.032 / 32CopilotArenaOrLMArenaCode0.032 / 32GDPval5.032 / 32GPQA_HLE_Reasoning0.032 / 32IFBench0.032 / 32LMArenaCreativeOrOpenEnded47.824 / 32LMArenaSearchDocument15.025 / 30LMArenaText47.824 / 32LongContextRecall0.032 / 32OutputSpeed81.815 / 32SWEBenchVerified81.524 / 31SWEComposite45.427 / 32SWERebench26.729 / 31SciCode0.032 / 32SonarComposite50.024 / 32TTFT26.629 / 32Tau2Bench50.623 / 32TerminalBench0.032 / 32
sources aistupidlevelartificial_analysislmarenaoverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasOPS_long/BlendedCostOPS_long/ContextWindowOPS_precision/BlendedCostOPS_precision/ContextWindowOPS_review/BlendedCostOPS_review/ContextWindowPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-2.5-flashgoogle42.126.136.040.4
gemini-2.5-flash

group breakdown

A_B50.017 / 32A_I50.018 / 32A_P50.017 / 32A_R50.020 / 32BUILD29.131 / 32CRE45.325 / 32GEN14.328 / 32LM_ARENA_REVIEW_PROXY79.317 / 32OPS_long93.72 / 32OPS_precision88.98 / 32OPS_review91.73 / 32PLAN15.331 / 32

metrics

ARC_AGI_20.723 / 25ArtificialAnalysisCoding0.031 / 32ArtificialAnalysisIntelligence0.430 / 32ArtificialAnalysisReasoning14.927 / 32BlendedCost93.510 / 31ContextWindow100.04 / 30CopilotArenaOrLMArenaCode62.819 / 32GDPval37.824 / 32GPQA_HLE_Reasoning14.927 / 32GSO19.413 / 16IFBench27.228 / 32LMArenaCreativeOrOpenEnded45.325 / 32LMArenaSearchDocument79.317 / 30LMArenaText45.325 / 32LiveCodeBench100.01 / 2LongContextRecall56.121 / 32MCPAtlas21.923 / 28OutputSpeed100.01 / 32SWEBenchMultilingual92.59 / 27SWEBenchPro52.527 / 29SWEBenchVerified0.031 / 31SWEComposite27.631 / 32SWERebench0.030 / 31SciCode18.227 / 32SonarBugDensity65.013 / 20SonarComposite59.311 / 32SonarFunctionalSkill78.97 / 20SonarIssueDensity25.212 / 20SonarVulnerabilityDensity56.011 / 20TTFT72.023 / 32Tau2Bench0.031 / 32TerminalBench0.031 / 32
sources arc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing A_B/AI_codeA_B/AI_complexityA_B/AI_correctnessA_B/AI_edge_casesA_B/AI_efficiencyA_B/AI_hallucination_resistanceA_B/AI_memory_retentionA_B/AI_recoveryA_B/AI_specA_B/AI_stabilityA_I/AI_complexityA_I/AI_correctnessA_I/AI_edge_casesA_I/AI_efficiencyA_I/AI_plan_coherenceA_I/AI_recoveryA_I/AI_specA_I/AI_stabilityA_P/AI_context_awarenessA_P/AI_correctnessA_P/AI_efficiencyA_P/AI_memory_retentionA_P/AI_parameter_accuracyA_P/AI_plan_coherenceA_P/AI_recoveryA_P/AI_specA_P/AI_stabilityA_P/AI_task_completionA_P/AI_tool_selectionA_R/AI_codeA_R/AI_correctnessA_R/AI_edge_casesA_R/AI_hallucination_resistanceA_R/AI_recoveryA_R/AI_specA_R/AI_stability
glm-4.6zai28.6 down 5.2 since last refresh25.1 down 3.3 since last refresh28.3 down 5.8 since last refresh33.5 down 4.5 since last refresh
glm-4.6

group breakdown

A_B43.221 / 32A_I52.217 / 32A_P45.824 / 32A_R53.519 / 32BUILD19.932 / 32CRE22.028 / 32GEN12.231 / 32LM_ARENA_REVIEW_PROXY50.018 / 32OPS_long78.322 / 32OPS_precision84.115 / 32OPS_review82.216 / 32PLAN16.930 / 32

metrics

AI_code9.225 / 28AI_complexity8.525 / 28AI_context_awareness0.027 / 28AI_correctness42.619 / 28AI_edge_cases72.918 / 28AI_efficiency36.321 / 28AI_hallucination_resistance100.012 / 28AI_memory_retention100.03 / 28AI_parameter_accuracy0.027 / 28AI_plan_coherence100.04 / 28AI_recovery16.623 / 28AI_refusal28.624 / 28AI_spec38.822 / 28AI_stability100.04 / 28AI_task_completion0.027 / 28AI_tool_selection0.027 / 28ArtificialAnalysisCoding14.929 / 32ArtificialAnalysisIntelligence5.829 / 32ArtificialAnalysisReasoning13.528 / 32BlendedCost94.77 / 31ContextWindow74.023 / 30CopilotArenaOrLMArenaCode36.729 / 32GDPval16.629 / 32GPQA_HLE_Reasoning13.528 / 32IFBench2.630 / 32LMArenaCreativeOrOpenEnded22.028 / 32LMArenaText22.028 / 32LongContextRecall5.630 / 32MCPAtlas7.526 / 28OutputSpeed71.029 / 32SWEBenchMultilingual5.024 / 27SWEBenchPro0.029 / 29SWEBenchVerified77.227 / 31SWEComposite27.032 / 32SWERebench37.328 / 31SciCode6.730 / 32SonarBugDensity19.619 / 20SonarComposite13.032 / 32SonarFunctionalSkill7.519 / 20SonarIssueDensity7.519 / 20SonarVulnerabilityDensity28.015 / 20TTFT93.710 / 32Tau2Bench38.727 / 32TerminalBench17.928 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesswebenchswebench_proswerebenchterminal_benchmissing BUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocument