$ipbrLive LLM coding scoreboard.

Models drift. Agents battle. Math decides.

live · refreshed · 14 sources · 32 models

gpt-5.5claude-opus-4.7gemini-3.1-pro-previewIPBR
  • gemini-3.1-pro-preview86.9
  • claude-opus-4.784.5
  • gpt-5.583.5

leaders now

[ idea ]
  1. 1gemini-3.1-pro-preview94.7 down 0.5 since last refresh
  2. 2claude-opus-4.791.6 up 1.9 since last refresh
  3. 3gemini-3-pro86.3 down 0.6 since last refresh
[ plan ]
  1. 1gpt-5.588.9
  2. 2gemini-3.1-pro-preview88.6 down 0.3 since last refresh
  3. 3claude-opus-4.781.6 up 1.1 since last refresh
[ build ]
  1. 1gpt-5.582.2 down 0.7 since last refresh
  2. 2claude-opus-4.781.4 up 1.6 since last refresh
  3. 3gemini-3.1-pro-preview78.8 down 0.6 since last refresh
[ review ]
  1. 1gemini-3.1-pro-preview85.3 up 0.5 since last refresh
  2. 2claude-opus-4.783.5 up 1.4 since last refresh
  3. 3deepseek-v4-pro81.0 down 0.1 since last refresh
how scoring works

Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.

scoring

Each role score is the benchmark composite for that role, normalized to 0-100 and combined via weighted average of group scores. See the about page for the full math.

missing data

If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.

Full math, role definitions, and source list →

gpt-5.5openai85.8 down 0.1 since last refresh88.982.2 down 0.7 since last refresh77.2 down 0.1 since last refresh
gpt-5.5

group breakdown

A_B67.915 / 32A_I81.613 / 32A_P85.11 / 32A_R86.610 / 32BUILD85.53 / 32CRE83.87 / 32GEN95.23 / 32LM_ARENA_REVIEW_PROXY27.821 / 32OPS_long77.820 / 32OPS_precision73.823 / 32OPS_review75.824 / 32PLAN88.62 / 32

metrics

AI_code29.628 / 32AI_complexity27.329 / 32AI_context_awareness12.97 / 32AI_correctness92.914 / 32AI_edge_cases76.918 / 32AI_efficiency36.117 / 32AI_hallucination_resistance100.03 / 32AI_memory_retention100.01 / 32AI_parameter_accuracy99.43 / 32AI_plan_coherence100.01 / 32AI_recovery97.013 / 32AI_refusal100.013 / 32AI_spec100.013 / 32AI_stability84.312 / 32AI_task_completion100.01 / 32AI_tool_selection80.17 / 32ARC_AGI_297.72 / 25ArtificialAnalysisCoding100.02 / 32ArtificialAnalysisIntelligence98.93 / 32ArtificialAnalysisReasoning100.02 / 32BlendedCost36.431 / 32ContextWindow100.02 / 32CopilotArenaOrLMArenaCode66.014 / 32GDPval95.02 / 32GPQA_HLE_Reasoning100.02 / 32GSO94.02 / 16IFBench76.913 / 32LMArenaCreativeOrOpenEnded83.87 / 32LMArenaSearchDocument27.819 / 30LMArenaText83.87 / 32LongContextRecall96.34 / 32MCPAtlas59.712 / 28OutputSpeed78.417 / 32SWEBenchPro95.010 / 29SWEBenchVerified95.010 / 31SWEComposite89.98 / 32SWERebench83.512 / 31SciCode89.75 / 32SonarBugDensity94.52 / 20SonarComposite65.56 / 32SonarFunctionalSkill46.516 / 20SonarIssueDensity52.75 / 20SonarVulnerabilityDensity99.22 / 20TTFT79.916 / 32Tau2Bench86.714 / 32TerminalBench100.01 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarterminal_benchmissing SWEComposite/SWEBenchMultilingual
claude-opus-4.7anthropic91.6 up 1.9 since last refresh81.6 up 1.1 since last refresh81.4 up 1.6 since last refresh83.5 up 1.4 since last refresh
claude-opus-4.7

group breakdown

A_B66.816 / 32A_I82.212 / 32A_P64.712 / 32A_R74.817 / 32BUILD87.71 / 32CRE95.54 / 32GEN97.52 / 32LM_ARENA_REVIEW_PROXY100.01 / 32OPS_long65.429 / 32OPS_precision53.231 / 32OPS_review61.629 / 32PLAN81.35 / 32

metrics

AI_code43.714 / 32AI_complexity49.514 / 32AI_context_awareness12.79 / 32AI_correctness100.03 / 32AI_edge_cases94.212 / 32AI_efficiency66.914 / 32AI_hallucination_resistance0.031 / 32AI_memory_retention0.026 / 32AI_parameter_accuracy100.01 / 32AI_plan_coherence15.413 / 32AI_recovery89.827 / 32AI_refusal100.03 / 32AI_spec100.03 / 32AI_stability100.03 / 32AI_task_completion100.02 / 32AI_tool_selection43.427 / 32ARC_AGI_293.53 / 25ArtificialAnalysisCoding94.33 / 32ArtificialAnalysisIntelligence100.01 / 32ArtificialAnalysisReasoning97.43 / 32BlendedCost47.930 / 32ContextWindow99.211 / 32CopilotArenaOrLMArenaCode100.02 / 32GDPval95.01 / 32GPQA_HLE_Reasoning97.43 / 32GSO100.01 / 16IFBench43.521 / 32LMArenaCreativeOrOpenEnded95.54 / 32LMArenaSearchDocument100.01 / 30LMArenaText95.54 / 32LongContextRecall86.29 / 32MCPAtlas100.01 / 28OutputSpeed77.320 / 32SWEBenchMultilingual95.03 / 27SWEBenchPro95.02 / 29SWEBenchVerified95.04 / 31SWEComposite91.16 / 32SWERebench85.310 / 31SciCode95.23 / 32SonarBugDensity50.117 / 20SonarComposite51.416 / 32SonarFunctionalSkill93.92 / 20SonarIssueDensity0.020 / 20SonarVulnerabilityDensity25.317 / 20TTFT15.930 / 32Tau2Bench79.517 / 32TerminalBench78.24 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarmissing none
gemini-3.1-pro-previewgoogle94.7 down 0.5 since last refresh88.6 down 0.3 since last refresh78.8 down 0.6 since last refresh85.3 up 0.5 since last refresh
gemini-3.1-pro-preview

group breakdown

A_B73.213 / 32A_I77.816 / 32A_P71.96 / 32A_R82.514 / 32BUILD80.65 / 32CRE100.02 / 32GEN100.01 / 32LM_ARENA_REVIEW_PROXY93.13 / 32OPS_long75.923 / 32OPS_precision63.627 / 32OPS_review71.826 / 32PLAN90.81 / 32

metrics

AI_code60.313 / 32AI_complexity73.912 / 32AI_context_awareness18.36 / 32AI_correctness92.517 / 32AI_edge_cases92.515 / 32AI_efficiency67.613 / 32AI_hallucination_resistance92.511 / 32AI_memory_retention18.010 / 32AI_parameter_accuracy83.712 / 32AI_plan_coherence91.99 / 32AI_recovery92.516 / 32AI_refusal92.520 / 32AI_spec92.520 / 32AI_stability24.929 / 32AI_task_completion89.67 / 32AI_tool_selection66.414 / 32ARC_AGI_2100.01 / 25ArtificialAnalysisCoding100.01 / 32ArtificialAnalysisIntelligence100.02 / 32ArtificialAnalysisReasoning100.01 / 32BlendedCost76.021 / 32ContextWindow100.07 / 32CopilotArenaOrLMArenaCode70.711 / 32GDPval49.323 / 32GPQA_HLE_Reasoning100.01 / 32GSO51.39 / 16IFBench93.35 / 32LMArenaCreativeOrOpenEnded100.02 / 32LMArenaSearchDocument93.13 / 30LMArenaText100.02 / 32LongContextRecall98.33 / 32MCPAtlas58.414 / 28OutputSpeed90.46 / 32SWEBenchMultilingual36.018 / 27SWEBenchPro89.118 / 29SWEBenchVerified95.07 / 31SWEComposite89.09 / 32SWERebench100.01 / 31SciCode100.02 / 32SonarBugDensity52.715 / 20SonarComposite54.215 / 32SonarFunctionalSkill78.910 / 20SonarIssueDensity13.215 / 20SonarVulnerabilityDensity58.214 / 20TTFT17.929 / 32Tau2Bench95.39 / 32TerminalBench89.53 / 32
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing none
claude-opus-4.5anthropic75.7 up 10.9 since last refresh67.7 up 6.0 since last refresh77.8 up 11.4 since last refresh64.4 up 10.2 since last refresh
claude-opus-4.5

group breakdown

A_B87.85 / 32A_I88.25 / 32A_P63.714 / 32A_R86.411 / 32BUILD78.18 / 32CRE74.012 / 32GEN73.010 / 32LM_ARENA_REVIEW_PROXY10.829 / 32OPS_long71.825 / 32OPS_precision66.926 / 32OPS_review67.727 / 32PLAN65.615 / 32

metrics

AI_code100.01 / 32AI_complexity81.64 / 32AI_context_awareness21.53 / 32AI_correctness100.02 / 32AI_edge_cases100.02 / 32AI_efficiency100.01 / 32AI_hallucination_resistance20.027 / 32AI_memory_retention0.025 / 32AI_parameter_accuracy77.316 / 32AI_plan_coherence0.331 / 32AI_recovery100.02 / 32AI_refusal100.02 / 32AI_spec100.02 / 32AI_stability100.02 / 32AI_task_completion82.511 / 32AI_tool_selection69.010 / 32ARC_AGI_285.55 / 25ArtificialAnalysisCoding78.16 / 32ArtificialAnalysisIntelligence72.012 / 32ArtificialAnalysisReasoning63.615 / 32BlendedCost47.928 / 32ContextWindow73.527 / 32CopilotArenaOrLMArenaCode75.09 / 32GDPval82.59 / 32GPQA_HLE_Reasoning63.615 / 32GSO59.35 / 16IFBench41.923 / 32LMArenaCreativeOrOpenEnded74.012 / 32LMArenaSearchDocument10.827 / 30LMArenaText74.012 / 32LongContextRecall100.01 / 32MCPAtlas57.315 / 28OutputSpeed78.019 / 32SWEBenchMultilingual95.02 / 27SWEBenchPro88.419 / 29SWEBenchVerified92.017 / 31SWEComposite84.714 / 32SWERebench76.314 / 31SciCode67.710 / 32SonarBugDensity73.76 / 20SonarComposite87.11 / 32SonarFunctionalSkill100.01 / 20SonarIssueDensity77.23 / 20SonarVulnerabilityDensity87.24 / 20TTFT65.425 / 32Tau2Bench81.516 / 32TerminalBench54.718 / 32
sources aistupidlevelartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proswerebenchterminal_benchmissing none
deepseek-v4-prodeepseek73.878.177.1 down 0.5 since last refresh81.0 down 0.1 since last refresh
deepseek-v4-pro

group breakdown

A_B57.027 / 32A_I70.519 / 32A_P60.418 / 32A_R69.627 / 32BUILD80.36 / 32CRE73.114 / 32GEN78.36 / 32LM_ARENA_REVIEW_PROXY88.77 / 32OPS_long69.727 / 32OPS_precision81.916 / 32OPS_review82.516 / 32PLAN83.04 / 32

metrics

AI_code38.216 / 32AI_complexity30.716 / 32AI_context_awareness7.512 / 32AI_correctness86.419 / 32AI_edge_cases72.920 / 32AI_efficiency36.116 / 32AI_hallucination_resistance24.526 / 32AI_memory_retention7.515 / 32AI_parameter_accuracy81.313 / 32AI_plan_coherence38.610 / 32AI_recovery89.918 / 32AI_refusal92.517 / 32AI_spec92.517 / 32AI_stability74.322 / 32AI_task_completion71.616 / 32AI_tool_selection67.211 / 32ArtificialAnalysisCoding77.07 / 32ArtificialAnalysisIntelligence78.99 / 32ArtificialAnalysisReasoning84.18 / 32BlendedCost98.15 / 32ContextWindow100.03 / 32CopilotArenaOrLMArenaCode70.712 / 32GDPval68.216 / 32GPQA_HLE_Reasoning84.18 / 32IFBench91.76 / 32LMArenaCreativeOrOpenEnded73.114 / 32LMArenaSearchDocument88.77 / 30LMArenaText73.114 / 32LongContextRecall66.113 / 32MCPAtlas81.76 / 28OutputSpeed47.031 / 32SWEBenchMultilingual95.05 / 27SWEBenchPro95.04 / 29SWEBenchVerified95.06 / 31SWEComposite94.03 / 32SWERebench92.55 / 31SciCode70.49 / 32SonarComposite50.018 / 32TTFT94.97 / 32Tau2Bench96.65 / 32TerminalBench69.910 / 32
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
glm-5.1zai82.3 down 0.1 since last refresh75.177.0 down 0.6 since last refresh79.9 down 0.1 since last refresh
glm-5.1

group breakdown

A_B57.526 / 32A_I68.727 / 32A_P53.227 / 32A_R72.426 / 32BUILD80.27 / 32CRE88.55 / 32GEN78.27 / 32LM_ARENA_REVIEW_PROXY88.711 / 32OPS_long80.714 / 32OPS_precision84.914 / 32OPS_review82.615 / 32PLAN78.49 / 32

metrics

AI_code32.725 / 32AI_complexity30.724 / 32AI_context_awareness7.520 / 32AI_correctness86.427 / 32AI_edge_cases72.928 / 32AI_efficiency35.425 / 32AI_hallucination_resistance41.523 / 32AI_memory_retention7.523 / 32AI_parameter_accuracy64.228 / 32AI_plan_coherence12.923 / 32AI_recovery89.926 / 32AI_refusal92.528 / 32AI_spec92.528 / 32AI_stability79.120 / 32AI_task_completion60.927 / 32AI_tool_selection46.425 / 32ArtificialAnalysisCoding62.913 / 32ArtificialAnalysisIntelligence78.510 / 32ArtificialAnalysisReasoning63.216 / 32BlendedCost86.417 / 32ContextWindow73.725 / 32CopilotArenaOrLMArenaCode97.23 / 32GDPval74.710 / 32GPQA_HLE_Reasoning63.216 / 32IFBench91.27 / 32LMArenaCreativeOrOpenEnded88.55 / 32LMArenaSearchDocument88.711 / 30LMArenaText88.55 / 32LongContextRecall45.926 / 32MCPAtlas87.33 / 28OutputSpeed75.821 / 32SWEBenchMultilingual92.513 / 27SWEBenchPro95.014 / 29SWEBenchVerified92.516 / 31SWEComposite96.41 / 32SWERebench100.02 / 31SciCode36.321 / 32SonarComposite50.028 / 32TTFT96.75 / 32Tau2Bench100.04 / 32TerminalBench73.29 / 32
sources artificial_analysislmarenamcp_atlasopenrouteroverridesswerebenchmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gpt-5.3-codexopenai70.3 up 8.1 since last refresh49.4 up 4.2 since last refresh76.1 up 8.5 since last refresh71.0 up 6.8 since last refresh
gpt-5.3-codex

group breakdown

A_B96.12 / 32A_I90.23 / 32A_P61.217 / 32A_R99.71 / 32BUILD74.712 / 32CRE73.213 / 32GEN47.622 / 32LM_ARENA_REVIEW_PROXY92.54 / 32OPS_long84.611 / 32OPS_precision81.019 / 32OPS_review81.917 / 32PLAN41.324 / 32

metrics

AI_code97.24 / 32AI_complexity96.53 / 32AI_context_awareness0.028 / 32AI_correctness100.08 / 32AI_edge_cases100.07 / 32AI_efficiency99.82 / 32AI_hallucination_resistance99.98 / 32AI_memory_retention28.17 / 32AI_parameter_accuracy87.88 / 32AI_plan_coherence6.229 / 32AI_recovery100.07 / 32AI_refusal100.011 / 32AI_spec100.011 / 32AI_stability100.07 / 32AI_task_completion53.228 / 32AI_tool_selection58.817 / 32ARC_AGI_272.58 / 25ArtificialAnalysisCoding43.622 / 32ArtificialAnalysisIntelligence29.426 / 32ArtificialAnalysisReasoning33.525 / 32BlendedCost75.322 / 32ContextWindow84.617 / 32CopilotArenaOrLMArenaCode54.123 / 32GDPval68.814 / 32GPQA_HLE_Reasoning33.525 / 32GSO53.48 / 16IFBench58.620 / 32LMArenaCreativeOrOpenEnded73.213 / 32LMArenaSearchDocument92.54 / 30LMArenaText73.213 / 32LongContextRecall42.227 / 32OutputSpeed89.78 / 32SWEBenchPro95.09 / 29SWEBenchVerified92.514 / 31SWEComposite92.15 / 32SWERebench89.49 / 31SciCode40.320 / 32SonarBugDensity80.85 / 20SonarComposite60.97 / 32SonarFunctionalSkill72.311 / 20SonarIssueDensity7.516 / 20SonarVulnerabilityDensity92.53 / 20TTFT75.320 / 32Tau2Bench7.529 / 32TerminalBench74.37 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingual
claude-opus-4.6anthropic85.1 down 7.9 since last refresh73.3 down 4.4 since last refresh75.3 down 8.2 since last refresh66.7 down 6.7 since last refresh
claude-opus-4.6

group breakdown

A_B20.731 / 32A_I30.330 / 32A_P33.030 / 32A_R35.330 / 32BUILD87.42 / 32CRE100.01 / 32GEN90.14 / 32LM_ARENA_REVIEW_PROXY33.820 / 32OPS_long75.324 / 32OPS_precision71.924 / 32OPS_review74.725 / 32PLAN75.411 / 32

metrics

AI_canary_health84.05 / 7AI_code1.030 / 32AI_complexity0.731 / 32AI_context_awareness12.310 / 32AI_correctness48.329 / 32AI_edge_cases54.930 / 32AI_efficiency6.431 / 32AI_hallucination_resistance0.030 / 32AI_memory_retention8.914 / 32AI_parameter_accuracy95.85 / 32AI_plan_coherence0.032 / 32AI_recovery84.228 / 32AI_refusal2.531 / 32AI_spec2.531 / 32AI_stability38.126 / 32AI_task_completion63.317 / 32AI_tool_selection93.93 / 32ARC_AGI_291.84 / 25ArtificialAnalysisCoding79.15 / 32ArtificialAnalysisIntelligence84.77 / 32ArtificialAnalysisReasoning87.56 / 32BlendedCost47.929 / 32ContextWindow99.210 / 32CopilotArenaOrLMArenaCode100.01 / 32GDPval84.47 / 32GPQA_HLE_Reasoning87.56 / 32GSO75.33 / 16IFBench28.727 / 32LMArenaCreativeOrOpenEnded100.01 / 32LMArenaSearchDocument33.818 / 30LMArenaText100.01 / 32LongContextRecall88.36 / 32MCPAtlas93.52 / 28OutputSpeed75.323 / 32SWEBenchMultilingual91.914 / 27SWEBenchPro100.01 / 29SWEBenchVerified99.43 / 31SWEComposite95.72 / 32SWERebench91.68 / 31SciCode80.96 / 32SonarBugDensity59.510 / 20SonarComposite70.55 / 32SonarFunctionalSkill92.24 / 20SonarIssueDensity46.87 / 20SonarVulnerabilityDensity66.69 / 20TTFT71.124 / 32Tau2Bench87.413 / 32TerminalBench64.212 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
qwen3.6-plusalibaba68.4 down 0.1 since last refresh70.275.0 down 0.6 since last refresh79.6 down 0.1 since last refresh
qwen3.6-plus

group breakdown

A_B57.522 / 32A_I68.723 / 32A_P53.223 / 32A_R72.422 / 32BUILD76.89 / 32CRE69.516 / 32GEN60.916 / 32LM_ARENA_REVIEW_PROXY88.79 / 32OPS_long84.113 / 32OPS_precision88.28 / 32OPS_review89.14 / 32PLAN78.97 / 32

metrics

AI_code32.721 / 32AI_complexity30.720 / 32AI_context_awareness7.516 / 32AI_correctness86.423 / 32AI_edge_cases72.924 / 32AI_efficiency35.421 / 32AI_hallucination_resistance41.519 / 32AI_memory_retention7.519 / 32AI_parameter_accuracy64.224 / 32AI_plan_coherence12.919 / 32AI_recovery89.922 / 32AI_refusal92.524 / 32AI_spec92.524 / 32AI_stability79.116 / 32AI_task_completion60.923 / 32AI_tool_selection46.421 / 32ARC_AGI_211.916 / 25ArtificialAnalysisCoding61.214 / 32ArtificialAnalysisIntelligence73.211 / 32ArtificialAnalysisReasoning61.318 / 32BlendedCost95.06 / 32ContextWindow99.215 / 32CopilotArenaOrLMArenaCode73.910 / 32GDPval73.311 / 32GPQA_HLE_Reasoning61.318 / 32IFBench88.210 / 32LMArenaCreativeOrOpenEnded69.516 / 32LMArenaSearchDocument88.79 / 30LMArenaText69.516 / 32LongContextRecall83.211 / 32MCPAtlas76.59 / 28OutputSpeed75.722 / 32SWEBenchMultilingual92.510 / 27SWEBenchPro95.011 / 29SWEBenchVerified95.011 / 31SWEComposite85.913 / 32SWERebench72.818 / 31SciCode19.326 / 32SonarBugDensity92.53 / 20SonarComposite80.64 / 32SonarFunctionalSkill66.815 / 20SonarIssueDensity92.52 / 20SonarVulnerabilityDensity81.67 / 20TTFT90.312 / 32Tau2Bench100.01 / 32TerminalBench67.611 / 32
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSO
gpt-5.4openai71.3 up 8.6 since last refresh49.2 up 5.0 since last refresh74.2 up 10.0 since last refresh58.6 up 8.4 since last refresh
gpt-5.4

group breakdown

A_B91.13 / 32A_I83.89 / 32A_P65.311 / 32A_R95.05 / 32BUILD72.313 / 32CRE77.310 / 32GEN44.523 / 32LM_ARENA_REVIEW_PROXY17.227 / 32OPS_long92.32 / 32OPS_precision88.76 / 32OPS_review90.13 / 32PLAN40.225 / 32

metrics

AI_code99.03 / 32AI_complexity78.07 / 32AI_context_awareness5.521 / 32AI_correctness100.09 / 32AI_edge_cases97.011 / 32AI_efficiency94.73 / 32AI_hallucination_resistance91.512 / 32AI_memory_retention32.86 / 32AI_parameter_accuracy90.76 / 32AI_plan_coherence16.312 / 32AI_recovery100.08 / 32AI_refusal100.012 / 32AI_spec100.012 / 32AI_stability69.325 / 32AI_task_completion80.112 / 32AI_tool_selection81.55 / 32ARC_AGI_276.57 / 25ArtificialAnalysisCoding33.926 / 32ArtificialAnalysisIntelligence27.427 / 32ArtificialAnalysisReasoning12.429 / 32BlendedCost73.723 / 32ContextWindow100.01 / 32CopilotArenaOrLMArenaCode64.717 / 32GDPval90.74 / 32GPQA_HLE_Reasoning12.429 / 32GSO54.07 / 16IFBench59.119 / 32LMArenaCreativeOrOpenEnded77.310 / 32LMArenaSearchDocument17.225 / 30LMArenaText77.310 / 32LongContextRecall20.729 / 32MCPAtlas59.711 / 28OutputSpeed95.63 / 32SWEBenchPro92.516 / 29SWEBenchVerified95.09 / 31SWEComposite88.910 / 32SWERebench83.511 / 31SciCode6.729 / 32SonarBugDensity84.74 / 20SonarComposite60.48 / 32SonarFunctionalSkill66.814 / 20SonarIssueDensity6.818 / 20SonarVulnerabilityDensity100.01 / 20TTFT86.515 / 32Tau2Bench0.032 / 32TerminalBench100.02 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebench_proterminal_benchmissing SWEComposite/SWEBenchMultilingual
kimi-k2.6moonshot73.6 down 0.1 since last refresh75.074.0 down 0.7 since last refresh79.0 down 0.1 since last refresh
kimi-k2.6

group breakdown

A_B58.817 / 32A_I72.018 / 32A_P53.719 / 32A_R76.416 / 32BUILD83.54 / 32CRE79.19 / 32GEN85.75 / 32LM_ARENA_REVIEW_PROXY95.62 / 32OPS_long77.919 / 32OPS_precision83.015 / 32OPS_review81.518 / 32PLAN86.43 / 32

metrics

AI_canary_health0.07 / 7AI_code29.627 / 32AI_complexity27.328 / 32AI_context_awareness0.027 / 32AI_correctness92.913 / 32AI_edge_cases76.917 / 32AI_efficiency32.827 / 32AI_hallucination_resistance40.025 / 32AI_memory_retention0.031 / 32AI_parameter_accuracy66.720 / 32AI_plan_coherence6.328 / 32AI_recovery97.012 / 32AI_refusal100.09 / 32AI_spec100.09 / 32AI_stability84.311 / 32AI_task_completion62.919 / 32AI_tool_selection45.826 / 32ArtificialAnalysisCoding75.78 / 32ArtificialAnalysisIntelligence88.24 / 32ArtificialAnalysisReasoning89.05 / 32BlendedCost87.915 / 32ContextWindow77.819 / 32CopilotArenaOrLMArenaCode94.75 / 32GDPval69.313 / 32GPQA_HLE_Reasoning89.05 / 32IFBench90.38 / 32LMArenaCreativeOrOpenEnded79.19 / 32LMArenaSearchDocument95.62 / 30LMArenaText79.19 / 32LongContextRecall83.210 / 32MCPAtlas81.78 / 28OutputSpeed70.529 / 32SWEBenchMultilingual95.07 / 27SWEBenchPro95.08 / 29SWEBenchVerified95.08 / 31SWEComposite94.04 / 32SWERebench92.57 / 31SciCode89.74 / 32SonarComposite50.023 / 32TTFT93.29 / 32Tau2Bench95.97 / 32TerminalBench74.56 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
deepseek-v4-flashdeepseek64.072.173.6 down 0.6 since last refresh78.7 down 0.1 since last refresh
deepseek-v4-flash

group breakdown

A_B58.218 / 32A_I74.217 / 32A_P62.315 / 32A_R73.018 / 32BUILD74.711 / 32CRE58.420 / 32GEN63.014 / 32LM_ARENA_REVIEW_PROXY88.76 / 32OPS_long84.512 / 32OPS_precision89.34 / 32OPS_review86.49 / 32PLAN78.48 / 32

metrics

AI_canary_health88.82 / 7AI_code36.117 / 32AI_complexity27.326 / 32AI_context_awareness0.025 / 32AI_correctness92.912 / 32AI_edge_cases76.916 / 32AI_efficiency33.626 / 32AI_hallucination_resistance20.029 / 32AI_memory_retention0.029 / 32AI_parameter_accuracy86.89 / 32AI_plan_coherence36.611 / 32AI_recovery97.011 / 32AI_refusal100.06 / 32AI_spec100.06 / 32AI_stability78.521 / 32AI_task_completion75.414 / 32AI_tool_selection70.38 / 32ArtificialAnalysisCoding46.720 / 32ArtificialAnalysisIntelligence59.720 / 32ArtificialAnalysisReasoning77.310 / 32BlendedCost99.82 / 32ContextWindow70.330 / 32CopilotArenaOrLMArenaCode88.06 / 32GDPval68.215 / 32GPQA_HLE_Reasoning77.310 / 32IFBench99.03 / 32LMArenaCreativeOrOpenEnded58.420 / 32LMArenaSearchDocument88.76 / 30LMArenaText58.420 / 32LongContextRecall49.424 / 32MCPAtlas81.75 / 28OutputSpeed80.116 / 32SWEBenchMultilingual59.115 / 27SWEBenchPro95.03 / 29SWEBenchVerified95.05 / 31SWEComposite90.47 / 32SWERebench92.54 / 31SciCode42.418 / 32SonarComposite50.017 / 32TTFT99.43 / 32Tau2Bench93.911 / 32TerminalBench60.915 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
claude-sonnet-4.6anthropic73.6 up 4.7 since last refresh60.7 up 2.2 since last refresh73.4 up 5.5 since last refresh61.7 up 3.5 since last refresh
claude-sonnet-4.6

group breakdown

A_B82.38 / 32A_I87.56 / 32A_P66.09 / 32A_R82.215 / 32BUILD75.410 / 32CRE74.311 / 32GEN66.012 / 32LM_ARENA_REVIEW_PROXY23.422 / 32OPS_long66.428 / 32OPS_precision53.630 / 32OPS_review63.528 / 32PLAN56.621 / 32

metrics

AI_canary_health84.74 / 7AI_code92.35 / 32AI_complexity75.28 / 32AI_context_awareness9.211 / 32AI_correctness100.05 / 32AI_edge_cases100.04 / 32AI_efficiency80.64 / 32AI_hallucination_resistance0.032 / 32AI_memory_retention0.028 / 32AI_parameter_accuracy70.419 / 32AI_plan_coherence15.414 / 32AI_recovery100.04 / 32AI_refusal100.05 / 32AI_spec100.05 / 32AI_stability100.05 / 32AI_task_completion82.510 / 32AI_tool_selection100.01 / 32ARC_AGI_210.617 / 25ArtificialAnalysisCoding88.84 / 32ArtificialAnalysisIntelligence79.78 / 32ArtificialAnalysisReasoning68.912 / 32BlendedCost73.026 / 32ContextWindow99.214 / 32CopilotArenaOrLMArenaCode95.34 / 32GDPval88.86 / 32GPQA_HLE_Reasoning68.912 / 32GSO30.711 / 16IFBench38.125 / 32LMArenaCreativeOrOpenEnded74.311 / 32LMArenaSearchDocument23.420 / 30LMArenaText74.311 / 32LongContextRecall88.37 / 32MCPAtlas55.716 / 28OutputSpeed80.515 / 32SWEBenchMultilingual95.04 / 27SWEBenchPro76.524 / 29SWEBenchVerified90.020 / 31SWEComposite88.111 / 32SWERebench95.83 / 31SciCode52.815 / 32SonarBugDensity65.88 / 20SonarComposite55.810 / 32SonarFunctionalSkill84.55 / 20SonarIssueDensity22.311 / 20SonarVulnerabilityDensity21.818 / 20TTFT0.032 / 32Tau2Bench50.523 / 32TerminalBench47.321 / 32
sources aistupidlevelarc_agiartificial_analysislmarenamcp_atlasopenrouteroverridessonarswerebenchmissing none
mimo-v2.5-proxiaomi77.8 down 0.1 since last refresh74.671.7 down 0.6 since last refresh78.0 down 0.1 since last refresh
mimo-v2.5-pro

group breakdown

A_B57.524 / 32A_I68.725 / 32A_P53.225 / 32A_R72.424 / 32BUILD72.214 / 32CRE81.48 / 32GEN73.89 / 32LM_ARENA_REVIEW_PROXY85.316 / 32OPS_long84.89 / 32OPS_precision87.49 / 32OPS_review88.46 / 32PLAN80.06 / 32

metrics

AI_code32.723 / 32AI_complexity30.722 / 32AI_context_awareness7.518 / 32AI_correctness86.425 / 32AI_edge_cases72.926 / 32AI_efficiency35.423 / 32AI_hallucination_resistance41.521 / 32AI_memory_retention7.521 / 32AI_parameter_accuracy64.226 / 32AI_plan_coherence12.921 / 32AI_recovery89.924 / 32AI_refusal92.526 / 32AI_spec92.526 / 32AI_stability79.118 / 32AI_task_completion60.925 / 32AI_tool_selection46.423 / 32ARC_AGI_220.313 / 25ArtificialAnalysisCoding70.111 / 32ArtificialAnalysisIntelligence87.85 / 32ArtificialAnalysisReasoning75.011 / 32BlendedCost87.516 / 32ContextWindow100.09 / 32CopilotArenaOrLMArenaCode78.28 / 32GDPval68.221 / 32GPQA_HLE_Reasoning75.011 / 32IFBench100.02 / 32LMArenaCreativeOrOpenEnded81.48 / 32LMArenaSearchDocument85.316 / 30LMArenaText81.48 / 32LongContextRecall100.02 / 32MCPAtlas32.421 / 28OutputSpeed78.418 / 32SWEBenchMultilingual92.512 / 27SWEBenchPro95.013 / 29SWEBenchVerified95.012 / 31SWEComposite82.115 / 32SWERebench63.523 / 31SciCode71.58 / 32SonarComposite50.027 / 32TTFT89.714 / 32Tau2Bench92.112 / 32TerminalBench76.85 / 32
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
claude-opus-4.1anthropic62.1 up 6.6 since last refresh62.6 up 3.4 since last refresh71.4 up 7.6 since last refresh58.9 up 5.4 since last refresh
claude-opus-4.1

group breakdown

A_B87.26 / 32A_I89.24 / 32A_P62.116 / 32A_R88.98 / 32BUILD70.815 / 32CRE52.222 / 32GEN65.513 / 32LM_ARENA_REVIEW_PROXY0.131 / 32OPS_long64.230 / 32OPS_precision55.229 / 32OPS_review56.330 / 32PLAN62.019 / 32

metrics

AI_canary_health65.76 / 7AI_code91.26 / 32AI_complexity99.62 / 32AI_context_awareness0.023 / 32AI_correctness100.01 / 32AI_edge_cases100.01 / 32AI_efficiency75.27 / 32AI_hallucination_resistance40.024 / 32AI_memory_retention0.024 / 32AI_parameter_accuracy71.418 / 32AI_plan_coherence12.425 / 32AI_recovery100.01 / 32AI_refusal100.01 / 32AI_spec100.01 / 32AI_stability100.01 / 32AI_task_completion84.49 / 32AI_tool_selection58.916 / 32ARC_AGI_283.56 / 25ArtificialAnalysisCoding73.99 / 32ArtificialAnalysisIntelligence68.715 / 32ArtificialAnalysisReasoning61.617 / 32BlendedCost0.032 / 32ContextWindow73.526 / 32CopilotArenaOrLMArenaCode46.827 / 32GDPval82.58 / 32GPQA_HLE_Reasoning61.617 / 32GSO57.96 / 16IFBench43.122 / 32LMArenaCreativeOrOpenEnded52.222 / 32LMArenaSearchDocument0.129 / 30LMArenaText52.222 / 32LongContextRecall92.55 / 32MCPAtlas86.94 / 28OutputSpeed73.825 / 32SWEBenchMultilingual92.58 / 27SWEBenchPro82.620 / 29SWEBenchVerified91.518 / 31SWEComposite72.522 / 32SWERebench51.526 / 31SciCode65.011 / 32SonarBugDensity70.17 / 20SonarComposite81.53 / 32SonarFunctionalSkill92.53 / 20SonarIssueDensity73.14 / 20SonarVulnerabilityDensity81.66 / 20TTFT63.126 / 32Tau2Bench76.818 / 32TerminalBench29.326 / 32
sources aistupidlevellmarenaopenrouteroverridesswerebenchterminal_benchmissing none
minimax-m2.7minimax47.5 down 0.1 since last refresh61.568.3 down 0.6 since last refresh72.2 down 0.1 since last refresh
minimax-m2.7

group breakdown

A_B57.520 / 32A_I68.721 / 32A_P53.221 / 32A_R72.420 / 32BUILD68.916 / 32CRE34.226 / 32GEN52.121 / 32LM_ARENA_REVIEW_PROXY85.314 / 32OPS_long79.618 / 32OPS_precision85.311 / 32OPS_review83.313 / 32PLAN66.413 / 32

metrics

AI_code32.719 / 32AI_complexity30.718 / 32AI_context_awareness7.514 / 32AI_correctness86.421 / 32AI_edge_cases72.922 / 32AI_efficiency35.419 / 32AI_hallucination_resistance41.517 / 32AI_memory_retention7.517 / 32AI_parameter_accuracy64.222 / 32AI_plan_coherence12.917 / 32AI_recovery89.920 / 32AI_refusal92.522 / 32AI_spec92.522 / 32AI_stability79.114 / 32AI_task_completion60.921 / 32AI_tool_selection46.419 / 32ARC_AGI_211.915 / 25ArtificialAnalysisCoding57.717 / 32ArtificialAnalysisIntelligence71.613 / 32ArtificialAnalysisReasoning64.714 / 32BlendedCost98.53 / 32ContextWindow73.229 / 32CopilotArenaOrLMArenaCode54.921 / 32GDPval68.218 / 32GPQA_HLE_Reasoning64.714 / 32IFBench89.59 / 32LMArenaCreativeOrOpenEnded34.226 / 32LMArenaSearchDocument85.314 / 30LMArenaText34.226 / 32LongContextRecall78.212 / 32MCPAtlas32.419 / 28OutputSpeed72.828 / 32SWEBenchMultilingual95.06 / 27SWEBenchPro95.06 / 29SWEBenchVerified92.513 / 31SWEComposite86.012 / 32SWERebench73.317 / 31SciCode53.914 / 32SonarComposite50.020 / 32TTFT93.88 / 32Tau2Bench71.020 / 32TerminalBench61.413 / 32
sources artificial_analysislmarenaopenrouteroverridesswerebenchmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
glm-5zai68.1 down 0.1 since last refresh62.168.3 down 0.6 since last refresh72.7 down 0.1 since last refresh
glm-5

group breakdown

A_B57.525 / 32A_I68.726 / 32A_P53.226 / 32A_R72.425 / 32BUILD68.517 / 32CRE72.115 / 32GEN54.018 / 32LM_ARENA_REVIEW_PROXY88.710 / 32OPS_long84.610 / 32OPS_precision88.75 / 32OPS_review86.110 / 32PLAN65.914 / 32

metrics

AI_code32.724 / 32AI_complexity30.723 / 32AI_context_awareness7.519 / 32AI_correctness86.426 / 32AI_edge_cases72.927 / 32AI_efficiency35.424 / 32AI_hallucination_resistance41.522 / 32AI_memory_retention7.522 / 32AI_parameter_accuracy64.227 / 32AI_plan_coherence12.922 / 32AI_recovery89.925 / 32AI_refusal92.527 / 32AI_spec92.527 / 32AI_stability79.119 / 32AI_task_completion60.926 / 32AI_tool_selection46.424 / 32ARC_AGI_25.219 / 25ArtificialAnalysisCoding40.124 / 32ArtificialAnalysisIntelligence60.818 / 32ArtificialAnalysisReasoning53.222 / 32BlendedCost92.512 / 32ContextWindow73.724 / 32CopilotArenaOrLMArenaCode64.319 / 32GDPval73.312 / 32GPQA_HLE_Reasoning53.222 / 32IFBench82.811 / 32LMArenaCreativeOrOpenEnded72.115 / 32LMArenaSearchDocument88.710 / 30LMArenaText72.115 / 32LongContextRecall37.828 / 32MCPAtlas47.217 / 28OutputSpeed80.614 / 32SWEBenchMultilingual51.216 / 27SWEBenchPro92.517 / 29SWEBenchVerified91.019 / 31SWEComposite81.916 / 32SWERebench76.913 / 31SciCode35.222 / 32SonarBugDensity100.01 / 20SonarComposite86.02 / 32SonarFunctionalSkill69.812 / 20SonarIssueDensity100.01 / 20SonarVulnerabilityDensity87.25 / 20TTFT100.01 / 32Tau2Bench100.03 / 32TerminalBench55.817 / 32
sources arc_agiartificial_analysislmarenaopenrouteroverridessonarswebenchswerebenchterminal_benchmissing BUILD/GSO
gemini-3-progoogle86.3 down 0.6 since last refresh73.2 down 0.3 since last refresh67.9 down 0.7 since last refresh64.3 up 0.6 since last refresh
gemini-3-pro

group breakdown

A_B77.310 / 32A_I82.711 / 32A_P75.82 / 32A_R88.39 / 32BUILD67.518 / 32CRE96.83 / 32GEN75.18 / 32LM_ARENA_REVIEW_PROXY20.024 / 32OPS_long60.031 / 32OPS_precision47.032 / 32OPS_review45.732 / 32PLAN75.610 / 32

metrics

AI_code62.110 / 32AI_complexity78.26 / 32AI_context_awareness12.78 / 32AI_correctness100.06 / 32AI_edge_cases100.05 / 32AI_efficiency70.810 / 32AI_hallucination_resistance100.01 / 32AI_memory_retention12.312 / 32AI_parameter_accuracy89.77 / 32AI_plan_coherence99.36 / 32AI_recovery100.05 / 32AI_refusal100.07 / 32AI_spec100.07 / 32AI_stability20.430 / 32AI_task_completion96.64 / 32AI_tool_selection69.39 / 32ARC_AGI_242.29 / 25ArtificialAnalysisCoding73.610 / 32ArtificialAnalysisIntelligence67.016 / 32ArtificialAnalysisReasoning91.14 / 32BlendedCost76.020 / 32ContextWindow0.032 / 32CopilotArenaOrLMArenaCode65.016 / 32GDPval34.927 / 32GPQA_HLE_Reasoning91.14 / 32GSO40.710 / 16IFBench75.314 / 32LMArenaCreativeOrOpenEnded96.83 / 32LMArenaSearchDocument20.022 / 30LMArenaText96.83 / 32LongContextRecall88.38 / 32MCPAtlas59.810 / 28OutputSpeed90.55 / 32SWEBenchMultilingual33.519 / 27SWEBenchPro80.322 / 29SWEBenchVerified81.425 / 31SWEComposite71.723 / 32SWERebench70.220 / 31SciCode100.01 / 32SonarBugDensity53.211 / 20SonarComposite54.911 / 32SonarFunctionalSkill84.16 / 20SonarIssueDensity6.719 / 20SonarVulnerabilityDensity59.710 / 20TTFT13.331 / 32Tau2Bench76.119 / 32TerminalBench61.214 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
mimo-v2.5xiaomi57.7 down 0.1 since last refresh61.166.6 down 0.6 since last refresh71.0 down 0.1 since last refresh
mimo-v2.5

group breakdown

A_B57.523 / 32A_I68.724 / 32A_P53.224 / 32A_R72.423 / 32BUILD66.019 / 32CRE51.123 / 32GEN54.517 / 32LM_ARENA_REVIEW_PROXY85.315 / 32OPS_long90.14 / 32OPS_precision91.61 / 32OPS_review92.52 / 32PLAN62.818 / 32

metrics

AI_code32.722 / 32AI_complexity30.721 / 32AI_context_awareness7.517 / 32AI_correctness86.424 / 32AI_edge_cases72.925 / 32AI_efficiency35.422 / 32AI_hallucination_resistance41.520 / 32AI_memory_retention7.520 / 32AI_parameter_accuracy64.225 / 32AI_plan_coherence12.920 / 32AI_recovery89.923 / 32AI_refusal92.525 / 32AI_spec92.525 / 32AI_stability79.117 / 32AI_task_completion60.924 / 32AI_tool_selection46.422 / 32ARC_AGI_220.312 / 25ArtificialAnalysisCoding58.416 / 32ArtificialAnalysisIntelligence69.314 / 32ArtificialAnalysisReasoning53.221 / 32BlendedCost94.19 / 32ContextWindow100.08 / 32CopilotArenaOrLMArenaCode67.613 / 32GDPval68.220 / 32GPQA_HLE_Reasoning53.221 / 32IFBench66.417 / 32LMArenaCreativeOrOpenEnded51.123 / 32LMArenaSearchDocument85.315 / 30LMArenaText51.123 / 32LongContextRecall47.925 / 32MCPAtlas32.420 / 28OutputSpeed86.211 / 32SWEBenchMultilingual92.511 / 27SWEBenchPro95.012 / 29SWEBenchVerified92.515 / 31SWEComposite81.817 / 32SWERebench63.522 / 31SciCode32.523 / 32SonarComposite50.026 / 32TTFT91.311 / 32Tau2Bench84.015 / 32TerminalBench73.38 / 32
sources artificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-3-flashgoogle80.4 down 0.5 since last refresh66.6 down 0.3 since last refresh64.3 down 0.6 since last refresh60.9 up 0.5 since last refresh
gemini-3-flash

group breakdown

A_B73.212 / 32A_I77.815 / 32A_P71.95 / 32A_R82.513 / 32BUILD59.421 / 32CRE87.76 / 32GEN62.315 / 32LM_ARENA_REVIEW_PROXY19.425 / 32OPS_long94.51 / 32OPS_precision90.53 / 32OPS_review92.61 / 32PLAN63.517 / 32

metrics

AI_code60.312 / 32AI_complexity73.911 / 32AI_context_awareness18.35 / 32AI_correctness92.516 / 32AI_edge_cases92.514 / 32AI_efficiency67.612 / 32AI_hallucination_resistance92.510 / 32AI_memory_retention18.09 / 32AI_parameter_accuracy83.711 / 32AI_plan_coherence91.98 / 32AI_recovery92.515 / 32AI_refusal92.519 / 32AI_spec92.519 / 32AI_stability24.928 / 32AI_task_completion89.66 / 32AI_tool_selection66.413 / 32ARC_AGI_23.122 / 25ArtificialAnalysisCoding60.115 / 32ArtificialAnalysisIntelligence59.321 / 32ArtificialAnalysisReasoning83.79 / 32BlendedCost90.514 / 32ContextWindow100.06 / 32CopilotArenaOrLMArenaCode64.618 / 32GDPval37.125 / 32GPQA_HLE_Reasoning83.79 / 32GSO14.014 / 16IFBench95.74 / 32LMArenaCreativeOrOpenEnded87.76 / 32LMArenaSearchDocument19.423 / 30LMArenaText87.76 / 32LongContextRecall66.114 / 32MCPAtlas16.924 / 28OutputSpeed99.42 / 32SWEBenchMultilingual100.01 / 27SWEBenchPro53.026 / 29SWEBenchVerified100.01 / 31SWEComposite74.020 / 32SWERebench76.015 / 31SciCode73.77 / 32SonarBugDensity52.714 / 20SonarComposite54.214 / 32SonarFunctionalSkill78.99 / 20SonarIssueDensity13.214 / 20SonarVulnerabilityDensity58.213 / 20TTFT78.917 / 32Tau2Bench61.121 / 32TerminalBench48.219 / 32
sources arc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
kimi-k2.5moonshot58.6 down 0.1 since last refresh61.262.1 down 0.6 since last refresh70.1 down 0.1 since last refresh
kimi-k2.5

group breakdown

A_B57.521 / 32A_I68.722 / 32A_P53.222 / 32A_R72.421 / 32BUILD60.220 / 32CRE54.621 / 32GEN53.819 / 32LM_ARENA_REVIEW_PROXY91.65 / 32OPS_long80.216 / 32OPS_precision85.610 / 32OPS_review83.912 / 32PLAN64.616 / 32

metrics

AI_code32.720 / 32AI_complexity30.719 / 32AI_context_awareness7.515 / 32AI_correctness86.422 / 32AI_edge_cases72.923 / 32AI_efficiency35.420 / 32AI_hallucination_resistance41.518 / 32AI_memory_retention7.518 / 32AI_parameter_accuracy64.223 / 32AI_plan_coherence12.918 / 32AI_recovery89.921 / 32AI_refusal92.523 / 32AI_spec92.523 / 32AI_stability79.115 / 32AI_task_completion60.922 / 32AI_tool_selection46.420 / 32ARC_AGI_215.014 / 25ArtificialAnalysisCoding49.419 / 32ArtificialAnalysisIntelligence60.817 / 32ArtificialAnalysisReasoning68.513 / 32BlendedCost93.710 / 32ContextWindow77.818 / 32CopilotArenaOrLMArenaCode54.622 / 32GDPval68.219 / 32GPQA_HLE_Reasoning68.513 / 32IFBench74.715 / 32LMArenaCreativeOrOpenEnded54.621 / 32LMArenaSearchDocument91.65 / 30LMArenaText54.621 / 32LongContextRecall61.018 / 32MCPAtlas29.322 / 28OutputSpeed72.827 / 32SWEBenchMultilingual8.822 / 27SWEBenchPro95.07 / 29SWEBenchVerified85.022 / 31SWEComposite73.221 / 32SWERebench65.821 / 31SciCode64.912 / 32SonarComposite50.022 / 32TTFT95.36 / 32Tau2Bench95.96 / 32TerminalBench41.923 / 32
sources arc_agiartificial_analysislmarenamcp_atlasopenrouteroverridesswebenchswerebenchterminal_benchmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
minimax-m2.5minimax33.5 down 0.1 since last refresh51.561.3 down 0.6 since last refresh67.3 down 0.1 since last refresh
minimax-m2.5

group breakdown

A_B57.519 / 32A_I68.720 / 32A_P53.220 / 32A_R72.419 / 32BUILD59.322 / 32CRE17.129 / 32GEN29.926 / 32LM_ARENA_REVIEW_PROXY85.313 / 32OPS_long85.37 / 32OPS_precision88.57 / 32OPS_review86.58 / 32PLAN59.120 / 32

metrics

AI_code32.718 / 32AI_complexity30.717 / 32AI_context_awareness7.513 / 32AI_correctness86.420 / 32AI_edge_cases72.921 / 32AI_efficiency35.418 / 32AI_hallucination_resistance41.516 / 32AI_memory_retention7.516 / 32AI_parameter_accuracy64.221 / 32AI_plan_coherence12.916 / 32AI_recovery89.919 / 32AI_refusal92.521 / 32AI_spec92.521 / 32AI_stability79.113 / 32AI_task_completion60.920 / 32AI_tool_selection46.418 / 32ARC_AGI_25.218 / 25ArtificialAnalysisCoding42.223 / 32ArtificialAnalysisIntelligence42.024 / 32ArtificialAnalysisReasoning40.124 / 32BlendedCost100.01 / 32ContextWindow73.228 / 32CopilotArenaOrLMArenaCode45.928 / 32GDPval68.217 / 32GPQA_HLE_Reasoning40.124 / 32IFBench78.512 / 32LMArenaCreativeOrOpenEnded17.129 / 32LMArenaSearchDocument85.313 / 30LMArenaText17.129 / 32LongContextRecall64.516 / 32MCPAtlas32.418 / 28OutputSpeed83.113 / 32SWEBenchMultilingual26.520 / 27SWEBenchPro95.05 / 29SWEBenchVerified100.02 / 31SWEComposite75.919 / 32SWERebench62.424 / 31SciCode29.725 / 32SonarComposite50.019 / 32TTFT93.110 / 32Tau2Bench94.610 / 32TerminalBench40.424 / 32
sources arc_agiartificial_analysislmarenaopenrouteroverridesswebenchswerebenchterminal_benchmissing BUILD/GSOSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gpt-5.2openai68.1 up 1.5 since last refresh57.6 up 0.7 since last refresh58.7 up 1.5 since last refresh57.0 up 1.0 since last refresh
gpt-5.2

group breakdown

A_B85.07 / 32A_I86.17 / 32A_P65.410 / 32A_R96.84 / 32BUILD50.825 / 32CRE67.917 / 32GEN52.320 / 32LM_ARENA_REVIEW_PROXY20.923 / 32OPS_long85.08 / 32OPS_precision81.717 / 32OPS_review82.614 / 32PLAN54.522 / 32

metrics

AI_code68.59 / 32AI_complexity79.15 / 32AI_context_awareness4.622 / 32AI_correctness100.07 / 32AI_edge_cases100.06 / 32AI_efficiency74.78 / 32AI_hallucination_resistance100.02 / 32AI_memory_retention0.032 / 32AI_parameter_accuracy97.64 / 32AI_plan_coherence1.730 / 32AI_recovery100.06 / 32AI_refusal100.010 / 32AI_spec100.010 / 32AI_stability100.06 / 32AI_task_completion97.23 / 32AI_tool_selection100.02 / 32ARC_AGI_20.025 / 25ArtificialAnalysisCoding65.612 / 32ArtificialAnalysisIntelligence60.119 / 32ArtificialAnalysisReasoning55.819 / 32BlendedCost78.819 / 32ContextWindow84.616 / 32CopilotArenaOrLMArenaCode29.530 / 32GDPval66.922 / 32GPQA_HLE_Reasoning55.819 / 32GSO64.74 / 16IFBench61.218 / 32LMArenaCreativeOrOpenEnded67.917 / 32LMArenaSearchDocument20.921 / 30LMArenaText67.917 / 32LongContextRecall50.923 / 32OutputSpeed89.77 / 32SWEBenchMultilingual0.027 / 27SWEBenchPro38.228 / 29SWEBenchVerified79.626 / 31SWEComposite45.328 / 32SciCode49.516 / 32SonarBugDensity64.29 / 20SonarComposite59.79 / 32SonarFunctionalSkill67.213 / 20SonarIssueDensity35.79 / 20SonarVulnerabilityDensity73.48 / 20TTFT75.319 / 32Tau2Bench47.325 / 32TerminalBench58.216 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proterminal_benchmissing BUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWERebench
claude-sonnet-4.5anthropic62.9 down 0.1 since last refresh48.5 up 0.1 since last refresh57.4 down 1.1 since last refresh48.4 down 0.1 since last refresh
claude-sonnet-4.5

group breakdown

A_B80.99 / 32A_I85.48 / 32A_P74.53 / 32A_R90.17 / 32BUILD51.924 / 32CRE64.218 / 32GEN42.024 / 32LM_ARENA_REVIEW_PROXY2.330 / 32OPS_long77.322 / 32OPS_precision77.220 / 32OPS_review79.819 / 32PLAN39.526 / 32

metrics

AI_canary_health85.63 / 7AI_code69.28 / 32AI_complexity60.013 / 32AI_context_awareness92.42 / 32AI_correctness100.04 / 32AI_edge_cases100.03 / 32AI_efficiency77.26 / 32AI_hallucination_resistance60.014 / 32AI_memory_retention17.811 / 32AI_parameter_accuracy79.614 / 32AI_plan_coherence12.424 / 32AI_recovery100.03 / 32AI_refusal100.04 / 32AI_spec100.04 / 32AI_stability100.04 / 32AI_task_completion89.28 / 32AI_tool_selection80.36 / 32ARC_AGI_23.720 / 25ArtificialAnalysisCoding46.321 / 32ArtificialAnalysisIntelligence46.222 / 32ArtificialAnalysisReasoning33.326 / 32BlendedCost73.025 / 32ContextWindow99.213 / 32CopilotArenaOrLMArenaCode47.126 / 32GDPval91.13 / 32GPQA_HLE_Reasoning33.326 / 32GSO27.312 / 16IFBench40.024 / 32LMArenaCreativeOrOpenEnded64.218 / 32LMArenaSearchDocument2.328 / 30LMArenaText64.218 / 32LongContextRecall63.017 / 32MCPAtlas4.027 / 28OutputSpeed73.426 / 32SWEBenchMultilingual3.526 / 27SWEBenchPro81.221 / 29SWEBenchVerified84.423 / 31SWEComposite71.324 / 32SWERebench74.616 / 31SciCode41.319 / 32SonarBugDensity2.819 / 20SonarComposite15.631 / 32SonarFunctionalSkill17.218 / 20SonarIssueDensity30.010 / 20SonarVulnerabilityDensity4.619 / 20TTFT73.621 / 32Tau2Bench55.922 / 32TerminalBench37.225 / 32
sources aistupidlevelarc_agiartificial_analysisgsolmarenamcp_atlasopenrouteroverridessonarswebenchswebench_proswerebenchterminal_benchmissing none
glm-4.7zai33.4 up 0.1 since last refresh52.956.759.9 up 0.1 since last refresh
glm-4.7

group breakdown

A_B97.21 / 32A_I96.51 / 32A_P68.27 / 32A_R98.92 / 32BUILD44.427 / 32CRE6.230 / 32GEN34.625 / 32LM_ARENA_REVIEW_PROXY50.019 / 32OPS_long88.05 / 32OPS_precision90.82 / 32OPS_review88.27 / 32PLAN53.723 / 32

metrics

AI_code99.42 / 32AI_complexity100.01 / 32AI_context_awareness0.032 / 32AI_correctness100.011 / 32AI_edge_cases100.010 / 32AI_efficiency77.75 / 32AI_hallucination_resistance100.07 / 32AI_memory_retention99.85 / 32AI_parameter_accuracy0.032 / 32AI_plan_coherence100.05 / 32AI_recovery100.010 / 32AI_refusal100.016 / 32AI_spec100.016 / 32AI_stability89.110 / 32AI_task_completion0.032 / 32AI_tool_selection0.032 / 32ArtificialAnalysisCoding38.425 / 32ArtificialAnalysisIntelligence42.823 / 32ArtificialAnalysisReasoning55.120 / 32BlendedCost94.97 / 32ContextWindow73.723 / 32CopilotArenaOrLMArenaCode65.615 / 32GDPval34.528 / 32GPQA_HLE_Reasoning55.120 / 32IFBench68.516 / 32LMArenaCreativeOrOpenEnded6.230 / 32LMArenaText6.230 / 32LongContextRecall54.522 / 32MCPAtlas0.028 / 28OutputSpeed86.410 / 32SWEBenchMultilingual5.025 / 27SWEBenchVerified89.621 / 31SWEComposite60.525 / 32SWERebench70.519 / 31SciCode43.517 / 32SonarBugDensity51.616 / 20SonarComposite27.329 / 32SonarFunctionalSkill0.020 / 20SonarIssueDensity50.86 / 20SonarVulnerabilityDensity28.716 / 20TTFT99.52 / 32Tau2Bench95.98 / 32TerminalBench27.027 / 32
sources aistupidlevelartificial_analysislmarenamcp_atlasopenrouteroverridessonarswerebenchterminal_benchmissing BUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocumentSWEComposite/SWEBenchPro
kimi-k2-0905moonshot21.4 down 0.6 since last refresh24.7 down 0.3 since last refresh52.2 down 1.0 since last refresh48.3 down 0.6 since last refresh
kimi-k2-0905

group breakdown

A_B29.830 / 32A_I26.031 / 32A_P31.631 / 32A_R27.231 / 32BUILD58.923 / 32CRE25.027 / 32GEN7.432 / 32LM_ARENA_REVIEW_PROXY88.78 / 32OPS_long33.032 / 32OPS_precision55.728 / 32OPS_review50.631 / 32PLAN28.727 / 32

metrics

AI_canary_health89.31 / 7AI_code29.626 / 32AI_complexity27.327 / 32AI_context_awareness0.026 / 32AI_correctness0.032 / 32AI_edge_cases0.032 / 32AI_efficiency46.415 / 32AI_hallucination_resistance60.015 / 32AI_memory_retention0.030 / 32AI_parameter_accuracy76.617 / 32AI_plan_coherence15.415 / 32AI_recovery0.032 / 32AI_refusal100.08 / 32AI_spec100.08 / 32AI_stability0.032 / 32AI_task_completion62.918 / 32AI_tool_selection60.715 / 32ArtificialAnalysisCoding2.530 / 32ArtificialAnalysisIntelligence0.031 / 32ArtificialAnalysisReasoning0.031 / 32BlendedCost91.713 / 32ContextWindow39.331 / 32CopilotArenaOrLMArenaCode88.07 / 32GDPval5.031 / 32GPQA_HLE_Reasoning0.031 / 32IFBench0.031 / 32LMArenaCreativeOrOpenEnded25.027 / 32LMArenaSearchDocument88.78 / 30LMArenaText25.027 / 32LongContextRecall0.031 / 32MCPAtlas81.77 / 28OutputSpeed0.032 / 32SWEBenchMultilingual5.023 / 27SWEBenchPro92.515 / 29SWEBenchVerified77.228 / 31SWEComposite81.518 / 32SWERebench92.56 / 31SciCode0.031 / 32SonarComposite50.021 / 32TTFT89.813 / 32Tau2Bench45.326 / 32TerminalBench44.522 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesmissing BUILD/GSOGEN/ARC_AGI_2SonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
grok-4-latestxai62.0 down 0.3 since last refresh67.9 down 0.3 since last refresh47.9 down 0.9 since last refresh52.3 down 0.5 since last refresh
grok-4-latest

group breakdown

A_B46.828 / 32A_I53.228 / 32A_P50.728 / 32A_R51.828 / 32BUILD42.528 / 32CRE58.419 / 32GEN68.911 / 32LM_ARENA_REVIEW_PROXY18.526 / 32OPS_long79.917 / 32OPS_precision77.121 / 32OPS_review77.522 / 32PLAN71.212 / 32

metrics

AI_code22.929 / 32AI_complexity42.115 / 32AI_context_awareness0.029 / 32AI_correctness77.928 / 32AI_edge_cases3.731 / 32AI_efficiency0.032 / 32AI_hallucination_resistance100.04 / 32AI_memory_retention99.82 / 32AI_parameter_accuracy0.029 / 32AI_plan_coherence100.02 / 32AI_recovery6.931 / 32AI_refusal68.929 / 32AI_spec68.929 / 32AI_stability71.624 / 32AI_task_completion0.029 / 32AI_tool_selection0.029 / 32ARC_AGI_220.911 / 25ArtificialAnalysisCoding54.618 / 32ArtificialAnalysisIntelligence85.56 / 32ArtificialAnalysisReasoning85.07 / 32BlendedCost73.027 / 32ContextWindow77.420 / 32CopilotArenaOrLMArenaCode53.624 / 32GDPval11.230 / 32GPQA_HLE_Reasoning85.07 / 32IFBench100.01 / 32LMArenaCreativeOrOpenEnded58.419 / 32LMArenaSearchDocument18.524 / 30LMArenaText58.419 / 32LongContextRecall56.021 / 32OutputSpeed84.312 / 32SWEComposite45.229 / 32SWERebench38.127 / 31SciCode55.613 / 32SonarComposite50.024 / 32TTFT73.022 / 32Tau2Bench100.02 / 32TerminalBench11.729 / 32
sources aistupidlevelarc_agiartificial_analysislmarenaopenrouteroverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSWEComposite/SWEBenchVerifiedSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-2.5-progoogle21.6 down 0.5 since last refresh32.9 down 0.3 since last refresh42.9 down 0.6 since last refresh37.4 up 0.5 since last refresh
gemini-2.5-pro

group breakdown

A_B73.211 / 32A_I77.814 / 32A_P71.94 / 32A_R82.512 / 32BUILD34.929 / 32CRE0.032 / 32GEN14.228 / 32LM_ARENA_REVIEW_PROXY0.032 / 32OPS_long80.215 / 32OPS_precision70.925 / 32OPS_review77.223 / 32PLAN26.129 / 32

metrics

AI_code60.311 / 32AI_complexity73.910 / 32AI_context_awareness18.34 / 32AI_correctness92.515 / 32AI_edge_cases92.513 / 32AI_efficiency67.611 / 32AI_hallucination_resistance92.59 / 32AI_memory_retention18.08 / 32AI_parameter_accuracy83.710 / 32AI_plan_coherence91.97 / 32AI_recovery92.514 / 32AI_refusal92.518 / 32AI_spec92.518 / 32AI_stability24.927 / 32AI_task_completion89.65 / 32AI_tool_selection66.412 / 32ARC_AGI_23.721 / 25ArtificialAnalysisCoding23.228 / 32ArtificialAnalysisIntelligence13.928 / 32ArtificialAnalysisReasoning43.523 / 32BlendedCost78.818 / 32ContextWindow100.05 / 32CopilotArenaOrLMArenaCode0.031 / 32GDPval35.726 / 32GPQA_HLE_Reasoning43.523 / 32GSO0.016 / 16IFBench16.829 / 32LMArenaCreativeOrOpenEnded0.032 / 32LMArenaSearchDocument0.030 / 30LMArenaText0.032 / 32LongContextRecall64.515 / 32MCPAtlas58.413 / 28OutputSpeed90.94 / 32SWEBenchMultilingual36.017 / 27SWEBenchPro75.725 / 29SWEBenchVerified33.530 / 31SWEComposite35.130 / 32SWERebench0.031 / 31SciCode30.824 / 32SonarBugDensity52.713 / 20SonarComposite54.213 / 32SonarFunctionalSkill78.98 / 20SonarIssueDensity13.213 / 20SonarVulnerabilityDensity58.212 / 20TTFT36.728 / 32Tau2Bench1.930 / 32TerminalBench1.630 / 32
sources arc_agiartificial_analysisgsolmarenaopenrouterswebenchswerebenchterminal_benchmissing none
claude-sonnet-4anthropic12.7 down 10.3 since last refresh26.8 down 5.9 since last refresh42.9 down 10.6 since last refresh46.2 down 8.9 since last refresh
claude-sonnet-4

group breakdown

A_B16.432 / 32A_I20.932 / 32A_P24.732 / 32A_R26.732 / 32BUILD46.126 / 32CRE0.031 / 32GEN13.430 / 32LM_ARENA_REVIEW_PROXY86.912 / 32OPS_long77.621 / 32OPS_precision77.022 / 32OPS_review79.820 / 32PLAN27.928 / 32

metrics

AI_code0.931 / 32AI_complexity21.430 / 32AI_context_awareness0.024 / 32AI_correctness1.131 / 32AI_edge_cases67.529 / 32AI_efficiency19.028 / 32AI_hallucination_resistance20.028 / 32AI_memory_retention0.027 / 32AI_parameter_accuracy77.815 / 32AI_plan_coherence12.426 / 32AI_recovery71.729 / 32AI_refusal7.630 / 32AI_spec7.630 / 32AI_stability2.531 / 32AI_task_completion75.513 / 32AI_tool_selection87.94 / 32ARC_AGI_20.224 / 25ArtificialAnalysisCoding30.827 / 32ArtificialAnalysisIntelligence29.725 / 32ArtificialAnalysisReasoning5.030 / 32BlendedCost73.024 / 32ContextWindow99.212 / 32CopilotArenaOrLMArenaCode47.525 / 32GDPval88.85 / 32GPQA_HLE_Reasoning5.030 / 32GSO6.015 / 16IFBench33.026 / 32LMArenaCreativeOrOpenEnded0.031 / 32LMArenaSearchDocument86.912 / 30LMArenaText0.031 / 32LiveCodeBench0.02 / 2LongContextRecall58.019 / 32MCPAtlas10.925 / 28OutputSpeed74.724 / 32SWEBenchMultilingual10.421 / 27SWEBenchPro78.423 / 29SWEBenchVerified67.429 / 31SWEComposite60.326 / 32SWERebench54.425 / 31SciCode15.428 / 32SonarBugDensity0.020 / 20SonarComposite19.530 / 32SonarFunctionalSkill26.417 / 20SonarIssueDensity35.88 / 20SonarVulnerabilityDensity0.020 / 20TTFT71.823 / 32Tau2Bench25.528 / 32TerminalBench47.320 / 32
sources aistupidlevelarc_agiartificial_analysisgsolivecodebenchlmarenaopenroutersonarswebenchswebench_proswerebenchmissing none
grok-code-fast-1xai48.9 down 0.2 since last refresh27.4 down 0.1 since last refresh41.3 down 0.3 since last refresh36.8
grok-code-fast-1

group breakdown

A_B88.04 / 32A_I90.82 / 32A_P66.78 / 32A_R98.73 / 32BUILD29.230 / 32CRE47.224 / 32GEN15.627 / 32LM_ARENA_REVIEW_PROXY15.728 / 32OPS_long86.06 / 32OPS_precision85.113 / 32OPS_review85.111 / 32PLAN12.632 / 32

metrics

AI_code87.17 / 32AI_complexity74.49 / 32AI_context_awareness0.030 / 32AI_correctness100.010 / 32AI_edge_cases100.08 / 32AI_efficiency17.129 / 32AI_hallucination_resistance100.05 / 32AI_memory_retention99.83 / 32AI_parameter_accuracy0.030 / 32AI_plan_coherence100.03 / 32AI_recovery100.09 / 32AI_refusal100.014 / 32AI_spec100.014 / 32AI_stability100.08 / 32AI_task_completion0.030 / 32AI_tool_selection0.030 / 32ARC_AGI_225.310 / 25ArtificialAnalysisCoding0.032 / 32ArtificialAnalysisIntelligence0.032 / 32ArtificialAnalysisReasoning0.032 / 32BlendedCost98.54 / 32ContextWindow77.421 / 32CopilotArenaOrLMArenaCode0.032 / 32GDPval5.032 / 32GPQA_HLE_Reasoning0.032 / 32IFBench0.032 / 32LMArenaCreativeOrOpenEnded47.224 / 32LMArenaSearchDocument15.726 / 30LMArenaText47.224 / 32LongContextRecall0.032 / 32OutputSpeed89.49 / 32SWEBenchVerified81.524 / 31SWEComposite45.427 / 32SWERebench26.729 / 31SciCode0.032 / 32SonarComposite50.025 / 32TTFT77.018 / 32Tau2Bench50.524 / 32TerminalBench0.032 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesswerebenchterminal_benchmissing BUILD/GSOBUILD/MCPAtlasPLAN/MCPAtlasSWEComposite/SWEBenchMultilingualSWEComposite/SWEBenchProSonarComposite/SonarBugDensitySonarComposite/SonarFunctionalSkillSonarComposite/SonarIssueDensitySonarComposite/SonarVulnerabilityDensity
gemini-2.5-flashgoogle39.4 down 8.3 since last refresh24.5 down 4.6 since last refresh32.4 down 9.5 since last refresh38.9 down 7.7 since last refresh
gemini-2.5-flash

group breakdown

A_B30.529 / 32A_I35.029 / 32A_P41.929 / 32A_R43.129 / 32BUILD28.531 / 32CRE44.725 / 32GEN14.229 / 32LM_ARENA_REVIEW_PROXY79.417 / 32OPS_long91.63 / 32OPS_precision85.112 / 32OPS_review89.05 / 32PLAN15.231 / 32

metrics

AI_code0.032 / 32AI_complexity0.032 / 32AI_context_awareness100.01 / 32AI_correctness25.830 / 32AI_edge_cases74.719 / 32AI_efficiency71.79 / 32AI_hallucination_resistance75.513 / 32AI_memory_retention10.813 / 32AI_parameter_accuracy100.02 / 32AI_plan_coherence11.727 / 32AI_recovery47.630 / 32AI_refusal0.032 / 32AI_spec0.032 / 32AI_stability73.423 / 32AI_task_completion73.915 / 32AI_tool_selection41.628 / 32ARC_AGI_20.723 / 25ArtificialAnalysisCoding0.031 / 32ArtificialAnalysisIntelligence0.430 / 32ArtificialAnalysisReasoning14.927 / 32BlendedCost93.411 / 32ContextWindow100.04 / 32CopilotArenaOrLMArenaCode62.420 / 32GDPval37.824 / 32GPQA_HLE_Reasoning14.927 / 32GSO19.413 / 16IFBench26.528 / 32LMArenaCreativeOrOpenEnded44.725 / 32LMArenaSearchDocument79.417 / 30LMArenaText44.725 / 32LiveCodeBench100.01 / 2LongContextRecall56.020 / 32MCPAtlas21.923 / 28OutputSpeed100.01 / 32SWEBenchMultilingual92.59 / 27SWEBenchPro52.527 / 29SWEBenchVerified0.031 / 31SWEComposite27.631 / 32SWERebench0.030 / 31SciCode18.227 / 32SonarBugDensity52.712 / 20SonarComposite54.212 / 32SonarFunctionalSkill78.97 / 20SonarIssueDensity13.212 / 20SonarVulnerabilityDensity58.211 / 20TTFT61.327 / 32Tau2Bench0.031 / 32TerminalBench0.131 / 32
sources aistupidlevelarc_agiartificial_analysislivecodebenchlmarenaopenrouterswebenchswerebenchterminal_benchmissing none
glm-4.6zai32.4 down 0.4 since last refresh27.3 down 0.1 since last refresh31.7 down 0.8 since last refresh38.3 down 0.3 since last refresh
glm-4.6

group breakdown

A_B70.814 / 32A_I82.910 / 32A_P64.413 / 32A_R90.66 / 32BUILD19.232 / 32CRE21.728 / 32GEN12.131 / 32LM_ARENA_REVIEW_PROXY50.018 / 32OPS_long70.926 / 32OPS_precision81.218 / 32OPS_review78.721 / 32PLAN16.130 / 32

metrics

AI_code40.715 / 32AI_complexity29.725 / 32AI_context_awareness0.031 / 32AI_correctness89.618 / 32AI_edge_cases100.09 / 32AI_efficiency10.430 / 32AI_hallucination_resistance100.06 / 32AI_memory_retention99.84 / 32AI_parameter_accuracy0.031 / 32AI_plan_coherence100.04 / 32AI_recovery91.217 / 32AI_refusal100.015 / 32AI_spec100.015 / 32AI_stability100.09 / 32AI_task_completion0.031 / 32AI_tool_selection0.031 / 32ArtificialAnalysisCoding14.929 / 32ArtificialAnalysisIntelligence5.829 / 32ArtificialAnalysisReasoning13.528 / 32BlendedCost94.58 / 32ContextWindow73.922 / 32CopilotArenaOrLMArenaCode36.429 / 32GDPval16.629 / 32GPQA_HLE_Reasoning13.528 / 32IFBench2.530 / 32LMArenaCreativeOrOpenEnded21.728 / 32LMArenaText21.728 / 32LongContextRecall5.530 / 32MCPAtlas7.526 / 28OutputSpeed55.930 / 32SWEBenchMultilingual5.024 / 27SWEBenchPro0.029 / 29SWEBenchVerified77.227 / 31SWEComposite27.032 / 32SWERebench37.328 / 31SciCode6.730 / 32SonarBugDensity7.518 / 20SonarComposite10.732 / 32SonarFunctionalSkill7.519 / 20SonarIssueDensity7.517 / 20SonarVulnerabilityDensity29.015 / 20TTFT98.34 / 32Tau2Bench38.827 / 32TerminalBench13.828 / 32
sources aistupidlevelartificial_analysislmarenaopenrouteroverridesswebenchswebench_proswerebenchterminal_benchmissing BUILD/GSOGEN/ARC_AGI_2LM_ARENA_REVIEW_PROXY/LMArenaSearchDocument