Models drift. Agents battle. Math decides.
live · refreshed · 14 sources · 32 models
- claude-opus-4.788.6
- claude-opus-4.683.6
- kimi-k2.681.7
leaders now
- 1claude-opus-4.796.2▲+1.6 up 1.6 since last refresh
- 2claude-opus-4.693.4▲+6.8 up 6.8 since last refresh
- 3gemini-3.1-pro-preview86.5▼-9.6 down 9.6 since last refresh
- 1claude-opus-4.785.6▲+0.4 up 0.4 since last refresh
- 2gemini-3.1-pro-preview85.0▼-5.2 down 5.2 since last refresh
- 3gpt-5.582.8▼-0.1 down 0.1 since last refresh
- 1claude-opus-4.786.7▲+2.9 up 2.9 since last refresh
- 2claude-opus-4.686.2▲+8.2 up 8.2 since last refresh
- 3gpt-5.582.1▲+0.1 up 0.1 since last refresh
- 1claude-opus-4.786.1▲+1.3 up 1.3 since last refresh
- 2kimi-k2.684.2▼-0.5 down 0.5 since last refresh
- 3deepseek-v4-pro81.4▼-0.4 down 0.4 since last refresh
how scoring works
Each model gets four role scores from public benchmarks. Idea measures open-ended creativity. Plan measures structured reasoning, function-calling, and multi-step decomposition. Build measures implementation skill — SWE-bench, LiveCodeBench, terminal tasks. Review measures preference judgment.
scoring
Each role score is the benchmark composite for that role, normalized to 0-100 and combined via weighted average of group scores. See the about page for the full math.
missing data
If a model is missing some metrics within a group, the group score blends from shrink-to-50 to trusting the present metrics across 60-80% group coverage. At 80% coverage and above, the present-weight mean is trusted directly.
| claude-opus-4.7 | anthropic | 96.2▲+1.6 up 1.6 since last refresh | 85.6▲+0.4 up 0.4 since last refresh | 86.7▲+2.9 up 2.9 since last refresh | 86.1▲+1.3 up 1.3 since last refresh | ▸ | |||
group breakdownA_B88.55 / 32A_I97.42 / 32A_P78.52 / 32A_R82.812 / 32BUILD88.21 / 32CRE97.74 / 32GEN98.02 / 32LM_ARENA_REVIEW_PROXY100.01 / 32OPS_long78.024 / 32OPS_precision74.225 / 32OPS_review76.725 / 32PLAN81.55 / 32 metricsAI_code97.53 / 32AI_complexity99.92 / 32AI_context_awareness11.99 / 32AI_correctness100.04 / 32AI_edge_cases100.04 / 32AI_efficiency91.94 / 32AI_hallucination_resistance0.032 / 32AI_memory_retention43.013 / 32AI_parameter_accuracy52.926 / 32AI_plan_coherence80.45 / 32AI_recovery100.04 / 32AI_refusal100.04 / 32AI_spec100.04 / 32AI_stability100.01 / 32AI_task_completion75.78 / 32AI_tool_selection81.26 / 32ARC_AGI_293.53 / 25ArtificialAnalysisCoding94.33 / 32ArtificialAnalysisIntelligence100.01 / 32ArtificialAnalysisReasoning97.43 / 32BlendedCost47.930 / 32ContextWindow99.211 / 32CopilotArenaOrLMArenaCode100.02 / 32GDPval95.01 / 32GPQA_HLE_Reasoning97.43 / 32GSO100.01 / 16IFBench44.720 / 32LMArenaCreativeOrOpenEnded97.74 / 32LMArenaSearchDocument100.01 / 30LMArenaText97.74 / 32LongContextRecall86.49 / 32MCPAtlas100.01 / 28OutputSpeed79.016 / 32SWEBenchMultilingual95.03 / 27SWEBenchPro95.02 / 29SWEBenchVerified95.04 / 31SWEComposite91.16 / 32SWERebench85.310 / 31SciCode95.23 / 32SonarBugDensity65.512 / 20SonarComposite56.315 / 32SonarFunctionalSkill93.92 / 20SonarIssueDensity8.117 / 20SonarVulnerabilityDensity24.217 / 20TTFT74.426 / 32Tau2Bench79.616 / 32TerminalBench78.24 / 32 | |||||||||
| claude-opus-4.6 | anthropic | 93.4▲+6.8 up 6.8 since last refresh | 79.0▲+4.0 up 4.0 since last refresh | 86.2▲+8.2 up 8.2 since last refresh | 75.7▲+6.3 up 6.3 since last refresh | ▸ | |||
group breakdownA_B91.54 / 32A_I86.08 / 32A_P71.14 / 32A_R96.25 / 32BUILD87.82 / 32CRE100.01 / 32GEN89.93 / 32LM_ARENA_REVIEW_PROXY32.520 / 32OPS_long76.225 / 32OPS_precision73.426 / 32OPS_review75.826 / 32PLAN75.511 / 32 metricsAI_canary_health59.15 / 6AI_code87.05 / 32AI_complexity81.16 / 32AI_context_awareness3.620 / 32AI_correctness100.03 / 32AI_edge_cases100.03 / 32AI_efficiency78.28 / 32AI_hallucination_resistance100.02 / 32AI_memory_retention99.53 / 32AI_parameter_accuracy63.025 / 32AI_plan_coherence36.913 / 32AI_recovery100.03 / 32AI_refusal100.03 / 32AI_spec100.03 / 32AI_stability74.915 / 32AI_task_completion65.013 / 32AI_tool_selection100.01 / 32ARC_AGI_291.84 / 25ArtificialAnalysisCoding79.15 / 32ArtificialAnalysisIntelligence84.36 / 32ArtificialAnalysisReasoning87.56 / 32BlendedCost47.929 / 32ContextWindow99.210 / 32CopilotArenaOrLMArenaCode100.01 / 32GDPval84.47 / 32GPQA_HLE_Reasoning87.56 / 32GSO75.33 / 16IFBench29.527 / 32LMArenaCreativeOrOpenEnded100.01 / 32LMArenaSearchDocument32.518 / 30LMArenaText100.01 / 32LongContextRecall88.06 / 32MCPAtlas93.52 / 28OutputSpeed75.621 / 32SWEBenchMultilingual91.914 / 27SWEBenchPro100.01 / 29SWEBenchVerified99.43 / 31SWEComposite95.72 / 32SWERebench91.68 / 31SciCode80.96 / 32SonarBugDensity72.09 / 20SonarComposite74.55 / 32SonarFunctionalSkill92.24 / 20SonarIssueDensity54.87 / 20SonarVulnerabilityDensity63.69 / 20TTFT75.025 / 32Tau2Bench87.512 / 32TerminalBench64.212 / 32 | |||||||||
| gpt-5.5 | openai | 67.1▼-0.3 down 0.3 since last refresh | 82.8▼-0.1 down 0.1 since last refresh | 82.1▲+0.1 up 0.1 since last refresh | 76.7▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B65.914 / 32A_I73.612 / 32A_P57.816 / 32A_R80.713 / 32BUILD85.73 / 32CRE53.622 / 32GEN87.64 / 32LM_ARENA_REVIEW_PROXY28.621 / 32OPS_long78.723 / 32OPS_precision75.424 / 32OPS_review76.924 / 32PLAN88.92 / 32 metricsAI_code43.927 / 32AI_complexity30.814 / 32AI_context_awareness0.028 / 32AI_correctness92.917 / 32AI_edge_cases58.016 / 32AI_efficiency58.914 / 32AI_hallucination_resistance74.916 / 32AI_memory_retention16.514 / 32AI_parameter_accuracy91.55 / 32AI_plan_coherence22.015 / 32AI_recovery96.717 / 32AI_refusal100.013 / 32AI_spec100.013 / 32AI_stability78.912 / 32AI_task_completion56.326 / 32AI_tool_selection34.127 / 32ARC_AGI_297.72 / 25ArtificialAnalysisCoding100.02 / 32ArtificialAnalysisIntelligence98.93 / 32ArtificialAnalysisReasoning100.02 / 32BlendedCost36.431 / 32ContextWindow100.02 / 32CopilotArenaOrLMArenaCode66.114 / 32GDPval95.02 / 32GPQA_HLE_Reasoning100.02 / 32GSO94.02 / 16IFBench78.812 / 32LMArenaCreativeOrOpenEnded53.622 / 32LMArenaSearchDocument28.619 / 30LMArenaText53.622 / 32LongContextRecall96.54 / 32MCPAtlas59.712 / 28OutputSpeed78.517 / 32SWEBenchPro95.010 / 29SWEBenchVerified95.010 / 31SWEComposite89.98 / 32SWERebench83.512 / 31SciCode89.75 / 32SonarBugDensity96.22 / 20SonarComposite67.06 / 32SonarFunctionalSkill46.515 / 20SonarIssueDensity59.85 / 20SonarVulnerabilityDensity94.82 / 20TTFT84.416 / 32Tau2Bench86.813 / 32TerminalBench100.01 / 32 | |||||||||
| kimi-k2.6 | moonshot | 80.8▼-0.8 down 0.8 since last refresh | 81.9▼-0.1 down 0.1 since last refresh | 80.0▼-0.6 down 0.6 since last refresh | 84.2▼-0.5 down 0.5 since last refresh | ▸ | |||
group breakdownA_B58.119 / 32A_I71.415 / 32A_P56.517 / 32A_R68.918 / 32BUILD83.54 / 32CRE80.88 / 32GEN86.25 / 32LM_ARENA_REVIEW_PROXY95.62 / 32OPS_long81.117 / 32OPS_precision85.415 / 32OPS_review83.616 / 32PLAN86.73 / 32 metricsAI_code43.926 / 32AI_complexity16.727 / 32AI_context_awareness0.025 / 32AI_correctness92.916 / 32AI_edge_cases58.015 / 32AI_efficiency56.516 / 32AI_hallucination_resistance5.730 / 32AI_memory_retention0.031 / 32AI_parameter_accuracy77.812 / 32AI_plan_coherence15.828 / 32AI_recovery96.716 / 32AI_refusal100.09 / 32AI_spec100.09 / 32AI_stability78.911 / 32AI_task_completion62.515 / 32AI_tool_selection48.126 / 32ArtificialAnalysisCoding75.78 / 32ArtificialAnalysisIntelligence88.24 / 32ArtificialAnalysisReasoning89.05 / 32BlendedCost88.115 / 32ContextWindow77.819 / 32CopilotArenaOrLMArenaCode93.35 / 32GDPval69.313 / 32GPQA_HLE_Reasoning89.05 / 32IFBench92.77 / 32LMArenaCreativeOrOpenEnded80.88 / 32LMArenaSearchDocument95.62 / 30LMArenaText80.88 / 32LongContextRecall83.010 / 32MCPAtlas81.78 / 28OutputSpeed75.422 / 32SWEBenchMultilingual95.07 / 27SWEBenchPro95.08 / 29SWEBenchVerified95.08 / 31SWEComposite94.04 / 32SWERebench92.57 / 31SciCode89.74 / 32SonarComposite50.022 / 32TTFT95.88 / 32Tau2Bench96.06 / 32TerminalBench74.56 / 32 | |||||||||
| claude-opus-4.5 | anthropic | 76.1▼-1.1 down 1.1 since last refresh | 70.5▼-0.4 down 0.4 since last refresh | 78.2▼-1.5 down 1.5 since last refresh | 66.0▼-0.6 down 0.6 since last refresh | ▸ | |||
group breakdownA_B87.76 / 32A_I86.09 / 32A_P79.41 / 32A_R94.87 / 32BUILD78.38 / 32CRE74.913 / 32GEN73.210 / 32LM_ARENA_REVIEW_PROXY11.029 / 32OPS_long73.726 / 32OPS_precision70.527 / 32OPS_review70.227 / 32PLAN65.714 / 32 metricsAI_canary_health88.52 / 6AI_code75.88 / 32AI_complexity70.48 / 32AI_context_awareness13.08 / 32AI_correctness100.02 / 32AI_edge_cases100.02 / 32AI_efficiency78.87 / 32AI_hallucination_resistance100.01 / 32AI_memory_retention99.52 / 32AI_parameter_accuracy100.01 / 32AI_plan_coherence51.67 / 32AI_recovery100.02 / 32AI_refusal100.02 / 32AI_spec100.02 / 32AI_stability71.825 / 32AI_task_completion100.01 / 32AI_tool_selection99.82 / 32ARC_AGI_285.55 / 25ArtificialAnalysisCoding78.16 / 32ArtificialAnalysisIntelligence72.011 / 32ArtificialAnalysisReasoning63.614 / 32BlendedCost47.928 / 32ContextWindow73.527 / 32CopilotArenaOrLMArenaCode75.59 / 32GDPval82.59 / 32GPQA_HLE_Reasoning63.614 / 32GSO59.35 / 16IFBench42.822 / 32LMArenaCreativeOrOpenEnded74.913 / 32LMArenaSearchDocument11.027 / 30LMArenaText74.913 / 32LongContextRecall100.01 / 32MCPAtlas57.315 / 28OutputSpeed77.618 / 32SWEBenchMultilingual95.02 / 27SWEBenchPro88.419 / 29SWEBenchVerified92.017 / 31SWEComposite84.714 / 32SWERebench76.314 / 31SciCode67.710 / 32SonarBugDensity81.85 / 20SonarComposite89.01 / 32SonarFunctionalSkill100.01 / 20SonarIssueDensity80.63 / 20SonarVulnerabilityDensity83.34 / 20TTFT76.022 / 32Tau2Bench81.515 / 32TerminalBench54.718 / 32 | |||||||||
| deepseek-v4-pro | deepseek | 76.0▼-0.5 down 0.5 since last refresh | 78.4 | 77.9▼-0.5 down 0.5 since last refresh | 81.4▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B61.518 / 32A_I69.216 / 32A_P58.514 / 32A_R71.316 / 32BUILD80.46 / 32CRE77.011 / 32GEN79.56 / 32LM_ARENA_REVIEW_PROXY88.87 / 32OPS_long70.927 / 32OPS_precision82.817 / 32OPS_review83.217 / 32PLAN83.34 / 32 metricsAI_code51.415 / 32AI_complexity21.716 / 32AI_context_awareness7.511 / 32AI_correctness86.418 / 32AI_edge_cases56.817 / 32AI_efficiency68.311 / 32AI_hallucination_resistance39.019 / 32AI_memory_retention7.517 / 32AI_parameter_accuracy79.910 / 32AI_plan_coherence21.016 / 32AI_recovery89.718 / 32AI_refusal92.518 / 32AI_spec92.518 / 32AI_stability74.516 / 32AI_task_completion71.312 / 32AI_tool_selection67.410 / 32ArtificialAnalysisCoding77.07 / 32ArtificialAnalysisIntelligence78.98 / 32ArtificialAnalysisReasoning84.17 / 32BlendedCost98.15 / 32ContextWindow100.03 / 32CopilotArenaOrLMArenaCode72.511 / 32GDPval68.216 / 32GPQA_HLE_Reasoning84.17 / 32IFBench94.05 / 32LMArenaCreativeOrOpenEnded77.011 / 32LMArenaSearchDocument88.87 / 30LMArenaText77.011 / 32LongContextRecall66.214 / 32MCPAtlas81.76 / 28OutputSpeed49.131 / 32SWEBenchMultilingual95.05 / 27SWEBenchPro95.04 / 29SWEBenchVerified95.06 / 31SWEComposite94.03 / 32SWERebench92.55 / 31SciCode70.49 / 32SonarComposite50.017 / 32TTFT95.59 / 32Tau2Bench96.74 / 32TerminalBench69.910 / 32 | |||||||||
| glm-5.1 | zai | 83.7▼-0.7 down 0.7 since last refresh | 75.8▼-0.1 down 0.1 since last refresh | 77.0▼-0.6 down 0.6 since last refresh | 79.1▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B56.927 / 32A_I68.226 / 32A_P55.525 / 32A_R66.127 / 32BUILD80.27 / 32CRE90.95 / 32GEN78.97 / 32LM_ARENA_REVIEW_PROXY88.811 / 32OPS_long80.619 / 32OPS_precision85.714 / 32OPS_review83.118 / 32PLAN78.78 / 32 metricsAI_code44.824 / 32AI_complexity21.724 / 32AI_context_awareness7.519 / 32AI_correctness86.426 / 32AI_edge_cases56.825 / 32AI_efficiency55.624 / 32AI_hallucination_resistance12.329 / 32AI_memory_retention7.525 / 32AI_parameter_accuracy73.620 / 32AI_plan_coherence21.024 / 32AI_recovery89.726 / 32AI_refusal92.526 / 32AI_spec92.526 / 32AI_stability74.524 / 32AI_task_completion60.623 / 32AI_tool_selection48.425 / 32ArtificialAnalysisCoding62.913 / 32ArtificialAnalysisIntelligence78.59 / 32ArtificialAnalysisReasoning63.215 / 32BlendedCost87.517 / 32ContextWindow73.725 / 32CopilotArenaOrLMArenaCode98.03 / 32GDPval74.710 / 32GPQA_HLE_Reasoning63.215 / 32IFBench93.46 / 32LMArenaCreativeOrOpenEnded90.95 / 32LMArenaSearchDocument88.811 / 30LMArenaText90.95 / 32LongContextRecall46.026 / 32MCPAtlas87.33 / 28OutputSpeed74.527 / 32SWEBenchMultilingual92.513 / 27SWEBenchPro95.014 / 29SWEBenchVerified92.516 / 31SWEComposite96.41 / 32SWERebench100.02 / 31SciCode36.321 / 32SonarComposite50.027 / 32TTFT99.33 / 32Tau2Bench100.03 / 32TerminalBench73.29 / 32 | |||||||||
| claude-sonnet-4.6 | anthropic | 76.9▲+1.7 up 1.7 since last refresh | 61.4▲+0.5 up 0.5 since last refresh | 75.9▲+2.7 up 2.7 since last refresh | 64.7▲+1.3 up 1.3 since last refresh | ▸ | |||
group breakdownA_B93.93 / 32A_I90.06 / 32A_P65.39 / 32A_R100.02 / 32BUILD75.910 / 32CRE79.010 / 32GEN67.211 / 32LM_ARENA_REVIEW_PROXY22.422 / 32OPS_long69.628 / 32OPS_precision59.028 / 32OPS_review67.428 / 32PLAN56.720 / 32 metricsAI_canary_health89.51 / 6AI_code99.92 / 32AI_complexity85.84 / 32AI_context_awareness0.022 / 32AI_correctness100.05 / 32AI_edge_cases100.05 / 32AI_efficiency92.13 / 32AI_hallucination_resistance100.04 / 32AI_memory_retention0.028 / 32AI_parameter_accuracy100.02 / 32AI_plan_coherence20.025 / 32AI_recovery100.06 / 32AI_refusal100.05 / 32AI_spec100.05 / 32AI_stability100.02 / 32AI_task_completion71.511 / 32AI_tool_selection80.77 / 32ARC_AGI_210.617 / 25ArtificialAnalysisCoding88.84 / 32ArtificialAnalysisIntelligence79.77 / 32ArtificialAnalysisReasoning68.911 / 32BlendedCost73.026 / 32ContextWindow99.214 / 32CopilotArenaOrLMArenaCode95.14 / 32GDPval88.86 / 32GPQA_HLE_Reasoning68.911 / 32GSO30.711 / 16IFBench39.124 / 32LMArenaCreativeOrOpenEnded79.010 / 32LMArenaSearchDocument22.420 / 30LMArenaText79.010 / 32LongContextRecall88.07 / 32MCPAtlas55.716 / 28OutputSpeed80.714 / 32SWEBenchMultilingual95.04 / 27SWEBenchPro76.524 / 29SWEBenchVerified90.020 / 31SWEComposite88.111 / 32SWERebench95.83 / 31SciCode52.814 / 32SonarBugDensity76.47 / 20SonarComposite60.710 / 32SonarFunctionalSkill84.55 / 20SonarIssueDensity34.011 / 20SonarVulnerabilityDensity20.918 / 20TTFT15.230 / 32Tau2Bench50.622 / 32TerminalBench47.321 / 32 | |||||||||
| qwen3.6-plus | alibaba | 68.3▼-0.7 down 0.7 since last refresh | 70.7▼-0.1 down 0.1 since last refresh | 74.8▼-0.6 down 0.6 since last refresh | 78.7▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B56.923 / 32A_I68.222 / 32A_P55.521 / 32A_R66.123 / 32BUILD76.79 / 32CRE69.415 / 32GEN60.915 / 32LM_ARENA_REVIEW_PROXY88.89 / 32OPS_long84.513 / 32OPS_precision88.68 / 32OPS_review89.45 / 32PLAN79.27 / 32 metricsAI_code44.820 / 32AI_complexity21.720 / 32AI_context_awareness7.515 / 32AI_correctness86.422 / 32AI_edge_cases56.821 / 32AI_efficiency55.620 / 32AI_hallucination_resistance12.325 / 32AI_memory_retention7.521 / 32AI_parameter_accuracy73.616 / 32AI_plan_coherence21.020 / 32AI_recovery89.722 / 32AI_refusal92.522 / 32AI_spec92.522 / 32AI_stability74.520 / 32AI_task_completion60.619 / 32AI_tool_selection48.421 / 32ARC_AGI_211.916 / 25ArtificialAnalysisCoding61.214 / 32ArtificialAnalysisIntelligence73.210 / 32ArtificialAnalysisReasoning61.317 / 32BlendedCost95.06 / 32ContextWindow99.215 / 32CopilotArenaOrLMArenaCode73.210 / 32GDPval73.311 / 32GPQA_HLE_Reasoning61.317 / 32IFBench90.49 / 32LMArenaCreativeOrOpenEnded69.415 / 32LMArenaSearchDocument88.89 / 30LMArenaText69.415 / 32LongContextRecall83.011 / 32MCPAtlas76.59 / 28OutputSpeed76.120 / 32SWEBenchMultilingual92.510 / 27SWEBenchPro95.011 / 29SWEBenchVerified95.011 / 31SWEComposite85.913 / 32SWERebench72.818 / 31SciCode19.326 / 32SonarBugDensity92.53 / 20SonarComposite80.14 / 32SonarFunctionalSkill66.814 / 20SonarIssueDensity92.52 / 20SonarVulnerabilityDensity78.37 / 20TTFT91.112 / 32Tau2Bench100.01 / 32TerminalBench67.611 / 32 | |||||||||
| deepseek-v4-flash | deepseek | 66.2▼-0.6 down 0.6 since last refresh | 72.3 | 74.6▼-0.6 down 0.6 since last refresh | 79.2▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B63.515 / 32A_I72.614 / 32A_P60.013 / 32A_R75.015 / 32BUILD74.611 / 32CRE62.018 / 32GEN64.113 / 32LM_ARENA_REVIEW_PROXY88.86 / 32OPS_long87.96 / 32OPS_precision91.24 / 32OPS_review88.37 / 32PLAN78.69 / 32 metricsAI_canary_health84.33 / 6AI_code51.613 / 32AI_complexity16.725 / 32AI_context_awareness0.023 / 32AI_correctness92.914 / 32AI_edge_cases58.013 / 32AI_efficiency71.610 / 32AI_hallucination_resistance37.121 / 32AI_memory_retention0.029 / 32AI_parameter_accuracy85.28 / 32AI_plan_coherence15.827 / 32AI_recovery96.714 / 32AI_refusal100.06 / 32AI_spec100.06 / 32AI_stability78.910 / 32AI_task_completion75.09 / 32AI_tool_selection70.58 / 32ArtificialAnalysisCoding46.720 / 32ArtificialAnalysisIntelligence59.719 / 32ArtificialAnalysisReasoning77.39 / 32BlendedCost99.82 / 32ContextWindow70.330 / 32CopilotArenaOrLMArenaCode86.86 / 32GDPval68.215 / 32GPQA_HLE_Reasoning77.39 / 32IFBench100.01 / 32LMArenaCreativeOrOpenEnded62.018 / 32LMArenaSearchDocument88.86 / 30LMArenaText62.018 / 32LongContextRecall49.324 / 32MCPAtlas81.75 / 28OutputSpeed86.510 / 32SWEBenchMultilingual59.115 / 27SWEBenchPro95.03 / 29SWEBenchVerified95.05 / 31SWEComposite90.47 / 32SWERebench92.54 / 31SciCode42.418 / 32SonarComposite50.016 / 32TTFT99.24 / 32Tau2Bench94.110 / 32TerminalBench60.915 / 32 | |||||||||
| gpt-5.3-codex | openai | 71.0▼-0.5 down 0.5 since last refresh | 49.2▼-0.1 down 0.1 since last refresh | 74.2▼-1.0 down 1.0 since last refresh | 69.5▼-0.3 down 0.3 since last refresh | ▸ | |||
group breakdownA_B86.07 / 32A_I87.07 / 32A_P58.315 / 32A_R90.48 / 32BUILD74.212 / 32CRE75.312 / 32GEN48.121 / 32LM_ARENA_REVIEW_PROXY92.53 / 32OPS_long84.512 / 32OPS_precision80.921 / 32OPS_review81.921 / 32PLAN41.524 / 32 metricsAI_code82.46 / 32AI_complexity84.65 / 32AI_context_awareness0.027 / 32AI_correctness100.08 / 32AI_edge_cases100.08 / 32AI_efficiency81.36 / 32AI_hallucination_resistance53.618 / 32AI_memory_retention14.516 / 32AI_parameter_accuracy87.37 / 32AI_plan_coherence0.032 / 32AI_recovery100.09 / 32AI_refusal100.011 / 32AI_spec100.011 / 32AI_stability100.05 / 32AI_task_completion53.827 / 32AI_tool_selection54.916 / 32ARC_AGI_272.58 / 25ArtificialAnalysisCoding43.622 / 32ArtificialAnalysisIntelligence29.426 / 32ArtificialAnalysisReasoning33.525 / 32BlendedCost75.322 / 32ContextWindow84.617 / 32CopilotArenaOrLMArenaCode42.728 / 32GDPval68.814 / 32GPQA_HLE_Reasoning33.525 / 32GSO53.48 / 16IFBench59.919 / 32LMArenaCreativeOrOpenEnded75.312 / 32LMArenaSearchDocument92.53 / 30LMArenaText75.312 / 32LongContextRecall42.327 / 32OutputSpeed89.58 / 32SWEBenchPro95.09 / 29SWEBenchVerified92.514 / 31SWEComposite92.15 / 32SWERebench89.49 / 31SciCode40.320 / 32SonarBugDensity84.44 / 20SonarComposite61.68 / 32SonarFunctionalSkill72.311 / 20SonarIssueDensity7.518 / 20SonarVulnerabilityDensity92.53 / 20TTFT75.324 / 32Tau2Bench7.529 / 32TerminalBench74.37 / 32 | |||||||||
| gemini-3.1-pro-preview | 86.5▼-9.6 down 9.6 since last refresh | 85.0▼-5.2 down 5.2 since last refresh | 73.0▼-8.8 down 8.8 since last refresh | 78.4▼-8.8 down 8.8 since last refresh | ▸ | ||||
group breakdownA_B22.131 / 32A_I17.131 / 32A_P36.731 / 32A_R28.630 / 32BUILD81.05 / 32CRE100.02 / 32GEN100.01 / 32LM_ARENA_REVIEW_PROXY92.24 / 32OPS_long87.19 / 32OPS_precision82.419 / 32OPS_review85.312 / 32PLAN91.11 / 32 metricsAI_code7.530 / 32AI_complexity7.530 / 32AI_context_awareness15.37 / 32AI_correctness7.530 / 32AI_edge_cases22.629 / 32AI_efficiency27.629 / 32AI_hallucination_resistance92.515 / 32AI_memory_retention92.511 / 32AI_parameter_accuracy65.924 / 32AI_plan_coherence43.810 / 32AI_recovery33.729 / 32AI_refusal7.530 / 32AI_spec7.530 / 32AI_stability7.530 / 32AI_task_completion92.55 / 32AI_tool_selection67.113 / 32ARC_AGI_2100.01 / 25ArtificialAnalysisCoding100.01 / 32ArtificialAnalysisIntelligence100.02 / 32ArtificialAnalysisReasoning100.01 / 32BlendedCost76.021 / 32ContextWindow100.07 / 32CopilotArenaOrLMArenaCode69.412 / 32GDPval49.323 / 32GPQA_HLE_Reasoning100.01 / 32GSO51.39 / 16IFBench95.94 / 32LMArenaCreativeOrOpenEnded100.02 / 32LMArenaSearchDocument92.24 / 30LMArenaText100.02 / 32LongContextRecall98.13 / 32MCPAtlas58.414 / 28OutputSpeed91.55 / 32SWEBenchMultilingual36.018 / 27SWEBenchPro89.118 / 29SWEBenchVerified95.07 / 31SWEComposite89.09 / 32SWERebench100.01 / 31SciCode100.02 / 32SonarBugDensity65.016 / 20SonarComposite59.314 / 32SonarFunctionalSkill78.910 / 20SonarIssueDensity25.215 / 20SonarVulnerabilityDensity56.014 / 20TTFT70.729 / 32Tau2Bench95.48 / 32TerminalBench89.53 / 32 | |||||||||
| mimo-v2.5-pro | xiaomi | 78.8▼-0.7 down 0.7 since last refresh | 75.1▼-0.1 down 0.1 since last refresh | 71.6▼-0.6 down 0.6 since last refresh | 76.8▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B56.925 / 32A_I68.224 / 32A_P55.523 / 32A_R66.125 / 32BUILD72.213 / 32CRE83.37 / 32GEN74.29 / 32LM_ARENA_REVIEW_PROXY84.316 / 32OPS_long83.814 / 32OPS_precision86.711 / 32OPS_review87.89 / 32PLAN80.06 / 32 metricsAI_code44.822 / 32AI_complexity21.722 / 32AI_context_awareness7.517 / 32AI_correctness86.424 / 32AI_edge_cases56.823 / 32AI_efficiency55.622 / 32AI_hallucination_resistance12.327 / 32AI_memory_retention7.523 / 32AI_parameter_accuracy73.618 / 32AI_plan_coherence21.022 / 32AI_recovery89.724 / 32AI_refusal92.524 / 32AI_spec92.524 / 32AI_stability74.522 / 32AI_task_completion60.621 / 32AI_tool_selection48.423 / 32ARC_AGI_220.313 / 25ArtificialAnalysisCoding70.111 / 32ArtificialAnalysisIntelligence87.85 / 32ArtificialAnalysisReasoning75.010 / 32BlendedCost87.516 / 32ContextWindow100.09 / 32CopilotArenaOrLMArenaCode77.48 / 32GDPval68.221 / 32GPQA_HLE_Reasoning75.010 / 32IFBench100.02 / 32LMArenaCreativeOrOpenEnded83.37 / 32LMArenaSearchDocument84.316 / 30LMArenaText83.37 / 32LongContextRecall100.02 / 32MCPAtlas32.421 / 28OutputSpeed76.619 / 32SWEBenchMultilingual92.512 / 27SWEBenchPro95.013 / 29SWEBenchVerified95.012 / 31SWEComposite82.115 / 32SWERebench63.523 / 31SciCode71.58 / 32SonarComposite50.026 / 32TTFT89.314 / 32Tau2Bench92.111 / 32TerminalBench76.85 / 32 | |||||||||
| claude-opus-4.1 | anthropic | 61.1▼-0.6 down 0.6 since last refresh | 63.2▼-0.1 down 0.1 since last refresh | 69.7▼-1.1 down 1.1 since last refresh | 57.7▼-0.3 down 0.3 since last refresh | ▸ | |||
group breakdownA_B79.710 / 32A_I84.910 / 32A_P70.15 / 32A_R85.09 / 32BUILD71.014 / 32CRE53.023 / 32GEN65.712 / 32LM_ARENA_REVIEW_PROXY0.431 / 32OPS_long65.929 / 32OPS_precision58.329 / 32OPS_review58.429 / 32PLAN62.118 / 32 metricsAI_canary_health53.66 / 6AI_code80.37 / 32AI_complexity76.77 / 32AI_context_awareness72.93 / 32AI_correctness100.01 / 32AI_edge_cases100.01 / 32AI_efficiency65.613 / 32AI_hallucination_resistance37.120 / 32AI_memory_retention0.026 / 32AI_parameter_accuracy52.127 / 32AI_plan_coherence36.912 / 32AI_recovery100.01 / 32AI_refusal100.01 / 32AI_spec100.01 / 32AI_stability77.113 / 32AI_task_completion59.625 / 32AI_tool_selection88.34 / 32ARC_AGI_283.56 / 25ArtificialAnalysisCoding73.99 / 32ArtificialAnalysisIntelligence68.714 / 32ArtificialAnalysisReasoning61.616 / 32BlendedCost0.032 / 32ContextWindow73.526 / 32CopilotArenaOrLMArenaCode47.226 / 32GDPval82.58 / 32GPQA_HLE_Reasoning61.616 / 32GSO57.96 / 16IFBench43.921 / 32LMArenaCreativeOrOpenEnded53.023 / 32LMArenaSearchDocument0.429 / 30LMArenaText53.023 / 32LongContextRecall92.55 / 32MCPAtlas86.94 / 28OutputSpeed73.528 / 32SWEBenchMultilingual92.58 / 27SWEBenchPro82.620 / 29SWEBenchVerified91.518 / 31SWEComposite72.522 / 32SWERebench51.526 / 31SciCode65.011 / 32SonarBugDensity77.06 / 20SonarComposite83.23 / 32SonarFunctionalSkill92.53 / 20SonarIssueDensity76.04 / 20SonarVulnerabilityDensity78.36 / 20TTFT72.128 / 32Tau2Bench76.817 / 32TerminalBench29.326 / 32 | |||||||||
| minimax-m2.7 | minimax | 49.0▼-0.7 down 0.7 since last refresh | 62.3▼-0.1 down 0.1 since last refresh | 68.4▼-0.6 down 0.6 since last refresh | 71.3▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B56.921 / 32A_I68.220 / 32A_P55.519 / 32A_R66.121 / 32BUILD68.915 / 32CRE36.726 / 32GEN52.719 / 32LM_ARENA_REVIEW_PROXY84.314 / 32OPS_long81.515 / 32OPS_precision87.010 / 32OPS_review84.713 / 32PLAN66.712 / 32 metricsAI_code44.818 / 32AI_complexity21.718 / 32AI_context_awareness7.513 / 32AI_correctness86.420 / 32AI_edge_cases56.819 / 32AI_efficiency55.618 / 32AI_hallucination_resistance12.323 / 32AI_memory_retention7.519 / 32AI_parameter_accuracy73.614 / 32AI_plan_coherence21.018 / 32AI_recovery89.720 / 32AI_refusal92.520 / 32AI_spec92.520 / 32AI_stability74.518 / 32AI_task_completion60.617 / 32AI_tool_selection48.419 / 32ARC_AGI_211.915 / 25ArtificialAnalysisCoding57.717 / 32ArtificialAnalysisIntelligence71.612 / 32ArtificialAnalysisReasoning64.713 / 32BlendedCost99.13 / 32ContextWindow73.229 / 32CopilotArenaOrLMArenaCode54.821 / 32GDPval68.218 / 32GPQA_HLE_Reasoning64.713 / 32IFBench91.98 / 32LMArenaCreativeOrOpenEnded36.726 / 32LMArenaSearchDocument84.314 / 30LMArenaText36.726 / 32LongContextRecall77.912 / 32MCPAtlas32.419 / 28OutputSpeed75.223 / 32SWEBenchMultilingual95.06 / 27SWEBenchPro95.06 / 29SWEBenchVerified92.513 / 31SWEComposite86.012 / 32SWERebench73.317 / 31SciCode53.913 / 32SonarComposite50.019 / 32TTFT96.27 / 32Tau2Bench71.019 / 32TerminalBench61.413 / 32 | |||||||||
| glm-5 | zai | 68.8▼-0.7 down 0.7 since last refresh | 62.8▼-0.1 down 0.1 since last refresh | 68.2▼-0.6 down 0.6 since last refresh | 71.9▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B56.926 / 32A_I68.225 / 32A_P55.524 / 32A_R66.126 / 32BUILD68.417 / 32CRE73.214 / 32GEN54.217 / 32LM_ARENA_REVIEW_PROXY88.810 / 32OPS_long87.010 / 32OPS_precision90.07 / 32OPS_review87.410 / 32PLAN66.213 / 32 metricsAI_code44.823 / 32AI_complexity21.723 / 32AI_context_awareness7.518 / 32AI_correctness86.425 / 32AI_edge_cases56.824 / 32AI_efficiency55.623 / 32AI_hallucination_resistance12.328 / 32AI_memory_retention7.524 / 32AI_parameter_accuracy73.619 / 32AI_plan_coherence21.023 / 32AI_recovery89.725 / 32AI_refusal92.525 / 32AI_spec92.525 / 32AI_stability74.523 / 32AI_task_completion60.622 / 32AI_tool_selection48.424 / 32ARC_AGI_25.219 / 25ArtificialAnalysisCoding40.124 / 32ArtificialAnalysisIntelligence60.817 / 32ArtificialAnalysisReasoning53.222 / 32BlendedCost92.512 / 32ContextWindow73.724 / 32CopilotArenaOrLMArenaCode64.518 / 32GDPval73.312 / 32GPQA_HLE_Reasoning53.222 / 32IFBench85.010 / 32LMArenaCreativeOrOpenEnded73.214 / 32LMArenaSearchDocument88.810 / 30LMArenaText73.214 / 32LongContextRecall37.528 / 32MCPAtlas47.217 / 28OutputSpeed84.813 / 32SWEBenchMultilingual51.216 / 27SWEBenchPro92.517 / 29SWEBenchVerified91.019 / 31SWEComposite81.916 / 32SWERebench76.913 / 31SciCode35.222 / 32SonarBugDensity100.01 / 20SonarComposite85.42 / 32SonarFunctionalSkill69.812 / 20SonarIssueDensity100.01 / 20SonarVulnerabilityDensity83.35 / 20TTFT100.01 / 32Tau2Bench100.02 / 32TerminalBench55.817 / 32 | |||||||||
| gpt-5.4 | openai | 70.4▼-2.9 down 2.9 since last refresh | 47.6▼-1.2 down 1.2 since last refresh | 67.8▼-2.7 down 2.7 since last refresh | 53.5▼-1.7 down 1.7 since last refresh | ▸ | |||
group breakdownA_B66.013 / 32A_I68.618 / 32A_P53.127 / 32A_R69.617 / 32BUILD68.616 / 32CRE79.89 / 32GEN45.122 / 32LM_ARENA_REVIEW_PROXY16.126 / 32OPS_long91.23 / 32OPS_precision88.09 / 32OPS_review89.44 / 32PLAN40.525 / 32 metricsAI_code61.411 / 32AI_complexity64.310 / 32AI_context_awareness1.921 / 32AI_correctness100.09 / 32AI_edge_cases100.09 / 32AI_efficiency72.49 / 32AI_hallucination_resistance1.631 / 32AI_memory_retention0.032 / 32AI_parameter_accuracy85.29 / 32AI_plan_coherence0.231 / 32AI_recovery100.010 / 32AI_refusal100.012 / 32AI_spec100.012 / 32AI_stability2.231 / 32AI_task_completion73.610 / 32AI_tool_selection83.65 / 32ARC_AGI_276.57 / 25ArtificialAnalysisCoding33.926 / 32ArtificialAnalysisIntelligence27.427 / 32ArtificialAnalysisReasoning12.429 / 32BlendedCost73.723 / 32ContextWindow100.01 / 32CopilotArenaOrLMArenaCode48.223 / 32GDPval90.74 / 32GPQA_HLE_Reasoning12.429 / 32GSO54.07 / 16IFBench60.718 / 32LMArenaCreativeOrOpenEnded79.89 / 32LMArenaSearchDocument16.124 / 30LMArenaText79.89 / 32LongContextRecall20.729 / 32MCPAtlas59.711 / 28OutputSpeed93.83 / 32SWEBenchPro92.516 / 29SWEBenchVerified95.09 / 31SWEComposite88.910 / 32SWERebench83.511 / 31SciCode6.729 / 32SonarBugDensity0.020 / 20SonarComposite30.029 / 32SonarFunctionalSkill37.516 / 20SonarIssueDensity0.020 / 20SonarVulnerabilityDensity100.01 / 20TTFT86.215 / 32Tau2Bench0.032 / 32TerminalBench100.02 / 32 | |||||||||
| mimo-v2.5 | xiaomi | 59.6▼-0.7 down 0.7 since last refresh | 61.8▼-0.1 down 0.1 since last refresh | 66.4▼-0.6 down 0.6 since last refresh | 69.9▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B56.924 / 32A_I68.223 / 32A_P55.522 / 32A_R66.124 / 32BUILD65.919 / 32CRE54.521 / 32GEN55.416 / 32LM_ARENA_REVIEW_PROXY84.315 / 32OPS_long89.64 / 32OPS_precision91.43 / 32OPS_review92.33 / 32PLAN63.117 / 32 metricsAI_code44.821 / 32AI_complexity21.721 / 32AI_context_awareness7.516 / 32AI_correctness86.423 / 32AI_edge_cases56.822 / 32AI_efficiency55.621 / 32AI_hallucination_resistance12.326 / 32AI_memory_retention7.522 / 32AI_parameter_accuracy73.617 / 32AI_plan_coherence21.021 / 32AI_recovery89.723 / 32AI_refusal92.523 / 32AI_spec92.523 / 32AI_stability74.521 / 32AI_task_completion60.620 / 32AI_tool_selection48.422 / 32ARC_AGI_220.312 / 25ArtificialAnalysisCoding58.416 / 32ArtificialAnalysisIntelligence69.313 / 32ArtificialAnalysisReasoning53.221 / 32BlendedCost94.110 / 32ContextWindow100.08 / 32CopilotArenaOrLMArenaCode65.915 / 32GDPval68.220 / 32GPQA_HLE_Reasoning53.221 / 32IFBench68.216 / 32LMArenaCreativeOrOpenEnded54.521 / 32LMArenaSearchDocument84.315 / 30LMArenaText54.521 / 32LongContextRecall47.625 / 32MCPAtlas32.420 / 28OutputSpeed85.311 / 32SWEBenchMultilingual92.511 / 27SWEBenchPro95.012 / 29SWEBenchVerified92.515 / 31SWEComposite81.817 / 32SWERebench63.522 / 31SciCode32.523 / 32SonarComposite50.025 / 32TTFT91.411 / 32Tau2Bench84.214 / 32TerminalBench73.38 / 32 | |||||||||
| kimi-k2.5 | moonshot | 59.1▼-0.7 down 0.7 since last refresh | 61.8▼-0.1 down 0.1 since last refresh | 62.2▼-0.6 down 0.6 since last refresh | 69.1▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B56.922 / 32A_I68.221 / 32A_P55.520 / 32A_R66.122 / 32BUILD60.320 / 32CRE55.519 / 32GEN54.018 / 32LM_ARENA_REVIEW_PROXY90.35 / 32OPS_long81.316 / 32OPS_precision86.312 / 32OPS_review84.514 / 32PLAN64.915 / 32 metricsAI_code44.819 / 32AI_complexity21.719 / 32AI_context_awareness7.514 / 32AI_correctness86.421 / 32AI_edge_cases56.820 / 32AI_efficiency55.619 / 32AI_hallucination_resistance12.324 / 32AI_memory_retention7.520 / 32AI_parameter_accuracy73.615 / 32AI_plan_coherence21.019 / 32AI_recovery89.721 / 32AI_refusal92.521 / 32AI_spec92.521 / 32AI_stability74.519 / 32AI_task_completion60.618 / 32AI_tool_selection48.420 / 32ARC_AGI_215.014 / 25ArtificialAnalysisCoding49.819 / 32ArtificialAnalysisIntelligence60.816 / 32ArtificialAnalysisReasoning68.512 / 32BlendedCost94.49 / 32ContextWindow77.818 / 32CopilotArenaOrLMArenaCode55.020 / 32GDPval68.219 / 32GPQA_HLE_Reasoning68.512 / 32IFBench76.714 / 32LMArenaCreativeOrOpenEnded55.519 / 32LMArenaSearchDocument90.35 / 30LMArenaText55.519 / 32LongContextRecall61.119 / 32MCPAtlas29.322 / 28OutputSpeed74.825 / 32SWEBenchMultilingual8.822 / 27SWEBenchPro95.07 / 29SWEBenchVerified85.022 / 31SWEComposite73.221 / 32SWERebench65.821 / 31SciCode64.912 / 32SonarComposite50.021 / 32TTFT95.110 / 32Tau2Bench96.05 / 32TerminalBench41.923 / 32 | |||||||||
| minimax-m2.5 | minimax | 31.4▼-0.7 down 0.7 since last refresh | 51.9▼-0.1 down 0.1 since last refresh | 61.4▼-0.6 down 0.6 since last refresh | 66.4▼-0.4 down 0.4 since last refresh | ▸ | |||
group breakdownA_B56.920 / 32A_I68.219 / 32A_P55.518 / 32A_R66.120 / 32BUILD59.322 / 32CRE13.529 / 32GEN29.026 / 32LM_ARENA_REVIEW_PROXY84.313 / 32OPS_long87.28 / 32OPS_precision90.46 / 32OPS_review88.08 / 32PLAN59.419 / 32 metricsAI_code44.817 / 32AI_complexity21.717 / 32AI_context_awareness7.512 / 32AI_correctness86.419 / 32AI_edge_cases56.818 / 32AI_efficiency55.617 / 32AI_hallucination_resistance12.322 / 32AI_memory_retention7.518 / 32AI_parameter_accuracy73.613 / 32AI_plan_coherence21.017 / 32AI_recovery89.719 / 32AI_refusal92.519 / 32AI_spec92.519 / 32AI_stability74.517 / 32AI_task_completion60.616 / 32AI_tool_selection48.418 / 32ARC_AGI_25.218 / 25ArtificialAnalysisCoding42.223 / 32ArtificialAnalysisIntelligence42.023 / 32ArtificialAnalysisReasoning40.124 / 32BlendedCost100.01 / 32ContextWindow73.228 / 32CopilotArenaOrLMArenaCode46.127 / 32GDPval68.217 / 32GPQA_HLE_Reasoning40.124 / 32IFBench80.611 / 32LMArenaCreativeOrOpenEnded13.529 / 32LMArenaSearchDocument84.313 / 30LMArenaText13.529 / 32LongContextRecall64.517 / 32MCPAtlas32.418 / 28OutputSpeed85.212 / 32SWEBenchMultilingual26.520 / 27SWEBenchPro95.05 / 29SWEBenchVerified100.02 / 31SWEComposite75.919 / 32SWERebench62.424 / 31SciCode29.725 / 32SonarComposite50.018 / 32TTFT96.76 / 32Tau2Bench94.79 / 32TerminalBench40.424 / 32 | |||||||||
| gpt-5.2 | openai | 69.6▲+0.7 up 0.7 since last refresh | 57.1▲+0.3 up 0.3 since last refresh | 60.6▲+1.3 up 1.3 since last refresh | 57.5▲+0.7 up 0.7 since last refresh | ▸ | |||
group breakdownA_B95.42 / 32A_I90.75 / 32A_P60.412 / 32A_R100.01 / 32BUILD51.325 / 32CRE69.316 / 32GEN52.620 / 32LM_ARENA_REVIEW_PROXY19.723 / 32OPS_long84.911 / 32OPS_precision81.620 / 32OPS_review82.619 / 32PLAN54.821 / 32 metricsAI_code100.01 / 32AI_complexity100.01 / 32AI_context_awareness0.026 / 32AI_correctness100.07 / 32AI_edge_cases100.07 / 32AI_efficiency84.45 / 32AI_hallucination_resistance100.08 / 32AI_memory_retention15.415 / 32AI_parameter_accuracy22.528 / 32AI_plan_coherence19.926 / 32AI_recovery100.08 / 32AI_refusal100.010 / 32AI_spec100.010 / 32AI_stability100.04 / 32AI_task_completion60.424 / 32AI_tool_selection59.515 / 32ARC_AGI_20.025 / 25ArtificialAnalysisCoding65.612 / 32ArtificialAnalysisIntelligence60.118 / 32ArtificialAnalysisReasoning55.819 / 32BlendedCost78.819 / 32ContextWindow84.616 / 32CopilotArenaOrLMArenaCode29.530 / 32GDPval66.922 / 32GPQA_HLE_Reasoning55.819 / 32GSO64.74 / 16IFBench63.017 / 32LMArenaCreativeOrOpenEnded69.316 / 32LMArenaSearchDocument19.721 / 30LMArenaText69.316 / 32LongContextRecall51.023 / 32OutputSpeed89.57 / 32SWEBenchMultilingual0.027 / 27SWEBenchPro38.228 / 29SWEBenchVerified79.626 / 31SWEComposite45.328 / 32SciCode49.515 / 32SonarBugDensity75.38 / 20SonarComposite63.87 / 32SonarFunctionalSkill67.213 / 20SonarIssueDensity45.49 / 20SonarVulnerabilityDensity70.28 / 20TTFT75.323 / 32Tau2Bench47.325 / 32TerminalBench58.216 / 32 | |||||||||
| gemini-3-pro | 76.4▼-11.3 down 11.3 since last refresh | 66.9▼-6.1 down 6.1 since last refresh | 59.0▼-10.3 down 10.3 since last refresh | 54.8▼-10.3 down 10.3 since last refresh | ▸ | ||||
group breakdownA_B17.232 / 32A_I11.332 / 32A_P34.432 / 32A_R24.932 / 32BUILD68.118 / 32CRE98.43 / 32GEN75.58 / 32LM_ARENA_REVIEW_PROXY19.224 / 32OPS_long58.431 / 32OPS_precision42.932 / 32OPS_review42.932 / 32PLAN75.910 / 32 metricsAI_code0.032 / 32AI_complexity0.032 / 32AI_context_awareness9.210 / 32AI_correctness0.032 / 32AI_edge_cases17.830 / 32AI_efficiency23.630 / 32AI_hallucination_resistance100.06 / 32AI_memory_retention100.01 / 32AI_parameter_accuracy68.821 / 32AI_plan_coherence42.711 / 32AI_recovery30.930 / 32AI_refusal0.032 / 32AI_spec0.031 / 32AI_stability0.032 / 32AI_task_completion100.02 / 32AI_tool_selection70.19 / 32ARC_AGI_242.29 / 25ArtificialAnalysisCoding73.610 / 32ArtificialAnalysisIntelligence67.015 / 32ArtificialAnalysisReasoning91.14 / 32BlendedCost76.020 / 32ContextWindow0.032 / 32CopilotArenaOrLMArenaCode65.616 / 32GDPval34.927 / 32GPQA_HLE_Reasoning91.14 / 32GSO40.710 / 16IFBench77.313 / 32LMArenaCreativeOrOpenEnded98.43 / 32LMArenaSearchDocument19.222 / 30LMArenaText98.43 / 32LongContextRecall88.08 / 32MCPAtlas59.810 / 28OutputSpeed92.44 / 32SWEBenchMultilingual33.519 / 27SWEBenchPro80.322 / 29SWEBenchVerified81.425 / 31SWEComposite71.723 / 32SWERebench70.220 / 31SciCode100.01 / 32SonarBugDensity67.610 / 20SonarComposite60.99 / 32SonarFunctionalSkill84.16 / 20SonarIssueDensity20.816 / 20SonarVulnerabilityDensity57.010 / 20TTFT0.032 / 32Tau2Bench76.318 / 32TerminalBench61.214 / 32 | |||||||||
| glm-4.7 | zai | 34.0▲+0.8 up 0.8 since last refresh | 53.4▲+0.3 up 0.3 since last refresh | 57.3▲+0.8 up 0.8 since last refresh | 60.3▲+1.1 up 1.1 since last refresh | ▸ | |||
group breakdownA_B98.21 / 32A_I98.91 / 32A_P70.06 / 32A_R99.73 / 32BUILD44.927 / 32CRE6.430 / 32GEN34.725 / 32LM_ARENA_REVIEW_PROXY50.019 / 32OPS_long89.25 / 32OPS_precision91.42 / 32OPS_review88.86 / 32PLAN53.922 / 32 metricsAI_code96.84 / 32AI_complexity88.93 / 32AI_context_awareness0.032 / 32AI_correctness100.013 / 32AI_edge_cases100.011 / 32AI_efficiency100.01 / 32AI_hallucination_resistance100.012 / 32AI_memory_retention99.58 / 32AI_parameter_accuracy0.032 / 32AI_plan_coherence100.04 / 32AI_recovery100.013 / 32AI_refusal100.016 / 32AI_spec100.016 / 32AI_stability100.09 / 32AI_task_completion0.032 / 32AI_tool_selection0.032 / 32ArtificialAnalysisCoding38.425 / 32ArtificialAnalysisIntelligence42.822 / 32ArtificialAnalysisReasoning55.120 / 32BlendedCost94.97 / 32ContextWindow73.723 / 32CopilotArenaOrLMArenaCode66.213 / 32GDPval34.528 / 32GPQA_HLE_Reasoning55.120 / 32IFBench70.315 / 32LMArenaCreativeOrOpenEnded6.430 / 32LMArenaText6.430 / 32LongContextRecall54.422 / 32MCPAtlas0.028 / 28OutputSpeed88.79 / 32SWEBenchMultilingual5.025 / 27SWEBenchVerified89.621 / 31SWEComposite60.525 / 32SWERebench70.519 / 31SciCode43.517 / 32SonarBugDensity66.611 / 20SonarComposite32.028 / 32SonarFunctionalSkill0.020 / 20SonarIssueDensity58.26 / 20SonarVulnerabilityDensity27.416 / 20TTFT99.42 / 32Tau2Bench96.07 / 32TerminalBench27.027 / 32 | |||||||||
| gemini-3-flash | 72.1▼-9.6 down 9.6 since last refresh | 61.7▼-5.2 down 5.2 since last refresh | 57.2▼-8.8 down 8.8 since last refresh | 53.0▼-8.8 down 8.8 since last refresh | ▸ | ||||
group breakdownA_B22.130 / 32A_I17.130 / 32A_P36.730 / 32A_R28.629 / 32BUILD60.021 / 32CRE88.96 / 32GEN62.714 / 32LM_ARENA_REVIEW_PROXY18.425 / 32OPS_long95.61 / 32OPS_precision92.21 / 32OPS_review93.91 / 32PLAN63.916 / 32 metricsAI_code7.529 / 32AI_complexity7.529 / 32AI_context_awareness15.36 / 32AI_correctness7.529 / 32AI_edge_cases22.628 / 32AI_efficiency27.628 / 32AI_hallucination_resistance92.514 / 32AI_memory_retention92.510 / 32AI_parameter_accuracy65.923 / 32AI_plan_coherence43.89 / 32AI_recovery33.728 / 32AI_refusal7.529 / 32AI_spec7.529 / 32AI_stability7.529 / 32AI_task_completion92.54 / 32AI_tool_selection67.112 / 32ARC_AGI_23.122 / 25ArtificialAnalysisCoding60.115 / 32ArtificialAnalysisIntelligence59.320 / 32ArtificialAnalysisReasoning83.78 / 32BlendedCost90.514 / 32ContextWindow100.06 / 32CopilotArenaOrLMArenaCode65.117 / 32GDPval37.125 / 32GPQA_HLE_Reasoning83.78 / 32GSO14.014 / 16IFBench98.13 / 32LMArenaCreativeOrOpenEnded88.96 / 32LMArenaSearchDocument18.423 / 30LMArenaText88.96 / 32LongContextRecall66.215 / 32MCPAtlas16.924 / 28OutputSpeed99.82 / 32SWEBenchMultilingual100.01 / 27SWEBenchPro53.026 / 29SWEBenchVerified100.01 / 31SWEComposite74.020 / 32SWERebench76.015 / 31SciCode73.77 / 32SonarBugDensity65.015 / 20SonarComposite59.313 / 32SonarFunctionalSkill78.99 / 20SonarIssueDensity25.214 / 20SonarVulnerabilityDensity56.013 / 20TTFT83.417 / 32Tau2Bench61.120 / 32TerminalBench48.219 / 32 | |||||||||
| kimi-k2-0905 | moonshot | 27.4▲+6.3 up 6.3 since last refresh | 28.0▲+3.7 up 3.7 since last refresh | 57.0▲+4.1 up 4.1 since last refresh | 57.0▲+7.4 up 7.4 since last refresh | ▸ | |||
group breakdownA_B62.216 / 32A_I65.027 / 32A_P53.726 / 32A_R84.910 / 32BUILD58.823 / 32CRE25.227 / 32GEN7.432 / 32LM_ARENA_REVIEW_PROXY88.88 / 32OPS_long33.332 / 32OPS_precision56.130 / 32OPS_review50.931 / 32PLAN28.727 / 32 metricsAI_code51.614 / 32AI_complexity16.726 / 32AI_context_awareness0.024 / 32AI_correctness92.915 / 32AI_edge_cases58.014 / 32AI_efficiency0.131 / 32AI_hallucination_resistance100.07 / 32AI_memory_retention0.030 / 32AI_parameter_accuracy79.811 / 32AI_plan_coherence9.529 / 32AI_recovery96.715 / 32AI_refusal100.08 / 32AI_spec100.08 / 32AI_stability71.126 / 32AI_task_completion62.514 / 32AI_tool_selection67.114 / 32ArtificialAnalysisCoding2.530 / 32ArtificialAnalysisIntelligence0.031 / 32ArtificialAnalysisReasoning0.031 / 32BlendedCost91.713 / 32ContextWindow39.331 / 32CopilotArenaOrLMArenaCode86.87 / 32GDPval5.031 / 32GPQA_HLE_Reasoning0.031 / 32IFBench0.031 / 32LMArenaCreativeOrOpenEnded25.227 / 32LMArenaSearchDocument88.88 / 30LMArenaText25.227 / 32LongContextRecall0.031 / 32MCPAtlas81.77 / 28OutputSpeed0.032 / 32SWEBenchMultilingual5.023 / 27SWEBenchPro92.515 / 29SWEBenchVerified77.228 / 31SWEComposite81.518 / 32SWERebench92.56 / 31SciCode0.031 / 32SonarComposite50.020 / 32TTFT90.913 / 32Tau2Bench45.326 / 32TerminalBench44.522 / 32 | |||||||||
| claude-sonnet-4 | anthropic | 20.0▲+3.6 up 3.6 since last refresh | 32.7▲+2.3 up 2.3 since last refresh | 50.4▲+3.9 up 3.9 since last refresh | 54.4▲+3.4 up 3.4 since last refresh | ▸ | |||
group breakdownA_B61.717 / 32A_I69.117 / 32A_P62.410 / 32A_R77.914 / 32BUILD46.926 / 32CRE0.031 / 32GEN13.430 / 32LM_ARENA_REVIEW_PROXY87.812 / 32OPS_long79.022 / 32OPS_precision79.323 / 32OPS_review81.423 / 32PLAN28.028 / 32 metricsAI_code44.025 / 32AI_complexity27.615 / 32AI_context_awareness35.84 / 32AI_correctness57.827 / 32AI_edge_cases99.212 / 32AI_efficiency57.015 / 32AI_hallucination_resistance68.617 / 32AI_memory_retention0.027 / 32AI_parameter_accuracy99.14 / 32AI_plan_coherence28.514 / 32AI_recovery100.05 / 32AI_refusal97.717 / 32AI_spec98.017 / 32AI_stability68.327 / 32AI_task_completion85.86 / 32AI_tool_selection93.73 / 32ARC_AGI_20.224 / 25ArtificialAnalysisCoding30.827 / 32ArtificialAnalysisIntelligence29.725 / 32ArtificialAnalysisReasoning5.030 / 32BlendedCost73.024 / 32ContextWindow99.212 / 32CopilotArenaOrLMArenaCode47.824 / 32GDPval88.85 / 32GPQA_HLE_Reasoning5.030 / 32GSO6.015 / 16IFBench33.825 / 32LMArenaCreativeOrOpenEnded0.031 / 32LMArenaSearchDocument87.812 / 30LMArenaText0.031 / 32LiveCodeBench0.02 / 2LongContextRecall57.720 / 32MCPAtlas10.925 / 28OutputSpeed74.826 / 32SWEBenchMultilingual10.421 / 27SWEBenchPro78.423 / 29SWEBenchVerified67.429 / 31SWEComposite60.326 / 32SWERebench54.425 / 31SciCode15.428 / 32SonarBugDensity28.418 / 20SonarComposite27.630 / 32SonarFunctionalSkill26.417 / 20SonarIssueDensity45.58 / 20SonarVulnerabilityDensity0.020 / 20TTFT78.418 / 32Tau2Bench25.528 / 32TerminalBench47.320 / 32 | |||||||||
| claude-sonnet-4.5 | anthropic | 53.8▼-10.5 down 10.5 since last refresh | 43.5▼-5.6 down 5.6 since last refresh | 49.4▼-10.4 down 10.4 since last refresh | 39.0▼-10.7 down 10.7 since last refresh | ▸ | |||
group breakdownA_B22.628 / 32A_I17.128 / 32A_P38.828 / 32A_R25.531 / 32BUILD52.824 / 32CRE66.017 / 32GEN42.524 / 32LM_ARENA_REVIEW_PROXY1.530 / 32OPS_long79.221 / 32OPS_precision79.422 / 32OPS_review81.522 / 32PLAN39.626 / 32 metricsAI_canary_health81.84 / 6AI_code0.631 / 32AI_complexity0.731 / 32AI_context_awareness94.12 / 32AI_correctness2.131 / 32AI_edge_cases0.032 / 32AI_efficiency48.425 / 32AI_hallucination_resistance100.03 / 32AI_memory_retention99.54 / 32AI_parameter_accuracy91.06 / 32AI_plan_coherence3.230 / 32AI_recovery2.331 / 32AI_refusal1.231 / 32AI_spec0.032 / 32AI_stability76.314 / 32AI_task_completion76.27 / 32AI_tool_selection52.117 / 32ARC_AGI_23.720 / 25ArtificialAnalysisCoding46.321 / 32ArtificialAnalysisIntelligence46.221 / 32ArtificialAnalysisReasoning33.326 / 32BlendedCost73.025 / 32ContextWindow99.213 / 32CopilotArenaOrLMArenaCode47.425 / 32GDPval91.13 / 32GPQA_HLE_Reasoning33.326 / 32GSO27.312 / 16IFBench41.023 / 32LMArenaCreativeOrOpenEnded66.017 / 32LMArenaSearchDocument1.528 / 30LMArenaText66.017 / 32LongContextRecall62.818 / 32MCPAtlas4.027 / 28OutputSpeed75.224 / 32SWEBenchMultilingual3.526 / 27SWEBenchPro81.221 / 29SWEBenchVerified84.423 / 31SWEComposite71.324 / 32SWERebench74.616 / 31SciCode41.319 / 32SonarBugDensity32.817 / 20SonarComposite24.231 / 32SonarFunctionalSkill17.218 / 20SonarIssueDensity40.610 / 20SonarVulnerabilityDensity4.419 / 20TTFT78.120 / 32Tau2Bench55.821 / 32TerminalBench37.225 / 32 | |||||||||
| grok-4-latest | xai | 56.6▲+0.6 up 0.6 since last refresh | 47.0▲+1.1 up 1.1 since last refresh | 48.0▼-1.4 down 1.4 since last refresh | 45.5▼-0.9 down 0.9 since last refresh | ▸ | |||
group breakdownA_B74.211 / 32A_I80.011 / 32A_P66.08 / 32A_R84.411 / 32BUILD42.728 / 32CRE55.420 / 32GEN44.123 / 32LM_ARENA_REVIEW_PROXY13.228 / 32OPS_long59.530 / 32OPS_precision49.631 / 32OPS_review56.830 / 32PLAN42.323 / 32 metricsAI_code73.59 / 32AI_complexity60.111 / 32AI_context_awareness0.029 / 32AI_correctness100.010 / 32AI_edge_cases0.431 / 32AI_efficiency0.032 / 32AI_hallucination_resistance100.09 / 32AI_memory_retention99.55 / 32AI_parameter_accuracy0.029 / 32AI_plan_coherence100.01 / 32AI_recovery100.011 / 32AI_refusal100.014 / 32AI_spec100.014 / 32AI_stability100.06 / 32AI_task_completion0.029 / 32AI_tool_selection0.029 / 32ARC_AGI_220.911 / 25ArtificialAnalysisCoding52.918 / 32ArtificialAnalysisIntelligence40.424 / 32ArtificialAnalysisReasoning56.418 / 32BlendedCost73.027 / 32ContextWindow77.420 / 32CopilotArenaOrLMArenaCode48.322 / 32GDPval11.230 / 32GPQA_HLE_Reasoning56.418 / 32IFBench31.026 / 32LMArenaCreativeOrOpenEnded55.420 / 32LMArenaSearchDocument13.226 / 30LMArenaText55.420 / 32LongContextRecall74.613 / 32OutputSpeed72.030 / 32SWEComposite45.229 / 32SWERebench38.127 / 31SciCode46.816 / 32SonarComposite50.023 / 32TTFT5.131 / 32Tau2Bench48.624 / 32TerminalBench11.729 / 32 | |||||||||
| gemini-2.5-flash | 48.4▲+8.9 up 8.9 since last refresh | 30.5▲+5.1 up 5.1 since last refresh | 41.3▲+8.3 up 8.3 since last refresh | 47.3▲+9.5 up 9.5 since last refresh | ▸ | ||||
group breakdownA_B84.39 / 32A_I91.54 / 32A_P78.33 / 32A_R95.56 / 32BUILD29.131 / 32CRE45.325 / 32GEN14.328 / 32LM_ARENA_REVIEW_PROXY79.317 / 32OPS_long94.82 / 32OPS_precision90.85 / 32OPS_review93.02 / 32PLAN15.331 / 32 metricsAI_code54.812 / 32AI_complexity59.512 / 32AI_context_awareness100.01 / 32AI_correctness100.06 / 32AI_edge_cases100.06 / 32AI_efficiency99.92 / 32AI_hallucination_resistance100.05 / 32AI_memory_retention56.112 / 32AI_parameter_accuracy100.03 / 32AI_plan_coherence55.46 / 32AI_recovery100.07 / 32AI_refusal100.07 / 32AI_spec100.07 / 32AI_stability100.03 / 32AI_task_completion33.528 / 32AI_tool_selection27.228 / 32ARC_AGI_20.723 / 25ArtificialAnalysisCoding0.031 / 32ArtificialAnalysisIntelligence0.430 / 32ArtificialAnalysisReasoning14.927 / 32BlendedCost93.411 / 32ContextWindow100.04 / 32CopilotArenaOrLMArenaCode62.819 / 32GDPval37.824 / 32GPQA_HLE_Reasoning14.927 / 32GSO19.413 / 16IFBench27.228 / 32LMArenaCreativeOrOpenEnded45.325 / 32LMArenaSearchDocument79.317 / 30LMArenaText45.325 / 32LiveCodeBench100.01 / 2LongContextRecall56.121 / 32MCPAtlas21.923 / 28OutputSpeed100.01 / 32SWEBenchMultilingual92.59 / 27SWEBenchPro52.527 / 29SWEBenchVerified0.031 / 31SWEComposite27.631 / 32SWERebench0.030 / 31SciCode18.227 / 32SonarBugDensity65.013 / 20SonarComposite59.311 / 32SonarFunctionalSkill78.97 / 20SonarIssueDensity25.212 / 20SonarVulnerabilityDensity56.011 / 20TTFT77.321 / 32Tau2Bench0.031 / 32TerminalBench0.131 / 32 | |||||||||
| grok-code-fast-1 | xai | 49.1▲+5.8 up 5.8 since last refresh | 27.5▲+3.3 up 3.3 since last refresh | 40.7▲+6.4 up 6.4 since last refresh | 36.1▲+5.7 up 5.7 since last refresh | ▸ | |||
group breakdownA_B85.28 / 32A_I92.63 / 32A_P67.97 / 32A_R96.74 / 32BUILD29.230 / 32CRE47.824 / 32GEN15.727 / 32LM_ARENA_REVIEW_PROXY15.027 / 32OPS_long80.718 / 32OPS_precision82.518 / 32OPS_review82.420 / 32PLAN12.632 / 32 metricsAI_code66.510 / 32AI_complexity67.99 / 32AI_context_awareness0.030 / 32AI_correctness100.011 / 32AI_edge_cases100.010 / 32AI_efficiency47.726 / 32AI_hallucination_resistance100.010 / 32AI_memory_retention99.56 / 32AI_parameter_accuracy0.030 / 32AI_plan_coherence100.02 / 32AI_recovery100.012 / 32AI_refusal100.015 / 32AI_spec100.015 / 32AI_stability100.07 / 32AI_task_completion0.030 / 32AI_tool_selection0.030 / 32ARC_AGI_225.310 / 25ArtificialAnalysisCoding0.032 / 32ArtificialAnalysisIntelligence0.032 / 32ArtificialAnalysisReasoning0.032 / 32BlendedCost98.54 / 32ContextWindow77.421 / 32CopilotArenaOrLMArenaCode0.032 / 32GDPval5.032 / 32GPQA_HLE_Reasoning0.032 / 32IFBench0.032 / 32LMArenaCreativeOrOpenEnded47.824 / 32LMArenaSearchDocument15.025 / 30LMArenaText47.824 / 32LongContextRecall0.032 / 32OutputSpeed79.315 / 32SWEBenchVerified81.524 / 31SWEComposite45.427 / 32SWERebench26.729 / 31SciCode0.032 / 32SonarComposite50.024 / 32TTFT78.319 / 32Tau2Bench50.623 / 32TerminalBench0.032 / 32 | |||||||||
| gemini-2.5-pro | 13.1▼-9.6 down 9.6 since last refresh | 28.6▼-5.2 down 5.2 since last refresh | 36.6▼-8.8 down 8.8 since last refresh | 30.2▼-8.8 down 8.8 since last refresh | ▸ | ||||
group breakdownA_B22.129 / 32A_I17.129 / 32A_P36.729 / 32A_R28.628 / 32BUILD35.429 / 32CRE0.032 / 32GEN14.229 / 32LM_ARENA_REVIEW_PROXY0.032 / 32OPS_long87.67 / 32OPS_precision83.416 / 32OPS_review86.211 / 32PLAN26.229 / 32 metricsAI_code7.528 / 32AI_complexity7.528 / 32AI_context_awareness15.35 / 32AI_correctness7.528 / 32AI_edge_cases22.627 / 32AI_efficiency27.627 / 32AI_hallucination_resistance92.513 / 32AI_memory_retention92.59 / 32AI_parameter_accuracy65.922 / 32AI_plan_coherence43.88 / 32AI_recovery33.727 / 32AI_refusal7.528 / 32AI_spec7.528 / 32AI_stability7.528 / 32AI_task_completion92.53 / 32AI_tool_selection67.111 / 32ARC_AGI_23.721 / 25ArtificialAnalysisCoding23.528 / 32ArtificialAnalysisIntelligence13.928 / 32ArtificialAnalysisReasoning43.523 / 32BlendedCost78.818 / 32ContextWindow100.05 / 32CopilotArenaOrLMArenaCode0.031 / 32GDPval35.726 / 32GPQA_HLE_Reasoning43.523 / 32GSO0.016 / 16IFBench17.329 / 32LMArenaCreativeOrOpenEnded0.032 / 32LMArenaSearchDocument0.030 / 30LMArenaText0.032 / 32LongContextRecall64.516 / 32MCPAtlas58.413 / 28OutputSpeed91.46 / 32SWEBenchMultilingual36.017 / 27SWEBenchPro75.725 / 29SWEBenchVerified33.530 / 31SWEComposite35.130 / 32SWERebench0.031 / 31SciCode30.824 / 32SonarBugDensity65.014 / 20SonarComposite59.312 / 32SonarFunctionalSkill78.98 / 20SonarIssueDensity25.213 / 20SonarVulnerabilityDensity56.012 / 20TTFT72.127 / 32Tau2Bench1.830 / 32TerminalBench1.630 / 32 | |||||||||
| glm-4.6 | zai | 31.8▼-2.0 down 2.0 since last refresh | 27.2▼-1.0 down 1.0 since last refresh | 31.8▼-2.4 down 2.4 since last refresh | 35.3▼-1.8 down 1.8 since last refresh | ▸ | |||
group breakdownA_B67.712 / 32A_I72.613 / 32A_P60.911 / 32A_R67.419 / 32BUILD19.532 / 32CRE22.028 / 32GEN12.231 / 32LM_ARENA_REVIEW_PROXY50.018 / 32OPS_long80.120 / 32OPS_precision86.313 / 32OPS_review83.815 / 32PLAN16.130 / 32 metricsAI_code47.216 / 32AI_complexity41.313 / 32AI_context_awareness0.031 / 32AI_correctness100.012 / 32AI_edge_cases41.626 / 32AI_efficiency67.012 / 32AI_hallucination_resistance100.011 / 32AI_memory_retention99.57 / 32AI_parameter_accuracy0.031 / 32AI_plan_coherence100.03 / 32AI_recovery0.032 / 32AI_refusal91.227 / 32AI_spec87.527 / 32AI_stability100.08 / 32AI_task_completion0.031 / 32AI_tool_selection0.031 / 32ArtificialAnalysisCoding14.929 / 32ArtificialAnalysisIntelligence5.829 / 32ArtificialAnalysisReasoning13.528 / 32BlendedCost94.68 / 32ContextWindow73.722 / 32CopilotArenaOrLMArenaCode36.729 / 32GDPval16.629 / 32GPQA_HLE_Reasoning13.528 / 32IFBench2.630 / 32LMArenaCreativeOrOpenEnded22.028 / 32LMArenaText22.028 / 32LongContextRecall5.630 / 32MCPAtlas7.526 / 28OutputSpeed72.529 / 32SWEBenchMultilingual5.024 / 27SWEBenchPro0.029 / 29SWEBenchVerified77.227 / 31SWEComposite27.032 / 32SWERebench37.328 / 31SciCode6.730 / 32SonarBugDensity19.619 / 20SonarComposite13.032 / 32SonarFunctionalSkill7.519 / 20SonarIssueDensity7.519 / 20SonarVulnerabilityDensity28.015 / 20TTFT98.75 / 32Tau2Bench38.727 / 32TerminalBench13.828 / 32 | |||||||||