Skip to content

Daten aktualisiert vor 36 MinutenQuellen:Code Arena·LiveCodeBench·Aider Polyglot

Live Benchmarks / Coding

Coding-Benchmarks

Echte Code-Generierung, Repo-Level-Fixes und kompetitives Programmieren.

RankModelScore
1
gpt-5 (high)OpenAI
88.0% correct
2
gpt-5 (medium)OpenAI
86.7% correct
3
o3-pro (high)OpenAI
84.9% correct
4
gemini-2.5-pro-preview-06-05 (32k think)Google DeepMind
83.1% correct
5
gpt-5 (low)OpenAI
81.3% correct
6
o3 (high)OpenAI
81.3% correct
7
grok-4 (high)xAI
79.6% correct
8
gemini-2.5-pro-preview-06-05 (default think)Google DeepMind
79.1% correct
9
o3 (high) + gpt-4.1OpenAI
78.2% correct
10
O3OpenAI
76.9% correct
11
Gemini 2.5 Pro Preview 05-06Google DeepMind
76.9% correct
12
DeepSeek-V3.2-Exp (Reasoner)DeepSeek
74.2% correct
13
Gemini 2.5 Pro Preview 03-25Google DeepMind
72.9% correct
14
claude-opus-4-20250514 (32k thinking)Anthropic
72.0% correct
15
o4-mini (high)OpenAI
72.0% correct
16
DeepSeek R1 (0528)DeepSeek
71.4% correct
17
claude-opus-4-20250514 (no think)Anthropic
70.7% correct
18
DeepSeek-V3.2-Exp (Chat)DeepSeek
70.2% correct
19
claude-3-7-sonnet-20250219 (32k thinking tokens)Anthropic
64.9% correct
20
DeepSeek R1 + claude-3-5-sonnet-20241022DeepSeek
64.0% correct
21
o1-2024-12-17 (high)OpenAI
61.7% correct
22
claude-sonnet-4-20250514 (32k thinking)Anthropic
61.3% correct
23
claude-3-7-sonnet-20250219 (no thinking)Anthropic
60.4% correct
24
o3-mini (high)OpenAI
60.4% correct
25
Qwen3 235B A22B diff, no think, Alibaba APIAlibaba
59.6% correct
26
Kimi K2Moonshot AI
59.1% correct
27
DeepSeek R1DeepSeek
56.9% correct
28
claude-sonnet-4-20250514 (no thinking)Anthropic
56.4% correct
29
gemini-2.5-flash-preview-05-20 (24k think)Google DeepMind
55.1% correct
30
DeepSeek V3 (0324)DeepSeek
55.1% correct
31
Quasar AlphaUnknown
54.7% correct
32
o3-mini (medium)OpenAI
53.8% correct
33
Grok 3 BetaxAI
53.3% correct
34
Optimus AlphaUnknown
52.9% correct
35
Gpt 4.1OpenAI
52.4% correct
36
Claude 3 5 Sonnet 20241022Anthropic
51.6% correct
37
Grok 3 Mini Beta (high)xAI
49.3% correct
38
DeepSeek Chat V3 (prev)DeepSeek
48.4% correct
39
gemini-2.5-flash-preview-04-17 (default)Google DeepMind
47.1% correct
40
chatgpt-4o-latest (2025-03-29)Unknown
45.3% correct
41
Gpt 4.5 PreviewOpenAI
44.9% correct
42
gemini-2.5-flash-preview-05-20 (no think)Google DeepMind
44.0% correct
43
gpt-oss-120b (high)OpenAI
41.8% correct
44
Qwen3 32BAlibaba
40.0% correct
45
Gemini Exp 1206Google DeepMind
38.2% correct
46
Gemini 2.0 Pro exp-02-05Google DeepMind
35.6% correct
47
Grok 3 Mini Beta (low)xAI
34.7% correct
48
O1 Mini 2024 09 12OpenAI
32.9% correct
49
Gpt 4.1 MiniOpenAI
32.4% correct
50
Claude 3 5 Haiku 20241022Anthropic
28.0% correct
51
chatgpt-4o-latest (2025-02-15)Unknown
27.1% correct
52
QwQ-32B + Qwen 2.5 Coder InstructUnknown
26.2% correct
53
Gpt 4o 2024 08 06OpenAI
23.1% correct
54
Gemini 2.0 Flash ExpGoogle DeepMind
22.2% correct
55
Qwen Max 2025 01 25Alibaba
21.8% correct
56
QwQ 32BUnknown
20.9% correct
57
Gemini 2.0 Flash Thinking Exp 01 21Google DeepMind
18.2% correct
58
Gpt 4o 2024 11 20OpenAI
18.2% correct
59
DeepSeek Chat V2.5DeepSeek
17.8% correct
60
Qwen2.5 Coder 32B InstructAlibaba
16.4% correct
61
Llama 4 MaverickMeta
15.6% correct
62
Yi LightningUnknown
12.9% correct
63
Command A 03 2025 QualityUnknown
12.0% correct
64
Codestral 25.01Mistral
11.1% correct
65
Openhands Lm 32b v0.1Unknown
10.2% correct
66
Gpt 4.1 NanoOpenAI
8.9% correct
67
Qwen2.5 Coder 32B InstructAlibaba
8.0% correct
68
Gemma 3 27b ItGoogle DeepMind
4.9% correct
69
Gpt 4o Mini 2024 07 18OpenAI
3.6% correct

Verwandte Diskussion

Community-Puls

r/MachineLearningDiscussionvor 5 Tagen

[D] How to break free from LLM's chains as a PhD student?

I didn't realize but over a period of one year i have become overreliant on ChatGPT to write code, I am a second year PhD student and don't want to end up as someone with fake "coding skills" after I graduate. I hear people talk about it al

20997u/etoipi1

Brauchen Sie Hilfe bei der Auswahl des richtigen KI-Modells?

Benchmarks sind ein Ausgangspunkt, keine Antwort. Das richtige Modell hängt von Ihrem Workload, Budget und Ihren Integrations-Anforderungen ab – lassen Sie es uns gemeinsam herausfinden.