Skip to content

Daten aktualisiert vor 36 MinutenQuellen:LiveBench

Live Benchmarks / Reasoning

Reasoning-Benchmarks

Mehrstufiges Reasoning, Mathematik und kontaminationsfreie Sprachaufgaben.

RankModelScore
1
GPT-5.4 Thinking xHigh EffortOpenAI
80.3%
2
Gemini 3.1 Pro Preview HighGoogle
79.9%
3
Claude 4.6 Opus Thinking High EffortAnthropic
76.3%
4
Claude 4.5 Opus Thinking High EffortAnthropic
76.0%
5
Claude 4.6 Sonnet Thinking Medium EffortAnthropic
75.5%
6
GPT-5.2 HighOpenAI
74.8%
7
GPT-5.2 CodexOpenAI
74.3%
8
GPT-5.1 Codex Max HighOpenAI
74.0%
9
Gemini 3 Pro Preview HighGoogle
73.4%
10
GPT-5.3 Codex HighOpenAI
72.8%
11
Gemini 3 Flash Preview HighGoogle
72.4%
12
GPT-5.1 HighOpenAI
72.0%
13
Qwen 3.6 PlusAlibaba
70.8%
14
GPT-5 ProOpenAI
70.5%
15
GLM 5.1Z.AI
70.2%
16
GPT-5.4 Nano xHighOpenAI
70.1%
17
Kimi K2.5 ThinkingMoonshot AI
69.1%
18
GLM 5Z.AI
68.8%
19
GPT-5.1 CodexOpenAI
68.6%
20
Claude Sonnet 4.5 ThinkingAnthropic
68.2%
21
Grok 4.20 BetaxAI
68.0%
22
GPT-5.4 Mini xHighOpenAI
67.5%
23
GPT-5 Mini HighOpenAI
65.9%
24
Minimax M2.7Minimax
63.5%
25
DeepSeek V3.2 ThinkingDeepSeek
62.2%
26
Grok 4xAI
62.0%
27
Claude 4.1 Opus ThinkingAnthropic
61.8%
28
Gemini 3.1 Flash Lite Preview HighGoogle
61.7%
29
Gemma 4 31BGoogle
61.6%
30
Kimi K2 ThinkingMoonshot AI
61.6%
31
Claude Haiku 4.5 ThinkingAnthropic
61.3%
32
Claude 4 Sonnet ThinkingAnthropic
61.3%
33
GPT-5.1 Codex MiniOpenAI
60.4%
34
Minimax M2.5Minimax
60.1%
35
GPT-5.3 InstantOpenAI
60.0%
36
Grok 4.1 FastxAI
60.0%
37
Claude 4.5 Opus Medium EffortAnthropic
59.1%
38
DeepSeek V3.2 Exp ThinkingDeepSeek
58.9%
39
Gemini 2.5 Pro (Max Thinking)Google
58.3%
40
MiMo V2 ProXiaomi
58.1%
41
GLM 4.7Z.AI
58.1%
42
GLM 4.6Z.AI
55.2%
43
Claude 4.1 OpusAnthropic
54.5%
44
Claude Sonnet 4.5Anthropic
53.7%
45
Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google
53.1%
46
Qwen 3 235B A22B Thinking 2507Alibaba
53.0%
47
DeepSeek V3.2DeepSeek
51.8%
48
Claude 4 SonnetAnthropic
51.0%
49
Qwen 3 Next 80B A3B ThinkingAlibaba
50.4%
50
DeepSeek V3.2 ExpDeepSeek
49.9%
51
GLM 5V TurboZ.AI
49.6%
52
GPT-5.2 No ThinkingOpenAI
48.9%
53
Qwen 3 235B A22B Instruct 2507Alibaba
48.8%
54
GPT-5 Nano HighOpenAI
48.6%
55
Qwen 3 Next 80B A3B InstructAlibaba
48.4%
56
Kimi K2 InstructMoonshot AI
48.1%
57
Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google
47.7%
58
GPT OSS 120bOpenAI
46.1%
59
Claude Haiku 4.5Anthropic
45.3%
60
Grok Code FastxAI
45.1%
61
Qwen 3 32BAlibaba
43.6%
62
GPT-5.1 No ThinkingOpenAI
42.6%
63
Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google
42.6%
64
Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google
42.4%
65
Devstral 2Mistral
41.2%
66
GLM 4.6VZ.AI
40.1%
67
Grok 4.20 Beta (Non-Reasoning)xAI
39.7%
68
Qwen 3 30B A3BAlibaba
39.0%
69
Grok 4.1 Fast (Non-Reasoning)xAI
33.5%
70
Trinity Large PreviewArcee
32.7%
71
Nemotron 3 Super 120B A12BNVIDIA
32.5%

Verwandte Diskussion

Community-Puls

Brauchen Sie Hilfe bei der Auswahl des richtigen KI-Modells?

Benchmarks sind ein Ausgangspunkt, keine Antwort. Das richtige Modell hängt von Ihrem Workload, Budget und Ihren Integrations-Anforderungen ab – lassen Sie es uns gemeinsam herausfinden.