Skip to content

Data updated 30 minutes agoSources:LiveBench Instruction Following

Live Benchmarks / Instruction Following

Instruction following benchmarks

Adherence to formatting constraints and complex instructions from LiveBench.

LiveBench Instruction Following

View original source →
RankModelScore
1
Gemini 3.1 Pro Preview HighGoogle
79.1%
2
Gemini 3 Flash Preview HighGoogle
74.9%
3
GPT-5.1 Codex Max HighOpenAI
70.4%
4
GPT-5.4 Thinking xHigh EffortOpenAI
70.2%
5
Gemini 3.1 Flash Lite Preview HighGoogle
68.6%
6
GLM 5.1Z.AI
68.5%
7
Gemma 4 31BGoogle
67.6%
8
GPT-5.4 Nano xHighOpenAI
67.2%
9
GPT-5.2 CodexOpenAI
66.5%
10
Gemini 3 Pro Preview HighGoogle
65.8%
11
GPT-5.3 Codex HighOpenAI
65.4%
12
GPT-5 Mini HighOpenAI
65.3%
13
GPT-5 ProOpenAI
64.0%
14
GPT-5.1 HighOpenAI
63.9%
15
GPT-5.1 CodexOpenAI
63.4%
16
Grok 4.20 BetaxAI
63.4%
17
Claude 4.6 Opus Thinking High EffortAnthropic
63.3%
18
Claude 4.6 Sonnet Thinking Medium EffortAnthropic
63.2%
19
Claude 4.5 Opus Thinking High EffortAnthropic
62.5%
20
Kimi K2 ThinkingMoonshot AI
62.0%
21
GPT-5.2 HighOpenAI
61.8%
22
Minimax M2.7Minimax
61.1%
23
GPT-5.4 Mini xHighOpenAI
60.3%
24
GPT-5.3 InstantOpenAI
59.4%
25
GPT-5.1 Codex MiniOpenAI
59.0%
26
Qwen 3.6 PlusAlibaba
58.3%
27
Kimi K2.5 ThinkingMoonshot AI
57.4%
28
Minimax M2.5Minimax
57.2%
29
GPT-5 Nano HighOpenAI
55.7%
30
GLM 5Z.AI
55.3%
31
Claude Sonnet 4.5 ThinkingAnthropic
53.4%
32
GPT OSS 120bOpenAI
50.3%
33
Claude Haiku 4.5 ThinkingAnthropic
49.8%
34
DeepSeek V3.2 ThinkingDeepSeek
48.2%
35
Claude 4 Sonnet ThinkingAnthropic
44.3%
36
MiMo V2 ProXiaomi
43.2%
37
Claude 4.1 Opus ThinkingAnthropic
42.4%
38
Qwen 3 Next 80B A3B ThinkingAlibaba
41.5%
39
DeepSeek V3.2 Exp ThinkingDeepSeek
41.3%
40
Qwen 3 235B A22B Thinking 2507Alibaba
40.6%
41
GLM 4.7Z.AI
35.7%
42
Gemini 2.5 Pro (Max Thinking)Google
33.1%
43
Grok 4xAI
29.1%
44
Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google
28.5%
45
Nemotron 3 Super 120B A12BNVIDIA
28.4%
46
Grok 4.1 FastxAI
28.2%
47
Claude 4.5 Opus Medium EffortAnthropic
28.1%
48
Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google
28.1%
49
Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google
27.7%
50
GLM 5V TurboZ.AI
27.2%
51
GPT-5.2 No ThinkingOpenAI
27.2%
52
GLM 4.6Z.AI
26.2%
53
Claude 4.1 OpusAnthropic
25.9%
54
Grok 4.20 Beta (Non-Reasoning)xAI
24.4%
55
Claude Sonnet 4.5Anthropic
23.5%
56
GPT-5.1 No ThinkingOpenAI
23.5%
57
Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google
23.1%
58
DeepSeek V3.2DeepSeek
23.1%
59
Claude 4 SonnetAnthropic
22.7%
60
Grok Code FastxAI
22.3%
61
Qwen 3 235B A22B Instruct 2507Alibaba
21.7%
62
Qwen 3 30B A3BAlibaba
21.1%
63
Kimi K2 InstructMoonshot AI
20.4%
64
DeepSeek V3.2 ExpDeepSeek
19.3%
65
Qwen 3 Next 80B A3B InstructAlibaba
19.2%
66
Qwen 3 32BAlibaba
17.8%
67
Claude Haiku 4.5Anthropic
17.8%
68
GLM 4.6VZ.AI
17.1%
69
Grok 4.1 Fast (Non-Reasoning)xAI
17.0%
70
Devstral 2Mistral
13.5%
71
Trinity Large PreviewArcee
12.2%

Related discussion

Community pulse

Need help choosing the right AI model for your business?

Benchmarks are a starting point, not an answer. The right model depends on your workload, budget, and integration constraints — let's figure it out together.