Updated 3 hours agoSources:LiveBench Instruction Following
/ Live Benchmarks / Instruction Following
Instruction following benchmarks
Adherence to formatting constraints and complex instructions from LiveBench.
LiveBench Instruction Following
View original source →| # | Model | Score | Input $/M | Output $/M | Context | CI |
|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview HighGoogle | 79.1% | — | — | — | — |
| 2 | Gemini 3.5 Flash HighGoogle | 75.6% | — | — | — | — |
| 3 | Gemini 3 Flash Preview HighGoogle | 74.9% | — | — | — | — |
| 4 | Qwen 3.7 MaxAlibaba | 74.0% | — | — | — | — |
| 5 | GPT-5.5 Thinking xHigh EffortOpenAI | 73.0% | — | — | — | — |
| 6 | GPT-5.1 Codex Max HighOpenAI | 70.4% | — | — | — | — |
| 7 | GPT-5.4 Thinking xHigh EffortOpenAI | 70.2% | — | — | — | — |
| 8 | Gemini 3.1 Flash Lite Preview HighGoogle | 68.6% | — | — | — | — |
| 9 | GLM 5.1Z.AI | 68.5% | — | — | — | — |
| 10 | Gemma 4 31BGoogle | 67.6% | — | — | — | — |
| 11 | GPT-5.4 Nano xHighOpenAI | 67.2% | — | — | — | — |
| 12 | GPT-5.2 CodexOpenAI | 66.5% | — | — | — | — |
| 13 | Gemini 3 Pro Preview HighGoogle | 65.8% | — | — | — | — |
| 14 | GPT-5.3 Codex HighOpenAI | 65.4% | — | — | — | — |
| 15 | GPT-5 Mini HighOpenAI | 65.3% | — | — | — | — |
| 16 | Kimi K2.6 ThinkingMoonshot AI | 64.4% | — | — | — | — |
| 17 | GPT-5 ProOpenAI | 64.0% | — | — | — | — |
| 18 | GPT-5.1 HighOpenAI | 63.9% | — | — | — | — |
| 19 | GPT-5.1 CodexOpenAI | 63.4% | — | — | — | — |
| 20 | Grok 4.20 BetaxAI | 63.4% | — | — | — | — |
| 21 | Claude 4.6 Opus Thinking High EffortAnthropic | 63.3% | — | — | — | — |
| 22 | Claude 4.6 Sonnet Thinking Medium EffortAnthropic | 63.2% | — | — | — | — |
| 23 | DeepSeek V4 FlashDeepSeek | 63.1% | — | — | — | — |
| 24 | Grok 4.3xAI | 62.8% | — | — | — | — |
| 25 | Claude 4.5 Opus Thinking High EffortAnthropic | 62.5% | — | — | — | — |
| 26 | DeepSeek V4 ProDeepSeek | 62.4% | — | — | — | — |
| 27 | Kimi K2 ThinkingMoonshot AI | 62.0% | — | — | — | — |
| 28 | GPT-5.2 HighOpenAI | 61.8% | — | — | — | — |
| 29 | Minimax M2.7Minimax | 61.1% | — | — | — | — |
| 30 | GPT-5.4 Mini xHighOpenAI | 60.3% | — | — | — | — |
| 31 | GPT-5.3 InstantOpenAI | 59.4% | — | — | — | — |
| 32 | Claude 4.7 Opus Thinking xHigh EffortAnthropic | 59.3% | — | — | — | — |
| 33 | GPT-5.1 Codex MiniOpenAI | 59.0% | — | — | — | — |
| 34 | Qwen 3.6 PlusAlibaba | 58.3% | — | — | — | — |
| 35 | Kimi K2.5 ThinkingMoonshot AI | 57.4% | — | — | — | — |
| 36 | Minimax M2.5Minimax | 57.2% | — | — | — | — |
| 37 | GPT-5 Nano HighOpenAI | 55.7% | — | — | — | — |
| 38 | GLM 5Z.AI | 55.3% | — | — | — | — |
| 39 | Claude Sonnet 4.5 ThinkingAnthropic | 53.4% | — | — | — | — |
| 40 | Qwen 3.6 27BAlibaba | 53.2% | — | — | — | — |
| 41 | GPT OSS 120bOpenAI | 50.3% | — | — | — | — |
| 42 | Claude Haiku 4.5 ThinkingAnthropic | 49.8% | — | — | — | — |
| 43 | DeepSeek V3.2 ThinkingDeepSeek | 48.2% | — | — | — | — |
| 44 | Qwen 3.6 FlashAlibaba | 47.2% | — | — | — | — |
| 45 | Claude 4 Sonnet ThinkingAnthropic | 44.3% | — | — | — | — |
| 46 | MiMo V2 ProXiaomi | 43.2% | — | — | — | — |
| 47 | Claude 4.1 Opus ThinkingAnthropic | 42.4% | — | — | — | — |
| 48 | Qwen 3 Next 80B A3B ThinkingAlibaba | 41.5% | — | — | — | — |
| 49 | DeepSeek V3.2 Exp ThinkingDeepSeek | 41.3% | — | — | — | — |
| 50 | Qwen 3 235B A22B Thinking 2507Alibaba | 40.6% | — | — | — | — |
| 51 | GLM 4.7Z.AI | 35.7% | — | — | — | — |
| 52 | Gemini 2.5 Pro (Max Thinking)Google | 33.1% | — | — | — | — |
| 53 | Elephant AlphaOpenRouter | 29.6% | — | — | — | — |
| 54 | Grok 4xAI | 29.1% | — | — | — | — |
| 55 | Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google | 28.5% | — | — | — | — |
| 56 | Nemotron 3 Super 120B A12BNVIDIA | 28.4% | — | — | — | — |
| 57 | Grok 4.1 FastxAI | 28.2% | — | — | — | — |
| 58 | Claude 4.5 Opus Medium EffortAnthropic | 28.1% | — | — | — | — |
| 59 | Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google | 28.1% | — | — | — | — |
| 60 | Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google | 27.7% | — | — | — | — |
| 61 | GLM 5V TurboZ.AI | 27.2% | — | — | — | — |
| 62 | GPT-5.2 No ThinkingOpenAI | 27.2% | — | — | — | — |
| 63 | GLM 4.6Z.AI | 26.2% | — | — | — | — |
| 64 | Claude 4.1 OpusAnthropic | 25.9% | — | — | — | — |
| 65 | Grok 4.20 Beta (Non-Reasoning)xAI | 24.4% | — | — | — | — |
| 66 | Claude Sonnet 4.5Anthropic | 23.5% | — | — | — | — |
| 67 | GPT-5.1 No ThinkingOpenAI | 23.5% | — | — | — | — |
| 68 | Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google | 23.1% | — | — | — | — |
| 69 | DeepSeek V3.2DeepSeek | 23.1% | — | — | — | — |
| 70 | Claude 4 SonnetAnthropic | 22.7% | — | — | — | — |
| 71 | Grok Code FastxAI | 22.3% | — | — | — | — |
| 72 | Qwen 3 235B A22B Instruct 2507Alibaba | 21.7% | — | — | — | — |
| 73 | Qwen 3 30B A3BAlibaba | 21.1% | — | — | — | — |
| 74 | Kimi K2 InstructMoonshot AI | 20.4% | — | — | — | — |
| 75 | DeepSeek V3.2 ExpDeepSeek | 19.3% | — | — | — | — |
| 76 | Qwen 3 Next 80B A3B InstructAlibaba | 19.2% | — | — | — | — |
| 77 | Qwen 3 32BAlibaba | 17.8% | — | — | — | — |
| 78 | Claude Haiku 4.5Anthropic | 17.8% | — | — | — | — |
| 79 | GLM 4.6VZ.AI | 17.1% | — | — | — | — |
| 80 | Grok 4.1 Fast (Non-Reasoning)xAI | 17.0% | — | — | — | — |
| 81 | Devstral 2Mistral | 13.5% | — | — | — | — |
| 82 | Trinity Large PreviewArcee | 12.2% | — | — | — | — |
/ Live Benchmarks
Need help choosing the right AI model for your business?
Benchmarks are a starting point, not an answer. The right model depends on your workload, budget, and integration constraints — let's figure it out together.