Agentic coding benchmarks

Multi-step code editing and tool use — agentic workflows from LiveBench.

LiveBench Agentic Coding

#	Model	Score	Input $/M	Output $/M	Context	CI
1	GPT-5.4 Thinking xHigh EffortOpenAI	70.0%	—	—	—	—
2	Gemini 3.1 Pro Preview HighGoogle	65.0%	—	—	—	—
3	Claude 4.5 Opus Thinking High EffortAnthropic	63.3%	—	—	—	—
4	Claude 4.5 Opus Medium EffortAnthropic	63.3%	—	—	—	—
5	Claude 4.6 Opus Thinking High EffortAnthropic	61.7%	—	—	—	—
6	Claude 4.7 Opus Thinking xHigh EffortAnthropic	60.0%	—	—	—	—
7	Claude 4.6 Sonnet Thinking Medium EffortAnthropic	60.0%	—	—	—	—
8	Kimi K2.6 ThinkingMoonshot AI	58.3%	—	—	—	—
9	GPT-5.5 Thinking xHigh EffortOpenAI	56.7%	—	—	—	—
10	DeepSeek V4 ProDeepSeek	56.7%	—	—	—	—
11	Gemini 3 Pro Preview HighGoogle	55.0%	—	—	—	—
12	GPT-5.3 Codex HighOpenAI	55.0%	—	—	—	—
13	Qwen 3.6 PlusAlibaba	55.0%	—	—	—	—
14	GLM 5.1Z.AI	55.0%	—	—	—	—
15	GLM 5Z.AI	55.0%	—	—	—	—
16	GPT-5.1 Codex Max HighOpenAI	53.3%	—	—	—	—
17	GPT-5.1 HighOpenAI	53.3%	—	—	—	—
18	GPT-5.1 CodexOpenAI	53.3%	—	—	—	—
19	Claude Sonnet 4.5 ThinkingAnthropic	53.3%	—	—	—	—
20	Claude 4.1 OpusAnthropic	53.3%	—	—	—	—
21	Gemini 3.5 Flash HighGoogle	51.7%	—	—	—	—
22	GPT-5.2 HighOpenAI	51.7%	—	—	—	—
23	GPT-5.2 CodexOpenAI	51.7%	—	—	—	—
24	Qwen 3.7 MaxAlibaba	51.7%	—	—	—	—
25	GPT-5 ProOpenAI	51.7%	—	—	—	—
26	Minimax M2.5Minimax	51.7%	—	—	—	—
27	DeepSeek V4 FlashDeepSeek	50.0%	—	—	—	—
28	Grok 4.3xAI	50.0%	—	—	—	—
29	Qwen 3.6 27BAlibaba	50.0%	—	—	—	—
30	Minimax M2.7Minimax	50.0%	—	—	—	—
31	GPT-5.4 Nano xHighOpenAI	49.1%	—	—	—	—
32	Kimi K2.5 ThinkingMoonshot AI	48.3%	—	—	—	—
33	Claude 4.1 Opus ThinkingAnthropic	48.3%	—	—	—	—
34	Claude Sonnet 4.5Anthropic	48.3%	—	—	—	—
35	GPT-5.4 Mini xHighOpenAI	47.5%	—	—	—	—
36	GPT-5 Mini HighOpenAI	46.7%	—	—	—	—
37	Qwen 3.6 FlashAlibaba	46.7%	—	—	—	—
38	DeepSeek V3.2DeepSeek	46.7%	—	—	—	—
39	Grok 4.20 BetaxAI	43.3%	—	—	—	—
40	Devstral 2Mistral	43.3%	—	—	—	—
41	Claude Haiku 4.5 ThinkingAnthropic	41.7%	—	—	—	—
42	GLM 4.7Z.AI	41.7%	—	—	—	—
43	Gemini 3 Flash Preview HighGoogle	40.0%	—	—	—	—
44	DeepSeek V3.2 ThinkingDeepSeek	40.0%	—	—	—	—
45	Gemma 4 31BGoogle	40.0%	—	—	—	—
46	Claude 4 Sonnet ThinkingAnthropic	40.0%	—	—	—	—
47	GPT-5.1 Codex MiniOpenAI	40.0%	—	—	—	—
48	GPT-5.2 No ThinkingOpenAI	40.0%	—	—	—	—
49	Kimi K2 ThinkingMoonshot AI	38.3%	—	—	—	—
50	Claude 4 SonnetAnthropic	38.3%	—	—	—	—
51	Grok 4.20 Beta (Non-Reasoning)xAI	38.3%	—	—	—	—
52	DeepSeek V3.2 ExpDeepSeek	36.7%	—	—	—	—
53	GLM 4.6Z.AI	35.0%	—	—	—	—
54	Gemini 3.1 Flash Lite Preview HighGoogle	33.3%	—	—	—	—
55	Gemini 2.5 Pro (Max Thinking)Google	33.3%	—	—	—	—
56	Claude Haiku 4.5Anthropic	33.3%	—	—	—	—
57	Grok Code FastxAI	33.3%	—	—	—	—
58	Grok 4.1 FastxAI	31.7%	—	—	—	—
59	DeepSeek V3.2 Exp ThinkingDeepSeek	31.7%	—	—	—	—
60	Kimi K2 InstructMoonshot AI	31.7%	—	—	—	—
61	Grok 4xAI	30.0%	—	—	—	—
62	MiMo V2 ProXiaomi	30.0%	—	—	—	—
63	GPT-5.3 InstantOpenAI	28.3%	—	—	—	—
64	GPT-5.1 No ThinkingOpenAI	28.3%	—	—	—	—
65	Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google	23.3%	—	—	—	—
66	GPT-5 Nano HighOpenAI	23.3%	—	—	—	—
67	Nemotron 3 Super 120B A12BNVIDIA	23.0%	—	—	—	—
68	Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google	16.7%	—	—	—	—
69	GPT OSS 120bOpenAI	16.7%	—	—	—	—
70	Qwen 3 235B A22B Instruct 2507Alibaba	13.3%	—	—	—	—
71	Qwen 3 Next 80B A3B InstructAlibaba	10.0%	—	—	—	—
72	Grok 4.1 Fast (Non-Reasoning)xAI	10.0%	—	—	—	—
73	Qwen 3 Next 80B A3B ThinkingAlibaba	8.3%	—	—	—	—
74	Qwen 3 235B A22B Thinking 2507Alibaba	6.7%	—	—	—	—
75	Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google	5.0%	—	—	—	—
76	GLM 5V TurboZ.AI	3.3%	—	—	—	—
77	Qwen 3 32BAlibaba	3.3%	—	—	—	—
78	GLM 4.6VZ.AI	3.3%	—	—	—	—
79	Trinity Large PreviewArcee	3.3%	—	—	—	—
80	Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google	1.7%	—	—	—	—
81	Qwen 3 30B A3BAlibaba	1.7%	—	—	—	—
82	Elephant AlphaOpenRouter	1.7%	—	—	—	—

/ Live Benchmarks

Need help choosing the right AI model for your business?

Benchmarks are a starting point, not an answer. The right model depends on your workload, budget, and integration constraints — let's figure it out together.

Get in touch →