Reasoning benchmarks

Logic, deduction, and inference tasks from LiveBench.

LiveBench

#	Model	Score	Input $/M	Output $/M	Context	CI
1	GPT-5.5 Thinking xHigh EffortOpenAI	80.7%	—	—	—	—
2	GPT-5.4 Thinking xHigh EffortOpenAI	80.3%	—	—	—	—
3	Gemini 3.1 Pro Preview HighGoogle	79.9%	—	—	—	—
4	Claude 4.8 Opus Thinking xHigh EffortAnthropic	78.8%	—	—	—	—
5	Claude Fable 5 Thinking xHigh Effort*losing out due to stricter content moderationAnthropic	78.3%	—	—	—	—
6	Claude 4.7 Opus Thinking xHigh EffortAnthropic	76.9%	—	—	—	—
7	Claude 4.6 Opus Thinking High EffortAnthropic	76.3%	—	—	—	—
8	GLM 5.2Z.AI	76.2%	—	—	—	—
9	Claude 4.5 Opus Thinking High EffortAnthropic	76.0%	—	—	—	—
10	Claude Sonnet 5 xHigh EffortAnthropic	75.8%	—	—	—	—
11	Claude 4.6 Sonnet Thinking Medium EffortAnthropic	75.5%	—	—	—	—
12	Gemini 3.5 Flash HighGoogle	75.0%	—	—	—	—
13	GPT-5.2 HighOpenAI	74.8%	—	—	—	—
14	GPT-5.2 CodexOpenAI	74.3%	—	—	—	—
15	Qwen 3.7 MaxAlibaba	74.3%	—	—	—	—
16	GPT-5.1 Codex Max HighOpenAI	74.0%	—	—	—	—
17	DeepSeek V4 ProDeepSeek	73.6%	—	—	—	—
18	Gemini 3 Pro Preview HighGoogle	73.4%	—	—	—	—
19	GPT-5.3 Codex HighOpenAI	72.8%	—	—	—	—
20	Gemini 3 Flash Preview HighGoogle	72.4%	—	—	—	—
21	Kimi K2.6 ThinkingMoonshot AI	72.2%	—	—	—	—
22	GPT-5.1 HighOpenAI	72.0%	—	—	—	—
23	Kimi K2.7 CodeMoonshot AI	71.9%	—	—	—	—
24	Qwen 3.6 PlusAlibaba	70.8%	—	—	—	—
25	GPT-5 ProOpenAI	70.5%	—	—	—	—
26	GLM 5.1Z.AI	70.2%	—	—	—	—
27	GPT-5.4 Nano xHighOpenAI	70.1%	—	—	—	—
28	Minimax M3MiniMax	70.0%	—	—	—	—
29	Kimi K2.5 ThinkingMoonshot AI	69.1%	—	—	—	—
30	Grok Build 0.1xAI	68.9%	—	—	—	—
31	GLM 5Z.AI	68.8%	—	—	—	—
32	GPT-5.1 CodexOpenAI	68.6%	—	—	—	—
33	Claude Sonnet 4.5 ThinkingAnthropic	68.2%	—	—	—	—
34	Grok 4.20 BetaxAI	68.0%	—	—	—	—
35	GPT-5.4 Mini xHighOpenAI	67.5%	—	—	—	—
36	DeepSeek V4 FlashDeepSeek	67.3%	—	—	—	—
37	Grok 4.3xAI	66.7%	—	—	—	—
38	GPT-5 Mini HighOpenAI	65.9%	—	—	—	—
39	Qwen 3.6 27BAlibaba	65.6%	—	—	—	—
40	Minimax M2.7MiniMax	63.5%	—	—	—	—
41	DeepSeek V3.2 ThinkingDeepSeek	62.2%	—	—	—	—
42	Grok 4xAI	62.0%	—	—	—	—
43	Claude 4.1 Opus ThinkingAnthropic	61.8%	—	—	—	—
44	Gemini 3.1 Flash Lite Preview HighGoogle	61.7%	—	—	—	—
45	Gemma 4 31BGoogle	61.6%	—	—	—	—
46	Kimi K2 ThinkingMoonshot AI	61.6%	—	—	—	—
47	Claude Haiku 4.5 ThinkingAnthropic	61.3%	—	—	—	—
48	Claude 4 Sonnet ThinkingAnthropic	61.3%	—	—	—	—
49	GPT-5.1 Codex MiniOpenAI	60.4%	—	—	—	—
50	Qwen 3.6 FlashAlibaba	60.4%	—	—	—	—
51	Minimax M2.5MiniMax	60.1%	—	—	—	—
52	GPT-5.3 InstantOpenAI	60.0%	—	—	—	—
53	Grok 4.1 FastxAI	60.0%	—	—	—	—
54	Claude 4.5 Opus Medium EffortAnthropic	59.1%	—	—	—	—
55	DeepSeek V3.2 Exp ThinkingDeepSeek	58.9%	—	—	—	—
56	Gemini 2.5 Pro (Max Thinking)Google	58.3%	—	—	—	—
57	MiMo V2 ProXiaomi	58.1%	—	—	—	—
58	GLM 4.7Z.AI	58.1%	—	—	—	—
59	GLM 4.6Z.AI	55.2%	—	—	—	—
60	Claude 4.1 OpusAnthropic	54.5%	—	—	—	—
61	Claude Sonnet 4.5Anthropic	53.7%	—	—	—	—
62	Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google	53.1%	—	—	—	—
63	Qwen 3 235B A22B Thinking 2507Alibaba	53.0%	—	—	—	—
64	DeepSeek V3.2DeepSeek	51.8%	—	—	—	—
65	Nemotron 3 Ultra 550B A55BNVIDIA	51.8%	—	—	—	—
66	Claude 4 SonnetAnthropic	51.0%	—	—	—	—
67	Qwen 3 Next 80B A3B ThinkingAlibaba	50.4%	—	—	—	—
68	DeepSeek V3.2 ExpDeepSeek	49.9%	—	—	—	—
69	GLM 5V TurboZ.AI	49.6%	—	—	—	—
70	GPT-5.2 No ThinkingOpenAI	48.9%	—	—	—	—
71	Qwen 3 235B A22B Instruct 2507Alibaba	48.8%	—	—	—	—
72	GPT-5 Nano HighOpenAI	48.6%	—	—	—	—
73	Qwen 3 Next 80B A3B InstructAlibaba	48.4%	—	—	—	—
74	Kimi K2 InstructMoonshot AI	48.1%	—	—	—	—
75	Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google	47.7%	—	—	—	—
76	GPT OSS 120bOpenAI	46.1%	—	—	—	—
77	Claude Haiku 4.5Anthropic	45.3%	—	—	—	—
78	Grok Code FastxAI	45.1%	—	—	—	—
79	Qwen 3 32BAlibaba	43.6%	—	—	—	—
80	GPT-5.1 No ThinkingOpenAI	42.6%	—	—	—	—
81	Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google	42.6%	—	—	—	—
82	Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google	42.4%	—	—	—	—
83	Devstral 2Mistral	41.2%	—	—	—	—
84	GLM 4.6VZ.AI	40.1%	—	—	—	—
85	Grok 4.20 Beta (Non-Reasoning)xAI	39.7%	—	—	—	—
86	Qwen 3 30B A3BAlibaba	39.0%	—	—	—	—
87	Elephant AlphaOpenRouter	36.0%	—	—	—	—
88	Grok 4.1 Fast (Non-Reasoning)xAI	33.5%	—	—	—	—
89	Trinity Large PreviewArcee AI	32.7%	—	—	—	—
90	Nemotron 3 Super 120B A12BNVIDIA	32.5%	—	—	—	—

LiveBench Reasoning

View original source →

#	Model	Score	Input $/M	Output $/M	Context	CI
1	Claude 4.8 Opus Thinking xHigh EffortAnthropic	89.7%	—	—	—	—
2	Claude 4.6 Opus Thinking High EffortAnthropic	88.7%	—	—	—	—
3	GPT-5.4 Thinking xHigh EffortOpenAI	88.1%	—	—	—	—
4	GPT-5.5 Thinking xHigh EffortOpenAI	87.7%	—	—	—	—
5	Claude 4.7 Opus Thinking xHigh EffortAnthropic	87.7%	—	—	—	—
6	Claude Fable 5 Thinking xHigh Effort*losing out due to stricter content moderationAnthropic	87.3%	—	—	—	—
7	Claude Sonnet 5 xHigh EffortAnthropic	86.9%	—	—	—	—
8	Claude 4.6 Sonnet Thinking Medium EffortAnthropic	84.8%	—	—	—	—
9	Gemini 3.1 Pro Preview HighGoogle	84.0%	—	—	—	—
10	GPT-5.1 Codex Max HighOpenAI	83.7%	—	—	—	—
11	Qwen 3.7 MaxAlibaba	83.3%	—	—	—	—
12	GPT-5.2 HighOpenAI	83.2%	—	—	—	—
13	Kimi K2.7 CodeMoonshot AI	82.8%	—	—	—	—
14	DeepSeek V4 ProDeepSeek	82.7%	—	—	—	—
15	Gemini 3.5 Flash HighGoogle	82.0%	—	—	—	—
16	GPT-5.1 CodexOpenAI	82.0%	—	—	—	—
17	GPT-5 ProOpenAI	81.7%	—	—	—	—
18	GPT-5.4 Nano xHighOpenAI	81.0%	—	—	—	—
19	Grok 4.1 FastxAI	80.2%	—	—	—	—
20	GPT-5.3 Codex HighOpenAI	80.2%	—	—	—	—
21	Claude 4.5 Opus Thinking High EffortAnthropic	80.1%	—	—	—	—
22	Kimi K2.6 ThinkingMoonshot AI	79.4%	—	—	—	—
23	Grok 4xAI	79.1%	—	—	—	—
24	GPT-5.1 HighOpenAI	78.8%	—	—	—	—
25	GLM 5.2Z.AI	78.6%	—	—	—	—
26	GPT-5.2 CodexOpenAI	77.7%	—	—	—	—
27	Claude Sonnet 4.5 ThinkingAnthropic	77.6%	—	—	—	—
28	Gemini 3 Pro Preview HighGoogle	77.4%	—	—	—	—
29	DeepSeek V3.2 ThinkingDeepSeek	77.2%	—	—	—	—
30	Grok Build 0.1xAI	76.4%	—	—	—	—
31	Kimi K2.5 ThinkingMoonshot AI	76.0%	—	—	—	—
32	Qwen 3.6 PlusAlibaba	75.8%	—	—	—	—
33	Grok 4.20 BetaxAI	75.3%	—	—	—	—
34	Minimax M2.7MiniMax	74.8%	—	—	—	—
35	Gemini 3 Flash Preview HighGoogle	74.5%	—	—	—	—
36	Minimax M3MiniMax	74.5%	—	—	—	—
37	GLM 5.1Z.AI	72.5%	—	—	—	—
38	GPT-5.4 Mini xHighOpenAI	72.5%	—	—	—	—
39	Claude 4.1 Opus ThinkingAnthropic	72.3%	—	—	—	—
40	Grok 4.3xAI	70.8%	—	—	—	—
41	Gemini 2.5 Pro (Max Thinking)Google	70.8%	—	—	—	—
42	DeepSeek V4 FlashDeepSeek	70.6%	—	—	—	—
43	Qwen 3.6 27BAlibaba	70.3%	—	—	—	—
44	MiMo V2 ProXiaomi	69.7%	—	—	—	—
45	GLM 5Z.AI	69.1%	—	—	—	—
46	Claude 4 Sonnet ThinkingAnthropic	69.0%	—	—	—	—
47	GPT-5 Mini HighOpenAI	68.3%	—	—	—	—
48	GPT-5.1 Codex MiniOpenAI	64.7%	—	—	—	—
49	DeepSeek V3.2 Exp ThinkingDeepSeek	64.4%	—	—	—	—
50	Kimi K2 ThinkingMoonshot AI	63.5%	—	—	—	—
51	GPT-5.3 InstantOpenAI	63.1%	—	—	—	—
52	Qwen 3.6 FlashAlibaba	62.9%	—	—	—	—
53	GLM 4.6Z.AI	62.1%	—	—	—	—
54	Claude Haiku 4.5 ThinkingAnthropic	61.7%	—	—	—	—
55	GLM 4.7Z.AI	59.7%	—	—	—	—
56	Gemini 3.1 Flash Lite Preview HighGoogle	59.7%	—	—	—	—
57	Gemma 4 31BGoogle	59.4%	—	—	—	—
58	Qwen 3 235B A22B Thinking 2507Alibaba	59.4%	—	—	—	—
59	Minimax M2.5MiniMax	59.3%	—	—	—	—
60	Qwen 3 235B A22B Instruct 2507Alibaba	58.4%	—	—	—	—
61	Qwen 3 Next 80B A3B ThinkingAlibaba	58.2%	—	—	—	—
62	GLM 5V TurboZ.AI	56.1%	—	—	—	—
63	Qwen 3 Next 80B A3B InstructAlibaba	54.8%	—	—	—	—
64	Claude 4.5 Opus Medium EffortAnthropic	53.2%	—	—	—	—
65	Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google	51.5%	—	—	—	—
66	Qwen 3 32BAlibaba	48.3%	—	—	—	—
67	DeepSeek V3.2 ExpDeepSeek	45.5%	—	—	—	—
68	Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google	44.6%	—	—	—	—
69	DeepSeek V3.2DeepSeek	44.3%	—	—	—	—
70	Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google	43.3%	—	—	—	—
71	GPT-5.2 No ThinkingOpenAI	42.8%	—	—	—	—
72	Grok Code FastxAI	42.3%	—	—	—	—
73	Claude Sonnet 4.5Anthropic	42.3%	—	—	—	—
74	Kimi K2 InstructMoonshot AI	42.2%	—	—	—	—
75	Claude 4.1 OpusAnthropic	40.9%	—	—	—	—
76	GPT-5 Nano HighOpenAI	40.3%	—	—	—	—
77	Elephant AlphaOpenRouter	40.0%	—	—	—	—
78	Claude 4 SonnetAnthropic	39.7%	—	—	—	—
79	GPT OSS 120bOpenAI	39.2%	—	—	—	—
80	Nemotron 3 Ultra 550B A55BNVIDIA	37.5%	—	—	—	—
81	GLM 4.6VZ.AI	37.2%	—	—	—	—
82	Qwen 3 30B A3BAlibaba	36.7%	—	—	—	—
83	Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google	36.2%	—	—	—	—
84	Nemotron 3 Super 120B A12BNVIDIA	34.4%	—	—	—	—
85	Claude Haiku 4.5Anthropic	33.9%	—	—	—	—
86	Devstral 2Mistral	27.7%	—	—	—	—
87	GPT-5.1 No ThinkingOpenAI	26.8%	—	—	—	—
88	Grok 4.20 Beta (Non-Reasoning)xAI	25.6%	—	—	—	—
89	Grok 4.1 Fast (Non-Reasoning)xAI	23.4%	—	—	—	—
90	Trinity Large PreviewArcee AI	20.6%	—	—	—	—

/ Live Benchmarks

Need help choosing the right AI model for your business?

Benchmarks are a starting point, not an answer. The right model depends on your workload, budget, and integration constraints — let's figure it out together.

Get in touch →