Math benchmarks

Numerical reasoning and mathematical problem solving from LiveBench.

LiveBench Math

#	Model	Score	Input $/M	Output $/M	Context	CI
1	GPT-5.5 Thinking xHigh EffortOpenAI	96.3%	—	—	—	—
2	Claude 4.8 Opus Thinking xHigh EffortAnthropic	95.3%	—	—	—	—
3	GPT-5.4 Thinking xHigh EffortOpenAI	94.2%	—	—	—	—
4	Claude Fable 5 Thinking xHigh Effort*losing out due to stricter content moderationAnthropic	93.9%	—	—	—	—
5	GPT-5.2 HighOpenAI	93.2%	—	—	—	—
6	Claude 4.7 Opus Thinking xHigh EffortAnthropic	93.1%	—	—	—	—
7	GPT-5.4 Nano xHighOpenAI	91.3%	—	—	—	—
8	Gemini 3.1 Pro Preview HighGoogle	91.0%	—	—	—	—
9	DeepSeek V4 ProDeepSeek	90.7%	—	—	—	—
10	Claude 4.5 Opus Thinking High EffortAnthropic	90.4%	—	—	—	—
11	GLM 5.2Z.AI	89.8%	—	—	—	—
12	Claude Sonnet 5 xHigh EffortAnthropic	89.6%	—	—	—	—
13	Claude 4.6 Opus Thinking High EffortAnthropic	89.3%	—	—	—	—
14	GPT-5.2 CodexOpenAI	88.8%	—	—	—	—
15	Gemini 3.5 Flash HighGoogle	88.2%	—	—	—	—
16	GPT-5.3 Codex HighOpenAI	87.8%	—	—	—	—
17	Grok 4.20 BetaxAI	87.1%	—	—	—	—
18	Claude 4.6 Sonnet Thinking Medium EffortAnthropic	87.0%	—	—	—	—
19	GPT-5.1 HighOpenAI	86.9%	—	—	—	—
20	GPT-5 ProOpenAI	86.2%	—	—	—	—
21	Qwen 3.7 MaxAlibaba	85.3%	—	—	—	—
22	DeepSeek V3.2 ThinkingDeepSeek	85.0%	—	—	—	—
23	GLM 5.1Z.AI	84.9%	—	—	—	—
24	Kimi K2.5 ThinkingMoonshot AI	84.9%	—	—	—	—
25	Grok 4.3xAI	84.3%	—	—	—	—
26	Kimi K2.6 ThinkingMoonshot AI	84.3%	—	—	—	—
27	Gemini 3 Flash Preview HighGoogle	84.2%	—	—	—	—
28	Qwen 3.6 PlusAlibaba	83.7%	—	—	—	—
29	Grok 4.1 FastxAI	83.7%	—	—	—	—
30	GLM 5Z.AI	83.5%	—	—	—	—
31	GPT-5.1 Codex Max HighOpenAI	83.2%	—	—	—	—
32	Grok 4xAI	83.0%	—	—	—	—
33	DeepSeek V3.2 Exp ThinkingDeepSeek	82.4%	—	—	—	—
34	GPT-5 Mini HighOpenAI	82.2%	—	—	—	—
35	Gemini 3 Pro Preview HighGoogle	81.8%	—	—	—	—
36	GLM 4.6Z.AI	81.1%	—	—	—	—
37	Kimi K2 ThinkingMoonshot AI	81.1%	—	—	—	—
38	Minimax M2.7MiniMax	80.5%	—	—	—	—
39	Qwen 3.6 27BAlibaba	79.9%	—	—	—	—
40	DeepSeek V4 FlashDeepSeek	79.7%	—	—	—	—
41	Kimi K2.7 CodeMoonshot AI	79.6%	—	—	—	—
42	GPT-5.1 CodexOpenAI	79.6%	—	—	—	—
43	Claude Sonnet 4.5 ThinkingAnthropic	79.3%	—	—	—	—
44	Qwen 3.6 FlashAlibaba	78.9%	—	—	—	—
45	GPT-5.4 Mini xHighOpenAI	78.6%	—	—	—	—
46	Grok Build 0.1xAI	78.4%	—	—	—	—
47	Claude Haiku 4.5 ThinkingAnthropic	77.5%	—	—	—	—
48	Minimax M2.5MiniMax	77.4%	—	—	—	—
49	MiMo V2 ProXiaomi	77.0%	—	—	—	—
50	Minimax M3MiniMax	77.0%	—	—	—	—
51	GPT-5.1 Codex MiniOpenAI	76.3%	—	—	—	—
52	GLM 4.7Z.AI	76.0%	—	—	—	—
53	Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google	75.3%	—	—	—	—
54	Qwen 3 Next 80B A3B ThinkingAlibaba	74.3%	—	—	—	—
55	Gemma 4 31BGoogle	73.9%	—	—	—	—
56	Gemini 3.1 Flash Lite Preview HighGoogle	73.6%	—	—	—	—
57	Qwen 3 235B A22B Thinking 2507Alibaba	73.4%	—	—	—	—
58	Claude 4.1 Opus ThinkingAnthropic	73.2%	—	—	—	—
59	GPT-5.3 InstantOpenAI	72.4%	—	—	—	—
60	Claude 4 Sonnet ThinkingAnthropic	70.5%	—	—	—	—
61	GLM 5V TurboZ.AI	70.4%	—	—	—	—
62	Qwen 3 Next 80B A3B InstructAlibaba	70.2%	—	—	—	—
63	GPT OSS 120bOpenAI	68.9%	—	—	—	—
64	Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google	68.8%	—	—	—	—
65	GPT-5 Nano HighOpenAI	68.4%	—	—	—	—
66	Gemini 2.5 Pro (Max Thinking)Google	68.3%	—	—	—	—
67	Qwen 3 235B A22B Instruct 2507Alibaba	68.0%	—	—	—	—
68	Qwen 3 32BAlibaba	67.4%	—	—	—	—
69	Claude 4.5 Opus Medium EffortAnthropic	66.3%	—	—	—	—
70	Qwen 3 30B A3BAlibaba	65.3%	—	—	—	—
71	Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google	64.9%	—	—	—	—
72	DeepSeek V3.2 ExpDeepSeek	64.4%	—	—	—	—
73	DeepSeek V3.2DeepSeek	64.0%	—	—	—	—
74	Claude 4.1 OpusAnthropic	62.8%	—	—	—	—
75	Claude Sonnet 4.5Anthropic	62.6%	—	—	—	—
76	GLM 4.6VZ.AI	62.5%	—	—	—	—
77	Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google	61.0%	—	—	—	—
78	Claude 4 SonnetAnthropic	60.4%	—	—	—	—
79	GPT-5.2 No ThinkingOpenAI	58.3%	—	—	—	—
80	Kimi K2 InstructMoonshot AI	58.1%	—	—	—	—
81	Claude Haiku 4.5Anthropic	58.0%	—	—	—	—
82	Elephant AlphaOpenRouter	57.5%	—	—	—	—
83	Grok Code FastxAI	56.0%	—	—	—	—
84	Nemotron 3 Ultra 550B A55BNVIDIA	54.5%	—	—	—	—
85	Devstral 2Mistral	52.5%	—	—	—	—
86	Grok 4.20 Beta (Non-Reasoning)xAI	45.5%	—	—	—	—
87	Trinity Large PreviewArcee AI	44.9%	—	—	—	—
88	GPT-5.1 No ThinkingOpenAI	44.5%	—	—	—	—
89	Grok 4.1 Fast (Non-Reasoning)xAI	38.9%	—	—	—	—
90	Nemotron 3 Super 120B A12BNVIDIA	36.4%	—	—	—	—

/ Live Benchmarks

Need help choosing the right AI model for your business?

Benchmarks are a starting point, not an answer. The right model depends on your workload, budget, and integration constraints — let's figure it out together.

Get in touch →