RankModelProviderScore (0-100)SamplesContextPrice / 1M tokens
1
A
claude-opus-4-6-thinking Anthropic
100.0
11.9K
1M
¥36 / ¥180Input/Output
2
A
claude-opus-4-6 Anthropic
95.7
20.4K
1M
¥36 / ¥180Input/Output
3
A
claude-opus-4-7 Anthropic
91.3
6.7K
1M
¥36 / ¥180Input/Output
4
A
claude-opus-4-7-thinking Anthropic
87.0
6.4K
1M
¥36 / ¥180Input/Output
5
O
gpt-5.5-high Openai
82.6
4.6K
1.05M
¥36 / ¥216Input/Output
6
A
claude-sonnet-4-6 Anthropic
78.3
31.9K
1M
¥21.6 / ¥108Input/Output
7
O
gpt-5.5 Openai
73.9
4.7K
1.05M
¥36 / ¥216Input/Output
8
O
gpt-5.4 Openai
69.6
14.4K
1.05M
¥18 / ¥108Input/Output
9
A
claude-opus-4-5-20251101 Anthropic
65.2
8K
200K
¥36 / ¥180Input/Output
10
M
kimi-k2.6 Moonshot
60.9
3.8K
262K
¥6.84 / ¥28.8Input/Output
11
M
muse-spark Meta
56.5
868
-
-
12
A
claude-sonnet-4-5-20250929 Anthropic
52.2
16.7K
200K
¥21.6 / ¥108Input/Output
13
G
gemini-3.1-pro-preview Google
47.8
24.9K
1.05M
¥14.4 / ¥86.4Input/Output
14
G
gemini-3-pro Google
43.5
10.8K
1.05M
¥14.4 / ¥86.4Input/Output
15
M
kimi-k2.5-thinking Moonshot
39.1
10.5K
262K
¥4.32 / ¥21.6Input/Output
16
G
gemini-2.5-pro Google
34.8
20K
1.05M
¥9 / ¥72Input/Output
17
G
gemma-4-31b Google
30.4
4.4K
262K
¥3.24 / ¥7.2Input/Output
18
A
claude-haiku-4-5-20251001 Anthropic
26.1
17.9K
200K
¥7.2 / ¥36Input/Output
19
X
grok-4.20-beta-0309-reasoning Xai
21.7
6.8K
2M
¥14.4 / ¥43.2Input/Output
20
G
gemini-3-flash Google
17.4
7.2K
1.05M
¥3.6 / ¥21.6Input/Output
21
O
gpt-5.2-high Openai
13.0
7.1K
400K
¥12.6 / ¥101Input/Output
22
O
gpt-5.2 Openai
8.7
22.4K
400K
¥12.6 / ¥101Input/Output
23
O
gpt-5.5-instant Openai
4.3
3.5K
400K
¥9 / ¥72Input/Output
24
O
gpt-5.1 Openai
0.0
8.3K
400K
¥9 / ¥72Input/Output
Top model analysisclaude-opus-4-6-thinking why it ranks first
claude-opus-4-6-thinking ranks first with a percent score of 100.0 and 11.9K samples. Use it as the first option for this leaderboard, then compare price, context and availability.
How to chooseDo not only look at rank #1
Start with the leaderboard closest to your task. Compare the top models by score and sample size, then check price, context length, open or closed access, and provider availability.
Related leaderboardsCompare adjacent capabilities