The World's #1 Vibe Coding Benchmark

See how leading AI coding models stack up across UI generation, algorithms, debugging, refactoring, reasoning, security, and speed. Each card provides a snapshot of the top performers in that category. Learn more.

Benchmarks

UI

View

Apr 16 · 0h ago

Rank	Model	Score
1	Claude Sonnet 4.6	81.5
2	Claude Opus 4.6	81.1
3	Grok 4.20 Reasoning	79.3
4	Claude Opus 4.7	78.4
5	Gemma 4 31B	78.0
6	GLM 5.1	76.5
7	GPT-5.4 Mini	72.6
8	GLM 5V Turbo	72.6
9	GPT-5.4	65.7
10	GPT-5.4 Nano	65.3

Security

View

Apr 16 · 0h ago

Rank	Model	Score
1	Claude Sonnet 4.6	85.3
2	Gemini 3.1 Pro	85.2
3	GPT-5.4	84.4
4	GPT-5.4 Mini	83.3
5	Qwen 3.6 Plus	82.4
6	Claude Opus 4.6	81.6
7	GLM 5.1	80.8
8	Qwen3.5 Plus 2026-02-15	80.3
9	GPT-5.4 Nano	80.0
10	Grok 4.20 Reasoning	78.9

Refactoring

View

Apr 16 · 0h ago

Rank	Model	Score	Intent
1	Claude Opus 4.7	75.2	85.3%
2	Qwen 3.6 Plus	74.8	85.7%
3	Gemini 3.1 Pro	70.0	82.2%
4	Claude Opus 4.6	69.5	82.0%
5	Claude Sonnet 4.6	69.4	82.4%
6	Grok 4.20 Reasoning	68.1	81.9%
7	Grok 4.20 (Non-Reasoning)	67.6	80.7%
8	GPT-5.4	63.4	78.0%
9	GPT-5.4 Mini	62.3	76.5%
10	GLM 5V Turbo	61.0	78.1%

Hallucination

View

Apr 16 · 0h ago

Rank	Model	Score	Fab %
1	Gemini 3.1 Pro	79.1	26.7%
2	Qwen 3.6 Plus	79.1	27.0%
3	Qwen3.5 Plus 2026-02-15	77.3	29.0%
4	Claude Opus 4.7	77.1	27.5%
5	Claude Opus 4.5	76.9	27.9%
6	Claude Opus 4.6 (April 14)	76.9	29.1%
7	Claude Sonnet 4.6	76.6	28.9%
8	Grok 4.20 (Non-Reasoning)	76.1	29.7%
9	Grok 4.20 Reasoning	76.0	29.7%
10	Gemini 3 Pro	75.9	30.0%

Reasoning

View

Apr 16 · 0h ago

Rank	Model	Score	Accuracy
1	Grok 4.20 Reasoning	41.8	10.0%
2	GPT-5.4	40.6	10.0%
3	Claude Opus 4.7	40.3	6.7%
4	Grok 4.20 (Non-Reasoning)	40.0	6.7%
5	Claude Opus 4.6	39.6	10.0%
6	MiniMax M2.5	38.1	10.0%
7	Qwen 3.6 Plus	38.0	6.7%
8	Kimi K2.5	37.8	6.7%
9	MiniMax M2.7	37.5	6.7%
10	Claude Sonnet 4.6	37.2	3.3%

30 tasks · hard benchmark · grounded reasoning over mixed artifacts

Debugging

View

Apr 16 · 0h ago

Rank	Model	Score	Diagnose
1	Claude Opus 4.6	87.0	25.0%
2	Claude Sonnet 4.6	86.6	22.9%
3	Grok 4.20 (Non-Reasoning)	86.3	19.3%
4	Claude Opus 4.7	86.2	21.5%
5	Gemini 3.1 Pro	85.9	14.0%
6	MiMo-V2-Pro	85.8	17.0%
7	GPT-5.4	85.6	21.5%
8	o4-mini	85.6	21.0%
9	Grok 4.20 Reasoning	85.3	11.5%
10	Qwen 3.6 Plus	85.1	26.5%

Speed

View

Apr 16 · 0h ago

Rank	Model	tok/s	TTFT
1	Grok 4.20 (Non-Reasoning)	243.3	1999ms
2	Grok 4.20 Reasoning	237.7	1497ms
3	GPT-5.4 Mini	236.4	233ms
4	GPT-5.4 Nano	227.8	941ms
5	GLM 5V Turbo	221.2	5444ms
6	Qwen 3.6 Plus	158	11520ms
7	Gemini 3.1 Pro	122.2	7608ms
8	Claude Opus 4.7	116.4	852ms
9	Claude Sonnet 4.6	95.3	1207ms
10	Qwen3.5 Plus 2026-02-15	94.6	14952ms

DGX Spark Bench

Local

View

Apr 9 · 7d ago

Rank	Model	Size	Pass %	tok/s	TTFT
1	Qwen 3.5 27B	27B	76.3%	11.1	361ms
2	GPT-OSS 120B	120B	74.0%	41.9	498ms
3	Mistral Small 4	23.6B	69.0%	4.7	2910ms
4	Gemma 4 31B	31B	64.0%	16.5	10153ms

Coming Soon

Overall

Rank	Model	Score
1	GPT-5.4	95.5
2	GPT-5.4 Mini	94.8
3	GPT-5.4 Nano	92.9
4	GPT-4.1	91.8
5	Qwen 3.5 35B-A3B	91.7
6	Claude Sonnet 4.5	90.7
7	Qwen 3.5 122B-A10B	90.0
8	o3-mini	89.6
9	Qwen 3.5 27B	89.5
10	Gemini 2.5 Pro	88.9

Coming Soon

Generation

Rank	Model	Score
1	GPT-5.4	97.0
2	GPT-5.4 Mini	94.4
3	Qwen 3.5 35B-A3B	93.5
4	Qwen 3.5 122B-A10B	92.5
5	GPT-4.1	92.4
6	Qwen 3.5 27B	92.2
7	Qwen 3.5 Flash (02-23)	90.8
8	Claude Sonnet 4.5	90.4
9	GPT-5.4 Nano	90.1
10	Gemini 2.5 Pro	89.3