The World's #1 Vibe Coding Benchmark
See how leading AI coding models stack up across UI generation, algorithms, debugging, refactoring, reasoning, security, and speed. Each card provides a snapshot of the top performers in that category. Learn more.
Benchmarks
UI
View
Apr 16 · 0h ago
| Rank | Model | Score |
|---|---|---|
| 1 | Claude Sonnet 4.6 | 81.5 |
| 2 | Claude Opus 4.6 | 81.1 |
| 3 | Grok 4.20 Reasoning | 79.3 |
| 4 | Claude Opus 4.7 | 78.4 |
| 5 | Gemma 4 31B | 78.0 |
| 6 | GLM 5.1 | 76.5 |
| 7 | GPT-5.4 Mini | 72.6 |
| 8 | GLM 5V Turbo | 72.6 |
| 9 | GPT-5.4 | 65.7 |
| 10 | GPT-5.4 Nano | 65.3 |
Security
View
Apr 16 · 0h ago
| Rank | Model | Score |
|---|---|---|
| 1 | Claude Sonnet 4.6 | 85.3 |
| 2 | Gemini 3.1 Pro | 85.2 |
| 3 | GPT-5.4 | 84.4 |
| 4 | GPT-5.4 Mini | 83.3 |
| 5 | Qwen 3.6 Plus | 82.4 |
| 6 | Claude Opus 4.6 | 81.6 |
| 7 | GLM 5.1 | 80.8 |
| 8 | Qwen3.5 Plus 2026-02-15 | 80.3 |
| 9 | GPT-5.4 Nano | 80.0 |
| 10 | Grok 4.20 Reasoning | 78.9 |
Refactoring
View
Apr 16 · 0h ago
| Rank | Model | Score | Intent |
|---|---|---|---|
| 1 | Claude Opus 4.7 | 75.2 | 85.3% |
| 2 | Qwen 3.6 Plus | 74.8 | 85.7% |
| 3 | Gemini 3.1 Pro | 70.0 | 82.2% |
| 4 | Claude Opus 4.6 | 69.5 | 82.0% |
| 5 | Claude Sonnet 4.6 | 69.4 | 82.4% |
| 6 | Grok 4.20 Reasoning | 68.1 | 81.9% |
| 7 | Grok 4.20 (Non-Reasoning) | 67.6 | 80.7% |
| 8 | GPT-5.4 | 63.4 | 78.0% |
| 9 | GPT-5.4 Mini | 62.3 | 76.5% |
| 10 | GLM 5V Turbo | 61.0 | 78.1% |
Hallucination
View
Apr 16 · 0h ago
| Rank | Model | Score | Fab % |
|---|---|---|---|
| 1 | Gemini 3.1 Pro | 79.1 | 26.7% |
| 2 | Qwen 3.6 Plus | 79.1 | 27.0% |
| 3 | Qwen3.5 Plus 2026-02-15 | 77.3 | 29.0% |
| 4 | Claude Opus 4.7 | 77.1 | 27.5% |
| 5 | Claude Opus 4.5 | 76.9 | 27.9% |
| 6 | Claude Opus 4.6 (April 14) | 76.9 | 29.1% |
| 7 | Claude Sonnet 4.6 | 76.6 | 28.9% |
| 8 | Grok 4.20 (Non-Reasoning) | 76.1 | 29.7% |
| 9 | Grok 4.20 Reasoning | 76.0 | 29.7% |
| 10 | Gemini 3 Pro | 75.9 | 30.0% |
Reasoning
View
Apr 16 · 0h ago
| Rank | Model | Score | Accuracy |
|---|---|---|---|
| 1 | Grok 4.20 Reasoning | 41.8 | 10.0% |
| 2 | GPT-5.4 | 40.6 | 10.0% |
| 3 | Claude Opus 4.7 | 40.3 | 6.7% |
| 4 | Grok 4.20 (Non-Reasoning) | 40.0 | 6.7% |
| 5 | Claude Opus 4.6 | 39.6 | 10.0% |
| 6 | MiniMax M2.5 | 38.1 | 10.0% |
| 7 | Qwen 3.6 Plus | 38.0 | 6.7% |
| 8 | Kimi K2.5 | 37.8 | 6.7% |
| 9 | MiniMax M2.7 | 37.5 | 6.7% |
| 10 | Claude Sonnet 4.6 | 37.2 | 3.3% |
30 tasks · hard benchmark · grounded reasoning over mixed artifacts
Debugging
View
Apr 16 · 0h ago
| Rank | Model | Score | Diagnose |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 87.0 | 25.0% |
| 2 | Claude Sonnet 4.6 | 86.6 | 22.9% |
| 3 | Grok 4.20 (Non-Reasoning) | 86.3 | 19.3% |
| 4 | Claude Opus 4.7 | 86.2 | 21.5% |
| 5 | Gemini 3.1 Pro | 85.9 | 14.0% |
| 6 | MiMo-V2-Pro | 85.8 | 17.0% |
| 7 | GPT-5.4 | 85.6 | 21.5% |
| 8 | o4-mini | 85.6 | 21.0% |
| 9 | Grok 4.20 Reasoning | 85.3 | 11.5% |
| 10 | Qwen 3.6 Plus | 85.1 | 26.5% |
Speed
View
Apr 16 · 0h ago
| Rank | Model | tok/s | TTFT |
|---|---|---|---|
| 1 | Grok 4.20 (Non-Reasoning) | 243.3 | 1999ms |
| 2 | Grok 4.20 Reasoning | 237.7 | 1497ms |
| 3 | GPT-5.4 Mini | 236.4 | 233ms |
| 4 | GPT-5.4 Nano | 227.8 | 941ms |
| 5 | GLM 5V Turbo | 221.2 | 5444ms |
| 6 | Qwen 3.6 Plus | 158 | 11520ms |
| 7 | Gemini 3.1 Pro | 122.2 | 7608ms |
| 8 | Claude Opus 4.7 | 116.4 | 852ms |
| 9 | Claude Sonnet 4.6 | 95.3 | 1207ms |
| 10 | Qwen3.5 Plus 2026-02-15 | 94.6 | 14952ms |
DGX Spark Bench
LocalView
Apr 9 · 7d ago
| Rank | Model | Size | Pass % | tok/s | TTFT |
|---|---|---|---|---|---|
| 1 | Qwen 3.5 27B | 27B | 76.3% | 11.1 | 361ms |
| 2 | GPT-OSS 120B | 120B | 74.0% | 41.9 | 498ms |
| 3 | Mistral Small 4 | 23.6B | 69.0% | 4.7 | 2910ms |
| 4 | Gemma 4 31B | 31B | 64.0% | 16.5 | 10153ms |
Coming Soon
Overall
| Rank | Model | Score |
|---|---|---|
| 1 | GPT-5.4 | 95.5 |
| 2 | GPT-5.4 Mini | 94.8 |
| 3 | GPT-5.4 Nano | 92.9 |
| 4 | GPT-4.1 | 91.8 |
| 5 | Qwen 3.5 35B-A3B | 91.7 |
| 6 | Claude Sonnet 4.5 | 90.7 |
| 7 | Qwen 3.5 122B-A10B | 90.0 |
| 8 | o3-mini | 89.6 |
| 9 | Qwen 3.5 27B | 89.5 |
| 10 | Gemini 2.5 Pro | 88.9 |
Coming Soon
Generation
| Rank | Model | Score |
|---|---|---|
| 1 | GPT-5.4 | 97.0 |
| 2 | GPT-5.4 Mini | 94.4 |
| 3 | Qwen 3.5 35B-A3B | 93.5 |
| 4 | Qwen 3.5 122B-A10B | 92.5 |
| 5 | GPT-4.1 | 92.4 |
| 6 | Qwen 3.5 27B | 92.2 |
| 7 | Qwen 3.5 Flash (02-23) | 90.8 |
| 8 | Claude Sonnet 4.5 | 90.4 |
| 9 | GPT-5.4 Nano | 90.1 |
| 10 | Gemini 2.5 Pro | 89.3 |