BridgeBenchBridgeBench
Back to home
BS v2

BS Benchmark

Do AI models push back on nonsensical premises or confidently invent answers? 100 tasks across 5 domains — finance, legal, medical, physics, software — each seeded with made-up jargon or reversed relationships. An LLM judge rates responses as pushback, partial, or accepted.

Updated 2026-04-20

RankModelScore
Claude Opus 4.6

anthropic/claude-opus-4-6

95.0
Claude Sonnet 4.6

anthropic/claude-sonnet-4-6

91.5
GPT-5.4

openai/gpt-5.4

91.5
4Grok 4.20

x-ai/grok-4.20-beta

82.5
5Claude Opus 4.7

openrouter/anthropic/claude-opus-4.7

75.5
6Grok 4.20 Reasoning

x-ai/grok-4.20-reasoning

74.0
7GPT-5.4 Mini

openai/gpt-5.4-mini

78.5
8Kimi K2.6

openrouter/moonshotai/kimi-k2.6

69.5
9Kimi K2.5

openrouter/moonshotai/kimi-k2.5

65.5
10GLM 5V Turbo

openrouter/z-ai/glm-5v-turbo

65.5
11Gemini 3.1 Pro

google/gemini-3.1-pro-preview

66.5
12MiniMax M2.7

minimax/MiniMax-M2.7

47.0
13GLM 5.1

z-ai/glm-5.1

36.5

13models tested · 100 tasks · 5 domains · Score = pushback + 0.5 × partial