GPT-5.5 Performs Worse Than GPT-5.4 on BridgeBench
Across eight BridgeBench v2 categories, GPT-5.5 regresses on debugging, BS-Bench, cost, and reasoning while charging roughly 8x more per task. Here's the full breakdown with the data and the methodology.
TL;DR
We ran GPT-5.5 through the full BridgeBench v2 suite on 2026-04-25 and compared the results to GPT-5.4 head-to-head on identical tasks. GPT-5.5 is not a generational step up from GPT-5.4 — on the categories that matter most for real-world coding, it regresses.
- Debugging: 85.6 → 77.5 (−8.1). The largest single-category regression we've measured between adjacent OpenAI releases.
- BS-Bench (push-back on bad asks): 91.5 → 88.0 (−3.5). Acceptance rate of clearly-flawed requests jumped from 4% to 7%.
- Cost-Bench: 84.7 → 83.4 (−1.3) — but the headline is the price tag. GPT-5.5 spent $4.12 vs. $0.52 for GPT-5.4 across the same 40 tasks: roughly 8× more expensive for a slightly worse outcome.
- Reasoning: 40.6 → 39.7 (−0.9). Accuracy on multi-step reasoning fell from 10% to 6.7%.
- Where GPT-5.5 *does* gain: UI Bench (+8.2), Refactoring (+2.5), Security (+0.9), Hallucination (+0.8).
The cohort-wide average across the eight comparable categories: GPT-5.4 = 73.59, GPT-5.5 = 73.41. A statistical wash on aggregate — and an 8× cost increase to get there.
The Headline Table
All scores are 0–100. Δ is GPT-5.5 minus GPT-5.4. Higher is better in every column.
| Benchmark | GPT-5.4 | GPT-5.5 | Δ | Result |
|---|---|---|---|---|
| Debugging | 85.6 | 77.5 | −8.1 | Regression |
| BS-Bench | 91.5 | 88.0 | −3.5 | Regression |
| Cost-Bench | 84.7 | 83.4 | −1.3 | Regression |
| Reasoning | 40.6 | 39.7 | −0.9 | Regression |
| Hallucination | 72.8 | 73.6 | +0.8 | Improvement |
| Security | 84.4 | 85.3 | +0.9 | Improvement |
| Refactoring | 63.4 | 65.9 | +2.5 | Improvement |
| UI Bench | 65.7 | 73.9 | +8.2 | Improvement |
| Mean | 73.59 | 73.41 | −0.18 | Wash |
Source: debugging-snapshot.json, bs-bench-snapshot.json, cost-bench-snapshot.json, reasoning-snapshot.json, hallucination-snapshot.json, security-snapshot.json, refactoring-snapshot.json, ui-bench-snapshot.json in the published BridgeBench v2 dataset.
Where GPT-5.5 Loses
Debugging: −8.1 points
This is the regression we found most surprising. Debugging is one of the highest-signal categories in BridgeBench because it directly maps to what builders ask AI for in agentic coding: *"the test is failing, find and fix the bug."*
| Metric | GPT-5.4 | GPT-5.5 |
|---|---|---|
| Average score | 85.6 | 77.5 |
| Visible fix rate | 100% | 90% |
| Hidden bug pass rate | 100% | 90% |
| Regression pass rate | 100% | 90% |
| Diagnosis accuracy | 21.5% | 22.0% |
| Task success rate | 100% | 80% |
GPT-5.4 produced a passing fix on every one of the 10 debugging tasks. GPT-5.5 dropped two of them outright (success rate 80%) and slipped on hidden-bug detection and regression prevention by 10 points each. Diagnosis accuracy ticks up by half a point, but the fixes themselves regress — the model is talking a slightly better game and shipping a meaningfully worse fix.
For agentic coding workflows that run dozens of debug-fix loops in a single session, an 80% per-task success rate compounds badly. See the full breakdown on the Debugging benchmark page.
Cost-Bench: same outcome, 8× the bill
Cost-Bench measures whether a model can solve a fixed task pack (40 tasks) and tracks total spend through the OpenAI API. The score itself moves only slightly, but the price moves a lot:
| Metric | GPT-5.4 | GPT-5.5 |
|---|---|---|
| Average score | 84.7 | 83.4 |
| Strict success rate | 57.5% | 50.0% |
| Qualified success rate | 80.0% | 72.5% |
| Total cost (40 tasks) | $0.524 | $4.123 |
| Cost per strict success | $0.0228 | $0.2062 |
| Score per dollar | 6,466 | 809 |
GPT-5.5 spends about 8× more to deliver about 9× worse cost-per-success. Score-per-dollar — BridgeBench's headline efficiency metric — collapses from 6,466 to 809, an 8× degradation. For builders running large agentic pipelines, this is the number that translates directly into burn rate.
Full results on the Cost-Bench page.
BS-Bench: a bit more pushover
BS-Bench tests whether a model pushes back on clearly bogus or impossible requests. Push-back is good; quietly playing along with nonsense is bad.
| Metric | GPT-5.4 | GPT-5.5 |
|---|---|---|
| Average score | 91.5 | 88.0 |
| Push-back rate | 87% | 83% |
| Partial accept rate | 9% | 10% |
| Full accept rate (worse) | 4% | 7% |
GPT-5.5 accepts almost twice as many clearly-flawed asks at face value. Not a catastrophic shift, but the direction is the wrong one for an agent expected to act on instructions autonomously.
Reasoning: still saturated, slightly worse
Reasoning is the hardest BridgeBench category by design — every model in the cohort scores under 50. Within that ceiling:
| Metric | GPT-5.4 | GPT-5.5 |
|---|---|---|
| Average score | 40.6 | 39.7 |
| Accuracy rate | 10.0% | 6.7% |
| Evidence grounding | 90.6 | 91.3 |
Final-answer accuracy drops by a third (10% → 6.7%). Evidence grounding (whether the model cites the right source even when it gets the answer wrong) ticks up slightly, suggesting GPT-5.5 looks at the right places but lands on the wrong conclusion more often.
Where GPT-5.5 Wins
The wins are real, just narrower than the regressions.
UI Bench: +8.2 points
The clearest improvement. UI Bench tests 12 frontend generation tasks (animations, games, widgets) judged on completeness, visual quality, and interactivity.
| Metric | GPT-5.4 | GPT-5.5 |
|---|---|---|
| Average score | 65.7 | 73.9 |
| Completeness | 83.3 | 91.7 |
| Visual quality | 55.8 | 61.4 |
| Interactivity | 50.0 | 63.3 |
Every sub-metric moves up. If your primary use case is shipping HTML/CSS/JS components from a prompt, GPT-5.5 is meaningfully better. See the full board on the UI Bench page.
Refactoring: +2.5, with caveats
Refactoring shows a small headline gain, but the underlying mechanics are mixed:
| Metric | GPT-5.4 | GPT-5.5 |
|---|---|---|
| Average score | 63.4 | 65.9 |
| Visible behavior pass | 100% | 93.3% |
| Hidden regression pass | 99.4% | 92.7% |
| Intent compliance | 78.0 | 76.6 |
| Task success rate | 93.3% | 86.7% |
GPT-5.5's higher headline score is mostly a scoring-weight artifact: it's worse at preserving behavior (visible and hidden), worse at task completion, and slightly worse at intent compliance. The aggregate number rises, but the per-metric picture is closer to a regression than a win.
Security and Hallucination: incremental
Both improve by less than 1 point, well within run-to-run noise on the cohort. We're calling these directional but not material:
| Benchmark | Metric | GPT-5.4 | GPT-5.5 |
|---|---|---|---|
| Security | Hidden pass rate | 80.9% | 81.9% |
| Security | Visible pass rate | 85.6% | 86.7% |
| Hallucination | Accuracy | 67.0% | 69.0% |
| Hallucination | Fabrication rate | 34.9% | 32.3% |
Less hallucination is genuinely good — but a 2.6-point fabrication-rate drop is a smaller delta than the 8.1-point debugging regression by every meaningful comparison.
Methodology
When: GPT-5.5 was evaluated on 2026-04-24 and re-run for the canonical results on 2026-04-25 using BridgeBench v2 task packs. GPT-5.4 baseline numbers come from the same task packs published in the BridgeBench v2 cohort snapshots.
How: Both models were run through the OpenAI API directly (no router intermediaries for the canonical run). For each benchmark category, the same task pack hash and same evaluator scripts were used for both models — the only variable is the model under test.
Categories scored: Algorithms, BS-Bench, Cost-Bench, Debugging, Hallucination, Reasoning, Refactoring, Security, UI Bench. We exclude Algorithms from the head-to-head because both GPT-5.4 and GPT-5.5 saturate near 100, leaving no comparison signal.
Run files: *-gpt55-direct-final-20260425-results.json in the BridgeBench results store; mirrored into the per-bench snapshot JSON files that power bridgebench.bridgemind.ai.
Caveats: Several categories use small task packs (Debugging: 10 tasks, UI Bench: 12 tasks, Refactoring: 15 tasks). Single-task swings can move headline scores by several points. We report the underlying pass rates alongside the score so you can judge significance yourself.
What This Means for Builders
If you are choosing a model for an agentic coding workflow today, the case for switching from GPT-5.4 to GPT-5.5 is weak:
- Pure cost. 8× the spend for the same average quality is hard to justify in any production pipeline.
- Debugging-heavy workflows. If your agent spends most of its tokens chasing bugs and writing fixes, GPT-5.5 is a clear downgrade. Stay on 5.4.
- UI generation. This is the only workload where the upgrade is unambiguous. If you're shipping frontend components from prompts, GPT-5.5 is genuinely better.
- Refactoring. Don't read the +2.5 as an improvement until OpenAI explains why visible-behavior pass rate dropped from 100% to 93.3% on a 15-task pack.
For most builders, the practical recommendation is to stay on GPT-5.4 unless your workload is dominated by UI generation. The 8× cost premium for GPT-5.5 should be reserved for the specific tasks where it actually wins.
What's Next
We're re-running GPT-5.5 against the full v2 suite weekly to track whether OpenAI silently improves the model post-release (this happened with the GPT-5.4 cohort). We'll also publish a head-to-head against Claude Opus 4.7 and Gemini 3 once those have completed full v2 evaluation.
Track the live numbers on the BridgeBench leaderboard, or dig into a specific category:
- Debugging benchmark
- Cost-Bench
- BS-Bench
- Reasoning benchmark
- UI Bench
- Refactoring benchmark
- Security benchmark
- Hallucination benchmark
Questions about the methodology or want us to bench a specific model? Find us on Discord or follow @bridgebench on X.