BridgeBenchBridgeBench
Back to blog
9 min readBridgeMind Team

GPT-5.5 Performs Worse Than GPT-5.4 on BridgeBench

Across eight BridgeBench v2 categories, GPT-5.5 regresses on debugging, BS-Bench, cost, and reasoning while charging roughly 8x more per task. Here's the full breakdown with the data and the methodology.

GPT-5.5OpenAIbenchmarksBridgeBench v2model evaluation

TL;DR

We ran GPT-5.5 through the full BridgeBench v2 suite on 2026-04-25 and compared the results to GPT-5.4 head-to-head on identical tasks. GPT-5.5 is not a generational step up from GPT-5.4 — on the categories that matter most for real-world coding, it regresses.

  • Debugging: 85.6 → 77.5 (−8.1). The largest single-category regression we've measured between adjacent OpenAI releases.
  • BS-Bench (push-back on bad asks): 91.5 → 88.0 (−3.5). Acceptance rate of clearly-flawed requests jumped from 4% to 7%.
  • Cost-Bench: 84.7 → 83.4 (−1.3) — but the headline is the price tag. GPT-5.5 spent $4.12 vs. $0.52 for GPT-5.4 across the same 40 tasks: roughly 8× more expensive for a slightly worse outcome.
  • Reasoning: 40.6 → 39.7 (−0.9). Accuracy on multi-step reasoning fell from 10% to 6.7%.
  • Where GPT-5.5 *does* gain: UI Bench (+8.2), Refactoring (+2.5), Security (+0.9), Hallucination (+0.8).

The cohort-wide average across the eight comparable categories: GPT-5.4 = 73.59, GPT-5.5 = 73.41. A statistical wash on aggregate — and an 8× cost increase to get there.

The Headline Table

All scores are 0–100. Δ is GPT-5.5 minus GPT-5.4. Higher is better in every column.

BenchmarkGPT-5.4GPT-5.5ΔResult
Debugging85.677.5−8.1Regression
BS-Bench91.588.0−3.5Regression
Cost-Bench84.783.4−1.3Regression
Reasoning40.639.7−0.9Regression
Hallucination72.873.6+0.8Improvement
Security84.485.3+0.9Improvement
Refactoring63.465.9+2.5Improvement
UI Bench65.773.9+8.2Improvement
Mean73.5973.41−0.18Wash

Source: debugging-snapshot.json, bs-bench-snapshot.json, cost-bench-snapshot.json, reasoning-snapshot.json, hallucination-snapshot.json, security-snapshot.json, refactoring-snapshot.json, ui-bench-snapshot.json in the published BridgeBench v2 dataset.

Where GPT-5.5 Loses

Debugging: −8.1 points

This is the regression we found most surprising. Debugging is one of the highest-signal categories in BridgeBench because it directly maps to what builders ask AI for in agentic coding: *"the test is failing, find and fix the bug."*

MetricGPT-5.4GPT-5.5
Average score85.677.5
Visible fix rate100%90%
Hidden bug pass rate100%90%
Regression pass rate100%90%
Diagnosis accuracy21.5%22.0%
Task success rate100%80%

GPT-5.4 produced a passing fix on every one of the 10 debugging tasks. GPT-5.5 dropped two of them outright (success rate 80%) and slipped on hidden-bug detection and regression prevention by 10 points each. Diagnosis accuracy ticks up by half a point, but the fixes themselves regress — the model is talking a slightly better game and shipping a meaningfully worse fix.

For agentic coding workflows that run dozens of debug-fix loops in a single session, an 80% per-task success rate compounds badly. See the full breakdown on the Debugging benchmark page.

Cost-Bench: same outcome, 8× the bill

Cost-Bench measures whether a model can solve a fixed task pack (40 tasks) and tracks total spend through the OpenAI API. The score itself moves only slightly, but the price moves a lot:

MetricGPT-5.4GPT-5.5
Average score84.783.4
Strict success rate57.5%50.0%
Qualified success rate80.0%72.5%
Total cost (40 tasks)$0.524$4.123
Cost per strict success$0.0228$0.2062
Score per dollar6,466809

GPT-5.5 spends about 8× more to deliver about 9× worse cost-per-success. Score-per-dollar — BridgeBench's headline efficiency metric — collapses from 6,466 to 809, an 8× degradation. For builders running large agentic pipelines, this is the number that translates directly into burn rate.

Full results on the Cost-Bench page.

BS-Bench: a bit more pushover

BS-Bench tests whether a model pushes back on clearly bogus or impossible requests. Push-back is good; quietly playing along with nonsense is bad.

MetricGPT-5.4GPT-5.5
Average score91.588.0
Push-back rate87%83%
Partial accept rate9%10%
Full accept rate (worse)4%7%

GPT-5.5 accepts almost twice as many clearly-flawed asks at face value. Not a catastrophic shift, but the direction is the wrong one for an agent expected to act on instructions autonomously.

Reasoning: still saturated, slightly worse

Reasoning is the hardest BridgeBench category by design — every model in the cohort scores under 50. Within that ceiling:

MetricGPT-5.4GPT-5.5
Average score40.639.7
Accuracy rate10.0%6.7%
Evidence grounding90.691.3

Final-answer accuracy drops by a third (10% → 6.7%). Evidence grounding (whether the model cites the right source even when it gets the answer wrong) ticks up slightly, suggesting GPT-5.5 looks at the right places but lands on the wrong conclusion more often.

Where GPT-5.5 Wins

The wins are real, just narrower than the regressions.

UI Bench: +8.2 points

The clearest improvement. UI Bench tests 12 frontend generation tasks (animations, games, widgets) judged on completeness, visual quality, and interactivity.

MetricGPT-5.4GPT-5.5
Average score65.773.9
Completeness83.391.7
Visual quality55.861.4
Interactivity50.063.3

Every sub-metric moves up. If your primary use case is shipping HTML/CSS/JS components from a prompt, GPT-5.5 is meaningfully better. See the full board on the UI Bench page.

Refactoring: +2.5, with caveats

Refactoring shows a small headline gain, but the underlying mechanics are mixed:

MetricGPT-5.4GPT-5.5
Average score63.465.9
Visible behavior pass100%93.3%
Hidden regression pass99.4%92.7%
Intent compliance78.076.6
Task success rate93.3%86.7%

GPT-5.5's higher headline score is mostly a scoring-weight artifact: it's worse at preserving behavior (visible and hidden), worse at task completion, and slightly worse at intent compliance. The aggregate number rises, but the per-metric picture is closer to a regression than a win.

Security and Hallucination: incremental

Both improve by less than 1 point, well within run-to-run noise on the cohort. We're calling these directional but not material:

BenchmarkMetricGPT-5.4GPT-5.5
SecurityHidden pass rate80.9%81.9%
SecurityVisible pass rate85.6%86.7%
HallucinationAccuracy67.0%69.0%
HallucinationFabrication rate34.9%32.3%

Less hallucination is genuinely good — but a 2.6-point fabrication-rate drop is a smaller delta than the 8.1-point debugging regression by every meaningful comparison.

Methodology

When: GPT-5.5 was evaluated on 2026-04-24 and re-run for the canonical results on 2026-04-25 using BridgeBench v2 task packs. GPT-5.4 baseline numbers come from the same task packs published in the BridgeBench v2 cohort snapshots.

How: Both models were run through the OpenAI API directly (no router intermediaries for the canonical run). For each benchmark category, the same task pack hash and same evaluator scripts were used for both models — the only variable is the model under test.

Categories scored: Algorithms, BS-Bench, Cost-Bench, Debugging, Hallucination, Reasoning, Refactoring, Security, UI Bench. We exclude Algorithms from the head-to-head because both GPT-5.4 and GPT-5.5 saturate near 100, leaving no comparison signal.

Run files: *-gpt55-direct-final-20260425-results.json in the BridgeBench results store; mirrored into the per-bench snapshot JSON files that power bridgebench.bridgemind.ai.

Caveats: Several categories use small task packs (Debugging: 10 tasks, UI Bench: 12 tasks, Refactoring: 15 tasks). Single-task swings can move headline scores by several points. We report the underlying pass rates alongside the score so you can judge significance yourself.

What This Means for Builders

If you are choosing a model for an agentic coding workflow today, the case for switching from GPT-5.4 to GPT-5.5 is weak:

  • Pure cost. 8× the spend for the same average quality is hard to justify in any production pipeline.
  • Debugging-heavy workflows. If your agent spends most of its tokens chasing bugs and writing fixes, GPT-5.5 is a clear downgrade. Stay on 5.4.
  • UI generation. This is the only workload where the upgrade is unambiguous. If you're shipping frontend components from prompts, GPT-5.5 is genuinely better.
  • Refactoring. Don't read the +2.5 as an improvement until OpenAI explains why visible-behavior pass rate dropped from 100% to 93.3% on a 15-task pack.

For most builders, the practical recommendation is to stay on GPT-5.4 unless your workload is dominated by UI generation. The 8× cost premium for GPT-5.5 should be reserved for the specific tasks where it actually wins.

What's Next

We're re-running GPT-5.5 against the full v2 suite weekly to track whether OpenAI silently improves the model post-release (this happened with the GPT-5.4 cohort). We'll also publish a head-to-head against Claude Opus 4.7 and Gemini 3 once those have completed full v2 evaluation.

Track the live numbers on the BridgeBench leaderboard, or dig into a specific category:

Questions about the methodology or want us to bench a specific model? Find us on Discord or follow @bridgebench on X.