The Context Lab Research

Claude vs. Codex on SWE-Bench Pro

The industry has crowned Claude the best coding agent. We ran both agents on 731 real-world software engineering tasks under identical conditions. The data tells a different story.

731
Real-world tasks
11 repos, 4 languages
89%
Agreement rate
Cohen's κ = 0.72
2.1x
Cost difference
Same resolution, half the price
01 — The Experiment

Identical Conditions. No Excuses.

SWE-Bench Pro: 731 real GitHub issues pulled from 11 open-source repositories in Python, Go, JavaScript, and TypeScript. Each agent reads the issue, explores the codebase, writes a fix, and runs tests. Fully autonomous — no human help, no retrieval augmentation, no documentation access.

CLAUDE OPUS 4.6
  • Claude Code CLI
  • Multi-model: Haiku for exploration, Opus for reasoning
  • Specialized tools (Read, Grep, Edit, Glob)
OPENAI GPT-5.4
  • Codex CLI
  • Single model for everything
  • Bash + explicit chain-of-thought reasoning blocks

Why comparison-native metrics? Absolute resolution rates aren't comparable across published reports — they depend on task sample, harness version, and scoring methodology. We use concordance, relative risk, and conditional win rates: metrics that describe how two agents performed relative to each other on identical tasks.

02 — The Verdict

No Statistical Advantage. Period.

On 652 of 731 tasks, both agents reach the same outcome. On the 79 where they disagree, neither has a statistically significant edge.

89%agreement
Both solved (490)
Both failed (161)
Claude only (37)
Codex only (44)
0.986
Relative Risk
Effectively 1:1
45/55
Conditional Win Rate
A coin flip (p > 0.05)
If you can only choose one agent, and the task happens to be one where they disagree — it's a coin flip which one succeeds.
03 — The Cost Reality

Same Results. Half the Price.

If resolution is identical, the next question is cost. Here, the gap is unmistakable and consistent.

Cost per resolved task — Claude$3.98
Cost per resolved task — Codex$1.86
81%
of shared-success tasks where Codex is cheaper
2.65x
median cost ratio on tasks both solve
$0.06
cheapest task
$47.73
most expensive task (180x range)
Codex achieves equivalent resolution at roughly half the price. Unlike resolution rate, the cost difference is large and consistent.
04 — Cognitive Profiles

Two Minds, One Outcome

Beneath identical resolution rates, these agents think completely differently. Claude is an explorer — it reads extensively before acting. Codex is a planner — it reasons explicitly before touching code.

THE EXPLORER
Claude Opus 4.6
% of session time spent in each phase
74.8%
Explore 74.8%
Plan 8.0%
Implement 9.9%
Verify 7.3%

First edit at 71% through session. Reads extensively, delegates exploration to cheaper Haiku sub-model, iterates through multiple test cycles.

THE PLANNER
OpenAI GPT-5.4
% of session time spent in each phase
36.3%
53.7%
Explore 36.3%
Plan 53.7%
Implement 7.1%
Verify 3.0%

First edit at 63% through session. Reasons explicitly before acting, explores less but thinks more, executes cleanly when the plan is correct.

TOOL USAGE — % of total tool calls per session
Claude
33%
33%
12%
Bash 33%
Read 33%
Grep 12%
Edit 9%
Todo 8%
Other 5%
Codex
53%
39%
Reasoning 53%
Bash 39%
FileEdit 7%
Msg 1%
Claude explores. Codex plans. Different paths to the same destination. These cognitive signatures have implications for the next generation of AI coding agents.
05 — Resilience

When Things Go Wrong

Both agents hit obstacles — failing tests, wrong files, build errors. The difference is what happens next. Claude's iterative style gives it more chances to course-correct.

83%
Claude recovery
67%
Codex recovery
THE EXPLORER'S ADVANTAGE

4.8 test runs per task on average. When something goes wrong, Claude has more test cycles to course-correct through iteration.

THE PLANNER'S TRADEOFF

2.4 test runs per task on average. When the plan is right, Codex executes cleanly. When it's wrong, recovery is harder.

A fundamental architectural tradeoff: Claude's exploration-heavy style provides resilience through iteration, Codex's planning-heavy style provides efficiency through foresight. Both reach the same resolution rate.
06 — Efficiency

Precision vs. Economy

Normalized to a pooled mean of 100: lower is better for cost, higher is better for resolution. Each agent has a distinct efficiency signature.

Resolution rate
99.3
100.7
Avg cost per task
135.8
64.2
Cost per resolved task
136.4
63.6
Avg tool calls
85.6
114.4
Avg output tokens
73.9
126.1
Claude — precise
Codex — efficient
Codex wins on cost. Claude wins on precision — fewer tool calls, fewer tokens, smaller patches. Both achieve the same outcome.
07 — Where Both Fail

Shared Blind Spots

22% of tasks defeated both agents. The failure patterns reveal where the entire field needs to improve — not where one agent beats another.

~36%
Partial Fix, Remaining Failures

On the right track but couldn't resolve all issues — fixing target behavior but introducing regressions or missing edge cases.

~27%
Right Fix, Wrong Layer

Functionally correct fix applied to the wrong part of the codebase. The agent understood the symptom but misidentified the source.

~5%
Overwhelmed by Scale

Large codebases where both agents explored extensively but could not converge. The problem exceeded navigational capacity.

08 — Bottom Line

The Crown Is Shared

Claude Opus 4.6 and OpenAI GPT-5.4 are statistically identical on resolution. The narrative of Claude dominance in coding is not supported by the data.

On 731 real-world tasks, they agree 89% of the time. On the 11% where they disagree, neither has a significant advantage. The clearest differentiator is cost — Codex delivers the same results at half the price.

Beneath equivalent outcomes, they embody fundamentally different problem-solving philosophies. Claude explores and iterates. Codex plans and executes. Same destination, different paths — a finding with real implications for how we build and evaluate the next generation of AI coding agents.

89%
task agreement
p > 0.05
no significant edge
2.1x
Codex cost advantage
83%
Claude recovery rate

Interested in our research?

We publish independent, data-driven analysis of AI agent performance.

Methodology: 731 tasks, proportional stratified sampling by repository, Wald method at 95% confidence. All metrics are comparison-native — concordance, relative risk, conditional win rates. Single run; per-repo advantages are suggestive, not definitive. Full data and analysis code available in the project data room.

The Context Lab · March 2026