The Context Lab Research

Claude vs. Codex on SWE-Bench Pro

The industry has crowned Claude the best coding agent. We ran both agents on 731 real-world software engineering tasks under identical conditions. The data tells a different story.

731

Real-world tasks

11 repos, 4 languages

89%

Agreement rate

Cohen's κ = 0.72

2.1x

Cost difference

Same resolution, half the price

01 — The Experiment

Identical Conditions. No Excuses.

SWE-Bench Pro: 731 real GitHub issues pulled from 11 open-source repositories in Python, Go, JavaScript, and TypeScript. Each agent reads the issue, explores the codebase, writes a fix, and runs tests. Fully autonomous — no human help, no retrieval augmentation, no documentation access.

CLAUDE OPUS 4.6

Claude Code CLI
Multi-model: Haiku for exploration, Opus for reasoning
Specialized tools (Read, Grep, Edit, Glob)

OPENAI GPT-5.4

Codex CLI
Single model for everything
Bash + explicit chain-of-thought reasoning blocks

Why comparison-native metrics? Absolute resolution rates aren't comparable across published reports — they depend on task sample, harness version, and scoring methodology. We use concordance, relative risk, and conditional win rates: metrics that describe how two agents performed relative to each other on identical tasks.

02 — The Verdict

No Statistical Advantage. Period.

On 652 of 731 tasks, both agents reach the same outcome. On the 79 where they disagree, neither has a statistically significant edge.

Both solved (490)

Both failed (161)

Claude only (37)

Codex only (44)

0.986

Relative Risk

Effectively 1:1

45/55

Conditional Win Rate

A coin flip (p > 0.05)

If you can only choose one agent, and the task happens to be one where they disagree — it's a coin flip which one succeeds.

03 — The Cost Reality

Same Results. Half the Price.

If resolution is identical, the next question is cost. Here, the gap is unmistakable and consistent.

Cost per resolved task — Claude$3.98

Cost per resolved task — Codex$1.86

81%

of shared-success tasks where Codex is cheaper

2.65x

median cost ratio on tasks both solve

$0.06

cheapest task

$47.73

most expensive task (180x range)

Codex achieves equivalent resolution at roughly half the price. Unlike resolution rate, the cost difference is large and consistent.

04 — Cognitive Profiles

Two Minds, One Outcome

Beneath identical resolution rates, these agents think completely differently. Claude is an explorer — it reads extensively before acting. Codex is a planner — it reasons explicitly before touching code.

THE EXPLORER

Claude Opus 4.6

% of session time spent in each phase

74.8%

Explore74.8%
Plan8.0%
Implement9.9%
Verify7.3%

First edit at 71% through session. Reads extensively, delegates exploration to cheaper Haiku sub-model, iterates through multiple test cycles.

THE PLANNER

OpenAI GPT-5.4

% of session time spent in each phase

36.3%

53.7%

Explore36.3%
Plan53.7%
Implement7.1%
Verify3.0%

First edit at 63% through session. Reasons explicitly before acting, explores less but thinks more, executes cleanly when the plan is correct.

TOOL USAGE — % of total tool calls per session

Claude

33%

33%

12%

Bash33%
Read33%
Grep12%
Edit9%
Todo8%
Other5%

Codex

53%

39%

Reasoning53%
Bash39%
FileEdit7%
Msg1%

Claude explores. Codex plans. Different paths to the same destination. These cognitive signatures have implications for the next generation of AI coding agents.

05 — Resilience

When Things Go Wrong

Both agents hit obstacles — failing tests, wrong files, build errors. The difference is what happens next. Claude's iterative style gives it more chances to course-correct.

Claude recovery

Codex recovery

THE EXPLORER'S ADVANTAGE

4.8 test runs per task on average. When something goes wrong, Claude has more test cycles to course-correct through iteration.

THE PLANNER'S TRADEOFF

2.4 test runs per task on average. When the plan is right, Codex executes cleanly. When it's wrong, recovery is harder.

A fundamental architectural tradeoff: Claude's exploration-heavy style provides resilience through iteration, Codex's planning-heavy style provides efficiency through foresight. Both reach the same resolution rate.

06 — Efficiency

Precision vs. Economy

Normalized to a pooled mean of 100: lower is better for cost, higher is better for resolution. Each agent has a distinct efficiency signature.

Resolution rate

99.3

100.7

Avg cost per task

135.8

64.2

Cost per resolved task

136.4

63.6

Avg tool calls

85.6

114.4

Avg output tokens

73.9

126.1

Claude — precise

Codex — efficient

Codex wins on cost. Claude wins on precision — fewer tool calls, fewer tokens, smaller patches. Both achieve the same outcome.

07 — Where Both Fail

Shared Blind Spots

22% of tasks defeated both agents. The failure patterns reveal where the entire field needs to improve — not where one agent beats another.

~36%

Partial Fix, Remaining Failures

On the right track but couldn't resolve all issues — fixing target behavior but introducing regressions or missing edge cases.

~27%

Right Fix, Wrong Layer

Functionally correct fix applied to the wrong part of the codebase. The agent understood the symptom but misidentified the source.

~5%

Overwhelmed by Scale

Large codebases where both agents explored extensively but could not converge. The problem exceeded navigational capacity.

08 — Bottom Line

The Crown Is Shared

Claude Opus 4.6 and OpenAI GPT-5.4 are statistically identical on resolution. The narrative of Claude dominance in coding is not supported by the data.

On 731 real-world tasks, they agree 89% of the time. On the 11% where they disagree, neither has a significant advantage. The clearest differentiator is cost — Codex delivers the same results at half the price.

Beneath equivalent outcomes, they embody fundamentally different problem-solving philosophies. Claude explores and iterates. Codex plans and executes. Same destination, different paths — a finding with real implications for how we build and evaluate the next generation of AI coding agents.

89%

task agreement

p > 0.05

no significant edge

2.1x

Codex cost advantage

83%

Claude recovery rate

Interested in our research?

We publish independent, data-driven analysis of AI agent performance.

Methodology: 731 tasks, proportional stratified sampling by repository, Wald method at 95% confidence. All metrics are comparison-native — concordance, relative risk, conditional win rates. Single run; per-repo advantages are suggestive, not definitive. Full data and analysis code available in the project data room.

The Context Lab · March 2026