The industry has crowned Claude the best coding agent. We ran both agents on 731 real-world software engineering tasks under identical conditions. The data tells a different story.
SWE-Bench Pro: 731 real GitHub issues pulled from 11 open-source repositories in Python, Go, JavaScript, and TypeScript. Each agent reads the issue, explores the codebase, writes a fix, and runs tests. Fully autonomous — no human help, no retrieval augmentation, no documentation access.
Why comparison-native metrics? Absolute resolution rates aren't comparable across published reports — they depend on task sample, harness version, and scoring methodology. We use concordance, relative risk, and conditional win rates: metrics that describe how two agents performed relative to each other on identical tasks.
On 652 of 731 tasks, both agents reach the same outcome. On the 79 where they disagree, neither has a statistically significant edge.
If resolution is identical, the next question is cost. Here, the gap is unmistakable and consistent.
Beneath identical resolution rates, these agents think completely differently. Claude is an explorer — it reads extensively before acting. Codex is a planner — it reasons explicitly before touching code.
First edit at 71% through session. Reads extensively, delegates exploration to cheaper Haiku sub-model, iterates through multiple test cycles.
First edit at 63% through session. Reasons explicitly before acting, explores less but thinks more, executes cleanly when the plan is correct.
Both agents hit obstacles — failing tests, wrong files, build errors. The difference is what happens next. Claude's iterative style gives it more chances to course-correct.
4.8 test runs per task on average. When something goes wrong, Claude has more test cycles to course-correct through iteration.
2.4 test runs per task on average. When the plan is right, Codex executes cleanly. When it's wrong, recovery is harder.
Normalized to a pooled mean of 100: lower is better for cost, higher is better for resolution. Each agent has a distinct efficiency signature.
22% of tasks defeated both agents. The failure patterns reveal where the entire field needs to improve — not where one agent beats another.
On the right track but couldn't resolve all issues — fixing target behavior but introducing regressions or missing edge cases.
Functionally correct fix applied to the wrong part of the codebase. The agent understood the symptom but misidentified the source.
Large codebases where both agents explored extensively but could not converge. The problem exceeded navigational capacity.
Claude Opus 4.6 and OpenAI GPT-5.4 are statistically identical on resolution. The narrative of Claude dominance in coding is not supported by the data.
On 731 real-world tasks, they agree 89% of the time. On the 11% where they disagree, neither has a significant advantage. The clearest differentiator is cost — Codex delivers the same results at half the price.
Beneath equivalent outcomes, they embody fundamentally different problem-solving philosophies. Claude explores and iterates. Codex plans and executes. Same destination, different paths — a finding with real implications for how we build and evaluate the next generation of AI coding agents.
Interested in our research?
We publish independent, data-driven analysis of AI agent performance.