Bito x The Context Lab: Proving What Context Does for Coding Agents
Bito needed to quantify the impact of codebase context on coding agent performance. We provided phased evaluation infrastructure that ran 600 isolated evaluations with full trace capture, revealing a 39.4% improvement in success rate.
Customer
Bito — AI developer tools company building codebase intelligence for coding agents
The Challenge
Bito built AI Architect, a codebase intelligence layer that provides system-level context to coding agents via MCP. They believed it improved agent performance on complex tasks—but they had no way to prove it.
The problem wasn't whether AI Architect worked. The problem was quantifying exactly how much it helped, on what kinds of tasks, and why. That required running hundreds of controlled evaluations, capturing every agent action, and analyzing the data across multiple dimensions—without burning through budget on a massive run that might not yield useful results.
Bito needed a third-party evaluation on SWE-Bench Pro: approximately 300 tasks across five large repositories, each run with and without AI Architect. That's roughly 600 isolated runs with full trace capture and rigorous verification.
What made this hard
Deterministic isolation at scale. Each run had to execute in a fully isolated, deterministic environment. This required strict control over infrastructure state—container lifecycle, filesystem, network, credentials—and application state—repository checkout, dependency resolution, build artifacts. Any shared or residual state across runs risked cross-contamination and invalid comparisons.
Reliability and failure containment. At the scale of hundreds of parallel runs, infrastructure and setup failures are unavoidable—but silent failures are unacceptable. The system had to fail fast, classify failures precisely, and recover automatically. Clear boundaries between infrastructure failures and genuine agent failures were essential to prevent corrupted runs from entering the dataset.
MCP state validation. MCP-enabled treatment runs introduced a stateful control plane whose correctness directly impacted result validity. For every treatment run, the system had to verify—before agent execution—that the repository was fully indexed, the correct MCP application image and version were deployed, and all required state had converged to a known-good baseline. Partial convergence (incomplete indexing or stale context) could silently degrade results.
Instruction injection complexity. Custom MCP instructions were injected per run to ensure consistent agent-context interaction. This expanded the failure surface: instruction drift, partial injection, or mismatches between instructions and indexed context. Each run required verification that agent configuration, instruction payloads, and MCP state were mutually consistent.
Application state orchestration. Some repositories and task scenarios required explicit control over application-level state beyond container startup—time-dependent logic, order-sensitive builds, historical state reconstruction. Supporting these cases pushed orchestration beyond infrastructure health into runtime validation of application state convergence.
Cost uncertainty. Running 600 agent evaluations with a frontier model isn't cheap. Bito needed visibility into token costs before committing—and safeguards against runaway consumption due to infrastructure or integration failures.
The Solution
Phased evaluation for cost efficiency
Rather than committing to a full 600-run evaluation upfront, we structured the work in phases.
Phase 1: Workflow validation and cost modeling. We ran a small set of tasks to validate the end-to-end workflow—repo builds, test execution, MCP integration, trace capture. This phase produced a cost estimation model: token usage per task, projected spend for the full run.
Phase 2: Pilot for directional signal. A small pilot run tested whether the evaluation was worth scaling. The pilot showed directional performance gains on resolve rate, confirming AI Architect made a measurable difference. With cost estimates validated and early results promising, Bito approved the full evaluation.
Phase 3: Full-scale run. Approximately 300 tasks in both baseline and treatment conditions—roughly 600 total runs with full trace capture. The larger dataset enabled deeper analysis: breakdowns by repository, task complexity, file count, and failure modes.
Platform: Evaluation infrastructure at scale
Hermetic execution environments. Pre-baked Docker images for all five repositories at the exact commits SWE-Bench Pro requires. Each run enforced hermetic initialization and reproducibility guarantees before agent execution began. No shared state between runs.
Preflight validation and failure classification. Automated validation confirming repos build and tests execute before agents start. When failures occurred, the platform classified them precisely—infrastructure issue, setup failure, or genuine agent failure—with deterministic retries and explicit terminal states.
MCP control-plane verification. For treatment runs, the platform asserted repository indexing completion, correct MCP application image and version deployment, and full state convergence before execution proceeded. Instruction payloads were validated for consistency with indexed context.
Deep observability. Full trace capture on every run—tool calls, file operations, MCP queries, timing, token usage. Structured logs across distributed components, verbose enough for behavioral analysis while remaining queryable at scale.
FinOps enforcement. Cost control embedded in the execution pipeline: early termination on invalid states, cost modeling before scale-up, safeguards against runaway token consumption.
FDEs: Workflow design and evaluation rubric
Our engineers worked with Bito to configure the AI Architect MCP integration and validate it end-to-end. They defined the evaluation rubric—what counts as success, how to classify failures, how to measure efficiency. They built the cost estimation model that gave Bito confidence to proceed. And they ensured agents invoked MCP naturally, the way developers would use context in real workflows.
Research: Multi-dimensional analysis
Trace parsers extracted quantitative metrics from every run. Patch analysis measured the scope of every code change. Results were structured into queryable layers so Bito could slice by repository, task complexity, and file count—answering not just "did it work" but "where and why."
The Impact
Quantifying what was previously unmeasurable
Before this evaluation, Bito knew AI Architect helped—but couldn't prove how much. The evaluation delivered concrete numbers they could stand behind: a 39.4% improvement in success rate (60.8% vs 43.6% baseline). For the first time, Bito could quantify the value of codebase context with third-party validation.
Visibility into agent behavior
Trace analysis revealed the mechanism, not just the outcome. Without AI Architect, agents struggled—cycling through files, making redundant searches, hitting dead ends before finding the right code to modify. With AI Architect, agents converged faster, querying for dependencies and related files upfront instead of discovering them through trial and error.
The efficiency metrics made this visible: 19.6% faster task completion, 25.4% fewer tool calls. The traces showed exactly where context eliminated struggle.
Cost-efficient path to rigorous results
The phased approach let Bito validate the workflow, estimate costs, and see directional results before committing to a full evaluation. No wasted spend on a run that might not work. No surprises on token costs. When the full run executed, Bito already knew what to expect.
Analysis across multiple dimensions
The full-scale run enabled breakdowns the pilot couldn't support. Bito could see that gains concentrated on large repositories and multi-file tasks—exactly where codebase navigation matters most.
On tasks requiring 10+ file changes, AI Architect succeeded 4.5× more often than baseline. On 15+ file changes, baseline had zero successes while AI Architect completed four. The 412-file, 58,000-line refactor that baseline couldn't finish became a concrete proof point: AI Architect completed it 27% faster with 50% fewer tool calls.
These weren't anecdotes. They were patterns extracted from 600 traced runs, sliced across the dimensions that matter for enterprise adoption.
In Summary
Bito needed to quantify something they couldn't measure on their own: the impact of codebase context on coding agent performance. We provided a phased approach that validated the workflow and controlled costs, infrastructure that ran 600 isolated evaluations with full trace capture, and analysis that revealed where and why context made the difference.
The evaluation gave Bito rigorous, third-party validation with data they could stand behind.