Guides

The Non-Determinism Problem: What It Takes to Evaluate Agents Reliably

Non-determinism is the biggest gap between agent pilots and reliable production deployment. This post breaks down why variance is fundamental to LLM-based agents, what evaluation infrastructure must look like in a non-deterministic world, and how to separate capability exploration from regression protection.

Feb 4, 2026·AI Agents, Evaluation, Non-Determinism

Your agent works in demo. It fails in production. The gap is usually non-determinism.

In our previous post, we talked about why agent evaluation is a full-stack problem. Non-determinism was one of the key challenges we mentioned. This topic deserves a deeper look, because it's the single biggest factor in the gap between demo and production.

The short version: give an agent the same task twice and you might get different tool selections, different intermediate reasoning, different API call sequences, and different final outputs. All of which could be perfectly correct. This variance is fundamental to how LLMs work—they sample from probability distributions. The question is: what does evaluation infrastructure need to look like when your system behaves this way?

The Numbers That Matter

Research from τ-bench found that agents achieving 60% pass@1 on benchmarks may exhibit only 25% consistency across multiple trials.[1]

An agent that succeeds more than half the time on a single run might fail three out of four times when you actually need it to work reliably. This gap between benchmark performance and production reliability is central to understanding agent deployment challenges.

The "non-deterministic multiplier" compounds everything. Testing a prompt once isn't enough. Understanding whether an agent actually works requires testing across dozens of scenarios. Every prompt change or model swap means rerunning the entire statistical sample.

What Evaluation Requires

Addressing non-determinism means building several layers into the evaluation stack.[2]

The Infrastructure Layer

Statistical significance requires running the same evaluation hundreds of times. This demands purpose-built infrastructure:

  • Isolated environments for each run so they don't interfere with each other
  • Full trace capture to understand why variance occurs
  • Reproducible conditions so results are meaningful
  • Analytics pipelines that translate thousands of run logs into actionable insights
  • Concurrent execution so statistically significant samples complete in hours, not days
  • The Verification Layer

    Running hundreds of trials is only useful if you can verify whether each trial succeeded. Different task types require different verification approaches:[2]

    Code-based graders work well where success is deterministic—unit tests passing, correct state changes, valid tool call sequences.

    Model-based graders handle subjective quality dimensions—rubric scoring, output comparison, tone and clarity assessment. These introduce their own variance, so calibration matters.

    Human review provides ground truth for high-stakes cases and helps calibrate automated graders over time.

    The most robust evaluation systems layer all three, using code-based verification where possible and model-based or human verification where necessary.

    The Measurement Layer

    Pass@1 is less important than consistency metrics. The questions that matter for production readiness:

  • Pass@k: How often does the agent succeed across k trials?[1]
  • Variance distribution: What's the spread of execution paths and outcomes?
  • Failure clustering: Where in the workflow do failures typically occur?
  • Cost and latency variance: How predictable is resource consumption?
  • Distributional views of agent behavior reveal what single-run metrics cannot.

    Capability vs. Regression Evals

    One useful distinction: capability evals measure what an agent can do, while regression evals check that existing capabilities still work after changes.[2]

    Capability evals are exploratory. Low pass rates are expected—they help you understand limits and improve. Run them when developing new features or expanding agent scope.

    Regression evals are protective. High pass rates are expected—they catch breakages. Run them continuously, especially after prompt changes or model swaps.

    As capability evals mature and pass rates stabilize, they graduate into regression suites. This progression gives you both room to experiment and confidence that you're not breaking what already works.

    What We've Found Works

    Based on our experience building evaluation infrastructure:

    Accept variance as the baseline. Characterize the distribution of agent behavior rather than looking for a single "correct" run. What percentage of runs succeed? What's the variance in paths taken? Where do failures cluster?

    Invest in trace capture. Non-deterministic systems are only debuggable with full visibility into what happened on each run—tool calls, intermediate reasoning, decision points.

    Build for scale. Statistical significance requires infrastructure that can execute hundreds of runs efficiently: cloud execution, proper isolation, parallelization.

    Layer your verification. Code-based graders where possible, model-based graders where necessary, human review for calibration and high-stakes cases.

    Separate capability from regression. Different eval types serve different purposes. Exploration and protection require different approaches.

    In Summary

    The non-determinism of agents is fundamental to how these systems work. Variance is the source of both their flexibility and their unreliability.

    Addressing it requires infrastructure for scale, layered verification approaches, distributional metrics, and clear separation between capability and regression evaluation.

    This is part of why we built The Context Lab—to provide the statistical evaluation infrastructure that non-deterministic systems require.

    What's been your experience with agent variance? Have you found approaches that work for your use case?

    References

    [1] Yao, S., Shinn, N., et al. (2024). "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." arXiv preprint. https://arxiv.org/abs/2406.12045

    [2] Anthropic. (2025). "Demystifying evals for AI agents." https://www.anthropic.com/research/demystifying-evals-for-ai-agents