January 15, 2026
Current agent evals are broken: here is how OpenAGI builds scalable, diverse, and reliable evals for its computer-use models
Most teams evaluate GUI agents in the simplest possible way: run the agent on a set of tasks and report its success rate.

Most teams evaluate GUI agents in the simplest possible way: run the agent on a set of tasks and report its success rate. Benchmarks such as OSWorld and OnlineMind2Web follow this pattern by asking the agent to complete a long-horizon objective end-to-end, often spanning tens or hundreds of steps.
This design is easy to implement and easy to compare, but it collapses a GUI agent’s real complexity into a single binary outcome.
A GUI agent is a coupled system—visual perception, state tracking, planning, and low-level execution all interact—and an end-to-end success rate does not tell you which component failed, why it failed, or whether the failure mode is fixable. As a result, these benchmarks provide limited diagnostic value for people building agents.
More importantly, the end-to-end framing leaves several decision-critical axes under-measured: reliability across runs, controllability (can you constrain behavior and recover from deviations), and cost-effectiveness (tokens, latency, and tool usage per unit of useful work).
Those are often the metrics that determine whether an agent can be deployed as a dependable, generalizable workflow rather than a demo. This motivates our first principle for benchmark design: optimize for the developer and enterprise experience of building and customizing scalable workflows, not just reporting a headline success rate.
Agents introduce a new set of challenges for evaluation compared to language models
First, we need to capture human–agent interaction in a way that is both realistic and measurable: what is the user intent, what constraints are implied but unstated, and how do you score partial progress, recovery, and clarification behavior?
Second, we need to construct an environment the agent can execute against reliably—tool APIs, GUIs, web pages, and backends change.
Third, evaluation must produce informative feedback that maps back to engineering and training decisions: which failures are perception issues, which are planning errors, which are execution bugs, and which are mis-specified objectives?
A further complication is that agent evaluation in dynamic environments is inherently noisy. Outcomes are perturbed by randomness in page load timing, nondeterministic UI states, tool latency, stochastic policies, and external system variability. In static math/code benchmarks, the problem is largely fixed and repeatable; for agents, the same policy can succeed or fail depending on incidental factors.
Where do today’s benchmarks fail to capture real production behavior?
The biggest challenge for existing benchmarks is capturing real-world production behavior in a way that actually improves the user experience. Today’s benchmarks often fail to model two crucial aspects.
First, reliability. Imagine a developer hammering a workflow and deploying a product, only to find certain steps occasionally failing. Existing benchmarks generally do not measure how the model performs under diverse website appearances, layouts, resolutions, and other real-world variations.
Second, cost-efficiency. Improving the user experience involves shortening wait time: beyond success rate, we care about the total time and cost needed to complete steps, so developers can build scalable and profitable workflows using our model.
At OpenAGI, to address these gaps, we start from the behaviors that appear most frequently in our real use cases and use them as the foundation of the benchmark. Specifically, we emphasize high-leverage atomic capabilities like search/navigation and reliable form filling with large, messy input—because these are the steps that most often make or break production workflows.
How OpenAGI evaluates its computer-use agents
We generate a controllable environment that can be easily customized to cover a wide range of user scenarios. Similar efforts have been made by OSWorld via Docker, or Meta’s OpenApps.
However, our objectives differ. Compared with OSWorld, we focus on lighter-weight, more customizable building blocks so developers can remix scenarios quickly, vary conditions systematically, and iterate without heavy setup. Compared with OpenApps, we target a broader range of scenarios—covering a wider slice of real enterprise and developer workflows, not just a curated set of apps.
To achieve this, we take a hybrid approach. We build a large set of digital assets that enable controllable scenario generation (e.g., configurable pages, UI states, and task templates), and we also provide sandboxes that support evaluation on real-world software and web apps. This combination lets us test both robustness under controlled perturbations and performance in realistic, end-to-end settings.
How OpenAGI thinks about reliable scoring and diagnosis of failure modes
With atomic behaviors as the unit of evaluation, reliable scoring depends first on the reliability of the environment itself. We therefore use customized web apps and software—digital assets built specifically for evaluation—as the primary execution substrate. These assets are designed to mimic real-world appearance and functionality, but they also expose dedicated interfaces that log agent actions and export task outcomes in a structured form. That makes it possible to compare executions against verifiable signals (state diffs, emitted events, completion flags, extracted fields) rather than relying on brittle screen-level heuristics.
For tasks that remain inherently open-ended and lack rigorous programmatic signals, we use an LLM-as-judge as a backstop. We do this with per-scenario judging prompts that (1) describe the intended behavior and acceptable variations, (2) specify the rubric for correctness at the individual-sample level, and (3) require the judge to ground its decision in concrete evidence from the trajectory and final state.This combination—verifiable signals where possible, and tightly scoped judging where not—supports both consistent scoring and clearer diagnosis of failure modes.
What’s next for agentic evals?
The future of agentic evaluation aims to become an API wrapper for the real world. This claim has three implications.
Unified API for evaluation and training. The environment, task, and judge inside the evaluation benchmark are not just for evaluation, but a coherent component of the training framework. The procedure of executing the agent in our evaluation framework is identical to collecting reward signals during the RL training process. Therefore, future evaluation will become a unified API of the real world serving both training and evaluation. In our model research, we have integrated the digital assets into our training framework to shape the model’s atomic behaviors.
Scalable API for real-world mapping. To serve both training and evaluation, we are building the evaluation benchmark as a scalable API that covers diverse use cases rather than taking the shortcut of using the real world directly for evaluation.
Value-aware API for agentic learning. The evaluation itself can provide direct learning signals, i.e., guidance toward the value function of a model. This becomes more important as we require agents to conduct long-horizon tasks: the evaluation environment should evolve to provide reliable process reward signals that hint at the quality—or danger—of actions.
That is, the evaluation environment (the API of the real world, in our language) should provide a value estimate of how good or bad past actions and current progress are, to guide efficient learning. Such a shift moves beyond the existing GRPO paradigm (a reinforcement learning algorithm that enhances mathematical reasoning in language models), which focuses on outcome reward, towards the principled idea of temporal difference (TD), a model-free reinforcement learning method.
Imagine a human learning to drive: we don’t need to actually hit another vehicle (outcome reward) to learn the skill. We receive a strong negative signal—fear—when we get close to danger. Similarly, the “feeling” of risk at dangerous behaviors, or the “feeling” of progress in long tasks, is how our evaluation will evolve: an intelligent real-world API providing values for learning.
If today’s agent benchmarks are mostly “run the task and report success rate,” the next generation will look more like an API wrapper for the real world: a programmable interface that reproduces the incentives and constraints agents face in production, yields reliable, machine-checkable signals, and scales to the diversity and volume required for both evaluation and learning.