What WorkArena's Task Design Makes Visible

One of the 33 benchmark tasks in WorkArena asks an agent to create a hardware asset inside ServiceNow. Fill in a model category, a vendor, an installed date, a serial number. The agent could navigate the auto-completion boxes and date pickers, click submit, and arrive at a confirmation screen. WorkArena ignores the confirmation screen entirely. Its validator queries the database, retrieves the record the agent created, and checks whether every saved field value matches the specification. A wrong value in a saved record is a terminal failure, because incorrect information has been pushed into a live system. An agent that reached the confirmation page with corrupted data underneath would look like a success to any observer watching the screen.

That detail, tucked inside the benchmark's validation code, says more about enterprise agent evaluation than any benchmark score. As Head of Frontier AI Research at ServiceNow, Alexandre Drouin leads a team studying what he describes as barriers to adoption of enterprise AI capabilities, including trustworthiness, security, and "reliable measurement through rigorous benchmarking." Being inside the platform company gave his team something outside researchers don't have: direct access to the business logic that enterprise software encodes implicitly through form layouts, field dependencies, role permissions, and workflow routing. WorkArena, presented at ICML 2024, runs on a remote-hosted ServiceNow instance. Its tasks cover filtering lists, filling forms, ordering from service catalogs, searching knowledge bases, reading dashboards. To measure success at any of them, the team had to formalize something enterprises have always left implicit.

Compare WebArena, the ICLR 2024 benchmark that evaluates agents across self-hosted websites for e-commerce, forums, collaborative software development, and content management. WebArena's evaluation is genuinely capable. Its programmatic functions inspect databases, call APIs, check page content. For a task like "check out merge requests assigned to me," it verifies the resulting URL. This works because the success criteria are legible from the web layer. The sites were built for the benchmark. The rules are the sites.

WorkArena operates somewhere structurally different. When a service-catalog task asks an agent to order an iPad Pro with a specific color and storage configuration, the validator fetches the resulting request record, verifies that exactly one item type was ordered, and compares the item description, quantity, and configuration options against spec. The benchmark disables the "Add to Cart" button because shopping carts are shared between sessions. A detail that only matters when your evaluation environment is a real enterprise platform with persistent shared state, not a sandbox built for testing.

To build these validators, Drouin's team had to translate implicit business logic into explicit, machine-checkable rules. In practice, a field labeled "Priority" in a list view might display "High" while the underlying database stores a numeric code. A "Caller" field shows a name but references a user record in another table. The list-filtering validators resolve these reference fields and choice values through ServiceNow's table metadata before comparing results. The impersonation task queries the system's user object to verify the agent is logged in as the right person. Each validator amounts to a formal specification of what "correct" means for that particular enterprise operation.

WorkArena++ pushes this further by composing atomic tasks into workflows. An offboarding task creates a user and laptop asset during setup, then asks the agent to find the user's laptop, unassign the hardware, and delete the user according to a protocol stored in the knowledge base. Success depends on the agent retrieving procedural rules from one part of the system and executing them correctly across several others.

The underlying challenge

Enterprise systems encode their rules in layers no screen fully represents. Roles determine what's visible. Prior records constrain what's possible. Business logic governs what counts as correct.

Anyone building evaluation for enterprise agents will eventually face what Drouin's team formalized: you cannot measure whether an agent did the right thing without writing down, in machine-checkable form, what the right thing actually is. Enterprises have documented business logic for decades in process manuals and compliance frameworks. Writing it as machine-verifiable success criteria for autonomous agents is a different problem, and one that barely existed before agents started operating inside these systems.

Things to follow up on...

Pass-once vs. pass-always: Tau-bench's pass^k metric measures whether agents succeed reliably across repeated trials, and its ICLR 2025 paper reports that GPT-4o's retail pass^8 drops below 25% even when single-attempt success exceeds 50%.
Workplace beyond browsing: TheAgentCompany extends the evaluation surface to coding, running programs, and communicating with simulated coworkers inside a software company, and its NeurIPS 2025 paper reports the most competitive agent completed only 30% of tasks autonomously.
Benchmark fragmentation problem: The BrowserGym Ecosystem paper, which Drouin coauthored, frames the project as a response to inconsistent evaluation methodologies across web-agent benchmarks that make cross-benchmark comparison unreliable.
Safety as task performance: Drouin's recent talk on "Full Stack Benchmarking for Knowledge Work" argues that benchmarks for agentic safety and security cannot be meaningfully separated from primary task performance, a claim worth watching as enterprise deployments scale.