Share

Specification by Default

When Infrastructure Writes the Spec

By Rina Takahashi— May 13, 2026

Feature image for article: When Infrastructure Writes the Spec

A support platform prices its AI agents per automated resolution: no escalation, ticket not reopened within 72 hours. Clean metric — one where a customer who gives up counts the same as one who's satisfied. Most organizations deploying agents haven't specified what "good" means in operational terms. Infrastructure is filling that vacuum. Pricing models, protocol defaults, eval frameworks each encoding their own answer. And the specification decision already happened, whether anyone noticed or not.

Specification by Default

When Infrastructure Writes the Spec

By Rina Takahashi— May 13, 2026

A support platform prices its AI agents per automated resolution: no escalation, ticket not reopened within 72 hours. Clean metric — one where a customer who gives up counts the same as one who's satisfied. Most organizations deploying agents haven't specified what "good" means in operational terms. Infrastructure is filling that vacuum. Pricing models, protocol defaults, eval frameworks each encoding their own answer. And the specification decision already happened, whether anyone noticed or not.

Ecosystem Scan

Five developments from the past month, each moving at its own pace, each pushing the agent ecosystem toward a specificity it has been happy to avoid.

MCP's download numbers look like infrastructure that won. The governance structures don't. OpenAI ships sandboxing that concedes what production actually requires. The EU's enforcement calendar keeps bending under political gravity. Pricing migrates toward outcomes, which only works when you can count them. And prompt injection has crossed from research papers into documented field operations.

What connects them: somebody, somewhere, is being forced to write down a definition they'd been comfortable leaving vague.

Ecosystem Scan

Five developments from the past month, each moving at its own pace, each pushing the agent ecosystem toward a specificity it has been happy to avoid.

MCP's download numbers look like infrastructure that won. The governance structures don't. OpenAI ships sandboxing that concedes what production actually requires. The EU's enforcement calendar keeps bending under political gravity. Pricing migrates toward outcomes, which only works when you can count them. And prompt injection has crossed from research papers into documented field operations.

What connects them: somebody, somewhere, is being forced to write down a definition they'd been comfortable leaving vague.

Protocol Scale

MCP Hits 97 Million Downloads, Governance Lags Behind

MCP crossed 97 million monthly SDK downloads by March. 78% of enterprise AI teams report at least one MCP-backed agent in production. Now under Linux Foundation governance with OpenAI, Google, and Microsoft co-sponsoring, the 2026 roadmap puts stateless HTTP transport and a Tasks primitive under review. Neither has shipped. No Enterprise Working Group exists. Server counts range from 5,800 to 17,468 depending on who's counting.

Agent Sandboxing

OpenAI Agents SDK Ships Native Sandbox Environments

OpenAI's April 15 update adds sandbox environments where agents can inspect files, run commands, edit code, and handle long-horizon tasks inside controlled, siloed workspaces with scoped access. It targets the blast-radius problem directly: one of the core reasons production agents fail. TypeScript support still pending.

AI Regulation

EU AI Act Timeline Shifts Under Provisional Deal

August 2, 2026 remains the legally binding enforcement date for EU AI Act transparency rules. A May 7 provisional agreement between Council and Parliament would defer national regulatory sandboxes to August 2027. The Digital Omnibus package may push standalone high-risk system deadlines to December 2027. The provisional agreement still requires formal adoption.

Pricing Models

Outcome Pricing Spreads as Seat-Based Models Shrink

Zendesk charges $1.50 per automated resolution on committed volume, $2.00 pay-as-you-go. In January 2026, a billing change introduced automatic overage charges with no prior notification. Across the industry, seat-based pricing fell from 21% to 15% of SaaS companies in twelve months while hybrid models surged from 27% to 41%. Outcome pricing presupposes tasks that are measurable and bounded.

Agent Security

Indirect Prompt Injection Now Documented in Production

Google and Forcepoint documented production attacks: hidden instructions embedded in web pages, documents, and emails that agents process and execute. The AgentDojo benchmark, used by US and UK AI Safety Institutes, found effective defenses significantly degrade agent utility. Kiteworks 2026 data: 41–44% of organizations lack human-in-the-loop oversight; 55–63% have no kill switches, purpose binding, or network isolation.

Research Radar

Towards a Science of AI Agent Reliability

Decomposes reliability into consistency, robustness, predictability, and safety across 14 models over 18 months. Reliability gains lag capability progress throughout. The paper amounts to specification work: defining what "reliable enough" demands beyond raw accuracy.

What can accuracy actually tell you?

It can't distinguish an agent that fails predictably from one that fails randomly, or benign mistakes from catastrophic ones.

How bad do failures get?

Replit's AI assistant deleted a production database; OpenAI's Operator made an unauthorized $31.43 Instacart purchase. Accuracy captured neither risk.

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

Analyzed 12 benchmarks and found enterprise-critical dimensions going unmeasured: 50x cost variations, consistency drops from 60% to 25%, a 37% lab-to-production gap. The resulting CLEAR framework is, in effect, the specification document most deploying organizations still lack.

Who's overpaying for accuracy?

Agents with the highest raw scores cost 4.4–10.8x more than Pareto-efficient alternatives delivering comparable real-world performance.

How broken are current benchmarks?

Do-nothing agents pass 38% of a leading benchmark's tasks, meaning evaluations can't reliably tell working agents from idle ones.

Research Radar

Towards a Science of AI Agent Reliability

Decomposes reliability into consistency, robustness, predictability, and safety across 14 models over 18 months. Reliability gains lag capability progress throughout. The paper amounts to specification work: defining what "reliable enough" demands beyond raw accuracy.

What can accuracy actually tell you?

It can't distinguish an agent that fails predictably from one that fails randomly, or benign mistakes from catastrophic ones.

How bad do failures get?

Replit's AI assistant deleted a production database; OpenAI's Operator made an unauthorized $31.43 Instacart purchase. Accuracy captured neither risk.

Research Radar

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

Analyzed 12 benchmarks and found enterprise-critical dimensions going unmeasured: 50x cost variations, consistency drops from 60% to 25%, a 37% lab-to-production gap. The resulting CLEAR framework is, in effect, the specification document most deploying organizations still lack.

Who's overpaying for accuracy?

Agents with the highest raw scores cost 4.4–10.8x more than Pareto-efficient alternatives delivering comparable real-world performance.

How broken are current benchmarks?

Do-nothing agents pass 38% of a leading benchmark's tasks, meaning evaluations can't reliably tell working agents from idle ones.

Regulation as Specification

When Regulation Does the Specification Work

Organizations have been rationally avoiding the hardest upstream work in AI deployment: articulating what their systems actually do, what they optimize for, and why those choices are defensible. That avoidance is political as much as practical. Writing the spec forces disagreements into the open.

The EU AI Act's documentation requirements read like the design specification most teams never wrote. Intended purpose, trade-off rationale, metric justification. A regulator, in effect, mandating the homework the ecosystem kept deferring.

The GDPR parallel is instructive. Before 2018, most organizations couldn't inventory their own data. The deadline created the discipline. The AI Act forces something harder: not just knowing what you have, but explaining why it's defensible. Whether that discipline outlasts the compliance scramble will determine if this is a one-time audit or a genuine shift in how systems get built.

Regulation as Specification

When Regulation Does the Specification Work

Organizations have been rationally avoiding the hardest upstream work in AI deployment: articulating what their systems actually do, what they optimize for, and why those choices are defensible. That avoidance is political as much as practical. Writing the spec forces disagreements into the open.

The EU AI Act's documentation requirements read like the design specification most teams never wrote. Intended purpose, trade-off rationale, metric justification. A regulator, in effect, mandating the homework the ecosystem kept deferring.

The GDPR parallel is instructive. Before 2018, most organizations couldn't inventory their own data. The deadline created the discipline. The AI Act forces something harder: not just knowing what you have, but explaining why it's defensible. Whether that discipline outlasts the compliance scramble will determine if this is a one-time audit or a genuine shift in how systems get built.

Procurement gate: Explainability is becoming a third evaluation axis alongside speed and accuracy, quietly disqualifying vendors who can't evidence decision logic

Shadow exposure: Most compliance gaps surface during discovery, meaning organizations can't document agent systems they don't know they're running

Retrofit penalty: Post-deployment remediation costs dwarf the weeks of pre-development specification workshops that would have prevented them

Metric justification: Regulators now require explaining why chosen performance metrics are appropriate, not just reporting favorable scores

Timing mismatch: Enterprise procurement cycles run over a year, so organizations starting compliance work now are remediating, not preparing

Market Pulse

Market Pulse

When Infrastructure Writes the Spec

When Infrastructure Writes the Spec

Towards a Science of AI Agent Reliability

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

Towards a Science of AI Agent Reliability

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems