TinyFish | Market Pulse

The Perimeter Problem

The Data Perimeter Split Both Architectures in Half

By Rina Takahashi— May 27, 2026

Feature image for article: The Data Perimeter Split Both Architectures in Half

Last week, Anthropic and OpenAI both responded to the same brute constraint: regulated enterprise data cannot leave the building. Each vendor split the agent to accommodate that boundary. Anthropic moved tool execution inside the customer perimeter but kept the reasoning loop on its own infrastructure. OpenAI announced an on-premises partnership with Dell without specifying where inference actually runs. Neither architecture co-locates governance with the consequential decisions. When something goes wrong, the person who needs to intervene can kill the connection but has no way to inspect or redirect the reasoning that got there.

The Perimeter Problem

The Data Perimeter Split Both Architectures in Half

By Rina Takahashi— May 27, 2026

Last week, Anthropic and OpenAI both responded to the same brute constraint: regulated enterprise data cannot leave the building. Each vendor split the agent to accommodate that boundary. Anthropic moved tool execution inside the customer perimeter but kept the reasoning loop on its own infrastructure. OpenAI announced an on-premises partnership with Dell without specifying where inference actually runs. Neither architecture co-locates governance with the consequential decisions. When something goes wrong, the person who needs to intervene can kill the connection but has no way to inspect or redirect the reasoning that got there.

Ecosystem Signals

While most of the industry took the long weekend off, the agent ecosystem kept doing what it's done all quarter: building walls and drawing maps. Standards for what sites expose to agents. Governance planes that catalog what's already running. Coordination protocols for establishing trust across organizational boundaries. Security reports documenting what happens when those boundaries fail.

These efforts look parallel. They aren't. They're all converging on the same structural problem: where does the perimeter sit when the thing inside it is autonomous, distributed, and increasingly worth attacking? The answer is arriving from every direction at once, faster than most organizations can absorb it.

Ecosystem Signals

While most of the industry took the long weekend off, the agent ecosystem kept doing what it's done all quarter: building walls and drawing maps. Standards for what sites expose to agents. Governance planes that catalog what's already running. Coordination protocols for establishing trust across organizational boundaries. Security reports documenting what happens when those boundaries fail.

These efforts look parallel. They aren't. They're all converging on the same structural problem: where does the perimeter sit when the thing inside it is autonomous, distributed, and increasingly worth attacking? The answer is arriving from every direction at once, faster than most organizations can absorb it.

Web Standards

WebMCP Origin Trial Opens in Chrome 149

Sites publish structured tools for agents instead of getting scraped. Booking.com, Expedia, Instacart, Intuit, Shopify, and Redfin are testing. The catch: Chrome only, Gemini as the sole consuming agent, visible browser tab required. The spec lives in a W3C community group, not on the standards track. Mozilla and Apple remain uncommitted.

Enterprise Governance

Microsoft Ships Agent 365 as Governance Product

Generally available at $15/user/month, Agent 365 offers a control plane spanning Microsoft, AWS, and Google Cloud. Starting June, Defender maps each agent's reachable devices, MCP servers, identities, and cloud resources. The shadow AI discovery feature detects unauthorized agents already running inside organizations. Governance just became a line item.

Protocol Layer

A2A Hits v1.0 With 150 Organizations Aboard

Now under Linux Foundation governance, A2A introduces signed agent cards for cryptographic identity. A two-layer architecture is hardening: MCP handles vertical tool integration (roughly 97 million monthly SDK downloads, though treat that number directionally), A2A handles horizontal coordination between agents. The stack is starting to look settled.

Market Signal

Gartner Puts Agentic AI at Peak Hype

The first standalone Agentic AI Hype Cycle finds 17% of organizations have deployed agents while 60%+ expect to within two years. That's the most aggressive adoption curve Gartner has tracked. Governance, security, and FinOps each appear as distinct profiles on the cycle. Gartner also predicts over 40% of agentic projects will be canceled by end of 2027.

Threat Landscape

OWASP Confirms Real Exploits Hitting Agent Infrastructure

The Q1 2026 round-up marks a shift from theoretical risk to documented exploitation. Attackers are going after agent identities, orchestration layers, and supply chains. Prompt injection is now a practical data-leakage vector in production. One agent registry was systematically poisoned: five of seven top-downloaded skills turned out to be malware.

Research Spotlight

Towards a Science of AI Agent Reliability

Princeton researchers propose twelve metrics spanning consistency, robustness, predictability, and safety, drawn from safety-critical engineering. Tested across 14 models and two benchmarks, reliability improvements trail capability gains by factors of two to seven over 18 months of releases.

Why does accuracy keep climbing while reliability stalls?

Structured tasks show moderate gains; open-ended tasks show almost none. Single success metrics hide the gap.

Who's behind this?

The Princeton team running the Science of AI Agent Evaluation program, with a related paper accepted to ICLR 2026.

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Across 50 SWE-bench runs, 71% of Claude's failures trace to the same wrong interpretation repeated every time. Consistency turns out to be a force multiplier: a model that misreads a task fails more reliably, not less. Retries can't fix a systematic misunderstanding.

What breaks retry-based strategies?

Interpretation accuracy matters more than execution consistency. Running it again won't help if the reading was wrong from the start.

So consistency is good, right?

Only if the underlying assumption is correct. Otherwise you've just automated the same mistake at higher confidence.

When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

ReAct-style agents produce two to four distinct action sequences across ten identical runs. Consistent tasks hit 80–92% accuracy; inconsistent ones drop to 25–60%. Setting temperature to zero doesn't fix it. Most divergence happens at step two, the first real decision point.

Where does variance actually start?

69% of trajectory divergence occurs at step two. Early commitments cascade through the rest of execution.

Can't you just set temperature to zero?

No. Behavioral variance persists regardless, suggesting the instability runs deeper than sampling randomness.

ReliabilityBench

Quantifies the gap between passing once and passing reliably. An agent hitting 60% on a single attempt may show only 25% consistency across repeated trials. Standard benchmarks systematically flatter what production will punish.

How much do benchmarks overstate readiness?

Pass-at-one scores can run more than double the actual multi-trial consistency rate.

What does this mean for procurement?

A demo that works is not evidence of a system that works repeatedly. The gap is measurable and large.

Research Spotlight

Towards a Science of AI Agent Reliability

Princeton researchers propose twelve metrics spanning consistency, robustness, predictability, and safety, drawn from safety-critical engineering. Tested across 14 models and two benchmarks, reliability improvements trail capability gains by factors of two to seven over 18 months of releases.

Why does accuracy keep climbing while reliability stalls?

Structured tasks show moderate gains; open-ended tasks show almost none. Single success metrics hide the gap.

Who's behind this?

The Princeton team running the Science of AI Agent Evaluation program, with a related paper accepted to ICLR 2026.

Research Spotlight

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Across 50 SWE-bench runs, 71% of Claude's failures trace to the same wrong interpretation repeated every time. Consistency turns out to be a force multiplier: a model that misreads a task fails more reliably, not less. Retries can't fix a systematic misunderstanding.

What breaks retry-based strategies?

Interpretation accuracy matters more than execution consistency. Running it again won't help if the reading was wrong from the start.

So consistency is good, right?

Only if the underlying assumption is correct. Otherwise you've just automated the same mistake at higher confidence.

Research Spotlight

When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

ReAct-style agents produce two to four distinct action sequences across ten identical runs. Consistent tasks hit 80–92% accuracy; inconsistent ones drop to 25–60%. Setting temperature to zero doesn't fix it. Most divergence happens at step two, the first real decision point.

Where does variance actually start?

69% of trajectory divergence occurs at step two. Early commitments cascade through the rest of execution.

Can't you just set temperature to zero?

No. Behavioral variance persists regardless, suggesting the instability runs deeper than sampling randomness.

Research Spotlight

ReliabilityBench

Quantifies the gap between passing once and passing reliably. An agent hitting 60% on a single attempt may show only 25% consistency across repeated trials. Standard benchmarks systematically flatter what production will punish.

How much do benchmarks overstate readiness?

Pass-at-one scores can run more than double the actual multi-trial consistency rate.

What does this mean for procurement?

A demo that works is not evidence of a system that works repeatedly. The gap is measurable and large.

Pricing Shift

Outcome-Based Pricing Is Rising. The Definition of "Outcome" Isn't Keeping Up.

Per-seat pricing dropped from 21% to 15% of SaaS companies in twelve months. Hybrid models surged to 41%. Everyone agrees the old model is dying. Nobody agrees what replaces it.

Intercom charges $0.99 per resolution. HubSpot, since April 2026, undercuts at $0.50. Both say "resolved conversation." Intercom means the customer stopped replying. HubSpot means no human handoff within 72 hours. Same noun, wildly different measurements, entirely different incentive structures.

Pricing has outrun the definitions it depends on.

Pricing Shift

Outcome-Based Pricing Is Rising. The Definition of "Outcome" Isn't Keeping Up.

Per-seat pricing dropped from 21% to 15% of SaaS companies in twelve months. Hybrid models surged to 41%. Everyone agrees the old model is dying. Nobody agrees what replaces it.

Intercom charges $0.99 per resolution. HubSpot, since April 2026, undercuts at $0.50. Both say "resolved conversation." Intercom means the customer stopped replying. HubSpot means no human handoff within 72 hours. Same noun, wildly different measurements, entirely different incentive structures.

Pricing has outrun the definitions it depends on.

TAKE NOTE

Cost fan-out: A single agent action can cascade into model calls, tool retries, and evaluator loops, each carrying different cost curves that resist tidy unit pricing

Billing paradox: Intercom customers report monthly bills jumping from $3,000 to $8,500 as automation improves, effectively penalizing the performance gains they purchased

Revenue volatility: SaaS was built on predictable recurring revenue; outcome pricing introduces the kind of quarterly swings that CFOs and public-market investors instinctively resist

Domain ceiling: Resolution pricing works in support because the unit of work is bounded; complex multi-step agent workflows have no equivalent clean boundary

Sierra's signal: Pure outcome-based pricing from inception, now past $150M ARR, suggesting the model scales where attribution stays unambiguous and contained