CURRENT | Practitioner's Corner

Builder Profile

APIs for Websites That Don't Have APIs

By Rina Takahashi— June 11, 2026

Feature image for article: APIs for Websites That Don't Have APIs

Suchintan Singh describes Skyvern as building "APIs for websites that don't have APIs." It sounds like a pitch deck line. But the phrase does real work if you've spent years watching ML platforms get measured by marketplace revenue, not model accuracy, and Selenium scripts break every time a frontend developer renames a CSS class. His team built a harder benchmark, scored below 50% on it, and published the results anyway. The architecture and the measurement both point the same direction: web agents are an infrastructure problem that the AI framing keeps obscuring.

Builder Profile

APIs for Websites That Don't Have APIs

By Rina Takahashi— June 11, 2026

Suchintan Singh describes Skyvern as building "APIs for websites that don't have APIs." It sounds like a pitch deck line. But the phrase does real work if you've spent years watching ML platforms get measured by marketplace revenue, not model accuracy, and Selenium scripts break every time a frontend developer renames a CSS class. His team built a harder benchmark, scored below 50% on it, and published the results anyway. The architecture and the measurement both point the same direction: web agents are an infrastructure problem that the AI framing keeps obscuring.

Enforcement Gap

The Wall That Used to Be There

By Rina Takahashi— June 11, 2026

Feature image for article: The Wall That Used to Be There

An agent running on an 8K-context model that entered a loop used to crash. The window filled, the API errored, the run died. That was an accidental safety mechanism, and it worked. Context windows now stretch past a million tokens, and the accidental wall is gone. A recent multi-agent loop ran for eleven days and $47,000 before a human noticed the billing dashboard. The team had monitoring. They could see it happening. They couldn't stop it programmatically. Watching and stopping, it turns out, require entirely different infrastructure.

Enforcement Gap

The Wall That Used to Be There

By Rina Takahashi— June 11, 2026

An agent running on an 8K-context model that entered a loop used to crash. The window filled, the API errored, the run died. That was an accidental safety mechanism, and it worked. Context windows now stretch past a million tokens, and the accidental wall is gone. A recent multi-agent loop ran for eleven days and $47,000 before a human noticed the billing dashboard. The team had monitoring. They could see it happening. They couldn't stop it programmatically. Watching and stopping, it turns out, require entirely different infrastructure.

Enforcement Design

The Crash That Stopped Coming and the Engineer Who Had to Replace It

Enforcement Design

The Crash That Stopped Coming and the Engineer Who Had to Replace It

Compound Failure Math

The Compound Failure Rate: Why 85% Per-Step Accuracy Means 80% Workflow Failure

P(success) = a^m. Per-step accuracy raised to the number of sequential steps. At 85% accuracy across 10 steps, you land at 19.7% end-to-end success. Bump to 95% per step and a 10-step workflow still only clears 60%.

That's the generous version. The formula assumes independent errors. In practice, a botched early step poisons downstream context, so failures correlate. The math gives you a floor, not a ceiling. And most production workflows run longer than ten steps.

Compound Failure Math

The Compound Failure Rate: Why 85% Per-Step Accuracy Means 80% Workflow Failure

P(success) = a^m. Per-step accuracy raised to the number of sequential steps. At 85% accuracy across 10 steps, you land at 19.7% end-to-end success. Bump to 95% per step and a 10-step workflow still only clears 60%.

That's the generous version. The formula assumes independent errors. In practice, a botched early step poisons downstream context, so failures correlate. The math gives you a floor, not a ceiling. And most production workflows run longer than ten steps.

TAKE NOTE

Pass^k reality: Tau-bench's pass^k metric asks whether an agent succeeds on every attempt across k runs. GPT-4o fell below 25% at pass^8 on retail tasks.

Even 99%: A 100-step workflow at 99% per-step accuracy succeeds 36.6% of the time. Exceptional components still produce unreliable systems.

Shorter chains: Eliminating steps improves reliability faster than improving models. Decomposition is an engineering decision that compounds upward through the exponent.

Deterministic routing: Push exact operations like currency conversion and database lookups into typed code. Reserve model autonomy for genuinely ambiguous tasks.

Right metric: Track workflow completion rate, not step accuracy. Below 50% end-to-end success, irreversible tasks need human checkpoints before execution.