Shuchang Zheng and the Architecture That Notices

Skyvern's compile-to-code architecture was built for cost and speed, but its heal step accidentally produces timestamped environmental evidence.

Before co-founding Skyvern, Shuchang Zheng spent more than four years at Lyft building testing infrastructure used by over a thousand engineers — systems that verified whether services behaved correctly under peak load. Then Patreon, where he helped scale payment processing to 20 million transactions a month. Both jobs share a common discipline: knowing when something changed, and whether the change matters.

That discipline shows up in the load-bearing decision behind Skyvern's architecture. Keeping a large language model in the loop for every step of a browser workflow is expensive and nondeterministic. Two runs of the same task can take different paths, cost different amounts, produce different results. Zheng's team built what they call Route Memorization: let the AI figure out a workflow once, compile the successful path into a deterministic Playwright script, run the script from then on with zero model cost per execution. A paper studying the broader pattern puts a number on it: compiled workflows break even at roughly 17 runs and reduce token consumption by 57× at a thousand.

The economics are straightforward. The more interesting question is what happens when the compiled path breaks.

Websites change. A form field moves, a login flow adds a step, a dropdown renders differently. The compiled script hits a wall. Skyvern's validator checks whether each action actually worked — did the URL change, did a popup block the click. When it flags a failure, the system re-invokes the LLM to find a new path through the changed layout, then recompiles. The team calls this the heal step. The lineage from Zheng's Lyft work is visible here: a per-action verification loop that catches failures inline rather than discovering them at the end of a run.

The explicit design rationale, from Zheng's YC page:

“

"Agents shine when things don't go as planned."

The heal step exists for resilience. You want intelligence available for the moments when the world shifts underfoot, without paying for it on every routine execution.

Each heal-step invocation, though, records something specific. It's a timestamp on an environment event. Something changed enough to break a path that previously worked. The model wasn't running when the break happened. So the record is an environment change signature, produced as a byproduct of an architecture optimized for cost. A world event, captured in a log that was built to track workflow health.

Most agent observability points at the agent: did the model hallucinate, did the reasoning chain go sideways. A heal-step log answers something different. Did the world change? When, where, how much? Skyvern's Web Bench benchmark, built with Halluminate, spans 5,750 tasks across 452 websites and deliberately emphasizes write-heavy workflows against real anti-bot defenses. These are the conditions where compiled paths break most often, because write-heavy workflows touch site-specific logic that changes frequently. Heal-step patterns across that surface area would map where the web is shifting and how fast.

If compile-to-code becomes the default approach to browser automation, every system running it generates this kind of environmental evidence as a side effect. The aggregate would constitute a sensing layer that nobody designed, pointed at the environment rather than the agent. Today's observability watches the model and assumes the world is stable. This would watch the world.

Nobody designed the heal step as a sensor. But the byproduct is timestamped evidence about the environment, already pointed at the right layer. Most tools built to monitor agents weren't designed to answer whether the environment was stable. A heal-step log happens to answer exactly that, even though answering that question was never the point.

Whether anyone reads the signal that way is a different problem, and it depends on who's looking at the logs.

Things to follow up on...

Compiled AI economics paper: A recent Stanford-affiliated study formalizes the compile-once-run-many pattern and measures its operational tradeoffs, finding 96% task completion with zero execution tokens on function-calling benchmarks.
NSA flags environment trust: The NSA's May 2026 cybersecurity advisory on MCP notes that idempotency is not enforced by the protocol, leaving environment-level verification to implementations that often overlook it.
Write-heavy benchmark gaps: Skyvern's Web Bench found that all tested agents performed surprisingly poorly on write-heavy tasks like authentication and form filling, a gap documented on their YC launch page that standard read-focused benchmarks don't capture.
Five Eyes on assuming misbehavior: The joint CISA/Five Eyes guidance from May 2026 explicitly recommends organizations assume agentic AI systems may behave unexpectedly and prioritize resilience over efficiency gains.