Suchintan Singh describes Skyvern, the company he co-founded in 2023, as building "APIs for websites that don't have APIs." It sounds like a pitch deck line. But the phrase does real work if you listen to what it implies. It points at an infrastructure gap in the web itself. The browser is a terrible seam where that gap shows.
Singh spent years on the other side of that seam. At Faire and Gopuff, he built ML infrastructure for marketplace search and discovery. The systems he was responsible for "generated over $100M of GMV." That's how he describes the work. Revenue, not model accuracy or F1 scores. When your ML platform gets measured by whether the marketplace made money, you develop a specific allergy to systems that work in evaluation and break in production.
Skyvern is his third startup. The first two are worth noting mostly as a narrowing lens. An engineer onboarding tool. Then Wyvern, an ML platform for marketplace search that reached $50K ARR before he pivoted away from it. He had a working product in his exact domain and walked away. What he'd seen across two companies and two startups was that the hard problem kept landing in the same place: making systems reliable against surfaces that weren't designed for them. Government portals, insurance dashboards, vendor billing systems. Legacy enterprise software whose only interface is a browser window, and whose DOM structure changes whenever a frontend developer renames a CSS class.
His team calls this the "maintenance tax": Selenium scripts that work Tuesday, break Wednesday, get patched Thursday. "Whack-a-mole with the DOM." The phrase frames browser automation as an ongoing operational cost, one that compounds. That framing only comes from someone who has watched it compound.
The architectural response is consistent with the diagnosis. Skyvern's compile-to-code approach lets the AI figure out a workflow once, then compiles the successful path into a Playwright script. If the site changes and the script breaks, the AI re-engages, heals the path, recompiles. Average run cost drops from $0.11 to $0.04. Run time drops from 279 seconds to 120. But the number Singh's team highlights is none of these. It's that "runs are now deterministic." Deterministic. That's the vocabulary of someone who has lived with non-deterministic systems in production and knows what they cost.
The same instinct shows up in measurement. Frustrated with what he called "vibes-based" evaluation, his team built Web Bench: 5,750 tasks across 452 websites, weighted toward write-heavy actions like authentication and form filling. The standard benchmark covers 643 tasks on 15 sites, mostly reads. Web Bench includes the adversarial conditions real automation faces. The results were unflattering to everyone, Skyvern included. Best fully automated agent on non-read tasks: 46.6%. Publishing that number is a choice. Building a harder test and reporting honestly when your own system scores below half suggests someone who believes accurate measurement serves you better than favorable measurement.
A recent LinkedIn post captures the philosophy in miniature. Singh wrote that the best "prompt engineering" he did all year wasn't prompt engineering. He gave a support agent read access to Skyvern's own database. It stopped guessing at customer errors and started diagnosing them from actual run data.
"None of the prompt optimization compares to letting the agent see the actual data."
The bottleneck was what the model could see. Everything else followed from that. An infrastructure observation through and through.
Things to follow up on...
- Write tasks remain hard: Web Bench's results on non-read tasks show that authentication, form filling, and file downloads are where every browser agent still struggles, because write-heavy workflows touch site-specific logic that changes frequently.
- Checkpoints aren't durability: Diagrid's analysis of why agent framework checkpointing falls short of true durable execution maps directly onto the reliability gap Singh is building against.
- Retry safety for agents: The idempotency problem Singh's compile-to-code approach sidesteps is explored in detail in this breakdown of why HTTP retry patterns don't translate to agent tool calls.
- The eval gap in production: LangChain's State of Agent Engineering survey found that 89% of teams running agents have observability but only 52% have implemented evals, reinforcing Singh's frustration with "vibes-based" measurement.

