AI Browser Automation as Code: How to Build Browser Agents That Don't Break

AI browser automation has gone from demo to daily tool in under two years. Point a model at a website, describe the task in plain English, and watch it click, type, and scroll its way to a result. It is genuinely magical the first ten times. Then it breaks — a button moves, a modal pops up, a page loads half a second slower — and the same agent that booked your flight yesterday now confidently checks out the wrong item today. That gap between "works in a demo" and "works on the 500th run" is the single biggest reason browser agents stall before reaching production.

The fix that keeps surfacing is not a smarter model. It is a different shape for the work: browser automation as code. Instead of letting a language model improvise every click in real time, you capture the automation as explicit, typed, replayable steps — and reserve the model for the few moments where judgment is actually required. The recent Show HN launch of Intuned, pitched as "reliable browser automations as code," is the news hook here. But the pattern is bigger than any one tool, and it is worth understanding whether you build agents yourself or just depend on ones that do.

What is AI browser automation, exactly?

AI browser automation is the use of an AI agent to operate a real web browser the way a person would: reading the rendered page, deciding what to do next, and acting through clicks, keystrokes, and navigation. It sits on top of a browser-driving layer — historically Playwright or Puppeteer — and adds a model that interprets goals and page state instead of relying on a hard-coded script.

There are roughly three styles in the wild:

Pure-LLM clicking. The model sees a screenshot or the DOM on every step and decides the next action live. Maximum flexibility, minimum determinism.
Record-and-replay. A human demonstrates the flow once; the tool replays the recorded selectors. Deterministic until the page changes, then brittle.
Automation as code. The flow is expressed as explicit code — typed steps with assertions — and the model is invoked only where the page is genuinely ambiguous or dynamic.

Most teams start at the top of that list because it demos best, then migrate down as reliability bites.

Why do browser agents break in production?

If you have shipped one, you already know the failure modes. If you are about to, here is what is waiting for you.

Flaky selectors. A pure-LLM agent re-derives "the blue Submit button" from scratch every run. Re-rendered class names, A/B tests, and localization all shift what the model sees, so the same instruction lands on different elements on different days.

Non-determinism. Ask a model to do the same task twice and you can get two different action sequences. That is fine for a one-off, but it makes automations impossible to test, diff, or trust. You cannot certify a flow that never runs the same way twice.

Timing and state races. Pages load asynchronously. A human waits for the spinner; a naive agent clicks into the void. Single-page apps that mutate the DOM after "load" make this worse.

No real retries or recovery. When a step fails mid-flow, an improvising agent often barrels ahead on a now-invalid page instead of detecting the failure and recovering to a known state.

Cost and latency. Calling a frontier model on every single click is slow and expensive. A 30-step flow becomes 30 model round-trips — death by a thousand inferences when 28 of those steps were deterministic anyway.

The through-line: flexibility and reliability are in tension, and pure-LLM clicking optimizes entirely for the first.

What does "automation as code" actually mean?

The automation-as-code pattern flips the default. Instead of "let the model decide everything, every time," you make the happy path explicit and bounded, and you spend model calls only where they earn their keep.

Concretely, an automation-as-code flow tends to have four properties:

Typed, explicit steps. Each action — navigate, fill, click, extract — is a named operation with defined inputs and outputs, not a free-text instruction. This is code you can read, review, and version.
Assertions between steps. After "click Checkout," the flow asserts the cart page actually loaded before proceeding. Failures are caught at the boundary, not three steps later on a broken page.
Deterministic replay. Given the same inputs, the flow runs the same way every time, so you can test it in CI and diff two runs to see exactly what changed.
The LLM in the loop, not in control. The model is called for the parts that genuinely need judgment — "which of these three results matches the user's intent?" or "extract the order total from this unfamiliar layout" — and not for the dozens of mechanical steps around them.

A useful mental model: treat the agent like a well-tested function with a small AI core, not a freewheeling intern with admin rights to your browser. The model handles ambiguity; the code handles everything else.

How is this different from just using Playwright?

This is the most common question, and the honest answer is: automation as code usually builds on a tool like Playwright rather than replacing it. Plain Playwright gives you deterministic, code-first browser control — exactly the reliability backbone you want. What it does not give you is adaptation: a hand-written Playwright script breaks the moment a selector changes, and it has no notion of "figure this part out."

Pure AI agents give you adaptation but sacrifice the determinism. The automation-as-code approach is the synthesis: keep Playwright-grade determinism for the structural skeleton of the flow, and inject model-driven adaptation only at the joints. Tools in this category — Intuned among them — essentially package that synthesis so you do not have to wire the model-in-the-loop plumbing yourself.

So it is not "Playwright vs. AI agent." It is "deterministic core, AI at the edges."

How do you make a browser agent reliable enough to ship?

You do not need a specific vendor to apply the pattern. A practical checklist:

Make the happy path code, not prose. Write the known steps explicitly. Reserve natural-language instructions for the genuinely variable parts.
Assert after every state transition. Treat each page as a contract: confirm you landed where you expected before acting.
Pin selectors to stable anchors. Prefer roles, labels, and test IDs over visual descriptions or brittle CSS paths.
Add real recovery, not retries-in-place. On failure, return to a known checkpoint and re-attempt from there, rather than re-clicking blindly.
Budget your model calls. Inference only where ambiguity lives. Every deterministic step you pull out of the model is faster, cheaper, and more reproducible.
Test like software. Run flows in CI against staging, and alert on assertion failures so you learn the page changed before your users do.
Log trajectories. Capture each step's input and outcome so a failed run is debuggable instead of a black box.

None of these are exotic. They are the same disciplines that make any automation dependable — applied to a surface that an AI happens to be driving.

When is pure-LLM clicking still the right call?

Automation as code is not always the answer. For genuinely one-shot tasks, exploratory research, or flows where the page structure is unknown and changes every time, the flexibility of a pure-LLM agent is worth the unreliability — you are not going to run it 500 times anyway. The pattern earns its cost specifically when a flow is repeated, business-critical, and expected to stay stable. Match the tool to the job: improvise for the long tail, codify the workhorses.

Key takeaways

AI browser automation breaks in production because of flaky selectors, non-determinism, timing races, weak recovery, and per-click model cost — not because the model is "not smart enough."
The durable fix is automation as code: typed steps, assertions, deterministic replay, and the LLM in the loop only where judgment is required.
It complements browser-driving tools like Playwright rather than replacing them — deterministic core, AI at the edges.
Reliability is an engineering discipline: assert state transitions, pin to stable selectors, recover to checkpoints, budget inference, and test in CI.
Reach for pure-LLM clicking on one-off, exploratory tasks; reach for automation as code on the repeated, business-critical flows.

If you are building agents that need to act on the real world reliably — not just chat about it — this reliability-first mindset is the same one we apply across agent design at Clawvard. For a companion read on evaluating agent reliability rather than just building it, see our computer use agent benchmarks explainer, and browse more practical guides in AI Tutorials. Ready to put dependable agents to work? Explore Clawvard.