Module 027 · Desk VIII · Frontier Capabilities

Computer use.

Agents that drive your browser. Capabilities, boundaries, and what's ready for production in 90 minutes.

90 minutes · 9 sections · ~7,500 words · Prereq: Module 026
Written for
Rookie Founder

Your agent can drive a browser now. Most of what you've read about it is wrong.

Computer use lets agents see a screen, move a mouse, and type into fields. The demos are impressive. The production use is narrower than the demos suggest, and the risks are bigger than most teams realize.

This module is 90 minutes of honest computer use. By the end:

  • A clear view of what computer use does well, and what it doesn't.
  • One shipped browser-task agent.
  • Guardrails: what you must not let it touch.

Thinker.

Computer use is the screen-read-act loop:

  1. Screenshot the screen.
  2. Model decides what action to take.
  3. Action is executed (click, type, scroll).
  4. Screenshot again. Repeat.

Where it works

  • Tasks with a clear start and end state.
  • Apps without an API (the legacy system your team hates).
  • Data entry across web forms.
  • Extraction from rendered content.

Where it doesn't

  • Tasks requiring fine precision (designer work, detailed spreadsheet edits).
  • Tasks on rapidly-changing UIs (agent state becomes stale fast).
  • Anything requiring credentials the user doesn't want to hand over.

The new boundary

When an agent can click and type on your behalf, the blast radius changes. Every click is an action the agent now takes, not just advises on. Plan accordingly.

Talker.

The task prompt

You are a browser task agent. Execute the task below.

Task: [one-sentence task description]

Starting state: [URL and any preconditions]
Success criteria: [what "done" looks like, literally]
Failure criteria: [when to stop and report failure]

Rules:
- Narrate each action in one line before taking it.
- Never interact with elements outside the target domain.
- Never enter payment info, passwords, or PII unless they
  came from the approved-credentials list.
- If a screen asks for something unexpected, stop and ask
  the user.
- Maximum 30 actions. If you hit 30, stop and report.

Approved credentials (if any): [list or "none"]

Narration as a log

Requiring narration is the single highest-leverage rule. Every action has a reason. When the agent misbehaves, you have a trace in English, not pixels.

Rememberer.

Computer-use sessions are stateful. Log everything.

[browser-agent]/
  sessions/
    [YYYY-MM-DD-HH]/
      screenshots/      (every screen, numbered)
      narration.md      (one line per action)
      final-state.md    (what was done, what wasn't)

Why screenshots matter

When something goes wrong, the screenshots are the evidence. Regulators, auditors, or your own post-mortem all need them. Disk is cheap. Keep them.

Credential handling

Never put passwords in the prompt or the trace. Use a secrets manager. Tag them by purpose. Review the approved-credentials list quarterly.

Doer.

Twelve minutes. Ship one browser task. Low-stakes, real data.

Build block · 12 minutes
First browser task agent

Step 1. Pick the task (2 min)

Something narrow and boring. "Check a known page for a specific value." "Download this week's report from a dashboard." Avoid anything with side effects.

Step 2. Set up the environment (3 min)

Use Claude's computer use (Claude Desktop with the feature flag, or the API with the tools). Sandboxed browser only. Not your main logged-in browser.

Step 3. Write the task prompt (3 min)

Use the template from Talker. Be specific about success and failure.

Step 4. Run and watch (3 min)

Watch the narration live. Interrupt if it does anything surprising. Screenshot every step. After completion, read the full trace.

Step 5. Iterate or retire (1 min)

If the task worked, add it to your automation backlog. If it didn't, decide: is it fixable (prompt issue), or is this a bad fit for computer use?

Expected

One completed task, with trace. Or one clear "not ready" signal. Both are useful outcomes.

If something's wrong
  • Agent clicks the wrong thing: UI element recognition is imperfect. Add more specific language to the task ("the blue Submit button at the bottom of the form").
  • Agent loops: add tighter action cap, or break the task into smaller steps.
  • Agent gives up: make sure the success criteria is reachable. Often the task needs decomposition, not more AI.

Rookie.

Failure 1. Running on your main browser

You hook computer use into your logged-in Chrome. Agent accidentally clicks something sensitive. You find out by being logged out of everything.

Fix: dedicated sandbox browser. Fresh profile. Limited credentials.

Failure 2. Treating it like an API

You schedule a computer-use agent to run hourly. 3 days later the target site redesigned. Your agent now clicks random things on a page it doesn't understand.

Fix: computer use is not a stable integration. It's a fallback for things without APIs. Expect UI changes to break it. Monitor. Fall back gracefully.

Failure 3. Giving it too much trust

You let it use your real credentials. It logs into a financial system. One misclick moves money.

Fix: the credentials you approve are minimal. The domains you approve are minimal. Dollar-value actions require human confirmation.

Manager.

Deployment discipline

Computer-use agents run in isolated containers. Network policies restrict domains. Credentials via secrets manager only. This is infra work. Have an engineer set it up, even if operators use it.

Review rhythm

Every computer-use agent has a weekly review: did it still work? Did it do anything unexpected? Target UIs change. Without review, you ship broken agents without knowing.

When to not use computer use

If the target system has an API, use the API. If the task is sensitive, build a real integration. Computer use is for legacy systems, third-party dashboards, one-off extraction.

Chief.

Risk 1. Blast radius

An agent that can click is an agent that can cause real-world effects. Unintended effects scale with permissions. A computer-use agent with write access to accounting is a different risk class than one reading dashboards.

Governance: permissions audit per agent. Write access requires explicit sign-off. No general-purpose "full-access" browser agents.

Risk 2. Screenshots as a data surface

Every session produces screenshots. Those screenshots may contain PII, trade secrets, or regulated data. They need the same treatment as other sensitive data.

Governance: screenshot storage policy. Retention, encryption, access logs. Same rigor as customer data.

Risk 3. Vendor TOS violations

Many websites' Terms of Service prohibit automated access. A computer-use agent may violate TOS of sites you scrape or interact with. Cost: being blocked at best, legal action at worst.

Governance: review TOS for every target. Avoid high-TOS-risk sites. When in doubt, email and ask.

Founder.

Solo founder: computer use is useful exactly where no API exists. Don't reach for it when an API is available.

The solo computer-use kit

  • Dedicated sandbox browser, fresh profile, no signed-in accounts.
  • One approved credentials file, minimum scopes.
  • One task per agent, narrow scope.
  • Weekly review: did it work? Did anything change?

The ROI reality

Most solo use cases are better solved by: an API call, a vibe-coded scraper, or a direct integration. Computer use is the last resort, not the first tool. When you need it, it's great. Most of the time, you don't.

The one thing to remember

Computer use is a hammer for the last mile of automation.

For anything with an API, use the API. For anything without, computer use is the bridge. Keep the sandbox tight, the credentials small, the scope narrow. The blast radius is real. The discipline keeps it boring.

Keep exploring
More from the library.
Browse the full catalog →