Module 018 · Desk V · Build & Ship

Deploying agents.

From laptop to users. Where your agent actually runs, how it scales, how it fails, in 90 minutes.

90 minutes · 9 sections · ~7,500 words · Prereq: Module 017
Written for
Rookie Manager Founder

Your agent works on your laptop. That is not a deploy.

You built something. You ran it. You showed a teammate. They said "cool, can I use it?" The correct answer is yes. The honest answer is actually, it's kind of complicated. That gap is where most agent projects die.

Deploying an agent is not the same as deploying a web app. The agent has: a system prompt (versioned like code), tools (each with its own auth), state (per-user, per-session), and cost (per-call, not per-request). Every one of those needs a home in your deployment.

This module is 90 minutes of getting one agent from your laptop to a URL your teammates can use. By the end:

  • A working deployment of a real agent.
  • Observability: you can see every call, every tool, every cost.
  • A rollback plan.

Thinker.

A deployed agent has five surfaces that can fail. Know them before you ship.

  1. The prompt. Versioned. Reviewable. Rolled back independently of code.
  2. The code. The agent loop. Standard deployment.
  3. The tools. Each has its own auth, rate limit, failure mode.
  4. The model. The upstream LLM API. Can go down. Can rate-limit you. Can change behavior on a minor version bump.
  5. The state. User data, session data, cached results. Must be backed up like any DB.

Deploy once, observe forever

The day you deploy, structured logging goes in. Every agent call gets a trace: input, tool calls, tool results, final output, total tokens, total cost. You'll thank yourself the first time something misbehaves.

Three deploy targets

  • A URL. Web UI with login. Good for teammates.
  • A CLI. Script they run on their laptop. Good for engineers.
  • An API. For programmatic integration. Good for products.

Start with one. Add the others when someone asks.

Talker.

Two prompts matter at deploy time. The ops review prompt. The incident triage prompt.

The ops review prompt

Run this before you hit "deploy to prod":

You are an ops reviewer. I'm deploying this agent:

System prompt: [paste]
Tools: [list with descriptions]
Storage: [describe]
Deploy target: [URL / CLI / API]

Answer, specifically:
1. What's the blast radius if the prompt is wrong? (e.g.,
   agent emails all customers with garbage)
2. What's the blast radius if a tool call fails?
3. What data does this touch, and where does it go?
4. What's the rollback plan if we ship and it breaks?
5. How do I know in the first hour if it's misbehaving?

Flag anything that doesn't have a clear answer.

If the agent can't flag concerns, you're missing context in the deploy. Fix that before shipping.

The incident triage prompt

Save this. Use it when the agent misbehaves:

I have an agent incident. Trace: [paste full trace]

Tell me, in order:
1. Which step was the failure, prompt, tool, or model?
2. What's the fastest rollback?
3. What change would prevent this next time?

Rememberer.

Three deployment environments. Each has its own state.

[your-app]/
  prompts/
    system-v1.txt
    system-v2.txt       (current prod)
    system-v3.txt       (staging)
  deploy/
    prod.env            (never committed)
    staging.env
    local.env
  storage/
    prod.db
    staging.db
  logs/
    [YYYY-MM-DD]/traces.jsonl

Prompt versions

The prompt is code. Every version is a file. Prod runs system-v2.txt. When you deploy v3, you move the pointer, not overwrite v2. Rollback = flip the pointer back.

Logs as diagnostics

Every agent call writes one JSONL line to logs/[today]/traces.jsonl:

{"ts": "...", "user": "...", "prompt_version": "v2",
 "input": "...", "tool_calls": [...], "output": "...",
 "tokens": {"in": 1234, "out": 567}, "cost_usd": 0.034}

Rotate daily. Keep 30 days. After a year you have enough data to spot drift.

Doer.

Twelve minutes. Deploy a simple agent to a URL your teammates can hit.

Build block · 12 minutes
Ship an agent to a URL

Step 1. Pick the target (2 min)

For this build, use Fly.io or Railway. Free tier. Deploys from a Dockerfile. Something like:

flyctl launch   # or railway login; railway init

Step 2. Wrap the agent in a tiny web server (3 min)

from fastapi import FastAPI
from pydantic import BaseModel
from agent import run       # your loop from Module 016

app = FastAPI()

class Req(BaseModel): input: str

@app.post("/run")
def handle(req: Req):
    return {"output": run(req.input)}

Step 3. Add observability (3 min)

Before every run, log the input. After, log the output and the cost. One JSON line per request.

Step 4. Deploy (2 min)

flyctl deploy
# or: railway up

Set ANTHROPIC_API_KEY as a secret, never as an env var in Dockerfile.

Step 5. Share (2 min)

Send the URL to one teammate. Watch your logs. Iterate tomorrow.

Expected

A URL. One teammate using it. Structured logs you can grep.

If something's wrong
  • Build fails: check Python version matches. Deploy platforms default to 3.10.
  • Auth fails at runtime: secret not set on deploy platform.
  • Works locally, breaks in prod: you're relying on a file path that doesn't exist in the container.

Rookie.

Failure 1. Deploy without logs

You ship. Agent misbehaves. Users complain. You can't reproduce because you have no trace of the bad call.

Fix: structured logs on day one, not day three. No exceptions.

Failure 2. Prompt in the Docker image

You bake the system prompt into the code. Every prompt change rebuilds and redeploys the whole container. Takes 10 minutes. You stop iterating.

Fix: prompt lives in a file, loaded at startup, from a versioned location. Prompt changes are near-instant deploys.

Failure 3. No rate limit on your own endpoint

Somebody shares your URL. 100 requests hit in 5 minutes. You blow through your LLM quota and get rate-limited for an hour.

Fix: basic auth or API key. Rate limit per key. Free tier deploy platforms usually handle this with one config line.

Manager.

A team running multiple agents needs a deployment playbook, or every agent is a bespoke snowflake.

One deploy template

Pick one deploy path (Fly, Railway, your own K8s). Use it for every agent. New agents copy the template. This is how you go from shipping one agent to shipping ten.

Pre-deploy checklist

  • Eval suite passes (Module 006).
  • Ops review prompt run (Talker above).
  • Logs verified in staging.
  • Rollback plan in writing.

Four checks. Five minutes. Saves ten hours.

On-call

If the agent is user-facing, someone is on-call for it. Name the owner. Pager them on error spikes. Real products have pages, agent products are products.

Chief.

Risk 1. The prompt outage

A bad prompt change ships at 2am. Agent replies with garbage to every user until 8am. Blast radius: every user touched in 6 hours.

Governance: prompt changes need the same approval as code changes. No "quick tweaks" from a Notion doc.

Risk 2. Model provider outage

Your agent depends on an upstream API. When that API has an incident, your product has an incident. Have a fallback (different model, queue, graceful degrade).

Risk 3. Observability debt

You deploy. You don't look at logs. After 3 months, cost is 5x what you budgeted because a tool is looping. Nobody noticed because nobody was watching.

Governance: weekly ops review. Read the cost chart, read 20 random traces, flag patterns.

Founder.

Solo founder deploying an agent: the minimum viable ops stack fits on one laptop and costs nothing.

The solo deploy stack

  • Host. Fly.io or Railway free tier.
  • Storage. SQLite on a volume, backed up nightly to S3.
  • Logs. Local JSONL, rotated daily, synced to S3 or Cloudflare R2.
  • Monitoring. A cron job that pings your own /health endpoint, emails you when it breaks.

Total: $5-20/month. Scales to a hundred users easily.

The weekly ops ritual

Every Friday, 15 minutes:

  • Read 10 random traces from the week.
  • Look at the cost chart.
  • Read the top 3 user complaints or confusion points.
  • Note what's worth fixing this week.
The one thing to remember

Deploy is not a moment. It's a practice.

The first ship is easy. The hundredth is what tests whether you built infrastructure or a demo. Prompts as files, logs as traces, rollback as a flipped pointer. Every deploy you do from here uses the same five moves.

Keep exploring
More from the library.
Browse the full catalog →