Past the playground, into an app your users can use.
Every agent demo you've seen lives in Claude Desktop or a Jupyter notebook. Those are playgrounds. They're not apps. Your teammates can't use them. Your users will never see them.
The Agent SDK is the bridge. It gives you the primitives to embed Claude's agent loop in your own application, with your own tools, your own memory, your own UI.
This module is 90 minutes of going from playground to shipped app. By the end, you'll have:
- A working mental model of the SDK's primitives.
- A CLI agent app that runs on your laptop.
- A deployment path to put it in front of teammates.
Prereq: Modules 001 (agent loop) and 003 (tools). If the agent loop is still fuzzy, circle back.
Thinker.
The Agent SDK is four primitives:
- Messages. The conversation with the model.
- Tools. Functions the model can call. Defined in code, not MCP (though MCP interop exists).
- System prompt. Agent identity and rules.
- The loop. You call the model, handle tool calls, call again, stop when done.
The agent loop in code
while not done:
response = client.messages.create(...)
if response.stop_reason == "tool_use":
tool_results = run_tools(response.content)
messages.append(tool_results)
else:
done = True
That's the whole loop. Everything else (streaming, memory, budgets) is trim around it.
SDK vs. MCP vs. Claude Code
- SDK: you build the app.
- MCP: you expose tools to existing clients.
- Claude Code: the CLI/IDE client itself.
SDK is for when you want the agent inside your product.
Talker.
The system prompt for an SDK agent is the same contract from Module 002, plus one thing: it knows about your tools.
The template
You are [agent name], a [role] for [app name].
Rules:
- [imperative 1]
- [imperative 2]
- Never call a tool unless the user's request requires it.
- Always return a final text response after using tools.
Tools available:
- search_docs: searches our internal docs by keyword.
- create_ticket: files a bug or request.
- get_user_profile: looks up a user by email.
Use the minimum number of tools needed.
Escape hatch: if the request doesn't match any tool, say so
plainly and ask one clarifying question.
Tool definitions in code
tools = [
{
"name": "search_docs",
"description": "Searches internal docs. Returns up to 5 snippets matching the query.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]
}
}
]
Rememberer.
SDK agents need state. Three shapes:
- Conversation state. Messages list in memory during a session.
- User state. Preferences, history, profile. Stored per-user in a DB.
- System state. Rate limits, cache, auth tokens. In Redis or similar.
File layout
my-agent-app/
agent/
__init__.py
loop.py (the main loop)
tools.py (tool implementations)
system.txt (system prompt)
storage/
sessions.db
users.db
tests/
pyproject.toml
.env.example
Sessions live short. Users live forever. Treat them differently.
The context budget
Every tool result, every past message, every system prompt token, goes into the next call. Watch the budget. A long session with 20 tool calls and 5000-token responses will hit the limit faster than you expect.
Trim old tool results. Summarize old messages. Context is finite.
Doer.
Twelve minutes. Ship a CLI agent app that uses tools and runs a real loop.
Step 1. Scaffold (2 min)
mkdir my-agent-app && cd my-agent-app
python -m venv .venv && source .venv/bin/activate
pip install anthropic
Step 2. Write the loop (5 min)
import anthropic, json
client = anthropic.Anthropic()
SYSTEM = """You are a file-reading agent. When the user asks
about a file, use read_file. Always return a plain text
summary after reading."""
tools = [{
"name": "read_file",
"description": "Reads the contents of a local file.",
"input_schema": {"type": "object", "properties":
{"path": {"type": "string"}}, "required": ["path"]}
}]
def read_file(path):
with open(path) as f: return f.read()
def run(user_input):
messages = [{"role": "user", "content": user_input}]
while True:
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=SYSTEM, tools=tools, messages=messages)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason != "tool_use":
return resp.content[0].text
tool_results = []
for block in resp.content:
if block.type == "tool_use":
result = read_file(block.input["path"])
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result[:2000]
})
messages.append({"role": "user", "content": tool_results})
if __name__ == "__main__":
import sys
print(run(sys.argv[1]))
Step 3. Run (2 min)
python agent.py "summarize /tmp/test.md"
Step 4. Test edge cases (2 min)
Try: a nonexistent file (does it fail gracefully?), an empty file (does it say so?), an adversarial input (does it escape?).
Step 5. Commit (1 min)
A working CLI agent. Takes a user query, reads files when needed, returns a summary. ~50 lines of code.
- Loops forever: add a
max_iterationscounter. If you hit 10 tool calls, bail. - Tool result too big: truncate at 2000 chars like the example, or summarize first.
- Auth errors: check
ANTHROPIC_API_KEY.
Rookie.
Failure 1. The infinite tool loop
Your agent calls a tool, gets a result that prompts another call, then another. 30 tool calls later, you've burned $5 and the user is waiting.
Fix: hard cap on iterations. 10 tool calls max per user turn. Above that, return an error and ask the user for more context.
Failure 2. Dumping whole tool results
Your tool returns 50KB of JSON. The agent consumes all of it. Next turn, you're at 80% context.
Fix: summarize tool results before passing them back. "Returned 47 records; top 5 matches: ..." Agents don't need raw data to reason.
Failure 3. No streaming
Your user types a question. Your agent thinks for 8 seconds. Blank screen. User closes tab.
Fix: stream responses. Show the partial output as it arrives. Users will wait 30 seconds if they see progress, and 3 if they don't.
Manager.
Agent apps in a team repo are real software. They get CI, code review, versioned deploys.
The eval suite
Every agent app has an eval suite. Real inputs, expected outputs, pass/fail. Run before every deploy. Same discipline as Module 006.
Ownership
One owner per agent. They merge PRs, approve prompt changes, respond to incidents. Agent apps behave like products, they need product owners.
Prompt changes are deploys
A change to system.txt is a code change. Same PR, same review, same CI. Don't edit the prompt at runtime from a settings panel. That path ends in an outage.
Chief.
Agent apps in production carry three risks worth naming.
Risk 1. Per-call cost
Agent apps run more LLM calls per user action than simple chat apps. Budget accordingly. A chat app at $0.01/request becomes an agent app at $0.15/request. At scale, different business model.
Risk 2. Tool action risk
Agent tools that write, delete, or charge carry operational risk. One bug, and the agent executes the wrong thing 10,000 times. Require human-in-the-loop confirmation for any destructive tool call, until you trust the specific tool deeply.
Risk 3. Observability
When an agent misbehaves, you need to see the full trace: messages, tool calls, tool results. Set up structured logging from day one. Debugging an agent without traces is detective work with no evidence.
Founder.
Solo founder shipping an SDK-based agent app: the minimum viable stack is ~200 lines of Python.
The solo SDK loop
- Monday: write the agent loop. ~50 lines.
- Tuesday: define 3-5 tools. Implement each.
- Wednesday: write system.txt. Run it 10 times against real inputs.
- Thursday: add structured logging. Check every trace.
- Friday: wrap it in a tiny web UI (Gradio, Streamlit). Share the link.
Five-day ship. Not five months.
The three files that matter
loop.py: the agent loop itself. Rarely changes after week 1.tools.py: one function per tool. Grows over time.system.txt: the prompt. Updated weekly as you discover failure modes.
The SDK is 50 lines of Python around a while loop.
Every agent app you'll ever ship is a variation on that loop. Learn it once. Ship forever. The rest (streaming, memory, cost control) is trim, not substance.