Module 029 · Desk VIII · Frontier Capabilities

Model upgrades.

Migrating your agents when the next model ships. What breaks, what gets better, how to move without downtime. 90 minutes.

90 minutes · 9 sections · ~7,500 words · Prereq: Module 028
Written for
Manager Chief Founder

The next Claude ships. Your agents don't work the same.

Model upgrades are supposed to be "drop in a new model name." In practice, your agents behave slightly differently on the new model. Better on some tasks, worse on others. Regression possible. Catastrophic regression possible if you don't check.

Every provider (Anthropic, OpenAI, Google) ships new models on irregular schedules. If you're running agents in production, migrations are a recurring event, every 3-6 months, forever.

This module is 90 minutes of running a migration cleanly. By the end:

  • A model upgrade playbook you can run in under a day.
  • A regression suite that catches the behavior that slipped.
  • A rollback plan that takes 2 minutes when things go wrong.

Thinker.

Four things change between models, in order of impact.

  1. Instruction following. How strictly the model obeys your prompt. Usually gets better. Sometimes changes style.
  2. Tool use. Structure of tool calls, error handling, retry behavior. Often different in subtle ways.
  3. Output style. Verbosity, formatting preferences, default phrasing. Affects voice.
  4. Reasoning depth. Better on hard tasks, sometimes over-reasons on simple ones.

What doesn't change

The API shape is usually stable. The underlying concepts (messages, tools, system prompts) persist. Most of your code works unchanged. It's the behavior on the wire that shifts.

The migration principle

Always have the previous model pinned as a fallback. Ship the new model behind a flag. Run evals. Watch cost and quality for a week. Flip fully only when evals and traces look clean.

Talker.

The regression prompt

Run per agent, when a new model lands.

You are a regression auditor. I'm upgrading model X to
model Y for agent Z.

I'll provide:
- The system prompt.
- 20 sample inputs.
- Outputs from model X (baseline).
- Outputs from model Y (candidate).

For each input, report:
- Does Y match X's behavior on this input? yes / no /
  different-but-acceptable
- If no or different: what changed? One line.

At the end, produce:
- Count of matches, acceptable-differences, regressions.
- Top 3 themes in the differences.
- Recommendation: ship / hold / partial (flip for
  specific input types only).

This turns migration from vibes-based into eval-based. Decisions become defensible.

Rememberer.

Every production agent pins a model. Every pin lives in version control.

[agent-repo]/
  config.yaml             (contains model: claude-sonnet-4-6)
  evals/
    eval-inputs.json
    baselines/
      claude-sonnet-4-6.json   (outputs from current prod)
      claude-sonnet-5-0.json   (outputs from candidate)
  migration-notes/
    2026-04-18-sonnet-4-to-5.md

The pin matters

Using claude-sonnet-latest feels convenient. It's a footgun. Provider ships a new model overnight, your agents behave differently in the morning, you didn't get to choose when.

Always pin a specific version. Upgrade on your schedule.

Baselines are assets

Every agent has a baseline: the outputs of the eval suite on the current model. When a new model ships, you diff. No baseline = no migration data.

Doer.

Twelve minutes. Run a regression test on one agent against a new model.

Build block · 12 minutes
Dry-run a model migration

Step 1. Pick the agent (1 min)

The one running most calls in production.

Step 2. Grab 20 real inputs (2 min)

From your logs. Real diversity: happy path, edges, adversarial.

Step 3. Run both models (5 min)

Script: run each input through model X (current) and model Y (new). Save outputs. JSONL format:

{"input": "...", "model_x": "...", "model_y": "..."}

Step 4. Run the regression prompt (3 min)

Paste the outputs into the regression prompt from Talker. Get a verdict.

Step 5. Decide (1 min)

Based on the verdict: ship, hold, or partial. Write one paragraph in migration-notes/ explaining what you decided and why.

Expected

A structured migration decision. Evidence-based. Documented.

If something's wrong
  • Outputs differ a lot but you can't tell which is better: your evals measure structure, not quality. Add a "this is good because..." rubric.
  • Y is consistently worse: report to provider, hold, or adapt prompt for Y.
  • Y is consistently better but costs 3x: partial flip for high-value tasks only. Keep X for volume work.

Rookie.

Failure 1. Auto-upgrading with "latest"

You pin model: latest. Provider ships a new flagship overnight. Your prompts break. You're debugging at 3am.

Fix: pin specific versions. Always. Your migrations happen on your schedule, not a provider's.

Failure 2. Migrating without evals

You update the model name, tests pass (because you have no tests), deploy. A week later, customers report regressions.

Fix: eval suite before migration. Baseline + candidate. No evals, no migration.

Failure 3. Full flip without canary

You flip 100% of traffic to the new model. Bug in the prompt interacting with new model behavior breaks 10,000 calls before you notice.

Fix: canary. 1% first, 10% next day, 50% day after, 100% by end of week. Always have a rollback flag.

Manager.

The migration calendar

When a provider ships a new model, the team has 30-60 days to evaluate. Put it on a calendar. Don't let it linger until someone asks "are we on the new model yet?"

Per-agent migration plans

Each agent owner runs the regression prompt, decides ship / hold / partial, writes a migration note. The team lead compiles a portfolio view.

Model portfolio

Not every agent needs to run on the flagship. Some should stay on cheaper/older models. A portfolio view shows: which agents on which models, why. This view catches drift and surfaces optimization opportunities.

Chief.

Risk 1. Silent regression

You migrate. Evals pass. Something the evals didn't cover breaks. Customer reports surface weeks later.

Governance: eval suites get updated as new failure modes emerge. A good migration is also an opportunity to improve the eval suite. Any customer-surfaced regression becomes a new eval case, permanently.

Risk 2. Vendor dependency

Every model migration reminds you: your stack depends on one provider's release schedule. If they change terms, discontinue a model, or change behavior, you adapt on their timeline, not yours.

Governance: annual model-portability review. Can your top 3 agents run on an alternative provider? What would it take?

Risk 3. The accumulated pin debt

After 3 years of "we'll migrate next quarter," you have agents running on models that are 4 versions behind. Migration becomes painful; skipping becomes a habit.

Governance: no agent runs on a model more than 18 months old. Force-migration by policy. Budget the time.

Founder.

Solo founder: model migrations are a quarterly ritual, not a crisis.

The solo migration loop

  1. New model lands. You hear about it.
  2. Within 2 weeks: run the regression prompt on your top agent.
  3. If it's better, canary at 10% for a week.
  4. If no issues, flip to 100%.
  5. Commit the migration note. Pin the new version.

Two hours per migration. Quarterly. Forever.

The file you keep

  • ~/.bot/migrations/[agent-name].md with one entry per migration.

Six migrations in, you have a clear history of how your agents evolved alongside model progress.

The one thing to remember

Migration is a recurring event. Build for it.

Pin specific versions. Baseline outputs. Regression-test. Canary. Document. A good migration takes an afternoon. A bad one takes a week. The difference is whether you built the infrastructure before you needed it. This module is that infrastructure. You'll run this loop for the rest of your career with LLMs.

Keep exploring
More from the library.
Browse the full catalog →