Module 009 · Report pipelines.

Section 01

Hello

Opens the module·Names the problem

Data is not a report. The work of turning one into the other is exactly the work agents can do.

Every week, someone on your team pulls numbers from three systems, pastes them into a doc, writes a paragraph saying what they mean, adds some caveats, and sends it to a distribution list. That someone spends two hours on a task that's 80 percent mechanical and 20 percent judgment.

Today we build the agent that does the 80 percent. The 20 percent stays with the human.

By the end of 90 minutes:

A weekly-metrics report agent that takes a CSV or JSON of data and produces a narrative summary.
Constraints that prevent the agent from hallucinating numbers.
A template system so the report looks the same every week.
A review gate so a human always signs off before distribution.

Report pipelines are one of the most abused AI use cases. Done badly, the agent invents numbers and sounds confident. Done well, the agent saves hours a week with strict grounding in the data. This module is about doing it well.

Prereq: Module 008. Understanding the workflow pattern helps.

Thinker lays out the four stages.

Section 02

Thinker

Reasoning·Four stages

A report pipeline has four stages. Understanding them separately is the difference between a useful agent and one that makes stuff up.

Stage 1

Collect

Pull data from the source systems. CSV exports, database queries, API calls. Not an LLM task.

Stage 2

Analyze

Compute metrics, deltas, aggregations. Python or SQL. Deterministic.

Stage 3

Draft

Turn the analysis into narrative. This is where the agent earns its place.

Stage 4

Review

Human reads the draft, checks the numbers, approves, distributes.

Stages 1 and 2 are not LLM work

This is the most important insight in this module.

Data collection and analysis are deterministic. You have a query. It returns numbers. The numbers are correct or they're not. Don't ask an LLM to fetch your data. Don't ask an LLM to compute your aggregations. Both will sometimes hallucinate.

Write those stages as plain code. Python, SQL, a shell script. The output is a structured blob of numbers. That blob is the agent's input.

Stage 3 is the agent's whole job

The agent takes a blob of verified numbers and turns it into prose. "Active users: 4,820 (up 8% week-over-week)" becomes "User engagement continued its upward trend this week, reaching 4,820 active users, an 8% increase from last week."

The agent doesn't invent the number. It wraps the number in language. That's the craft: ground every sentence in a specific number from the input, never introduce numbers not in the input.

Stage 4 is non-negotiable

Every report gets a human review before it goes out. No exceptions. Even the most reliable pipeline has the occasional LLM flub, and reports that go to stakeholders are exactly the wrong place to let a small error slip through.

Review takes five minutes. The agent saved ninety. The math is still strongly positive.

Why separate the stages

Because each stage has different failure modes and different fixes.

Stage 1 breaks: the data source changed. Fix the query.
Stage 2 breaks: the math is wrong. Fix the script.
Stage 3 breaks: the narrative is off. Fix the prompt.
Stage 4 breaks: the reviewer missed something. Tighten the review checklist.

When you conflate stages ("the agent does everything"), every failure requires investigating the whole pipeline. When you separate them, diagnosis is fast.

The north star of this module

The agent writes prose. Your code computes numbers.

Never let the LLM do arithmetic. Never let it pull data. Give it verified numbers and ask it to narrate. That's the whole discipline of a reliable report pipeline.

Talker on how to ground the narrative in the data.

Section 03

Talker

Prompts·Narrative from data

The drafting prompt has one job: turn numbers into narrative without introducing new numbers.

The grounding rule

Every sentence in the report must cite a specific field from the input data. Not "users grew a lot" but "users_active_weekly grew 8%."

The prompt enforces this:

Rules:
- Every factual claim in the report must reference a specific
  field from the provided data.
- If a claim is not supported by a specific field, do not include it.
- Never compute new numbers. Only use the numbers provided.
- Never infer causation. Report correlation if present in data.

The structured input

The agent reads data as JSON, not as free text. JSON keys are the names the agent can cite.

{
  "period": "2026-W16",
  "metrics": {
    "active_users_weekly": {"value": 4820, "delta_pct": 8.0, "direction": "up"},
    "new_signups": {"value": 312, "delta_pct": -2.0, "direction": "down"},
    "revenue_weekly": {"value": 48200, "delta_pct": 12.0, "direction": "up"}
  },
  "highlights": ["Feature X launched Monday"],
  "notes": ["Week includes Easter holiday, may affect signups"]
}

The agent cites active_users_weekly.value, not "active users." The naming makes the grounding visible and checkable.

The template constraint

Reports should look the same every week. Not because uniformity is virtuous; because readers can scan consistent structure. Enforce the template in the prompt:

Format:

# Weekly Report - {period}

## Headline
One sentence summarizing the most important metric change.

## Metrics
One paragraph per metric. Cite the value and the delta.

## Highlights
Bullet points for each highlight from the input.

## Notes
Any caveats or context from the notes field.

## Look-ahead
One sentence on what to watch next week.

Five sections, same every week. The agent fills the template with content derived from the structured input. Variance only in content, never in structure.

The anti-hallucination patterns

Three specific rules to prevent invented numbers:

- If the input data does not contain a metric relevant to the
  section, write "(no data provided for this metric)" instead of
  making up a number.
- Never use words like "approximately" or "about" with numbers.
  Either quote the exact number or omit it.
- Never combine metrics to calculate new ratios. If "users" and
  "revenue" are both in the data, do not report revenue-per-user
  unless that field is also provided.

The third rule is the most important. Computing ratios is a very tempting LLM behavior and a very common source of hallucinated numbers. Ban it in the prompt; do the computation in Stage 2 if you want it in the report.

The tone calibration

Reports read cold or warm. The LLM defaults to warm and vague. You usually want something more like factual with minor editorial.

Tone:
- Factual and concise.
- No superlatives ("amazing", "tremendous").
- No hedging ("seems like", "might suggest").
- Short sentences preferred.
- Active voice.

Ten lines of tone guidance change the output character significantly. Write these once, keep them in every report's prompt.

Build block · 4 minutes

Sketch your report template

For a report you'd actually send, write down the template:

# [Report name]

## [Section 1 name]
[One-line description of what goes here]

## [Section 2 name]
[One-line description]

...etc

Three to six sections. Each section has a one-line description of what content goes there. This is the template the agent will fill.

Expected output

A markdown template with 3-6 sections. Saved for the Doer build block.

Rememberer on templates and consistency.

Section 04

Rememberer

Memory·Templates and consistency

Reports live in time. Week 16 connects to week 15, which connects to week 14. Memory across reports is how the series becomes useful.

What persists across reports

Three things:

The template. Same structure every week.
Prior weeks' data. For week-over-week comparisons and trend language.
Editorial notes. Standing context. "We launched Feature X on Monday. Expect an uptick in signups."

The historical data memory

The drafting prompt needs the current period's data, plus enough prior periods to support trend language. Usually the last 4-8 weeks.

Storage: a flat file or a database table. Each row is one week's data. The pipeline fetches the last N weeks and includes them in the agent's context.

reports/
  2026-W14.json
  2026-W15.json
  2026-W16.json  (current)

When Stage 3 runs, it loads W14-W16 and includes all three in context. The agent can reference prior weeks for comparisons the data itself might not encode.

The editorial notes system

Some context doesn't live in the data. "We had a marketing campaign Tuesday." "The integration with Provider X went down for four hours." These matter for the narrative but aren't metrics.

Maintain a notes file, appended to by the team throughout the week:

notes/2026-W16.md

- 2026-W16 Monday: Feature X launch announced.
- 2026-W16 Wednesday: Blog post about new integration.
- 2026-W16 Thursday: 3-hour incident on provider Y.

Stage 3 includes this file in context. The agent can weave the notes into the narrative: "Active users increased 8 percent, coinciding with Monday's Feature X launch."

The consistency enforcement

Reports drift if you let them. Week 16 looks slightly different from week 15. Section names change. Metrics come and go. The pipeline that was supposed to reduce variance introduces it.

Two enforcement tactics:

The template is committed code. Changes to the template go through the same review process as any other code change. Not "I tweaked the prompt last Friday."
A weekly diff check. Automated comparison of this week's report structure against last week's. Flag deviations for human review. Catches drift early.

What NOT to keep in memory

Don't feed the agent the last year of reports. Tokens, cost, no value.

Don't feed it raw database logs. Structured metric summaries only.

Don't feed it the draft of last week's report (as opposed to the data). The agent will mimic the previous week's narrative structure too closely and the reports will feel formulaic.

Doer.

Section 05

Doer

Actions·Build the weekly report

Build a weekly-metrics report agent. We'll use synthetic data to stand in for a real source.

Build block · 12 minutes

Ship a weekly report pipeline

Step 1. Create the data file (2 min)

Make /tmp/week16.json:

{
  "period": "2026-W16",
  "metrics": {
    "active_users_weekly": {"value": 4820, "delta_pct": 8.0, "direction": "up"},
    "new_signups": {"value": 312, "delta_pct": -2.0, "direction": "down"},
    "revenue_weekly": {"value": 48200, "delta_pct": 12.0, "direction": "up"},
    "support_tickets": {"value": 127, "delta_pct": -15.0, "direction": "down"}
  },
  "highlights": ["Feature X launched Monday", "Press mention in TechCrunch Thursday"],
  "notes": ["Week includes one holiday", "3-hour incident on Wednesday"]
}

Step 2. Create the report agent (4 min)

Create .claude/agents/weekly-report.md:

---
name: weekly-report
description: Takes a JSON file of weekly metrics and produces a
  structured markdown report. Use this for weekly operational
  reporting.
tools: Read
---

You are a report-drafting agent. You read a JSON file of weekly
metrics and produce a markdown report.

Rules:
- Every factual claim must reference a specific field from the
  input data.
- Never compute new numbers. Use only the numbers provided.
- If data is missing for a section, write "(no data provided)"
  instead of inventing.
- No words like "approximately", "about", "around" with numbers.
- Active voice. Short sentences. Factual tone.

Template:

# Weekly Report - {period}

## Headline
One sentence summarizing the most important metric change.

## Metrics
One paragraph per metric. Cite exact values and deltas.

## Highlights
Bullet points from the highlights field.

## Context
Any caveats from the notes field.

## Look-ahead
One sentence on what to watch next week.

If input is not valid JSON, reply "The input is not valid report
data. Check the source." Do not attempt to produce a report.

Step 3. Run the agent (2 min)

Use the weekly-report agent to generate the report for
/tmp/week16.json

Expected output: a markdown report with five sections, every number matching the input, no invented figures.

Step 4. Check for hallucinations (2 min)

Read the output carefully. Check:

Every number in the report appears in the JSON.
No ratio or rate was computed (like "revenue per user").
No causation was inferred unless clearly in the notes.
The notes about the holiday and incident appear in the Context section.

Step 5. Stress test (2 min)

Try a malformed input to test the escape hatch:

Use the weekly-report agent to generate a report for
/tmp/notes.txt

(Point it at a plain text file.) The agent should reply "The input is not valid report data..." and refuse to generate a fake report.

Expected outcome

A report agent that produces structured reports grounded in real numbers, refuses malformed inputs, and maintains consistent format across runs.

If something's wrong

Agent invented a number: tighten the grounding rule, add the specific rule "Never write a percentage or ratio that isn't in the input."
Format drifts between runs: copy the template verbatim into the prompt, not a description of it.
Too verbose: add "Each paragraph is under 50 words" to rules.
Too dry: add one worked example in the prompt showing the tone you want.

What you built

A report-drafting agent. The Stage 3 piece of the four-stage pipeline. Stages 1 and 2 (collect, analyze) remain engineering work: a script that produces the JSON. Stage 4 (review) remains a human role.

At steady state, your weekly report cycle looks like: cron job runs the collect+analyze script Monday morning, agent generates the draft, you spend 5 minutes reviewing and hitting send. Two hours of work becomes ten minutes.

Rookie has the three ways report pipelines break.

Section 06

Rookie

Pitfalls·Three failure modes

Three ways report pipelines fail when first built.

Failure 1. Hallucinated numbers

The report says "revenue-per-user reached $42." Revenue-per-user is not in the input data. The agent computed it (incorrectly) or made it up.

Root cause: the grounding rules in the prompt weren't specific enough about forbidding computation.

Fix: explicit rule. "Never compute ratios, rates, or percentages that are not literally in the input data." If you want revenue-per-user in the report, compute it in Stage 2 and add it to the input data. Never let the LLM do arithmetic.

Failure 2. Inconsistent formatting

Week 15 has "## Metrics" with four paragraphs. Week 16 has "## Key Metrics" with bulleted numbers. Week 17 has "## Metrics" with a table.

The structure drifts because the prompt allowed drift. Maybe the template was described in prose ("include a section with the key metrics") instead of shown verbatim.

Fix: template literally in the prompt. Section names, heading levels, paragraph-vs-bullet all specified exactly. The agent has no room to improvise structure.

Failure 3. Missing human review gate

The pipeline runs, emails the draft to stakeholders automatically. One Monday the draft contains a flat-out wrong number (a data-pipeline bug in Stage 2, not the agent's fault). Stakeholders receive it. Email chains ensue.

Root cause: automation without a review gate.

Fix: the pipeline produces a draft that lands in a review inbox or a draft folder. A human clicks "send" after reading. The human review is not a suggestion; it's a required step in the pipeline. The 5 minutes of review per week is worth the catch rate.

The underlying discipline

Report pipelines are one of the few agent use cases where a single visible error can damage trust across an entire team. "The weekly report had a wrong number" is the kind of thing executives remember. The defensive posture (grounding rules, templates, review gates) is less about preventing rare disasters and more about preserving that trust across hundreds of weeks.

Manager on ownership and cadence.

Section 07

Manager

Team process·Ownership and review

Reports have readers. Readers notice when reports change. This is where team discipline matters.

Who owns a report

Every report has exactly one owner. That person is responsible for:

The data collection pipeline's correctness.
The report's structure and the agent's prompt.
The final review before distribution.
Any changes to the report, communicated to stakeholders.

Common failure: a report is "owned by the data team" which means no specific human. When something goes wrong, there's no one to call. When the report needs to change, nobody has authority. Reports without owners stagnate or drift, sometimes both.

The review handoff

The agent produces a draft. A human reviews. Who is that human?

Ideally the report owner. In practice, at team scale, you want a rotation so one person isn't reviewing every week.

The rotation runbook:

Monday 9am: pipeline runs. Draft lands in the review folder.
The reviewer this week gets notified.
Reviewer has until 11am to read, check numbers against source, approve or request changes.
Approval auto-distributes to the mailing list at 11am.
If not approved by 11am, the review owner gets escalated.

Deadlines matter here. Without them, reviews drift to "whenever someone gets to it," which often means "later than anyone expected."

The change management

Changes to a regular report need explicit communication.

Renaming a section: mention in the next report ("Note: we've renamed 'Pipeline Health' to 'System Metrics' for clarity.").

Adding a metric: mention in the report where it first appears.

Changing the tone or level of detail: email the distribution list before the first changed report lands.

Readers scan reports. Subtle changes get noticed as "something feels different" which erodes trust. Telegraphing changes preserves trust.

The monthly review

Once a month, the report owner runs a meta-review:

Are we reporting the right metrics?
Has the agent's output quality changed?
What feedback have we heard from readers?
Are there metrics that should be added or removed?

Twenty minutes. Feeds into the next round of prompt and template refinement. Without this review, reports calcify.

Scaling to multiple reports

Once you have one report agent working, others follow easily. A team might end up with:

Weekly metrics report (operations)
Weekly sales update (sales)
Monthly exec review (leadership)
Quarterly board deck summary (leadership)

Each has its own owner, its own prompt, its own data pipeline. The shared infrastructure (the pattern, the template discipline, the review gate) is what makes them all work. One pipeline well-built becomes four or five pipelines at the same quality bar.

Chief on trust and executive reporting.

Section 08

Chief

Governance·Risk, cost, exposure

Executive reports are the highest-trust, lowest-tolerance output of any AI system you'll deploy.

A slightly wrong customer email is annoying. A slightly wrong number in the board deck is a disaster. Calibrate accordingly.

The trust asymmetry

Building trust in a report system takes months. One bad report destroys it. Stakeholders who see a wrong number once will verify every number forever after, which is the opposite of the efficiency gain.

Implication: errors in executive reports are more expensive than errors in operational ones. A 2 percent error rate is fine for internal dashboards. It's catastrophic for the quarterly business review.

Set the quality bar accordingly. For reports that go to executives, boards, investors, or regulators:

Every number verified against the source system before distribution.
At least two human eyes on every draft.
A checklist of specific things to check (not "does it look right").
An audit trail of what was verified and by whom.

The agent's appropriate role

In executive reporting, the agent's role is to draft prose that surrounds verified numbers. The numbers themselves come from systems of record. The agent does not compute, does not infer, does not estimate.

This feels like a limited role. It is. Limited is what you want in this setting. The agent is a first-draft writer, not an analyst.

What NOT to automate

Three kinds of reporting that shouldn't have LLM draft agents at all:

Financial reporting to regulators. No draft agent. Traditional templating only. Human-written prose.
Crisis communications. When something's going wrong, the voice and framing need more care than an agent can provide.
Novel strategic analysis. Reports that involve first-time interpretation of new trends. The agent can reliably narrate recurring patterns, not reason about new ones.

For these, the human writes. The time investment is justified by the stakes.

The calibration metric

Track how often the human reviewer changes the agent's draft. Three regimes:

Over 50% of drafts modified significantly. The agent is not earning its place. Review the prompt.
Under 10% of drafts modified. Either the agent is great or the reviewer isn't really reviewing. Spot-check.
20-40% of drafts modified lightly. Healthy range. Agent does the grunt work, human tightens.

This metric is the best signal of whether the agent-human system is working. Track it monthly.

The governance three

Policy items for any organization running report agents:

All executive reports have a human approval step before distribution.
Every report has a named owner, visible on the report itself.
Change log for report structure, published with changes.

These are light-touch but critical. Organizations with them rarely have "the report is wrong" incidents. Organizations without them have them eventually.

Founder wraps it.

Section 09

Founder

Synthesis·The solo workflow

For a solo operator, report pipelines are how you get investor-grade output without an analyst team.

The weekly you to weekly you

As a solo operator, you're often the audience for your own reports. Nobody else reads the Monday-morning metrics email. You do. You look at numbers, think about trends, write a short note to yourself about what to focus on that week.

An agent can do most of that. You provide the judgment. It provides the grunt work of composing "revenue up 12 percent, signups down 2 percent, support tickets down 15 percent."

Your time goes from 30 minutes of number-staring to 5 minutes of reading the agent's summary and deciding what to do.

One report to start

Don't build five report pipelines at once. Build one.

Pick the report you write most often for yourself. The one you dread. Build the agent for that. Run it for a month. Tune the prompt. Then build the second.

Common candidates for "first report":

Weekly business metrics (revenue, users, churn).
Weekly marketing summary (traffic, conversion, top content).
Monthly financial snapshot.

The investor update pattern

For solo founders raising money: monthly investor updates are a report-pipeline use case that pays for itself many times over.

Structure:

Key metrics (month-over-month).
What happened this month (wins, learnings).
What's next.
Asks from investors.

The agent fills the first section from a data file. You write the qualitative sections. The full update takes 30 minutes instead of three hours. Monthly cadence becomes sustainable.

The drift habit

Once a quarter, re-read your last three months of reports.

Do they still feel right?
Are the metrics still the right ones?
Did the agent start sounding same-y?

Adjust the prompt accordingly. A good quarterly tune-up keeps reports fresh through years of use.

Keep the data pipeline simple

Your Stage 1 (collect) and Stage 2 (analyze) don't need to be fancy. A shell script that exports from Stripe, a Python script that computes deltas, output goes to a JSON file. That's enough.

The temptation as a solo operator is to build a fancy data-pipeline platform before building the agent. Don't. Build the simplest possible pipeline first. Iterate the agent against it. Make the pipeline fancier only when the simple version hits real limits.

The one thing to remember

Data is not a report. Numbers are not narrative.

The agent's job is to wrap verified numbers in language. That's it. Keep the arithmetic in code, the judgment in your head, and the narration in the agent. That division of labor is what makes report pipelines reliable.