The executive summary, generated by Claude.
Numbers are accurate. Executives want paragraphs. This page is what happens when you feed the comparison data to Claude Opus 4.7 with a carefully-structured prompt.
The truth-check on the previous page is right in numbers. A CFO does not read tables. A CFO reads paragraphs. So the next layer of the system feeds the comparison data to a Claude API call with a carefully-structured prompt and gets back an executive summary.
Three prompt-engineering decisions matter here, and I think they are the kind of details that separate "useful AI output" from "the model said something." First, the style rules sit in the system prompt as hard constraints: no em dashes, no "obviously" or "clearly," every numeric claim must trace back to a specific number in the data, conversational tone, no preamble. System prompts are stable across calls and set the role; the user message is what varies per request.
Second, negative constraints work better than positive ones for tone control. "Avoid 'obviously' and 'clearly'" is more effective than "be humble." Telling a model what NOT to do is concrete; telling it what TO do is interpretable in too many ways.
Third, the skip-list approach. The user message includes an explicit list of channels to discuss versus channels to omit. Asking the model to filter on its own is unreliable. Telling it which to skip is precise.
Cost is roughly two and a half cents per call (about 1300 input tokens plus 800 output tokens at Claude Opus 4.7 prices). The output below was generated by exactly this pipeline. Every numeric claim in it traces back to the comparison data on the previous page. Read it, and notice how it correctly hedges when an underlying p-value is weak and recommends a confirmatory holdout test before reallocating. That is prompt engineering doing real work.
The generated executive summary
Our last-touch attribution is overstating channel-driven conversions by roughly 3.2x (16,426 claimed versus 5,144 measured as truly incremental over 26 weeks), and the mix inside that number is wrong in ways that should change next quarter's budget.
Direct mail is badly under-credited. The model gives it 7.7% of channel credit, but geo-lift measurement puts its true contribution at 27.3%, a 19.5 percentage point gap and the largest mismatch in the portfolio. The measurement is statistically tight (p-value 0.000), so this is not noise. Direct mail is doing far more work than the dashboards show, and the recommended action is to protect and likely expand its budget rather than trim it in the next planning cycle.
Paid search is the most over-credited channel in absolute terms. The model assigns it 21.8% of channel credit; measurement supports only 8.2%, a swing of roughly 3,150 conversions that paid search is getting credit for but not actually causing. One important caveat: the p-value on the paid search measurement is 0.309, meaning the precise size of the overstatement is uncertain. The direction is consistent with brand search cannibalization, so the action is to run a planned paid search holdout test in one or two geos this quarter before making large cuts.
Display is over-credited and the evidence is stronger. The model claims 25.1% of channel credit; measurement shows 12.3%, a 12.9 percentage point gap and about 3,497 conversions of overstatement. The p-value of 0.071 is suggestive rather than conclusive, but the gap is large enough that the prudent move is to reduce display spend by a meaningful step (start with 20 to 30%) and reinvest the savings into the under-credited channels while monitoring total conversions.
TV brand is under-credited with solid statistical support. The model gives it 14.3% of credit; measurement says 20.1%, a 5.9 percentage point gap with a p-value of 0.021. TV is pulling more weight than the attribution model recognizes, particularly on upper-funnel demand that paid search and display are then taking credit for. Hold or modestly increase TV brand investment, and stop using last-touch numbers to justify cuts to it.
This week, commission the paid search geo holdout and rebalance roughly 15 to 20% of display budget toward direct mail and TV brand pending those results.
Generated by claude-opus-4-7 via the Anthropic SDK. Total cost for this call: $0.0263 (1,314 input tokens + 790 output tokens at Opus 4.7 list pricing). Re-runnable any time via scripts/run_narrative.py.