Pipeline component · 05

The validator, validated.

Precision, recall, F1, threshold sweep, calibration, and Claude grading Claude. The discipline most attribution work skips, borrowed from how clause classifiers are graded in legal AI.

Walkthrough · Episode 06

Marketing teams almost never apply classification metrics to attribution decisions. Attribution outputs look continuous, like credit shares, not categorical. But the decisions this system outputs are categorical: over-credited, under-credited, or accurate. So we should grade them the same way we grade any classifier.

What you are looking at on this page is the validator, validated. The confusion matrix shows where the system's labels match the configured ground truth, and where they confuse. The threshold sweep shows where the F1-optimal cutoff is. The calibration check tells us whether the system's stated confidence matches its empirical accuracy. The narrative evaluation has Claude grade Claude on whether the executive summary's claims are consistent with the comparison data.

This page is the methodological discipline most attribution work skips, and it is the single strongest interview moment in the project. The framing is borrowed from how clause classifiers are evaluated in legal AI, applied to a domain that does not usually receive it.

Per-class precision, recall, F1

50 simulations × 6 channels = 300 predictions. Configured labels stay constant; engine predictions vary with measurement noise. Approximately 100 predictions per class.

Confusion matrix showing predicted vs configured labels.

Bar chart of precision, recall, F1 per class.

Three observations. Recall on actionable classes is high (0.85 on OVER, 0.79 on UNDER), so when there is real misallocation the system catches it. OVER and UNDER almost never get confused with each other (3 of 100 in each direction), so the system never tells you to cut a channel that should be expanded. The weakness is on ACCURATE recall (0.43), which the threshold sweep below addresses.

Threshold sweep

The 5pp default for the truth-check is an educated guess. Sweeping it in 0.5pp steps across [1pp, 15pp] and recomputing F1 at each candidate shows where the optimum lives per class.

Two-panel chart: F1 by class vs threshold, and OVER class precision/recall trade-off. — F1-optimal threshold for OVER_CREDITED (the budget-relevant class) is 8.0pp. Macro-F1 peaks at the same point. Biggest gain is on the ACCURATE class at 13.5pp (+0.141 F1), which would substantially cut the false-flag rate.

Calibration check

When the system says it is confident in a label, is it actually right that often? Confidence is defined as the margin from the decision boundary. Expected Calibration Error (ECE) is the bin-size-weighted mean absolute gap between stated confidence and empirical accuracy.

Reliability diagram of stated confidence vs empirical accuracy across confidence bins. — ECE = 0.155, meaningfully miscalibrated. The system is well-calibrated at the extremes (the largest bin at 99% stated confidence is 91% empirical accuracy) but over-confident in the moderate-confidence range. At 85% stated confidence, empirical accuracy is only 52%. That is the dangerous failure mode you cannot see in macro-F1 alone.

LLM-grades-LLM narrative evaluation

Take the executive summary from the narrative page. Structure the comparison data as ground truth. Have Claude grade Claude on whether the narrative's claims are consistent. Per channel labeled OVER or UNDER, the grader returns three booleans: was the channel mentioned, was the direction correctly stated, were the cited numbers within tolerance.

Channel mentioned

100%

Direction correct

100%

Magnitude approx correct

100%

Total API cost

$0.26

18 channel-level evaluations across 5 simulated runs. Two honest caveats: small N (with 50 evaluations we would likely find some failures, particularly on magnitude) and same-family LLM judging itself (the grader is also Opus 4.7; a rigorous follow-up uses a different judge family). Even with the caveats, 100% on direction correctness is a real signal: the prompt structure forces the writer to ground numeric claims in the input data.

What I would change at enterprise scale

Real data inputs. Replace the synthetic data source with warehouse connectors (Snowflake or Teradata). The interface stays the same; the implementation changes.
Attribution model integration. Pull the existing model's outputs from the warehouse and feed them in alongside measurement.
Real geo-experiments. Pull experiment metadata from Eppo or Statsig (or an internal tool) so geo-tests are real holdouts, not synthetic dark periods.
Single-channel synthetic control in src/methods/synthetic_control.py. Tighter per-channel magnitudes than multi-channel TWFE.
Larger-N evaluation runs. 200+ simulations for production-grade P/R/F1 numbers. With out-of-family judges (Sonnet, or non-Anthropic), the narrative eval becomes rigorous.
Output destinations and audit logging. The narrative report auto-delivers to a Confluence page, a Slack channel, or an email distribution list. Every recalibration recommendation gets an audit trail.