Receipts · does the model actually work?

Simple test: when the model says a team has a 60% chance to win, do they actually win 60% of the time? Below is every graded pick we've ever made, sorted by how confident the model was, and how often it was right. Green tick = the prediction. Amber bar = what actually happened. If the amber bars consistently land at the green ticks, the model is honest.

Loading calibration data…

Calibration by predicted probability

Picks bucketed by what the model thought the win probability was, vs. how often they actually won. Each row shows the bucket, the actual hit rate as an amber bar, and the expected hit rate as a green tick. Sample size matters — sparse buckets are noisy.
Crunching buckets…
Actual hit rate    Expected (perfectly calibrated)

Performance by edge tier

How each tier has done historically. ROI assumes -110 American odds for spread/total picks and the actual fair price for moneylines. Locks are the bets the model has the most conviction on; Solid are the medium-conviction picks that still clear the noise threshold.
TierThresholdSampleHit rateAvg edge
Loading…

AGREE vs FLIP

When the model agrees with the consensus market price (AGREE) vs. when it disagrees (FLIP). AGREEs historically hit at a higher rate but with smaller edges; FLIPs are riskier but pay more when they hit.
DirectionSampleHit rateAvg edge
Loading…

How we keep this honest

Every pick is timestamped in picks before the game starts. Once the game finalizes, the grader (tg_mlb_grade.py) writes the result from the official MLB Stats API box score. Nothing on this page can be edited after the fact — it's a direct query of graded rows. If you spot something off, methodology.html walks through the model end-to-end.