Receipts · does the model actually work?
Simple test: when the model says a team has a 60% chance to win, do they actually win 60% of the time? Below is every graded pick we've ever made, sorted by how confident the model was, and how often it was right. Green tick = the prediction. Amber bar = what actually happened. If the amber bars consistently land at the green ticks, the model is honest.
Loading calibration data…
Calibration by predicted probability
Picks bucketed by what the model thought the win probability was, vs. how often they
actually won. Each row shows the bucket, the actual hit rate as an amber bar, and the
expected hit rate as a green tick. Sample size matters — sparse buckets are noisy.
Crunching buckets…
Actual hit rate
Expected (perfectly calibrated)
Performance by edge tier
How each tier has done historically. ROI assumes -110 American odds for spread/total
picks and the actual fair price for moneylines. Locks are the bets the model has
the most conviction on; Solid are the medium-conviction picks that still clear
the noise threshold.
| Tier | Threshold | Sample | Hit rate | Avg edge |
|---|---|---|---|---|
| Loading… | ||||
AGREE vs FLIP
When the model agrees with the consensus market price (AGREE) vs. when it
disagrees (FLIP). AGREEs historically hit at a higher rate but with smaller
edges; FLIPs are riskier but pay more when they hit.
| Direction | Sample | Hit rate | Avg edge |
|---|---|---|---|
| Loading… | |||
How we keep this honest
Every pick is timestamped in
picks before the game starts. Once the
game finalizes, the grader (tg_mlb_grade.py) writes the result from the
official MLB Stats API box score. Nothing on this page can be edited after the fact —
it's a direct query of graded rows. If you spot something off,
methodology.html walks through the model end-to-end.
—