Calibration is the statistical property that a probabilistic forecaster's stated confidence matches its empirical hit rate: events assigned 70% probability should occur roughly 70% of the time. A well-calibrated forecaster is neither systematically overconfident nor underconfident across the full probability range.
Calibration is assessed by bucketing forecasts by stated probability and comparing each bucket's average prediction to its realized frequency; a reliability diagram plots predicted versus observed and a perfectly calibrated forecaster lies on the 45-degree line. Calibration is distinct from resolution (sharpness) — a forecaster can be calibrated yet uninformative by always quoting the base rate. Proper scoring rules like the Brier score decompose into calibration and resolution components.
As prediction markets and LLM-generated probabilistic forecasts proliferate in 2025-2026 macro workflows, calibration is the primary diagnostic separating genuine forecasting skill from confident noise — and the metric by which agentic forecasting systems should be audited before their probabilities feed live positioning.
A forecaster issues 100 calls each tagged "80% likely." If only 55 resolve true, the forecaster is overconfident by 25 points in that bucket — the reliability curve sags below the diagonal. Philip Tetlock's Good Judgment Project (2011-2015) showed elite "superforecasters" remained close to the diagonal across buckets, whereas typical pundits clustered probabilities near 0 and 100 and were badly overconfident.
Brier score = (1/N) Σ (forecast_prob − outcome)²; decomposes into reliability (calibration) − resolution + uncertainty