What is forecast calibration?

Forecast calibration is the property that a probabilistic forecaster's stated confidence matches its observed hit rate. Events assigned an 80% probability should occur about 80% of the time across many such forecasts. Calibration is measured with reliability diagrams and proper scoring rules like the Brier score, and is a core diagnostic for prediction markets and model-generated forecasts.

How is calibration different from accuracy?

Calibration measures whether stated probabilities match realized frequencies, while accuracy or sharpness (resolution) measures how decisively a forecaster separates outcomes. A forecaster who always quotes the 50% base rate can be perfectly calibrated yet useless. The best forecasters are both calibrated and sharp, pushing probabilities toward 0 or 100 only when the evidence justifies it.

How do you measure if a forecaster is well-calibrated?

Calibration is measured by grouping forecasts into probability buckets and comparing each bucket's average forecast to the frequency of events that actually occurred. Plotting predicted against observed produces a reliability diagram; points on the 45-degree line indicate perfect calibration. The Brier score and its reliability component quantify deviation numerically across all buckets.

Why does calibration matter for prediction markets and AI forecasts?

Calibration is the primary test separating genuine forecasting skill from confident noise, which matters acutely as prediction markets and LLM-generated probabilities feed 2025-2026 macro positioning. An overconfident but uncalibrated forecaster systematically misprices tail risk; auditing calibration before trusting a model's probabilities prevents that error from propagating into live trades.

This Is LedgerPrediction-native macro publication

Glossary

caliber (forecast calibration)

forecast calibration · probabilistic calibration · reliability

Macro Frameworks

Calibration is the statistical property that a probabilistic forecaster's stated confidence matches its empirical hit rate: events assigned 70% probability should occur roughly 70% of the time. A well-calibrated forecaster is neither systematically overconfident nor underconfident across the full probability range.

How it works

Calibration is assessed by bucketing forecasts by stated probability and comparing each bucket's average prediction to its realized frequency; a reliability diagram plots predicted versus observed and a perfectly calibrated forecaster lies on the 45-degree line. Calibration is distinct from resolution (sharpness) — a forecaster can be calibrated yet uninformative by always quoting the base rate. Proper scoring rules like the Brier score decompose into calibration and resolution components.

Why it matters now

As prediction markets and LLM-generated probabilistic forecasts proliferate in 2025-2026 macro workflows, calibration is the primary diagnostic separating genuine forecasting skill from confident noise — and the metric by which agentic forecasting systems should be audited before their probabilities feed live positioning.

Example

A forecaster issues 100 calls each tagged "80% likely." If only 55 resolve true, the forecaster is overconfident by 25 points in that bucket — the reliability curve sags below the diagonal. Philip Tetlock's Good Judgment Project (2011-2015) showed elite "superforecasters" remained close to the diagonal across buckets, whereas typical pundits clustered probabilities near 0 and 100 and were badly overconfident.

Mechanism

Brier score = (1/N) Σ (forecast_prob − outcome)²; decomposes into reliability (calibration) − resolution + uncertainty

How desks use it

Auditing LLM or agentic forecast outputs before their probabilities inform live positioning
Comparing pundit or prediction-market probabilities against realized base rates over time
Decomposing a Brier score into reliability and resolution to diagnose overconfidence

Frequently asked

What is forecast calibration?: Forecast calibration is the property that a probabilistic forecaster's stated confidence matches its observed hit rate. Events assigned an 80% probability should occur about 80% of the time across many such forecasts. Calibration is measured with reliability diagrams and proper scoring rules like the Brier score, and is a core diagnostic for prediction markets and model-generated forecasts.
How is calibration different from accuracy?: Calibration measures whether stated probabilities match realized frequencies, while accuracy or sharpness (resolution) measures how decisively a forecaster separates outcomes. A forecaster who always quotes the 50% base rate can be perfectly calibrated yet useless. The best forecasters are both calibrated and sharp, pushing probabilities toward 0 or 100 only when the evidence justifies it.
How do you measure if a forecaster is well-calibrated?: Calibration is measured by grouping forecasts into probability buckets and comparing each bucket's average forecast to the frequency of events that actually occurred. Plotting predicted against observed produces a reliability diagram; points on the 45-degree line indicate perfect calibration. The Brier score and its reliability component quantify deviation numerically across all buckets.
Why does calibration matter for prediction markets and AI forecasts?: Calibration is the primary test separating genuine forecasting skill from confident noise, which matters acutely as prediction markets and LLM-generated probabilities feed 2025-2026 macro positioning. An overconfident but uncalibrated forecaster systematically misprices tail risk; auditing calibration before trusting a model's probabilities prevents that error from propagating into live trades.

Recently in the wire

Agentic Finance Arrives Before Its Regulators DoBriefing · rates-fx
- AI Going Into 2026 — Year of the Agents (Part 2): The Trade Broadens— Mind The Tape
- Decomposing the Geopolitical Risk Premium Across Equities, FX and Rates— Capital Flows Research