Optimal testing for calibration of predictive models

Wed Mar 2, 2022 12:00 p.m.—1:00 p.m.
Exterior of Sheffield-Sterling-Strathcona Hall featuring a stone carving of Yale's coat of arms and motto

This event has passed.

Seminar: 
Applied Mathematics

Event time: 
Wednesday, March 2, 2022 - 12:00pm

Location: 
https://yale.zoom.us/j/97458245891

Speaker: 
Edgar Dobriban

Speaker affiliation: 
University of Pennsylvania

Event description: 
The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge.  


Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider the problem of detecting mis-calibration of predictive models using a finite validation dataset. Due to the randomness in the data, plug-in measures of calibration need to be compared against a proper background distribution to reliably assess calibration. Thus, detecting mis-calibration in a classification setting can be formulated as a statistical hypothesis testing problem. The null hypothesis is that the model is perfectly calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large.  We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions.  When the conditional class probabilities are H”older continuous, we propose a minimax optimal test for calibration based on a debiased plug-in estimator of the ℓ2-Expected Calibration Error (ECE).  We further propose a version that is adaptive to unknown smoothness.  We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. Our algorithm is a general-purpose tool, which—combined with classical tests for calibration of discrete-valued predictors—can be used to test the calibration of virtually any classification method.