The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables
Himabindu Lakkaraju (Stanford University);Jon Kleinberg (Cornell University);Jure Leskovec (Stanford University);Jens Ludwig (University of Chicago);Sendhil Mullainathan (Harvard University)
Abstract
Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is {\em selectively labeled} in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. Comparing the performance of humans and machines on data with this type of bias can lead to erroneous estimates and wrong conclusions. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an evaluation methodology that is robust to the presence of unmeasured confounders (unobservables). We propose a metric that allows us to evaluate the effectiveness of any given black-box predictive model and benchmark it against the performance of human decision-makers. We also develop an approach called \emph{contraction} which allows us to compute this metric without resorting to counterfactual inference by exploiting the heterogeneity of human decision-makers. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions. Experiments on synthetic data also show that our contraction technique produces accurate estimates of our evaluation metric.