Limits on the discrimination possible with discrete valued data, with application to medical risk prediction

D. R. Lovell, C. R. Dance, M. Niranjan, R. W. Prager, and K. J. Dalton

We describe an upper bound on the accuracy (in the ROC sense) attainable in two-alternative forced choice risk prediction, for a specific set of data represented by discrete features. By accuracy, we mean the probability that a risk prediction system will correctly rank a randomly chosen high risk case and a randomly chosen low risk case.

We also present methods for estimating the maximum accuracy we can expect to attain using a given set of discrete features to represent data sampled from a given population.

These techniques allow an experimenter to calculate the maximum performance that could be achieved, without having to resort to applying specific risk prediction methods. Furthermore, these techniques can be used to rank discrete features in order of their effect on maximum attainable accuracy.