Ranking the effect of different features on the classification of discrete valued data

D. R. Lovell, C. R. Dance, M. Niranjan, R. W. Prager and K. J. Dalton

There is a fundamental upper limit on the extent that any classifier --- neural network or otherwise --- can discriminate between two classes of discrete valued data. When classes overlap in feature space, discrimination will be less than perfect.

In a two alternative forced-choice decision problem, a classifier's discrimination ability is often measured in terms of the area under its receiver operating characteristic (ROC) curve. In this paper, we show how to calculate the maximum possible area under the ROC curve obtained with a particular representation of a given set of discrete valued data. This result corresponds to the fundamental upper limit of discrimination.

We extend this result to show how to estimate both the maximum and the average discrimination we can expect to achieve on unseen test data, again, for a particular representation of a discrete valued data set.

These results have practical engineering application in the selection of discriminative features for neural network training. We show how the bound on discrimination can be used in a backwards elimination algorithm to rank discrete features in order of their discriminative power. The algorithm commences with the saturated model (ie., using all features and all interactions to represent the data) and removes the feature that has least impact on the upper limit of the model's discrimination. Thus, features are removed from the model in reverse order of their discriminative power.

This feature ranking technique is applied to a machine learning benchmark: the classification of the Mushroom Database. We also describe its application to a larger real-world medical risk prediction problem: prediction of risk of adverse pregnancy outcome from information available at initial hospital booking. The results obtained with the latter task suggest that improved risk prediction would require more discriminative information, rather than better predictive techniques. This demonstrates how we can determine whether a classifier's performance is being limited by an inherent inseparability of the data.

Keywords: medical risk prediction, receiver operating characteristic, failure to progress.