LIMITS ON THE DISCRIMINATION POSSIBLE WITH DISCRETE VALUED DATA, WITH APPLICATION TO MEDICAL RISK PREDICTION

David Lovell, Chris Dance, Mahesan Niranjan, Richard Prager and Kevin Dalton

January 1996

We describe an upper bound on the {\em accuracy} (in the ROC sense) attainable in two-alternative forced choice risk prediction, for a specific set of data represented by discrete features. By accuracy, we mean the probability that a risk prediction system will correctly rank a randomly chosen high risk case and a randomly chosen low risk case.

We also present methods for estimating the maximum accuracy we can expect to attain using a given set of discrete features to represent data sampled from a given population.

These techniques allow an experimenter to calculate the maximum performance that could be achieved, without having to resort to applying specific risk prediction methods. Furthermore, these techniques can be used to rank discrete features in order of their effect on maximum attainable accuracy.

(ftp:) lovell_tr243.ps.Z (http:) lovell_tr243.ps.Z

PDF (automatically generated from original PostScript document - may be badly aliased on screen):

(ftp:) lovell_tr243.pdf | (http:) lovell_tr243.pdf

If you have difficulty viewing files that end `'.gz'`

,
which are gzip compressed, then you may be able to find
tools to uncompress them at the gzip
web site.

If you have difficulty viewing files that are in PostScript, (ending
`'.ps'`

or `'.ps.gz'`

), then you may be able to
find tools to view them at
the gsview
web site.

We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.