[Next] [Up] [Previous]
Next: Conclusions Up: A guide to the Previous: Advantages

Problems

Successful application of technology is often an incremental process. It takes a number of iterative refinements to make an idea work, and a thorough exposition of the problems encountered at each step is a necessary part of that evolution.

The QAMC project has encountered several interesting problems in developing a web based system for risk prediction in obstetrics. Surprisingly few of them are related to shortcomings in the risk prediction models themselves. As we shall see, far more difficulties arose in relation to the raw data and the interpretation of results.

Limitations of the data

None of the databases available to the QAMC project were collected with the objective of risk prediction in mind. At this time, perinatal data is gathered for legal reasons, hospital administration reasons, and as a means to provide a quality audit [9]. These objectives have a profound effect on the different kinds of information gathered, and the reliability with which they are recorded (more of which in the next section).

It is important to understand that, even though a large amount of data is available, it might contain very little information relevant to a particular outcome. Consequently, we must accept that the kind of data we collect places a fundamental upper limit on the accuracy of our predictions. This concept was presented in a medical context by Hanley and McNeil [2] and is discussed in relation to the QAMC project in [5,6,7].

As an example of this, consider the prediction of birthweight. We can build a prediction model on the basis of maternal information, such as height, weight at different stages of pregnancy, etc. However, a far more accurate approach is to take measurements from the baby in utero, perhaps via ultrasound. This is precisely the kind of information which was not routinely recorded in the databases we had access to, and our ability to make an accurate forecast would be diminished as a result.

In short, no amount of sophisticated risk modelling can make up for the absence of relevant information in that data that has been collected. This is certainly the limiting factor in predicting the risk of failure to progress (as discussed in [6]). In presenting this web page we hope that clinicians will see the potential benefits of gathering and recording more informative patient details.

Data reliability

Although quality audits (such as described in [6]) go some way towards reassuring us that database information is fairly consistent with case notes, they cannot establish whether case notes are consistent with reality. Systematic inconsistencies in case notes can have serious consequences for the accuracy of risk prediction models trained on retrospective data.

One way in which case notes could distort the true characteristics of a patient population is by presenting different levels of detail in different patient records. Let us consider how this might happen in the SMR2 data. Case details are completed and entered into the database when the mother is discharged from hospital. A mother who experienced complications at some stage is likely to have more detailed case notes than a mother whose pregnancy and delivery proceeded without problem. Even if both mothers have similar medical histories, it is now probable that the first mother will have her history recorded in more detail that the second. If this is done on a systematic basis, we will observe in the data a spurious association between certain conditions and adverse outcome - even though those conditions are equally prevalent in mothers whose pregnancy had benign outcome - simply because those conditions have usually been recorded when adverse outcomes have occurred.

We suspect this to be the reason behind the following discrepancy in the SMR2 data (reported in [6]). Grand multiparity is a term used to describe a mother who has given birth to four or more children. The ICD-9 code for grand multiparity ( 6594) is recorded in the SMR2 data. A mother's parity is also recorded as a separate item and we would expect agreement between that number and the ICD code.

However, parity 4+ and ICD-9 code 6594 are not identical across the database: there are 15 531 instances of parity 4+, but only 696 instances of code 6594. Furthermore, code 6594 is associated with a slightly higher incidence of failure to progress (2.9%) than parity 4+ (2.1%). Although these differences are small, the fact that there are differences at all shows that misreporting has the potential to affect the conclusions that we draw from data.

Again, we must keep in mind that the SMR2 database was not compiled for the construction of risk prediction models. However, reliability of data remains an important issue if we wish to make meaningful inferences using large databases of information.

Definition of ICD-9 and other diagnostic codes

The Achilles' heel of the QAMC risk prediction web page is its use of ICD-9 codes to represent maternal health status. At project review meetings, clinicians have consistently commented that codes such as 6525 (high head at term) or 6429 (unspecified hypertension) are vague and open to a considerable degree of interpretation. This is not just a problem with the SMR2 data; other datasets contain their own variety of imprecise terms. The net effect is to reduce the utility of any systems that use these codes as predictors of outcome.

Difficulties with ICD-9 and other diagnostic codes lead to the development of an alternative risk prediction web page that used only non-ICD code information [6]. Still, the problems caused by vague diagnostic terms raise the question of why this information was recorded in the first place.

One thing is clear: if large medical databases are to be used to develop accurate and meaningful models of risk, the information they record must be defined more rigorously. It is only through attempting to develop such risk prediction models that this issue comes to light.

Interpretability of results

Leaving aside the question of whether accurate risk prediction models could be built, it is difficult to say whether there is a place for such systems in modern medical practice. Would clinicians actually find a percentage risk estimate useful in case management? This is an issue on which we hope to obtain some feedback about from visitors to the web page. As we remarked earlier, the successful application of technology is an incremental process, but if there is little enthusiasm for the ultimate objective of that application, the process will founder.

We noted earlier that the QAMC risk prediction web page requires all patient characteristics to be specified before a database query or risk prediction can be made. Bayesian belief networks (see e.g. , [11,4]) offer a promising approach to modelling risk with incomplete information, however, training these systems with large amounts of data remains a challenge. An even greater challenge is that of inferring causality from data and this is currently a topic of great interest within the AI community.

We shall return to the issue of interpretability below.

Generalisation issues

Implicit in our use of SMR2 data is an assumption that the underlying characteristics of the patient population ( i.e. , training data) do not change over time. This assumption allows us to generalise about the risk of adverse outcome associated with new cases on the basis of cases we have seen in the past. At this stage, we have not assessed whether stationarity assumptions hold in the SMR data.

Stationarity is an important issue in the medical domain, where accepted methods of patient management are constantly evolving to improve patient outcome. One consequence of changing medical practice is that information in medical databases has an implicit ``use-by date'' with regard to building systems to predict certain outcomes.

Our ability to make predictions about patients outside the catchment area of the SMR2 data is another important issue in generalising to new domains. Standards of medical practice and methods of treatment vary between geographical regions, as do certain physical characteristics of the patient population. Thus, it is not clear whether the prediction systems built using the SMR2 data should be used to make forecasts for cases outside Scotland. In an ideal situation, we would have explored this topic by using data collected elsewhere to validate performance. However, as mentioned elsewhere, the types of information recorded in the datasets available to QAMC were too varied to allow valid comparisons to be made.

Quality indicators vs meaningful warnings

The term quality indicator refers to a statistic which is thought to be related to the quality of care delivered in a specific medical setting. Caesarian section and perinatal mortality rates are two indicators common in obstetric practice. These statistics allow clinicians and administrators to form rough comparisons between the quality of care delivered in different districts, hospitals, even by different doctors. Naturally, such crude estimates must be combined with other relevant information. For example, a large hospital may have a high perinatal mortality rate simply because it receives more than its fair share of high risk cases.

While these rates and measures might be associated with quality of care, their summary nature often makes them of little use to forecast. (This point is emphasised elsewhere in relation to some of the prediction tasks originally proposed in the QAMC project.) It is hardly any use to to make an accurate prediction about risk of perinatal death unless one can identify reasons why individual cases are at risk. Effective intervention in high risk cases demands that the cause(s) of risk be understood.

This matter could be addressed by modelling more specific, well defined outcomes, however, there is a fundamental, and more challenging issue at play. Statistical modelling of risk can have two objectives: one, to achieve an accurate forecast, and two, to achieve interpretable results [7]. At this point in time, it appears as though some trade-off between the two objectives is unavoidable, i.e. , the most accurate prediction system makes inscrutable forecasts, and the most interpretable system is not (usually) so accurate[*]. Anyone dealing with models of complex phenomena (such as the relationship between symptoms and disease) ought to bear this in mind.

Medico-legal issues

Few medical domains are as litigious as obstetrics. Hence, it is with a certain sense of trepidation (and a carefully worded disclaimer) that we present the QAMC web page . But suppose for a moment that this web page was actually meant for forecasting risk in real situations - a plethora of medico-legal questions is raised.

Take the very nature of the information provided: what is the legal status of an estimated probability of adverse outcome? Unless a forecaster asserts something with certainty, it seems highly unlikely that they could ever be prosecuted for providing a duff prediction. What mechanism of checks and balances exists to ensure this does not occur?

Consider the question of liability: who is ultimately responsible for the information provided? Even if that could be ascertained, if the information is made available across the Internet, the question of which jurisdiction prosecution should take place under is a contentious one. As telemedicine becomes more commonplace, more and more such questions are causing consternation in the courts, concern among clinicians, and lucre for the lawyers.

Still, it seem that these legal problems are not insurmountable. There are a number of medical applications of Bayesian nets available, most notably, a system developed by Knowledge Industries as part of the Microsoft Pregnancy and Child Care package. To the best of our knowledge, this system was trained using expert estimation of conditional probabilities alone, as opposed to a large database of case records.

[Next] [Up] [Previous]
Next: Conclusions Up: A guide to the Previous: Advantages

D.R. Lovell
Mon Sep 15 18:08:31 BST 1997