News: Use of diagnostic codes in AI prediction models can inflate performance levels, study finds
A prognostic study published in the Journal of the American Medical Association (JAMA) raises concerns that artificial intelligence (AI) models designed to predict hospital outcomes may appear far more accurate than they truly are due to a subtle but serious methodological error known as label leakage, JustCoding reported. According to a National Library of Medicine article, label leakage refers to scenarios where models use information from training data that would not normally be available at the time of prediction.
Researchers examined how ICD-9 and ICD-10 diagnostic codes—data that is typically finalized only after a patient is discharged—are being used in AI models that aim to predict outcomes occurring during the same hospital stay, such as inpatient mortality. They analyzed electronic health records from 180,640 patients treated in intensive care units or emergency departments at Beth Israel Deaconess Medical Center between 2008 and 2019, using the widely referenced Medical Information Mart for Intensive Care IV (MIMIC-IV) database. The cohort had a mean age of nearly 59 years, and 4.7% of the patients died during the admission.
Using standard training, validation, and test splits, the researchers developed several prediction models, including logistic regression, random forest, and XGBoost, to predict inpatient mortality using only ICD diagnostic codes as input features. All three models achieved exceptionally high accuracy, with areas under the operating curve approaching 0.98, signifying a near-perfect level of predictive performance.
However, the apparent success masked a potential problem: The most influential predictors in these models were ICD codes such as cardiac arrest, brain death, subdural hemorrhage, and encounters for palliative care, all diagnoses that are often assigned late in the hospital stay or after death has already occurred. Because these codes are not available at the time providers would need to make decisions, their inclusion effectively allows the model to predict the future.
Therefore, the findings suggest that reliance on such ICD codes can dramatically inflate model performance while rendering the tools clinically useless in real time.
To assess how widespread the issue is, the researchers also conducted a targeted review of published studies using MIMIC data to predict same-admission outcomes. Of 92 identified studies, 37—more than 40%—included ICD codes as predictive features, despite clear documentation from MIMIC stating that these codes are derived after hospital discharge.
This study has broad implications for the development and evaluation of AI tools in healthcare, particularly as clinical settings look to deploy predictive models at the bedside. Models trained with leaked information may undermine trust in medical AI tools, waste resources, and potentially misguide clinical decisions if adopted without careful scrutiny.
Greater diligence is needed to ensure that AI prediction models use only temporally appropriate data. Transparent reporting of feature availability, rigorous validation strategies, and deeper collaboration between providers, coders, and data scientists are also essential steps toward building AI systems that are not just accurate on paper, but reliable and deployable for real time care.
Editor’s note: This article originally appeared in JustCoding.
