Thread by @MFGensheimer, Let’s talk about measuring performance of predictive models/biomarkers. Can we compare AUC [...]

Michael Gensheimer

MFGensheimer

Let’s talk about measuring performance of predictive models/biomarkers. Can we compare AUC across studies to rank models? Recent machine learning papers have had eye-popping AUC values (such as DeepMind& #39;s paper showing AUC=0.98 for predicting acute kidney injury). (1/n)

Here& #39;s the DeepMind paper:
https://www.nature.com/articles/s41586-019-1390-1
It">https://www.nature.com/articles/... looked at all VA patients who were admitted to the hospital, which would include some with low risk of developing kidney failure such as those getting elective procedures, and some at high risk such as those in septic shock.

A clinically applicable approach to continuous prediction of future acute kidney injury

A deep learning approach that predicts the risk of acute kidney injury may help to identify patients at risk of health deterioration within a time window that enables early treatment.

https://www.nature.com/articles/s41586-019-1390-1

The area under the receiver operating characteristic curve (AUC) generally ranges from 0.5 (no better than chance) to 1.0 (all patients experiencing the outcome have higher predictor values than all the patients without the outcome).

In the DeepMind/VA dataset, the broad patient population makes it easier to achieve a high AUC value. It would be easy to come up with simple criteria that could do a good job of selecting high-risk patients.

Let& #39;s use an example where the biomarker is hemoglobin A1c, and the outcome is development of a foot ulcer. Foot ulcers develop mainly in patients with diabetes and are more common in patients with severe diabetes and high hemoglobin A1c.

We will make a simulated dataset with 1000 patients with diabetes, and we& #39;ll assign half the patients to having the outcome of foot ulcer. Here& #39;s the ROC curve for hemoglobin A1c as the predictor. We get a pretty good AUC of 0.78.

Now let& #39;s add 1,000 non-diabetic patients to the dataset, who have low hemoglobin A1c values and none of whom get a foot ulcer. Wow, AUC jumped up to 0.93!

By plotting the raw data you can see why this is happening. First dataset (y=developed foot ulcer):

Second dataset:

Take-home points: metrics like AUC are useful for comparing performance of model 1 vs. 2 on the same dataset, but not so good at comparing model 1 on dataset 1 vs. model 2 on dataset 2. Think about the population the model was tested on. And Google/DeepMind is not Skynet (yet).

Link to R code and more detailed info below. It& #39;s on Google Colab so you can edit the assumptions and re-run in your browser. https://colab.research.google.com/github/MGensheimer/teaching/blob/master/auc_demo/auc_demo.ipynb">https://colab.research.google.com/github/MG...

Google Colaboratory

https://colab.research.google.com/github/MGensheimer/teaching/blob/master/auc_demo/auc_demo.ipynb

You can follow @MFGensheimer.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: