I've seen a few papers describing the characteristics of people who tested positive for COVID-19 and this is sometimes being interpreted as describing people with certain characteristic's the *probability of infection*. Let's talk about why that's likely not true 👇🧵

1/22
👉 Usually when thinking about estimating the prevalence of a disease, we use the *sensitivity* and *specificity* of the test to help us
👉 The calculations assume that everyone is equally likely to get tested, and with COVID-19 that is likely not the case

2/22
Let's do some 💭 thought experiments. For these, my goal is to estimate the probability of being infected with 🦠COVID-19 given you have 🧩Disease X

For example,🧩 Disease X could be:
♥️ heart disease
🩸 hypertension
➕ it could also be any subgroup (for example age, etc)

3/22
In these 💭 thought experiments, we don't actually have perfect information about who is infected with 🦠 COVID-19, we just know among those who are 🧪 *tested* who has been infected with 🦠 COVID-19. This is really the crux of the matter.

4/22
For these 💭 thought experiments, assume that the current tests are *perfect* (that is there are 0 false positives and 0 false negatives)

☝️note that this is likely not the case, with the current testing framework false (+) are unlikely but false (-) may be occurring

5/22
We want the probability of being infected with COVID-19 given you have disease X
P(🦠|🧩)

To get this, we need P(🧩|🦠) because based on Bayes' Theorem we know:

P(🦠|🧩) = P(🧩|🦠)P(🦠) / P(🧩)

6/22
BUT, instead of P(🧩|🦠), we actually have P(🧩|🦠, 🧪) - the probability of having disease X given you have COVID-19 AND you were tested. So the crux of these thought experiments will be trying to get an accurate estimate of P(🧩|🦠) so that we can get back to P(🦠|🧩)

7/22
💭 experiment 1️⃣: Best case scenario

🧩 20%¹ of the population has disease X
🦠 50%¹ have COVID-19
❌ There is no relationship between disease X and COVID-19
🧪 People with disease X are just as likely to get tested than people without disease X

-
¹ all numbers are made up
8/22
Why is 💭 experiment 1️⃣ a best case scenario?

It looks like:
🧪50% have COVID-19 among those tested
🧪 Of those who tested positive, the prevalence of disease X is 20%
P(🦠|🧩) = 50%

✅ Reality (no relationship between disease X and COVID-19) matches what we see

9/22
💭 experiment 2️⃣: Oversampling scenario

🧩 20% of the population has disease X
🦠 50% have COVID-19
❌ There is no relationship between disease X and COVID-19
🧪 People with disease X are ✨2x✨ more likely to get tested than people without disease X

10/22
Why is 💭 experiment 2️⃣ bad?

It looks like:
🧪 50% have COVID-19 among those tested
🧪 Of those who tested positive for COVID-19, the prevalence of disease X is 33% 😱

❌ If we plug in what we see P(🧩|🦠, 🧪) for P(🧩|🦠), it looks like P(🦠|🧩) is 82.5%, reality is 50%

11/22
💭 experiment 3️⃣: Undersampling scenario

🧩 20% of the population has disease X
🦠 50% have COVID-19
❌ There is no relationship between disease X and COVID-19
🧪 People with disease X are ✨1/2✨ as likely to get tested than people without disease X

12/22
Why is 💭 experiment 3️⃣ bad?

It looks like:
🧪 50% have COVID-19 among those tested
🧪 Of those who tested positive for COVID-19, the prevalence of disease X is 11%

❌ If we plug in what we see (P(🧩|🦠, 🧪)) for P(🧩|🦠), it looks like P(🦠|🧩) is 27.5%, reality is 50%

13/22
💭 experiment 4️⃣: two problems scenario

🧩 20% of the population has disease X
🦠 56% have COVID-19
✅ people with disease X are 1.6 times more likely to have COVID-19, P(🦠|🧩) = 80%
🧪 People with disease X are ✨5✨ as likely to get tested than people without disease X
14/22
Why is 💭 experiment 4️⃣ bad?

It looks like:
🦠🧪 66% have COVID-19 among those tested
🧩🧪 Of those who tested positive for COVID-19, the prevalence of disease X is 66%

❌ We're getting both the prevalence of COVID-19 *and* the it's association with Disease X wrong

15/22
OKAY, scenarios finished, so hopefully this highlights why we can't take the prevalence of characteristics in the *tested positive* population as the prevalence of characteristics in the overall COVID-19 population. Now, here are tips for how we can correct the numbers 👇

16/22
Scenario 2️⃣: Oversampling by 2x

👉 take those with disease X that tested positive for COVID-19 and downweight them by a factor of 2.
✅ the adjusted prevalence of Disease X among those that tested positive for COVID-19 (0.5 / 2.5) = 0.2 (20%)
P(🦠|🧩) = 50%

17/22
Scenario 3️⃣: Undersampling by 1/2

👉 take those with disease X that tested positive for COVID-19 and upweight them by a factor of 2.
✅ the adjusted prevalence of Disease X among those that tested positive for COVID-19 (2/ 10) = 0.2 (20%)
P(🦠|🧩) = 50%

18/22
Scenario 4️⃣: Two problems

For the prevalence of COVID-19, correct by weighing by the probability of being tested in each subgroup (🧩 = disease X, ❌🧩 = No disease X)

P(🦠) = P(🦠 | 🧩) P(🧩) + P(🦠 | ❌🧩) P(❌🧩)

✅P(🦠) = ⅘ * 0.2 + ½ * 0.8 = 56%

19/22
Scenario 4️⃣: Two problems

Said another way, for calculating the overall prevalence of COVID-19, this is like downweighting the oversampled Disease X people (divide by 5).

✅ (⅘ + 2) / (⅘ + 2 + ⅕ + 2) = 0.56

20/22
Scenario 4️⃣: Two problems

For calculating the prevalence of disease X among COVID-19 patients
✅ P(🧩 | 🦠) = P(🦠 | 🧩) P(🧩) / P(🦠) = ⅘ * 0.2 / 0.56 = 0.285
Again, downweight the oversampled Disease X population (divide by 5).
✅ ⅘ / (⅘ + 2) = 0.285

P(🦠|🧩) = 80%
21/22
Hopefully this is somewhat helpful when reading about characteristics of those who are currently testing positive for COVID-19. As always, please let me know if there is something I've missed! 🙏

22/22
You can follow @LucyStats.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: