I've seen a few papers describing the characteristics of people who tested positive for COVID-19 and this is sometimes being interpreted as describing people with certain characteristic's the *probability of infection*. Let's talk about why that's likely not true
1/22
1/22
Usually when thinking about estimating the prevalence of a disease, we use the *sensitivity* and *specificity* of the test to help us
The calculations assume that everyone is equally likely to get tested, and with COVID-19 that is likely not the case
2/22
The calculations assume that everyone is equally likely to get tested, and with COVID-19 that is likely not the case
2/22
Let's do some thought experiments. For these, my goal is to estimate the probability of being infected with COVID-19 given you have Disease X
For example, Disease X could be:
heart disease
hypertension
it could also be any subgroup (for example age, etc)
3/22
For example, Disease X could be:
heart disease
hypertension
it could also be any subgroup (for example age, etc)
3/22
In these thought experiments, we don't actually have perfect information about who is infected with COVID-19, we just know among those who are *tested* who has been infected with COVID-19. This is really the crux of the matter.
4/22
4/22
For these thought experiments, assume that the current tests are *perfect* (that is there are 0 false positives and 0 false negatives)
note that this is likely not the case, with the current testing framework false (+) are unlikely but false (-) may be occurring
5/22
note that this is likely not the case, with the current testing framework false (+) are unlikely but false (-) may be occurring
5/22
We want the probability of being infected with COVID-19 given you have disease X
P(|)
To get this, we need P(|) because based on Bayes' Theorem we know:
P(|) = P(|)P() / P()
6/22
P(|)
To get this, we need P(|) because based on Bayes' Theorem we know:
P(|) = P(|)P() / P()
6/22
BUT, instead of P(|), we actually have P(|, ) - the probability of having disease X given you have COVID-19 AND you were tested. So the crux of these thought experiments will be trying to get an accurate estimate of P(|) so that we can get back to P(|)
7/22
7/22
experiment : Best case scenario
20%¹ of the population has disease X
50%¹ have COVID-19
There is no relationship between disease X and COVID-19
People with disease X are just as likely to get tested than people without disease X
-
¹ all numbers are made up
8/22
20%¹ of the population has disease X
50%¹ have COVID-19
There is no relationship between disease X and COVID-19
People with disease X are just as likely to get tested than people without disease X
-
¹ all numbers are made up
8/22
Why is experiment a best case scenario?
It looks like:
50% have COVID-19 among those tested
Of those who tested positive, the prevalence of disease X is 20%
P(|) = 50%
Reality (no relationship between disease X and COVID-19) matches what we see
9/22
It looks like:
50% have COVID-19 among those tested
Of those who tested positive, the prevalence of disease X is 20%
P(|) = 50%
Reality (no relationship between disease X and COVID-19) matches what we see
9/22
experiment : Oversampling scenario
20% of the population has disease X
50% have COVID-19
There is no relationship between disease X and COVID-19
People with disease X are 2x more likely to get tested than people without disease X
10/22
20% of the population has disease X
50% have COVID-19
There is no relationship between disease X and COVID-19
People with disease X are 2x more likely to get tested than people without disease X
10/22
Why is experiment bad?
It looks like:
50% have COVID-19 among those tested
Of those who tested positive for COVID-19, the prevalence of disease X is 33%
If we plug in what we see P(|, ) for P(|), it looks like P(|) is 82.5%, reality is 50%
11/22
It looks like:
50% have COVID-19 among those tested
Of those who tested positive for COVID-19, the prevalence of disease X is 33%
If we plug in what we see P(|, ) for P(|), it looks like P(|) is 82.5%, reality is 50%
11/22
experiment : Undersampling scenario
20% of the population has disease X
50% have COVID-19
There is no relationship between disease X and COVID-19
People with disease X are 1/2 as likely to get tested than people without disease X
12/22
20% of the population has disease X
50% have COVID-19
There is no relationship between disease X and COVID-19
People with disease X are 1/2 as likely to get tested than people without disease X
12/22
Why is experiment bad?
It looks like:
50% have COVID-19 among those tested
Of those who tested positive for COVID-19, the prevalence of disease X is 11%
If we plug in what we see (P(|, )) for P(|), it looks like P(|) is 27.5%, reality is 50%
13/22
It looks like:
50% have COVID-19 among those tested
Of those who tested positive for COVID-19, the prevalence of disease X is 11%
If we plug in what we see (P(|, )) for P(|), it looks like P(|) is 27.5%, reality is 50%
13/22
experiment : two problems scenario
20% of the population has disease X
56% have COVID-19
people with disease X are 1.6 times more likely to have COVID-19, P(|) = 80%
People with disease X are 5 as likely to get tested than people without disease X
14/22
20% of the population has disease X
56% have COVID-19
people with disease X are 1.6 times more likely to have COVID-19, P(|) = 80%
People with disease X are 5 as likely to get tested than people without disease X
14/22
Why is experiment bad?
It looks like:
66% have COVID-19 among those tested
Of those who tested positive for COVID-19, the prevalence of disease X is 66%
We're getting both the prevalence of COVID-19 *and* the it's association with Disease X wrong
15/22
It looks like:
66% have COVID-19 among those tested
Of those who tested positive for COVID-19, the prevalence of disease X is 66%
We're getting both the prevalence of COVID-19 *and* the it's association with Disease X wrong
15/22
OKAY, scenarios finished, so hopefully this highlights why we can't take the prevalence of characteristics in the *tested positive* population as the prevalence of characteristics in the overall COVID-19 population. Now, here are tips for how we can correct the numbers
16/22
16/22
Scenario : Oversampling by 2x
take those with disease X that tested positive for COVID-19 and downweight them by a factor of 2.
the adjusted prevalence of Disease X among those that tested positive for COVID-19 (0.5 / 2.5) = 0.2 (20%)
P(|) = 50%
17/22
take those with disease X that tested positive for COVID-19 and downweight them by a factor of 2.
the adjusted prevalence of Disease X among those that tested positive for COVID-19 (0.5 / 2.5) = 0.2 (20%)
P(|) = 50%
17/22
Scenario : Undersampling by 1/2
take those with disease X that tested positive for COVID-19 and upweight them by a factor of 2.
the adjusted prevalence of Disease X among those that tested positive for COVID-19 (2/ 10) = 0.2 (20%)
P(|) = 50%
18/22
take those with disease X that tested positive for COVID-19 and upweight them by a factor of 2.
the adjusted prevalence of Disease X among those that tested positive for COVID-19 (2/ 10) = 0.2 (20%)
P(|) = 50%
18/22
Scenario : Two problems
For the prevalence of COVID-19, correct by weighing by the probability of being tested in each subgroup ( = disease X, = No disease X)
P() = P( | ) P() + P( | ) P()
P() = ⅘ * 0.2 + ½ * 0.8 = 56%
19/22
For the prevalence of COVID-19, correct by weighing by the probability of being tested in each subgroup ( = disease X, = No disease X)
P() = P( | ) P() + P( | ) P()
P() = ⅘ * 0.2 + ½ * 0.8 = 56%
19/22
Scenario : Two problems
Said another way, for calculating the overall prevalence of COVID-19, this is like downweighting the oversampled Disease X people (divide by 5).
(⅘ + 2) / (⅘ + 2 + ⅕ + 2) = 0.56
20/22
Said another way, for calculating the overall prevalence of COVID-19, this is like downweighting the oversampled Disease X people (divide by 5).
(⅘ + 2) / (⅘ + 2 + ⅕ + 2) = 0.56
20/22
Scenario : Two problems
For calculating the prevalence of disease X among COVID-19 patients
P( | ) = P( | ) P() / P() = ⅘ * 0.2 / 0.56 = 0.285
Again, downweight the oversampled Disease X population (divide by 5).
⅘ / (⅘ + 2) = 0.285
P(|) = 80%
21/22
For calculating the prevalence of disease X among COVID-19 patients
P( | ) = P( | ) P() / P() = ⅘ * 0.2 / 0.56 = 0.285
Again, downweight the oversampled Disease X population (divide by 5).
⅘ / (⅘ + 2) = 0.285
P(|) = 80%
21/22
Hopefully this is somewhat helpful when reading about characteristics of those who are currently testing positive for COVID-19. As always, please let me know if there is something I've missed!
22/22
22/22