I've seen a few papers describing the characteristics of people who tested positive for COVID-19 and this is sometimes being interpreted as describing people with certain characteristic's the *probability of infection*. Let's talk about why that's likely not true 

1/22


1/22


2/22
Let's do some
thought experiments. For these, my goal is to estimate the probability of being infected with
COVID-19 given you have
Disease X
For example,
Disease X could be:
heart disease
hypertension
it could also be any subgroup (for example age, etc)
3/22



For example,




3/22
In these
thought experiments, we don't actually have perfect information about who is infected with
COVID-19, we just know among those who are
*tested* who has been infected with
COVID-19. This is really the crux of the matter.
4/22




4/22
For these
thought experiments, assume that the current tests are *perfect* (that is there are 0 false positives and 0 false negatives)
note that this is likely not the case, with the current testing framework false (+) are unlikely but false (-) may be occurring
5/22


5/22
We want the probability of being infected with COVID-19 given you have disease X
P(
|
)
To get this, we need P(
|
) because based on Bayes' Theorem we know:
P(
|
) = P(
|
)P(
) / P(
)
6/22
P(


To get this, we need P(


P(






6/22
BUT, instead of P(
|
), we actually have P(
|
,
) - the probability of having disease X given you have COVID-19 AND you were tested. So the crux of these thought experiments will be trying to get an accurate estimate of P(
|
) so that we can get back to P(
|
)
7/22









7/22






-
¹ all numbers are made up
8/22
Why is
experiment
a best case scenario?
It looks like:
50% have COVID-19 among those tested
Of those who tested positive, the prevalence of disease X is 20%
P(
|
) = 50%
Reality (no relationship between disease X and COVID-19) matches what we see
9/22


It looks like:


P(



9/22








10/22
Why is
experiment
bad?
It looks like:
50% have COVID-19 among those tested
Of those who tested positive for COVID-19, the prevalence of disease X is 33% 
If we plug in what we see P(
|
,
) for P(
|
), it looks like P(
|
) is 82.5%, reality is 50%
11/22


It looks like:











11/22








12/22
Why is
experiment
bad?
It looks like:
50% have COVID-19 among those tested
Of those who tested positive for COVID-19, the prevalence of disease X is 11%
If we plug in what we see (P(
|
,
)) for P(
|
), it looks like P(
|
) is 27.5%, reality is 50%
13/22


It looks like:










13/22










14/22
Why is
experiment
bad?
It looks like:

66% have COVID-19 among those tested

Of those who tested positive for COVID-19, the prevalence of disease X is 66%
We're getting both the prevalence of COVID-19 *and* the it's association with Disease X wrong
15/22


It looks like:





15/22
OKAY, scenarios finished, so hopefully this highlights why we can't take the prevalence of characteristics in the *tested positive* population as the prevalence of characteristics in the overall COVID-19 population. Now, here are tips for how we can correct the numbers 
16/22

16/22
Scenario
: Oversampling by 2x
take those with disease X that tested positive for COVID-19 and downweight them by a factor of 2.
the adjusted prevalence of Disease X among those that tested positive for COVID-19 (0.5 / 2.5) = 0.2 (20%)
P(
|
) = 50%
17/22



P(


17/22
Scenario
: Undersampling by 1/2
take those with disease X that tested positive for COVID-19 and upweight them by a factor of 2.
the adjusted prevalence of Disease X among those that tested positive for COVID-19 (2/ 10) = 0.2 (20%)
P(
|
) = 50%
18/22



P(


18/22
Scenario
: Two problems
For the prevalence of COVID-19, correct by weighing by the probability of being tested in each subgroup (
= disease X, 
= No disease X)
P(
) = P(
|
) P(
) + P(
| 
) P(
)
P(
) = ⅘ * 0.2 + ½ * 0.8 = 56%
19/22

For the prevalence of COVID-19, correct by weighing by the probability of being tested in each subgroup (



P(











19/22
Scenario
: Two problems
Said another way, for calculating the overall prevalence of COVID-19, this is like downweighting the oversampled Disease X people (divide by 5).
(⅘ + 2) / (⅘ + 2 + ⅕ + 2) = 0.56
20/22

Said another way, for calculating the overall prevalence of COVID-19, this is like downweighting the oversampled Disease X people (divide by 5).

20/22
Scenario
: Two problems
For calculating the prevalence of disease X among COVID-19 patients
P(
|
) = P(
|
) P(
) / P(
) = ⅘ * 0.2 / 0.56 = 0.285
Again, downweight the oversampled Disease X population (divide by 5).
⅘ / (⅘ + 2) = 0.285
P(
|
) = 80%
21/22

For calculating the prevalence of disease X among COVID-19 patients







Again, downweight the oversampled Disease X population (divide by 5).

P(


21/22
Hopefully this is somewhat helpful when reading about characteristics of those who are currently testing positive for COVID-19. As always, please let me know if there is something I've missed! 
22/22

22/22