Data acquisition is fun to do, the thrills, the suspense and sometimes the tears.

Time to do your analysis and you realize that you have empty spaces in your dataset.

Oh wow, what happened? This is a case of missing values.

Thread 🛢

1/7
Missing values in a dataset can be caused by:
1. Occasional system error preventing data from being recorded.
2. Subsets missing certain attributes or missing entirely.

One of the simplest ways of checking for missing values in a dataset is using pandas describe function.

2/7
This function displays the statistical information in the dataset. Once there is a difference in values in the count row, that is the evidence of missing values.

Dealing with missing values: Two ways of dealing with this includes deletion and imputation.

3/7
1. Deletion:

a. Partial deletion: This method involves limiting our analysis to the available data.

b. Listwise deleteion: Excludes a particular data point/points from all analysis to be done.

c. Pairwise deletion: Excludes a particular case/cases from the analysis

4/7
due to impossible tasks.

2. Imputation: This method is employed when there isn't much data or removing data can impact the analysis in a not so good way.

There are so many techniques in existence, but each with its own anomaly and bias.

5/7
a. Using the Mean of other data points to fill in the missing values.

Pro:
Doesn't affect the Mean across the sample.

Con:
Lessens variable correlation.

b. Linear regression: This option involves creating a predictive equation with using the available information

6/7
and in turn using this equation to predict the variables with the missing values.

Cons:
1. Over emphasized trends.
2. Too much certainty suggested by exact values.

If you learnt from this thread, like and retweet for others to learn too.

7/7
You can follow @adaihueze.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: