Dealing with Missing Values in Big Datasets

IntelData Pty Ltd | Asset Management and Business
Jan 3, 2016
4 min read

Missing value correction is one of the most time-consuming of the variables cleaning steps needed in data preparation. The most desirable way to fix the missing values is imputing missing values.

Missing Value Imputation

[endif]--Missing value imputation means changing values of missing data to a value that represents a plausible or expected value in a variable if it were actually know. The most common methods to fix missing values in large datasets are shown in Figure 2. Below a brief explanation for each of the methods is given: ![endif]--

Listwise Deletion

In this method the analyst removes any records with missing values so that what is left is a dataset a with records without any missing values. In some cases, this makes a perfect sense. For example, in asset condition monitoring data gathered from sensors in maintenance systems, if a sensor has sent no values, this is indicative of a data collection problem. If the occurrence of missing value is rare this not harm the analysis.

Column Deletion

In this method the variable that has many missing values is removed. This will leave the dataset only with variables that are fully populated. Although this approach solves the problem of deleting too many records, it might cause other problems such as deleting important variable.

Imputation with a Constant

In this method the missing values are replaced with a constant. In case of categorical variables, this can be achieved by simply filling the missing values with an appropriate string like “U” to indicate missing. For continuous variables, this is most often a 0. For instance, bank account balances for account types that an individual does not have can be recorded with 0.

However, sometimes imputing with a 0 causes significant problems. In case of a variable like age, which in a dataset could have a range between 18-68, replacing the missing values with a 0 can impact the dataset significantly.

Mean and Median Imputation for Continuous Variables

Mean Imputation

As stated before, imputing continuous variables with a 0 could at times be undesirable. The most commonly used method to deal with missing values and at the same time avoiding aforementioned problem is imputation with the mean. Beside the method being easy for the analysts, the idea behind mean imputation is that the value that is imputed should do the least amount of harm possible. It is safe to assume that the values that missing to approach the mean value.

However, this, method can also cause problems. For example, if the number of missing values is a lot, replacing them with the mean could affect the distribution of the variable as the standard deviation shrinks as a result of this method of imputation.

Median Imputation

Mean imputation is by far the most common method used by the predictive data analysts. However, in some circumstances where mean and median are different from one another, imputing with the median may be better because it will represent better the most typical value of the variable.

Imputing with distributions

When large percentage of the values are missing, the summary statistics are affected by man imputation. An alternative to this is rather than imputing with a constant value, to impute with randomly from a known distribution.

For example, consider the variable AGE with the mean of 61.1 and standard deviation of 16.6. The missing values in this variable can be produced from a normal distribution with mean of 61.1 and standard deviation of 16.6. these imputed values will retain the almost the same distribution of the variable. Of course this depends on how normal the original distribution of AGE is in the first place.

In other cases, where the variable appears to be uniformly distributed, a uniform distribution can be used rather than a normal distribution.

Random Imputation from Own Distributions

In cases where the distribution of the variable is not known, random imputation from own distribution can be used. In this method, instead of using a random number generator to pick a number, a random actual value of the variable of the non-missing values is selected.

Below is how this can be achieved:

Copy the variable to impute (variable “x”) to a new column
Remove missing values.
Randomly scramble the values in that column
Replicate values of the column so that there are as many values as there were records in the original data
Join this column to the original data
Impute missing value of variable X with the value in the scrambled column

Imputing Missing Values from a Model

This method begins with changing the role of the input variable with missing values to now be a target variable. The inputs to the new model are other input variables that may predict this new target variable. The training data should be large enough and preferably without missing data.

This method can produce goof imputations. However, there some drawbacks for this method: