Missing data is a problem which plagues all manner of science and there are a number of ways which missing data can be dealt with. In this post I introduce the general ideas behind missing data, and demonstrate a few methods with which data missingness can be attacked.

Missing Data

The best way to attack the problem of missing data is to simply not have any missing data. This will always be a better solution, but is not always a reasonable one. Assuming that it is necessary to work with a dataset which is full of missing data there are a number of operations which could be applied in order to work around the missing data. However, it is important to first look at why the data is missing as this can have a strong effect on any model. Here are a few commonly used missingness mechanisms.

  1. MCAR (Missing Completely at Random)
    This is the simplest missingness mechanism in which the probability of a datapoint missing is the same for all units. This probability is important because it means deleting entire cases with missing data does not relate any bias upon inferences gained from the newly formed dataset.
  2. MAR (Missing at Random)
    This missingness mechanism is when missingness is not at random, ironic name I agree, and the missingness can be accounted for by variables for which there is complete information. This means the missing data is related to some of the observed data. An example would be a certain subgroup of a survey not all answering a question. These responses can also be deleted provided any inference includes controls for all variables that can affect missngness.
  3. MNAR (Missing not at Random)
    MNAR includes a few possibilities. The idea with MNAR is to include predictors that bring the model closer to MAR in order to reduce bias in any inferences made.
    1. The missingness is dependent upon non-recorded data.
    2. The missingness is dependent upon the value itself.

Deleting Data

Deleting Data is the simplest method for dealing with missing data. Attempting imputation will typically, but not necessarily, lead to better results than simply deleting data, but in some cases the data is simply too full of missing values for imputation to be a logical step. This is where deleting data can be the proper choice.

Listwise Deletion

This deletion method involves dropping an entire observation which has missing values. Given that this assumes MCAR which is almost impossible to prove, any inferences made using a dataset cleaned with listwise deletion is likely to contain bias.

Pairwise Deletion

This deletion method also assumes MCAR, but actively involves working to retain more data. Rather than deleting an entire observation pairwise deletion will use all cases where the variables being analyzed are present. This likely leads to the model being made up of different parts made with different numbers of observations, which can be tricky to combine in a robust manner.

Variable Deletion

If enough of a variable are missing, then it may be worth it to drop a variable completely from all observations.

Imputing Data

It is important to first of all look at what kind of data is present. Time Series data requires some additional work to remove seasonality and trend from the data before continuing. One important note to mention is that maintaining the covariance structure can help maintain correlations between the data thus bringing us closer to MAR and allowing us to provide better inferences from the data.


Mean Imputation

The simplest method of imputing data is to simply replace the missing data points with the mean of all occurences of the variable. This is not ideal since it destroys the covariance structure and introduces bias.

Regression Imputation

Another simple method of imputing data is to use simple regression, linear or logistic depending on quantative or categorical data, but this isn't ideal as it assumes relationships between variables and decreases standard error and it also doesn't maintain the existing covariance structure of the data.

Machine Learning Imputation

There is also the possibility of using other machine learning methods for imputing data. Methods such as kNN (k Nearest Neighbors) and various Neural Networks (Convolutional and Recurrent) have been used with varying degrees of success. I will not go into these methods, but these are in my opinion not as effective as using simpler statistical methods.


The above methods have all belonged to a class of imputation methods called direct imputation. I do recommend a direct imputation method as well as a model-based imputation method. These are both recommended because they maintain the data structure by maintaining the covariance structure.

Before introducing the methods I recommend to impute data with I want to introduce the idea of Multiple Imputation. This is a Bayesian approach to imputation wherein several plausible datasets are produced with the missing values produced by a sampling from the distributions formed by the observed data. Going forward from here is up to the discretion of the practitioner. One could pool these imputed values together and create a new dataset, one could use all imputed datasets to see how the imputed data affects any models created from the data, and so on.

Iterative PCA

The idea behind iterative PCA is fairly simple, but also powerful. It relies heavily on the ideas behind Principle Component Analysis. PCA is a method which does exactly what its' name says; it analyzes the principle components of data. It allows for finding the most significant factors, the principle components, by identifying the eigenvectors with the largest eigenvalues and the ones which lay perpendicular to them. It can allow for dimension reduction due to the fact that the data can be along a new plane which is upon higher dimensions and the plane can be reduced to one fewer dimension. With high dimensional data, PCA essentially tells us which kinds of data will produce the highest variance, and along what path.

What the Iterative PCA method does is the following.

  1. It initializes the missing values with an imputation using the mean for that value.
  2. It performs PCA on the imputed data and projects onto the largest principle component line.
    What this does in essence is move the imputed data onto a line of best fit along the largest principle component as only the imputed data are able to be moved. This means that the largest principle component is the determining factor in how the imputed data are gathered.
  3. Repeat 2 iteratively until convergence.

There is an excellent R library for imputing data using iPCA called missMDA.

Expectation Maximization

The Expectation Maximization method is a likelihood based method. This method is useful because it keeps the structure of the covariance matrix intact and is able to maintain relations between the variables.

The main idea behind the EM approach is that the complete likelihood is used to calculate the maximum likelihood estimates for the missing data rather than the observed likelihood since using the observed likelihood can be extremely complicated or not feasible numerically.

This is done by adding manufactured data for the missing data points to the observed data to create a complete likelihood which is easier to work with numerically. The incomplete data is then replaced at each step by the conditional expectation given by the observed data and the current estimates. This is the Expectation step.

The new parameter estimates are obtained from the Expectation step as though the new estimates had come from the complete sample. This is the Maximization step.

These steps are iterated until convergence and a good estimate of the Maximum Likelihood Estimate is found. Due to the iterative manner of the algorithm the initial difficult likelihood maximization is replaced with a series of likelihood maximization problems which are much simpler.

This can be written as the following steps for imputing the missing data:

  1. Initialize the starting parameters \(\theta\)
  2. Perform Soft Imputation
  3. Re-estimate the parameters
  4. Iterate until convergence
  5. Re-iterize with different starting parameters to ensure that only a local maximum was not found

One important note is that there are two options to freely use the EM algorithm to impute the missing data. One which involves performing hard imputation with the expectation of the final iteration parameters, and the other which involves the assumption that the missing data was Missing at Random. This means that the missing part of the data is independent of the missing observations given the observed observations. Let us utilize the assumption of MAR.

\[f(M|X,\theta)=f(M|X_{obs},\theta)\]

\[M\perp\!\!\!\perp X_{mis}|X_{obs},\theta\]

Since the components are estimated as belonging to components of a Gaussian mixture models the probability density function has the form of:

\[p(x)=\sum\_{k=1}^K\pi_k\mathcal{N}(x|\mu_k\sigma_k)\]

Wherein \(x\) represents all the data, \(\pi_k\) which is fixed to one for the specific component a data point belongs in and zero for all others, and the components of the distribution of each mixture component \(\mu_k\) and \(\sigma_k\). In this case an assumption is made that all of the components are distributed according to the normal distribution.

Given the MAR assumption the complete log likelihood can be represented as:

\[\ell(\theta) = \log p(x|\theta)\]

\[\ell(\theta)=\sum_i^N\log\sum_{k=1}^K\mathcal{N}(x_i|\mu_k\sigma_k)\]

with \(N\) equivalent to the number of data points in the dataset

The main idea in the E step is to find the function that the M step can maximize. In essence the M step solves the following maximization problem and gives a new \(\theta\) which is used as \(\theta_{new}\) for the next E step.

\[\;{\tiny\begin{matrix}\\ \normalsize \text{argmax} \\ ^{\scriptsize \theta}\end{matrix}} \quad q(\theta| \theta_{old})\]

The main idea behind the justification of this argument is that following an E-step and a subsequent M step, the likelihood has never decreased which was proved by Dempster et al.

The E step thus provides the function which the M step will use in its maximization. The output of the M-step is put into the E-step. as \(\theta_{old}\). In the following \(\mathbf{Z}\) represents the latent variables and \(\mathbf{X}\) represents the data under the current estimate. This allows the missing values to be taken into consideration as well as the observed values while also allowing consideration of further improvements.

\[q(\theta| \theta_{old}) = \mathbb{E}_{\mathbf{Z}|\mathbf{X},\theta_{old}}\log(p(\theta|\mathbf{X},\mathbf{Z}))\]

The missing values are imputed using the following:

\[\mu_k= \begin{bmatrix} \mu_{i,obs} \\ \mu_{i,mis} \end{bmatrix} \text{and} \; \sigma_i = \begin{bmatrix} \sigma_i^{obs,obs} & \sigma_i^{obs,mis} \\ \sigma_i^{mis,obs} & \sigma_i^{mis,mis} \end{bmatrix}\]

\[x_{i,k} = \mu_{k,mis} + \sigma_{k,obs,mis} (\sigma_{k,obs,obs}^{-1}(x_{i,obs}-\mu_{k,obs}))\]

Conclusion

Working with missing data is a messy subject, and the approach one should take always depends upon the structure of the data and how it is missing. Any approach which maintains the relationship between variables via preserving the covariance structure is important if inferences are going to be made from the completed-via-imputation dataset.

Resources

  • A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
  • J.-B. Durand. Lecture notes in fundamentals of probabilistic data mining, January 2018
  • A. Gelman and J. Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, New York : Cambridge University Press, 2006.