Objectives
Upon completion of this lesson, you will be able to:
- list the different methods for dealing with missing values
- know when to drop data and when to impute
- apply imputation methods
- consider regulatory and privacy concerns
Introduction
Missing values refer to the absence of data or no data value in an expected place within a dataset. In other words, they’re gaps or blanks in your data.
For example, if you are dealing with a dataset of customer information, there might be some missing values in columns such as “email address” or “phone number” where some customers didn’t provide this information.
Missing values can cause a variety of issues. Many machine learning algorithms require complete datasets to function properly, meaning that if there are missing values, the algorithms might not work or might deliver inaccurate results. Missing data can also bias or distort representations and lead to misleading trends and conclusions.
Motivation
The chalk-talk below by Dr. Martin Schedlbauer of Khoury Boston introduces the topic of missing values and the impact on training machine learning models and doing data analytics.
Missing Data
Missing data refers to the absence of information in a dataset. It can occur due to various reasons such as data entry errors, data corruption, or survey non-responses. There are three main types of missing data:
Missing Completely at Random (MCAR): The missing data is unrelated to any observed or unobserved variables.
Missing at Random (MAR): The missing values are related to the observed data, but not the unobserved data.
Missing Not at Random (MNAR): The missing data is related to the unobserved data. Handling missing data is crucial for accurate and unbiased analysis.
Let’s dig a little more into each one of these.
Missing Completely at Random (MCAR)
MCAR means that the probability of a value being missing is completely unrelated to any observed or unobserved data, i.e., the other features cannot be used to predict the missing value. In other words, the missing value is random and has no pattern. Furthermore, there is no relationship between the missing values and the observed data. The missingness is also unrelated to the unobserved (missing) data itself.
Since the missing data is random and unrelated to any other data, it can be ignored without introducing bias. So, the methods covered in this lesson, including Listwise Deletion, Mean/Median/Mode Imputation or Advanced Imputation Techniques such as K-Nearest Neighbors (KNN) imputation, can be applied.
Missing at Random (MAR)
MAR means that the probability of a value being missing is related to the observed data but not the missing data itself. The missingness can be explained by other observed variables in the dataset. In other words, there is a relationship between the missingness and some of the observed data, but the missingness is unrelated to the missing data itself once the observed data is taken into account.
Since the missingness is related to observed data, one can use the observed data to inform the imputation process. Methods might include Conditional Mean Imputation, Regression Imputation, and Multiple Imputation.
Missing Not at Random (MNAR)
MNAR means that the probability of a value being missing is related to the unobserved data itself. The missingness depends on the value that is missing. The missingness is not entirely explained by the observed data, but rather the missingness is related to the missing values themselves.
MNAR is the most challenging type of missingness to handle because the missingness is directly related to the missing values. Potential methods include Model-Based Methods where we use more complex models that explicitly account for the missing data mechanism, such as Expectation-Maximization (EM) algorithms. Performing analysis to understand how different assumptions about the missing data might affect the results. Adding assumptions or additional external data that can help in modeling the missing data mechanism.
Understanding the nature of the missing data mechanism is crucial for choosing the appropriate method to handle missing values in machine learning and feature engineering.
Dealing with Missing Values
There are several strategies to handle missing values in datasets:
Removal: If the missing data is limited to a small number of observations, you might simply delete those observations from your dataset. This method is direct, but can potentially remove valuable data or introduce bias, especially if data is not missing randomly.
Imputation: You can replace missing values with substituted values. There are many methods of imputation, including mean/mode/median imputation, random imputation, and prediction model imputation.
Mean/Median/Mode Imputation: This involves replacing the missing data for a certain variable with the mean or median (for numerical data) or the mode (for categorical data) of all known values of that variable. This is a common and easy method but can lead to an underestimate of the true variance.
Random Imputation: Randomly selected available values are used to fill in missing data.
Prediction Model Imputation: Machine learning algorithms, like K-Nearest Neighbors (K-NN) and regression models, can be used to predict and impute missing values.
Default Value: If you know why the data is missing, you might assign a value that indicates a data was missing, such as -1 or 9999, for numerical variables, or “Unknown,” for categorical variables. This might be helpful if the fact that data is missing is itself informative.
Multiple Imputation: This method involves imputing the missing values multiple times to create “complete” datasets, analyzing each dataset separately, and pooling the results. It’s a more sophisticated method than single imputation methods like mean/median/mode imputation, and it gives better estimation of the error in imputation.
Advanced Techniques: More advanced techniques, like data augmentation methods or deep learning techniques (like using autoencoders), can be used for dealing with missing data.
Which method to use largely depends on the nature of the data and the problem at hand. Also, the analysis should account for the uncertainty in the imputations. After handling missing data, it’s important to recheck the quality of the data before proceeding with the next steps in your machine learning or data analytics workflow.
Removal/Deletion
“Removal” or “deletion” is one of the simplest methods for dealing with missing data in a dataset. It involves removing the instances (rows) or features (columns) that have missing values. There are two main types of deletion: listwise (or complete) deletion and pairwise deletion.
Listwise (Complete) Deletion: This is the most common form of deletion used. It involves removing entire observations (rows) where at least one value is missing. For example, in a dataset of customer information, if a customer didn’t provide their phone number, their entire record would be deleted.
Pairwise Deletion: In this case, an observation is only excluded if the specific variable that’s being analyzed is missing. Pairwise deletion maximizes all the available data. For example, if you’re analyzing income data and a customer didn’t provide their income, only that specific customer’s income data would be excluded, but the rest of the data for that customer would be used for other variables.
The deletion method is straightforward to implement, but it also has its drawbacks:
- It can lead to a loss of information, particularly if the dataset is small or if the missing data is not random.
- It can introduce bias into the resulting dataset, which could lead to erroneous conclusions or predictions.
- The statistical power decreases with the decreasing size of the dataset.
Therefore, deletion should be used carefully, considering these limitations. It’s often suitable when the amount of missing data is small and appears to be missing completely at random. For example, if a small percentage of users in a survey forgot to answer a particular question, it might be reasonable to just delete those responses.
However, if data is not missing at random and there are patterns in the missing data, deletion can lead to significant bias and should generally be avoided. In these cases, imputation methods might be more suitable.
Imputation
Imputation is a method used to fill in missing data with substituted values. The aim of imputation is to produce a complete dataset that can be used for machine learning models or statistical analysis.
Here are some common imputation techniques:
Mean/Median/Mode Imputation: This involves replacing missing values with the mean (average), median (middle value), or mode (most frequent value) from the non-missing values of that variable. It’s a simple method often used for initial handling of missing data, but it can reduce the variability in the data and potentially introduce bias.
Random Imputation: This involves replacing the missing value with a randomly selected observed value. It’s a relatively easy method to implement and can be more accurate than mean/median/mode imputation, but it’s still a fairly naive method.
Regression Imputation: In this method, a regression model is used to predict missing values based on other data. The observed data is used to estimate the model and then predict the missing values. This method can be more accurate than the previous two, but it makes strong assumptions about the relationships in your data.
K-Nearest Neighbors (K-NN) Imputation: The K-NN method identifies the ‘K’ most similar observations to the one with missing data and imputes the missing value using these similar observations. The similarity is usually calculated using a distance metric such as Euclidean distance.
Multiple Imputation: This is a more sophisticated technique that imputes the missing data multiple times to create several different complete datasets. Each of these datasets is then analyzed, and the results are pooled to give a final result. This method gives a better estimation of the uncertainty around the imputed values.
Advanced Imputation Methods: More advanced methods, such as those using machine learning models or deep learning, can also be used. For example, an autoencoder (a type of neural network) can be trained to learn the most salient features in the data, and then used to impute missing values.
The choice of imputation method depends on the nature of the data and the specific problem. After imputation, it’s also important to validate the quality of the imputed data before using it for further analysis or modeling.
Default Value
The “assignment” method for dealing with missing values involves assigning a unique value to the missing data. This method is often used when the fact that the data is missing is itself potentially informative.
For numerical variables, you might assign a value that is outside the normal range of the variable, such as -1, 0, or 9999. For categorical variables, you might assign a new category like “Unknown” or “Missing”.
For example, consider a dataset of customer information where some customers didn’t provide their income. You could assign a value like -1 to represent missing income data. This approach might be useful if the absence of income information could be indicative of a specific customer behavior or characteristic.
Similarly, in a dataset of patient medical histories, if some patients didn’t answer a question about a specific disease, you might create a new category named “Unknown” for that question. This could be informative, as perhaps these patients were uncomfortable discussing that particular disease, which could be useful information for the analysis.
Assignment can be a useful method when you suspect that the missing data might not be missing at random, and could contain useful information. But it also has potential downsides:
It can introduce bias, particularly if the assigned value is not chosen carefully. For instance, using 0 for missing income data might lead the model to inaccurately associate these customers with a low income group.
It can lead to an artificial relationship between the variable with assigned missing values and other variables, as the assigned value might not reflect the true underlying data distribution.
It might confuse the machine learning model if the value assigned for missing data is not distinct enough from the actual values.
Therefore, it’s important to consider these factors when deciding to use the assignment method to handle missing data. Remember that a missing value could mean something and therefore it alludes to an unknown value that the analyst has not yet discovered.
Advanced Techniques
Advanced techniques for handling missing values often involve statistical or machine learning methods that can model the relationships in the data in a more sophisticated way than simple imputation or deletion.
Here are a few advanced techniques:
Multiple Imputation: This technique involves creating several complete datasets from the one with missing data. Each of these is created by randomly imputing missing values based on a specified model, usually a regression or Bayesian model. The variability across these multiple imputations reflects the uncertainty about the right values to impute. Each dataset is analyzed separately, and then the results are combined, accounting for both within- and between-imputation variability.
Machine Learning Algorithms: Machine learning models like Decision Trees, Random Forests, or Support Vector Machines can be used to predict missing values based on other data. For instance, if we have a dataset with missing age values, we can use other variables (like income, job, marital status, etc.) to train a machine learning model, and then use this model to predict and impute the missing age values.
Deep Learning Techniques: Autoencoders, a type of neural network, can be used to handle missing values. An autoencoder is trained to learn a compressed representation of the data, then recreate the original data from this compressed form. When it encounters missing values, it fills them in with what it would expect based on its training. This approach can be particularly useful with high-dimensional data where simple imputation methods might not work well.
Probabilistic Models: These are statistical techniques that model the data distribution and use this to impute missing values. Examples include Expectation-Maximization (EM) algorithms and Bayesian models. These methods can account for the uncertainty of missing values, but they also require strong assumptions about the data.
Matrix Factorization: This technique is commonly used in recommendation systems. The idea is to factorize the matrix (dataset) into two lower dimensional matrices, and then use these to fill in the missing values.
These advanced techniques can provide more accurate imputations than simpler methods, particularly with large datasets and when the data is not missing at random. However, they also have their drawbacks, including increased computational complexity, difficulty in implementation, and the potential for overfitting.
When choosing an advanced method for handling missing data, you should consider the nature of the missing data (e.g., is it missing at random or not?), the amount of missing data, the complexity of the relationships in your data, and the computational resources available.
Choosing a Techniques
Choosing the appropriate technique to handle missing values in your dataset depends on several factors, including:
The Amount of Missing Data: If only a small percentage of the observations are missing values, it might be reasonable to use listwise deletion (assuming the missingness is completely at random). However, if a large portion of your data is missing, more advanced imputation methods might be more appropriate.
The Pattern of Missingness: If the data is missing completely at random (MCAR), simpler methods like deletion or mean imputation might suffice. But if the data is missing at random (MAR) or not at random (MNAR), where the missingness is related to observed or unobserved data, more advanced techniques like multiple imputation, predictive modeling or matrix factorization would be better.
The Type of Variable: If the variable is categorical, you might use mode imputation or assign a new category to represent missing data. For continuous variables, you could use mean or median imputation, regression imputation, or more advanced methods like K-NN or deep learning-based imputation.
The Importance of the Variable: If the variable with missing data is a crucial feature for your analysis or model, you might want to use a more sophisticated imputation method to preserve its variance and relationships with other variables. If it’s not as important, you might use a simpler method or even consider dropping the variable.
The Complexity of the Data: If the relationships between variables in your dataset are complex, advanced techniques like multiple imputation, machine learning models, or deep learning techniques might be more appropriate, as they can model these complex relationships better than simpler methods.
Computational Resources: More advanced techniques can require significant computational resources and time. If these are limited, you might need to use simpler methods or use a smaller sample of your data for multiple imputation or modeling-based imputation.
It’s often a good idea to try multiple methods and compare the results to see which performs best for your specific situation – data science is the art of experimentation. You might also want to perform sensitivity analysis to understand the potential impact of the missing data on your conclusions or predictions.
Outliers and Missing Values
There are occasions when an extreme outliers should be treated as a missing value. For example, when an outlying value is extreme and there are very few outliers, then treating the outliers as a missing value and applying the aforementioned techniques would be appropriate.
Regulatory and Privacy Considerations
Regulatory and privacy considerations can play a significant role in determining how you deal with missing data in certain contexts. Let’s look at each of these in detail.
Regulatory Considerations:
Certain industries, particularly those related to healthcare, finance, and personal data, have strict regulations that guide how data can be handled, manipulated, and stored. These regulations can impact how you deal with missing data.
For instance, in the healthcare industry, the U.S. Health Insurance Portability and Accountability Act (HIPAA) lays out strict guidelines on how to manage patient health information. Similarly, finance industries have regulations that require accurate reporting and maintaining data integrity, and they may specify acceptable methods for handling missing data.
Regulatory considerations also extend to the fairness and transparency of models, especially in sensitive areas such as credit scoring or hiring. Improperly handling missing data could potentially lead to models that are biased or discriminatory, which would be a regulatory issue.
Privacy Considerations:
Handling missing data by imputation methods involves making an educated guess about what the missing data might be, often based on other data about the same individual. If these inferences are incorrect, they could potentially be misleading or violate an individual’s privacy.
In some cases, the fact that a certain piece of data is missing can reveal something about an individual, even if the actual data isn’t known. For example, if certain health data is missing, it might be because an individual has a particular health condition that they prefer not to disclose.
Also, in privacy-preserving data analysis, the goal is often to add noise to data to protect individual privacy. In these cases, adding synthetic data (like imputed values) to your dataset might be contrary to the goals of privacy protection.
In general, it’s crucial to understand and comply with all relevant regulations and privacy considerations when dealing with missing data. When in doubt, consult with a legal expert or a data privacy officer.
Summary
Missing values in machine learning and data analytics refer to the absence of expected data in a dataset. These gaps can cause issues, as many algorithms require complete data to function correctly, and missing values can distort conclusions or trends.
Dealing with missing values involves several strategies:
Removal: Removing observations with missing data. Effective if missing data is small but can introduce bias.
Imputation: Replacing missing values with substituted values. This can include using mean/median/mode, random values, or predictive models for substitution.
Default Value: Assigning a unique value to represent missing data, which is helpful if the missing status itself carries information.
Multiple Imputation: Imputing missing values multiple times, analyzing each dataset, and pooling the results. This provides better error estimation.
Advanced Techniques: Using methods like data augmentation or deep learning techniques for dealing with missing data.
The chosen method should depend on the nature of the data and the problem at hand, and the data quality should be reassessed after handling missing values.
