Class imbalance in supervised machine learning refers to the scenario where the distribution of classes in a dataset is significantly skewed, meaning one class (or a few classes) has far more examples than others. This imbalance poses a challenge because most machine learning algorithms are designed to optimize overall accuracy, often leading to poor performance on the minority class. This can lead to significant bias in prediction.
For example, consider a binary classification problem for detecting fraudulent transactions. If 98% of transactions are legitimate and only 2% are fraudulent, a naive model might predict all transactions as legitimate to achieve 98% accuracy. However, this approach fails to identify fraudulent cases, which are critical in this context.
The issue arises because algorithms like decision trees, neural networks, and support vector machines may focus on the majority class, neglecting the minority class. Metrics such as accuracy become misleading, as high accuracy can be achieved by ignoring minority cases altogether. Alternative evaluation metrics such as True Positives and True Negatives or the F1-Score can highlight potential class imbalance.
Several strategies have been developed to ameliorate class imbalance:
Data-Level Approaches: These include resampling methods such as oversampling the minority class (e.g., SMOTE) or undersampling the majority class to balance the dataset.
Algorithm-Level Modifications: Some algorithms can be adjusted to handle imbalances by incorporating class weights or cost-sensitive learning.
Evaluation Metrics: Using metrics like precision, recall, F1-score, and area under the ROC curve (AUC-ROC) provides a clearer picture of a model’s performance on imbalanced datasets.
Understanding and addressing class imbalance is critical for building robust and meaningful machine learning models, particularly in applications like fraud detection, medical diagnosis, and anomaly detection, where the minority class often represents the cases of greatest interest but generally having the least representation in the data.
Data-level methods to manage class imbalance generally focus on resampling. Resampling methods are techniques used to address class imbalance in datasets by modifying the distribution of the minority and majority classes. These methods operate at the data level, aiming to create a balanced dataset that allows machine learning models to learn effectively from both classes. Common resampling methods include oversampling, undersampling, and hybrid approaches.
Oversampling is done by increasing the size of the minority class by replicating existing samples or generating synthetic samples. There are several common methods for oversampling minority classes:
Random Oversampling: Randomly duplicates instances of the minority class until the class distribution is balanced.
Synthetic Minority Over-sampling Technique (SMOTE): Generates new synthetic instances for the minority class by interpolating between existing samples and their nearest neighbors.
Random Over-Sampling Examples (ROSE): A resampling method designed to address class imbalance by generating synthetic samples for the minority class. Unlike basic oversampling, which duplicates existing samples, or SMOTE, which interpolates between existing data points, ROSE uses a kernel density estimation approach to generate synthetic data points. This method creates a smoother, more realistic representation of the minority class distribution.
Oversampling is generally used when the dataset is small, as it avoids losing information from the majority class. SMOTE is a particularly useful technique when the minority class has sufficient variability to synthesize meaningful new samples.
Undersampling reduces the size of the majority class by randomly removing samples to balance the class distribution. Similar to oversampling, there are various strategies:
Undersampling is often used when the dataset is large, and the majority class contains redundant or noisy samples. It works best when there is no risk of losing valuable information about the majority class.
The code example below illustrates both undersampling strategies1. Undersampling in R can be implemented without relying on a package by manually selecting a random subset of the majority class to match the size of the minority class. This approach ensures a balanced dataset by reducing the majority class to the size of the minority class.
Let’s first create an imbalanced dataset for testing:
set.seed(123)
# Simulate a binary classification dataset
data <- data.frame(
x1 = rnorm(100),
x2 = rnorm(100),
class = c(rep(0, 90), rep(1, 10)) # Imbalanced with 90 majority and 10 minority
)
# Check class distribution
table(data$class)
##
## 0 1
## 90 10
Now we can separate the classes:
# Separate majority and minority classes
majority <- subset(data, class == 0)
minority <- subset(data, class == 1)
Next, we extract a random sample from the majority class:
# Randomly sample the majority class to match the size of the minority class
undersampled_majority <- majority[sample(nrow(majority), nrow(minority)), ]
Finally we can combine the subsets:
# Combine the undersampled majority class with the minority class
balanced_data <- rbind(undersampled_majority, minority)
# Check the new class distribution
table(balanced_data$class)
##
## 0 1
## 10 10
In the above code,
sample()
function is used to randomly select rows from the majority class.nrow(minority)
) ensures that the majority class is reduced to the size of the minority class.rbind()
function combines the reduced majority class with the minority class into a balanced dataset.To visualize the effect of undersampling, you can plot the original and balanced datasets:
library(ggplot2)
# Plot original data
ggplot(data, aes(x = x1, y = x2, color = factor(class))) +
geom_point() +
ggtitle("Original Imbalanced Dataset")
# Plot balanced data
ggplot(balanced_data, aes(x = x1, y = x2, color = factor(class))) +
geom_point() +
ggtitle("Dataset After Undersampling")
Remember that undersampling implies potentially a:
Hybrid methods combine oversampling and undersampling to achieve a balanced dataset with minimal redundancy and improved synthetic sample quality. They generally start by oversampling the minority class to a predefined level and then undersampling the majority class to reduce redundancy. Methods like SMOTE-Tomek Links or SMOTE-ENN combine synthetic oversampling with cleaning techniques.
Hybrid methods work best when the dataset contains noisy or overlapping class boundaries. It is most effective for highly imbalanced datasets where a simple oversampling or undersampling approach might fail.
By applying these techniques judiciously, practitioners can mitigate the adverse effects of class imbalance and improve model performance on both the majority and minority classes.
Let’s take a closer look with additional details on the two most common oversampling methods: ROSE and SMOTE.
ROSE (Random Over-Sampling Examples) is a resampling method designed to address class imbalance by generating synthetic samples for the minority class. Unlike basic oversampling, which duplicates existing samples, or SMOTE, which interpolates between existing data points, ROSE uses a kernel density estimation approach to generate synthetic data points. This method creates a smoother, more realistic representation of the minority class distribution.
ROSE generates synthetic examples for both the minority and majority classes by sampling from a smoothed approximation of the original data distribution. This is achieved through the following steps:
Kernel Density Estimation (KDE): A kernel density function is applied to the data to estimate the probability density function of the feature space. This density is used to randomly sample new data points, ensuring that the synthetic points reflect the underlying data distribution.
Balanced Sampling: Synthetic samples are added to the minority class to balance the dataset. Optionally, some points from the majority class can also be synthetically generated or removed to ensure better class boundaries.
Noise Handling: ROSE incorporates a level of randomness, reducing the risk of creating exact duplicates or overly simplistic synthetic points, which helps improve generalization.
ROSE has some key advantages over other methods:
ROSE is particularly effective in situations where the minority class is highly underrepresented and the dataset contains complex decision boundaries or overlaps between classes, but it is computationally complex. However, often simpler methods like random oversampling or SMOTE are insufficient to capture the minority class’s diversity.
The ROSE
package in R provides an easy-to-use implementation of this method. Here’s how it can be applied:
Consider a binary classification problem with an imbalanced dataset. The code below applies random oversampling using functions from the ROSE package.
## Loaded ROSE 0.0-4
# Simulate imbalanced data
set.seed(123)
data <- data.frame(
x1 = rnorm(100),
x2 = rnorm(100),
class = c(rep(0, 90), rep(1, 10))
)
# Check class distribution
table(data$class)
##
## 0 1
## 90 10
# Apply ROSE to generate balanced data
rose_data <- ROSE(class ~ ., data = data, seed = 1)$data
# Check the new class distribution
table(rose_data$class)
##
## 0 1
## 52 48
You can visualize how ROSE generates synthetic points using scatterplots:
library(ggplot2)
# Original data
ggplot(data, aes(x = x1, y = x2, color = factor(class))) +
geom_point() +
ggtitle("Original Data")
# ROSE data
ggplot(rose_data, aes(x = x1, y = x2, color = factor(class))) +
geom_point() +
ggtitle("Data After ROSE")
The ROSE package also provides functionality to evaluate the effectiveness of resampling using models. For example:
# Train a logistic regression model on the resampled data
model <- glm(class ~ ., family = binomial, data = rose_data)
# Predict and evaluate
pred <- predict(model, newdata = data, type = "response")
roc.curve(data$class, pred)
## Area under the curve (AUC): 0.641
While ROSE is highly effective, it has some limitations:
Nevertheless, ROSE is a versatile strategy for handling class imbalance by generating synthetic examples that reflect the original data distribution. It is particularly suitable for datasets with complex relationships between features and classes, providing a robust alternative to traditional resampling methods.
SMOTE is a popular resampling technique used to address class imbalance in datasets by generating synthetic samples for the minority class. Unlike random oversampling, which duplicates existing minority class samples, SMOTE creates new data points by interpolating between existing minority class samples. This method helps reduce overfitting and enhances the minority class representation in the dataset.
SMOTE applies the following steps:
Identify Nearest Neighbors: For each sample in the minority class, SMOTE identifies its nearest neighbors within the same class based on a distance metric (commonly Euclidean distance). This is essentially an application of the kNN algorithm. Note that this requires categorical features to be numerically encoded and numeric features to be normalized to a common scale.
Generate Synthetic Samples: New samples are created by taking a weighted average of a minority class sample and one of its nearest neighbors. This interpolation ensures that synthetic samples lie along the line segments connecting existing samples.
Repeat Until Balanced: This process is repeated until the desired class balance is achieved.
SMOTE has some key advantages over other methods:
Avoids Overfitting: By generating new synthetic samples rather than duplicating existing ones, SMOTE mitigates the risk of overfitting to the minority class.
Enhances Generalization: SMOTE encourages models to learn a broader representation of the minority class, especially near class boundaries.
Improves Decision Boundaries: By creating samples near the edges of the minority class, SMOTE helps the model distinguish better between the majority and minority classes.
SMOTE is particularly effective in cases where:
The DMwR and smotefamily packages in R provide implementations of SMOTE. Here’s a step-by-step explanation with examples using a simulated imbalanced binary classification dataset:
set.seed(123)
data <- data.frame(
x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
class = c(rep(0, 90), rep(1, 10))
)
# Check class distribution
table(data$class)
##
## 0 1
## 90 10
The dataset is imbalanced, with only 10 samples in the minority class. Now we can use the smote(()
function from the performanceEstimation package to generate synthetic samples:
library(performanceEstimation) # For SMOTE
set.seed(123)
# Simulate an imbalanced dataset
data <- data.frame(
x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
class = as.factor(c(rep(0, 90), rep(1, 10))) # 90 majority (0), 10 minority (1)
)
# Check class distribution
table(data$class)
##
## 0 1
## 90 10
# Apply SMOTE
smote_data <- performanceEstimation::smote(
form = class ~ ., # Formula specifying the target and predictors
data = data, # Original dataset
perc.over = 200, # Percentage of new synthetic samples for the minority class
perc.under = 150 # Percentage of majority samples to retain
)
# Check the new class distribution
table(smote_data$class)
##
## 0 1
## 300000 2010
Note the following outcomes:
perc.over = 200
: Increases the minority class by 200% (adds twice as many synthetic samples).perc.under = 150
: Adjusts the majority class to balance the dataset.There are other packages for R that also implement SMOTE, such as the smotefamily package explained with an example later in this section.
We can visualize the effects of SMOTE on the dataset by comparing the original data sets distribution of classes against the oversampled data set:
library(ggplot2)
# Original data
ggplot(data, aes(x = x1, y = x2, color = factor(class))) +
geom_point() +
ggtitle("Original Dataset")
# SMOTE data
ggplot(smote_data, aes(x = x1, y = x2, color = factor(class))) +
geom_point() +
ggtitle("Dataset After SMOTE")
The smote()
function has two important parameters:
perc.over
: Specifies the percentage of new synthetic samples to generate for the minority class.perc.under
: Defines the proportion of majority class samples to keep after oversampling.The smotefamily
package provides additional flexibility, including options for multi-class datasets:
library(smotefamily)
set.seed(123)
# Simulate an imbalanced dataset
data <- data.frame(
x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
class = as.factor(c(rep(0, 90), rep(1, 10))) # 90 majority (0), 10 minority (1)
)
# Apply SMOTE
smote_result <- smotefamily::SMOTE(
X = data[, c("x1", "x2")], # Features
target = data$class, # Target variable
K = 5, # Number of nearest neighbors
dup_size = 2 # Number of synthetic samples per minority class instance
)
# Extract the resampled data as a dataframe
smote_data <- data.frame(x1 = smote_result$data$x1,
x2 = smote_result$data$x2,
class = as.numeric(smote_result$data$class))
# Check the new class distribution
table(smote_data$class)
##
## 0 1
## 90 30
In the above example, there are two key parameters for the SMOTE()
function:
K
: Number of nearest neighbors to consider.dup_size
: Number of synthetic samples to generate per minority class instance.To assess the effectiveness of SMOTE, train a model before and after applying it and compare the model’s performance:
library(ROCR)
# Train a logistic regression model on the original data
model_original <- glm(class ~ ., data = data, family = binomial)
pred_original <- predict(model_original, type = "response")
roc_original <- ROCR::prediction(pred_original, data$class)
auc_original <- ROCR::performance(roc_original, "auc")@y.values
# Train a logistic regression model on SMOTE data
model_smote <- glm(class ~ ., data = smote_data, family = binomial)
pred_smote <- predict(model_smote, newdata = smote_data, type = "response")
roc_smote <- ROCR::prediction(pred_smote, smote_data$class)
auc_smote <- ROCR::performance(roc_smote, "auc")@y.values
# Compare AUC
print(paste("AUC Before SMOTE:", auc_original))
## [1] "AUC Before SMOTE: 1"
## [1] "AUC After SMOTE: 1"
There are some important limitations to SMOTE which implies judicious use:
Despite its potential limitations, SMOTE is a powerful resampling technique for addressing class imbalance, particularly in datasets where the minority class has well-defined and representative samples. By generating synthetic examples, it improves model performance on the minority class, resulting in better generalization. However, careful parameter tuning and evaluation are essential to avoid introducing noise or overcomplicating class boundaries.
SMOTE (Synthetic Minority Over-sampling Technique) and ROSE (Random Over-Sampling Examples) are two key resampling techniques designed to address class imbalance in datasets by generating synthetic samples. However, they differ in methodology and suitability for specific scenarios. The table below summarizes those differences:
Feature | SMOTE | ROSE |
---|---|---|
Methodology | Interpolates between existing minority class samples using nearest neighbors. | Uses kernel density estimation to generate synthetic points based on the entire data distribution. |
Synthetic Samples | Generated along line segments between existing minority samples. | Generated randomly across the feature space, with smoothing to approximate the data distribution. |
Focus | Enhances decision boundaries by creating samples near existing data points. | Provides a balanced and smoothed representation of both classes, not limited to minority class. |
Noise Handling | Can amplify noise or overlap if the minority class has noisy samples. | Less prone to overfitting but can introduce unrelated points if the data distribution is poorly estimated. |
Scalability | Computationally heavier for large datasets due to nearest-neighbor calculations. | Relatively lighter but depends on kernel density estimation complexity. |
Applicability | Works well with structured, clean datasets and when the minority class is not too sparse. | Effective for datasets with noisy or overlapping class boundaries. |
To summarize, we generally want to use SMOTE for structured datasets requiring precise synthetic data, and prefer ROSE for datasets needing broader smoothing or when noise in the minority class is a concern. So, in short, use:
Algorithm-level modifications to address class imbalance involve adjusting the learning algorithm itself to account for the imbalance without altering the dataset. This is achieved by making the model sensitive to the importance of each class, typically by introducing class weights, cost-sensitive learning, or custom loss functions. These methods ensure that the minority class has a proportionally larger influence during training.
Many machine learning algorithms allow specifying class weights, which assign higher importance to the minority class during the training process. The code example below demonstrates this with a logistic regression model and assigning class weights.
# Simulate imbalanced data
set.seed(123)
data <- data.frame(
x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
class = c(rep(0, 90), rep(1, 10))
)
# Check class distribution
table(data$class)
##
## 0 1
## 90 10
# Assign weights: higher for minority class
weights <- ifelse(data$class == 1, 9, 1) # Ratio 90:10, so minority gets 9x weight
# Train logistic regression with weights
model <- glm(class ~ x1 + x2, family = binomial, data = data, weights = weights)
# Model summary
summary(model)
##
## Call:
## glm(formula = class ~ x1 + x2, family = binomial, data = data,
## weights = weights)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -119.91 46002.47 -0.003 0.998
## x1 42.90 17740.59 0.002 0.998
## x2 31.54 14212.17 0.002 0.998
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.4953e+02 on 99 degrees of freedom
## Residual deviance: 1.1412e-08 on 97 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
# Predict probabilities
predicted <- predict(model, type = "response")
# Evaluate performance
library(ROCR)
roc_curve <- prediction(predicted, data$class)
auc <- performance(roc_curve, "auc")@y.values[[1]]
print(paste("AUC with weights:", auc))
## [1] "AUC with weights: 1"
Note that in the code above, the weights
parameter adjusts the influence of each class. In this case, minority class instances are assigned a weight of 9, amplifying their impact on the model. This approach works well when the dataset is highly imbalanced but representative of real-world proportions. The choice of weights is empirical and requires experimentation.
Cost-sensitive learning explicitly penalizes misclassifications of the minority class more heavily than those of the majority class. Decision tree algorithms, like rpart, allow specifying cost matrices to achieve this. The code below illustrates this for a decision tree.
library(rpart)
# Define cost matrix: higher penalty for minority misclassification
cost_matrix <- matrix(c(0, 10, 1, 0), nrow = 2, byrow = TRUE)
# Train a cost-sensitive decision tree
model <- rpart(
class ~ x1 + x2,
data = data,
method = "class",
parms = list(loss = cost_matrix)
)
# Print the tree structure
print(model)
## n= 100
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 100 10 0 (0.90000000 0.10000000)
## 2) x1< 2.270525 91 1 0 (0.98901099 0.01098901) *
## 3) x1>=2.270525 9 0 1 (0.00000000 1.00000000) *
# Predict and evaluate
predicted <- predict(model, type = "class")
confusion_matrix <- table(data$class, predicted)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
## predicted
## 0 1
## 0 90 0
## 1 1 9
Once again, the weights are chosen empirically and tuned through trial-and-error. The cost matrix penalizes misclassification of the minority class (10
) more heavily than the majority class (1
). This modification directly influences how the decision tree splits data, prioritizing minority class accuracy.
In deep learning, custom loss functions can be used to address class imbalance by assigning different penalties to errors based on class. The R example below uses a weighted loss in Keras. While not the only package for deep learning in R, the keras package in R provides an interface to the Keras deep learning library, which is built on top of TensorFlow. It is designed for creating and training deep learning models in a more accessible form Keras and TensorFlow are mostly used for tasks such as image recognition, natural language processing, and time-series forecasting. TensorFlow operates on a computation graph model, where operations are represented as nodes, and data flows between them along edges. The framework is optimized to handle large-scale numerical computations using this structure, particularly for neural networks. The training of deep neural networks with many hidden layers is computationally very expensive, so the code below may require significant time to run on typical systems.
library(keras)
# Prepare imbalanced data
x <- as.matrix(data[, c("x1", "x2")])
y <- as.numeric(data$class)
# Define a weighted binary cross-entropy loss
weighted_loss <- function(y_true, y_pred) {
weights <- ifelse(y_true == 1, 9, 1) # Higher weight for minority class
keras::k_mean(weights * k_binary_crossentropy(y_true, y_pred), axis = -1)
}
# Build a simple neural network
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = ncol(x)) %>%
layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
optimizer = "adam",
loss = weighted_loss,
metrics = c("accuracy")
)
# Train the model
model %>% fit(
x,
y,
epochs = 50,
batch_size = 10,
verbose = 1
)
Here, the custom loss function applies higher penalties to errors on minority class samples. This approach is particularly effective for neural networks or other algorithms where class weights may not be directly supported. However, neural networks carry a significant computational penalty and are computationally very expensive and time-consuming to train.
Algorithm-Level Modifications are best used when when class imbalance is moderate, and the algorithm supports weight adjustments (e.g., logistic regression, SVMs, kNN) or when misclassification costs vary significantly between classes, especially in decision trees or ensemble models. In deep learning or advanced models that require fine-grained control over optimization custom loss functions are an option.
These techniques are effective when you want to keep the dataset intact (without resampling) and rely on the algorithm to balance the learning process. In practice, both methods, oversampling and algorithm modification, can be used together.
This lesson explored methods for addressing class imbalance in supervised machine learning, focusing on data-level resampling techniques and algorithm-level modifications. Resampling methods modify the dataset’s class distribution to balance the representation of the majority and minority classes. Two key techniques discussed were SMOTE (Synthetic Minority Over-sampling Technique) and ROSE (Random Over-Sampling Examples). SMOTE generates synthetic samples for the minority class by interpolating between existing instances and their nearest neighbors, making it particularly suitable for structured datasets with a well-defined minority class. In contrast, ROSE employs kernel density estimation to generate synthetic samples across the feature space, making it effective for handling noisy or overlapping class boundaries. While SMOTE is ideal for enhancing decision boundaries, ROSE offers versatility for noisy datasets.
Algorithm-level modifications address class imbalance by directly influencing the learning process. These include the use of class weights, cost-sensitive learning, and custom loss functions. Class weights, as shown in a weighted logistic regression example, assign higher importance to minority class samples, ensuring they have a greater impact during model training. Cost-sensitive learning, such as with cost-sensitive decision trees, penalizes misclassifications of the minority class more heavily, effectively guiding the algorithm to prioritize these cases. Custom loss functions, often applied in deep learning, allow precise control over the training process by introducing penalties tailored to the dataset’s needs, such as weighted binary cross-entropy. These algorithmic approaches are particularly useful when the dataset itself should remain unaltered, and the imbalance can be addressed through adjustments to the learning framework.
In summary, resampling methods like SMOTE and ROSE are best used when modifying the dataset is feasible, with SMOTE being preferred for well-defined minority classes and ROSE for noisy or sparse datasets. Algorithm-level modifications, including class weights, cost matrices, and custom loss functions, are preferable when the learning process requires more nuanced adjustments. Together, these strategies provide a robust toolkit for handling class imbalance, ensuring better performance and fair representation of minority classes in machine learning models.
None yet.
In prior versions of R, the package unbalanced provided support for undersampling, but the package is no longer available.↩︎