Overview

Class imbalance in supervised machine learning refers to the scenario where the distribution of classes in a dataset is significantly skewed, meaning one class (or a few classes) has far more examples than others. This imbalance poses a challenge because most machine learning algorithms are designed to optimize overall accuracy, often leading to poor performance on the minority class. This can lead to significant bias in prediction.

For example, consider a binary classification problem for detecting fraudulent transactions. If 98% of transactions are legitimate and only 2% are fraudulent, a naive model might predict all transactions as legitimate to achieve 98% accuracy. However, this approach fails to identify fraudulent cases, which are critical in this context.

The issue arises because algorithms like decision trees, neural networks, and support vector machines may focus on the majority class, neglecting the minority class. Metrics such as accuracy become misleading, as high accuracy can be achieved by ignoring minority cases altogether. Alternative evaluation metrics such as True Positives and True Negatives or the F1-Score can highlight potential class imbalance.

Several strategies have been developed to ameliorate class imbalance:

  1. Data-Level Approaches: These include resampling methods such as oversampling the minority class (e.g., SMOTE) or undersampling the majority class to balance the dataset.

  2. Algorithm-Level Modifications: Some algorithms can be adjusted to handle imbalances by incorporating class weights or cost-sensitive learning.

  3. Evaluation Metrics: Using metrics like precision, recall, F1-score, and area under the ROC curve (AUC-ROC) provides a clearer picture of a model’s performance on imbalanced datasets.

Understanding and addressing class imbalance is critical for building robust and meaningful machine learning models, particularly in applications like fraud detection, medical diagnosis, and anomaly detection, where the minority class often represents the cases of greatest interest but generally having the least representation in the data.

Data-Level Approaches

Data-level methods to manage class imbalance generally focus on resampling. Resampling methods are techniques used to address class imbalance in datasets by modifying the distribution of the minority and majority classes. These methods operate at the data level, aiming to create a balanced dataset that allows machine learning models to learn effectively from both classes. Common resampling methods include oversampling, undersampling, and hybrid approaches.

Oversampling

Oversampling is done by increasing the size of the minority class by replicating existing samples or generating synthetic samples. There are several common methods for oversampling minority classes:

  1. Random Oversampling: Randomly duplicates instances of the minority class until the class distribution is balanced.

  2. Synthetic Minority Over-sampling Technique (SMOTE): Generates new synthetic instances for the minority class by interpolating between existing samples and their nearest neighbors.

  3. Random Over-Sampling Examples (ROSE): A resampling method designed to address class imbalance by generating synthetic samples for the minority class. Unlike basic oversampling, which duplicates existing samples, or SMOTE, which interpolates between existing data points, ROSE uses a kernel density estimation approach to generate synthetic data points. This method creates a smoother, more realistic representation of the minority class distribution.

Oversampling is generally used when the dataset is small, as it avoids losing information from the majority class. SMOTE is a particularly useful technique when the minority class has sufficient variability to synthesize meaningful new samples.

Undersampling

Undersampling reduces the size of the majority class by randomly removing samples to balance the class distribution. Similar to oversampling, there are various strategies:

  1. Random Undersampling: Randomly selects a subset of majority class samples equal in size to the minority class.
  2. Tomek Links: Removes majority class instances that are closest to minority class instances to clean class boundaries.

Undersampling is often used when the dataset is large, and the majority class contains redundant or noisy samples. It works best when there is no risk of losing valuable information about the majority class.

The code example below illustrates both undersampling strategies1. Undersampling in R can be implemented without relying on a package by manually selecting a random subset of the majority class to match the size of the minority class. This approach ensures a balanced dataset by reducing the majority class to the size of the minority class.

  1. Check Class Distribution: Determine the number of samples in the minority and majority classes.
  2. Subset the Majority Class: Randomly sample rows from the majority class to match the size of the minority class.
  3. Combine the Subsets: Merge the reduced majority class with the minority class to create a balanced dataset.

Let’s first create an imbalanced dataset for testing:

set.seed(123)
# Simulate a binary classification dataset
data <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100),
  class = c(rep(0, 90), rep(1, 10))  # Imbalanced with 90 majority and 10 minority
)

# Check class distribution
table(data$class)
## 
##  0  1 
## 90 10

Now we can separate the classes:

# Separate majority and minority classes
majority <- subset(data, class == 0)
minority <- subset(data, class == 1)

Next, we extract a random sample from the majority class:

# Randomly sample the majority class to match the size of the minority class
undersampled_majority <- majority[sample(nrow(majority), nrow(minority)), ]

Finally we can combine the subsets:

# Combine the undersampled majority class with the minority class
balanced_data <- rbind(undersampled_majority, minority)

# Check the new class distribution
table(balanced_data$class)
## 
##  0  1 
## 10 10

In the above code,

  1. The sample() function is used to randomly select rows from the majority class.
  2. The size of the sample (nrow(minority)) ensures that the majority class is reduced to the size of the minority class.
  3. The rbind() function combines the reduced majority class with the minority class into a balanced dataset.

To visualize the effect of undersampling, you can plot the original and balanced datasets:

library(ggplot2)

# Plot original data
ggplot(data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Original Imbalanced Dataset")

# Plot balanced data
ggplot(balanced_data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Dataset After Undersampling")

Remember that undersampling implies potentially a:

  1. Loss of Information: Undersampling reduces the size of the dataset, potentially discarding useful information from the majority class.
  2. Risk of Overfitting: With a small dataset, the model might overfit to the reduced data.

Hybrid Methods

Hybrid methods combine oversampling and undersampling to achieve a balanced dataset with minimal redundancy and improved synthetic sample quality. They generally start by oversampling the minority class to a predefined level and then undersampling the majority class to reduce redundancy. Methods like SMOTE-Tomek Links or SMOTE-ENN combine synthetic oversampling with cleaning techniques.

Hybrid methods work best when the dataset contains noisy or overlapping class boundaries. It is most effective for highly imbalanced datasets where a simple oversampling or undersampling approach might fail.


Summary of Use Cases

  1. Random Oversampling: Use for small datasets or when synthetic generation might introduce noise.
  2. SMOTE: Use when synthetic sample quality is essential and the minority class is well-defined.
  3. Random Undersampling: Use for large datasets with redundant majority samples.
  4. Hybrid Methods: Use for noisy or complex class boundaries.

By applying these techniques judiciously, practitioners can mitigate the adverse effects of class imbalance and improve model performance on both the majority and minority classes.

Synthetic Oversampling Methods

Let’s take a closer look with additional details on the two most common oversampling methods: ROSE and SMOTE.

ROSE

ROSE (Random Over-Sampling Examples) is a resampling method designed to address class imbalance by generating synthetic samples for the minority class. Unlike basic oversampling, which duplicates existing samples, or SMOTE, which interpolates between existing data points, ROSE uses a kernel density estimation approach to generate synthetic data points. This method creates a smoother, more realistic representation of the minority class distribution.

ROSE generates synthetic examples for both the minority and majority classes by sampling from a smoothed approximation of the original data distribution. This is achieved through the following steps:

  1. Kernel Density Estimation (KDE): A kernel density function is applied to the data to estimate the probability density function of the feature space. This density is used to randomly sample new data points, ensuring that the synthetic points reflect the underlying data distribution.

  2. Balanced Sampling: Synthetic samples are added to the minority class to balance the dataset. Optionally, some points from the majority class can also be synthetically generated or removed to ensure better class boundaries.

  3. Noise Handling: ROSE incorporates a level of randomness, reducing the risk of creating exact duplicates or overly simplistic synthetic points, which helps improve generalization.

Advantages of ROSE

ROSE has some key advantages over other methods:

  1. Preserves Data Characteristics: Synthetic samples closely resemble the real distribution, avoiding artifacts introduced by simpler methods like random duplication.
  2. Reduces Overfitting: By generating new points rather than duplicating existing ones, ROSE mitigates the risk of overfitting to the minority class.
  3. Improves Class Boundaries: The method often generates points near decision boundaries, improving the model’s ability to distinguish between classes.

ROSE is particularly effective in situations where the minority class is highly underrepresented and the dataset contains complex decision boundaries or overlaps between classes, but it is computationally complex. However, often simpler methods like random oversampling or SMOTE are insufficient to capture the minority class’s diversity.

Example of Applying ROSE in R

The ROSE package in R provides an easy-to-use implementation of this method. Here’s how it can be applied:

Consider a binary classification problem with an imbalanced dataset. The code below applies random oversampling using functions from the ROSE package.

library(ROSE)
## Loaded ROSE 0.0-4
# Simulate imbalanced data
set.seed(123)
data <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100),
  class = c(rep(0, 90), rep(1, 10))
)

# Check class distribution
table(data$class)
## 
##  0  1 
## 90 10
# Apply ROSE to generate balanced data
rose_data <- ROSE(class ~ ., data = data, seed = 1)$data

# Check the new class distribution
table(rose_data$class)
## 
##  0  1 
## 52 48

You can visualize how ROSE generates synthetic points using scatterplots:

library(ggplot2)

# Original data
ggplot(data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Original Data")

# ROSE data
ggplot(rose_data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Data After ROSE")

The ROSE package also provides functionality to evaluate the effectiveness of resampling using models. For example:

# Train a logistic regression model on the resampled data
model <- glm(class ~ ., family = binomial, data = rose_data)

# Predict and evaluate
pred <- predict(model, newdata = data, type = "response")
roc.curve(data$class, pred)

## Area under the curve (AUC): 0.641

While ROSE is highly effective, it has some limitations:

  1. Potential Noise Introduction: Excessive randomness in synthetic point generation can introduce noise if the data distribution is not well-captured by the kernel.
  2. Scalability: Computationally intensive for very large datasets.

Nevertheless, ROSE is a versatile strategy for handling class imbalance by generating synthetic examples that reflect the original data distribution. It is particularly suitable for datasets with complex relationships between features and classes, providing a robust alternative to traditional resampling methods.

SMOTE

SMOTE is a popular resampling technique used to address class imbalance in datasets by generating synthetic samples for the minority class. Unlike random oversampling, which duplicates existing minority class samples, SMOTE creates new data points by interpolating between existing minority class samples. This method helps reduce overfitting and enhances the minority class representation in the dataset.

SMOTE applies the following steps:

  1. Identify Nearest Neighbors: For each sample in the minority class, SMOTE identifies its nearest neighbors within the same class based on a distance metric (commonly Euclidean distance). This is essentially an application of the kNN algorithm. Note that this requires categorical features to be numerically encoded and numeric features to be normalized to a common scale.

  2. Generate Synthetic Samples: New samples are created by taking a weighted average of a minority class sample and one of its nearest neighbors. This interpolation ensures that synthetic samples lie along the line segments connecting existing samples.

  3. Repeat Until Balanced: This process is repeated until the desired class balance is achieved.

Advantages of SMOTE

SMOTE has some key advantages over other methods:

  1. Avoids Overfitting: By generating new synthetic samples rather than duplicating existing ones, SMOTE mitigates the risk of overfitting to the minority class.

  2. Enhances Generalization: SMOTE encourages models to learn a broader representation of the minority class, especially near class boundaries.

  3. Improves Decision Boundaries: By creating samples near the edges of the minority class, SMOTE helps the model distinguish better between the majority and minority classes.

SMOTE is particularly effective in cases where:

  1. The dataset has a significant class imbalance.
  2. The minority class is underrepresented, making it difficult for the model to learn its patterns.
  3. The minority class distribution is well-defined and not noisy.

Implementation of SMOTE in R

The DMwR and smotefamily packages in R provide implementations of SMOTE. Here’s a step-by-step explanation with examples using a simulated imbalanced binary classification dataset:

set.seed(123)
data <- data.frame(
  x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  class = c(rep(0, 90), rep(1, 10))
)

# Check class distribution
table(data$class)
## 
##  0  1 
## 90 10

The dataset is imbalanced, with only 10 samples in the minority class. Now we can use the smote(() function from the performanceEstimation package to generate synthetic samples:

library(performanceEstimation)     # For SMOTE

set.seed(123)

# Simulate an imbalanced dataset
data <- data.frame(
  x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  class = as.factor(c(rep(0, 90), rep(1, 10)))  # 90 majority (0), 10 minority (1)
)

# Check class distribution
table(data$class)
## 
##  0  1 
## 90 10
# Apply SMOTE
smote_data <- performanceEstimation::smote(
  form = class ~ .,       # Formula specifying the target and predictors
  data = data,            # Original dataset
  perc.over = 200,        # Percentage of new synthetic samples for the minority class
  perc.under = 150        # Percentage of majority samples to retain
)

# Check the new class distribution
table(smote_data$class)
## 
##      0      1 
## 300000   2010

Note the following outcomes:

  • perc.over = 200: Increases the minority class by 200% (adds twice as many synthetic samples).
  • perc.under = 150: Adjusts the majority class to balance the dataset.

There are other packages for R that also implement SMOTE, such as the smotefamily package explained with an example later in this section.

We can visualize the effects of SMOTE on the dataset by comparing the original data sets distribution of classes against the oversampled data set:

library(ggplot2)

# Original data
ggplot(data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Original Dataset")

# SMOTE data
ggplot(smote_data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Dataset After SMOTE")

The smote() function has two important parameters:

  1. perc.over: Specifies the percentage of new synthetic samples to generate for the minority class.
  2. perc.under: Defines the proportion of majority class samples to keep after oversampling.

The smotefamily package provides additional flexibility, including options for multi-class datasets:

library(smotefamily)

set.seed(123)

# Simulate an imbalanced dataset
data <- data.frame(
  x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  class = as.factor(c(rep(0, 90), rep(1, 10)))  # 90 majority (0), 10 minority (1)
)

# Apply SMOTE
smote_result <- smotefamily::SMOTE(
  X = data[, c("x1", "x2")],  # Features
  target = data$class,        # Target variable
  K = 5,                      # Number of nearest neighbors
  dup_size = 2                # Number of synthetic samples per minority class instance
)

# Extract the resampled data as a dataframe
smote_data <- data.frame(x1 = smote_result$data$x1,
                         x2 = smote_result$data$x2,
                         class = as.numeric(smote_result$data$class))



# Check the new class distribution
table(smote_data$class)
## 
##  0  1 
## 90 30

In the above example, there are two key parameters for the SMOTE() function:

  1. K: Number of nearest neighbors to consider.
  2. dup_size: Number of synthetic samples to generate per minority class instance.

To assess the effectiveness of SMOTE, train a model before and after applying it and compare the model’s performance:

library(ROCR)

# Train a logistic regression model on the original data
model_original <- glm(class ~ ., data = data, family = binomial)
pred_original <- predict(model_original, type = "response")
roc_original <- ROCR::prediction(pred_original, data$class)
auc_original <- ROCR::performance(roc_original, "auc")@y.values

# Train a logistic regression model on SMOTE data
model_smote <- glm(class ~ ., data = smote_data, family = binomial)
pred_smote <- predict(model_smote, newdata = smote_data, type = "response")
roc_smote <- ROCR::prediction(pred_smote, smote_data$class)
auc_smote <- ROCR::performance(roc_smote, "auc")@y.values

# Compare AUC
print(paste("AUC Before SMOTE:", auc_original))
## [1] "AUC Before SMOTE: 1"
print(paste("AUC After SMOTE:", auc_smote))
## [1] "AUC After SMOTE: 1"

Limitations of SMOTE

There are some important limitations to SMOTE which implies judicious use:

  1. Synthetic Data Quality:
    • If the minority class has noisy or overlapping samples, SMOTE may amplify this noise.
  2. Scalability:
    • Computational cost increases with the number of features and samples.
  3. Boundary Overlap:
    • SMOTE may generate synthetic samples that overlap with the majority class, leading to less discriminative decision boundaries.

Despite its potential limitations, SMOTE is a powerful resampling technique for addressing class imbalance, particularly in datasets where the minority class has well-defined and representative samples. By generating synthetic examples, it improves model performance on the minority class, resulting in better generalization. However, careful parameter tuning and evaluation are essential to avoid introducing noise or overcomplicating class boundaries.

ROSE vs SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) and ROSE (Random Over-Sampling Examples) are two key resampling techniques designed to address class imbalance in datasets by generating synthetic samples. However, they differ in methodology and suitability for specific scenarios. The table below summarizes those differences:

Feature SMOTE ROSE
Methodology Interpolates between existing minority class samples using nearest neighbors. Uses kernel density estimation to generate synthetic points based on the entire data distribution.
Synthetic Samples Generated along line segments between existing minority samples. Generated randomly across the feature space, with smoothing to approximate the data distribution.
Focus Enhances decision boundaries by creating samples near existing data points. Provides a balanced and smoothed representation of both classes, not limited to minority class.
Noise Handling Can amplify noise or overlap if the minority class has noisy samples. Less prone to overfitting but can introduce unrelated points if the data distribution is poorly estimated.
Scalability Computationally heavier for large datasets due to nearest-neighbor calculations. Relatively lighter but depends on kernel density estimation complexity.
Applicability Works well with structured, clean datasets and when the minority class is not too sparse. Effective for datasets with noisy or overlapping class boundaries.

To summarize, we generally want to use SMOTE for structured datasets requiring precise synthetic data, and prefer ROSE for datasets needing broader smoothing or when noise in the minority class is a concern. So, in short, use:

  1. SMOTE:
    • when you have a clean dataset with a well-defined minority class.
    • when the goal is to create synthetic samples that closely resemble the original minority class.
    • for tasks requiring stronger decision boundaries, such as fraud detection or medical diagnosis.
  2. ROSE:
    • when the dataset is noisy or has overlapping class boundaries.
    • when you want a broader smoothing of the feature space, especially if the minority class distribution is sparse.
    • for exploratory analysis or models sensitive to randomness in synthetic samples.

Algorithm Level Modifications

Algorithm-level modifications to address class imbalance involve adjusting the learning algorithm itself to account for the imbalance without altering the dataset. This is achieved by making the model sensitive to the importance of each class, typically by introducing class weights, cost-sensitive learning, or custom loss functions. These methods ensure that the minority class has a proportionally larger influence during training.

Example I: Using Class Weights in Logistic Regression

Many machine learning algorithms allow specifying class weights, which assign higher importance to the minority class during the training process. The code example below demonstrates this with a logistic regression model and assigning class weights.

# Simulate imbalanced data
set.seed(123)
data <- data.frame(
  x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  class = c(rep(0, 90), rep(1, 10))
)

# Check class distribution
table(data$class)
## 
##  0  1 
## 90 10
# Assign weights: higher for minority class
weights <- ifelse(data$class == 1, 9, 1)  # Ratio 90:10, so minority gets 9x weight

# Train logistic regression with weights
model <- glm(class ~ x1 + x2, family = binomial, data = data, weights = weights)

# Model summary
summary(model)
## 
## Call:
## glm(formula = class ~ x1 + x2, family = binomial, data = data, 
##     weights = weights)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -119.91   46002.47  -0.003    0.998
## x1             42.90   17740.59   0.002    0.998
## x2             31.54   14212.17   0.002    0.998
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.4953e+02  on 99  degrees of freedom
## Residual deviance: 1.1412e-08  on 97  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25
# Predict probabilities
predicted <- predict(model, type = "response")

# Evaluate performance
library(ROCR)
roc_curve <- prediction(predicted, data$class)
auc <- performance(roc_curve, "auc")@y.values[[1]]
print(paste("AUC with weights:", auc))
## [1] "AUC with weights: 1"

Note that in the code above, the weights parameter adjusts the influence of each class. In this case, minority class instances are assigned a weight of 9, amplifying their impact on the model. This approach works well when the dataset is highly imbalanced but representative of real-world proportions. The choice of weights is empirical and requires experimentation.


Example II: Cost-Sensitive Decision Trees

Cost-sensitive learning explicitly penalizes misclassifications of the minority class more heavily than those of the majority class. Decision tree algorithms, like rpart, allow specifying cost matrices to achieve this. The code below illustrates this for a decision tree.

library(rpart)

# Define cost matrix: higher penalty for minority misclassification
cost_matrix <- matrix(c(0, 10, 1, 0), nrow = 2, byrow = TRUE)

# Train a cost-sensitive decision tree
model <- rpart(
  class ~ x1 + x2,
  data = data,
  method = "class",
  parms = list(loss = cost_matrix)
)

# Print the tree structure
print(model)
## n= 100 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 100 10 0 (0.90000000 0.10000000)  
##   2) x1< 2.270525 91  1 0 (0.98901099 0.01098901) *
##   3) x1>=2.270525 9  0 1 (0.00000000 1.00000000) *
# Predict and evaluate
predicted <- predict(model, type = "class")
confusion_matrix <- table(data$class, predicted)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(confusion_matrix)
##    predicted
##      0  1
##   0 90  0
##   1  1  9

Once again, the weights are chosen empirically and tuned through trial-and-error. The cost matrix penalizes misclassification of the minority class (10) more heavily than the majority class (1). This modification directly influences how the decision tree splits data, prioritizing minority class accuracy.

Example III: Custom Loss Function in Neural Networks

In deep learning, custom loss functions can be used to address class imbalance by assigning different penalties to errors based on class. The R example below uses a weighted loss in Keras. While not the only package for deep learning in R, the keras package in R provides an interface to the Keras deep learning library, which is built on top of TensorFlow. It is designed for creating and training deep learning models in a more accessible form Keras and TensorFlow are mostly used for tasks such as image recognition, natural language processing, and time-series forecasting. TensorFlow operates on a computation graph model, where operations are represented as nodes, and data flows between them along edges. The framework is optimized to handle large-scale numerical computations using this structure, particularly for neural networks. The training of deep neural networks with many hidden layers is computationally very expensive, so the code below may require significant time to run on typical systems.

library(keras)

# Prepare imbalanced data
x <- as.matrix(data[, c("x1", "x2")])
y <- as.numeric(data$class)

# Define a weighted binary cross-entropy loss
weighted_loss <- function(y_true, y_pred) {
  weights <- ifelse(y_true == 1, 9, 1)  # Higher weight for minority class
  keras::k_mean(weights * k_binary_crossentropy(y_true, y_pred), axis = -1)
}

# Build a simple neural network
model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = ncol(x)) %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "adam",
  loss = weighted_loss,
  metrics = c("accuracy")
)

# Train the model
model %>% fit(
  x,
  y,
  epochs = 50,
  batch_size = 10,
  verbose = 1
)

Here, the custom loss function applies higher penalties to errors on minority class samples. This approach is particularly effective for neural networks or other algorithms where class weights may not be directly supported. However, neural networks carry a significant computational penalty and are computationally very expensive and time-consuming to train.

Algorithm-Level Modifications are best used when when class imbalance is moderate, and the algorithm supports weight adjustments (e.g., logistic regression, SVMs, kNN) or when misclassification costs vary significantly between classes, especially in decision trees or ensemble models. In deep learning or advanced models that require fine-grained control over optimization custom loss functions are an option.

These techniques are effective when you want to keep the dataset intact (without resampling) and rely on the algorithm to balance the learning process. In practice, both methods, oversampling and algorithm modification, can be used together.

Summary

This lesson explored methods for addressing class imbalance in supervised machine learning, focusing on data-level resampling techniques and algorithm-level modifications. Resampling methods modify the dataset’s class distribution to balance the representation of the majority and minority classes. Two key techniques discussed were SMOTE (Synthetic Minority Over-sampling Technique) and ROSE (Random Over-Sampling Examples). SMOTE generates synthetic samples for the minority class by interpolating between existing instances and their nearest neighbors, making it particularly suitable for structured datasets with a well-defined minority class. In contrast, ROSE employs kernel density estimation to generate synthetic samples across the feature space, making it effective for handling noisy or overlapping class boundaries. While SMOTE is ideal for enhancing decision boundaries, ROSE offers versatility for noisy datasets.

Algorithm-level modifications address class imbalance by directly influencing the learning process. These include the use of class weights, cost-sensitive learning, and custom loss functions. Class weights, as shown in a weighted logistic regression example, assign higher importance to minority class samples, ensuring they have a greater impact during model training. Cost-sensitive learning, such as with cost-sensitive decision trees, penalizes misclassifications of the minority class more heavily, effectively guiding the algorithm to prioritize these cases. Custom loss functions, often applied in deep learning, allow precise control over the training process by introducing penalties tailored to the dataset’s needs, such as weighted binary cross-entropy. These algorithmic approaches are particularly useful when the dataset itself should remain unaltered, and the imbalance can be addressed through adjustments to the learning framework.

In summary, resampling methods like SMOTE and ROSE are best used when modifying the dataset is feasible, with SMOTE being preferred for well-defined minority classes and ROSE for noisy or sparse datasets. Algorithm-level modifications, including class weights, cost matrices, and custom loss functions, are preferable when the learning process requires more nuanced adjustments. Together, these strategies provide a robust toolkit for handling class imbalance, ensuring better performance and fair representation of minority classes in machine learning models.


Files & Resources

All Files for Lesson 3.224

References

None yet.

Errata

Let us know.


  1. In prior versions of R, the package unbalanced provided support for undersampling, but the package is no longer available.↩︎

---
title: "Managing Class Imbalance"
params:
  category: 3
  stacks: 0
  number: 224
  time: 60
  level: intermediate
  tags: class imbalance,smote,oversampling
  description: "Explains common methods for reducing class imbalance."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: lumen
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Overview

Class imbalance in supervised machine learning refers to the scenario where the distribution of classes in a dataset is significantly skewed, meaning one class (or a few classes) has far more examples than others. This imbalance poses a challenge because most machine learning algorithms are designed to optimize overall accuracy, often leading to poor performance on the minority class. This can lead to significant bias in prediction.

For example, consider a binary classification problem for detecting fraudulent transactions. If 98% of transactions are legitimate and only 2% are fraudulent, a naive model might predict all transactions as legitimate to achieve 98% accuracy. However, this approach fails to identify fraudulent cases, which are critical in this context.

The issue arises because algorithms like decision trees, neural networks, and support vector machines may focus on the majority class, neglecting the minority class. Metrics such as accuracy become misleading, as high accuracy can be achieved by ignoring minority cases altogether. Alternative evaluation metrics such as True Positives and True Negatives or the F1-Score can highlight potential class imbalance.

Several strategies have been developed to ameliorate class imbalance:

1.  **Data-Level Approaches**: These include resampling methods such as oversampling the minority class (*e.g.*, *SMOTE*) or undersampling the majority class to balance the dataset.

2.  **Algorithm-Level Modifications**: Some algorithms can be adjusted to handle imbalances by incorporating class weights or cost-sensitive learning.

3.  **Evaluation Metrics**: Using metrics like precision, recall, F1-score, and area under the ROC curve (AUC-ROC) provides a clearer picture of a model's performance on imbalanced datasets.

Understanding and addressing class imbalance is critical for building robust and meaningful machine learning models, particularly in applications like fraud detection, medical diagnosis, and anomaly detection, where the minority class often represents the cases of greatest interest but generally having the least representation in the data.

## Data-Level Approaches

Data-level methods to manage class imbalance generally focus on resampling. Resampling methods are techniques used to address class imbalance in datasets by modifying the distribution of the minority and majority classes. These methods operate at the data level, aiming to create a balanced dataset that allows machine learning models to learn effectively from both classes. Common resampling methods include oversampling, undersampling, and hybrid approaches.

### Oversampling

Oversampling is done by increasing the size of the minority class by replicating existing samples or generating synthetic samples. There are several common methods for oversampling minority classes:

1.  **Random Oversampling**: Randomly duplicates instances of the minority class until the class distribution is balanced.

2.  **Synthetic Minority Over-sampling Technique (SMOTE)**: Generates new synthetic instances for the minority class by interpolating between existing samples and their nearest neighbors.

3.  **Random Over-Sampling Examples (ROSE)**: A resampling method designed to address class imbalance by generating synthetic samples for the minority class. Unlike basic oversampling, which duplicates existing samples, or SMOTE, which interpolates between existing data points, ROSE uses a kernel density estimation approach to generate synthetic data points. This method creates a smoother, more realistic representation of the minority class distribution.

Oversampling is generally used when the dataset is small, as it avoids losing information from the majority class. *SMOTE* is a particularly useful technique when the minority class has sufficient variability to synthesize meaningful new samples.

### Undersampling

Undersampling reduces the size of the majority class by randomly removing samples to balance the class distribution. Similar to oversampling, there are various strategies:

1.  **Random Undersampling**: Randomly selects a subset of majority class samples equal in size to the minority class.
2.  **Tomek Links**: Removes majority class instances that are closest to minority class instances to clean class boundaries.

Undersampling is often used when the dataset is large, and the majority class contains redundant or noisy samples. It works best when there is no risk of losing valuable information about the majority class.

The code example below illustrates both undersampling strategies[^1]. Undersampling in R can be implemented without relying on a package by manually selecting a random subset of the majority class to match the size of the minority class. This approach ensures a balanced dataset by reducing the majority class to the size of the minority class.

[^1]: In prior versions of R, the package **unbalanced** provided support for undersampling, but the package is no longer available.

1.  **Check Class Distribution**: Determine the number of samples in the minority and majority classes.
2.  **Subset the Majority Class**: Randomly sample rows from the majority class to match the size of the minority class.
3.  **Combine the Subsets**: Merge the reduced majority class with the minority class to create a balanced dataset.

Let's first create an imbalanced dataset for testing:

```{r}
set.seed(123)
# Simulate a binary classification dataset
data <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100),
  class = c(rep(0, 90), rep(1, 10))  # Imbalanced with 90 majority and 10 minority
)

# Check class distribution
table(data$class)
```

Now we can separate the classes:

```{r}
# Separate majority and minority classes
majority <- subset(data, class == 0)
minority <- subset(data, class == 1)
```

Next, we extract a random sample from the majority class:

```{r}
# Randomly sample the majority class to match the size of the minority class
undersampled_majority <- majority[sample(nrow(majority), nrow(minority)), ]
```

Finally we can combine the subsets:

```{r}
# Combine the undersampled majority class with the minority class
balanced_data <- rbind(undersampled_majority, minority)

# Check the new class distribution
table(balanced_data$class)
```

In the above code,

1.  The `sample()` function is used to randomly select rows from the majority class.
2.  The size of the sample (`nrow(minority)`) ensures that the majority class is reduced to the size of the minority class.
3.  The `rbind()` function combines the reduced majority class with the minority class into a balanced dataset.

To visualize the effect of undersampling, you can plot the original and balanced datasets:

```{r}
library(ggplot2)

# Plot original data
ggplot(data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Original Imbalanced Dataset")

# Plot balanced data
ggplot(balanced_data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Dataset After Undersampling")
```

Remember that undersampling implies potentially a:

1.  **Loss of Information**: Undersampling reduces the size of the dataset, potentially discarding useful information from the majority class.
2.  **Risk of Overfitting**: With a small dataset, the model might overfit to the reduced data.

### Hybrid Methods

Hybrid methods combine oversampling and undersampling to achieve a balanced dataset with minimal redundancy and improved synthetic sample quality. They generally start by oversampling the minority class to a predefined level and then undersampling the majority class to reduce redundancy. Methods like SMOTE-Tomek Links or SMOTE-ENN combine synthetic oversampling with cleaning techniques.

Hybrid methods work best when the dataset contains noisy or overlapping class boundaries. It is most effective for highly imbalanced datasets where a simple oversampling or undersampling approach might fail.

------------------------------------------------------------------------

### Summary of Use Cases

1.  **Random Oversampling**: Use for small datasets or when synthetic generation might introduce noise.
2.  **SMOTE**: Use when synthetic sample quality is essential and the minority class is well-defined.
3.  **Random Undersampling**: Use for large datasets with redundant majority samples.
4.  **Hybrid Methods**: Use for noisy or complex class boundaries.

By applying these techniques judiciously, practitioners can mitigate the adverse effects of class imbalance and improve model performance on both the majority and minority classes.

## Synthetic Oversampling Methods

Let's take a closer look with additional details on the two most common oversampling methods: *ROSE* and *SMOTE.*

### ROSE

**ROSE (Random Over-Sampling Examples)** is a resampling method designed to address class imbalance by generating synthetic samples for the minority class. Unlike basic oversampling, which duplicates existing samples, or SMOTE, which interpolates between existing data points, ROSE uses a kernel density estimation approach to generate synthetic data points. This method creates a smoother, more realistic representation of the minority class distribution.

ROSE generates synthetic examples for both the minority and majority classes by sampling from a smoothed approximation of the original data distribution. This is achieved through the following steps:

1.  **Kernel Density Estimation (KDE)**: A kernel density function is applied to the data to estimate the probability density function of the feature space. This density is used to randomly sample new data points, ensuring that the synthetic points reflect the underlying data distribution.

2.  **Balanced Sampling**: Synthetic samples are added to the minority class to balance the dataset. Optionally, some points from the majority class can also be synthetically generated or removed to ensure better class boundaries.

3.  **Noise Handling**: ROSE incorporates a level of randomness, reducing the risk of creating exact duplicates or overly simplistic synthetic points, which helps improve generalization.

#### Advantages of ROSE

ROSE has some key advantages over other methods:

1.  **Preserves Data Characteristics**: Synthetic samples closely resemble the real distribution, avoiding artifacts introduced by simpler methods like random duplication.
2.  **Reduces Overfitting**: By generating new points rather than duplicating existing ones, ROSE mitigates the risk of overfitting to the minority class.
3.  **Improves Class Boundaries**: The method often generates points near decision boundaries, improving the model's ability to distinguish between classes.

ROSE is particularly effective in situations where the minority class is highly underrepresented and the dataset contains complex decision boundaries or overlaps between classes, but it is computationally complex. However, often simpler methods like random oversampling or SMOTE are insufficient to capture the minority class's diversity.

#### Example of Applying ROSE in R

The `ROSE` package in R provides an easy-to-use implementation of this method. Here's how it can be applied:

Consider a binary classification problem with an imbalanced dataset. The code below applies random oversampling using functions from the **ROSE** package.

```{r}
library(ROSE)

# Simulate imbalanced data
set.seed(123)
data <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100),
  class = c(rep(0, 90), rep(1, 10))
)

# Check class distribution
table(data$class)

# Apply ROSE to generate balanced data
rose_data <- ROSE(class ~ ., data = data, seed = 1)$data

# Check the new class distribution
table(rose_data$class)
```

You can visualize how ROSE generates synthetic points using scatterplots:

```{r}
library(ggplot2)

# Original data
ggplot(data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Original Data")

# ROSE data
ggplot(rose_data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Data After ROSE")
```

The **ROSE** package also provides functionality to evaluate the effectiveness of resampling using models. For example:

```{r}
# Train a logistic regression model on the resampled data
model <- glm(class ~ ., family = binomial, data = rose_data)

# Predict and evaluate
pred <- predict(model, newdata = data, type = "response")
roc.curve(data$class, pred)
```

While ROSE is highly effective, it has some limitations:

1.  **Potential Noise Introduction**: Excessive randomness in synthetic point generation can introduce noise if the data distribution is not well-captured by the kernel.
2.  **Scalability**: Computationally intensive for very large datasets.

Nevertheless, ROSE is a versatile strategy for handling class imbalance by generating synthetic examples that reflect the original data distribution. It is particularly suitable for datasets with complex relationships between features and classes, providing a robust alternative to traditional resampling methods.

### SMOTE

SMOTE is a popular resampling technique used to address class imbalance in datasets by generating synthetic samples for the minority class. Unlike random oversampling, which duplicates existing minority class samples, SMOTE creates new data points by interpolating between existing minority class samples. This method helps reduce overfitting and enhances the minority class representation in the dataset.

SMOTE applies the following steps:

1.  **Identify Nearest Neighbors**: For each sample in the minority class, SMOTE identifies its nearest neighbors within the same class based on a distance metric (commonly Euclidean distance). This is essentially an application of the *kNN* algorithm. Note that this requires categorical features to be numerically encoded and numeric features to be normalized to a common scale.

2.  **Generate Synthetic Samples**: New samples are created by taking a weighted average of a minority class sample and one of its nearest neighbors. This interpolation ensures that synthetic samples lie along the line segments connecting existing samples.

3.  **Repeat Until Balanced**: This process is repeated until the desired class balance is achieved.

#### Advantages of SMOTE

SMOTE has some key advantages over other methods:

1.  **Avoids Overfitting**: By generating new synthetic samples rather than duplicating existing ones, SMOTE mitigates the risk of overfitting to the minority class.

2.  **Enhances Generalization**: SMOTE encourages models to learn a broader representation of the minority class, especially near class boundaries.

3.  **Improves Decision Boundaries**: By creating samples near the edges of the minority class, SMOTE helps the model distinguish better between the majority and minority classes.

SMOTE is particularly effective in cases where:

1.  The dataset has a significant class imbalance.
2.  The minority class is underrepresented, making it difficult for the model to learn its patterns.
3.  The minority class distribution is well-defined and not noisy.

#### Implementation of SMOTE in R

The **DMwR** and **smotefamily** packages in R provide implementations of SMOTE. Here's a step-by-step explanation with examples using a simulated imbalanced binary classification dataset:

```{r}
set.seed(123)
data <- data.frame(
  x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  class = c(rep(0, 90), rep(1, 10))
)

# Check class distribution
table(data$class)
```

The dataset is imbalanced, with only 10 samples in the minority class. Now we can use the `smote(()` function from the **performanceEstimation** package to generate synthetic samples:

```{r}
library(performanceEstimation)     # For SMOTE

set.seed(123)

# Simulate an imbalanced dataset
data <- data.frame(
  x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  class = as.factor(c(rep(0, 90), rep(1, 10)))  # 90 majority (0), 10 minority (1)
)

# Check class distribution
table(data$class)

# Apply SMOTE
smote_data <- performanceEstimation::smote(
  form = class ~ .,       # Formula specifying the target and predictors
  data = data,            # Original dataset
  perc.over = 200,        # Percentage of new synthetic samples for the minority class
  perc.under = 150        # Percentage of majority samples to retain
)

# Check the new class distribution
table(smote_data$class)

```

Note the following outcomes:

-   `perc.over = 200`: Increases the minority class by 200% (adds twice as many synthetic samples).
-   `perc.under = 150`: Adjusts the majority class to balance the dataset.

There are other packages for R that also implement SMOTE, such as the **smotefamily** package explained with an example later in this section.

We can visualize the effects of SMOTE on the dataset by comparing the original data sets distribution of classes against the oversampled data set:

```{r}
library(ggplot2)

# Original data
ggplot(data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Original Dataset")

# SMOTE data
ggplot(smote_data, aes(x = x1, y = x2, color = factor(class))) +
  geom_point() +
  ggtitle("Dataset After SMOTE")
```

The `smote()` function has two important parameters:

1.  `perc.over`: Specifies the percentage of new synthetic samples to generate for the minority class.
2.  `perc.under`: Defines the proportion of majority class samples to keep after oversampling.

The `smotefamily` package provides additional flexibility, including options for multi-class datasets:

```{r}
library(smotefamily)

set.seed(123)

# Simulate an imbalanced dataset
data <- data.frame(
  x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  class = as.factor(c(rep(0, 90), rep(1, 10)))  # 90 majority (0), 10 minority (1)
)

# Apply SMOTE
smote_result <- smotefamily::SMOTE(
  X = data[, c("x1", "x2")],  # Features
  target = data$class,        # Target variable
  K = 5,                      # Number of nearest neighbors
  dup_size = 2                # Number of synthetic samples per minority class instance
)

# Extract the resampled data as a dataframe
smote_data <- data.frame(x1 = smote_result$data$x1,
                         x2 = smote_result$data$x2,
                         class = as.numeric(smote_result$data$class))



# Check the new class distribution
table(smote_data$class)
```

In the above example, there are two key parameters for the `SMOTE()` function:

1.  `K`: Number of nearest neighbors to consider.
2.  `dup_size`: Number of synthetic samples to generate per minority class instance.

To assess the effectiveness of SMOTE, train a model before and after applying it and compare the model's performance:

```{r warning=FALSE}
library(ROCR)

# Train a logistic regression model on the original data
model_original <- glm(class ~ ., data = data, family = binomial)
pred_original <- predict(model_original, type = "response")
roc_original <- ROCR::prediction(pred_original, data$class)
auc_original <- ROCR::performance(roc_original, "auc")@y.values

# Train a logistic regression model on SMOTE data
model_smote <- glm(class ~ ., data = smote_data, family = binomial)
pred_smote <- predict(model_smote, newdata = smote_data, type = "response")
roc_smote <- ROCR::prediction(pred_smote, smote_data$class)
auc_smote <- ROCR::performance(roc_smote, "auc")@y.values

# Compare AUC
print(paste("AUC Before SMOTE:", auc_original))
print(paste("AUC After SMOTE:", auc_smote))
```

#### Limitations of SMOTE

There are some important limitations to SMOTE which implies judicious use:

1.  **Synthetic Data Quality**:
    -   If the minority class has noisy or overlapping samples, SMOTE may amplify this noise.
2.  **Scalability**:
    -   Computational cost increases with the number of features and samples.
3.  **Boundary Overlap**:
    -   SMOTE may generate synthetic samples that overlap with the majority class, leading to less discriminative decision boundaries.

Despite its potential limitations, SMOTE is a powerful resampling technique for addressing class imbalance, particularly in datasets where the minority class has well-defined and representative samples. By generating synthetic examples, it improves model performance on the minority class, resulting in better generalization. However, careful parameter tuning and evaluation are essential to avoid introducing noise or overcomplicating class boundaries.

### ROSE vs SMOTE

**SMOTE** (Synthetic Minority Over-sampling Technique) and **ROSE** (Random Over-Sampling Examples) are two key resampling techniques designed to address class imbalance in datasets by generating synthetic samples. However, they differ in methodology and suitability for specific scenarios. The table below summarizes those differences:

| Feature | SMOTE | ROSE |
|----|----|----|
| **Methodology** | Interpolates between existing minority class samples using nearest neighbors. | Uses kernel density estimation to generate synthetic points based on the entire data distribution. |
| **Synthetic Samples** | Generated along line segments between existing minority samples. | Generated randomly across the feature space, with smoothing to approximate the data distribution. |
| **Focus** | Enhances decision boundaries by creating samples near existing data points. | Provides a balanced and smoothed representation of both classes, not limited to minority class. |
| **Noise Handling** | Can amplify noise or overlap if the minority class has noisy samples. | Less prone to overfitting but can introduce unrelated points if the data distribution is poorly estimated. |
| **Scalability** | Computationally heavier for large datasets due to nearest-neighbor calculations. | Relatively lighter but depends on kernel density estimation complexity. |
| **Applicability** | Works well with structured, clean datasets and when the minority class is not too sparse. | Effective for datasets with noisy or overlapping class boundaries. |

To summarize, we generally want to use **SMOTE** for structured datasets requiring precise synthetic data, and prefer **ROSE** for datasets needing broader smoothing or when noise in the minority class is a concern. So, in short, use:

1.  **SMOTE**:
    -   when you have a clean dataset with a well-defined minority class.
    -   when the goal is to create synthetic samples that closely resemble the original minority class.
    -   for tasks requiring stronger decision boundaries, such as fraud detection or medical diagnosis.
2.  **ROSE**:
    -   when the dataset is noisy or has overlapping class boundaries.
    -   when you want a broader smoothing of the feature space, especially if the minority class distribution is sparse.
    -   for exploratory analysis or models sensitive to randomness in synthetic samples.

## Algorithm Level Modifications

Algorithm-level modifications to address class imbalance involve adjusting the learning algorithm itself to account for the imbalance without altering the dataset. This is achieved by making the model sensitive to the importance of each class, typically by introducing **class weights**, **cost-sensitive learning**, or **custom loss functions**. These methods ensure that the minority class has a proportionally larger influence during training.

### Example I: Using Class Weights in Logistic Regression

Many machine learning algorithms allow specifying class weights, which assign higher importance to the minority class during the training process. The code example below demonstrates this with a logistic regression model and assigning class weights.

```{r warning=FALSE}
# Simulate imbalanced data
set.seed(123)
data <- data.frame(
  x1 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  x2 = c(rnorm(90, mean = 0), rnorm(10, mean = 3)),
  class = c(rep(0, 90), rep(1, 10))
)

# Check class distribution
table(data$class)

# Assign weights: higher for minority class
weights <- ifelse(data$class == 1, 9, 1)  # Ratio 90:10, so minority gets 9x weight

# Train logistic regression with weights
model <- glm(class ~ x1 + x2, family = binomial, data = data, weights = weights)

# Model summary
summary(model)

# Predict probabilities
predicted <- predict(model, type = "response")

# Evaluate performance
library(ROCR)
roc_curve <- prediction(predicted, data$class)
auc <- performance(roc_curve, "auc")@y.values[[1]]
print(paste("AUC with weights:", auc))
```

Note that in the code above, the `weights` parameter adjusts the influence of each class. In this case, minority class instances are assigned a weight of 9, amplifying their impact on the model. This approach works well when the dataset is highly imbalanced but representative of real-world proportions. The choice of weights is empirical and requires experimentation.

------------------------------------------------------------------------

### Example II: Cost-Sensitive Decision Trees

Cost-sensitive learning explicitly penalizes misclassifications of the minority class more heavily than those of the majority class. Decision tree algorithms, like **rpart**, allow specifying cost matrices to achieve this. The code below illustrates this for a decision tree.

```{r}
library(rpart)

# Define cost matrix: higher penalty for minority misclassification
cost_matrix <- matrix(c(0, 10, 1, 0), nrow = 2, byrow = TRUE)

# Train a cost-sensitive decision tree
model <- rpart(
  class ~ x1 + x2,
  data = data,
  method = "class",
  parms = list(loss = cost_matrix)
)

# Print the tree structure
print(model)

# Predict and evaluate
predicted <- predict(model, type = "class")
confusion_matrix <- table(data$class, predicted)
print("Confusion Matrix:")
print(confusion_matrix)
```

Once again, the weights are chosen empirically and tuned through trial-and-error. The cost matrix penalizes misclassification of the minority class (`10`) more heavily than the majority class (`1`). This modification directly influences how the decision tree splits data, prioritizing minority class accuracy.

### Example III: Custom Loss Function in Neural Networks

In deep learning, custom loss functions can be used to address class imbalance by assigning different penalties to errors based on class. The R example below uses a weighted loss in *Keras.* While not the only package for deep learning in R, the **keras** package in R provides an interface to the *Keras* deep learning library, which is built on top of *TensorFlow.* It is designed for creating and training deep learning models in a more accessible form *Keras* and *TensorFlow* are mostly used for tasks such as image recognition, natural language processing, and time-series forecasting. *TensorFlow* operates on a computation graph model, where operations are represented as nodes, and data flows between them along edges. The framework is optimized to handle large-scale numerical computations using this structure, particularly for neural networks. The training of deep neural networks with many hidden layers is computationally very expensive, so the code below may require significant time to run on typical systems.

```{r echo=T, eval=F}
library(keras)

# Prepare imbalanced data
x <- as.matrix(data[, c("x1", "x2")])
y <- as.numeric(data$class)

# Define a weighted binary cross-entropy loss
weighted_loss <- function(y_true, y_pred) {
  weights <- ifelse(y_true == 1, 9, 1)  # Higher weight for minority class
  keras::k_mean(weights * k_binary_crossentropy(y_true, y_pred), axis = -1)
}

# Build a simple neural network
model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = ncol(x)) %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "adam",
  loss = weighted_loss,
  metrics = c("accuracy")
)

# Train the model
model %>% fit(
  x,
  y,
  epochs = 50,
  batch_size = 10,
  verbose = 1
)
```

Here, the custom loss function applies higher penalties to errors on minority class samples. This approach is particularly effective for neural networks or other algorithms where class weights may not be directly supported. However, neural networks carry a significant computational penalty and are computationally very expensive and time-consuming to train.

Algorithm-Level Modifications are best used when when class imbalance is moderate, and the algorithm supports weight adjustments (*e.g.*, logistic regression, SVMs, kNN) or when misclassification costs vary significantly between classes, especially in decision trees or ensemble models. In deep learning or advanced models that require fine-grained control over optimization custom loss functions are an option.

These techniques are effective when you want to keep the dataset intact (without resampling) and rely on the algorithm to balance the learning process. In practice, both methods, oversampling and algorithm modification, can be used together.

## Summary

This lesson explored methods for addressing class imbalance in supervised machine learning, focusing on data-level resampling techniques and algorithm-level modifications. Resampling methods modify the dataset’s class distribution to balance the representation of the majority and minority classes. Two key techniques discussed were SMOTE (Synthetic Minority Over-sampling Technique) and ROSE (Random Over-Sampling Examples). SMOTE generates synthetic samples for the minority class by interpolating between existing instances and their nearest neighbors, making it particularly suitable for structured datasets with a well-defined minority class. In contrast, ROSE employs kernel density estimation to generate synthetic samples across the feature space, making it effective for handling noisy or overlapping class boundaries. While SMOTE is ideal for enhancing decision boundaries, ROSE offers versatility for noisy datasets.

Algorithm-level modifications address class imbalance by directly influencing the learning process. These include the use of class weights, cost-sensitive learning, and custom loss functions. Class weights, as shown in a weighted logistic regression example, assign higher importance to minority class samples, ensuring they have a greater impact during model training. Cost-sensitive learning, such as with cost-sensitive decision trees, penalizes misclassifications of the minority class more heavily, effectively guiding the algorithm to prioritize these cases. Custom loss functions, often applied in deep learning, allow precise control over the training process by introducing penalties tailored to the dataset’s needs, such as weighted binary cross-entropy. These algorithmic approaches are particularly useful when the dataset itself should remain unaltered, and the imbalance can be addressed through adjustments to the learning framework.

In summary, resampling methods like SMOTE and ROSE are best used when modifying the dataset is feasible, with SMOTE being preferred for well-defined minority classes and ROSE for noisy or sparse datasets. Algorithm-level modifications, including class weights, cost matrices, and custom loss functions, are preferable when the learning process requires more nuanced adjustments. Together, these strategies provide a robust toolkit for handling class imbalance, ensuring better performance and fair representation of minority classes in machine learning models.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

None yet.

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
