Dimensionality reduction is an essential feature engineering step in many machine learning tasks, especially when dealing with high-dimensional datasets, i.e., datasets that have many features. Thoughtful dimensionality reduction aids in simplifying models, reduces overfitting, and improves computational efficiency.
This lesson provides an overview of dimensionality reduction techniques, focusing on Principal Component Analysis (PCA) and common feature selection methods, specifically in the context of supervised learning. The lesson will include an explanation of the mathematical underpinning and provide practical implementation examples in R and Python. Additionally, the lesson will explain when each method is most appropriate and situations where it might not be ideal.
In general, dimensionality reduction attempts to reduce the number of independent input variables (i.e., features or columns) in a dataset while retaining as much information as possible to allow for the detection of useful signals. In supervised machine learning, the primary goal is to improve model performance by mitigating the “curse of dimensionality”, which occurs when the feature space becomes too large, leading to overfitting and increased computational cost.
In supervised learning, the input data often includes a high number of features, some of which may be irrelevant, redundant, or highly correlated. This high dimensionality can lead to overfitting, where the model captures noise instead of signal, especially when the number of observations is limited. Moreover, training models in high-dimensional spaces often requires more computation and memory, while interpretability decreases as the number of features grows.
Dimensionality reduction addresses these challenges by either selecting a subset of features or transforming the original features into a new, lower-dimensional representation. The goal is to preserve the structure in the data that is most informative for predicting the target variable.
Feature selection techniques choose a subset of the original features based on their relevance. These methods can be filter-based (e.g., correlation with the target variable), wrapper-based (e.g., recursive feature elimination such as p-value-based elimination of features during regression), or embedded in the learning algorithm (e.g., LASSO).
In contrast, feature extraction transforms the original feature space into a lower-dimensional space by combining features, typically through mathematical operations. These transformed features are often not directly interpretable but may capture the underlying structure of the data more effectively.
In this lesson, we focus primarily on feature extraction methods, as they embody the central principles of dimensionality reduction.
In this section, we provide an overview of common feature extraction techniques with examples on how to apply the technique using R. In a subsequent section, we will dive more deeply into the mathematics behind some of the techniques.
PCA is an unsupervised technique that projects the data onto new orthogonal axes (principal components), ordered by the amount of variance they explain in the data. Though unsupervised, PCA is often used in supervised learning pipelines to reduce noise and redundancy in the feature space before training a model.
The first step is to center and scale the data. PCA then computes the eigenvectors and eigenvalues of the covariance matrix, selecting the top components that explain a chosen proportion of variance.
Example in R:
The example below uses the prcomp()
function on scaled data to perform PCA.
# Load dataset and preprocess
data(iris)
X <- iris[, 1:4]
y <- iris$Species
# Standardize the features
X.scaled <- scale(X)
# Perform PCA
pca.result <- prcomp(X.scaled, center = TRUE, scale. = TRUE)
# Proportion of variance explained
summary(pca.result)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
# Use the first two principal components
X.pca <- pca.result$x[, 1:2]
# Combine with the response variable
df.pca <- data.frame(X.pca, Species = y)
# Visualize
library(ggplot2)
ggplot(df.pca, aes(PC1, PC2, color = Species)) +
geom_point(size = 2) +
labs(title = "PCA on Iris Dataset")
In supervised learning, the PCA-transformed features can then be used as inputs to a classifier, for example to a logistic regression model or a support vector machine. So, rather than computing the regression model on all of the features, we restrict the model creation to only the principle components.
Unlike PCA, LDA is supervised. It seeks directions that best separate the classes by maximizing the ratio of between-class variance to within-class variance. The number of components LDA can return is bounded by the number of classes minus one.
Example in R:
library(MASS)
# Perform LDA
lda.result <- lda(Species ~ ., data = iris)
# Project the data
X.lda <- predict(lda.result)$x
# Visualize the LDA components
df.lda <- data.frame(X.lda, Species = iris$Species)
ggplot(df.lda, aes(LD1, LD2, color = Species)) +
geom_point(size = 2) +
labs(title = "LDA on Iris Dataset")
LDA is particularly effective when class labels are known and the classes are linearly separable in feature space.
PLS is designed for regression problems where predictors are numerous and collinear. It projects both predictors and response onto a new space that maximizes the covariance between them. Unlike PCA, it directly considers the output variable during component construction.
Example in R:
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
# Use the 'yarn' dataset from the pls package
data(yarn)
X <- yarn$NIR
y <- yarn$density
# Fit PLS model
pls.model <- plsr(y ~ X, ncomp = 5, validation = "CV")
# Summary and validation
summary(pls.model)
## Data: X dimension: 28 268
## Y dimension: 28 1
## Fit method: kernelpls
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 27.46 6.503 4.488 2.203 1.111 0.7126
## adjCV 27.46 5.833 4.449 2.163 1.052 0.6820
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 46.83 98.38 99.46 99.67 99.85
## y 98.12 98.25 99.64 99.97 99.99
PLS is best used when predictors are highly collinear or when there are more features than observations.
t-SNE is a non-linear technique useful for visualizing high-dimensional data. It models pairwise similarities in high and low dimensions and minimizes the divergence between these distributions. While t-SNE is not typically used directly in predictive pipelines due to lack of inverse transformation, it is valuable in exploratory data analysis.
UMAP is another non-linear technique that preserves both local and global structure more effectively than t-SNE in many cases. Like t-SNE, it is mainly used for visualization or pre-processing.
Example in R:
library(umap)
# Apply UMAP
umap.result <- umap(X.scaled)
# Prepare data
df.umap <- data.frame(umap.result$layout, Species = iris$Species)
ggplot(df.umap, aes(X1, X2, color = Species)) +
geom_point(size = 2) +
labs(title = "UMAP on Iris Dataset")
Autoencoders are neural networks trained to reconstruct their input. The compressed representation in the hidden layer(s) can serve as a reduced-dimensional feature set. These are especially useful when working with image, audio, or text embeddings.
Evaluation can be quantitative or qualitative. Quantitatively, one can assess the explained variance (as in PCA), classification accuracy using reduced features, or cross-validated performance of models. Qualitatively, visualizations of reduced data, class separation, and cluster coherence can provide insights.
One common approach is to compare the performance of a classifier using original versus reduced features. Cross-validation ensures that performance generalizes. The key is to experiment.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
##
## R2
# Original features
set.seed(42)
model.original <- train(Species ~ ., data = iris, method = "rf", trControl = trainControl(method = "cv"))
# PCA-transformed features (first 2 PCs)
df.pca2 <- data.frame(pca.result$x[, 1:2], Species = iris$Species)
model.pca <- train(Species ~ ., data = df.pca2, method = "rf", trControl = trainControl(method = "cv"))
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.96 0.94 0.04661373 0.06992059
## 2 3 0.96 0.94 0.04661373 0.06992059
## 3 4 0.96 0.94 0.04661373 0.06992059
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.9133333 0.87 0.07730012 0.1159502
Dimensionality reduction should be applied when the number of features is large relative to the number of observations, or when predictors are collinear, noisy, or irrelevant. These techniques are also useful when interpretability or visualization is needed, or when models are to be deployed in resource-constrained environments. However, care must be taken not to apply unsupervised reductions blindly in supervised contexts without assessing the effect on predictive accuracy, so experiment and evaluate and don’t blindly follow a “recipe”.
Let’s conclude the lesson with a case study using a high-dimensional real-world dataset. This section illustrates how dimensionality reduction techniques can be effectively applied in practice when working with a dataset that contains many more features than observations. The goal is to demonstrate the impact of dimensionality reduction on model performance and interpretability in a supervised learning context.
We will walk step-by-step through a case study that demonstrates dimensionality reduction via Principal Component Analysis (PCA) in the context of needing to create a regression model. The case study uses the mtcars built-in dataset in R.
The workflow we will apply is the following:
Our target variable will be miles per gallon (mpg), and we’ll compare how well it can be predicted using both the original features and the PCA-reduced feature space.
Step 1: Load and Prepare the Dataset
We begin by loading the mtcars
dataset and identifying the predictors and response variable.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Next, we will separate the target feature (mpg) from the predictor features.
# Define target and predictors
y <- mtcars$mpg ## target
X <- mtcars[, !names(mtcars) %in% "mpg"] ## predictors
All predictor variables are numeric and relatively continuous, making PCA appropriate.
Step 2: Fit a Linear Regression Model Using Original Features
We first fit a linear model using all original predictors and evaluate it using 10-fold cross-validation.
library(caret)
# Combine predictors and response
df.original <- data.frame(mpg = y, X)
# Set up cross-validation
set.seed(123)
ctrl <- trainControl(method = "cv", number = 10)
# Train linear regression model on original data
model.original <- train(mpg ~ ., data = df.original,
method = "lm",
trControl = ctrl)
# Print performance
model.original$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 3.286242 0.8588116 2.715828 1.362054 0.1739366 1.120088
This provides a baseline root mean squared error (RMSE) using the full feature set.
Step 3: Apply PCA for Feature Extraction
We now perform PCA on the predictors and extract the principal components that explain at least 95% of the variance.
# Standardize predictors
X.scaled <- scale(X)
# Perform PCA
pca <- prcomp(X.scaled, center = TRUE, scale. = TRUE)
# Variance explained
explained.var <- summary(pca)$importance[3, ] # cumulative proportion
# Determine number of components for 95% variance
num.components <- which(explained.var >= 0.95)[1]
cat("Number of components to retain:", num.components, "\n")
## Number of components to retain: 6
We now use the top num.components
principal components for our regression model.
Step 4: Fit Linear Regression Using PCA Components
We extract the selected components, attach the response, and fit a model using the same cross-validation setup as the full model.
# Use top components
X.pca <- pca$x[, 1:num.components]
# Combine with response
df.pca <- data.frame(mpg = y, X.pca)
# Train linear regression model on PCA-transformed data
set.seed(123)
model.pca <- train(mpg ~ ., data = df.pca,
method = "lm",
trControl = ctrl)
# Print performance
model.pca$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 2.628416 0.8446803 2.31206 0.9801583 0.3030947 0.9546273
Step 5: Compare Performance
We now compare the RMSE of the model using original features with the model using PCA components.
## RMSE with original features: 3.286242
## RMSE with PCA components: 2.628416
This comparison quantifies the tradeoff between dimensionality and predictive accuracy. If PCA reduces RMSE or maintains it with fewer features, it improves model efficiency and reduces overfitting potential.
Interpretation and Discussion
If the PCA-based model performs similarly or better than the full model, it suggests that the reduced set of orthogonal components successfully captures the variance relevant for predicting mpg
, while eliminating redundancy and multicollinearity.
On the other hand, if performance drops significantly, this suggests that some components excluded from the PCA-based model contained signal relevant to mpg
, or that the transformation lost important interpretability.
This case study demonstrates that PCA can be effectively used as a feature extraction tool in regression tasks, and that its performance can be rigorously evaluated using cross-validation.
It is worth mentioning that we did not determine whether the coefficients of either model are statistically significant and whether backward elimination based p-value might not further improve the model.
Using the PCA-Based Model
Let’s analyze the PCA-based regression model in a bit more detail by:
1. Examine PCA Loadings
The loadings (or eigenvectors) describe how each original variable contributes to each principal component. These values help us interpret the meaning of the principal components in terms of the original variables.
# Loadings: contributions of each variable to each component
loadings <- pca$rotation[, 1:num.components]
# View loadings for first few components
round(loadings, 3)
## PC1 PC2 PC3 PC4 PC5 PC6
## cyl 0.403 0.039 -0.139 0.000 -0.061 0.182
## disp 0.396 -0.054 -0.016 -0.265 -0.339 -0.357
## hp 0.354 0.245 0.182 0.060 -0.528 0.033
## drat -0.316 0.278 0.131 -0.853 -0.103 0.234
## wt 0.367 -0.147 0.386 -0.253 0.144 -0.432
## qsec -0.220 -0.461 0.403 -0.072 0.213 -0.293
## vs -0.333 -0.228 0.413 0.212 -0.624 0.117
## am -0.247 0.432 -0.235 0.032 -0.049 -0.609
## gear -0.221 0.465 0.279 0.262 -0.020 -0.246
## carb 0.227 0.412 0.562 0.123 0.366 0.258
Each column represents a principal component, and each row corresponds to an original variable. The magnitude of each loading tells us how much that variable contributes to the component. A high positive or negative value indicates a strong influence.
For example, if PC1 has high positive loadings for wt
(weight) and disp
(displacement), we can interpret PC1 as capturing a “size or mass” dimension of the car.
To visualize the loadings, we can create a biplot.
This plot shows:
2. Visualize the PCA Components
To better visualize class structure or spread in the component space, we can project the data onto the first two components and plot them.
# First two principal components
pca.df <- data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2], mpg = y)
# Plot colored by mpg
library(ggplot2)
ggplot(pca.df, aes(PC1, PC2, color = mpg)) +
geom_point(size = 3) +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "PCA Projection of mtcars",
x = "First Principal Component", y = "Second Principal Component")
This helps visualize patterns in the reduced space. Although PCA is unsupervised, plotting the continuous response (like mpg
) as a gradient can show how variance in components relates to the target.
3. Use PCA-Based Model for Prediction
Suppose we want to use the PCA-based model to make a prediction for a new observation. We must standardize the new input in the same way, project it into PCA space, and use the regression model.
Let’s use the first row of mtcars
as a new example.
# Original new observation (excluding mpg)
new.obs <- mtcars[1, -1]
# Standardize using training mean and sd
new.obs.scaled <- scale(new.obs, center = attr(X.scaled, "scaled:center"),
scale = attr(X.scaled, "scaled:scale"))
# Project onto PCA space
new.obs.pca <- as.matrix(new.obs.scaled) %*% pca$rotation[, 1:num.components]
# Create dataframe for prediction
new.df <- as.data.frame(new.obs.pca)
colnames(new.df) <- colnames(df.pca)[-1]
# Predict mpg using PCA-based model
predicted.mpg <- predict(model.pca, newdata = new.df)
predicted.mpg
## Mazda RX4
## 22.24574
This shows how to apply the entire transformation and modeling pipeline to make predictions on new data. It also highlights a critical point in PCA-based modeling: you must retain the scaling parameters and PCA rotation matrix from training to ensure consistent transformation of new data.
Of course, stating the regression equation for a PCA-based model is less valuable and a lot less interpretable.
Hopefully, this end-to-end workflow demonstrates how PCA can be a practical and useful tool in a regression modeling pipeline.
In this section, we provide additional details underpinning the techniques and methods presented in the prior sections. These may be of interest to readers with the requisite mathematical background, but they are not essential to their application.
To understand Principal Component Analysis (PCA) deeply, especially in the context of supervised machine learning and dimensionality reduction, we must explore the mathematical principles that define it. PCA is grounded in linear algebra and statistics, and its objective is to find new orthogonal axes (principal components) that successively maximize the variance in the data. Although PCA is an unsupervised method, it is often used as a preprocessing step in supervised learning to reduce noise, eliminate redundancy, and improve generalization.
Let us consider a dataset \(\mathbf{X} \in \mathbb{R}^{n \times p}\), where:
PCA seeks to find a set of orthogonal vectors \(\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_k \in \mathbb{R}^p\), where \(k \leq p\), such that the projection of \(\mathbf{X}\) onto these vectors captures as much variance as possible in the data.
We begin by centering the data so that each feature has mean zero:
\[ \tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}_n \bar{\mathbf{x}}^\top \]
where \(\bar{\mathbf{x}} \in \mathbb{R}^p\) is the vector of column means of \(\mathbf{X}\), and \(\mathbf{1}_n \in \mathbb{R}^n\) is a vector of ones. This centering ensures that PCA identifies directions of variance rather than biasing the components toward the mean.
We compute the sample covariance matrix:
\[ \mathbf{S} = \frac{1}{n - 1} \tilde{\mathbf{X}}^\top \tilde{\mathbf{X}} \]
This matrix \(\mathbf{S} \in \mathbb{R}^{p \times p}\) is symmetric and positive semi-definite, which ensures that it has real, non-negative eigenvalues and an orthonormal set of eigenvectors.
We solve the eigenvalue problem:
\[ \mathbf{S} \mathbf{w}_i = \lambda_i \mathbf{w}_i \]
Here, \(\lambda_i\) is the eigenvalue corresponding to eigenvector \(\mathbf{w}_i\). Each eigenvalue represents the amount of variance explained by its corresponding eigenvector. We sort the eigenvalues in descending order \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p\), and select the top \(k\) eigenvectors to form the projection matrix:
\[ \mathbf{W}_k = [\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_k] \in \mathbb{R}^{p \times k} \]
The reduced-dimensional representation of the data is then given by:
\[ \mathbf{Z} = \tilde{\mathbf{X}} \mathbf{W}_k \in \mathbb{R}^{n \times k} \]
Each row of \(\mathbf{Z}\) is a projection of the original observation onto the lower-dimensional space spanned by the principal components.
PCA can be equivalently described as the solution to an optimization problem. For the first principal component \(\mathbf{w}_1\), we maximize the variance of the projection:
\[ \mathbf{w}_1 = \arg\max_{\|\mathbf{w}\| = 1} \mathrm{Var}(\tilde{\mathbf{X}} \mathbf{w}) = \arg\max_{\|\mathbf{w}\| = 1} \mathbf{w}^\top \mathbf{S} \mathbf{w} \]
This is a classic Rayleigh quotient, and its solution is the eigenvector corresponding to the largest eigenvalue of \(\mathbf{S}\). Each subsequent component is found similarly, subject to orthogonality with previously selected components.
The proportion of total variance explained by the \(i\)-th principal component is:
\[ \frac{\lambda_i}{\sum_{j=1}^p \lambda_j} \]
This measure guides how many components to retain in practice, often using a threshold (e.g., retain enough components to explain at least 95% of the variance).
Geometrically, PCA finds a new coordinate system for the data:
PCA can also be seen as a matrix factorization:
\[ \tilde{\mathbf{X}} = \mathbf{Z} \mathbf{W}^\top + \mathbf{E} \]
where \(\mathbf{Z} \in \mathbb{R}^{n \times k}\) are the reduced components, \(\mathbf{W}^\top \in \mathbb{R}^{k \times p}\) is the loading matrix, and \(\mathbf{E}\) is the residual error. This perspective is helpful when understanding reconstruction or inverse transformation in PCA.
To reinforce the mathematical derivation, here is an R example that performs PCA manually using eigen decomposition:
# Sample centered data matrix
X <- scale(iris[, 1:4], center = TRUE, scale = FALSE)
# Covariance matrix
S <- cov(X)
# Eigen decomposition
eig <- eigen(S)
# Eigenvalues (variance explained)
eig$values
# Eigenvectors (principal directions)
eig$vectors
# Project data onto first two components
Z <- X %*% eig$vectors[, 1:2]
# Plot
plot(Z, col = as.numeric(iris$Species), pch = 19,
xlab = "PC1", ylab = "PC2", main = "PCA via Eigen Decomposition")
We saw that PCA is not just a computational trick but a principled method grounded in variance maximization, linear algebra, and matrix decomposition. It offers a powerful lens through which to view complex, high-dimensional data, especially when preparing it for supervised machine learning tasks.
Dimensionality reduction is a powerful tool in supervised machine learning, helping to improve model performance and interpretability. PCA is a robust technique for reducing dimensionality by transforming features into uncorrelated components, best suited for high-dimensional data with multicollinearity. However, because PCA is unsupervised, it may not always capture features most relevant to the target variable. Feature selection methods, on the other hand, are designed to retain the most important features for a given task and are better suited for supervised learning.
Each technique has its strengths and limitations, and the choice of method should be informed by the specific characteristics of the dataset and the objectives of the analysis. By combining mathematical understanding with practical implementation, you can leverage these techniques to build more efficient and effective machine learning models.
Oehm, Daniel (July 28, 2018). PCA vs Autoencoders for Dimensionality Reduction Posted. R Bloggers