Multiple linear regression (MLR) is an extension of simple linear regression that allows for the prediction of a dependent variable using multiple independent variables. It is a fundamental statistical technique widely used in causal predictive modeling, particularly in domains such as finance and healthcare, where multiple factors influence a (numerically quantifiable) outcome.
This lesson is a primer and introduction to multiple linear regression. For more detailed coverage of regression, consult these lessons:
At the core of multiple linear regression is a linear mathematical expression that defines the relationship between the dependent variable (target feature) and multiple independent variables (predictive features). Mathematically, a regression model is expressed as the sum of the independent variables multiplied by a coefficient that is calculated from the data:
In this equation, \(Y\) represents the dependent variable, while \(X_1, X_2, ..., X_p\) denote the independent variables. The term \(\beta_0\) corresponds to the intercept, whereas \(\beta_1, \beta_2, ..., \beta_p\) are the regression coefficients that quantify the contribution of each predictor. The error term \(\epsilon\) accounts for variability not explained by the independent variables.
To estimate the regression coefficients, the ordinary least squares (OLS) method is employed. This technique minimizes the sum of the squared residuals (the difference between the actual and predicted value). The coefficients provide insight into the impact of each independent variable on the dependent variable.
To determine how well the model fits the data and how good a prediction is, the model’s “fit” is evaluated using various metrics, which are presented later.
Building a Regression Model
Implementing multiple linear regression begins with selecting an appropriate dataset that is suitable for regression. Regression is not a universal technique and when its assumptions are not met, an alternative approach (such as kNN regression or regression trees) should be used.
This section provides a step-by-step tutorial on building a regression model in R.
1. Load the Data
The data should be in tabular form containing the predictor features (independent variables) and the target feature (dependent variable) in columns. Most commonly, the data will be in CSV files but can be provided in other format too which would then need to be converted to a tabular format.
To illustrate the practical application of multiple linear regression, we will use a dataset (insurance.csv) that contains numerous factors influencing medical costs.
The first step involves loading and exploring the dataset in R:
The target feature is charges, the other features are the predictors: age, sex, bmi, children, smoker, region. The goal is to develop of regression equation that would allow us to predict the medical costs incurred by an insured (charges).
We have three numeric features (age, bmi, children), two binary features (sex and smoker) and one categorical feature (region). Each requires different treatment and preparation.
Preparing the data is an essential preliminary step and involves handling missing values, encoding categorical variables, evaluating distribution of the features, identifying outliers, and ensuring proper feature scaling where necessary.
2. Manage Outliers
The first step is to identify any outliers in the numeric features. Generally, values that are more than three standard deviations from the mean on either size (i.e., having a z-score > 3) are considered outliers and those observations should either be removed or the values should be treated as missing values and imputed.
## calculate mean and sd for each numeric columnnumeric.cols <-sapply(df, is.numeric)for (c in1:length(numeric.cols)) {if (numeric.cols[c] ==TRUE) {## find outliers in numeric column m <-mean(df[,c], na.rm = T) s <-sd(df[,c], na.rm = T) outliers <-which(abs((m - df[,c]) / s) >3.0)if (length(outliers) >0) {## found outliers; replace with NA and impute latercat("Found outliers in column '", names(df)[c], "': \n")cat(" --> ", df[outliers,c], "\n\n") } }}
## Found outliers in column ' bmi ':
## --> 360.05
##
## Found outliers in column ' children ':
## --> 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
##
## Found outliers in column ' charges ':
## --> 51194.56 63770.43 58571.07 55135.4 52590.83 60021.4 62592.87
We need to be careful to ignore missing values in the calculation of the mean and standard deviation as those have not yet been dealt with.
So, three columns contain outlier values. While having 5 children might be considered an outlier in this dataset, it is by no means an unusual value. Therefore, we will ignore those “outliers”. Likewise for the high charges might be outliers but could be justified depending on the medical procedures that were performed. Again, it is best to ignore those in this situation.
On the other hand, the extremely high value for BMI is indeed very likely incorrect and is probably a data entry error; perhaps a slipped decimal when someone wanted to enter 36.005 instead of 360.05. Now, we could replace this value with NA and then treat it as an actual missing value or apply a correction that makes sense from a domain perspective. We will choose the latter in this situation.
df$bmi[which(df$bmi >360)] <-36.005
The treatment of outliers is situational and there is not always a clear course of action. Each method to manage outliers must be viewed in the context of the business and the goals of the analysis and regression modeling.
3. Impute Missing Values
Missing values must be addressed either by eliminating any observation (row) that has a missing value in any of its features or by imputing the missing value with an estimate. For numeric features, a common strategy is to impute the missing value with the mean or median, while for categorical features, the mode is often used.
The first step is to check each feature for missing values. While we can certainly check each feature manually, iterating over all columns is more practical. The code below uses a simple loop, but other approaches can be used as well, such as using sapply().
For missing numeric values we can check for NA with the function is.na(). However, that will not reveal missing values in a “character” column, such as smoker or region. For this we need to check whether the text is ““.
num.Rows <-nrow(df)num.Cols <-ncol(df)found <- Ffor (c in1:num.Cols){ missing.Values <-which(is.na(df[,c]) | df[,c] =="") num.Missing.Values <-length(missing.Values)if (num.Missing.Values >0) {print(paste0("Column '", names(df)[c], "' has ", num.Missing.Values, " missing values")) found <- T }}
## [1] "Column 'bmi' has 3 missing values"
## [1] "Column 'smoker' has 2 missing values"
if (!found) {print("no missing values detected")}
The numeric column bmi has three missing values, while the categorical column smoker has two missing values.
The simplest approach for numeric features is to replace the missing values with the mean of the column since bmi is a numeric feature. However, the mean can be skewed by outliers so the median might often be better. Now, in this data, there is likely a significant difference between the BMI of male vs female insured, so using the median of the appropriate sex might be more reasonable. In fact, we could use a t-test to test whether that difference is statistically significant.
Let’s replace the missing values for bmi with the median of the respective sex, of course, ignoring any missing values in the calculation of the median.
Rather than using ifelse we could have also used a loop. Using ifelse on a vector means that we are using a vector operation. It is more efficient than a loop, albeit a bit harder to understand. The code also assume only two values for sex: “male” and “female”, and no missing values for the column sex.
We will use a similar approach for the missing values in the column smoker by replacing the value with the mode for each sex. One complication is that R does not have a “mode” function1, so we need to do the work ourselves. The table() function counts the number of occurrences for each type of smoker.
At this point in the data preparation phase, we have dealt with both outliers and missing values. Now, we need to check the suitability of the features for regression.
4. Resolve Multicollinearity
Multicollinearity refers to a situation in multiple linear regression where two or more independent variables exhibit a high degree of correlation. When multicollinearity is present, the regression model encounters difficulty in estimating the individual contribution of each predictor to the dependent variable because the predictors convey redundant or overlapping information; in other words, features are “double counted”.
The presence of multicollinearity does not affect the predictive capability of the regression model but severely undermines the interpretability of the coefficients. When independent variables are highly correlated, small changes in the dataset can result in vastly different coefficient estimates. This instability complicates decision-making, particularly in applications such as finance and healthcare, where precise variable impact assessments are crucial.
The are a number of ways to check for multicollinearity, but a correlation matrix is a common tool. Examining the correlation matrix of independent variables provides an initial indication of multicollinearity. If two or more numeric features exhibit correlations above 0.8 or so, multicollinearity may be an issue.
## age bmi children charges
## age 1.0000000 0.10866959 0.04246900 0.29900819
## bmi 0.1086696 1.00000000 0.01102236 0.19813004
## children 0.0424690 0.01102236 1.00000000 0.06799823
## charges 0.2990082 0.19813004 0.06799823 1.00000000
For our dataset, there is no multicollinearity. However, if we had found correlations above some threshold (generally, about 0.8), then various strategies can be employed to mitigate its impact. One approach is to exclude one of the two variables in the regression modeling. This decision should be guided by domain knowledge and feature importance analysis. Instead of removing a variable, highly correlated predictors can be combined into a single feature, such as creating an index or an average score. This approach retains the predictive information while reducing redundancy. More advanced approaches are to use Principal Component Analysis (PCA) to transform them into a set of uncorrelated principal components or to use ridge regression instead. Scaling all features to a common scale can also be helpful.
6. Scale Features
In cases where multicollinearity arises due to the scaling of variables, centering (subtracting the mean) or standardizing (scaling to unit variance) may reduce its impact. This, of course, only applies to numeric features.
A drawback of scaling is that it changes the interpretability of the coefficients.
7. Encode Categorical Features
So far, we have primarily dealt with the numeric features. Let’s turn our attention to the categorical features. Binary categorical features (where there are only two possible values) should be turned into a 0/1 encoding.
In our dataset, we have two binary categorical features: smoker and sex. Naturally, in other domains, those features may be multi-level categorical rather than binary categorical.
## encode sex: male as 0 and female as 1df$sex <-ifelse(df$sex =="male", 0, 1)## encode smoker: 1 = yes and 0 = nodf$smoker <-ifelse(df$smoker =="yes", 1, 0)
The choice of 0 and 1 is arbitrary; we could have just as easily made “yes” for smoker 0 instead of 1.
The feature region is a multi-level categorical feature. There are several approaches to encoding categorical features as numeric values. We will only discuss one method: one-hot encoding. While it is simple and has high interpretability, it does not work well when there are more than about four levels.
Let’s start by finding out all of the different values for region, i.e., the levels.
Before we get into details on how to perform one-hot encoding in R, let’s first explain the method.In one-hot encoding, we add additional binary columns (0/1) to the data. One column less than what we have levels. So, in our dataset, we have four levels for region: northeast, northwest, southeast, southwest. This is a partial set of columns and rows of the original arrangement:
head(df[1:5,c(1,2,6)])
## age sex region
## 1 19 1 southwest
## 2 18 0 southeast
## 3 28 0 southeast
## 4 33 0 northwest
## 5 32 0 northwest
We now add three additional columns: northeast, northwest, southeast and encode the values as follows:
region
northeast
northwest
southeast
northeast
1
0
0
northwest
0
1
0
southeast
0
0
1
southwest
0
0
0
Each unique category becomes a new column except one of them, and only one column has a 1 per row. The category left out has all 0 in the row
In R, we can either use a loop or the apply() functions to make the assignments, or use functions from a package. Note that the lm function for building a regression model will do one-hot encoding automatically if the categorical column is a factor variable. We will show how to do one-hot encoding ourselves, so we have full control and flexibility.
n <-nrow(df)## add new columnsdf$northeast <-0df$northwest <-0df$southeast <-0for (i in1:n) {if (df$region[i] =="northeast") df$northeast[i] <-1elseif (df$region[i] =="northwest") df$northwest[i] <-1elseif (df$region[i] =="southeast") df$southeast[i] <-1}
Here’s our updated dataset with the one-hot encoded categorical feature:
Once we have done the encoding, we need to either remove the categorical variable from the dataframe or be sure to exclude it when building the regression model.
One-hot, or dummy, encoding is a common method used in machine learning and data preprocessing to convert categorical variables into a numerical format that can be used by algorithms. It represents each category as a binary vector, where each category gets its own column and is marked as 1 (present) or 0 (absent). Lesson 3.207 - Encoding Categorical Features provides a more detailed treatment of this subject.
Alternatives to One-Hot Encoding are Label Encoding where we assign numbers to categories (e.g., "North" → 1, "South" → 2), but this implies an ordinal relationship, which may be incorrect, Target Encoding which replaces categories with their mean target value (used in predictive modeling), or **Frequency Encoding” where each categories with their frequency of occurrence.
CAUTION: Some datasets may use numeric values to encode categorical features but that does not make the feature numeric. For example, the regions could have been defined as 1 for “northeast”, 2 for “northwest”, etc. That does not make them numeric as it would make no sense to talk about an “average region” or the standard deviation of the regions, or the arithmetic difference between northeast and southwest.
8. Select Features
There is no strict rule on how many features are too many in regression, as it depends on the size of the dataset and the nature of the problem. However, some general guidelines help determine whether the number of features is excessive:
Rule of Thumb: The Sample Size-to-Feature Ratio: A common heuristic suggests that the number of observations should be at least 10 times the number of features for reliable regression estimates. If the number of predictors approaches or exceeds the number of observations, the model is likely overfitted.
Adjusted\(R^2\) and Model Complexity: Unlike \(R^2\), which always increases as more predictors are added, adjusted \(R^2\) accounts for the number of features. If adjusted \(R^2\) starts to decline as more variables are introduced, it suggests that additional features do not contribute useful information.
Feature Importance Analysis: If many predictors have little to no effect on the dependent variable, their inclusion only adds noise. Identifying and removing irrelevant features improves model performance.
To mitigate the problems associated with too many features, various feature selection techniques can be applied to retain only the most informative predictors.
Filter methods evaluate individual feature relevance using statistical criteria before fitting the regression model:
Correlation Analysis: Identifies features that are highly correlated with each other and can be removed to avoid redundancy.
Variance Thresholding: Features with very low variance provide little discriminatory power and can be removed.
Stepwise Regression: A common method that iteratively adds or removes features based on model performance.
Forward Selection starts with no variables and adds predictors one at a time based on significance.
Backward Elimination starts with all variables and removes the least significant predictors iteratively.
Stepwise Selection is a combination of forward and backward selection, balancing feature inclusion and removal.
In addition, methods such as PCA and Ridge Regression are useful. All of these are beyond the scope of this introductory lesson.
9. Build Regression Model
After ensuring the data is clean and properly formatted, the regression model is built using the lm() function. Notice the “formula”: the dependent variables is on the left side, followed by a ~ and then a linear additive combination of the independent feature variables. The feature variables are the names of the numeric columns in the dataframe referenced with the data parameter. Feature selection guides which variables are included in the regression model.
The R code below demonstrates the construction of a regression model.
model <-lm(charges ~ age + sex + bmi + children + smoker + northeast + northwest + southeast, data = df)## print summary of the model for investigationsummary(model)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## northeast + northwest + southeast, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11276.3 -2846.9 -980.3 1409.4 30004.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13106.23 1033.96 -12.676 < 2e-16 ***
## age 256.85 11.89 21.599 < 2e-16 ***
## sex 135.42 332.77 0.407 0.684111
## bmi 341.38 28.61 11.931 < 2e-16 ***
## children 478.27 137.72 3.473 0.000532 ***
## smoker 23854.58 412.92 57.770 < 2e-16 ***
## northeast 959.71 477.64 2.009 0.044711 *
## northwest 613.80 476.96 1.287 0.198352
## southeast -82.32 470.40 -0.175 0.861113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1329 degrees of freedom
## Multiple R-squared: 0.7512, Adjusted R-squared: 0.7497
## F-statistic: 501.6 on 8 and 1329 DF, p-value: < 2.2e-16
Interpreting the output of this regression model involves examining the estimated coefficients, assessing their statistical significance, and evaluating confidence intervals. Model diagnostics are also necessary to identify potential violations of regression assumptions.
Backward Feature Elimination
Any coefficients that are not statistically significant, i.e., have a p-value of more than 0.05, should be eliminated. The tutorial below explains this approach of backward elimination based on p-values. This method iteratively removes the least significant predictor variable, i.e., the variable with the highest p-value, until all remaining variables are statistically significant.
Step-by-Step Explanation
Fit the Full Model: Start with all predictor variables in the model.
Identify the Least Significant Predictor: Find the predictor variable with the highest p-value
Remove the Least Significant Predictor: If the highest p-value is above a certain threshold (commonly 0.05), remove the corresponding predictor variable.
Refit the Model: Fit the model again without the removed predictor.
Repeat: Continue the process until all remaining predictor variables have p-values below the threshold.
A few tips to keep in mind:
ignore the p-value of the intercept
either include all or none of the columns that encode a categorical variables using dummy codes, even if some have a p-value greater than 0.05
9. Check Normality of Residuals
While it is not strictly necessary for numeric features (independent variables) to be normally distributed when building a regression model, there are nevertheless some important statistical considerations. Specifically, regression models estimate relationships between variables, and these relationships can be valid even when the predictors are not fully normally distributed. However, if predictors are highly skewed or have outliers, transformations (e.g., log, square root) may improve model performance.
What does matter is the distribution of residuals (errors) rather than the independent variables. In ordinary least squares (OLS) regression, an assumption is that residuals (differences between observed and predicted values) should be normally distributed for valid hypothesis testing (e.g., t-tests, confidence intervals, p-values). If residuals are non-normally distributed, it may indicate heteroscedasticity (non-constant variance), omitted variables, or model misspecification.
To check if residuals are normally distributed, you can use the following strategies for a partial model that simplifies the explanation:
# Fit linear regression modelmodel <-lm(charges ~ age + bmi + children + smoker, data = df)
A histogram of the residuals should look some “normal”, i.e., follow a Gaussian Distribution or Bell Curve.
# Check residual normality with a histogramhist(resid(model), main ="Histogram of Residuals", col ="lightblue")
There is some skewness to this distribution and transforming the target feature using a log transform might be appropriate.
An alternative is a Q-Q Plot which should be a line at a 45° angle. If it is more like a “hockey stick” then the distribution is not normal.
# Q-Q plot for normality checkqqnorm(resid(model))qqline(resid(model), col ="red")
An alternative to a visual inspection of the distribution of the residuals is the Shapiro-Wilk statistical test of normality. The two key values in the result are the W-statistic (Test Statistic) and the p-value. A value for W close to 1 suggests normality, while a value far from 1 suggests deviation from normality. If the p-value is greater than 0.05 then we fail to reject the null hypothesis, which means that the data is normally distributed. Likewise, a p-value less than 0.05 implies that we must reject the null hypothesis and consequently the data is not normally distributed.
# Shapiro-Wilk test for normalityshapiro.test(resid(model))
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.90025, p-value < 2.2e-16
In the above result, the distribution of the residuals is not normally distributed if we base our determination on the p value alone. However, the plots are somewhat reasonable and the W statistic indicates normality, so while not perfect, we can accept the regression model.
9. Express Regression Equation
The regression equation is derived using the estimated coefficients from the model. The general form of a multiple linear regression equation is:
We can then write an equation which becomes out deployable model (with some rounding for display clarity and writing the intercept at the end rather than the beginning):
This equation predicts the expected medical charges based on age, BMI, number of children, and whether the person is a smoker or not. It is a simple equation that can be added to web applications, spreadsheet models, or calculated by hand. Making a prediction is extremely fast as it just involves simple addition and multiplication.
Note how easily the model can be interpreted. For example, being a smoker (1) adds an additional $23,816 to the medical charges, while each child accounts for an additional $476. Each point that the BMI goes up, increases medical expenses by $324.
10. Make Prediction
Once the linear model (model) has been trained using the lm() function, predictions can be made using the predict() function in R or simply using the regression equation.
To predict charges for a patient with specific values of age, bmi, children, and smoker, we create a new data frame containing the input values and use the predict() function.
# Define the new observation as a data framenew.data <-data.frame(age =47,bmi =31,children =2,smoker =0)# use the regression model to predict medical expensesprediction <-predict(model, new.data)
The predicted medical charges or expenses are $1.0939^{4}.
If you want to predict for multiple companies at once, create a data frame containing multiple rows.
To obtain a confidence interval around the prediction, specify interval = "confidence".
This is wider than the confidence interval because it accounts for individual variance in future observations. If the goal is forecasting or making decisions based on individual outcomes, the prediction interval is more useful because it provides a range of likely values for a new observation. If the goal is to understand how well the model estimates the mean, the confidence interval is more relevant.
11. Evaluate Regression Model
Evaluating the performance of a regression model is essential to determine how well it fits the data and how reliable its predictions are. The “goodness” of a regression model is typically assessed using statistical measures, diagnostic plots, and residual analysis. The choice of evaluation metrics depends on the objective — whether the goal is to maximize explanatory power, assess predictive accuracy, or detect violations of assumptions.
The most two common evaluation metrics are the \(R^2\) statistic and mean squared error.
Coefficient of Determination
The \(R^2\) statistic, also known as the coefficient of determination, measures the proportion of variance in the dependent variable that is explained by the independent variables and is mathematically defined as:
\(R^2\) ranges from 0 to 1, where a value closer to 1 indicates that the model explains most of the variability in the response variable. However, a high \(R^2\) does not necessarily mean the model is good; it does not account for overfitting.
In R, we can find the \(R^2\) and the Adjusted \(R^2\) as follows:
model <-lm(charges ~ age + bmi + sex + smoker, data = df)r.squared <-summary(model)$r.squared
The \(R^2\) for the above regression model is 0.748. An \(R^2\) above 0.7 is generally reasonable, while above 0.8 is good, and above 0.9 is excellent. The model above has a reasonable \(R^2\) that explains 70% of the variance observed.
Unlike \(R^2\), the adjusted \(R^2\) accounts for the number of predictors in the model and penalizes unnecessary variables.
where \(n\) is the number of observations and \(k\) is the number of predictors. Adjusted \(R^2\) prevents the illusion of improvement when adding irrelevant predictors. If adding new predictors causes Adjusted \(R^2\) to decrease, those predictors do not contribute significantly to the model. The Adjusted \(R^2\) can be read using summary(model)$adj.r.squared.
Mean Squared Error (MSE)
The mean squared error measures the average squared difference between the actual and predicted values.
\[
MSE = \frac{1}{n} \sum (Y_i - \hat{Y}_i)^2
\]
A smaller MSE indicates a better model fit. It penalizes large errors more than small ones due to squaring. We can calculate MSE in R as follows:
model <-lm(charges ~ age + bmi + sex + smoker, data = df)mse <-mean((df$charges -predict(model))^2)print(mse)
## [1] 36966773
Root mean squared error (RMSE) is simply the square root of MSE, making it interpretable in the same units as the dependent variable.
\[
RMSE = \sqrt{MSE}
\]
As before, a lower RMSE values indicate a better fit. Unlike MSE, RMSE is directly interpretable in terms of the original response variable.
In addition, to the two metrics shown here, other metrics include Mean Absolute Deviation (MAD) and Mean Absolute Percentage Error (MAPE).
12. Deploy Model
A regression model is one of the simplest models to deploy and use to make predictions as one simply has to provide the regression equation.
Regression Assumptions
While regression models (such as linear and multiple linear regression) are widely used for predicting numeric target variables, there are scenarios where they may not be the best choice. Regression assumes a linear relationship between predictors and the target variable, but real-world data often violate these assumptions. Below are key situations where regression may not be appropriate, along with alternative machine learning models better suited for each case.
1. When the Relationship Between Predictors and the Target Variable is Non-Linear
Multiple linear regression assumes that the dependent variable is a linear function of the independent variables. If the true relationship is highly non-linear, adding polynomial terms can improve fit, but it often leads to overfitting or poor generalization.
2. When There is Multicollinearity Among Independent Variables
Multicollinearity occurs when predictor variables are highly correlated with each other, making coefficient estimates unstable. Regression struggles because it cannot distinguish between the effects of correlated variables.
3. When the Data Contains Many Outliers
Ordinary least squares (OLS) regression minimizes squared errors, making it highly sensitive to outliers. A few extreme values can disproportionately influence the model, distorting predictions.
4. When the Data is Highly Skewed or Has Heteroscedasticity
If residual variance is not constant (heteroscedasticity), regression coefficients may be inefficient, leading to incorrect inferences. Skewed distributions cause biased coefficient estimates. Log Transformation or Box-Cox Transformation should be used before regression to stabilize variance.
5. When There are High-Dimensional or Sparse Features
If there are more predictors than observations (e.g., genetic data, text data), traditional regression overfits.Many features may be irrelevant, diluting the predictive power of relevant ones.
6. When There are Interactions or Non-Additive Effects Between Predictors
Traditional regression assumes independent effects of predictors unless interaction terms are explicitly added. Complex interactions can be difficult to specify manually.
7. When the Dataset is Large and Regression Becomes Computationally Inefficient
Linear regression performs matrix operations that become computationally expensive in very large datasets. Complex relationships may not be effectively captured with a simple linear function.
8. When the Data is Temporal (Time Series) and Regression Ignores Sequential Dependency
Standard regression does not account for time-based patterns such as trends, seasonality, or autocorrelation. Residuals in regression models may be autocorrelated, violating assumptions.
Summary
Multiple linear regression remains a fundamental technique for predictive modeling across various domains, including finance and healthcare. Through its ability to quantify relationships between multiple predictors and an outcome, it provides valuable insights for decision-making and forecasting. The practical examples presented in this tutorial illustrate how MLR can be applied to real-world problems, emphasizing the importance of model evaluation and diagnostics.
Regression models work well when relationships are linear, independent, homoscedastic, and free from severe multicollinearity or outliers. However, when these assumptions do not hold, alternative machine learning algorithms provide better predictive performance and robustness.