Motivation

Considerations

Categorical Features

  1. categorical variables must be encoding same way as we’ve done with kNN; so use all of those techniques; the lm() function in R will automatically encode any variable that is categorical and of type “factor” but it’ll always use one-hot encoding; for that encoding, distribution does not matter, but since regression calculates “distance” (from the regression line), it’ll be sensitive to categorical variables and not work so well… decision trees will likely work better

if you encode a categorical variable into a numeric variable, such as weight-of-evidence, then it’ll have to be reasonably normally distributed, of course – as regression is a parametric statistical technique that presumes normality

if you can’t encode then omit the variable…

  1. see my answer for (1); R always uses one-hot which isn’t great if you have many levels as now you’ll get many more variables and that messes up regression – regression is not great at dealing with many dimensions (variables/columns); remember if you encode (or R encodes) a categorical variable with “n” levels, then you will introduce “n-1” new dimensions/variables/columns… this will likely reduce your Adjusted R^2 and your predictive power

Ordinal Categorical Features

  1. ordered categorical variables are best converted to a numeric column – you need to choose the numbers… example: education with levels No HS, HS, College, Graduate, Doctorate can be ordered and you could assign 0, 1, 2, 3, 4 as numeric equivalents… but now you stated that the “distance” between HS and College is 1 and Graduate and Doctorate is 1 – is that true? is that appropriate for your domain? perhaps or perhaps not… maybe the intervals are not equi-distant? maybe the encoding should be 0, 1, 3, 6, 10… domain dependent and might require some experimentation…

Transformation

[Yesterday 8:44 AM] Schedlbauer, Martin 4) no… if you transform a variable, e.g., a log-transform, that only applies to that variable… of course, if you make a prediction, you need to take the “input” variable and transform it first before inserting it into the regression equation ​[Yesterday 8:45 AM] Schedlbauer, Martin it might make more sense to log-transform one variable, not transform another because it is “normal enough”, inverse transform another, square-root transform yet another… the transform to be used depends on the initial distribution of the variable… again, experimentation is required as to which one works best… ​[Yesterday 8:46 AM] Schedlbauer, Martin building a model requires LOTS of experimentation and evaluation… also, your data might have clusters and it make make sense to split your data into the clusters and build different models for different clusters… ​[Yesterday 8:47 AM] Schedlbauer, Martin example: predicting home sales prices… you’ll likely have clusters: starter homes, condos, single family homes, really nice homes, mansions… you might not be able to build a model that predicts the price of any of those: might be better to build different models – so now you have different regression equations, or regression trees, or regression kNNs models – or ensembles of them…

now, how do you find the clusters? in the above example it was from domain knowledge – I “knew” from experience that dividing by home type might be appropriate… but, sometimes you don’t know… so, you either use visualization (like a scatter plot) or an automated cluster detection method, such as k-means (which we’ll see later in the course - it is an unsupervised machine learning method, aka a data mining method)…

Feature Scaling

scale… kind of matters for regression but not to the same extent as for kNN… still often best to normalize the scales using the same methods as we used for kNN (and will use for support vector machines and k-means clustering)… of course, you’ll do the normalization of the scales after you’ve done any categorical encoding, outlier detection, imputation, and transformation

Multicollinarity

Lastly, did you do any analysis of multicollinearity, i.e., determine if any pair of features has a strong correlation? If that’s the case, then you need to eliminate one of the two features as you are “double counting” them in the model.

Statistical Significance

The F-stat is as meaningful to interpret, as long as the overall p-value for the regression model is < 0.05 (or, I would argue, it should really be < 0.01); remember that the threshold for statistical significant of 0.05 is somewhat arbitrary and by convention… the argument recently has been that it should be less – I tend to agree with that and would argue that it should really be 0.01

Evaluation

and you definitely need to split the data into training and validation and then use the validation data to calculate MSE and/or MAD – that would be particularly important if you wanted to compare a regression model to a non-statistical model such as kNN or a regression tree as for those algorithms Adjusted R^2 is not defined and we need a different way to evaluate and compare models; MAD and/or MSE can be useful…

Summary


Files & Resources

All Files for Lesson 3.441

References

No references.

Errata

Let us know.

---
title: "Ordinary Least Squares Regression"
params:
  category: 3
  stacks: 0
  number: 441
  time: 10
  level: beginner
  tags: regression,ols,machine learning
  description: "Introduces linear regression and statistical learners. Shows
                how to build and evaluate a regression model."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Motivation

## Considerations

### Categorical Features

1)  categorical variables must be encoding same way as we've done with kNN; so use all of those techniques; the lm() function in R will automatically encode any variable that is categorical and of type "factor" but it'll always use one-hot encoding; for that encoding, distribution does not matter, but since regression calculates "distance" (from the regression line), it'll be sensitive to categorical variables and not work so well... decision trees will likely work better

if you encode a categorical variable into a numeric variable, such as weight-of-evidence, then it'll have to be reasonably normally distributed, of course -- as regression is a parametric statistical technique that presumes normality

if you can't encode then omit the variable...

2)  see my answer for (1); R always uses one-hot which isn't great if you have many levels as now you'll get many more variables and that messes up regression -- regression is not great at dealing with many dimensions (variables/columns); remember if you encode (or R encodes) a categorical variable with "n" levels, then you will introduce "n-1" new dimensions/variables/columns... this will likely reduce your Adjusted R\^2 and your predictive power

### Ordinal Categorical Features

3)  ordered categorical variables are best converted to a numeric column -- you need to choose the numbers... example: education with levels No HS, HS, College, Graduate, Doctorate can be ordered and you could assign 0, 1, 2, 3, 4 as numeric equivalents... but now you stated that the "distance" between HS and College is 1 and Graduate and Doctorate is 1 -- is that true? is that appropriate for your domain? perhaps or perhaps not... maybe the intervals are not equi-distant? maybe the encoding should be 0, 1, 3, 6, 10... domain dependent and might require some experimentation...

### Transformation

[Yesterday 8:44 AM] Schedlbauer, Martin 4) no... if you transform a variable, e.g., a log-transform, that only applies to that variable... of course, if you make a prediction, you need to take the "input" variable and transform it first before inserting it into the regression equation ​[Yesterday 8:45 AM] Schedlbauer, Martin it might make more sense to log-transform one variable, not transform another because it is "normal enough", inverse transform another, square-root transform yet another... the transform to be used depends on the initial distribution of the variable... again, experimentation is required as to which one works best... ​[Yesterday 8:46 AM] Schedlbauer, Martin building a model requires LOTS of experimentation and evaluation... also, your data might have clusters and it make make sense to split your data into the clusters and build different models for different clusters... ​[Yesterday 8:47 AM] Schedlbauer, Martin example: predicting home sales prices... you'll likely have clusters: starter homes, condos, single family homes, really nice homes, mansions... you might not be able to build a model that predicts the price of any of those: might be better to build different models -- so now you have different regression equations, or regression trees, or regression kNNs models -- or ensembles of them...

now, how do you find the clusters? in the above example it was from domain knowledge -- I "knew" from experience that dividing by home type might be appropriate... but, sometimes you don't know... so, you either use visualization (like a scatter plot) or an automated cluster detection method, such as k-means (which we'll see later in the course - it is an unsupervised machine learning method, aka a data mining method)...

### Feature Scaling

scale... kind of matters for regression but not to the same extent as for kNN... still often best to normalize the scales using the same methods as we used for kNN (and will use for support vector machines and k-means clustering)... of course, you'll do the normalization of the scales after you've done any categorical encoding, outlier detection, imputation, and transformation

### Multicollinarity

Lastly, did you do any analysis of multicollinearity, i.e., determine if any pair of features has a strong correlation? If that's the case, then you need to eliminate one of the two features as you are "double counting" them in the model.

### Statistical Significance

The F-stat is as meaningful to interpret, as long as the overall p-value for the regression model is \< 0.05 (or, I would argue, it should really be \< 0.01); remember that the threshold for statistical significant of 0.05 is somewhat arbitrary and by convention... the argument recently has been that it should be less -- I tend to agree with that and would argue that it should really be 0.01

## Evaluation

and you definitely need to split the data into training and validation and then use the validation data to calculate MSE and/or MAD -- that would be particularly important if you wanted to compare a regression model to a non-statistical model such as kNN or a regression tree as for those algorithms Adjusted R\^2 is not defined and we need a different way to evaluate and compare models; MAD and/or MSE can be useful...

## Summary

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

No references.

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
