Considerations
Categorical Features
- categorical variables must be encoding same way as we’ve done with kNN; so use all of those techniques; the lm() function in R will automatically encode any variable that is categorical and of type “factor” but it’ll always use one-hot encoding; for that encoding, distribution does not matter, but since regression calculates “distance” (from the regression line), it’ll be sensitive to categorical variables and not work so well… decision trees will likely work better
if you encode a categorical variable into a numeric variable, such as weight-of-evidence, then it’ll have to be reasonably normally distributed, of course – as regression is a parametric statistical technique that presumes normality
if you can’t encode then omit the variable…
- see my answer for (1); R always uses one-hot which isn’t great if you have many levels as now you’ll get many more variables and that messes up regression – regression is not great at dealing with many dimensions (variables/columns); remember if you encode (or R encodes) a categorical variable with “n” levels, then you will introduce “n-1” new dimensions/variables/columns… this will likely reduce your Adjusted R^2 and your predictive power
Ordinal Categorical Features
- ordered categorical variables are best converted to a numeric column – you need to choose the numbers… example: education with levels No HS, HS, College, Graduate, Doctorate can be ordered and you could assign 0, 1, 2, 3, 4 as numeric equivalents… but now you stated that the “distance” between HS and College is 1 and Graduate and Doctorate is 1 – is that true? is that appropriate for your domain? perhaps or perhaps not… maybe the intervals are not equi-distant? maybe the encoding should be 0, 1, 3, 6, 10… domain dependent and might require some experimentation…
Feature Scaling
scale… kind of matters for regression but not to the same extent as for kNN… still often best to normalize the scales using the same methods as we used for kNN (and will use for support vector machines and k-means clustering)… of course, you’ll do the normalization of the scales after you’ve done any categorical encoding, outlier detection, imputation, and transformation
Multicollinarity
Lastly, did you do any analysis of multicollinearity, i.e., determine if any pair of features has a strong correlation? If that’s the case, then you need to eliminate one of the two features as you are “double counting” them in the model.
Statistical Significance
The F-stat is as meaningful to interpret, as long as the overall p-value for the regression model is < 0.05 (or, I would argue, it should really be < 0.01); remember that the threshold for statistical significant of 0.05 is somewhat arbitrary and by convention… the argument recently has been that it should be less – I tend to agree with that and would argue that it should really be 0.01
Evaluation
and you definitely need to split the data into training and validation and then use the validation data to calculate MSE and/or MAD – that would be particularly important if you wanted to compare a regression model to a non-statistical model such as kNN or a regression tree as for those algorithms Adjusted R^2 is not defined and we need a different way to evaluate and compare models; MAD and/or MSE can be useful…
