Motivation
Multicollinearity is an important issue that must be addressed when developing machine learning models, particularly regression models – and especially when using ordinary least squares (OLS) regression. Unfortunately, in practice many data analytics professionals ignore multicollinearity and consequently create models that perform poorly, i.e., resulting in predictions with high variance and poor accuracy. The result is that the predictive model is not useful.
The issue may be that many data analytics professionals do not have a mathematics, statistics, or computer science background and therefore are often not aware of some of the critical underpinnings and assumptions of many machine learning algorithms. They often falsely assume that if the algorithm produces a model, that the model is “good”. Nothing can be further from the truth; just because R or Python didn’t crash doesn’t mean that one has built a useful model that makes usable predictions.
This tutorial takes a closer look at a particularly pernicious problem most often encountered when building a regression model.
Assumptions of Regression Methods
There are several different ways to build a regression model: a linear equation relating the features and weights in an additive manner.
Regression makes the assumption that the features (the independent variables) and the target variable (the dependent variable) all are reasonably normally distributed. In other words, they follow a Gaussian or Gauss-Laplace Distribution. If a variable is not normally distributed, as established through visual inspection of the variable’s histogram or a statistical test such as the Shapiro-Wilk Normality Test or the Kolmogorov–Smirnov Test, then one or more transforms are applied which may change the distribution just enough to be meet the normality criterion.
Another important assumption is the assumption of equal variance, or heteroscedasticity. If the variance is uneven, then there are likely clusters and different regression models must be constructed for the different clusters. Applying a clustering algorithm such as k-means can often reveal the clusters computationally rather than through visual inspection.
A third assumption is that the variables are independent and are not correlated to one another but rather only to the target variable. This assumption is the topic of this tutorial.
LS0tCnRpdGxlOiAiRWZmZWN0cyBvZiBNdWx0aWNvbGxpbmVhcml0eSBpbiBSZWdyZXNzaW9uIgpwYXJhbXM6CiAgY2F0ZWdvcnk6IDMKICBzdGFja3M6ICIxMCwxMSIKICBudW1iZXI6IDIwMQogIHRpbWU6IDEwCiAgbGV2ZWw6IGJlZ2lubmVyCiAgdGFnczogcmVncmVzc2lvbixzdGF0aXN0aWNzCiAgZGVzY3JpcHRpb246ICJFeHBsYWlucyB0aGUgZWZmZWN0cyBvZiBjb3JyZWxhdGlvbgogICAgICAgICAgICAgICAgYmV0d2VlbiBmZWF0dXJlcyBhbmQgdW5lcXVhbCB2YXJpYW5jZQogICAgICAgICAgICAgICAgYWNyb3NzIGEgZmVhdHVyZSBzcGFjZS4iCmRhdGU6ICI8c21hbGw+YHIgU3lzLkRhdGUoKWA8L3NtYWxsPiIKYXV0aG9yOiAiPHNtYWxsPk1hcnRpbiBTY2hlZGxiYXVlcjwvc21hbGw+IgplbWFpbDogIm0uc2NoZWRsYmF1ZXJAbmV1LmVkdSIKYWZmaWxpdGF0aW9uOiAiTm9ydGhlYXN0ZXJuIFVuaXZlcnNpdHkiCm91dHB1dDogCiAgYm9va2Rvd246Omh0bWxfZG9jdW1lbnQyOgogICAgdG9jOiB0cnVlCiAgICB0b2NfZmxvYXQ6IHRydWUKICAgIGNvbGxhcHNlZDogZmFsc2UKICAgIG51bWJlcl9zZWN0aW9uczogZmFsc2UKICAgIGNvZGVfZG93bmxvYWQ6IHRydWUKICAgIHRoZW1lOiBzcGFjZWxhYgogICAgaGlnaGxpZ2h0OiB0YW5nbwotLS0KCi0tLQp0aXRsZTogIjxzbWFsbD5gciBwYXJhbXMkY2F0ZWdvcnlgLmByIHBhcmFtcyRudW1iZXJgPC9zbWFsbD48YnIvPjxzcGFuIHN0eWxlPSdjb2xvcjogIzJFNDA1MzsgZm9udC1zaXplOiAwLjllbSc+YHIgcm1hcmtkb3duOjptZXRhZGF0YSR0aXRsZWA8L3NwYW4+IgotLS0KCmBgYHtyIGNvZGU9eGZ1bjo6cmVhZF91dGY4KHBhc3RlMChoZXJlOjpoZXJlKCksJy9SL19pbnNlcnQyREIuUicpKSwgaW5jbHVkZSA9IEZBTFNFfQpgYGAKCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQoKIyMjIyBDb250ZW50cyBhdCBhIEdsYW5jZQoKLSAgIFtNb3RpdmF0aW9uXSgjbW90aXZhdGlvbikKLSAgIFtBc3N1bXB0aW9ucyBvZiBSZWdyZXNzaW9uIE1ldGhvZHNdKCNhc3N1bXB0aW9ucykKCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQoKIyMgTW90aXZhdGlvbiB7I21vdGl2YXRpb259CgpNdWx0aWNvbGxpbmVhcml0eSBpcyBhbiBpbXBvcnRhbnQgaXNzdWUgdGhhdCBtdXN0IGJlIGFkZHJlc3NlZCB3aGVuIGRldmVsb3BpbmcgbWFjaGluZSBsZWFybmluZyBtb2RlbHMsIHBhcnRpY3VsYXJseSByZWdyZXNzaW9uIG1vZGVscyAtLSBhbmQgZXNwZWNpYWxseSB3aGVuIHVzaW5nIG9yZGluYXJ5IGxlYXN0IHNxdWFyZXMgKE9MUykgcmVncmVzc2lvbi4gVW5mb3J0dW5hdGVseSwgaW4gcHJhY3RpY2UgbWFueSBkYXRhIGFuYWx5dGljcyBwcm9mZXNzaW9uYWxzIGlnbm9yZSBtdWx0aWNvbGxpbmVhcml0eSBhbmQgY29uc2VxdWVudGx5IGNyZWF0ZSBtb2RlbHMgdGhhdCBwZXJmb3JtIHBvb3JseSwgKmkuZS4qLCByZXN1bHRpbmcgaW4gcHJlZGljdGlvbnMgd2l0aCBoaWdoIHZhcmlhbmNlIGFuZCBwb29yIGFjY3VyYWN5LiBUaGUgcmVzdWx0IGlzIHRoYXQgdGhlIHByZWRpY3RpdmUgbW9kZWwgaXMgbm90IHVzZWZ1bC4KClRoZSBpc3N1ZSBtYXkgYmUgdGhhdCBtYW55IGRhdGEgYW5hbHl0aWNzIHByb2Zlc3Npb25hbHMgZG8gbm90IGhhdmUgYSBtYXRoZW1hdGljcywgc3RhdGlzdGljcywgb3IgY29tcHV0ZXIgc2NpZW5jZSBiYWNrZ3JvdW5kIGFuZCB0aGVyZWZvcmUgYXJlIG9mdGVuIG5vdCBhd2FyZSBvZiBzb21lIG9mIHRoZSBjcml0aWNhbCB1bmRlcnBpbm5pbmdzIGFuZCBhc3N1bXB0aW9ucyBvZiBtYW55IG1hY2hpbmUgbGVhcm5pbmcgYWxnb3JpdGhtcy4gVGhleSBvZnRlbiBmYWxzZWx5IGFzc3VtZSB0aGF0IGlmIHRoZSBhbGdvcml0aG0gcHJvZHVjZXMgYSBtb2RlbCwgdGhhdCB0aGUgbW9kZWwgaXMgImdvb2QiLiBOb3RoaW5nIGNhbiBiZSBmdXJ0aGVyIGZyb20gdGhlIHRydXRoOyBqdXN0IGJlY2F1c2UgUiBvciBQeXRob24gZGlkbid0IGNyYXNoIGRvZXNuJ3QgbWVhbiB0aGF0IG9uZSBoYXMgYnVpbHQgYSB1c2VmdWwgbW9kZWwgdGhhdCBtYWtlcyB1c2FibGUgcHJlZGljdGlvbnMuCgpUaGlzIHR1dG9yaWFsIHRha2VzIGEgY2xvc2VyIGxvb2sgYXQgYSBwYXJ0aWN1bGFybHkgcGVybmljaW91cyBwcm9ibGVtIG1vc3Qgb2Z0ZW4gZW5jb3VudGVyZWQgd2hlbiBidWlsZGluZyBhIHJlZ3Jlc3Npb24gbW9kZWwuCgojIyBBc3N1bXB0aW9ucyBvZiBSZWdyZXNzaW9uIE1ldGhvZHMgeyNhc3N1bXB0aW9uc30KClRoZXJlIGFyZSBzZXZlcmFsIGRpZmZlcmVudCB3YXlzIHRvIGJ1aWxkIGEgcmVncmVzc2lvbiBtb2RlbDogYSBsaW5lYXIgZXF1YXRpb24gcmVsYXRpbmcgdGhlIGZlYXR1cmVzIGFuZCB3ZWlnaHRzIGluIGFuIGFkZGl0aXZlIG1hbm5lci4KClJlZ3Jlc3Npb24gbWFrZXMgdGhlIGFzc3VtcHRpb24gdGhhdCB0aGUgZmVhdHVyZXMgKHRoZSBpbmRlcGVuZGVudCB2YXJpYWJsZXMpIGFuZCB0aGUgdGFyZ2V0IHZhcmlhYmxlICh0aGUgZGVwZW5kZW50IHZhcmlhYmxlKSBhbGwgYXJlIHJlYXNvbmFibHkgbm9ybWFsbHkgZGlzdHJpYnV0ZWQuIEluIG90aGVyIHdvcmRzLCB0aGV5IGZvbGxvdyBhIFtHYXVzc2lhbiBvciBHYXVzcy1MYXBsYWNlIERpc3RyaWJ1dGlvbl0oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTm9ybWFsX2Rpc3RyaWJ1dGlvbikuIElmIGEgdmFyaWFibGUgaXMgbm90IG5vcm1hbGx5IGRpc3RyaWJ1dGVkLCBhcyBlc3RhYmxpc2hlZCB0aHJvdWdoIHZpc3VhbCBpbnNwZWN0aW9uIG9mIHRoZSB2YXJpYWJsZSdzIGhpc3RvZ3JhbSBvciBhIHN0YXRpc3RpY2FsIHRlc3Qgc3VjaCBhcyB0aGUgW1NoYXBpcm8tV2lsayBOb3JtYWxpdHkgVGVzdF0oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvU2hhcGlybyVFMiU4MCU5M1dpbGtfdGVzdCkgb3IgdGhlIFtLb2xtb2dvcm92LS1TbWlybm92IFRlc3RdKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0tvbG1vZ29yb3YlRTIlODAlOTNTbWlybm92X3Rlc3QpLCB0aGVuIG9uZSBvciBtb3JlIHRyYW5zZm9ybXMgYXJlIGFwcGxpZWQgd2hpY2ggbWF5IGNoYW5nZSB0aGUgZGlzdHJpYnV0aW9uIGp1c3QgZW5vdWdoIHRvIGJlIG1lZXQgdGhlIG5vcm1hbGl0eSBjcml0ZXJpb24uCgpBbm90aGVyIGltcG9ydGFudCBhc3N1bXB0aW9uIGlzIHRoZSBhc3N1bXB0aW9uIG9mIGVxdWFsIHZhcmlhbmNlLCBvciBbaGV0ZXJvc2NlZGFzdGljaXR5XShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9IZXRlcm9zY2VkYXN0aWNpdHkpLiBJZiB0aGUgdmFyaWFuY2UgaXMgdW5ldmVuLCB0aGVuIHRoZXJlIGFyZSBsaWtlbHkgY2x1c3RlcnMgYW5kIGRpZmZlcmVudCByZWdyZXNzaW9uIG1vZGVscyBtdXN0IGJlIGNvbnN0cnVjdGVkIGZvciB0aGUgZGlmZmVyZW50IGNsdXN0ZXJzLiBBcHBseWluZyBhIGNsdXN0ZXJpbmcgYWxnb3JpdGhtIHN1Y2ggYXMgKmstbWVhbnMqIGNhbiBvZnRlbiByZXZlYWwgdGhlIGNsdXN0ZXJzIGNvbXB1dGF0aW9uYWxseSByYXRoZXIgdGhhbiB0aHJvdWdoIHZpc3VhbCBpbnNwZWN0aW9uLgoKQSB0aGlyZCBhc3N1bXB0aW9uIGlzIHRoYXQgdGhlIHZhcmlhYmxlcyBhcmUgaW5kZXBlbmRlbnQgYW5kIGFyZSBub3QgY29ycmVsYXRlZCB0byBvbmUgYW5vdGhlciBidXQgcmF0aGVyIG9ubHkgdG8gdGhlIHRhcmdldCB2YXJpYWJsZS4gVGhpcyBhc3N1bXB0aW9uIGlzIHRoZSB0b3BpYyBvZiB0aGlzIHR1dG9yaWFsLgo=