Categorical Features

Most machine learning algorithms cannot handle categorical features (variables) unless they are converted to numerical values. The algorithms that cannot directly use categorical features are those that calculate distances in the feature space. That includes kNN, SVM (Support Vector Machine), ANN (Artificial Neural Network), k-means, and Regression. The Naive Bayes Classifier, Decision Trees, Decision Rules, Association Rules do not require an encoding of categorical variables.

Categorical features can be divided into two categories:

  • Ordinal (some ordering or ranking), e.g., education
  • Nominal (no particular order), e.g., gender, race, ethnicity

Encoding Mechanisms

Ordinal categorical features:

  • Label Encoding
  • Boolean Encoding

Nominal categorical features:

  • One-Hot Encoding
  • Count and Frequency Encoding
  • Weight-of-Evidence Encoding

Encoding Ordinal Features

Ordinal categorical variables can be converted to a numeric scale where the distances between the values reflect the differences in ranks. For example, we could encode education with categorical values (also called class levels) of None, High School, Bachelor, Masters, Doctorate as 0, 1, 3, 6, and 11. Of course, someone else might choose the encoding 1, 2, 3, 5, 8. The actual encoding is more of an art than a science.

This approach is often called “Label Encoding”.

Encoding Binary Features

Binary features are categorical features that have two possible values: true and false. Such features can be encoded with numeric values of 0 (false) and 1 (true).

Encoding Nominal Features

Nominal features are multinomial variables with values from a defined set of values without any ordering or ranking of the values and thus no definition of a “distance”. A simple example is gender where there are pre-defined values including female, male, non-binary, etc. In this situation it does not make sense to assign a numeric value to the class levels.

Note that some data sets might choose to use a numeric encoding of nominal features for performance reasons and to save storage, but that does not imply an ordering. It is critical that the analyst understands that a numeric feature does not always mean that it is numeric; it could be categorical.

Furthermore, all machine learning algorithms will gladly perform their calculations using numerically encoded nominal features and produce a model. However, the results of the model are likely not usable as the assumptions that the algorithm may have made are likely not correct.

For example, kNN needs to calculate distances between observations and thus needs to perform a subtraction of corresponding feature values. What does it mean to subtract female from non-binary? If they were encoded as female = 1, male = 2, non-binary = 3, unspecified = 99, then female - non-binary = -2. What is the semantics of this? It is a meaningless calculation as subtraction is simply not a defined operation on nominal variables.

One-Hot Encoding

A common encoding scheme is “one-hot encoding” which is also often called “dummy coding”. It is quite simple as it converts the nominal feature to several new Boolean features. In essence, the nominal feature is replaced with several Boolean features (this means that the nominal feature is removed from the data set). Naturally, this increases the number of columns which may cause other issues, and should only be used when there are very few categorical variables with a small number of levels (say, three to five).

Note that encoding occurs before splitting the data into subsets for training and testing.

The technique is best explained with an example. Let’s consider a data set with the categorical feature “eye color” with possible values of {amber, blue, brown, gray, green, hazel, red}. There are 𝔩 = 7 levels. So, we need to create 𝔩-1 new columns – we always need one column less than the number of categories.

Then we create (arbitrarily) six new columns labeled by the category label; generally best to pick the first six. We then encode each value of eye color by placing a 1 into the column that matches the eye color and 0 for the other columns. The one value that does not have a column has 0 in all columns (in the example below, it is the color red).

eyecolor amber blue brown gray green hazel
blue 0 1 0 0 0 0
gray 0 0 0 1 0 0
red 0 0 0 0 0 0
amber 1 0 0 0 0 0
brown 0 0 1 0 0 0
green 0 0 0 0 1 0
hazel 0 0 0 0 0 1
red 0 0 0 0 0 0

One-hot encoding can result in a substantial number of additional features (columns) if there are many levels in the categorical variable. Many machine learning algorithms do not perform well if there are too many features, i.e., kNN, k-means, statistical regression, among others. When using such “distance-based” algorithms, one of the other encodings below might be preferable.

Removing Categorical Features

This process is done independently for each nominal feature. This can result in a very large number of additional Boolean features which may overwhelm some algorithms.

Sometimes it may be more beneficial to remove some or all categorical features, particularly for those algorithms that calculate distance such as kNN and SVM.

Count or Frequency Encoding

Replace the categorical feature by the count of the observations that show that category in the dataset or by the frequency – or percentage – of observations in the dataset. For example, if 10 of 100 observations show the eye color green, we would replace any eye color value of green by 10 if doing count encoding, or by 0.1 if doing frequency encoding. The categorical variable is now a single-column numeric variable and can be used by distance-based algorithms such as kNN or regression.

“Weight of Evidence” Encoding

This approach is only applicable for binary classification models where the target variable is binary (positive or negative.) For example, in a disease classification scenario, the target variable has the value positive (1) if the disease is present and negative (0) otherwise. Each categorical feature is replaced by

\(\log_2\left(\frac{P\left(pos\right)}{P\left(neg\right)}\right)\)

where \(P(pos)\) is the probability of positive value for the target variable and \(P(neg)\) is the probability of negative target variable of each category in categorical variable.

Using the ubiquitous “titanic” dataset as an example, assuming “Survived” as target variable, one of the categorical variables is “Cabin”, which can be encoded using Weight of Evidence as shown below.

Weight of Evidence Example Using Titanic Dataset
Weight of Evidence Example Using Titanic Dataset

This encoding mechanism is particularly well suited for logistic regression because the logit function is the odds ratio of \(P(1)/P(0)\).

If it isn’t already clear, naturally you will need to remove the original categorical value column from the dataset.

Weight-of-Evidence vs Frequency Encoding

Both weight of evidence (WOE) encoding and frequency encoding are common techniques for encoding categorical variables in the context of machine learning, and their effectiveness can vary depending on the specific dataset and the nature of the categorical variables.

Weight of evidence encoding is commonly used in credit risk modeling and is helpful when dealing with binary classification problems. It calculates the relationship between the categories and the target variable by measuring the odds of the target variable being positive (or negative) for each category. This technique can handle imbalanced datasets and is useful when the categories have a strong association with the target variable.

Frequency encoding, on the other hand, replaces each category with the frequency of its occurrences in the dataset. This can be particularly useful for dealing with high cardinality categorical variables. It simplifies the data representation and can help capture information about the categories based on their frequency.

Ultimately, the choice between WOE encoding and frequency encoding depends on the characteristics of the dataset and the specific machine learning problem at hand. It may often be an appropriate practice to experiment with different encoding methods and compare their performance using cross-validation or other evaluation metrics to determine which one works better for a given modeling effort.

Conclusion

The handling of categorical features before applying any kind of machine learning algorithm is a critical step in feature engineering. While most of the encoding techniques can be applied, some work better for some algorithms. Weight-of-Evidence works best with Logistic Regression and with some non-linear models such as Decision Tree. If there are only a few categorical features with a small number of class levels then one-hot encoding works just fine. Frequency Count is useful when there are many nominal categorical variables. For ordinal categorical data, simple Label Encoding can be used.


Files & Resources

All Files for Lesson 3.207

References

No references.

Errata

None collected yet. Let us know.

---
title: "Encoding Categorical Features"
params:
  category: 3
  stacks: 0
  number: 207
  time: 45
  level: beginner
  tags: categorical features,feature engineering,one-hot,frequency
  description: "Explains the various encoding schemes for categorical
                variables, including one-hot, frequency, and weight of
                evidence."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Categorical Features

Most machine learning algorithms cannot handle categorical features (variables) unless they are converted to numerical values. The algorithms that cannot directly use categorical features are those that calculate distances in the feature space. That includes kNN, SVM (Support Vector Machine), ANN (Artificial Neural Network), k-means, and Regression. The Naive Bayes Classifier, Decision Trees, Decision Rules, Association Rules do not require an encoding of categorical variables.

Categorical features can be divided into two categories:

-   Ordinal (some ordering or ranking), *e.g.*, education
-   Nominal (no particular order), *e.g.*, gender, race, ethnicity

### Encoding Mechanisms

Ordinal categorical features:

-   Label Encoding
-   Boolean Encoding

Nominal categorical features:

-   One-Hot Encoding
-   Count and Frequency Encoding
-   Weight-of-Evidence Encoding

## Encoding Ordinal Features

Ordinal categorical variables can be converted to a numeric scale where the distances between the values reflect the differences in ranks. For example, we could encode education with categorical values (also called *class levels*) of *None*, *High School*, *Bachelor*, *Masters*, *Doctorate* as 0, 1, 3, 6, and 11. Of course, someone else might choose the encoding 1, 2, 3, 5, 8. The actual encoding is more of an art than a science.

This approach is often called "Label Encoding".

### Encoding Binary Features

Binary features are categorical features that have two possible values: *true* and *false*. Such features can be encoded with numeric values of 0 (*false*) and 1 (*true*).

## Encoding Nominal Features

Nominal features are multinomial variables with values from a defined set of values without any ordering or ranking of the values and thus no definition of a "distance". A simple example is gender where there are pre-defined values including female, male, non-binary, etc. In this situation it does not make sense to assign a numeric value to the class levels.

Note that some data sets might choose to use a numeric encoding of nominal features for performance reasons and to save storage, but that does not imply an ordering. It is critical that the analyst understands that a numeric feature does not always mean that it is numeric; it could be categorical.

Furthermore, all machine learning algorithms will gladly perform their calculations using numerically encoded nominal features and produce a model. However, the results of the model are likely not usable as the assumptions that the algorithm may have made are likely not correct.

For example, kNN needs to calculate distances between observations and thus needs to perform a subtraction of corresponding feature values. What does it mean to subtract female from non-binary? If they were encoded as female = 1, male = 2, non-binary = 3, unspecified = 99, then female - non-binary = -2. What is the semantics of this? It is a meaningless calculation as subtraction is simply not a defined operation on nominal variables.

### One-Hot Encoding

A common encoding scheme is "one-hot encoding" which is also often called "dummy coding". It is quite simple as it converts the nominal feature to several new Boolean features. In essence, the nominal feature is replaced with several Boolean features (this means that the nominal feature is removed from the data set). Naturally, this increases the number of columns which may cause other issues, and should only be used when there are very few categorical variables with a small number of levels (say, three to five).

> Note that encoding occurs before splitting the data into subsets for training and testing.

The technique is best explained with an example. Let's consider a data set with the categorical feature "eye color" with possible values of {amber, blue, brown, gray, green, hazel, red}. There are 𝔩 = 7 levels. So, we need to create 𝔩-1 new columns -- we always need one column less than the number of categories.

Then we create (arbitrarily) six new columns labeled by the category label; generally best to pick the first six. We then encode each value of eye color by placing a 1 into the column that matches the eye color and 0 for the other columns. The one value that does not have a column has 0 in all columns (in the example below, it is the color *red*).

| eyecolor | amber | blue | brown | gray | green | hazel |
|:--------:|:-----:|:----:|:-----:|:----:|:-----:|:-----:|
|   blue   |   0   |  1   |   0   |  0   |   0   |   0   |
|   gray   |   0   |  0   |   0   |  1   |   0   |   0   |
|   red    |   0   |  0   |   0   |  0   |   0   |   0   |
|  amber   |   1   |  0   |   0   |  0   |   0   |   0   |
|  brown   |   0   |  0   |   1   |  0   |   0   |   0   |
|  green   |   0   |  0   |   0   |  0   |   1   |   0   |
|  hazel   |   0   |  0   |   0   |  0   |   0   |   1   |
|   red    |   0   |  0   |   0   |  0   |   0   |   0   |

One-hot encoding can result in a substantial number of additional features (columns) if there are many levels in the categorical variable. Many machine learning algorithms do not perform well if there are too many features, *i.e.*, kNN, k-means, statistical regression, among others. When using such "distance-based" algorithms, one of the other encodings below might be preferable.

### Removing Categorical Features

This process is done independently for each nominal feature. This can result in a very large number of additional Boolean features which may overwhelm some algorithms.

Sometimes it may be more beneficial to remove some or all categorical features, particularly for those algorithms that calculate distance such as *kNN* and *SVM*.

### Count or Frequency Encoding

Replace the categorical feature by the count of the observations that show that category in the dataset or by the frequency -- or percentage -- of observations in the dataset. For example, if 10 of 100 observations show the eye color *green*, we would replace any eye color value of *green* by 10 if doing count encoding, or by 0.1 if doing frequency encoding. The categorical variable is now a single-column numeric variable and can be used by distance-based algorithms such as *kNN* or *regression*.

### "Weight of Evidence" Encoding

This approach is only applicable for binary classification models where the target variable is binary (positive or negative.) For example, in a disease classification scenario, the target variable has the value positive (1) if the disease is present and negative (0) otherwise. Each categorical feature is replaced by

$\log_2\left(\frac{P\left(pos\right)}{P\left(neg\right)}\right)$

where $P(pos)$ is the probability of positive value for the target variable and $P(neg)$ is the probability of negative target variable of each category in categorical variable.

Using the ubiquitous "titanic" dataset as an example, assuming "Survived" as target variable, one of the categorical variables is "Cabin", which can be encoded using *Weight of Evidence* as shown below.

![Weight of Evidence Example Using Titanic Dataset](l-3-207-WoETable.png){width="50%"}

This encoding mechanism is particularly well suited for logistic regression because the logit function is the odds ratio of $P(1)/P(0)$.

> If it isn't already clear, naturally you will need to remove the original categorical value column from the dataset.

### Weight-of-Evidence vs Frequency Encoding

Both weight of evidence (WOE) encoding and frequency encoding are common techniques for encoding categorical variables in the context of machine learning, and their effectiveness can vary depending on the specific dataset and the nature of the categorical variables.

Weight of evidence encoding is commonly used in credit risk modeling and is helpful when dealing with binary classification problems. It calculates the relationship between the categories and the target variable by measuring the odds of the target variable being positive (or negative) for each category. This technique can handle imbalanced datasets and is useful when the categories have a strong association with the target variable.

Frequency encoding, on the other hand, replaces each category with the frequency of its occurrences in the dataset. This can be particularly useful for dealing with high cardinality categorical variables. It simplifies the data representation and can help capture information about the categories based on their frequency.

Ultimately, the choice between WOE encoding and frequency encoding depends on the characteristics of the dataset and the specific machine learning problem at hand. It may often be an appropriate practice to experiment with different encoding methods and compare their performance using cross-validation or other evaluation metrics to determine which one works better for a given modeling effort.

## Conclusion

The handling of categorical features before applying any kind of machine learning algorithm is a critical step in feature engineering. While most of the encoding techniques can be applied, some work better for some algorithms. Weight-of-Evidence works best with Logistic Regression and with some non-linear models such as Decision Tree. If there are only a few categorical features with a small number of class levels then one-hot encoding works just fine. Frequency Count is useful when there are many nominal categorical variables. For ordinal categorical data, simple Label Encoding can be used.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

No references.

## Errata

None collected yet. [Let us know](https://form.jotform.com/212187072784157){target="_blank"}.

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_deployKnit.R')), include = FALSE}
```
