Be sure that you go through Lesson 3.207 before working through this tutorial.
This code tutorial demonstrates five common encoding schemes for categorical variables using the classic Titanic Dataset:
Label Encoding and Boolean Encoding for ordinal categorical features
One-Hot Encoding, Count and Frequency Encoding, and Weight-of-Evidence Encoding for nominal categorical features:
Let’s start by loading the “Titanic Survivors” data set into a data frame.
df <- read.csv("titanic.csv")
Now, let’s inspect the data and see what kinds of features we have.
head(df,5)
Let’s learn a bit more about the structure of the rows (observations or cases) and the columns (features).
str(df)
## 'data.frame': 887 obs. of 8 variables:
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Mr. Owen Harris Braund" "Mrs. John Bradley (Florence Briggs Thayer) Cumings" "Miss. Laina Heikkinen" "Mrs. Jacques Heath (Lily May Peel) Futrelle" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 27 54 2 27 14 ...
## $ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
From this we can see that there are three categorical features:
The remaining variables are either discrete numeric, continuous numeric, or qualitative (e.g., Name).
Let’s assume that we want to predict whether a passenger might survive or not. This means that Survived is a binary target feature and thus we would not need to encode it. We are left with encoding Pclass and Sex.
So, let’s take a look at the different ways we might want to encode the categorical variables. For some, there might be more than one approach. We will take a look first at Pclass and then at Sex.
The variable Pclass encodes a passenger’s class of ticket, e.g., first class versus third class.
Since Pclass is ordinal and there is a ranking, we can use it as is. However, it might make sense to encode it nevertheless, to ensure the correct order and proper interval. For example, first class (1) is greater than third class (3) but \(1 < 3\). Also, the difference between second class and third class is not as great as between first class and second class. So, an encoding of 9 for First Class (1), 4 for Second Class (2), and 1 for Third Class (3) might make more sense. In practice, you will need to experiment with different numeric values and intervals and see which might yield a better performance of the models.
So, let’s apply that encoding to the data set..
df$Pclass <- ifelse(df$Pclass == 1, 9, ifelse(df$Pclass == 2, 4, 1))
Let’s see if the encoding was applied properly.
head(df,5)
And, indeed, it appears to be – of course, we probably should make sure and inspect the whole data set. We will leave that to you.
A common encoding scheme is “one-hot encoding” which is also often called “dummy coding”. It is quite simple as it converts the nominal feature to several new Boolean features. In essence, the nominal feature is replaced with several Boolean features. Naturally, this increases the number of columns which may cause other issues, and should only be used when there are very few categorical variables with a small number of levels (say, three to five).
Before we apply this encoding, let’s go back to the original data set – where Pclass has levels 1, 2, and 3, for First Class, Second Class, and Third Class, respectively.
df <- read.csv("titanic.csv")
Since Pclass has three levels, we will need two new Boolean features; we’ll choose First for first class (value 1) and Second for second class (value 2). We always need one column less than the number of levels. For third class, the columns First and Second are both 0.
# initialize the columns First and Second to 0
df$First <- 0
df$Second <- 0
# set the columns to 1 if it's the corresponding class
df$First <- ifelse(df$Pclass == 1, 1, 0)
df$Second <- ifelse(df$Pclass == 2, 1, 0)
Let’s verify that it is encoded as expected. We’ll only list the columns we are interested in.
head(df[c(2,9,10)], 10)
As an alternative, we could have encoded Pclass using the Weight of Evidence encoding scheme. Recall that this approach is only applicable for binary classification models, i.e., where the target variable is binary (positive/negative or true/false or \(1/0\).)
For this data set, let’s assume that Survived is the target variable. It has the value positive (1) if the passenger survived and negative (0) otherwise.
In the Weight of Evidence Encoding, each categorical feature is replaced by
\(\log_2\left(\frac{P\left(pos\right)}{P\left(neg\right)}\right)\)
where \(P(pos)\) is the probability of positive value for the target variable and \(P(neg)\) is the probability of negative target variable of each category in categorical variable.
Before we apply this encoding, let’s go back to the original data set – where Pclass has levels 1, 2, and 3, for First Class, Second Class, and Third Class tickets.
df <- read.csv("titanic.csv")
We need to calculate the number of occurrences and the empirical probability (frequency) or each category for each outcome. The code below could certainly be written more elegantly using tables but being more explicit should make it easier to understand.
count.Survived.FirstClass <- length(which((df$Survived == 1) &
(df$Pclass == 1)))
count.Died.FirstClass <- length(which((df$Survived == 0) &
(df$Pclass == 1)))
count.Survived.SecondClass <- length(which((df$Survived == 1) &
(df$Pclass == 2)))
count.Died.SecondClass <- length(which((df$Survived == 0) &
(df$Pclass == 2)))
count.Survived.ThirdClass <- length(which((df$Survived == 1) &
(df$Pclass == 3)))
count.Died.ThirdClass <- length(which((df$Survived == 0) &
(df$Pclass == 3)))
Now, we need to calculate the WoE expression: the natural logarithm of the ratio of \(P(1)\) and \(P(0)\).
We also have to guard against the possibility that the empirical probability is 0 when there are no occurrences. We can add a Laplace Modifier in the form of adding 0.5 to each count. Naturally this changes the values slightly but since it’s applied to all it doesn’t affect the distance calculations materially.
allSurvived <- length(which(df$Survived == 1))
allDied <- length(which(df$Survived == 0))
p.S.F <- (count.Survived.FirstClass + 0.5) / allSurvived
p.D.F <- (count.Died.FirstClass + 0.5) / allDied
woe.First <- log(p.S.F / p.D.F)
p.S.S <- (count.Survived.SecondClass + 0.5) / allSurvived
p.D.S <- (count.Died.SecondClass + 0.5) / allDied
woe.Second <- log(p.S.S / p.D.S)
p.S.T <- (count.Survived.ThirdClass + 0.5) / allSurvived
p.D.T <- (count.Died.ThirdClass + 0.5) / allDied
woe.Third <- log(p.S.T / p.D.T)
Note that WoE can be negative number.
The final step is to replace the original values for Pclass with the WoE values and essentially transform a categorical variable into a continuous numeric variable that can be used directly.
df$Pclass[which(df$Pclass == 1)] <- woe.First
df$Pclass[which(df$Pclass == 2)] <- woe.Second
df$Pclass[which(df$Pclass == 3)] <- woe.Third
Let’s verify that it is encoded as expected. We’ll only list the columns we are interested in.
head(df[2], 10)
In this data set, sex (or gender) is a binary (Boolean) variable and thus can be encoded with 0 and 1. If we had more than two levels of gender, then either a one-hot or a count/frequency encoding would have been needed as gender is clearly not ordinal. However, if with just two levels, a frequency or count encoding might be better than a Boolean encoding.
Let’s do a frequency encoding first for Sex. We need to first get the class levels (number of different categorical values). Of course, we know it is Boolean, so there are two levels but let’s pretend we do not know. One approach is to convert Sex which is <int> in the data set to a factor variable. Recall that the type factor is R’s way to encoding a categorical variable (see R Factors).
sex <- as.factor(df$Sex)
levels(sex)
## [1] "female" "male"
l <- length(levels(sex))
So, now we need to count how often a passenger parished (died) or survived and what the frequency is.
n <- nrow(df)
m.count <- length(which(df$Sex == "male"))
f.count <- length(which(df$Sex == "female"))
mf <- m.count / n
ff <- f.count / n
So, the frequency for male is 0.65, while the frequency for female is 0.35. We now replace the categorical values with the frequency in the data set.
df[which(df$Sex == "male"),]$Sex <- mf
df[which(df$Sex == "female"),]$Sex <- ff
head(df,5)
In a Count Encoding, we would have replaced the categorical values with the counts rather than the frequencies.
This tutorial demonstrated different strategies for encoding ordinal and nominal categorical variables. Of course, sometimes it may be more beneficial to remove some or all categorical features if they are found to be (through statistical correlation analysis or domain knowledge) not important or have insufficient explanatory power. You will need to find out through experimentation and by trying different approaches. Data science is, to a large extent, much experimentation and sometimes more art than science.