Motivation

Among the various machine learning techniques, decision trees have gained widespread use in the banking sector due to their high accuracy and the ability to translate complex statistical models into plain language and make the prediction of the model explainable.

In many countries, lending practices are closely monitored by government agencies to ensure fairness. This scrutiny requires bank executives to clearly articulate the reasons behind loan approvals or rejections, which is where decision trees excel. They offer a clear, interpretable rationale for each decision, which is crucial not only for regulatory compliance but also for providing customers with understandable explanations for their credit ratings.

Automated credit scoring models are likely employed in scenarios such as credit card mailings and instant online approval processes. In this lesson, we will develop a simple credit approval model in R using C5.0 decision trees. Additionally, we will explore how to fine-tune the model to minimize errors that improve the model and direct the bias of the model to reduce financial loss.

This lesson is an expansion of the tutorial in Chapter 5 of (Lantz, 2019).

Model Explainabiity

One of the reasons why decision trees remain a commonly used supervised machine learning algorithm, is it’s superior explainability and ease with which the decision process can be reproduced. Explainability in the context of predictive modeling and supervised machine learning refers to the extent to which the internal mechanics of a machine learning model and its decision-making processes can be understood by humans. It requires making the model’s operations transparent and its predictions interpretable, allowing stakeholders to comprehend how input features are transformed into output predictions.

Explainability is essential and desirable for several reasons:

  1. Trust and Adoption: When the workings of a model are transparent, users and stakeholders are more likely to trust and adopt the technology. Trust is particularly important in fields like healthcare, medicine, finance, banking, and criminal justice, where decisions have significant consequences and often deeply affect people’s lives.

  2. Regulatory Compliance: Many industries are subject to regulations that require clear justifications for decisions, especially those affecting individuals’ lives and finances. For instance, in banking, the Basel III framework and the General Data Protection Regulation (GDPR) necessitate explainability in credit scoring models to ensure fairness and accountability, as well as expose any potential covert bias.

  3. Debugging and Improvement: Explainable models make it easier to identify and rectify errors or biases in the model. Understanding why a model makes certain predictions can highlight areas for improvement and refinement, leading to better performance and more accurate results.

  4. Ethical Responsibility: Machine learning models can inadvertently perpetuate or exacerbate biases present in training data. Explainability helps in identifying such biases, enabling developers to take corrective actions to ensure ethical and fair outcomes.

  5. User Understanding and Engagement: When users understand how a model works, they are more likely to engage with it appropriately. For example, in medical diagnostics, explainability can help doctors understand the rationale behind a model’s prediction, allowing them to make more informed decisions about patient care.

  6. Decision Accountability: In many applications, it is essential to hold decision-makers accountable. Explainable models provide the necessary transparency to ensure that decisions can be scrutinized and justified, thereby maintaining accountability and integrity in automated decision-making systems.

In short, explainability bridges the gap between complex machine learning models and human understanding. It ensures that models are not only accurate but also transparent, fair, as free from bias as possible, and trustworthy, thereby fostering wider acceptance – and responsible use – of machine learning and applied artificial intelligence.

Decision Trees for Classification

Decision tree algorithms are useful for both classification and regression tasks, although we will focus on classification tasks. These algorithms work by recursively partitioning the data into subsets based on feature values, creating a tree-like model of decisions. Each internal node of the tree represents a conditional test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or a regression value. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision trees are preferable when features are a mix of categorical and numeric values. Unlike regression methods, decision trees do not rely on any particular distribution of values.

Overview of Decision Tree Algorithms

A decision tree starts with a root node that encompasses the entire dataset. The algorithm then selects the feature that best divides the data into distinct classes or values, based on a criterion like Information Gain or Gini Index. This process is repeated recursively for each child node until one of the stopping conditions is met, such as reaching a maximum depth, having a minimum number of samples per node, or achieving homogeneous nodes. The construction of decision trees on data sets with many observations and a high number of dimensions can be time consuming and take a long time.

Information Gain measures the reduction in entropy or uncertainty about the class labels after splitting the data based on an attribute, while the Gini Index measures the impurity or diversity of the split. It is calculated as the probability of misclassifying a randomly chosen element if it were labeled according to the distribution of labels in the subset. Some decision tree algorithm will evaluate the statistical significance of the association between categorical variables or attempt to reduce variance. The latter approach is often used in regression trees which works to minimize the variance of the target variable within each node.

To prevent overfitting, decision trees often use pruning techniques to remove branches that have little importance. Pruning can be done in a pre-pruning manner (stopping the tree growth early) or post-pruning (removing branches from a fully grown tree).

Common Decision Tree Algorithms

Several decision tree algorithms have been developed over the years, each with its unique approach to building and refining trees. Some of the most notable ones include:

  1. ID3 (Iterative Dichotomiser 3): Developed by Ross Quinlan, ID3 uses Information Gain as its splitting criterion. It builds a tree top-down, starting from a set of training instances and the features that describe them.

  2. C4.5 and C5.0: Also developed by Ross Quinlan, C4.5 is an extension of ID3. It uses Gain Ratio, a modification of Information Gain that reduces its bias towards features with many distinct values. C5.0 is a commercial version that is more efficient and provides better performance than C4.5.

  3. CART (Classification and Regression Trees): Introduced by Breiman et al., CART can be used for both classification and regression tasks. It uses the Gini Index for classification trees and reduction in variance for regression trees. CART produces binary trees, where each node has exactly two children.

  4. CHAID (Chi-squared Automatic Interaction Detector): This algorithm is used primarily for classification and relies on the Chi-Square test to determine the best splits. It is particularly effective for categorical data and handles multi-way splits.

  5. MARS (Multivariate Adaptive Regression Splines): Although not strictly a decision tree, MARS builds models by dividing the data into distinct regions using hinge functions. It is highly effective for regression tasks and can handle interactions between variables.

  6. Random Forest: An ensemble method that constructs multiple decision trees during training and outputs the mode of the classes for classification or the mean prediction for regression of the individual trees. It uses techniques like bagging and feature randomness to improve model accuracy and control overfitting.

  7. Gradient Boosted Trees: Another ensemble technique that builds trees sequentially. Each tree is constructed to correct the errors of the previous ones, focusing on the residual errors. Prominent algorithms in this category include XGBoost, LightGBM, and CatBoost.

  8. Oblique Decision Trees: These trees use linear combinations of features rather than a single feature for splitting nodes. This can lead to more compact and potentially more accurate trees but are computationally more intensive to build.

Decision tree algorithms offer a robust and interpretable approach to both classification and regression tasks. Their versatility and ease of understanding make them popular in various fields, from finance to healthcare. Each variant of decision tree algorithms brings unique strengths, and understanding these can help in selecting the most appropriate model for a given problem. This lesson will focus on the use of the C5.0 (aka C50) algorithm.

The C5.0 Algorithm

The C5.0 (aka C50) algorithm, developed by Ross Quinlan, is an advanced and commercially available version of the C4.5 algorithm. It is widely used for creating classification models due to its efficiency, accuracy, and interpretability/explainability. The C5.0 algorithm builds a decision tree by recursively splitting the data into subsets based on the attribute that provides the highest information gain ratio. This process continues until the algorithm meets a stopping criterion, such as reaching a specified depth or having too few samples to split further.

How C5.0 Works

  1. Data Preparation: Before building the tree, the C5.0 algorithm requires the data to be prepared, which involves handling missing values, converting categorical data to numeric where necessary, and optionally normalizing numeric data. In practice, C5.0 handles missing values quite well and doesn’t often necessitate the removal of outliers, imputing missing values, encoding categorical features, and normalize numeric features.

  2. Tree Construction: The algorithm starts with the entire dataset and evaluates each attribute to find the one that provides the highest information gain ratio. This attribute becomes the root node of the tree. Splitting the tree is based on the information gain ratio. The information gain ratio is a measure of how well an attribute splits the data into distinct classes. It is calculated as the ratio of information gain (reduction in entropy) to the intrinsic information (a measure of the potential information generated by splitting the dataset based on this attribute). After selecting the root node, the dataset is divided into subsets based on the possible values of the chosen attribute. The algorithm then repeats the process for each subset, selecting the best attribute for further splitting. This recursive process continues, creating branches and leaf nodes, until a stopping criterion is met.

  3. Pruning: To avoid overfitting, the algorithm may use pre-pruning techniques such as limiting the depth of the tree, requiring a minimum number of samples per leaf, or setting a threshold for the information gain ratio. Once the tree is fully grown, post-pruning can be applied to remove branches that contribute little to the overall predictive power of the model. This involves evaluating the performance of subtrees and collapsing those that do not significantly improve accuracy.

  4. Rule Extraction: The C5.0 algorithm can convert the decision tree into a set of IF-THEN rules, which are often easier to interpret and understand. Each path from the root to a leaf node corresponds to a rule.

  5. Boosting: One of the notable features of C5.0 is its ability to use boosting, an ensemble technique that combines multiple weak classifiers to form a strong classifier. In boosting, the algorithm iteratively builds multiple decision trees, each focusing on the errors made by the previous ones. The final model is a weighted vote of these individual trees.

  6. Handling Missing Values and Noise: The C5.0 algorithm has robust mechanisms for handling missing values and noisy data. It can work with incomplete datasets by using surrogate splits, which allow it to proceed with the next best attribute when the primary attribute is missing.

Advantages of C5.0

C5.0 is significantly faster than its predecessor, C4.5, both in terms of memory usage and processing time. It can handle large datasets more efficiently, making it suitable for real-world applications with substantial amounts of data. The boosting feature of C5.0 helps in improving the model’s accuracy by combining the strengths of multiple decision trees. This ensemble method reduces bias and variance, leading to better generalization on unseen data. Furthermore, decision trees produced by C5.0 are easy to understand and interpret. The conversion of trees into a set of rules further enhances their interpretability, making it easier for stakeholders to grasp the decision-making process.

C5.0 offers various parameters for fine-tuning the model, such as adjusting the confidence level for pruning, setting minimum sample sizes, and enabling boosting. This flexibility allows users to customize the model according to their specific needs.

Lastly, the algorithm’s ability to handle noisy data and missing values ensures that the model remains robust and reliable, even when the quality of the data is not perfect. The C5.0 algorithm stands out due to its efficiency, accuracy, and ease of interpretation. Its ability to handle large datasets, robust performance with noisy and incomplete data, and the option to use boosting make it a go-to algorithm for classification tasks. These advantages often make C5.0 preferable over other decision tree algorithms, particularly in applications where accuracy, speed, and interpretability are paramount.

Computational Intractability and Decision Trees

Building a perfect decision tree is an NP-complete problem and thus no polynomial algorithm exists, which means that the time it takes to build a perfect decision tree increases exponentially with the size of the training data. This complexity arises from the combinatorial nature of the decision tree construction process. The various algorithms, including C50, use heuristics to create a tree that works well in practice, is computationally tractable, but is suboptimal.

A decision tree is built by recursively splitting a dataset based on the values of its attributes to create branches until the dataset is perfectly classified or a stopping criterion is met. The goal is to find the optimal set of splits that results in the most accurate tree with the least complexity (in terms of depth and number of nodes). For any given dataset with \(n\) attributes, there is an exponential number of ways to partition the data and choose splits. Evaluating all possible trees to find the best one involves an exhaustive search through all these combinations, which grows exponentially with the size of the dataset and the number of attributes. The task of finding the optimal decision tree can be framed as a combinatorial optimization problem. Each possible split at each node represents a combinatorial choice, and the objective is to find the combination of choices that minimizes some cost function (e.g., misclassification rate, tree depth, etc.).

The problem of constructing an optimal decision tree can be reduced to a known NP-complete problem, such as the Boolean satisfiability problem (SAT). If one could solve the decision tree optimization problem efficiently, it would imply an efficient solution to all NP-complete problems, which is not believed to be possible.

Because constructing a perfect decision tree is NP-complete, it implies that there is no known polynomial-time algorithm to solve this problem for all instances. Therefore, heuristic methods and approximations, such as those employed by C50, are typically used in practice to construct decision trees. These methods aim to find a good (though not necessarily optimal) tree in a reasonable amount of time that produces acceptable results in practice. Among the common heuristic methods are the use of a greedy approach which makes the best split decision at each step based on a certain criterion (e.g., information gain or Gini index). While this does not guarantee an optimal tree, it often results in a good and computationally feasible solution. To avoid overfitting and improve generalization, decision tree algorithms incorporate pruning methods, both pre-pruning (limiting the growth of the tree during construction) and post-pruning (removing branches from a fully grown tree). In addition, techniques such as bagging and boosting (e.g., Random Forests or Gradient Boosted Trees) build multiple decision trees on different sets of training data and combine their predictions to improve accuracy and robustness. In combination, these methods mitigate the suboptimality of individual trees by leveraging the strengths of multiple models and practical heuristics.

Worked Example: Credit Decision Making

The objective of our predictive model is to pinpoint factors associated with an increased risk of loan default. This should eventually assist lenders in making un-biased credit decisions or to automate the credit approval process. Achieving this requires access to extensive data on previous bank loans (a word of caution: the data may contain inherent bias which would then propagate to the model), coupled with detailed information about the applicants.

This worked example uses the dataset contributed to the UCI Machine Learning Repository by Hans Hofmann of the University of Hamburg (Hofmann, 1994). This dataset includes comprehensive information on loans sourced from a credit agency in Germany, so the applicability of any model might be limited to applicants from Germany – we need to be cautious in generalizing models as they are based on specific training data.

URL for data set: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

The credit dataset is comprised of 1,000 loan decisions made by loan officers, along with a set of numeric and nominal features that describe the characteristics of both the loans and the applicants. A (binary) class variable specifies whether each loan resulted in default. Our goal is to analyze this data to uncover patterns that might be able to predict loan defaults and thus assist a loan officer is making a credit decision, i.e., an applicant who is likely going to default might not be a good credit candidate or, alternatively, would require adjusted lending parameters such as a substantially higher interest rate and a shorter repayment period.

Workflow

Set up your own project in R Studio and create a new R Notebook in which to do your work and follow along. Experiment with the code, try additional techniques. The key is to type the code yourself and not simply copy and paste. You learn by doing!

Data Loading

We will load the data set directly from its URL using the utils package, although downloading to a project folder might be best when inspecting the data set. The code below also sets column headers as the data set does not have those. A data dictionary is provided at the above URL.

library(utils)

# URL of the dataset
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"

# define the column names for the dataset (as the dataset does not contain headers)
col_names <- c("Status.of.existing.checking.account", "Duration.in.month", "Credit.history",
               "Purpose", "Credit.amount", "Savings.account.bonds", "Present.employment.since",
               "Installment.rate.in.percentage.of.disposable.income", "Personal.status.and.sex",
               "Other.debtors.guarantors", "Present.residence.since", "Property", "Age.in.years",
               "Other.installment.plans", "Housing", "Number.of.existing.credits.at.this.bank",
               "Job", "Number.of.people.being.liable.to.provide.maintenance.for", "Telephone",
               "Foreign.worker", "Class")

# fetch and read the dataset
credit.df <- read.csv(url, 
                      header = FALSE, sep = " ", 
                      col.names = col_names)

# Display the first few rows of the dataset to ensure it was loaded
head(credit.df[,c(2,5,8,11,21)], 4)
##   Duration.in.month Credit.amount
## 1                 6          1169
## 2                48          5951
## 3                12          2096
## 4                42          7882
##   Installment.rate.in.percentage.of.disposable.income Present.residence.since
## 1                                                   4                       4
## 2                                                   2                       2
## 3                                                   2                       3
## 4                                                   2                       4
##   Class
## 1     1
## 2     2
## 3     1
## 4     1

Data Inspection

d <- ncol(credit.df)
n <- nrow(credit.df)

The data set contains 1000 rows (observations/cases) and 21 dimensions (features). The loan amounts in are DM (as this data set was collected prior to Germany adopting the Euro). The “target variable” is the column Class. A “1” indicates that the borrower did not default while a “2” indicates that they did default.

t <- table(credit.df$Class)
print(t)
## 
##   1   2 
## 700 300

In your own code, inspect additional features. Which ones are numeric? Which ones are categorical? Which features might be likely predictors?

Data Preparation

The first step is to revise the encoding of the “Class” target variable to “yes” (default) and “no” (no default) rather than the values 2 and 1. This is purely cosmetic and for readability and has no impact on the model construction.

credit.df$Class <- ifelse(credit.df$Class == 1, "no", "yes")

There are 700 classified as “no” out of 1000 credit transactions, making the overall default rate 30%.

Sampling Training Data

We will train the model on a random sample of the cases in the data set, leaving the remainder for model validation. A random sample involves selecting a subset of records randomly from a larger dataset. In R, this can be accomplished using the sample() function or the runif() function to generate a sequence of random numbers following a uniform distribution. There are additional functions that produce random numbers that follow other distributions such as the Normal (Gaussian) Distribution.

It is a common practice to set a seed value beforehand using the set.seed() function. Setting a seed ensures that the randomization process follows a reproducible sequence. While this might seem counter-intuitive for generating random numbers, it serves an important purpose. By specifying a seed value, the analysis can be repeated in the future with identical results, ensuring consistency and reproducibility in the findings.

The code below generates a random sample (without replacement, i.e., any observation appears once in either the training data or the validation data, but never in both and never more than once in either). We are not specifying a probability distribution, so every observation has the same chance of appearing in the training data. We are selecting 90% of the data for training and the rest for validation.

Note that your sample might be different if you are using a different version of R that was used when generating this lesson; this will also affect the results for the remainder of the lesson. However, the analysis and the model are just as valid.

set.seed(617)

n <- nrow(credit.df)
train.data <- sample(n, n * 0.9, replace = FALSE)

train.df <- credit.df[train.data,]
val.df <- credit.df[-train.data,]

If the selection was done at random, we would expect the same proportion of defaults to be appear in the training and the validation data sets – and they should be about 30% as that was the proportion of defaults in the original data set.

Training Data Set: Proportion of defaults

prop.table(table(train.df$Class))
## 
##        no       yes 
## 0.7022222 0.2977778

Validation Data Set: Proportion of defaults

prop.table(table(val.df$Class))
## 
##   no  yes 
## 0.68 0.32

Both the training and test datasets exhibit similar proportions of loan defaults, allowing us to proceed with building our decision tree. If the proportions had differed significantly, we might have needed to resample the dataset or employ a more sophisticated sampling approach to ensure balanced representation, such as a stratified sample or using k-fold cross validation rather than a dedicated validation data set.

A common problem of random sampling is class imbalance. What if some level of the target variable only appears a few times in the data set and the random sample is such that all instances occur in only the validation data – then the model will never be trained on those. This will often cause an error when making a prediction. In those situations, startified sampling or sampling with replacement (oversampling) may be better.

Random Number Generation

Let’s take a short detour on random numbers in computers, as random selection is common in many machine learning algorithms. The question is: how do computers generate random numbers?

Computers generate random numbers using algorithms called pseudo-random number generators (PRNGs). While these numbers are not truly random, they are sufficient for most practical purposes because they exhibit properties of randomness and cannot be predicted unless one knows the algorithm and the starting conditions of the algorithm.

You may (safely) skip the remainder of this section if you are not interested in how “random number generation” works.

A PRNG starts with an initial value called a seed. The seed can be any number, but it’s often derived from a variable source such as the current time or perhaps some external sensor that detects some natural phenomenon that is unpredictable. Once set, the seed allows the PRNG to produce a reproducible sequence of numbers that appears to be random. Naturally, the PRNG uses a deterministic algorithm to generate a sequence of numbers, so the numbers are really random but exhibit the properties of random numbers. Common algorithms include the Linear Congruential Generator (LCG), the Mersenne Twister, and the Xorshift algorithm.

Linear Congruential Generator (LCG):

An LCG produces numbers based on the formula:

\[ X_{n+1} = (aX_n + c) \mod m \]

where:

  • \(X\) is the sequence of pseudorandom values,
  • \(a\), \(c\), and \(m\) are constants,
  • \(X_0\) is the initial seed.

Mersenne Twister:

The Mersenne Twister is widely used because of its long period and high-quality randomness. It generates numbers with a period of \(2^{19937} - 1\), ensuring that the sequence doesn’t repeat for a very long time.

Xorshift:

Xorshift algorithms use bitwise operations to generate random numbers. These are known for being fast and having good statistical properties.

For applications requiring true randomness, such as cryptography, pseudorandom numbers may not be sufficient. Instead, True Random Number Generators (TRNGs) are used. TRNGs generate numbers based on physical processes and natural phenomena that are inherently random, such as: measuring the thermal noise in electronic circuits, monitoring the decay of radioactive materials, or observing the behavior of photons passing through a semi-transparent mirror. TRNGs are generally slower, require an external sensor, and are thus more expensive than PRNGs but provide true randomness.

Model Training

We will employ the C5.0 algorithm from the C50 package to train our decision tree model. If you haven’t already installed the package, you can do so by running install.packages("C50") or using the interface of R Studio. After installation, load the package into your R session with library(C50). The code below checks of the package is installed, and, if not, will install the package.

package_name <- "C50"

if (!require(package_name, 
             character.only = T, 
             warn.conflicts = F,
             quietly = T)) {
    install.packages(package_name, dependencies = TRUE)
}

library(C50)

The model is built using the function C5.0(). This function has five key arguments:

  • x: a data frame of the training features
  • y: a factor vector with two or more levels of the corresponding target classes
  • trials: an optional argument indicating the number of boosting iterations that are applied (default is 1)
  • costs: a matrix of costs associated with possible errors; the matrix must have C columns and rows where C is the number of levels in the factor vector for y
  • rules: a logical argument indicating whether the tree should be decomposed into rules

We are first converting the target variable from “character” to a factor as that is what the C5.0() function expects. Of course, it may have been better to have done the conversion before splitting.

train.df$Class <- as.factor(train.df$Class)
val.df$Class <- as.factor(val.df$Class)

Next, we are training a decision tree model using the C5.0 algorithm on all features,except for column 21 which is the target variable. Be sure not to include the column of the target variable in the training features as that would certainly cause overfitting.

credit.model <- C5.0(train.df[-21], train.df$Class, trials = 1, rules = F)

The “decision tree” model is now available for prediction. We can view the model’s characteristics by printing it out:

print(credit.model)
## 
## Call:
## C5.0.default(x = train.df[-21], y = train.df$Class, trials = 1, rules = F)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 20 
## 
## Tree size: 46 
## 
## Non-standard options: attempt to group attributes

The actual “tree” can be visualized - in a rudimentary fashion – using the summary() function. Only a partial output of the model is shown.

summary(credit.model)
C5.0 [Release 2.07 GPL Edition]     Sun Jun 23 13:00:32 2024
-------------------------------

Class specified by attribute `outcome'

Read 900 cases (21 attributes) from undefined.data

Decision tree:

Status.of.existing.checking.account in {A14,A13}: no (416/56)
Status.of.existing.checking.account in {A12,A11}:
:...Credit.history in {A30,A31}:
    :...Housing = A151: yes (15)
    :   Housing in {A152,A153}:
    :   :...Other.debtors.guarantors in {A102,A103}: no (3)
    :       Other.debtors.guarantors = A101:
    :       :...Housing = A153: yes (11/1)
    :           Housing = A152:
    
    ...

In terms of decision tree construction with C5.0, the numbers in parentheses in the summary above indicate the number of examples that meet the criteria for that decision and the number incorrectly classified by the decision. For instance, on the first line, 416/56 means that out of the 416 examples reaching that decision point, 56 were incorrectly classified as “not likely to default.” This indicates that 56 applicants actually defaulted despite the model’s prediction to the contrary.

Occasionally, a decision tree produces decision sequences that may seem (or are) illogical. For example, it might suggest that an applicant with a very good credit history is likely to default, while those whose checking balance is unknown are not likely to default. These contradictory rules can occur and may either reflect real patterns in the data or be statistical anomalies. It is essential to investigate such unusual branches to determine whether the tree’s logic is valid and is reasonable for the business domain; naturally, this requires business subject matter expertise.

Model Evaluation

While the summary() function outputs a “confusion matrix”, we can generate one directly as well by using the decision tree model to predict the target variable for all of the cases in the validation data set where we know the actual outcome – the confusion matrix then compares the predicted output to the actual observed output.

In the code below, we use the function predict() to make predictions using the features of the validation data set (minus the target column “Class” which is column 21).

credit.preds.val <- predict(credit.model, val.df[-21])

We can now create a confusion matrix using the function confusionMatrix() from the caret package, comparing the predicted labels against the actual observed labels for the validation data set.

package_name <- "caret"

if (!require(package_name, 
             character.only = T, 
             warn.conflicts = F,
             quietly = T)) {
    install.packages(package_name, dependencies = TRUE)
}

library(caret)
conf.matrix.c50 <- confusionMatrix(credit.preds.val, val.df$Class)
print(conf.matrix.c50)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  59  16
##        yes  9  16
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6534, 0.8312)
##     No Information Rate : 0.68            
##     P-Value [Acc > NIR] : 0.07961         
##                                           
##                   Kappa : 0.3902          
##                                           
##  Mcnemar's Test P-Value : 0.23014         
##                                           
##             Sensitivity : 0.8676          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.7867          
##          Neg Pred Value : 0.6400          
##              Prevalence : 0.6800          
##          Detection Rate : 0.5900          
##    Detection Prevalence : 0.7500          
##       Balanced Accuracy : 0.6838          
##                                           
##        'Positive' Class : no              
## 

Alternatively, we could use the function `CrossTable() from the gmodels package. As always, we need to check first if the package is installed, and, if not, install it before loading the package.

package_name <- "gmodels"

if (!require(package_name, 
             character.only = T, 
             warn.conflicts = F,
             quietly = T)) {
    install.packages(package_name, dependencies = TRUE)
}

library(gmodels)
CrossTable(val.df$Class, credit.preds.val,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        59 |         9 |        68 | 
##                |     0.590 |     0.090 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |        16 |        16 |        32 | 
##                |     0.160 |     0.160 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        75 |        25 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

From the above we can infer that out of the 100 loan applications in the validation data set, our C5.0 decision tree model correctly predicted that 57 did not default and 16 did default, resulting in an accuracy of 73% (57 + 16 / 100) and an error rate of 27% where the model incorrectly predicted the actual observed outcome. While this is a bit worse than expected based on the 30% observed default rate in the training data, it is not unexpected, since a model’s actual performance against unseen data is almost always worse than expected.

It is noteworthy that the model predicted only 16 out of the actual 28 defaults correctly, which amounts to a correct prediction of loan defaults of 57% – only a bit better than a random guess. That is not very encouraging and it is worth investigating to see if a better model can be built that is more “biased” towards correct predictions of defaults as that is of more business interest.

Model Improvement

Once an initial model is trained and evaluated, we can tune model parameters or try out engineered features to see if the model’s accuracy, F1-Score, precision, or recall can be improved to better suit the business use cases. One clear issue is that we have a relatively small training data set and the random sampling into training and validation can have a significant effect on the decision tree that is generated.

A common strategy for C5.0is boosting. Boosting is an ensemble learning technique that aims to improve the accuracy of a predictive model by combining the strengths of multiple weak learners. In the context of decision trees, boosting involves creating a series of decision trees, where each subsequent tree attempts to correct the errors made by the previous ones. The goal is to convert a collection of weak learners (trees that are only slightly better than random guessing) into a strong learner (a highly accurate model).

The process of building a boosted model begins by initializing the model with a single weak learner, typically a small decision tree. The initial model predicts the target variable based on the training data. Boosting proceeds iteratively, adding one tree at a time. Each new tree is trained to predict the residual errors (the differences between the actual values and the predictions made by the current model) from the previous iteration. During each iteration, boosting adjusts the weights of the training examples. Examples that were incorrectly predicted by the previous trees are given higher weights, making them more important in the next iteration. This ensures that the new tree focuses more on the hard-to-predict cases. The final boosted model is a weighted sum of all the individual trees. Each tree’s contribution to the final prediction is weighted based on its accuracy. This combination helps to smooth out individual errors and improve the overall model performance.

Key Variants of Boosting

  1. AdaBoost (Adaptive Boosting): AdaBoost is one of the earliest and most well-known boosting algorithms. It works by adjusting the weights of incorrectly classified instances so that subsequent classifiers focus more on difficult cases. AdaBoost combines the predictions of the weak learners using a weighted majority vote.

  2. Gradient Boosting: Gradient Boosting builds trees sequentially, where each tree is trained to reduce the residual errors of the previous tree using a gradient descent-like procedure. It generalizes AdaBoost by optimizing any differentiable loss function.

  3. XGBoost (Extreme Gradient Boosting): XGBoost is an efficient and scalable implementation of gradient boosting that includes several optimizations, such as regularization techniques, to prevent overfitting. It is highly effective in practice and has become a popular choice in many machine learning competitions.

  4. LightGBM and CatBoost: LightGBM (Light Gradient Boosting Machine) and CatBoost (Categorical Boosting) are other advanced implementations of gradient boosting. LightGBM is designed for efficiency and scalability, while CatBoost is optimized for categorical data and handles high-cardinality features well.

Advantages of Boosting

  1. Improved Accuracy: Boosting significantly improves the accuracy of the model by focusing on difficult-to-predict instances and reducing bias and variance.

  2. Robustness: By combining multiple weak learners, boosting creates a robust model that is less prone to overfitting compared to a single complex model.

  3. Flexibility: Boosting can be applied to various types of weak learners, although decision trees are commonly used due to their simplicity and interpretability.

  4. Handling Imbalanced Data: Boosting can be particularly effective in handling imbalanced datasets, as it focuses on hard-to-predict cases which often belong to the minority class.

The C5.0 algorithm in the C50 package supports boosting. Below is an example that uses boosting with C5.0 to train a decision tree model – the key parameter is trials set of a value greater than 1 (the default):

credit.boosted.model <- C5.0(x = train.df[-21], 
                             y = train.df$Class, 
                             data = train_data, 
                             trials = 10)

Let’s see if the boosted model performs better:

credit.preds.boosted.val <- predict(credit.boosted.model, val.df[-21])
conf.matrix.boosted <- confusionMatrix(credit.preds.boosted.val, val.df$Class)
print(conf.matrix.boosted)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  60  15
##        yes  8  17
##                                           
##                Accuracy : 0.77            
##                  95% CI : (0.6751, 0.8483)
##     No Information Rate : 0.68            
##     P-Value [Acc > NIR] : 0.03154         
##                                           
##                   Kappa : 0.439           
##                                           
##  Mcnemar's Test P-Value : 0.21090         
##                                           
##             Sensitivity : 0.8824          
##             Specificity : 0.5312          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.6800          
##              Prevalence : 0.6800          
##          Detection Rate : 0.6000          
##    Detection Prevalence : 0.7500          
##       Balanced Accuracy : 0.7068          
##                                           
##        'Positive' Class : no              
## 

And, once again, let’s build a prediction table:

CrossTable(val.df$Class, credit.preds.boosted.val,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        60 |         8 |        68 | 
##                |     0.600 |     0.080 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |        15 |        17 |        32 | 
##                |     0.150 |     0.170 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        75 |        25 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

The boosted model made correct predictions at the rate of 78% which is a slight improvement. However, the rate of correctly predicting a default increased to 68% (19/28) – a more significant improvement.

Given the apparent ease of adding boosting, one might wonder why it isn’t applied by default to every decision tree. There are two important reasons for not doing so by default. For one, if building a single decision tree requires substantial computational resources (it is after all a computationally complex algorithm), constructing multiple trees for boosting can become computationally impractical – it would just take too long to train a boosted model and during the training time, new training may have already been generated if data velocity is high. In addition, in cases where the training data is very noisy, boosting may not yield any improvement and could even exacerbate the noise. Nevertheless, when higher accuracy is important, it is worth experimenting with boosting to see if it provides the desired enhancement in performance, perhaps experimenting on a smaller random sample of the data.

Nevertheless, boosting is a useful (and commonly applied) technique that can significantly improve the performance of decision trees by iteratively focusing on hard-to-predict instances and combining multiple weak learners. This generally results in a robust and accurate model, making boosting a popular choice in various machine learning tasks.

Directing Bias

Some incorrect classification are more impactful than others. For instance, giving a loan to someone who is likely going to default is more consequential than not giving a loan to someone who is not likely to default. In other words, false negatives are better than false positive. Data scientists need to introduce bias so that some errors are avoided.

The C5.0 algorithm offers the opportunity to assign different penalties (costs) to various types of classification errors, thereby discouraging the model from making more costly mistakes – in other words, “biasing” the model. This is achieved through the use of a cost matrix, which details the relative cost of each type of error. By specifying these costs, we can guide the algorithm to prioritize minimizing the most expensive errors, improving the model’s performance in contexts where some mistakes are more critical than others.

A cost matrix for a binary classifier is a 2x2 matrix. The code below demonstrates how to construct a cost matrix.

matrix.dims <- list(c("no", "yes"), c("no", "yes"))
names(matrix.dims) <- c("predicted", "actual")

Next, we need to define the cost (i.e., the penalty) for each type of error. Since R populates a matrix by filling columns from top to bottom, we must provide the values in this order given that we have “predicted” in column one and “actual” on column two:

  1. Predicted no, actual no
  2. Predicted yes, actual no
  3. Predicted no, actual yes
  4. Predicted yes, actual yes

The costs (or penalties) are relative. A cost of 0 means that there is no cost. Let’s fill the matrix with some cost values; the cost matrix must mimic the dimension matrix defined prior:

cost.matrix <- matrix(c(0, 1, 4, 0), 
                      nrow = 2,
                      dimnames = matrix.dims)

Here is what the cost matrix contains:

print(cost.matrix)
##          actual
## predicted no yes
##       no   0   4
##       yes  1   0

For instance, the cost of a default when none was predicted is four times the cost of predicting a default when there was none, so a false negative has four times the impact of a false positive. Naturally, the correct predictions carry a cost of 0 as we do not want to discourage those.

Now that we have a cost matrix, we can train a new C5.0 decision tree model with boosting that also takes those costs into account when building the tree.

credit.boosted.costs.model <- C5.0(x = train.df[-21], 
                                   y = train.df$Class, 
                                   data = train_data, 
                                   trials = 10,
                                   costs = cost.matrix)

Let’s see if the boosted model with costs performs better:

credit.preds.boosted.costs.val <- predict(credit.boosted.costs.model, 
                                          val.df[-21])

And, once again, let’s build a prediction table:

CrossTable(val.df$Class, credit.preds.boosted.costs.val,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        60 |         8 |        68 | 
##                |     0.600 |     0.080 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |        21 |        11 |        32 | 
##                |     0.210 |     0.110 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        81 |        19 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

The accuracy (correct predictions, i.e., true positives and true negatives) is now 79% which is about the same as before, although a bit better. However, false positive are now at 6% while false negatives are at 15%. While the accuracy might be lower for a boosted model versus one that is boosted with costs, the types of errors are more amenable to business interests.

Summary

In this lesson, we discussed the construction process and common enhancements of decision trees using the C5.0 algorithm. Decision trees result powerful models for classification tasks that are interpretable. Their performance can be further improved through boosting, which involves combining multiple trees to reduce errors and increase accuracy. However, boosting may not always be practical due to computational demands and sensitivity to noisy data.

We also explored the concept of a cost matrix in the C5.0 algorithm, which allows assigning different penalties to various types of classification errors. This helps prioritize the minimization of more costly mistakes, thereby tailoring the model’s performance to specific business needs.


Files & Resources

All Files for Lesson 3.435

References

Hofmann, Hans. (1994). Statlog (German Credit Data). UCI Machine Learning Repository. https://doi.org/10.24432/C5NC77.

Lantz, Brent. (2019). Machine learning with R: Expert Techniques for Predictive Modeling. Packt Publishing Ltd.

Errata

None collected yet. Let us know.

---
title: "Decision Trees: A Worked Example in R"
params:
  category: 3
  stacks: 0
  number: 435
  time: 90
  level: beginner
  tags: naive bayes,bayes,machine learning,classification,spam detection
  description: "Worked example for predicting credit default using
                the C5.0 decision tree algorithm in R."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Motivation

Among the various machine learning techniques, decision trees have gained widespread use in the banking sector due to their high accuracy and the ability to translate complex statistical models into plain language and make the prediction of the model explainable.

In many countries, lending practices are closely monitored by government agencies to ensure fairness. This scrutiny requires bank executives to clearly articulate the reasons behind loan approvals or rejections, which is where decision trees excel. They offer a clear, interpretable rationale for each decision, which is crucial not only for regulatory compliance but also for providing customers with understandable explanations for their credit ratings.

Automated credit scoring models are likely employed in scenarios such as credit card mailings and instant online approval processes. In this lesson, we will develop a simple credit approval model in R using C5.0 decision trees. Additionally, we will explore how to fine-tune the model to minimize errors that improve the model and direct the bias of the model to reduce financial loss.

> This lesson is an expansion of the tutorial in Chapter 5 of (Lantz, 2019).

## Model Explainabiity

One of the reasons why decision trees remain a commonly used supervised machine learning algorithm, is it's superior explainability and ease with which the decision process can be reproduced. Explainability in the context of predictive modeling and supervised machine learning refers to the extent to which the internal mechanics of a machine learning model and its decision-making processes can be understood by humans. It requires making the model’s operations transparent and its predictions interpretable, allowing stakeholders to comprehend how input features are transformed into output predictions.

Explainability is essential and desirable for several reasons:

1.  **Trust and Adoption**: When the workings of a model are transparent, users and stakeholders are more likely to trust and adopt the technology. Trust is particularly important in fields like healthcare, medicine, finance, banking, and criminal justice, where decisions have significant consequences and often deeply affect people's lives.

2.  **Regulatory Compliance**: Many industries are subject to regulations that require clear justifications for decisions, especially those affecting individuals’ lives and finances. For instance, in banking, the Basel III framework and the General Data Protection Regulation (GDPR) necessitate explainability in credit scoring models to ensure fairness and accountability, as well as expose any potential covert bias.

3.  **Debugging and Improvement**: Explainable models make it easier to identify and rectify errors or biases in the model. Understanding why a model makes certain predictions can highlight areas for improvement and refinement, leading to better performance and more accurate results.

4.  **Ethical Responsibility**: Machine learning models can inadvertently perpetuate or exacerbate biases present in training data. Explainability helps in identifying such biases, enabling developers to take corrective actions to ensure ethical and fair outcomes.

5.  **User Understanding and Engagement**: When users understand how a model works, they are more likely to engage with it appropriately. For example, in medical diagnostics, explainability can help doctors understand the rationale behind a model’s prediction, allowing them to make more informed decisions about patient care.

6.  **Decision Accountability**: In many applications, it is essential to hold decision-makers accountable. Explainable models provide the necessary transparency to ensure that decisions can be scrutinized and justified, thereby maintaining accountability and integrity in automated decision-making systems.

In short, explainability bridges the gap between complex machine learning models and human understanding. It ensures that models are not only accurate but also transparent, fair, as free from bias as possible, and trustworthy, thereby fostering wider acceptance -- and responsible use -- of machine learning and applied artificial intelligence.

## Decision Trees for Classification

Decision tree algorithms are useful for both classification and regression tasks, although we will focus on classification tasks. These algorithms work by recursively partitioning the data into subsets based on feature values, creating a tree-like model of decisions. Each internal node of the tree represents a conditional test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or a regression value. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision trees are preferable when features are a mix of categorical and numeric values. Unlike regression methods, decision trees do not rely on any particular distribution of values.

### Overview of Decision Tree Algorithms

A decision tree starts with a root node that encompasses the entire dataset. The algorithm then selects the feature that best divides the data into distinct classes or values, based on a criterion like *Information Gain* or *Gini Index*. This process is repeated recursively for each child node until one of the stopping conditions is met, such as reaching a maximum depth, having a minimum number of samples per node, or achieving homogeneous nodes. The construction of decision trees on data sets with many observations and a high number of dimensions can be time consuming and take a long time.

Information Gain measures the reduction in entropy or uncertainty about the class labels after splitting the data based on an attribute, while the Gini Index measures the impurity or diversity of the split. It is calculated as the probability of misclassifying a randomly chosen element if it were labeled according to the distribution of labels in the subset. Some decision tree algorithm will evaluate the statistical significance of the association between categorical variables or attempt to reduce variance. The latter approach is often used in regression trees which works to minimize the variance of the target variable within each node.

To prevent overfitting, decision trees often use pruning techniques to remove branches that have little importance. Pruning can be done in a pre-pruning manner (stopping the tree growth early) or post-pruning (removing branches from a fully grown tree).

### Common Decision Tree Algorithms

Several decision tree algorithms have been developed over the years, each with its unique approach to building and refining trees. Some of the most notable ones include:

1.  **ID3 (Iterative Dichotomiser 3)**: Developed by Ross Quinlan, ID3 uses Information Gain as its splitting criterion. It builds a tree top-down, starting from a set of training instances and the features that describe them.

2.  **C4.5 and C5.0**: Also developed by Ross Quinlan, C4.5 is an extension of ID3. It uses *Gain Ratio*, a modification of *Information Gain* that reduces its bias towards features with many distinct values. C5.0 is a commercial version that is more efficient and provides better performance than C4.5.

3.  **CART (Classification and Regression Trees)**: Introduced by Breiman et al., CART can be used for both classification and regression tasks. It uses the *Gini Index* for classification trees and reduction in variance for regression trees. CART produces binary trees, where each node has exactly two children.

4.  **CHAID (Chi-squared Automatic Interaction Detector)**: This algorithm is used primarily for classification and relies on the Chi-Square test to determine the best splits. It is particularly effective for categorical data and handles multi-way splits.

5.  **MARS (Multivariate Adaptive Regression Splines)**: Although not strictly a decision tree, MARS builds models by dividing the data into distinct regions using hinge functions. It is highly effective for regression tasks and can handle interactions between variables.

6.  **Random Forest**: An ensemble method that constructs multiple decision trees during training and outputs the mode of the classes for classification or the mean prediction for regression of the individual trees. It uses techniques like bagging and feature randomness to improve model accuracy and control overfitting.

7.  **Gradient Boosted Trees**: Another ensemble technique that builds trees sequentially. Each tree is constructed to correct the errors of the previous ones, focusing on the residual errors. Prominent algorithms in this category include XGBoost, LightGBM, and CatBoost.

8.  **Oblique Decision Trees**: These trees use linear combinations of features rather than a single feature for splitting nodes. This can lead to more compact and potentially more accurate trees but are computationally more intensive to build.

Decision tree algorithms offer a robust and interpretable approach to both classification and regression tasks. Their versatility and ease of understanding make them popular in various fields, from finance to healthcare. Each variant of decision tree algorithms brings unique strengths, and understanding these can help in selecting the most appropriate model for a given problem. This lesson will focus on the use of the C5.0 (*aka* C50) algorithm.

### The C5.0 Algorithm

The C5.0 (*aka* C50) algorithm, developed by Ross Quinlan, is an advanced and commercially available version of the C4.5 algorithm. It is widely used for creating classification models due to its efficiency, accuracy, and interpretability/explainability. The C5.0 algorithm builds a decision tree by recursively splitting the data into subsets based on the attribute that provides the highest information gain ratio. This process continues until the algorithm meets a stopping criterion, such as reaching a specified depth or having too few samples to split further.

### How C5.0 Works

1.  **Data Preparation**: Before building the tree, the C5.0 algorithm requires the data to be prepared, which involves handling missing values, converting categorical data to numeric where necessary, and optionally normalizing numeric data. In practice, C5.0 handles missing values quite well and doesn't often necessitate the removal of outliers, imputing missing values, encoding categorical features, and normalize numeric features.

2.  **Tree Construction**: The algorithm starts with the entire dataset and evaluates each attribute to find the one that provides the highest information gain ratio. This attribute becomes the root node of the tree. Splitting the tree is based on the information gain ratio. The information gain ratio is a measure of how well an attribute splits the data into distinct classes. It is calculated as the ratio of information gain (reduction in entropy) to the intrinsic information (a measure of the potential information generated by splitting the dataset based on this attribute). After selecting the root node, the dataset is divided into subsets based on the possible values of the chosen attribute. The algorithm then repeats the process for each subset, selecting the best attribute for further splitting. This recursive process continues, creating branches and leaf nodes, until a stopping criterion is met.

3.  **Pruning**: To avoid overfitting, the algorithm may use pre-pruning techniques such as limiting the depth of the tree, requiring a minimum number of samples per leaf, or setting a threshold for the information gain ratio. Once the tree is fully grown, post-pruning can be applied to remove branches that contribute little to the overall predictive power of the model. This involves evaluating the performance of subtrees and collapsing those that do not significantly improve accuracy.

4.  **Rule Extraction**: The C5.0 algorithm can convert the decision tree into a set of IF-THEN rules, which are often easier to interpret and understand. Each path from the root to a leaf node corresponds to a rule.

5.  **Boosting**: One of the notable features of C5.0 is its ability to use boosting, an ensemble technique that combines multiple weak classifiers to form a strong classifier. In boosting, the algorithm iteratively builds multiple decision trees, each focusing on the errors made by the previous ones. The final model is a weighted vote of these individual trees.

6.  **Handling Missing Values and Noise**: The C5.0 algorithm has robust mechanisms for handling missing values and noisy data. It can work with incomplete datasets by using surrogate splits, which allow it to proceed with the next best attribute when the primary attribute is missing.

### Advantages of C5.0

C5.0 is significantly faster than its predecessor, C4.5, both in terms of memory usage and processing time. It can handle large datasets more efficiently, making it suitable for real-world applications with substantial amounts of data. The boosting feature of C5.0 helps in improving the model's accuracy by combining the strengths of multiple decision trees. This ensemble method reduces bias and variance, leading to better generalization on unseen data. Furthermore, decision trees produced by C5.0 are easy to understand and interpret. The conversion of trees into a set of rules further enhances their interpretability, making it easier for stakeholders to grasp the decision-making process.

C5.0 offers various parameters for fine-tuning the model, such as adjusting the confidence level for pruning, setting minimum sample sizes, and enabling boosting. This flexibility allows users to customize the model according to their specific needs.

Lastly, the algorithm's ability to handle noisy data and missing values ensures that the model remains robust and reliable, even when the quality of the data is not perfect. The C5.0 algorithm stands out due to its efficiency, accuracy, and ease of interpretation. Its ability to handle large datasets, robust performance with noisy and incomplete data, and the option to use boosting make it a go-to algorithm for classification tasks. These advantages often make C5.0 preferable over other decision tree algorithms, particularly in applications where accuracy, speed, and interpretability are paramount.

### Computational Intractability and Decision Trees

Building a perfect decision tree is an NP-complete problem and thus no polynomial algorithm exists, which means that the time it takes to build a perfect decision tree increases exponentially with the size of the training data. This complexity arises from the combinatorial nature of the decision tree construction process. The various algorithms, including C50, use heuristics to create a tree that works well in practice, is computationally tractable, but is suboptimal.

A decision tree is built by recursively splitting a dataset based on the values of its attributes to create branches until the dataset is perfectly classified or a stopping criterion is met. The goal is to find the optimal set of splits that results in the most accurate tree with the least complexity (in terms of depth and number of nodes). For any given dataset with $n$ attributes, there is an exponential number of ways to partition the data and choose splits. Evaluating all possible trees to find the best one involves an exhaustive search through all these combinations, which grows exponentially with the size of the dataset and the number of attributes. The task of finding the optimal decision tree can be framed as a combinatorial optimization problem. Each possible split at each node represents a combinatorial choice, and the objective is to find the combination of choices that minimizes some cost function (*e.g.*, misclassification rate, tree depth, etc.).

The problem of constructing an optimal decision tree can be reduced to a known NP-complete problem, such as the Boolean satisfiability problem (SAT). If one could solve the decision tree optimization problem efficiently, it would imply an efficient solution to all NP-complete problems, which is not believed to be possible.

Because constructing a perfect decision tree is NP-complete, it implies that there is no known polynomial-time algorithm to solve this problem for all instances. Therefore, heuristic methods and approximations, such as those employed by C50, are typically used in practice to construct decision trees. These methods aim to find a good (though not necessarily optimal) tree in a reasonable amount of time that produces acceptable results in practice. Among the common heuristic methods are the use of a greedy approach which makes the best split decision at each step based on a certain criterion (*e.g.*, information gain or Gini index). While this does not guarantee an optimal tree, it often results in a good and computationally feasible solution. To avoid overfitting and improve generalization, decision tree algorithms incorporate pruning methods, both pre-pruning (limiting the growth of the tree during construction) and post-pruning (removing branches from a fully grown tree). In addition, techniques such as bagging and boosting (*e.g.*, Random Forests or Gradient Boosted Trees) build multiple decision trees on different sets of training data and combine their predictions to improve accuracy and robustness. In combination, these methods mitigate the suboptimality of individual trees by leveraging the strengths of multiple models and practical heuristics.

## Worked Example: Credit Decision Making

The objective of our predictive model is to pinpoint factors associated with an increased risk of loan default. This should eventually assist lenders in making un-biased credit decisions or to automate the credit approval process. Achieving this requires access to extensive data on previous bank loans (a word of caution: the data may contain inherent bias which would then propagate to the model), coupled with detailed information about the applicants.

This worked example uses the dataset contributed to the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml) by Hans Hofmann of the University of Hamburg (Hofmann, 1994). This dataset includes comprehensive information on loans sourced from a credit agency in Germany, so the applicability of any model might be limited to applicants from Germany -- we need to be cautious in generalizing models as they are based on specific training data.

**URL for data set**: <https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data>

The credit dataset is comprised of 1,000 loan decisions made by loan officers, along with a set of numeric and nominal features that describe the characteristics of both the loans and the applicants. A (binary) class variable specifies whether each loan resulted in default. Our goal is to analyze this data to uncover patterns that might be able to predict loan defaults and thus assist a loan officer is making a credit decision, *i.e.*, an applicant who is likely going to default might not be a good credit candidate or, alternatively, would require adjusted lending parameters such as a substantially higher interest rate and a shorter repayment period.

## Workflow

Set up your own project in R Studio and create a new R Notebook in which to do your work and follow along. Experiment with the code, try additional techniques. The key is to type the code yourself and not simply copy and paste. You learn by doing!

### Data Loading

We will load the data set directly from its URL using the **utils** package, although downloading to a project folder might be best when inspecting the data set. The code below also sets column headers as the data set does not have those. A data dictionary is provided at the above URL.

```{r}
library(utils)

# URL of the dataset
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"

# define the column names for the dataset (as the dataset does not contain headers)
col_names <- c("Status.of.existing.checking.account", "Duration.in.month", "Credit.history",
               "Purpose", "Credit.amount", "Savings.account.bonds", "Present.employment.since",
               "Installment.rate.in.percentage.of.disposable.income", "Personal.status.and.sex",
               "Other.debtors.guarantors", "Present.residence.since", "Property", "Age.in.years",
               "Other.installment.plans", "Housing", "Number.of.existing.credits.at.this.bank",
               "Job", "Number.of.people.being.liable.to.provide.maintenance.for", "Telephone",
               "Foreign.worker", "Class")

# fetch and read the dataset
credit.df <- read.csv(url, 
                      header = FALSE, sep = " ", 
                      col.names = col_names)

# Display the first few rows of the dataset to ensure it was loaded
head(credit.df[,c(2,5,8,11,21)], 4)

```

```{r echo=F}
df.orig <- credit.df
```

### Data Inspection

```{r echo = T}
d <- ncol(credit.df)
n <- nrow(credit.df)
```

The data set contains `r n` rows (observations/cases) and `r d` dimensions (features). The loan amounts in are *DM* (as this data set was collected prior to Germany adopting the Euro). The "target variable" is the column *Class*. A "1" indicates that the borrower did not default while a "2" indicates that they did default.

```{r}
t <- table(credit.df$Class)
print(t)
```

In your own code, inspect additional features. Which ones are numeric? Which ones are categorical? Which features might be likely predictors?

### Data Preparation

The first step is to revise the encoding of the "Class" target variable to "yes" (default) and "no" (no default) rather than the values 2 and 1. This is purely cosmetic and for readability and has no impact on the model construction.

```{r}
credit.df$Class <- ifelse(credit.df$Class == 1, "no", "yes")
```

There are `r as.numeric(t[1])` classified as "no" out of `r n` credit transactions, making the overall default rate `r round(as.numeric(t[2])/(as.numeric(t[1])+as.numeric(t[2]))*100,1)`%.

### Sampling Training Data

We will train the model on a random sample of the cases in the data set, leaving the remainder for model validation. A random sample involves selecting a subset of records randomly from a larger dataset. In R, this can be accomplished using the `sample()` function or the `runif()` function to generate a sequence of random numbers following a uniform distribution. There are additional functions that produce random numbers that follow other distributions such as the Normal (Gaussian) Distribution.

It is a common practice to set a seed value beforehand using the `set.seed()` function. Setting a seed ensures that the randomization process follows a reproducible sequence. While this might seem counter-intuitive for generating random numbers, it serves an important purpose. By specifying a seed value, the analysis can be repeated in the future with identical results, ensuring consistency and reproducibility in the findings.

The code below generates a random sample (without replacement, *i.e.*, any observation appears once in either the training data or the validation data, but never in both and never more than once in either). We are not specifying a probability distribution, so every observation has the same chance of appearing in the training data. We are selecting 90% of the data for training and the rest for validation.

Note that your sample might be different if you are using a different version of R that was used when generating this lesson; this will also affect the results for the remainder of the lesson. However, the analysis and the model are just as valid.

```{r}
set.seed(617)

n <- nrow(credit.df)
train.data <- sample(n, n * 0.9, replace = FALSE)

train.df <- credit.df[train.data,]
val.df <- credit.df[-train.data,]
```

If the selection was done at random, we would expect the same proportion of defaults to be appear in the training and the validation data sets -- and they should be about 30% as that was the proportion of defaults in the original data set.

**Training Data Set: Proportion of defaults**

```{r}
prop.table(table(train.df$Class))
```

**Validation Data Set: Proportion of defaults**

```{r}
prop.table(table(val.df$Class))
```

Both the training and test datasets exhibit similar proportions of loan defaults, allowing us to proceed with building our decision tree. If the proportions had differed significantly, we might have needed to resample the dataset or employ a more sophisticated sampling approach to ensure balanced representation, such as a stratified sample or using *k*-fold cross validation rather than a dedicated validation data set.

A common problem of random sampling is class imbalance. What if some level of the target variable only appears a few times in the data set and the random sample is such that all instances occur in only the validation data -- then the model will never be trained on those. This will often cause an error when making a prediction. In those situations, startified sampling or sampling with replacement (oversampling) may be better.

#### Random Number Generation

Let's take a short detour on random numbers in computers, as random selection is common in many machine learning algorithms. The question is: how do computers generate random numbers?

Computers generate random numbers using algorithms called pseudo-random number generators (PRNGs). While these numbers are not truly random, they are sufficient for most practical purposes because they exhibit properties of randomness and cannot be predicted unless one knows the algorithm and the starting conditions of the algorithm.

***You may (safely) skip the remainder of this section if you are not interested in how "random number generation" works.***

A PRNG starts with an initial value called a *seed.* The seed can be any number, but it's often derived from a variable source such as the current time or perhaps some external sensor that detects some natural phenomenon that is unpredictable. Once set, the seed allows the PRNG to produce a reproducible sequence of numbers that appears to be random. Naturally, the PRNG uses a deterministic algorithm to generate a sequence of numbers, so the numbers are really random but exhibit the properties of random numbers. Common algorithms include the Linear Congruential Generator (LCG), the Mersenne Twister, and the Xorshift algorithm.

**Linear Congruential Generator (LCG)**:

An LCG produces numbers based on the formula:

$$
   X_{n+1} = (aX_n + c) \mod m
$$

where:

-   $X$ is the sequence of pseudorandom values,
-   $a$, $c$, and $m$ are constants,
-   $X_0$ is the initial seed.

**Mersenne Twister**:

The Mersenne Twister is widely used because of its long period and high-quality randomness. It generates numbers with a period of $2^{19937} - 1$, ensuring that the sequence doesn't repeat for a very long time.

**Xorshift**:

Xorshift algorithms use bitwise operations to generate random numbers. These are known for being fast and having good statistical properties.

For applications requiring true randomness, such as cryptography, pseudorandom numbers may not be sufficient. Instead, True Random Number Generators (TRNGs) are used. TRNGs generate numbers based on physical processes and natural phenomena that are inherently random, such as: measuring the thermal noise in electronic circuits, monitoring the decay of radioactive materials, or observing the behavior of photons passing through a semi-transparent mirror. TRNGs are generally slower, require an external sensor, and are thus more expensive than PRNGs but provide true randomness.

### Model Training

We will employ the C5.0 algorithm from the **C50** package to train our decision tree model. If you haven't already installed the package, you can do so by running `install.packages("C50")` or using the interface of R Studio. After installation, load the package into your R session with `library(C50)`. The code below checks of the package is installed, and, if not, will install the package.

```{r warning=F, message=F}
package_name <- "C50"

if (!require(package_name, 
             character.only = T, 
             warn.conflicts = F,
             quietly = T)) {
    install.packages(package_name, dependencies = TRUE)
}

library(C50)
```

The model is built using the function `C5.0()`. This function has five key arguments:

-   *x*: a data frame of the training features
-   *y*: a factor vector with two or more levels of the corresponding target classes
-   *trials*: an optional argument indicating the number of boosting iterations that are applied (default is 1)
-   *costs*: a matrix of costs associated with possible errors; the matrix must have *C* columns and rows where *C* is the number of levels in the factor vector for *y*
-   *rules*: a logical argument indicating whether the tree should be decomposed into rules

We are first converting the target variable from "character" to a factor as that is what the `C5.0()` function expects. Of course, it may have been better to have done the conversion before splitting.

```{r}
train.df$Class <- as.factor(train.df$Class)
val.df$Class <- as.factor(val.df$Class)
```

Next, we are training a decision tree model using the C5.0 algorithm on all features,except for column 21 which is the target variable. Be sure not to include the column of the target variable in the training features as that would certainly cause overfitting.

```{r}
credit.model <- C5.0(train.df[-21], train.df$Class, trials = 1, rules = F)
```

The "decision tree" model is now available for prediction. We can view the model's characteristics by printing it out:

```{r}
print(credit.model)
```

The actual "tree" can be visualized - in a rudimentary fashion -- using the `summary()` function. Only a partial output of the model is shown.

```{r eval=F}
summary(credit.model)
```

```         
C5.0 [Release 2.07 GPL Edition]     Sun Jun 23 13:00:32 2024
-------------------------------

Class specified by attribute `outcome'

Read 900 cases (21 attributes) from undefined.data

Decision tree:

Status.of.existing.checking.account in {A14,A13}: no (416/56)
Status.of.existing.checking.account in {A12,A11}:
:...Credit.history in {A30,A31}:
    :...Housing = A151: yes (15)
    :   Housing in {A152,A153}:
    :   :...Other.debtors.guarantors in {A102,A103}: no (3)
    :       Other.debtors.guarantors = A101:
    :       :...Housing = A153: yes (11/1)
    :           Housing = A152:
    
    ...
```

In terms of decision tree construction with C5.0, the numbers in parentheses in the summary above indicate the number of examples that meet the criteria for that decision and the number incorrectly classified by the decision. For instance, on the first line, 416/56 means that out of the 416 examples reaching that decision point, 56 were incorrectly classified as "not likely to default." This indicates that 56 applicants actually defaulted despite the model's prediction to the contrary.

Occasionally, a decision tree produces decision sequences that may seem (or are) illogical. For example, it might suggest that an applicant with a very good credit history is likely to default, while those whose checking balance is unknown are not likely to default. These contradictory rules can occur and may either reflect real patterns in the data or be statistical anomalies. It is essential to investigate such unusual branches to determine whether the tree's logic is valid and is reasonable for the business domain; naturally, this requires business subject matter expertise.

### Model Evaluation

While the `summary()` function outputs a "confusion matrix", we can generate one directly as well by using the decision tree model to predict the target variable for all of the cases in the validation data set where we know the actual outcome -- the confusion matrix then compares the predicted output to the actual observed output.

In the code below, we use the function `predict()` to make predictions using the features of the validation data set (minus the target column "Class" which is column 21).

```{r}
credit.preds.val <- predict(credit.model, val.df[-21])
```

We can now create a confusion matrix using the function `confusionMatrix()` from the **caret** package, comparing the predicted labels against the actual observed labels for the validation data set.

```{r warning=F, message=F}
package_name <- "caret"

if (!require(package_name, 
             character.only = T, 
             warn.conflicts = F,
             quietly = T)) {
    install.packages(package_name, dependencies = TRUE)
}

library(caret)
```

```{r}
conf.matrix.c50 <- confusionMatrix(credit.preds.val, val.df$Class)
print(conf.matrix.c50)
```

Alternatively, we could use the function \`CrossTable() from the **gmodels** package. As always, we need to check first if the package is installed, and, if not, install it before loading the package.

```{r warning=F, message=F}
package_name <- "gmodels"

if (!require(package_name, 
             character.only = T, 
             warn.conflicts = F,
             quietly = T)) {
    install.packages(package_name, dependencies = TRUE)
}

library(gmodels)
```

```{r}
CrossTable(val.df$Class, credit.preds.val,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))
```

From the above we can infer that out of the 100 loan applications in the validation data set, our C5.0 decision tree model correctly predicted that 57 did not default and 16 did default, resulting in an accuracy of 73% (57 + 16 / 100) and an error rate of 27% where the model incorrectly predicted the actual observed outcome. While this is a bit worse than expected based on the 30% observed default rate in the training data, it is not unexpected, since a model's actual performance against unseen data is almost always worse than expected.

It is noteworthy that the model predicted only 16 out of the actual 28 defaults correctly, which amounts to a correct prediction of loan defaults of 57% -- only a bit better than a random guess. That is not very encouraging and it is worth investigating to see if a better model can be built that is more "biased" towards correct predictions of defaults as that is of more business interest.

### Model Improvement

Once an initial model is trained and evaluated, we can tune model parameters or try out engineered features to see if the model's accuracy, F1-Score, precision, or recall can be improved to better suit the business use cases. One clear issue is that we have a relatively small training data set and the random sampling into training and validation can have a significant effect on the decision tree that is generated.

A common strategy for C5.0is *boosting*. Boosting is an ensemble learning technique that aims to improve the accuracy of a predictive model by combining the strengths of multiple weak learners. In the context of decision trees, boosting involves creating a series of decision trees, where each subsequent tree attempts to correct the errors made by the previous ones. The goal is to convert a collection of weak learners (trees that are only slightly better than random guessing) into a strong learner (a highly accurate model).

The process of building a boosted model begins by initializing the model with a single weak learner, typically a small decision tree. The initial model predicts the target variable based on the training data. Boosting proceeds iteratively, adding one tree at a time. Each new tree is trained to predict the residual errors (the differences between the actual values and the predictions made by the current model) from the previous iteration. During each iteration, boosting adjusts the weights of the training examples. Examples that were incorrectly predicted by the previous trees are given higher weights, making them more important in the next iteration. This ensures that the new tree focuses more on the hard-to-predict cases. The final boosted model is a weighted sum of all the individual trees. Each tree's contribution to the final prediction is weighted based on its accuracy. This combination helps to smooth out individual errors and improve the overall model performance.

#### Key Variants of Boosting

1.  **AdaBoost (Adaptive Boosting)**: AdaBoost is one of the earliest and most well-known boosting algorithms. It works by adjusting the weights of incorrectly classified instances so that subsequent classifiers focus more on difficult cases. AdaBoost combines the predictions of the weak learners using a weighted majority vote.

2.  **Gradient Boosting**: Gradient Boosting builds trees sequentially, where each tree is trained to reduce the residual errors of the previous tree using a gradient descent-like procedure. It generalizes AdaBoost by optimizing any differentiable loss function.

3.  **XGBoost (Extreme Gradient Boosting)**: XGBoost is an efficient and scalable implementation of gradient boosting that includes several optimizations, such as regularization techniques, to prevent overfitting. It is highly effective in practice and has become a popular choice in many machine learning competitions.

4.  **LightGBM and CatBoost**: LightGBM (Light Gradient Boosting Machine) and CatBoost (Categorical Boosting) are other advanced implementations of gradient boosting. LightGBM is designed for efficiency and scalability, while CatBoost is optimized for categorical data and handles high-cardinality features well.

#### Advantages of Boosting

1.  **Improved Accuracy**: Boosting significantly improves the accuracy of the model by focusing on difficult-to-predict instances and reducing bias and variance.

2.  **Robustness**: By combining multiple weak learners, boosting creates a robust model that is less prone to overfitting compared to a single complex model.

3.  **Flexibility**: Boosting can be applied to various types of weak learners, although decision trees are commonly used due to their simplicity and interpretability.

4.  **Handling Imbalanced Data**: Boosting can be particularly effective in handling imbalanced datasets, as it focuses on hard-to-predict cases which often belong to the minority class.

The C5.0 algorithm in the **C50** package supports boosting. Below is an example that uses boosting with C5.0 to train a decision tree model -- the key parameter is *trials* set of a value greater than 1 (the default):

```{r boostedModel}
credit.boosted.model <- C5.0(x = train.df[-21], 
                             y = train.df$Class, 
                             data = train_data, 
                             trials = 10)
```

Let's see if the boosted model performs better:

```{r}
credit.preds.boosted.val <- predict(credit.boosted.model, val.df[-21])
```

```{r}
conf.matrix.boosted <- confusionMatrix(credit.preds.boosted.val, val.df$Class)
print(conf.matrix.boosted)
```

And, once again, let's build a prediction table:

```{r}
CrossTable(val.df$Class, credit.preds.boosted.val,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))
```

The boosted model made correct predictions at the rate of 78% which is a slight improvement. However, the rate of correctly predicting a default increased to 68% (19/28) -- a more significant improvement.

Given the apparent ease of adding boosting, one might wonder why it isn't applied by default to every decision tree. There are two important reasons for not doing so by default. For one, if building a single decision tree requires substantial computational resources (it is after all a computationally complex algorithm), constructing multiple trees for boosting can become computationally impractical -- it would just take too long to train a boosted model and during the training time, new training may have already been generated if data velocity is high. In addition, in cases where the training data is very noisy, boosting may not yield any improvement and could even exacerbate the noise. Nevertheless, when higher accuracy is important, it is worth experimenting with boosting to see if it provides the desired enhancement in performance, perhaps experimenting on a smaller random sample of the data.

Nevertheless, boosting is a useful (and commonly applied) technique that can significantly improve the performance of decision trees by iteratively focusing on hard-to-predict instances and combining multiple weak learners. This generally results in a robust and accurate model, making boosting a popular choice in various machine learning tasks.

### Directing Bias

Some incorrect classification are more impactful than others. For instance, giving a loan to someone who is likely going to default is more consequential than not giving a loan to someone who is not likely to default. In other words, false negatives are better than false positive. Data scientists need to introduce bias so that some errors are avoided.

The C5.0 algorithm offers the opportunity to assign different penalties (costs) to various types of classification errors, thereby discouraging the model from making more costly mistakes -- in other words, "biasing" the model. This is achieved through the use of a cost matrix, which details the relative cost of each type of error. By specifying these costs, we can guide the algorithm to prioritize minimizing the most expensive errors, improving the model's performance in contexts where some mistakes are more critical than others.

A cost matrix for a binary classifier is a 2x2 matrix. The code below demonstrates how to construct a cost matrix.

```{r}
matrix.dims <- list(c("no", "yes"), c("no", "yes"))
names(matrix.dims) <- c("predicted", "actual")
```

Next, we need to define the cost (*i.e.*, the penalty) for each type of error. Since R populates a matrix by filling columns from top to bottom, we must provide the values in this order given that we have "predicted" in column one and "actual" on column two:

1.  Predicted no, actual no
2.  Predicted yes, actual no
3.  Predicted no, actual yes
4.  Predicted yes, actual yes

The costs (or penalties) are relative. A cost of 0 means that there is no cost. Let's fill the matrix with some cost values; the cost matrix must mimic the dimension matrix defined prior:

```{r}
cost.matrix <- matrix(c(0, 1, 4, 0), 
                      nrow = 2,
                      dimnames = matrix.dims)
```

Here is what the cost matrix contains:

```{r}
print(cost.matrix)
```

For instance, the cost of a default when none was predicted is four times the cost of predicting a default when there was none, so a false negative has four times the impact of a false positive. Naturally, the correct predictions carry a cost of 0 as we do not want to discourage those.

Now that we have a cost matrix, we can train a new C5.0 decision tree model with boosting that also takes those costs into account when building the tree.

```{r boostedModelWithCost}
credit.boosted.costs.model <- C5.0(x = train.df[-21], 
                                   y = train.df$Class, 
                                   data = train_data, 
                                   trials = 10,
                                   costs = cost.matrix)
```

Let's see if the boosted model with costs performs better:

```{r}
credit.preds.boosted.costs.val <- predict(credit.boosted.costs.model, 
                                          val.df[-21])
```

And, once again, let's build a prediction table:

```{r}
CrossTable(val.df$Class, credit.preds.boosted.costs.val,
             prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
             dnn = c('actual default', 'predicted default'))
```

The accuracy (correct predictions, *i.e.*, true positives and true negatives) is now 79% which is about the same as before, although a bit better. However, false positive are now at 6% while false negatives are at 15%. While the accuracy might be lower for a boosted model versus one that is boosted with costs, the types of errors are more amenable to business interests.

## Summary

In this lesson, we discussed the construction process and common enhancements of decision trees using the C5.0 algorithm. Decision trees result powerful models for classification tasks that are interpretable. Their performance can be further improved through boosting, which involves combining multiple trees to reduce errors and increase accuracy. However, boosting may not always be practical due to computational demands and sensitivity to noisy data.

We also explored the concept of a cost matrix in the C5.0 algorithm, which allows assigning different penalties to various types of classification errors. This helps prioritize the minimization of more costly mistakes, thereby tailoring the model's performance to specific business needs.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

Hofmann, Hans. (1994). Statlog (German Credit Data). UCI Machine Learning Repository. <https://doi.org/10.24432/C5NC77>.

Lantz, Brent. (2019). *Machine learning with R: Expert Techniques for Predictive Modeling*. Packt Publishing Ltd.

## Errata

None collected yet. [Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
