Overview
Evaluating classification models is a crucial aspect of supervised machine learning. It helps us understand how well our models are performing and guides us in improving their accuracy and effectiveness. There are several metrics used for this purpose, each with its own strengths and context of use. Some of the most commonly used metrics include:
Accuracy: This is the most intuitive performance measure and it is simply a ratio of correctly predicted observations to the total observations. It’s best used when the class distribution is similar and the costs of false positives and false negatives are roughly the same.
Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. It’s important when the cost of false positives is high.
Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to all observations in actual class. It’s used when the cost of false negatives is high.
F1 Score: F1 Score is the weighted average of Precision and Recall. It takes both false positives and false negatives into account. It’s useful when you want to balance precision and recall.
Confusion Matrix: A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.
ROC Curve and AUC: The ROC curve is a graphical representation of the trade-off between the true positive rate and false positive rate at various thresholds. AUC represents the degree or measure of separability achieved by the model.
Specificity: Specificity measures the proportion of actual negatives that are correctly identified as such. It’s important in contexts where the cost of false positives is high.
Log Loss: Also known as logistic loss or logit loss, it measures the performance of a classification model where the prediction is a probability between 0 and 1.
Each of these metrics provides different insights into the performance of a classification model, and the choice of metrics often depends on the specific requirements and context of the problem being solved.
The labeling of the target variable is a decision made by the analyst, i.e., what is “positive” and what is “negative” is a business decision and is not of consequence. For example, in disease detection, the presence of a disease is commonly marked as “positive”, while the absence would be marked a “negative”. If we were to evaluate a model predicting whether a customer is likely to buy again, the “positive” might best be the outcome of “customer buys again”.
When classification is between two classes (“positive” and “negative”), we call this a binary classification. When there are more than two classes, then it is a multiclass (or multivariate or multilevel) classification. Some supervised machine learning algorithms are principally used for binary classification (_e.g. logistic regression), while others are inherently better for multiclass classification (e.g., kNN, decision tree, and random forest).
Model Evaluation
There are several methods for evaluating classification models. All methods involve, in some way, the training of a classification model using a supervised machine learning algorithm (e.g., kNN or logistic regression) on a labeled training data set. The most common methods are listed below; the most common among these are the holdout method and k-fold cross validation (kCV).
Besides the holdout method, several other techniques are used for validating supervised machine learning models. Each method has its advantages and is suited to different scenarios. Some of the most commonly used methods include:
- K-Fold Cross-Validation:
- Description: The dataset is divided into two parts: the training set and the testing (or holdout) set. The model is trained using the training data and evaluated using the testing data set.
- Advantages: Simple to apply.
- Disadvantages: The model and the evaluation can be greatly influenced by the choice of training data.
- K-Fold Cross-Validation:
- Description: The dataset is divided into ‘k’ equally (or nearly equally) sized folds or subsets. The model is trained ‘k’ times, each time using a different fold as the testing set and the remaining ‘k-1’ folds as the training set.
- Advantages: It ensures that every observation from the original dataset has the chance of appearing in the training and test set. This is especially useful with smaller datasets.
- Disadvantages: It can be computationally intensive, especially for large datasets or complex models.
- Stratified K-Fold Cross-Validation:
- Description: Similar to K-Fold, but the folds are made by preserving the percentage of samples for each class. This is crucial in dealing with imbalanced datasets.
- Advantages: Maintains a balanced representation of the original dataset, particularly important for classification problems with imbalanced class distributions.
- Disadvantages: More complex to implement than simple K-Fold cross-validation.
- Leave-One-Out Cross-Validation (LOOCV):
- Description: A special case of K-Fold cross-validation where ‘k’ equals the number of observations in the dataset. Essentially, each observation is used as a single test example, and the rest are used for training.
- Advantages: Maximizes the amount of data used for training the model.
- Disadvantages: Extremely computationally expensive and impractical with large datasets. It can also have high variance as a single observation can sometimes be a poor representation of the dataset.
- Leave-P-Out Cross-Validation:
- Description: Similar to LOOCV, but instead of leaving out one observation at a time, ‘p’ observations are left out.
- Advantages: Allows for more thorough testing than LOOCV in certain cases.
- Disadvantages: Computationally very intensive and less commonly used.
- Bootstrap Method:
- Description: Involves randomly sampling with replacement from the dataset to create multiple training datasets. The model is trained on these bootstrap samples and tested on the unseen instances.
- Advantages: Useful for estimating the distribution of a statistic (e.g., mean, variance) and provides a measure of uncertainty.
- Disadvantages: Can lead to overfitting if not implemented correctly, as it involves sampling with replacement.
- Time Series Split:
- Description: Specifically used for time series data. The dataset is split into a sequence of training and test sets, where each successive test set is ‘moved forward in time’.
- Advantages: Respects the temporal order of observations, which is critical in time-series analysis.
- Disadvantages: Not applicable to non-time-series datasets and can be sensitive to the period chosen for training and testing.
- Random Subsampling:
- Description: Similar to the holdout method, but the process is repeated multiple times with different random splits of the dataset into training and test sets.
- Advantages: Simpler and less computationally intensive than K-Fold cross-validation.
- Disadvantages: Less comprehensive and can have high variance depending on the splits.
Each of these methods offers a different approach to assessing the performance of a machine learning model, and the choice of method can depend on the specific characteristics of the data and the practical constraints of model training and evaluation.
Holdout Method
The holdout method is a simple and widely used technique for validating supervised machine learning models, particularly in classification tasks. This method involves splitting the dataset into two subsets: one for training the model and the other for testing its performance.
The process for evaluating a classification model obtained from a supervised machine learning algorithm generally follows these steps:
- Splitting the Dataset:
- The dataset is divided into two parts: the training set and the testing (or holdout) set.
- A common split ratio is 70% of the data for training and 30% for testing, but this can vary based on the dataset size and specific requirements.
- Training the Model:
- The model is trained exclusively on the training set. This set is used to fit the model parameters.
- Testing the Model:
- After training, the model is evaluated on the testing set. This set is not used during the training phase, so it provides an unbiased evaluation of the model.
- Performance metrics such as accuracy, precision, recall, F1 score, ROC-AUC, etc., are computed to assess the model’s performance.
Advantages of the Holdout Method
- Simplicity: It is straightforward and easy to implement.
- Speed: Less computationally expensive compared to methods like cross-validation, especially for large datasets.
Disadvantages
- Data Split Dependence: The performance estimate can be highly dependent on how the data is split. If the split is not representative of the overall dataset, it can lead to misleading performance estimates.
- Limited Data Utilization: Since a portion of the data is set aside for testing, it’s not used for training. This can be a drawback, especially with small datasets.
Example Scenario
Imagine you have a dataset of 10,000 images to build a model that classifies images as either cats or dogs. Using the holdout method, you might:
- Use 7,000 images to train the model.
- Reserve the remaining 3,000 images to test the model.
After training, you evaluate the model’s performance on the 3,000 test images. The accuracy, precision, recall, and other relevant metrics calculated from this test set give you an estimate of how well your model will perform on unseen data.
Best Practices
- Random Split: Ensure that the split between the training and testing sets is random. This helps in making the split representative of the whole dataset.
- Stratified Split: If the dataset is imbalanced (e.g., 90% cats and 10% dogs), use stratified sampling to maintain the same proportion in both training and testing sets.
- Iterative Approach: For more robust validation, consider using the holdout method in combination with techniques like cross-validation, especially when dealing with smaller datasets.
The holdout method, despite its simplicity, can be a powerful tool in model validation, provided it’s used correctly and the limitations are acknowledged.
k-Fold Cross Validation
K-Fold Cross-Validation is a robust method for assessing the performance of machine learning models, particularly useful for its ability to provide a more reliable estimate of model performance on unseen data.
Description of K-Fold Cross-Validation
- Splitting the Dataset:
- The entire dataset is divided into ‘k’ equal (or nearly equal) sized subsets or ‘folds’.
- Common choices for ‘k’ include 5 or 10, but the optimal number can depend on the size and specifics of the dataset.
- Model Training and Validation Process:
- The process is repeated ‘k’ times, with each of the ‘k’ folds used exactly once as the validation set.
- In each iteration, a different fold is treated as the validation set, and the remaining ‘k-1’ folds are combined to form the training set.
- The model is trained on the training set and evaluated on the validation set.
- After ‘k’ iterations, every data point has been used both for training and validation.
- Aggregating Results:
- The performance measure (e.g., accuracy, precision, recall) is calculated for each of the ‘k’ iterations.
- The final model performance is typically reported as the average of these ‘k’ performance measures.
Advantages
- Reduced Bias: Since every observation gets to be in a test set exactly once and in a training set ‘k-1’ times, it reduces bias compared to methods like the holdout method.
- Utilization of All Data: It allows for both training and testing on all available data, maximizing the use of data, which is particularly beneficial for smaller datasets.
- Robust Performance Estimate: Provides a more accurate and robust estimate of model performance, as it averages the results from ‘k’ iterations.
- Useful for Limited Data: Ideal for scenarios with limited data, where it’s essential to use the dataset efficiently.
Disadvantages
- Computational Cost: It can be computationally expensive, especially for large datasets and complex models, as it requires the model to be trained and evaluated ‘k’ times.
- Time-Consuming: The increased computational cost translates to longer training times, which can be a significant drawback in time-sensitive projects.
- Variance in Performance: The performance might still vary depending on how the data is split into folds, though less so than with the holdout method.
- Choice of ‘k’: Selecting the appropriate value of ‘k’ can be challenging. A larger ‘k’ provides less bias towards overestimating the true expected error (as each test set is smaller), but the variance of the resulting estimate can be higher.
Example
In a dataset with 100 observations, using 10-fold cross-validation would involve: - Splitting the data into 10 folds of 10 observations each. - In each iteration, 9 folds (90 observations) are used for training, and 1 fold (10 observations) is used for validation. - After 10 iterations, the performance metric (e.g., accuracy) for each iteration is averaged to provide an overall performance estimate.
K-Fold Cross-Validation is widely used due to its balance of efficiency and effectiveness, particularly in scenarios where the available data is limited and one needs to get the most reliable performance estimate possible from the dataset.
Accuracy
Accuracy is a fundamental metric in both binary and multiclass classification problems in supervised machine learning. It measures the proportion of correct predictions (both true positives and true negatives) made by the model out of all predictions.
Binary Classification
In binary classification, there are only two classes (often labeled as positive and negative, or 1 and 0).
Example
Imagine a medical test for a disease where: - 90 people are correctly identified as having the disease (TP). - 900 people are correctly identified as not having the disease (TN). - 10 people are incorrectly identified as having the disease (FP). - 100 people are incorrectly identified as not having the disease (FN).
The accuracy of this test is calculated as:
\[ \text{Accuracy} = \frac{90 + 900}{90 + 900 + 10 + 100} = \frac{990}{1100} \approx 0.90 \]
This means the test correctly identifies the disease status 90% of the time.
Multiclass Classification
In multiclass classification, there are more than two classes.
Example
Consider a classification problem with three classes (A, B, and C) and a dataset with the following results: - Class A: 30 correct predictions, 5 incorrect predictions. - Class B: 40 correct predictions, 15 incorrect predictions. - Class C: 50 correct predictions, 10 incorrect predictions.
The total number of observations is \(30 + 5 + 40 + 15 + 50 + 10 = 150\).
The accuracy of the model is:
\[ \text{Accuracy} = \frac{30 + 40 + 50}{150} = \frac{120}{150} = 0.80 \]
This means the model correctly predicts the class 80% of the time.
Key Points
- Binary Classification: Accuracy is straightforward, focusing on TP and TN out of all observations.
- Multiclass Classification: Accuracy considers correct predictions across all classes.
- Limitations: Accuracy can be misleading in imbalanced datasets where one class significantly outweighs others. In such cases, other metrics like precision, recall, and F1 score might provide a more nuanced understanding of the model’s performance.
Accuracy is a good initial indicator of model performance but should be used alongside other metrics for a comprehensive evaluation, especially in cases of class imbalance or when the costs of different types of errors vary significantly.
Precision
Precision is a common and often used metric in the evaluation of classification models, especially in scenarios where the cost of false positives (incorrectly predicting the positive class) is high. It gives us insight into the accuracy of the positive predictions made by the model.
Definition of Precision
Precision is defined as the ratio of correctly predicted positive observations to the total predicted positive observations. In simpler terms, it answers the question: “Of all the instances the model labeled as positive, how many were actually positive?”
Calculating Precision
The formula for precision is:
\[ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP) + False Positives (FP)}} \]
where,
- True Positives (TP) are the instances correctly predicted as positive
- False Positives (FP) are the instances incorrectly predicted as positive
Interpretation
- A precision of 1.0 means that every item labeled as positive is indeed positive (but says nothing about the items labeled as negative).
- A lower precision indicates a high number of false positives among the labeled positives.
Common Use
Precision is particularly important in fields where the cost of a false positive is high. For example, in email spam detection, a false positive (labeling a good email as spam) is more problematic than a false negative (failing to identify a spam email). Similarly, in medical testing, falsely diagnosing a healthy patient with a disease could be more critical than missing the disease in its early stages.
Examples
Example I: Email Spam Filtering
Context: In email spam filtering systems, the goal is to identify and filter out spam emails while ensuring legitimate emails reach the user’s inbox.
- True Positive (TP): A spam email correctly identified as spam.
- False Positive (FP): A legitimate email incorrectly identified as spam (this is particularly undesirable as it could lead to missing important emails).
Precision in this Scenario: - High precision means that most of the emails identified as spam are indeed spam, minimizing the risk of important emails being wrongly filtered out. - If a spam filter has a precision of 0.95, it means that 95% of the emails it marks as spam are actually spam, and only 5% are legitimate emails mistakenly identified as spam.
Importance: - In email filtering, users typically prefer to receive a few spam emails in their inbox rather than miss an important legitimate email. Therefore, maintaining high precision is crucial to avoid the inconvenience and potential loss caused by missing important emails.
Example II: Medical Diagnosis for a Serious Disease
Context: Consider a medical test designed to diagnose a serious, potentially life-threatening disease.
- True Positive (TP): Correctly identifying a patient with the disease.
- False Positive (FP): Incorrectly diagnosing a healthy person with the disease.
Precision in this Scenario:
- High precision indicates that a high proportion of patients diagnosed with the disease actually have the disease.
- For instance, if a test has a precision of 0.90, it implies that 90% of the diagnosed cases are true positives, whereas 10% are false positives.
Importance:
- In medical diagnostics, especially for serious conditions, a false positive can lead to unnecessary stress, further invasive testing, and potentially harmful treatment for the patient. Therefore, having a high precision rate is crucial to minimize these risks.
In both examples, while high precision is desirable, it is also essential to balance it with other metrics like recall, especially in medical scenarios where missing a true case (high recall) can be as critical as avoiding false alarms (high precision). These examples highlight the importance of precision in contexts where the consequences of false positives are significant.
Precision in Multivariate Classification
In a multiclass (or multivariate) classification scenario, where there are more than two possible outcomes, precision is calculated for each class separately and then can be averaged to obtain an overall precision. This process involves considering each class as the “positive” class (of interest) and all other classes as “negative” (not of interest) for the purpose of the calculation.
Steps to Calculate Precision in Multiclass Classification
- Calculate Precision for Each Class:
- For each class, calculate precision as: \[ \text{Precision}_{\text{class}} = \frac{\text{True Positives (TP)}_{\text{class}}}{\text{True Positives (TP)}_{\text{class}} + \text{False Positives (FP)}_{\text{class}}} \]
- Here, TP for a class is the number of times the class was correctly predicted, and FP is the number of times other classes were incorrectly predicted as this class.
- Average the Precision Scores:
- Macro-average Precision: Calculate the average of the precision scores for each class. This treats all classes equally, regardless of their frequency in the dataset. \[ \text{Macro-average Precision} = \frac{\sum \text{Precision}_{\text{class}}}{\text{Number of classes}} \]
- Weighted-average Precision: Calculate the average of precision scores for each class, weighted by the number of true instances for each class. This accounts for class imbalance. \[ \text{Weighted-average Precision} = \sum \left( \frac{\text{Number of true instances in class}}{\text{Total number of instances}} \times \text{Precision}_{\text{class}} \right) \]
Example Calculation
Imagine a classification problem with three classes: A, B, and C. Let’s say we have the following counts:
- Class A: \(\text{TP}_A = 30, \text{FP}_A = 10\)
- Class B: \(\text{TP}_B = 40, \text{FP}_B = 20\)
- Class C: \(\text{TP}_C = 50, \text{FP}_C = 5\)
The precision for each class would be:
- Precision for Class A: \(\text{Precision}_A = \frac{30}{30 + 10} = 0.75\)
- Precision for Class B: \(\text{Precision}_B = \frac{40}{40 + 20} = 0.67\)
- Precision for Class C: \(\text{Precision}_C = \frac{50}{50 + 5} = 0.91\)
Then, the macro-average precision across all classes would be:
\[ \text{Macro-average Precision} = \frac{0.75 + 0.67 + 0.91}{3} \approx 0.78 \]
And if we were to calculate weighted-average precision (assuming equal distribution of true instances among classes for simplicity), it would be similar in this case.
Using these methods, you can account for the performance of a multiclass classification model in correctly identifying each class while considering the specific importance or frequency of each class.
Recall
Recall, also known as sensitivity, is an important metric in classification problems, used to measure the proportion of actual positives that are correctly identified by the classification model.
Recall in Binary Classification
In binary classification, there are two possible outcomes: positive and negative.
Example
Consider a medical test to identify a disease:
- The test correctly identifies 80 patients with the disease (TP).
- 20 patients with the disease are missed by the test (FN).
- The total number of actual patients with the disease is 100 (80 TP + 20 FN).
The recall of this test is:
\[ \text{Recall} = \frac{80}{80 + 20} = \frac{80}{100} = 0.80 \]
This means that the test correctly identifies 80% of the patients who actually have the disease.
Recall in Multiclass Classification
In multiclass classification, there are more than two possible outcomes.
Example
Consider a classification problem with three classes (A, B, C) with the following results: - Class A: 30 TP, 5 FN. - Class B: 40 TP, 10 FN. - Class C: 50 TP, 20 FN.
Recall for each class would be: - Class A: \(\text{Recall}_A = \frac{30}{30 + 5} = 0.86\) - Class B: \(\text{Recall}_B = \frac{40}{40 + 10} = 0.80\) - Class C: \(\text{Recall}_C = \frac{50}{50 + 20} = 0.71\)
The macro-average recall across all classes would be the average of these three values.
Key Points
- Binary Classification: Recall measures the proportion of actual positives correctly identified.
- Multiclass Classification: Recall is computed for each class individually and then averaged.
- Importance: High recall indicates a lower number of false negatives. It is particularly important in scenarios like medical diagnosis, where missing an actual positive case (a disease) can be critical.
Recall is a valuable metric, especially when the consequences of false negatives are significant. However, it should be balanced with other metrics like precision and accuracy for a well-rounded model evaluation.
F1 Score
The F1 score is a metric that combines precision and recall into a single number, providing a balanced measure of a model’s accuracy, especially when dealing with imbalanced datasets.
F1 Score in Binary Classification
In binary classification, where outcomes are labeled as positive or negative, the F1 score is particularly useful.
Example
Consider a binary classification task: - Precision = 0.75 (75% of the predicted positives are correct) - Recall = 0.60 (60% of actual positives are correctly identified)
The F1 score would be:
\[ \text{F1 Score} = 2 \times \frac{0.75 \times 0.60}{0.75 + 0.60} \approx 0.67 \]
F1 Score in Multiclass Classification
In multiclass classification, the F1 score needs to be calculated for each class and then averaged.
Steps to Calculate F1 Score
- Calculate Precision and Recall for Each Class: Treat each class as the positive class and calculate precision and recall.
- Calculate F1 Score for Each Class: Use the formula for each class individually.
- Average the F1 Scores: You can use either:
- Macro-average: Simply average the F1 scores of all classes.
- Weighted-average: Average the F1 scores, weighted by the number of true instances for each class.
Example
Let’s say we have a classification problem with three classes (A, B, and C), and we calculated the following precision and recall for each class: - Class A: Precision = 0.80, Recall = 0.70 - Class B: Precision = 0.60, Recall = 0.50 - Class C: Precision = 0.90, Recall = 0.85
The F1 scores for each class would be: - Class A: \(\text{F1}_A = 2 \times \frac{0.80 \times 0.70}{0.80 + 0.70} \approx 0.74\) - Class B: \(\text{F1}_B = 2 \times \frac{0.60 \times 0.50}{0.60 + 0.50} \approx 0.55\) - Class C: \(\text{F1}_C = 2 \times \frac{0.90 \times 0.85}{0.90 + 0.85} \approx 0.87\)
The macro-average F1 score would be the average of these values.
Key Points
- The F1 score provides a balance between precision and recall, being particularly useful in situations where there is an imbalance in the dataset or when false positives and false negatives carry different costs.
- In binary classification, it’s straightforward as it directly combines the model’s precision and recall.
- In multiclass classification, the F1 score is calculated for each class and then averaged, providing a comprehensive view of the model’s performance across all classes.
The F1 score is a useful metric in many scenarios, as it accounts for both the precision and recall of the model, providing a more holistic view of its performance.
Confusion Matrix
A confusion matrix is a tool used in supervised learning to visualize the performance of a classification model. It’s particularly useful for understanding the types of errors a model is making.
Confusion Matrix in Binary Classification
In binary classification, the confusion matrix is a 2x2 table that shows the number of true positives, true negatives, false positives, and false negatives.
Components of a Binary Confusion Matrix
- True Positives (TP): Correctly predicted positive cases.
- True Negatives (TN): Correctly predicted negative cases.
- False Positives (FP): Incorrectly predicted positive cases (Type I error).
- False Negatives (FN): Incorrectly predicted negative cases (Type II error).
Example
Imagine a medical test for a disease: - 50 patients have the disease and the test correctly identifies 40 (TP = 40). - 100 patients do not have the disease and the test correctly identifies 90 (TN = 90). - The test incorrectly identifies 10 healthy patients as having the disease (FP = 10). - The test fails to identify the disease in 10 patients who have it (FN = 10).
The confusion matrix would look like this:
Actual Positive |
40 (TP) |
10 (FN) |
Actual Negative |
10 (FP) |
90 (TN) |
Confusion Matrix in Multiclass Classification
In multiclass classification, the confusion matrix is larger, with dimensions equal to the number of classes. Each row represents the instances in an actual class, and each column represents the instances in a predicted class.
Steps to Calculate a Multiclass Confusion Matrix
- Determine the Number of Classes: Suppose there are N classes.
- Create an NxN Matrix: Each cell (i, j) in the matrix represents the number of instances of class i (actual) predicted as class j.
Example
Consider a classification problem with three classes: A, B, and C. After applying the model on a dataset, we get the following results:
- 30 of Class A were correctly classified (TP for A), 5 were classified as B, and 5 as C.
- 4 of Class B were incorrectly classified as A, 40 were correctly classified (TP for B), and 6 as C.
- 6 of Class C were classified as A, 8 as B, and 46 were correctly classified (TP for C).
The confusion matrix would look like this:
Actual A |
30 |
5 |
5 |
Actual B |
4 |
40 |
6 |
Actual C |
6 |
8 |
46 |
Key Points
- In binary classification, the confusion matrix is a simple 2x2 table, whereas, in multiclass classification, it expands to accommodate all classes.
- The diagonal cells (top-left to bottom-right) represent the number of correct predictions (true positives for each class).
- Off-diagonal cells show the distribution of errors, indicating which classes are being confused with others.
A confusion matrix provides a detailed breakdown of a model’s performance and is especially useful for identifying whether a model is confusing two classes, which can be crucial for improving model accuracy.
Specificity
Specificity is a metric used in classification tasks to measure the proportion of actual negatives that are correctly identified. It’s particularly important in contexts where the cost of false positives is high.
Specificity in Binary Classification
In binary classification, where there are only two classes (positive and negative), specificity is straightforward.
Example
Consider a medical test for a disease: - 100 healthy individuals are tested, 90 of whom are correctly identified as not having the disease (TN = 90). - 10 healthy individuals are incorrectly identified as having the disease (FP = 10).
The specificity of this test is:
\[ \text{Specificity} = \frac{90}{90 + 10} = \frac{90}{100} = 0.90 \]
This means the test correctly identifies 90% of healthy individuals.
Specificity in Multiclass Classification
In multiclass classification, specificity is calculated for each class by considering it as the “negative” class while grouping all other classes as “positive.”
Steps to Calculate Specificity for Each Class
- Treat Each Class as Negative Once: For each class, calculate TN and FP with respect to that class.
- Calculate Specificity for Each Class: Use the binary specificity formula for each class.
- Average the Specificity Scores: You can calculate an average or weighted-average specificity, similar to precision and recall.
Example
Consider a classification problem with three classes (A, B, and C). After applying the model, we have the following confusion matrix:
Actual A |
30 (TP for A) |
5 (FP for C) |
5 (FP for C) |
Actual B |
4 (FP for A) |
40 (TP for B) |
6 (FP for C) |
Actual C |
6 (FP for A) |
8 (FP for B) |
46 (TP for C) |
To calculate specificity for each class: - For Class A: Consider B and C predictions as “positive” and A as “negative”. - For Class B: Consider A and C predictions as “positive” and B as “negative”. - For Class C: Consider A and B predictions as “positive” and C as “negative”.
Then calculate specificity for each class using the binary formula.
Key Points
- Binary Classification: Specificity measures the accuracy in identifying actual negatives.
- Multiclass Classification: Specificity is calculated for each class by treating it as the “negative” class once.
- Importance: Specificity is crucial in scenarios where false positives carry significant consequences.
Specificity is often used alongside sensitivity (recall) to provide a comprehensive view of a model’s performance, especially in medical testing, where distinguishing between healthy and diseased individuals accurately is vital.
Precision vs Specificity
Precision and specificity are both metrics used in classification tasks, but they focus on different aspects of the model’s performance.
Precision
- Definition: Precision measures the accuracy of the positive predictions made by the model. It is the ratio of true positives (correct positive predictions) to the total number of positive predictions made by the model (both true positives and false positives).
- Formula: \[ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP) + False Positives (FP)}} \]
- Use: Precision is used when the cost of a false positive is high. In scenarios where it’s critical not to label a negative instance as positive, precision becomes a key metric.
- Example: In email spam detection, precision is important. If a spam filter has low precision, it means many legitimate emails are incorrectly marked as spam (false positives), which could lead to important emails being missed.
Specificity
- Definition: Specificity measures the ability of the model to correctly identify negatives. It is the ratio of true negatives (correct negative predictions) to the total number of actual negative instances (both true negatives and false positives).
- Formula: \[ \text{Specificity} = \frac{\text{True Negatives (TN)}}{\text{True Negatives (TN) + False Positives (FP)}} \]
- Use: Specificity is used when it is crucial to correctly identify negative cases. It’s important in cases where missing a negative can have serious consequences.
- Example: In a medical test for a rare but serious disease, specificity is crucial. A low specificity means many healthy individuals are incorrectly diagnosed with the disease (false positives), leading to unnecessary stress and potentially harmful treatments.
Key Differences
- Focus: Precision focuses on the proportion of correct positive predictions out of all positive predictions, while specificity focuses on correctly identifying negative cases.
- False Positives: Precision is affected by false positives in the context of positive predictions, whereas specificity is affected by false positives in the context of negative cases.
- Scenarios: Precision is key in scenarios where wrongly labeling negatives as positives is problematic, whereas specificity is key in scenarios where failing to identify true negatives is problematic.
Precision and specificity address different aspects of a model’s performance. Precision is about how many selected items are relevant, while specificity is about how many relevant items are selected, particularly in the context of negatives. Depending on the application and the consequences of different types of errors (false positives vs. false negatives), one may prioritize one metric over the other.
ROC Curve and AUC
The Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) are powerful tools used for evaluating the performance of classification models, particularly in binary classification. They can also be adapted for multiclass classification.
ROC Curve and AUC in Binary Classification
ROC Curve
- What It Represents: The ROC curve is a graphical representation that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied.
- Plot Components: The ROC curve plots the True Positive Rate (TPR, or Recall) against the False Positive Rate (FPR) at various threshold settings.
- True Positive Rate (TPR): TPR = TP / (TP + FN)
- False Positive Rate (FPR): FPR = FP / (FP + TN)
AUC (Area Under the ROC Curve)
- What It Represents: The AUC provides an aggregate measure of the model’s performance across all possible classification thresholds. It ranges from 0 to 1, with a higher AUC indicating better model performance.
- Interpretation:
- An AUC of 0.5 suggests no discriminative ability (equivalent to random guessing).
- An AUC of 1.0 suggests perfect classification.
Example
Consider a medical test for a disease: - By adjusting the threshold for what counts as a positive prediction, you generate different sets of TPR and FPR, plotting these on the ROC curve. - If the test is highly accurate, the ROC curve will bow towards the top left corner of the plot, indicating high TPR and low FPR. - If the AUC is close to 1, it suggests the test is highly effective at distinguishing between patients with and without the disease.
ROC Curve and AUC in Multiclass Classification
In multiclass classification, the ROC curve and AUC are extended to handle multiple classes through a few different methods:
One-vs-Rest (OvR) Approach
- Method: Treat each class as a binary classification (the class versus all other classes).
- Calculation: Calculate the ROC curve and AUC for each class separately, then average the results.
- This could be a simple average (macro-average) or a weighted average based on the prevalence of each class.
One-vs-One (OvO) Approach
- Method: For N classes, construct ROC curves for each pair of classes.
- Calculation: Average the AUCs from these ROC curves.
Example
Imagine a classification problem with three classes (A, B, and C). For the OvR method: - Calculate the ROC curve and AUC treating A as the positive class and B+C as the negative class, then repeat for B and C. - Average these AUC scores to get a single performance measure.
Key Points
- Binary Classification: ROC and AUC provide a comprehensive measure of model performance across all thresholds.
- Multiclass Classification: Extended through OvR or OvO approaches to handle multiple classes.
- Usefulness: These metrics are particularly useful for evaluating and comparing models, especially when dealing with imbalanced datasets or when the costs of different types of errors vary significantly.
Log Loss
Log Loss, also known as logistic loss or cross-entropy loss, is a performance metric that measures the accuracy of a classifier. It’s a probability-based metric, offering a more nuanced view of model performance, especially when the outputs are probabilities.
Log Loss in Binary Classification
In binary classification, log loss measures the uncertainty of the probability estimates by penalizing false classifications.
Example
Consider a binary classification task with 3 samples, and the model outputs the following probabilities and actual labels:
- Sample 1: Predicted probability for class 1 = 0.9, Actual label = 1
- Sample 2: Predicted probability for class 1 = 0.3, Actual label = 0
- Sample 3: Predicted probability for class 1 = 0.6, Actual label = 1
The log loss would be calculated as:
\[ \text{Log Loss} = - \frac{1}{3} [(1 \cdot \log(0.9) + (1 - 1) \cdot \log(1 - 0.9)) + (0 \cdot \log(0.3) + (1 - 0) \cdot \log(1 - 0.3)) + (1 \cdot \log(0.6) + (1 - 1) \cdot \log(1 - 0.6))] \]
Log Loss in Multiclass Classification
In multiclass classification, log loss is extended to cover multiple classes.
Example
Consider a classification problem with 3 samples and 3 classes:
- Sample 1: Predicted probabilities = [0.7, 0.2, 0.1], Actual class = 1
- Sample 2: Predicted probabilities = [0.1, 0.8, 0.1], Actual class = 2
- Sample 3: Predicted probabilities = [0.2, 0.2, 0.6], Actual class = 3
The log loss is calculated for each class and then averaged.
Key Points
- Binary Classification: Log loss provides a measure of how close the predicted probabilities are to the actual labels.
- Multiclass Classification: The concept is extended to multiple classes, summing over all classes for each sample.
- Interpretation: Lower log loss values indicate better model performance, with a log loss of 0 representing perfect predictions.
- Usefulness: Log loss is particularly useful when the output of the model is a probability, giving insight into the uncertainty of the predictions.
Sample Implementation in R
# Loading dataset using url
url1 <- "https://drive.google.com/uc?export=download&id=12gzLfZBMSl2d-sF63D-qZJbCEocUdTKK"
df.wine1 <- read.csv(file = url1, header = TRUE, sep = ";")
df.wine1 <- df.wine1[,c(1, 2, 3, 4, 12)]
library(scales)
# Performing min-max transformation on my dataset without outliers
quality <- df.wine1$quality
wine.1 <- df.wine1[,-5]
#str(win)
normaliz <- function(colum){
normalized <- scales::rescale(colum, to=c(0,1))
}
for (x in colnames(wine.1)){
normalized.column <- normaliz(wine.1[, x])
# here the column is getting normalized above but
# we have to assign it to the data frame to make changes
wine.1[,x] <- normalized.column
}
wine.1$quality <- quality
summary(wine.1)
## fixed.acidity volatile.acidity citric.acid residual.sugar quality
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :3.000
## 1st Qu.:0.2404 1st Qu.:0.1275 1st Qu.:0.1627 1st Qu.:0.01687 1st Qu.:5.000
## Median :0.2885 Median :0.1765 Median :0.1928 Median :0.07055 Median :6.000
## Mean :0.2937 Mean :0.1944 Mean :0.2013 Mean :0.08883 Mean :5.878
## 3rd Qu.:0.3365 3rd Qu.:0.2353 3rd Qu.:0.2349 3rd Qu.:0.14264 3rd Qu.:6.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :9.000
wine.1$quality <- as.factor(wine.1$quality)
# Splitting the dataset into training and testing subsets
set.seed(101)
N = nrow(wine.1)
split <- 0.7
rows <- sample(nrow(wine.1))
training_ind <- sample(1:N, size = round(N * split),replace = FALSE)
train <- wine.1[training_ind,]
test <- wine.1[-training_ind,]
train_labels <- train$quality
test_labels <- test$quality
length(train_labels)
## [1] 3429
## [1] 1469 5
## [1] 3429 5
library(class)
# kNN model
# The 5th column is the quality column
knn_model <- knn(train = train[,-5], test = test[,-5], cl= train_labels, k=3)
# Evaluating model
confusion_matrix <- table(knn_model, test_labels)
confusion_matrix
## test_labels
## knn_model 3 4 5 6 7 8 9
## 3 0 0 1 1 0 0 0
## 4 0 7 14 7 5 5 0
## 5 1 22 194 137 39 12 0
## 6 7 13 179 380 103 24 1
## 7 0 6 39 104 112 5 1
## 8 0 1 9 20 11 8 1
## 9 0 0 0 0 0 0 0
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
## [1] 0.4771954
precision.per.class <- c(ncol(confusion_matrix))
for (col in 1:ncol(confusion_matrix)) {
precision.per.class[col] <- confusion_matrix[col,col] /
sum(confusion_matrix[1:nrow(confusion_matrix),col])
}
precision <- mean(precision.per.class)
Summary
In this lesson, we discussed various aspects of evaluating classification models in supervised machine learning, focusing particularly on accuracy, precision, recall, and F1 score and how to interpret the outcomes for these metrics in relation to each other.
---
title: "Evaluating Classification Models"
params:
  category: 3
  stacks: 0
  number: 212
  time: 40
  level: beginner
  tags: knn,precision,recall,specificity,F1-Score,F1,classification,accuracy
  description: "Explains common evaluation metrics for classification models,
                incuding accuracy, precision, recall, sensitivity, F1 Score,
                among others. Shows how to calculate these metrics and how
                to interpret the results."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: readable
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Overview

Evaluating classification models is a crucial aspect of supervised machine learning. It helps us understand how well our models are performing and guides us in improving their accuracy and effectiveness. There are several metrics used for this purpose, each with its own strengths and context of use. Some of the most commonly used metrics include:

1.  **Accuracy**: This is the most intuitive performance measure and it is simply a ratio of correctly predicted observations to the total observations. It's best used when the class distribution is similar and the costs of false positives and false negatives are roughly the same.

2.  **Precision**: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. It's important when the cost of false positives is high.

3.  **Recall (Sensitivity)**: Recall is the ratio of correctly predicted positive observations to all observations in actual class. It's used when the cost of false negatives is high.

4.  **F1 Score**: F1 Score is the weighted average of Precision and Recall. It takes both false positives and false negatives into account. It's useful when you want to balance precision and recall.

5.  **Confusion Matrix**: A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.

6.  **ROC Curve and AUC**: The ROC curve is a graphical representation of the trade-off between the true positive rate and false positive rate at various thresholds. AUC represents the degree or measure of separability achieved by the model.

7.  **Specificity**: Specificity measures the proportion of actual negatives that are correctly identified as such. It's important in contexts where the cost of false positives is high.

8.  **Log Loss**: Also known as logistic loss or logit loss, it measures the performance of a classification model where the prediction is a probability between 0 and 1.

Each of these metrics provides different insights into the performance of a classification model, and the choice of metrics often depends on the specific requirements and context of the problem being solved.

The labeling of the target variable is a decision made by the analyst, *i.e.*, what is "positive" and what is "negative" is a business decision and is not of consequence. For example, in disease detection, the presence of a disease is commonly marked as "positive", while the absence would be marked a "negative". If we were to evaluate a model predicting whether a customer is likely to buy again, the "positive" might best be the outcome of "customer buys again".

When classification is between two classes ("positive" and "negative"), we call this a binary classification. When there are more than two classes, then it is a multiclass (or multivariate or multilevel) classification. Some supervised machine learning algorithms are principally used for binary classification (\_e.g. logistic regression), while others are inherently better for multiclass classification (*e.g.*, kNN, decision tree, and random forest).

## Model Evaluation

There are several methods for evaluating classification models. All methods involve, in some way, the training of a classification model using a supervised machine learning algorithm (*e.g.*, kNN or logistic regression) on a labeled training data set. The most common methods are listed below; the most common among these are the holdout method and k-fold cross validation (kCV).

Besides the holdout method, several other techniques are used for validating supervised machine learning models. Each method has its advantages and is suited to different scenarios. Some of the most commonly used methods include:

1.  **K-Fold Cross-Validation**:
    -   **Description**: The dataset is divided into two parts: the training set and the testing (or holdout) set. The model is trained using the training data and evaluated using the testing data set.
    -   **Advantages**: Simple to apply.
    -   **Disadvantages**: The model and the evaluation can be greatly influenced by the choice of training data.
2.  **K-Fold Cross-Validation**:
    -   **Description**: The dataset is divided into 'k' equally (or nearly equally) sized folds or subsets. The model is trained 'k' times, each time using a different fold as the testing set and the remaining 'k-1' folds as the training set.
    -   **Advantages**: It ensures that every observation from the original dataset has the chance of appearing in the training and test set. This is especially useful with smaller datasets.
    -   **Disadvantages**: It can be computationally intensive, especially for large datasets or complex models.
3.  **Stratified K-Fold Cross-Validation**:
    -   **Description**: Similar to K-Fold, but the folds are made by preserving the percentage of samples for each class. This is crucial in dealing with imbalanced datasets.
    -   **Advantages**: Maintains a balanced representation of the original dataset, particularly important for classification problems with imbalanced class distributions.
    -   **Disadvantages**: More complex to implement than simple K-Fold cross-validation.
4.  **Leave-One-Out Cross-Validation (LOOCV)**:
    -   **Description**: A special case of K-Fold cross-validation where 'k' equals the number of observations in the dataset. Essentially, each observation is used as a single test example, and the rest are used for training.
    -   **Advantages**: Maximizes the amount of data used for training the model.
    -   **Disadvantages**: Extremely computationally expensive and impractical with large datasets. It can also have high variance as a single observation can sometimes be a poor representation of the dataset.
5.  **Leave-P-Out Cross-Validation**:
    -   **Description**: Similar to LOOCV, but instead of leaving out one observation at a time, 'p' observations are left out.
    -   **Advantages**: Allows for more thorough testing than LOOCV in certain cases.
    -   **Disadvantages**: Computationally very intensive and less commonly used.
6.  **Bootstrap Method**:
    -   **Description**: Involves randomly sampling with replacement from the dataset to create multiple training datasets. The model is trained on these bootstrap samples and tested on the unseen instances.
    -   **Advantages**: Useful for estimating the distribution of a statistic (e.g., mean, variance) and provides a measure of uncertainty.
    -   **Disadvantages**: Can lead to overfitting if not implemented correctly, as it involves sampling with replacement.
7.  **Time Series Split**:
    -   **Description**: Specifically used for time series data. The dataset is split into a sequence of training and test sets, where each successive test set is 'moved forward in time'.
    -   **Advantages**: Respects the temporal order of observations, which is critical in time-series analysis.
    -   **Disadvantages**: Not applicable to non-time-series datasets and can be sensitive to the period chosen for training and testing.
8.  **Random Subsampling**:
    -   **Description**: Similar to the holdout method, but the process is repeated multiple times with different random splits of the dataset into training and test sets.
    -   **Advantages**: Simpler and less computationally intensive than K-Fold cross-validation.
    -   **Disadvantages**: Less comprehensive and can have high variance depending on the splits.

Each of these methods offers a different approach to assessing the performance of a machine learning model, and the choice of method can depend on the specific characteristics of the data and the practical constraints of model training and evaluation.

### Holdout Method

The holdout method is a simple and widely used technique for validating supervised machine learning models, particularly in classification tasks. This method involves splitting the dataset into two subsets: one for training the model and the other for testing its performance.

The process for evaluating a classification model obtained from a supervised machine learning algorithm generally follows these steps:

1.  **Splitting the Dataset**:
    -   The dataset is divided into two parts: the training set and the testing (or holdout) set.
    -   A common split ratio is 70% of the data for training and 30% for testing, but this can vary based on the dataset size and specific requirements.
2.  **Training the Model**:
    -   The model is trained exclusively on the training set. This set is used to fit the model parameters.
3.  **Testing the Model**:
    -   After training, the model is evaluated on the testing set. This set is not used during the training phase, so it provides an unbiased evaluation of the model.
    -   Performance metrics such as accuracy, precision, recall, F1 score, ROC-AUC, etc., are computed to assess the model's performance.

#### Advantages of the Holdout Method

-   **Simplicity**: It is straightforward and easy to implement.
-   **Speed**: Less computationally expensive compared to methods like cross-validation, especially for large datasets.

#### Disadvantages

-   **Data Split Dependence**: The performance estimate can be highly dependent on how the data is split. If the split is not representative of the overall dataset, it can lead to misleading performance estimates.
-   **Limited Data Utilization**: Since a portion of the data is set aside for testing, it's not used for training. This can be a drawback, especially with small datasets.

#### Example Scenario

Imagine you have a dataset of 10,000 images to build a model that classifies images as either cats or dogs. Using the holdout method, you might:

-   Use 7,000 images to train the model.
-   Reserve the remaining 3,000 images to test the model.

After training, you evaluate the model's performance on the 3,000 test images. The accuracy, precision, recall, and other relevant metrics calculated from this test set give you an estimate of how well your model will perform on unseen data.

#### Best Practices

-   **Random Split**: Ensure that the split between the training and testing sets is random. This helps in making the split representative of the whole dataset.
-   **Stratified Split**: If the dataset is imbalanced (e.g., 90% cats and 10% dogs), use stratified sampling to maintain the same proportion in both training and testing sets.
-   **Iterative Approach**: For more robust validation, consider using the holdout method in combination with techniques like cross-validation, especially when dealing with smaller datasets.

The holdout method, despite its simplicity, can be a powerful tool in model validation, provided it's used correctly and the limitations are acknowledged.

### k-Fold Cross Validation

K-Fold Cross-Validation is a robust method for assessing the performance of machine learning models, particularly useful for its ability to provide a more reliable estimate of model performance on unseen data.

#### Description of K-Fold Cross-Validation

1.  **Splitting the Dataset**:
    -   The entire dataset is divided into 'k' equal (or nearly equal) sized subsets or 'folds'.
    -   Common choices for 'k' include 5 or 10, but the optimal number can depend on the size and specifics of the dataset.
2.  **Model Training and Validation Process**:
    -   The process is repeated 'k' times, with each of the 'k' folds used exactly once as the validation set.
    -   In each iteration, a different fold is treated as the validation set, and the remaining 'k-1' folds are combined to form the training set.
    -   The model is trained on the training set and evaluated on the validation set.
    -   After 'k' iterations, every data point has been used both for training and validation.
3.  **Aggregating Results**:
    -   The performance measure (e.g., accuracy, precision, recall) is calculated for each of the 'k' iterations.
    -   The final model performance is typically reported as the average of these 'k' performance measures.

#### Advantages

1.  **Reduced Bias**: Since every observation gets to be in a test set exactly once and in a training set 'k-1' times, it reduces bias compared to methods like the holdout method.
2.  **Utilization of All Data**: It allows for both training and testing on all available data, maximizing the use of data, which is particularly beneficial for smaller datasets.
3.  **Robust Performance Estimate**: Provides a more accurate and robust estimate of model performance, as it averages the results from 'k' iterations.
4.  **Useful for Limited Data**: Ideal for scenarios with limited data, where it's essential to use the dataset efficiently.

#### Disadvantages

1.  **Computational Cost**: It can be computationally expensive, especially for large datasets and complex models, as it requires the model to be trained and evaluated 'k' times.
2.  **Time-Consuming**: The increased computational cost translates to longer training times, which can be a significant drawback in time-sensitive projects.
3.  **Variance in Performance**: The performance might still vary depending on how the data is split into folds, though less so than with the holdout method.
4.  **Choice of 'k'**: Selecting the appropriate value of 'k' can be challenging. A larger 'k' provides less bias towards overestimating the true expected error (as each test set is smaller), but the variance of the resulting estimate can be higher.

#### Example

In a dataset with 100 observations, using 10-fold cross-validation would involve: - Splitting the data into 10 folds of 10 observations each. - In each iteration, 9 folds (90 observations) are used for training, and 1 fold (10 observations) is used for validation. - After 10 iterations, the performance metric (e.g., accuracy) for each iteration is averaged to provide an overall performance estimate.

K-Fold Cross-Validation is widely used due to its balance of efficiency and effectiveness, particularly in scenarios where the available data is limited and one needs to get the most reliable performance estimate possible from the dataset.

## Accuracy

Accuracy is a fundamental metric in both binary and multiclass classification problems in supervised machine learning. It measures the proportion of correct predictions (both true positives and true negatives) made by the model out of all predictions.

### Binary Classification

In binary classification, there are only two classes (often labeled as positive and negative, or 1 and 0).

#### Formula for Accuracy

The formula for accuracy in binary classification is:

$$ \text{Accuracy} = \frac{\text{True Positives (TP) + True Negatives (TN)}}{\text{Total Number of Observations}} $$

Where: - **True Positives (TP)**: Correctly predicted positive observations. - **True Negatives (TN)**: Correctly predicted negative observations.

#### Example

Imagine a medical test for a disease where: - 90 people are correctly identified as having the disease (TP). - 900 people are correctly identified as not having the disease (TN). - 10 people are incorrectly identified as having the disease (FP). - 100 people are incorrectly identified as not having the disease (FN).

The accuracy of this test is calculated as:

$$ \text{Accuracy} = \frac{90 + 900}{90 + 900 + 10 + 100} = \frac{990}{1100} \approx 0.90 $$

This means the test correctly identifies the disease status 90% of the time.

### Multiclass Classification

In multiclass classification, there are more than two classes.

#### Formula for Accuracy

The formula for accuracy in multiclass classification is similar to that in binary classification, but it considers all classes:

$$ \text{Accuracy} = \frac{\text{Sum of Correct Predictions across all classes}}{\text{Total Number of Observations}} $$

#### Example

Consider a classification problem with three classes (A, B, and C) and a dataset with the following results: - Class A: 30 correct predictions, 5 incorrect predictions. - Class B: 40 correct predictions, 15 incorrect predictions. - Class C: 50 correct predictions, 10 incorrect predictions.

The total number of observations is $30 + 5 + 40 + 15 + 50 + 10 = 150$.

The accuracy of the model is:

$$ \text{Accuracy} = \frac{30 + 40 + 50}{150} = \frac{120}{150} = 0.80 $$

This means the model correctly predicts the class 80% of the time.

### Key Points

-   **Binary Classification**: Accuracy is straightforward, focusing on *TP* and *TN* out of all observations.
-   **Multiclass Classification**: Accuracy considers correct predictions across all classes.
-   **Limitations**: Accuracy can be misleading in imbalanced datasets where one class significantly outweighs others. In such cases, other metrics like precision, recall, and *F1* score might provide a more nuanced understanding of the model's performance.

Accuracy is a good initial indicator of model performance but should be used alongside other metrics for a comprehensive evaluation, especially in cases of class imbalance or when the costs of different types of errors vary significantly.

## Precision

Precision is a common and often used metric in the evaluation of classification models, especially in scenarios where the cost of false positives (incorrectly predicting the positive class) is high. It gives us insight into the accuracy of the positive predictions made by the model.

## Definition of Precision

Precision is defined as the ratio of correctly predicted positive observations to the total predicted positive observations. In simpler terms, it answers the question: "Of all the instances the model labeled as positive, how many were actually positive?"

### Calculating Precision

The formula for precision is:

$$ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP) + False Positives (FP)}} $$

where,

-   **True Positives (TP)** are the instances correctly predicted as positive
-   **False Positives (FP)** are the instances incorrectly predicted as positive

### Interpretation

-   A precision of 1.0 means that every item labeled as positive is indeed positive (but says nothing about the items labeled as negative).
-   A lower precision indicates a high number of false positives among the labeled positives.

### Common Use

Precision is particularly important in fields where the cost of a false positive is high. For example, in email spam detection, a false positive (labeling a good email as spam) is more problematic than a false negative (failing to identify a spam email). Similarly, in medical testing, falsely diagnosing a healthy patient with a disease could be more critical than missing the disease in its early stages.

### Examples

#### Example I: Email Spam Filtering

**Context**: In email spam filtering systems, the goal is to identify and filter out spam emails while ensuring legitimate emails reach the user's inbox.

-   **True Positive (TP)**: A spam email correctly identified as spam.
-   **False Positive (FP)**: A legitimate email incorrectly identified as spam (this is particularly undesirable as it could lead to missing important emails).

**Precision in this Scenario**: - High precision means that most of the emails identified as spam are indeed spam, minimizing the risk of important emails being wrongly filtered out. - If a spam filter has a precision of 0.95, it means that 95% of the emails it marks as spam are actually spam, and only 5% are legitimate emails mistakenly identified as spam.

**Importance**: - In email filtering, users typically prefer to receive a few spam emails in their inbox rather than miss an important legitimate email. Therefore, maintaining high precision is crucial to avoid the inconvenience and potential loss caused by missing important emails.

#### Example II: Medical Diagnosis for a Serious Disease

**Context**: Consider a medical test designed to diagnose a serious, potentially life-threatening disease.

-   **True Positive (TP)**: Correctly identifying a patient with the disease.
-   **False Positive (FP)**: Incorrectly diagnosing a healthy person with the disease.

**Precision in this Scenario**:

-   High precision indicates that a high proportion of patients diagnosed with the disease actually have the disease.
-   For instance, if a test has a precision of 0.90, it implies that 90% of the diagnosed cases are true positives, whereas 10% are false positives.

**Importance**:

-   In medical diagnostics, especially for serious conditions, a false positive can lead to unnecessary stress, further invasive testing, and potentially harmful treatment for the patient. Therefore, having a high precision rate is crucial to minimize these risks.

In both examples, while high precision is desirable, it is also essential to balance it with other metrics like recall, especially in medical scenarios where missing a true case (high recall) can be as critical as avoiding false alarms (high precision). These examples highlight the importance of precision in contexts where the consequences of false positives are significant.

### Precision in Multivariate Classification

In a multiclass (or multivariate) classification scenario, where there are more than two possible outcomes, precision is calculated for each class separately and then can be averaged to obtain an overall precision. This process involves considering each class as the "positive" class (of interest) and all other classes as "negative" (not of interest) for the purpose of the calculation.

#### Steps to Calculate Precision in Multiclass Classification

1.  **Calculate Precision for Each Class**:
    -   For each class, calculate precision as: $$ \text{Precision}_{\text{class}} = \frac{\text{True Positives (TP)}_{\text{class}}}{\text{True Positives (TP)}_{\text{class}} + \text{False Positives (FP)}_{\text{class}}} $$
    -   Here, **TP** for a class is the number of times the class was correctly predicted, and **FP** is the number of times other classes were incorrectly predicted as this class.
2.  **Average the Precision Scores**:
    -   **Macro-average Precision**: Calculate the average of the precision scores for each class. This treats all classes equally, regardless of their frequency in the dataset. $$ \text{Macro-average Precision} = \frac{\sum \text{Precision}_{\text{class}}}{\text{Number of classes}} $$
    -   **Weighted-average Precision**: Calculate the average of precision scores for each class, weighted by the number of true instances for each class. This accounts for class imbalance. $$ \text{Weighted-average Precision} = \sum \left( \frac{\text{Number of true instances in class}}{\text{Total number of instances}} \times \text{Precision}_{\text{class}} \right) $$

#### Example Calculation

Imagine a classification problem with three classes: A, B, and C. Let's say we have the following counts:

-   Class A: $\text{TP}_A = 30, \text{FP}_A = 10$
-   Class B: $\text{TP}_B = 40, \text{FP}_B = 20$
-   Class C: $\text{TP}_C = 50, \text{FP}_C = 5$

The precision for each class would be:

-   Precision for Class A: $\text{Precision}_A = \frac{30}{30 + 10} = 0.75$
-   Precision for Class B: $\text{Precision}_B = \frac{40}{40 + 20} = 0.67$
-   Precision for Class C: $\text{Precision}_C = \frac{50}{50 + 5} = 0.91$

Then, the macro-average precision across all classes would be:

$$ \text{Macro-average Precision} = \frac{0.75 + 0.67 + 0.91}{3} \approx 0.78 $$

And if we were to calculate weighted-average precision (assuming equal distribution of true instances among classes for simplicity), it would be similar in this case.

Using these methods, you can account for the performance of a multiclass classification model in correctly identifying each class while considering the specific importance or frequency of each class.

## Recall

Recall, also known as sensitivity, is an important metric in classification problems, used to measure the proportion of actual positives that are correctly identified by the classification model.

### Recall in Binary Classification

In binary classification, there are two possible outcomes: positive and negative.

#### Formula for Recall

The formula for recall in binary classification is:

$$ \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP) + False Negatives (FN)}} $$

-   **True Positives (TP)**: Correctly predicted positive observations.
-   **False Negatives (FN)**: Actual positives that the model incorrectly predicted as negative.

#### Example

Consider a medical test to identify a disease:

-   The test correctly identifies 80 patients with the disease (*TP*).
-   20 patients with the disease are missed by the test (*FN*).
-   The total number of actual patients with the disease is 100 (*80 TP + 20 FN*).

The recall of this test is:

$$ \text{Recall} = \frac{80}{80 + 20} = \frac{80}{100} = 0.80 $$

This means that the test correctly identifies 80% of the patients who actually have the disease.

### Recall in Multiclass Classification

In multiclass classification, there are more than two possible outcomes.

#### Formula for Recall

Recall for each class in a multiclass setting is calculated by considering each class as the positive class and the rest as negative, then averaging the results.

$$ \text{Recall}_{\text{class}} = \frac{\text{True Positives (TP)}_{\text{class}}}{\text{True Positives (TP)}_{\text{class}} + \text{False Negatives (FN)}_{\text{class}}} $$

You can then calculate either the macro-average or weighted-average recall:

-   **Macro-average Recall**: Calculate the average of recall values for each class.
-   **Weighted-average Recall**: Calculate the average of recall values for each class, weighted by the number of true instances in each class.

#### Example

Consider a classification problem with three classes (A, B, C) with the following results: - Class A: 30 TP, 5 FN. - Class B: 40 TP, 10 FN. - Class C: 50 TP, 20 FN.

Recall for each class would be: - Class A: $\text{Recall}_A = \frac{30}{30 + 5} = 0.86$ - Class B: $\text{Recall}_B = \frac{40}{40 + 10} = 0.80$ - Class C: $\text{Recall}_C = \frac{50}{50 + 20} = 0.71$

The macro-average recall across all classes would be the average of these three values.

### Key Points

-   **Binary Classification**: Recall measures the proportion of actual positives correctly identified.
-   **Multiclass Classification**: Recall is computed for each class individually and then averaged.
-   **Importance**: High recall indicates a lower number of false negatives. It is particularly important in scenarios like medical diagnosis, where missing an actual positive case (a disease) can be critical.

Recall is a valuable metric, especially when the consequences of false negatives are significant. However, it should be balanced with other metrics like precision and accuracy for a well-rounded model evaluation.

## F1 Score

The F1 score is a metric that combines precision and recall into a single number, providing a balanced measure of a model's accuracy, especially when dealing with imbalanced datasets.

### F1 Score in Binary Classification

In binary classification, where outcomes are labeled as positive or negative, the F1 score is particularly useful.

#### Formula for F1 Score

The formula for the F1 score in binary classification is the harmonic mean of precision and recall:

$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

Where:

-   **Precision**: True Positives / (True Positives + False Positives)
-   **Recall**: True Positives / (True Positives + False Negatives)

#### Example

Consider a binary classification task: - Precision = 0.75 (75% of the predicted positives are correct) - Recall = 0.60 (60% of actual positives are correctly identified)

The F1 score would be:

$$ \text{F1 Score} = 2 \times \frac{0.75 \times 0.60}{0.75 + 0.60} \approx 0.67 $$

### F1 Score in Multiclass Classification

In multiclass classification, the F1 score needs to be calculated for each class and then averaged.

#### Steps to Calculate F1 Score

1.  **Calculate Precision and Recall for Each Class**: Treat each class as the positive class and calculate precision and recall.
2.  **Calculate F1 Score for Each Class**: Use the formula for each class individually.
3.  **Average the F1 Scores**: You can use either:
    -   **Macro-average**: Simply average the F1 scores of all classes.
    -   **Weighted-average**: Average the F1 scores, weighted by the number of true instances for each class.

#### Example

Let's say we have a classification problem with three classes (A, B, and C), and we calculated the following precision and recall for each class: - Class A: Precision = 0.80, Recall = 0.70 - Class B: Precision = 0.60, Recall = 0.50 - Class C: Precision = 0.90, Recall = 0.85

The F1 scores for each class would be: - Class A: $\text{F1}_A = 2 \times \frac{0.80 \times 0.70}{0.80 + 0.70} \approx 0.74$ - Class B: $\text{F1}_B = 2 \times \frac{0.60 \times 0.50}{0.60 + 0.50} \approx 0.55$ - Class C: $\text{F1}_C = 2 \times \frac{0.90 \times 0.85}{0.90 + 0.85} \approx 0.87$

The macro-average F1 score would be the average of these values.

### Key Points

-   The F1 score provides a balance between precision and recall, being particularly useful in situations where there is an imbalance in the dataset or when false positives and false negatives carry different costs.
-   In binary classification, it's straightforward as it directly combines the model's precision and recall.
-   In multiclass classification, the F1 score is calculated for each class and then averaged, providing a comprehensive view of the model's performance across all classes.

The F1 score is a useful metric in many scenarios, as it accounts for both the precision and recall of the model, providing a more holistic view of its performance.

## Confusion Matrix

A confusion matrix is a tool used in supervised learning to visualize the performance of a classification model. It's particularly useful for understanding the types of errors a model is making.

### Confusion Matrix in Binary Classification

In binary classification, the confusion matrix is a 2x2 table that shows the number of true positives, true negatives, false positives, and false negatives.

#### Components of a Binary Confusion Matrix

1.  **True Positives (TP)**: Correctly predicted positive cases.
2.  **True Negatives (TN)**: Correctly predicted negative cases.
3.  **False Positives (FP)**: Incorrectly predicted positive cases (Type I error).
4.  **False Negatives (FN)**: Incorrectly predicted negative cases (Type II error).

#### Example

Imagine a medical test for a disease: - 50 patients have the disease and the test correctly identifies 40 (TP = 40). - 100 patients do not have the disease and the test correctly identifies 90 (TN = 90). - The test incorrectly identifies 10 healthy patients as having the disease (FP = 10). - The test fails to identify the disease in 10 patients who have it (FN = 10).

The confusion matrix would look like this:

|                     | Predicted Positive | Predicted Negative |
|---------------------|--------------------|--------------------|
| **Actual Positive** | 40 (TP)            | 10 (FN)            |
| **Actual Negative** | 10 (FP)            | 90 (TN)            |

### Confusion Matrix in Multiclass Classification

In multiclass classification, the confusion matrix is larger, with dimensions equal to the number of classes. Each row represents the instances in an actual class, and each column represents the instances in a predicted class.

#### Steps to Calculate a Multiclass Confusion Matrix

1.  **Determine the Number of Classes**: Suppose there are N classes.
2.  **Create an NxN Matrix**: Each cell (i, j) in the matrix represents the number of instances of class i (actual) predicted as class j.

#### Example

Consider a classification problem with three classes: A, B, and C. After applying the model on a dataset, we get the following results:

-   30 of Class A were correctly classified (TP for A), 5 were classified as B, and 5 as C.
-   4 of Class B were incorrectly classified as A, 40 were correctly classified (TP for B), and 6 as C.
-   6 of Class C were classified as A, 8 as B, and 46 were correctly classified (TP for C).

The confusion matrix would look like this:

|              | Predicted A | Predicted B | Predicted C |
|--------------|-------------|-------------|-------------|
| **Actual A** | 30          | 5           | 5           |
| **Actual B** | 4           | 40          | 6           |
| **Actual C** | 6           | 8           | 46          |

### Key Points

-   In binary classification, the confusion matrix is a simple 2x2 table, whereas, in multiclass classification, it expands to accommodate all classes.
-   The diagonal cells (top-left to bottom-right) represent the number of correct predictions (true positives for each class).
-   Off-diagonal cells show the distribution of errors, indicating which classes are being confused with others.

A confusion matrix provides a detailed breakdown of a model's performance and is especially useful for identifying whether a model is confusing two classes, which can be crucial for improving model accuracy.

## Specificity

Specificity is a metric used in classification tasks to measure the proportion of actual negatives that are correctly identified. It's particularly important in contexts where the cost of false positives is high.

### Specificity in Binary Classification

In binary classification, where there are only two classes (positive and negative), specificity is straightforward.

#### Formula for Specificity

The formula for specificity in binary classification is:

$$ \text{Specificity} = \frac{\text{True Negatives (TN)}}{\text{True Negatives (TN) + False Positives (FP)}} $$

-   **True Negatives (TN)**: Correctly predicted negative observations.
-   **False Positives (FP)**: Incorrectly predicted positive observations.

#### Example

Consider a medical test for a disease: - 100 healthy individuals are tested, 90 of whom are correctly identified as not having the disease (TN = 90). - 10 healthy individuals are incorrectly identified as having the disease (FP = 10).

The specificity of this test is:

$$ \text{Specificity} = \frac{90}{90 + 10} = \frac{90}{100} = 0.90 $$

This means the test correctly identifies 90% of healthy individuals.

### Specificity in Multiclass Classification

In multiclass classification, specificity is calculated for each class by considering it as the "negative" class while grouping all other classes as "positive."

#### Steps to Calculate Specificity for Each Class

1.  **Treat Each Class as Negative Once**: For each class, calculate TN and FP with respect to that class.
2.  **Calculate Specificity for Each Class**: Use the binary specificity formula for each class.
3.  **Average the Specificity Scores**: You can calculate an average or weighted-average specificity, similar to precision and recall.

#### Example

Consider a classification problem with three classes (A, B, and C). After applying the model, we have the following confusion matrix:

|              | Predicted A   | Predicted B   | Predicted C   |
|--------------|---------------|---------------|---------------|
| **Actual A** | 30 (TP for A) | 5 (FP for C)  | 5 (FP for C)  |
| **Actual B** | 4 (FP for A)  | 40 (TP for B) | 6 (FP for C)  |
| **Actual C** | 6 (FP for A)  | 8 (FP for B)  | 46 (TP for C) |

To calculate specificity for each class: - For Class A: Consider B and C predictions as "positive" and A as "negative". - For Class B: Consider A and C predictions as "positive" and B as "negative". - For Class C: Consider A and B predictions as "positive" and C as "negative".

Then calculate specificity for each class using the binary formula.

### Key Points

-   **Binary Classification**: Specificity measures the accuracy in identifying actual negatives.
-   **Multiclass Classification**: Specificity is calculated for each class by treating it as the "negative" class once.
-   **Importance**: Specificity is crucial in scenarios where false positives carry significant consequences.

Specificity is often used alongside sensitivity (recall) to provide a comprehensive view of a model's performance, especially in medical testing, where distinguishing between healthy and diseased individuals accurately is vital.

## Precision vs Specificity

Precision and specificity are both metrics used in classification tasks, but they focus on different aspects of the model's performance.

### Precision

-   **Definition**: Precision measures the accuracy of the positive predictions made by the model. It is the ratio of true positives (correct positive predictions) to the total number of positive predictions made by the model (both true positives and false positives).
-   **Formula**: $$ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP) + False Positives (FP)}} $$
-   **Use**: Precision is used when the cost of a false positive is high. In scenarios where it's critical not to label a negative instance as positive, precision becomes a key metric.
-   **Example**: In email spam detection, precision is important. If a spam filter has low precision, it means many legitimate emails are incorrectly marked as spam (false positives), which could lead to important emails being missed.

### Specificity

-   **Definition**: Specificity measures the ability of the model to correctly identify negatives. It is the ratio of true negatives (correct negative predictions) to the total number of actual negative instances (both true negatives and false positives).
-   **Formula**: $$ \text{Specificity} = \frac{\text{True Negatives (TN)}}{\text{True Negatives (TN) + False Positives (FP)}} $$
-   **Use**: Specificity is used when it is crucial to correctly identify negative cases. It's important in cases where missing a negative can have serious consequences.
-   **Example**: In a medical test for a rare but serious disease, specificity is crucial. A low specificity means many healthy individuals are incorrectly diagnosed with the disease (false positives), leading to unnecessary stress and potentially harmful treatments.

### Key Differences

-   **Focus**: Precision focuses on the proportion of correct positive predictions out of all positive predictions, while specificity focuses on correctly identifying negative cases.
-   **False Positives**: Precision is affected by false positives in the context of positive predictions, whereas specificity is affected by false positives in the context of negative cases.
-   **Scenarios**: Precision is key in scenarios where wrongly labeling negatives as positives is problematic, whereas specificity is key in scenarios where failing to identify true negatives is problematic.

Precision and specificity address different aspects of a model's performance. Precision is about how many selected items are relevant, while specificity is about how many relevant items are selected, particularly in the context of negatives. Depending on the application and the consequences of different types of errors (false positives vs. false negatives), one may prioritize one metric over the other.

## ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) are powerful tools used for evaluating the performance of classification models, particularly in binary classification. They can also be adapted for multiclass classification.

### ROC Curve and AUC in Binary Classification

#### ROC Curve

-   **What It Represents**: The ROC curve is a graphical representation that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied.
-   **Plot Components**: The ROC curve plots the True Positive Rate (TPR, or Recall) against the False Positive Rate (FPR) at various threshold settings.
    -   **True Positive Rate (TPR)**: TPR = TP / (TP + FN)
    -   **False Positive Rate (FPR)**: FPR = FP / (FP + TN)

#### AUC (Area Under the ROC Curve)

-   **What It Represents**: The AUC provides an aggregate measure of the model's performance across all possible classification thresholds. It ranges from 0 to 1, with a higher AUC indicating better model performance.
-   **Interpretation**:
    -   An AUC of 0.5 suggests no discriminative ability (equivalent to random guessing).
    -   An AUC of 1.0 suggests perfect classification.

#### Example

Consider a medical test for a disease: - By adjusting the threshold for what counts as a positive prediction, you generate different sets of TPR and FPR, plotting these on the ROC curve. - If the test is highly accurate, the ROC curve will bow towards the top left corner of the plot, indicating high TPR and low FPR. - If the AUC is close to 1, it suggests the test is highly effective at distinguishing between patients with and without the disease.

### ROC Curve and AUC in Multiclass Classification

In multiclass classification, the ROC curve and AUC are extended to handle multiple classes through a few different methods:

#### One-vs-Rest (OvR) Approach

-   **Method**: Treat each class as a binary classification (the class versus all other classes).
-   **Calculation**: Calculate the ROC curve and AUC for each class separately, then average the results.
    -   This could be a simple average (macro-average) or a weighted average based on the prevalence of each class.

#### One-vs-One (OvO) Approach

-   **Method**: For N classes, construct ROC curves for each pair of classes.
-   **Calculation**: Average the AUCs from these ROC curves.

#### Example

Imagine a classification problem with three classes (A, B, and C). For the OvR method: - Calculate the ROC curve and AUC treating A as the positive class and B+C as the negative class, then repeat for B and C. - Average these AUC scores to get a single performance measure.

### Key Points

-   **Binary Classification**: ROC and AUC provide a comprehensive measure of model performance across all thresholds.
-   **Multiclass Classification**: Extended through OvR or OvO approaches to handle multiple classes.
-   **Usefulness**: These metrics are particularly useful for evaluating and comparing models, especially when dealing with imbalanced datasets or when the costs of different types of errors vary significantly.

## Log Loss

Log Loss, also known as logistic loss or cross-entropy loss, is a performance metric that measures the accuracy of a classifier. It's a probability-based metric, offering a more nuanced view of model performance, especially when the outputs are probabilities.

### Log Loss in Binary Classification

In binary classification, log loss measures the uncertainty of the probability estimates by penalizing false classifications.

#### Formula for Log Loss

For binary classification, the formula for log loss is:

$$ \text{Log Loss} = - \frac{1}{N} \sum_{i=1}^{N} [y_i \cdot \log(p_i) + (1 - y_i) \cdot \log(1 - p_i)] $$

Where: - $N$ is the number of observations. - $y_i$ is the actual label (0 or 1). - $p_i$ is the predicted probability of the observation being in class 1. - $\log$ is the natural logarithm.

#### Example

Consider a binary classification task with 3 samples, and the model outputs the following probabilities and actual labels:

-   Sample 1: Predicted probability for class 1 = 0.9, Actual label = 1
-   Sample 2: Predicted probability for class 1 = 0.3, Actual label = 0
-   Sample 3: Predicted probability for class 1 = 0.6, Actual label = 1

The log loss would be calculated as:

$$ \text{Log Loss} = - \frac{1}{3} [(1 \cdot \log(0.9) + (1 - 1) \cdot \log(1 - 0.9)) + (0 \cdot \log(0.3) + (1 - 0) \cdot \log(1 - 0.3)) + (1 \cdot \log(0.6) + (1 - 1) \cdot \log(1 - 0.6))] $$

### Log Loss in Multiclass Classification

In multiclass classification, log loss is extended to cover multiple classes.

#### Formula for Multiclass Log Loss

The formula for multiclass log loss is:

$$ \text{Log Loss} = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij} \cdot \log(p_{ij}) $$

Where: - $N$ is the number of observations. - $M$ is the number of classes. - $y_{ij}$ is 1 if observation $i$ belongs to class $j$, and 0 otherwise. - $p_{ij}$ is the predicted probability of observation $i$ belonging to class $j$.

#### Example

Consider a classification problem with 3 samples and 3 classes:

-   Sample 1: Predicted probabilities = [0.7, 0.2, 0.1], Actual class = 1
-   Sample 2: Predicted probabilities = [0.1, 0.8, 0.1], Actual class = 2
-   Sample 3: Predicted probabilities = [0.2, 0.2, 0.6], Actual class = 3

The log loss is calculated for each class and then averaged.

### Key Points

-   **Binary Classification**: Log loss provides a measure of how close the predicted probabilities are to the actual labels.
-   **Multiclass Classification**: The concept is extended to multiple classes, summing over all classes for each sample.
-   **Interpretation**: Lower log loss values indicate better model performance, with a log loss of 0 representing perfect predictions.
-   **Usefulness**: Log loss is particularly useful when the output of the model is a probability, giving insight into the uncertainty of the predictions.

## Sample Implementation in R

```{r}
# Loading dataset using url

url1 <- "https://drive.google.com/uc?export=download&id=12gzLfZBMSl2d-sF63D-qZJbCEocUdTKK"
df.wine1 <- read.csv(file = url1, header = TRUE, sep = ";")
```

```{r}
df.wine1 <- df.wine1[,c(1, 2, 3, 4, 12)]
```

```{r}
library(scales)
# Performing min-max transformation on my dataset without outliers

quality <- df.wine1$quality
wine.1 <- df.wine1[,-5]
#str(win)

normaliz <- function(colum){
normalized <- scales::rescale(colum, to=c(0,1))
}

for (x in colnames(wine.1)){
normalized.column <- normaliz(wine.1[, x])
# here the column is getting normalized above but
# we have to assign it to the data frame to make changes
wine.1[,x] <- normalized.column
}

wine.1$quality <- quality
summary(wine.1)
```

```{r}
wine.1$quality <- as.factor(wine.1$quality)
# Splitting the dataset into training and testing subsets
set.seed(101)
N = nrow(wine.1)
split <- 0.7
rows <- sample(nrow(wine.1))

training_ind <- sample(1:N, size = round(N * split),replace = FALSE)
train <- wine.1[training_ind,]
test <- wine.1[-training_ind,]
train_labels <- train$quality
test_labels <- test$quality
length(train_labels)
dim(test)
dim(train)
# test_labels
```

```{r}
library(class)
# kNN model
# The 5th column is the quality column 
knn_model <- knn(train = train[,-5], test = test[,-5], cl= train_labels, k=3)
```

```{r}
# Evaluating model
confusion_matrix <- table(knn_model, test_labels)
confusion_matrix
```

```{r}
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
```

```{r}
precision.per.class <- c(ncol(confusion_matrix))

for (col in 1:ncol(confusion_matrix)) {
  precision.per.class[col] <- confusion_matrix[col,col] / 
                              sum(confusion_matrix[1:nrow(confusion_matrix),col])
}

precision <- mean(precision.per.class)
```

## Summary

In this lesson, we discussed various aspects of evaluating classification models in supervised machine learning, focusing particularly on accuracy, precision, recall, and F1 score and how to interpret the outcomes for these metrics in relation to each other.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

[https://medium.com/\@shrutisaxena0617/precision-vs-recall-386cf9f89488](https://medium.com/@shrutisaxena0617/precision-vs-recall-386cf9f89488){.uri}

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
