Objectives

Upon completion of this lesson, you will be able to:

  • calculate z-score
  • identify outliers

Introduction

In statistics and machine learning, an outlier is a data point that is significantly different from other similar points. They might occur due to variability in the data, or could be indicative of an error in data collection, entry, or processing.

In a dataset, an outlier can cause serious problems in statistical analyses. For example, if you are trying to perform a regression analysis, a single outlier could potentially have a large influence on the predictive model, skewing results.

Outliers can be univariate (found when looking at a distribution of values in a single feature space) or multivariate (found in an n-dimensional space).

Outlier Identification Methods

There are various methods to detect outliers, including:

  1. Standard Deviation: If a data point is more than three standard deviations from the mean, it is considered an outlier.

  2. Interquartile Range (IQR): In a boxplot, any data point that falls below Q1 - 1.5IQR or above Q3 + 1.5IQR is considered an outlier.

  3. Z-Score: Z-score is a measure of how many standard deviations an element is from the mean. A high or low z-score indicates a data point that is far from the mean.

  4. DBSCAN Clustering: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an unsupervised method that clusters core samples (dense areas of a dataset) and denotes non-core samples (sparse portions of the dataset).

  5. Isolation Forest: This is a machine learning method for anomaly detection. It isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Z-Score Method

This method is one of the most commonly used methods.

The standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation means that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation means that the data points are spread out over a wider range of values.

When we’re identifying outliers using the z-score, we typically use the rule of thumb that any data point that is more than three standard deviations away from the mean is considered an outlier. This rule is based on the empirical (or 68-95-99.7) rule in statistics, which states that for a normal distribution:

  • 68% of the data falls within one standard deviation of the mean.
  • 95% falls within two standard deviations.
  • And 99.7% falls within three standard deviations.

So, if a data point is more than three standard deviations from the mean, it is considered to be an extreme value and, in many cases, an outlier.

Here’s the step-by-step process:

  1. Calculate the mean (average) of the data set.
  2. Calculate the standard deviation.
  3. For each value in the data set, calculate the z-score, which is the number of standard deviations that value is from the mean.
  4. Any value that has a z-score greater than 3 or less than -3 is considered an outlier.

This method assumes the data follows a normal distribution, and it may not work as intended if this assumption is not met. For instance, the data might be skewed or have heavy tails, in which case other methods might be more appropriate.

The use of a threshold of 3 is somewhat arbitrary and is by convention as three standard deviations from the mean encompasses more than 99.7% of all data points.

Of course, given that we calculate mean and standard deviation

Implementation in R

For each column, calculate the mean (µ) and the standard deviation (σ). Then for each value xi in a column, calculate |µ - xi| / σ. So, for each value you are calculating the distance from the mean in terms of standard deviations. This value is called the z-score for xi. Any value above some threshold (generally 3) means that the value is far from the mean and is considered an outlier.

For illustration, we will only look for outliers in one column, but in practice all numeric columns must be evaluated.

Naturally, we must first deal with any missing values before we identify outliers and perform any other data shaping.

url <- "https://s3.us-east-2.amazonaws.com/artificium.us/datasets/CerealData.csv"
df <- read.csv(url,
               header = T,
               stringsAsFactors = T)

## identify potential outliers in column 'Calories'

## calculate mean and sd for 'Calories'
m.cal <- mean(df$Calories)
sd.cal <- sd(df$Calories)

## calculate z-score for column 'Calories'
z.Calories <- abs((df$Calories - m.cal) /  sd.cal)

## find outliers
rows.outliers <- which(z.Calories > 2.5)

print(rows.outliers)
## [1]  3 42 48 49

In the above case, there are no outliers. However, if we were to change the threshold to 2.5 instead of 3 standard deviations from the mean, then we would find that there are 4 outlying values.

Handling Outliers

Once outliers have been identified, they can be handled in a number of ways, depending on the specific context and goals of the analysis. They can be removed, transformed, or imputed with an appropriate value; alternatively, statistical methods robust to outliers can be used.

For example, we can remove rows that contain outlier values or we could assume that the outlier value is a mis-measurement and treat it as if it were missing data and impute a value. Of course, whether to ignore or impute changes that data set and effect the training of any machine learning model. Which approach to take is up to the data scientist and requires domain knowledge.

Having many outliers often means that there are clusters in the data and that it is prudent to split the data set into several data sets and analyze them separately and to build separate and likely very different machine learning models.

Tutorial

The tutorial below by Dr. Schedlbauer of Khoury Boston has a presentation and a code walk:

Extreme Value Theory

Extreme Value Theory (EVT) is a fascinating branch of statistics that delves into understanding the extreme deviations from the median of probability distributions. Its primary goal is to predict the occurrence of rare or extreme events, which hold significant interest across various fields such as finance, insurance, environmental science, and engineering. Knowing how likely extreme (or outlying) values are, can be very useful.

In the context of statistics, EVT is concerned with the behavior of the maximum or minimum values within a dataset. It introduces specific distributions to model these extremes, including the Gumbel, Frechet, and Weibull distributions. The Gumbel distribution is used for modeling the distribution of maximum values, the Frechet distribution applies to heavy-tailed distributions, and the Weibull distribution is useful for analyzing minimum values, particularly in reliability and life data analysis.

Two key approaches are central to EVT: the Block Maxima Approach and the Peaks Over Threshold (POT) Approach. The Block Maxima Approach involves dividing data into blocks, such as years or months, and modeling the maximum value within each block. In contrast, the POT Approach focuses on data points exceeding a certain threshold, offering a more flexible model for extreme values.

When discussing outliers in the context of EVT, it is essential to differentiate between outliers and extreme values. Outliers are data points that significantly deviate from other observations, which may result from data variability or experimental errors. On the other hand, extreme values specifically refer to the tails of the distribution, and EVT provides the tools to distinguish true extreme values from outliers.

EVT is particularly beneficial in modeling and analyzing the tail behavior of a distribution, aiding in understanding the probability and impact of extreme values. This is crucial for risk management and assessing the likelihood of extreme events, which might otherwise be mistaken for outliers in traditional analysis. For instance, in finance, EVT is used to estimate the risk of extreme market movements, while in insurance, it helps calculate premiums for natural disasters or rare catastrophic events. In environmental science, EVT can predict extreme weather conditions or rare natural events.

One of the key advantages of EVT over traditional outlier detection methods is its specific focus on extremes rather than anomalies. This focus allows for more accurate predictions and assessments, accounting for the fact that extreme events may follow a different statistical distribution compared to the rest of the data.

In short, Extreme Value Theory is a vital statistical tool for analyzing and understanding the tail ends of distributions. It provides valuable insights into rare but significant events, helping to differentiate between true extreme values and outliers. By doing so, EVT ensures more precise risk assessment and decision-making across various fields, highlighting its critical role in modern statistical analysis.

Summary

Outliers management must be done prior to any other data shaping or calculation of summary statistics or variance as that will influence those metrics. Many machine learning algorithms are sensitive to outliers, for example, kNN and regression.


Files & Resources

All Files for Lesson 3.203

Errata

None collected yet. Let us know.