55.151 Basics Descriptive Statistical Analysis

Objectives

Upon completion of this lesson, you will be able to:

calculate key statistical metrics
quantify correlations
determine the degree of normality of a data set
understand differences in means using t and z tests
apply a p-value to a t-test

Overview

Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of numerical data. It involves the use of various techniques, tools, and methodologies to gather, summarize, and draw conclusions from data. The primary goal of statistics is to extract useful information from data, which can be used to make informed decisions and predictions. It plays a crucial role in many fields, including science, business, engineering, medicine, and social sciences. Some of the key topics in statistics include probability theory, descriptive statistics, inferential statistics, hypothesis testing, regression analysis, and experimental design.

This lesson is not a full course is statistics (obviously) and it only provides a non-mathematical and (decidedly) superficial introduction to the topic. However, we hope that this provides sufficient scaffolding to understanding common statistical concepts and will serve as a springboard to learn more.

Data Sets for Lesson

Data Sets for Lesson 55.151

Data Concepts

Data used in statistical analysis, data mining, and machine learning is generally tabular in form and is most commonly stored in a dataframe in R and Python. R also supports a simplified version of the dataframe in the form of a tibble defined by the tidyverse package. This lesson focuses on R and will use the Base R functions and some commonly used packages, such as psych.

Most data is stored in either CSV files, XML files, or relational databases. The code fragment below loads a CSV file into a dataframe. The CSV is presumed to be in a subfolder called “data” within the project folder. See Lesson 6.202 – Working with R Projects on how to set up projects for working in R.

Of course, rather than loading data from a local file, we can also load data from a URL. In the call below we would substitute the URL for the file name.

df <- read.csv("data/GaltonsHeightData.csv",
               header = T)

Use the R function read.csv() to load a CSV file into a dataframe. The parameter header = T is necessary if the first line in the CSV contains header labels for the rows.

For more information on how to load data from CSV and other file formats, consult Lesson 6.106 – Import Data into R from CSV, TSV, and Excel Files

Each row is called an observation or instance and each column is generally referred to as a variable, dimension, or attribute. Values can be continuous (1.5, -6.2) or discrete (1, 12) numeric, categorical (value from a fixed set, e.g. gender is drawn from {m,f,b,o,u}), or textual (free-form, unstructured text without any consistent format). A categorical variables with two values of T/F (or 0/1, or similar) is often called a binary or Boolean variable.

Categorical variables that can be ordered and ranked are ordinal variables, e.g., gender {male,female,binary,other,unknows} cannot be ranked but degrees {BS,MS,MD,PhD} can be (at least to some extent).

To display a portion of the dataframe’s rows, use the function head() or simply access specific rows and/or columns as demonstrated below. The second argument is the number of rows to display.

head(df, 4)

##   Family Father Mother Gender Height Kids
## 1      1   78.5     67      M   73.2    4
## 2      1   78.5     67      F   69.2    4
## 3      1   78.5     67      F   69.0    4
## 4      1   78.5     67      F   69.0    4

To display specific rows or columns, access the dataframe using indexed access. Note that the access is [rows,columns].

## specific rows and all columns
df[4:6,]

##   Family Father Mother Gender Height Kids
## 4      1   78.5   67.0      F   69.0    4
## 5      2   75.5   66.5      M   73.5    4
## 6      2   75.5   66.5      M   72.5    4

## specific columns
df[4:6,1:3]

##   Family Father Mother
## 4      1   78.5   67.0
## 5      2   75.5   66.5
## 6      2   75.5   66.5

The rows or columns do not have to be contiguous; the indexes can be a vector of indexes or ranges.

df[c(3,4,11:13),c(2:3,5)]

##    Father Mother Height
## 3    78.5     67   69.0
## 4    78.5     67   69.0
## 11   75.0     64   70.5
## 12   75.0     64   68.5
## 13   75.0     64   67.0

The columns can also be accessed by their column name.

df[1:4,c("Father","Mother","Height")]

##   Father Mother Height
## 1   78.5     67   73.2
## 2   78.5     67   69.2
## 3   78.5     67   69.0
## 4   78.5     67   69.0

An entire single column can also be accessed using the $ operator.

min.height <- min(df$Height)

New columns can be added by just naming them. In the example below, we add a new column height.cm that is the height of a child (column Height) in “cm” rather than “inches”. Note that an operation on a column (which is actually a vector in R) applies the operation to each element; in other programming languages such an operation would require a loop, but in R it can be done as a “vector operation”.

df$Height.cm <- df$Height * 2.54

For more information on how to work with dataframes and vector, see Lesson 6.103 – Working with Vectors and Data Frames in R.

Exploring Structure

The functions summary() and str() are useful in helping to understand a dataframe’s structure and its variables. Other useful functions are also demonstrated in the code fragment below.

`summary()`

The function summary() provides key summary statistics for all variables across all observations.

summary(df)

##      Family          Father          Mother         Gender         
##  Min.   :  1.0   Min.   :62.00   Min.   :58.00   Length:898        
##  1st Qu.: 58.0   1st Qu.:68.00   1st Qu.:63.00   Class :character  
##  Median :105.0   Median :69.00   Median :64.00   Mode  :character  
##  Mean   :105.2   Mean   :69.23   Mean   :64.08                     
##  3rd Qu.:155.8   3rd Qu.:71.00   3rd Qu.:65.50                     
##  Max.   :205.0   Max.   :78.50   Max.   :70.50                     
##      Height           Kids          Height.cm    
##  Min.   :56.00   Min.   : 1.000   Min.   :142.2  
##  1st Qu.:64.00   1st Qu.: 4.000   1st Qu.:162.6  
##  Median :66.50   Median : 6.000   Median :168.9  
##  Mean   :66.76   Mean   : 6.136   Mean   :169.6  
##  3rd Qu.:69.70   3rd Qu.: 8.000   3rd Qu.:177.0  
##  Max.   :79.00   Max.   :15.000   Max.   :200.7

`str()`

The function str() shows the overall structure of the dataframe, including the number of rows (observations obs) and the number of columns (variables), along with each column’s name and data type.

str(df)

## 'data.frame':    898 obs. of  7 variables:
##  $ Family   : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ Father   : num  78.5 78.5 78.5 78.5 75.5 75.5 75.5 75.5 75 75 ...
##  $ Mother   : num  67 67 67 67 66.5 66.5 66.5 66.5 64 64 ...
##  $ Gender   : chr  "M" "F" "F" "F" ...
##  $ Height   : num  73.2 69.2 69 69 73.5 72.5 65.5 65.5 71 68 ...
##  $ Kids     : int  4 4 4 4 4 4 4 4 2 2 ...
##  $ Height.cm: num  186 176 175 175 187 ...

Data Dimensions

These functions find the number of rows (nrow()) and the number of columns (ncol()) or both (dim()).

n <- nrow(df)
c <- ncol(df)
d <- dim(df)

To aggregate by group (similar to a SQL GROUP BY) we can use the function aggregate(). In the example below, we group by height and then calculate the average(mean) for each height. The grouping variable must be categorical, while the group by variable must be numeric.

The “by” argument requires the grouping column to be in a list. The grouping variable can be named (gender in the example) which will be the group column’s name.

aggregate(df$Height, by = list(gender = df$Gender), FUN=mean)

##   gender        x
## 1      F 64.11016
## 2      M 69.22882

The result of an aggregation is a dataframe with two columns. It can be saved and used or processed later.

height.by.gender <- aggregate(df$Height, by = list(gender = df$Gender), FUN=mean)

Basic Measures

Range

The range of a variables in the difference between its maximum and minimum values and is calculated in R as follows:

height.min <- min(df$Height)
height.max <- max(df$Height)

height.range <- height.max - height.min

The code above calculates the range for the variable Height, which is 23.

Mean

The mean is the average of a set of values x_i of a single variable X is calculated with the formula:

\[ \bar{x} = ( \frac{1}{n} \sum_{i=i}^{n} x_{i} ) \]

It is the sum of the values divided by the number of values. Naturally, mean is only defined for numeric variables, and, to some extent, ordinal categorical variables.

The function mean() calculates the mean of a vector.

m <- mean(df$Height)

If the vector contains a missing values (indicated in R with NA) then then mean is also undefined. However, the function has a parameter na.rm that, if set to T or TRUE, ignores missing values.

m <- mean(df$Height, na.rm = T)

The mean is significantly affected by outliers. A single outlying value can skew the mean towards the outlier. Therefore, it is often advisable to remove outliers, calculate a trimmed mean, or divide the data into clusters and calculate the mean for each cluster.

Trimmed Mean

The trimmed mean removes a percentage of values from both ends of the ordered set of values. For example, the 10% trimmed mean, removes the smallest 10% and the largest 10% of the values and taking the arithmetic mean of the remaining 80% of the values.

The example below calculates the 10% (0.1) trimmed mean using the mean() function. The parameter trim can be any value from 0 to 0.5.

m.trimmed <- mean(df$Height, trim = 0.1)

Median

The median is a measure of central tendency in statistics that represents the middle value of a set of data. It is the value that separates the data into two equal halves, such that half of the values are greater than the median and half are less than the median. To find the median, the data must first be sorted in either ascending or descending order. If there is an odd number of data points, the median is the middle value.

For example, for the data set {2, 4, 5, 6, 8}, the median is 5 because it is the middle value.

If there is an even number of data points, the median is the average of the two middle values. For example, for the data set {3, 6, 7, 9}, the two middle values are 6 and 7. To find the median, we add them together and divide by 2 to get 6.5. Therefore, the median is 6.5.

The median is a useful measure of central tendency because it is not affected by extreme values or outliers in the data, unlike the mean which can be heavily influenced by outliers. It is often used in situations where the data is not normally distributed or where the data contains extreme values.

Like the mean, median is only defined for numeric variables.

In R, we can calculate the median using the median() function. Like mean() it results in NA if any value in the provided vector is NA unless the argument na.rm = T is provided.

m <- median(df$Height)

If the mean and the median are very close, then it is an indication that it is unlikely that there are outliers. It is still recommended to investigate by finding values that are some distance from the mean or through visual inspection using a scatterplot.

Mode

The mode is a measure of central tendency in statistics that represents the most frequently occurring categorical value in a vector. In other words, it is the value that appears most often. Unlike the median and mean, the mode is not affected by extreme values or outliers in the data.

To find the mode, we can simply count the number of times each value appears in the data set and identify the value with the highest count. For example, in the data set {2, 3, 3, 4, 5, 5, 5, 6, 6, 7}, the mode is 5 because it appears more frequently than any other value.

In some cases, a data set may have multiple modes. This can occur when two or more values appear with the same frequency. For example, in the data set {F, F, M, M, B, F, F, B}, “F” is the mode because it occurs most frequently.

The mode is a useful measure of central tendency for categorical or discrete data, such as nominal or ordinal data, where the data consists of categories or distinct values rather than a continuous range of values. It can also be used in conjunction with other measures of central tendency, such as the mean and median, to gain a more complete understanding of the data set.

There is no function in Base R to calculate the mode, but we can define a function to calculate the mode or use a package such as

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

mode.gender <- getmode(df$Gender)

Note that there is a function mode() in Base R, but it returns the data type of a variable.

Normal Distribution

The normal distribution, also known as the Gaussian distribution, is a probability distribution that is commonly used in statistics, mathematics, and the natural and social sciences. It is a bell-shaped curve that is symmetrical around the mean and describes the distribution of many naturally occurring phenomena, such as heights, weights, test scores, and IQ scores.

The normal distribution has several important properties that make it a useful tool for statistical analysis. Firstly, it is characterized by two parameters, the mean and the standard deviation, which allow us to describe the central tendency and variability of the data. Secondly, it is a continuous distribution, meaning that it can take on any value within a certain range, rather than being limited to discrete values. Finally, the normal distribution has several statistical properties that make it easier to work with, such as the fact that the sum or average of many independent random variables with a normal distribution will also have a normal distribution.

It is used in hypothesis testing, statistical inference, and modeling of various phenomena, such as stock prices, environmental data, and the behavior of complex systems. It also plays a central role in the development of machine learning methods and regression analysis. In addition, many statistical techniques as well as machine learning algorithms, such as t-tests, confidence intervals, and linear regression, assume that the data follows a normal distribution.

Histograms

A histogram is a visual representation of the distribution of a dataset, allowing us to easily analyze which factors have the highest and lowest amounts of data. It displays the frequency of data points in intervals, with the x-axis showing the intervals and the y-axis showing the frequency. Histograms can be created using both grouped and ungrouped data, with class boundaries used for the former and a grouped frequency distribution used for the latter. They are effective in analyzing the range and location of data and can display common structures such as normal, skewed, and cliff distributions.

In contrast to a bar chart, histograms do not have gaps between the bars and instead use bins to represent data in equal intervals. Histograms are used for continuous variables, and it is important to choose the appropriate bin width. Unlike bar charts, which use nominal data sets, histograms plot continuous data sets. In R, histograms can be created using the hist() function, which takes a vector of values to plot. The resulting histogram displays a range of continuous values on the x-axis and the frequency of data values on the y-axis using bars of varying heights.

hist(df$Height)

A histogram can be used to explain the normal distribution by showing the shape of the data and the location of the mean and standard deviation. The normal distribution, also known as the Gaussian distribution, is a bell-shaped curve that is symmetrical around the mean. The shape of the curve is determined by the mean and standard deviation of the data. In the above example, the data follows a somewhat normal distribution with few values on both extremes. Histograms can be useful in determining of the distribution is skewed.

In a histogram, the x-axis represents the range of data values, divided into intervals or bins, while the y-axis represents the frequency of the data points in each bin. If the data follows a normal distribution, the histogram will have a bell-shaped curve with a single peak at the mean. The curve will be symmetrical around the mean, with the same number of data points on either side.

Additionally, the spread of the data, as represented by the standard deviation, can also be seen in the histogram. A larger standard deviation will result in a wider, flatter curve, while a smaller standard deviation will result in a taller, narrower curve.

By analyzing the shape of the histogram and the location of the mean and standard deviation, we can determine if the data follows a normal distribution or not. If the data is not normally distributed, the histogram may show a skewed or asymmetrical shape, with data points clustered more heavily on one side of the mean than the other.

It is often useful to overlay the normal distribution curve (“Bell Curve”) as a visual aid in determining normality.

data <- df$Height

## histogram
hist.data <- hist(data, 
                  main = "Histogram of Height",
                  xlab = "Height",
                  ylab = "")

## calculate x and y values to use for normal curve
x_values <- seq(min(data), max(data), length = 150)
y_values <- dnorm(x_values, mean = mean(data), sd = sd(data)) 
y_values <- y_values * diff(hist.data$mids[1:2]) * length(data) 

## overlay normal curve on histogram
lines(x_values, y_values, lwd = 2, col="navy", lty="dashed")

Normality Testing

Normality testing is used to determine whether a data set follows a normal distribution. It is an important step in many statistical analyses, as many statistical methods and data ming algorithms, such as t-tests, ANOVA, and linear regression, assume that the data follows a normal distribution.

Normality testing involves comparing the observed distribution of the data to the expected distribution of a normal distribution. This is typically done by using a statistical test, such as the Shapiro-Wilk test or the Anderson-Darling test, which calculate a test statistic and p-value based on the sample data. The p-value indicates the probability that the sample data comes from a normal distribution. If the p-value is less than a predetermined significance level, commonly 0.05, then the data is considered to be significantly different from a normal distribution.

There are also graphical methods that can be used to assess normality, such as probability plots, Q-Q plots, and histograms. These methods can provide a visual representation of the data and help identify any deviations from a normal distribution.

Normality testing is important because it allows researchers to determine if the data they are working with is appropriate for the statistical methods they intend to use. If the data does not follow a normal distribution, alternative statistical methods, such as non-parametric methods, may need to be used. Additionally, normality testing can help identify outliers, which are data points that are significantly different from the other data points and can have a large impact on the analysis.

Not using the correct methods can lead to incorrect inferences and when used in data mining and machine learning, it will lead to models that have low predictive power.

Shapiro-Wilk Test

The Shapiro-Wilk test is a statistical test that is used to determine whether a sample of data comes from a normal distribution. It is a commonly used normality test and is named after its developers, Samuel Shapiro and Martin Wilk.

The Shapiro-Wilk test works by calculating a test statistic W based on the sample data and comparing it to the expected distribution of W for a normal distribution. The test statistic W ranges from 0 to 1, with 1 indicating perfect agreement with a normal distribution. The p-value of the test indicates the probability that the sample data comes from a normal distribution. If the p-value is less than a predetermined significance level, such as 0.05, then the null hypothesis (that the data comes from a normal distribution) is rejected in favor of the alternative hypothesis (that the data does not come from a normal distribution).

It should be noted that the test can be affected by sample size and may not be appropriate for all types of data. Therefore, it is important to consider multiple normality tests and graphical methods in order to determine if the data follows a normal distribution.

To perform a Shapiro-Wilk Test in R, use the function shapiro.test().

r <- shapiro.test(df$Height)

print(r)

## 
##  Shapiro-Wilk normality test
## 
## data:  df$Height
## W = 0.98978, p-value = 6.713e-06

For the variable Height from our data set, the Shapiro-Wilk test showed that the data is not normally distributed (W = 0.99, p < 0.05) because, while the value of W is very close to 1, the p-value is less than 0.05.

The object r returned from the function shapiro.test() contains the value of W in r$statistic[[1]] and the p-value in r$p.value.

The sample size must between 3 and 5000. If your data has more than 5000 values, consider extracting a 5000 point sample using the sample() function. The example below illustrates how to extract a sample of size 30 from the vector df$Height without replacement (i.e., the same value cannot be chosen more than once).

s <- sample(x = df$Height, 
            size = 30,
            replace = F)

Centrality

While the mean and median can help find the central value, they do not provide any insight as to the distribution of the data around the mean or media. There are two common measures to establish central tendency: variance and standard deviation.

Variance

Variance is a measure of the spread or variability of a set of data points around their mean. It measures how far each value in the data set is from the mean, on average. A high variance indicates that the data points are widely spread out from the mean, while a low variance indicates that the data points are clustered closely around the mean.

Mathematically, variance is calculated as the average of the squared differences between each data point and the mean. The formula for variance is:

\[ var = \sigma^2 = ( \frac{1}{n} \sum_{i=i}^{n} (x_{i} - \mu)^2 ) \]

where n is the total number of data points, x_i is each data point, and μ is the mean of x. It is common to use either a bar above the vector ($\bar{x}$ or the Greek letter μ as the symbol for the mean.

A large variance indicates a wide dispersion of the values around the mean.

Standard Deviation

Standard deviation (σ) is calculated as the square root of the variance.

\[ \sigma = \sqrt{\sigma^2} \]

where $\sigma^2$ is the variance of the data set.

It is often used to describe the degree of variability or spread in a set of data and to compare different sets of data. A small standard deviation indicates that the data is closely grouped around the mean, while a large standard deviation indicates that the data is more spread out.

The function sd() is used in R to calculate the standard deviation. Squaring the value provides the variance.

Standard deviation is used to calculate the z-score of a value which is often used to mathematically identify outliers.

z-Score

A z-score (also sometimes called the standard score) is a measure of how many standard deviations a data point is away from the mean of a data set. It is used to standardize the values of different variables, making it easier to compare them.

Mathematically, the z-score of a data point x can be calculated as:

\[ z = \frac{x - \mu}{\sigma} \]

where μ is the mean of the data set and σ is the standard deviation of the data set.

Outliers are data points that are significantly far from the mean. A common approach is to define outliers as being any data point that has some absolute z-score above some threshold; common thresholds are 2.5 or 3 but the setting of a threshold is subjective and depends on the dispersion of the data around the mean.

The code fragment below identify outliers for the variable Height from the previously used data set.

## calculate mean and standard deviation
mu <- mean(df$Height)
sd <- sd(df$Height)

## calculate z-score for each height value
df$z <- (df$Height - mu) / sd

## identify any rows where  |z of height| > 3
outliers <- which(abs(df$z) > 3)

df.outliers <- df[outliers,c("Gender","Height")]

print(df.outliers)

##     Gender Height
## 126      M     78
## 289      M     79
## 673      F     56

In the code chunk above, the function which() returns the rows (index) where the column z has an absolute value greater than 3.

Rather than finding outliers through z-score analysis with some arbitrary threshold, we can often find the threshold by visually inspecting a histogram or a boxplot.

Confidence Interval

The confidence interval (CI) provides a range of values that likely contain the true population parameter, such as a population mean. The CI is built around a sample statistic (such as the sample mean) and has an associated confidence level that quantifies the level of confidence that the parameter lies within the interval.

A 95% confidence interval, for example, implies that if you were to take 100 different samples and compute a 95% confidence interval for each sample, then approximately 95 of the 100 confidence intervals will contain the true mean value (μ).

The width of the confidence interval gives some idea about how uncertain we are about the unknown parameter. A wide interval may indicate that more data should be collected before anything very definite can be said about the parameter, while a narrow interval gives a more precise estimate of the parameter.

It’s important to note that the confidence level refers to the frequency of possible confidence intervals that contain the true value, not a single interval. So, for a given sample, the observed 95% confidence interval either contains the true value (with a probability of 0.95) or it does not (with a probability of 0.05); we cannot say that there is a 95% chance that a specific interval contains the true value.

It is important to remember that the confidence interval assumes your data is a random sample from the population. If your data does not meet the assumption of being a random sample, the confidence intervals may be invalid.

Confidence Interval for Mean

The 95% confidence interval for a mean is a range of values that you can be 95% certain contains the true mean of the population.

Here is how to calculate the confidence interval for the mean:

First, you need the standard deviation (σ) of your measurements. The standard deviation is a measure of how spread out your data points are around the mean.

You also need the number of measurements you’ve taken, which is your sample size (n).

Next, look up the z-score (also known as the standard score) for a 95% confidence interval. For a 95% confidence interval, the z-score is approximately 1.96. This is derived from the fact that in a normal distribution, 95% of the data lies within 1.96 standard deviations from the mean.

Now you can calculate the standard error (SE) of your measurements. The standard error is the standard deviation divided by the square root of the sample size:

\[ SE = \frac{\sigma}{\sqrt(n)} \]

Finally, calculate the 95% confidence interval for the mean (μ) using the following formulas:

\[ \mu \pm 1.96 \times SE \]

This gives you a range of values that you can be 95% certain contains the true mean of the population.

Please note, if the sample size is small (< 30) or the population is not normally distributed, you may want to use a t-score from a Student’s t-distribution instead of a z-score. The t-score takes into account the sample size and is more appropriate in these cases. For large sample sizes (usually considered as n > 30), t-scores and z-scores are essentially the same.

Common z-Scores

Z-scores are standardized scores that represent the number of standard deviations a data point is from the mean of a dataset. They’re based on the normal distribution (bell curve) where the mean is 0 and the standard deviation is 1.

The table below lists common z-scores.

Probability	z-Score
80%	1.28
90%	1.645
95%	1.96
99%	2.57

In R, you can use the qnorm() function to find the z-score associated with a given probability. The qnorm() function is the quantile function for the normal distribution.

Suppose you want to find the z-score associated with a probability of 0.975 (which corresponds to a 97.5th percentile and is commonly used for two-tailed tests for a 95% confidence interval), you would use:

qnorm(0.975)

## [1] 1.959964

This should output approximately 1.96, which is the z-score associated with a 97.5% probability.

Please note that qnorm() gives you the quantile function for a standard normal distribution (µ = 0, σ = 1). If your distribution has a different mean and standard deviation, you need to adjust the results accordingly.

Correlation

Correlation refers to the degree of association or relationship between two or more variables in a data set. It measures how strongly two variables are related and in what direction the relationship exists. Correlation can be positive (as one variable increases in value, so, generally, does the other), negative (as one variable increases in value, generally, the other decreases in value, i.e., they move in opposite directions), or zero. For example, there may be a positive correlation between the number of hours spent studying and the grades achieved in a class.

Negative correlation means that as one variable increases, the other variable decreases. This indicates an inverse relationship between the two variables. For example, there may be a negative correlation between the amount of exercise done and the weight of an individual.

Zero correlation means that there is no relationship between the two variables. The values of one variable do not affect the values of the other variable.

Correlation can be measured using statistical methods such as Pearson’s Moment correlation coefficient, Spearman’s rank correlation coefficient rho, and Kendall’s tau correlation coefficient. These methods calculate a numerical value that indicates the strength and direction of the correlation between the variables. The correlation coefficient ranges from -1 to +1, where -1 indicates a perfect negative correlation, +1 indicates a perfect positive correlation, and 0 indicates no correlation.

It is important to remember that correlation does not imply causation, and other factors may be involved in the relationship between the variables.

Pearson Moment

The Pearson correlation coefficient r, also known as the Pearson product-moment correlation coefficient, is a statistical measure that quantifies the degree of linear relationship between two continuous variables. It is the most widely used correlation coefficient and is named after its developer, Karl Pearson.

The Pearson correlation coefficient is calculated as the covariance between two variables divided by the product of their standard deviations. It presumes a normal distribution of both data vectors.

The Pearson correlation coefficient r ranges from -1 to +1, with -1 indicating a perfect negative linear relationship, +1 indicating a perfect positive linear relationship, and 0 indicating no linear relationship between the two variables. A coefficient value of 0.5, for example, indicates a moderate positive relationship, while a value of -0.7 indicates a strong negative relationship.

The Pearson correlation can help data analysts understand the strength and direction of the relationship between two variables, make predictions about future outcomes, and identify outliers or influential data points. However, it should be noted that the Pearson correlation coefficient only measures linear relationships and may not capture other types of relationships, such as non-linear relationships.

In R, we calculate the Pearson correlation coefficient with either cor() or cor.test(). The latter also provides a p-value. A p-value > 0.05 indicates that there is insufficient data to properly estimate the strength of the correlation.

cor(df$Father, df$Height)

## [1] 0.2753548

For the above example, the coefficient of correlation between the father’s and the child’s height is 0. It is not close enough to 1, so we can conclude there is little to no correlation. Whether we say that there is a correlation is subjective. If we detect a correlation of sufficient strength we can build a regression model to predict the value of the dependent variable based on the value of the independent variable.

Spearman Rho

Spearman’s rank correlation coefficient $\rho$ is a statistical measure that quantifies the degree of association between two variables based on their rank order, rather than their actual values. It is a non-parametric test and is used to assess the strength and direction of monotonic relationships between variables. It does not presume a normal distribution of the data vectors. However, using Spearman’s test when the data is normally distributed may result in overlooking a correlation that does actually exist (Type II error).

Unlike the Pearson correlation coefficient, which measures the degree of linear relationship between two variables, Spearman’s rank correlation coefficient is based on the ranks of the data values.

The value of the coefficient ranges from -1 to +1, with -1 indicating a perfect negative monotonic relationship, +1 indicating a perfect positive monotonic relationship, and 0 indicating no monotonic relationship between the two variables. A coefficient value of 0.5, for example, indicates a moderate positive monotonic relationship, while a value of -0.7 indicates a strong negative monotonic relationship.

Spearman’s rank correlation coefficient is useful when the relationship between two variables is not linear but is monotonic. It is also useful when the data is measured on an ordinal scale or when there are outliers or non-normal distributions in the data.

To calculate Spearman’s rank correlation coefficient in R, we can use the cor() function with the method parameter set to “spearman”:

cor(df$Father, df$Height, method="spearman")

## [1] 0.260718

Kendall’s Tau

Kendall’s $\tau$ (tau) is a non-parametric measure of association between two variables that is based on the concordant and discordant pairs of observations. It is similar to Spearman’s rank correlation coefficient but is based on the number of pairs of observations that are in agreement or disagreement, rather than the actual ranks of the data.

Kendall’s tau is denoted by the symbol tau (τ) and ranges from -1 to +1, with -1 indicating a perfect negative association, +1 indicating a perfect positive association, and 0 indicating no association between the two variables. A coefficient value of 0.5, for example, indicates a moderate positive association, while a value of -0.7 indicates a strong negative association.

Concordant pairs are pairs of observations that have the same direction of relationship between the two variables, while discordant pairs are pairs of observations that have opposite directions of relationship. Tied pairs, or pairs of observations that have the same value for both variables, are also taken into account in the calculation of Kendall’s tau.

Kendall’s tau is useful when the data is measured on an ordinal scale or when the relationship between two variables is non-linear.

To calculate Kendall’s tau in R, we can use the cor() function with the method parameter set to “kendall”:

cor(df$Father, df$Height, method="kendall")

## [1] 0.1849368

Procedure

To check whether two data vectors are correlated follow this procedure:

check normality of each vector independently using visual inspection (e.g., frequency plot) or statistically (e.g., Shapiro-Wilk Test)
if both vectors are reasonably normally distributed, test correlation using Pearson, otherwise use either Spearman or Kendall
if the coefficient of correlation is high enough (generally above 0.7) and the p-value is less than 0.05, then you can assume there is a correlation

Correlation Matrix

A correlation matrix is a table showing the correlation coefficients between a set of variables in a dataset. It is used to measure and understand the strength and direction of relationships between pairs of variables. Each cell in the matrix contains the correlation coefficient (whichever coefficient calculation method is appropriate given the distribution of data in the respective vectors), which quantifies the linear relationship between two variables. By default, it uses Pearson’s method, which measures linear correlation.

A correlation matrix is particularly useful in exploratory data analysis (EDA) to identify patterns, multicollinearity, or relationships that might inform further analysis or modeling.

Let’s build a correlation matrix for the numeric features in our sample dataset. The function cor() automatically builds a correlation matrix when called with a dataframe rather than two vectors. Naturally, a correlation matrix must exclude any categorical variables.

# Compute the correlation matrix
correlation_matrix <- cor(df[,c(2,3,5)])

# Display the correlation matrix
print(correlation_matrix)

##            Father     Mother    Height
## Father 1.00000000 0.07366461 0.2753548
## Mother 0.07366461 1.00000000 0.2016549
## Height 0.27535483 0.20165489 1.0000000

Note that the diagonal contains 1.0 because a variable is perfectly correlated with itself. Overall, the correlations are weak and there is virtually no correlation between the fathers’ and mothers’ heights – which is not unreasonable.

To better understand the correlation matrix, you can visualize it using a heatmap using the corrplot() function from the corrplot package:

library(corrplot)

# Visualize the correlation matrix
corrplot(correlation_matrix, method = "color", addCoef.col = "black", tl.cex = 0.8)

This visualization uses colors and annotations to make the relationships between variables more intuitive. In exploratory data analytics, such insights can guide feature selection, hypothesis generation, or further investigation of relationships.

Differences in Means

Differences in means refer to the comparison of the means of two or more groups or samples. It is a common statistical technique used to determine whether there is a statistically significant difference between the means of two or more groups.

To compare the means of two groups, a t-test is commonly used. The t-test compares the means of two groups and determines whether the difference between them is statistically significant or simply due to chance. The t-test assumes that the data is normally distributed and that the variances of the two groups are equal. If these assumptions are not met, alternative tests such as the Mann-Whitney U Test or Wilcoxon rank-sum test may be used.

To compare the means of more than two groups, an analysis of variance (ANOVA) is commonly used. ANOVA compares the means of two or more groups and determines whether there is a statistically significant difference between them. If ANOVA indicates that there is a significant difference between the means of the groups, post-hoc tests may be conducted to determine which groups differ significantly from each other.

Differences in means can be used to answer research questions such as:

Are there differences in test scores between students who received a new teaching method and those who received the standard teaching method?
Is there a difference in blood pressure between patients who received a new medication and those who received a placebo?
Do employees in different departments of a company have different levels of job satisfaction?

By comparing the means of two or more groups, researchers can determine whether there is a statistically significant difference between them, and draw conclusions about the factors that may be responsible for the observed differences.

T-tests and z-tests are two common statistical tests used to test hypotheses about population means. The main difference between them is that t-tests are used when the sample size is small (below 35) or the population standard deviation is unknown, while z-tests are used when the sample size is large and the population standard deviation is known or estimated. Z-tests are commonly used in quality control and industrial applications where large samples are available.

Both t-tests and z-tests involve calculating a test statistic and comparing it to a critical value based on the desired level of significance (alpha). If the test statistic falls in the rejection region, the null hypothesis is rejected in favor of the alternative hypothesis. The null hypothesis is always that there is no difference between the groups.

In summary, the choice between t-tests and z-tests depends on the sample size, the distribution of the data, and whether the population standard deviation is known or estimated.

The z-test is not directly supported in Base R.

t-Test

To perform a t-test in R, use the function t.test(). The example below illustrates the use of the function.

t.test(df$Father, df$Mother)

## 
##  Welch Two Sample t-test
## 
## data:  df$Father and df$Mother
## t = 45.645, df = 1785.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.927221 5.369661
## sample estimates:
## mean of x mean of y 
##  69.23285  64.08441

We look for a p-value of less than 0.05 (or whichever alpha or significance level has been chosen by the analyst). In the above test, p is less than 0.05. Thus the probability that the null hypothesis of there not being any difference between the groups is extremely small, we must conclude that there is a statistically significant difference between the two.

The confidence interval indicates that the true difference in means is likely to be between 4.93 and 5.37.

Of course, whether than difference is meaningful is a separate question; for that we may want to calculate Cohen’s d (but that is beyond the scope of this lesson.)

We can capture the result of the t-test in an object and access it’s member to include them in markdown or perform further calculations. For example, t.result$p.value is the p-value and t.result$conf.int[[1]] and t.result$conf.int[[2]] are the bounds of the confidence interval of the difference in means.

t.result <- t.test(df$Father, df$Mother)

Wilcoxon Rank-Sum Test

If the data of one or both vectors is not normally distributed and cannot be normalized using a log or other transform, then use a Wilcoxon Rank-Sum or a Mann-Whitney U Test. The code below shows how to run such a text in R. Again, we look for a p-value of less than 0.05.

wilcox.test(df$Father, df$Mother)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df$Father and df$Mother
## W = 756716, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Using a non-parametric test on normally distributed data often results in a Type II error where there is a difference in means but the test says that there is not, thus we would fail to see a difference in treatment groups.

Summary

This lesson provides an overview of common statistical analysis techniques. Naturally, the list of techniques presented is not complete and is only meant to provide a foundation for further study and to do essential data analysis.

Data Files, Code Files, Resources

All Files for Lesson 55.151

Resources & References

Data Sets for Lesson 55.151

Errata

Let us know.

55.151Basics Descriptive Statistical Analysis

Martin Schedlbauer, PhD

2025-01-23