Upon completion of this lesson, you will be able to:
Study the following lessons first as their contents is assumed in this lesson:
In this lesson, we will take a look a common parametric and non-parametric tests that determine whether differences in the means of a measurement between groups is statistically significant. We will start by examining parametric tests that assume the data is normally distributed and exhibits homogeneity. Next, we will look at non-parametric tests that can be used when the assumptions of normality and homogeneity are not met.
In this section we will explore four fundamental statistical tests often applied in quantitative research:
These tests are fundamental for analyzing the relationships and differences in data, helping us make inferences about populations based on sample data. Each test has specific assumptions and conditions that dictate its appropriate application.
In the context of this lesson, parametric tests refer to statistical tests that make specific assumptions about the parameters of the population distribution from which the sample is drawn. Specifically, these tests often assume that the data follows a normal distribution or another specific distribution (like a binomial or Poisson distribution) with defined characteristics, such as known means or variances. This reliance on distributional parameters differentiates parametric tests from non-parametric tests, which make fewer assumptions about the population’s distribution.
For example, t-tests, z-tests and ANOVA are considered “parametric” because they assume:
The primary benefit of parametric tests lies in their statistical power and efficiency, especially when these assumptions hold, making them a preferred choice for analyzing differences in means, effects of factors, and relationships when assumptions are reasonably met.
Related lesson on detecting and removing outliers.
The t-test is a statistical test used to compare the means of two groups and determine if they are statistically different from each other. There are three main types of t-tests: the one-sample t-test, the independent two-sample t-test, and the paired sample t-test.
The assumptions underlying a t-test include:
Suppose we are examining whether there is a significant difference in test scores between two independent classes, A and B.
# Generating sample data
set.seed(123)
class_A <- rnorm(30, mean = 75, sd = 10)
class_B <- rnorm(30, mean = 80, sd = 12)
# Conducting a two-sample t-test
t_test_result <- t.test(class_A, class_B, var.equal = TRUE)
print(t_test_result)
Here, t.test()
compares the means of class_A
and class_B
, assuming equal variances (var.equal = TRUE
). The output provides the t-value, degrees of freedom, and p-value, which helps determine if the difference in means is statistically significant.
The z-test is similar to the t-test in purpose but is applied when the sample size is large (typically >30), and the population variance is known or the sample variance is assumed to approximate the population variance closely. This test is often used to compare sample and population means or proportions.
Key assumptions for a z-test include: - Normality: Either the sample size should be large enough (by the Central Limit Theorem) or the data should follow a normal distribution. - Known Population Variance: In cases where the population standard deviation is unknown, the t-test is typically preferred.
In this example, suppose we have a large sample of weights and want to test if our sample mean significantly deviates from a population mean of 70 kg.
# Sample data
sample_data <- rnorm(100, mean = 72, sd = 10)
population_mean <- 70
sample_mean <- mean(sample_data)
sample_sd <- sd(sample_data)
n <- length(sample_data)
# Calculating z-score
z_score <- (sample_mean - population_mean) / (sample_sd / sqrt(n))
z_score
The resulting z-score can then be compared to a critical value from the z-distribution to determine significance.
Within-subject testing and between-subject testing are two experimental designs used to study the effects of factors on a dependent variable. These designs differ in how participants are assigned to levels of the factor(s) being studied and have implications for statistical analysis, sensitivity, and interpretability of results.
In within-subject testing (also known as repeated measures), each participant is exposed to all levels of the factor or factors being tested. This design is commonly used in experiments where we measure the same group of subjects under different conditions or at different time points. For example, if we are studying the effect of a drug on reaction time, a within-subject design would mean each participant’s reaction time is measured under both “drug” and “placebo” conditions.
The key benefits of within-subject testing include:
However, within-subject designs have potential downsides, such as:
In between-subject testing, different participants are assigned to different levels of the factor(s) being tested. Each participant experiences only one level of the factor. For instance, if studying the effect of a training program on test scores, a between-subject design would assign participants to either a “trained” group or a “control” group, but not both.
The advantages of between-subject testing include:
However, this design has some drawbacks:
The choice between within-subject and between-subject designs impacts both the statistical power and the interpretability of the results:
Statistical Power and Sample Size: Within-subject designs typically provide higher power with fewer participants because individual variability is controlled, while between-subject designs often require larger sample sizes to detect similar effects due to increased variability from participant differences.
Interpretability and Applicability: Results from within-subject designs are generally easier to interpret when looking at changes or effects over time within the same individuals (e.g., pre-test vs. post-test). However, between-subject designs may provide clearer insights when the primary interest is in comparing groups that represent distinct populations or interventions.
In sum, while within-subject designs are more sensitive and efficient for detecting changes within individuals, they may suffer from carryover effects and require corrections for sphericity. Between-subject designs avoid these issues but generally need larger sample sizes due to increased variability and lower statistical power. The choice depends on the study’s aims, the nature of the factors involved, and practical considerations regarding the sample and experimental setup.
The ANOVA test is used when comparing means across multiple groups (more than two). ANOVA assesses whether there is significant variation between group means. While a t-test compares only two groups, ANOVA can handle multiple group comparisons simultaneously.
ANOVA’s assumptions include:
Consider we have three different training programs, and we wish to analyze if there are differences in test scores across these programs.
# Generating sample data
set.seed(123)
program_A <- rnorm(30, mean = 85, sd = 5)
program_B <- rnorm(30, mean = 88, sd = 5)
program_C <- rnorm(30, mean = 90, sd = 5)
scores <- c(program_A, program_B, program_C)
group <- factor(rep(c("A", "B", "C"), each = 30))
# Conducting ANOVA
anova_result <- aov(scores ~ group)
summary(anova_result)
Here, aov()
performs the ANOVA test, comparing the means of the three programs. The summary output shows if any significant differences exist across the group means.
ANCOVA combines elements of ANOVA and regression by adjusting the dependent variable based on one or more covariates. ANCOVA is especially useful when we want to control for the effect of an additional variable (covariate) while comparing group means.
Assumptions for ANCOVA include: - Normality: Residuals should be normally distributed. - Homogeneity of Regression Slopes: The relationship between covariate and dependent variable should be similar across groups. - Independence and Homogeneity of Variance: Similar to ANOVA, with variances equal across groups and independent observations.
Suppose we want to examine if training program effectiveness on test scores varies, adjusting for previous scores as a covariate.
# Generating sample data
set.seed(123)
previous_scores <- rnorm(90, mean = 80, sd = 5)
program_A <- rnorm(30, mean = 85, sd = 5)
program_B <- rnorm(30, mean = 88, sd = 5)
program_C <- rnorm(30, mean = 90, sd = 5)
scores <- c(program_A, program_B, program_C)
group <- factor(rep(c("A", "B", "C"), each = 30))
# Conducting ANCOVA
ancova_result <- aov(scores ~ group + previous_scores)
summary(ancova_result)
This example uses aov()
to assess if scores differ across training programs, adjusting for previous_scores
. The summary output allows us to evaluate both the main effects and the impact of the covariate.
In summary, these tests – t-test, z-test, ANOVA, and ANCOVA – offer powerful tools for comparing group means under varying circumstances and assumptions. Choosing the appropriate test and verifying assumptions ensure the robustness and validity of conclusions drawn from statistical analyses.
Mixed-factor ANOVA, also known as mixed-design ANOVA, is an extension of the ANOVA technique used when we have both within-subjects (repeated measures) and between-subjects factors in the same analysis. It’s particularly useful for studying complex designs where some factors are applied to all subjects (within-subjects) and other factors vary across different groups (between-subjects).
Mixed-factor ANOVA is ideal in situations where: 1. We want to analyze the effects of one or more factors that vary within subjects (e.g., time, condition). 2. We simultaneously want to analyze the effects of one or more factors that vary between subjects (e.g., gender, treatment group).
For example, a clinical study could have two groups of patients (treatment vs. control) who are measured at multiple time points. Here, treatment is a between-subjects factor (different groups) and time is a within-subjects factor (repeated measures across time). Mixed-factor ANOVA would allow us to evaluate the effect of treatment, the effect of time, and whether these factors interact.
Mixed-factor ANOVA can be challenging due to the following:
Suppose we want to test if a new treatment impacts patient recovery scores over three time points, with scores recorded for each patient. Here, treatment is a between-subjects factor (two groups: treatment vs. control), and time is a within-subjects factor (three repeated measurements).
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
# Generating sample data
set.seed(123)
patient <- factor(rep(1:20, each = 3)) # 20 patients with 3 repeated measures
time <- factor(rep(c("Time1", "Time2", "Time3"), times = 20), ordered = TRUE)
treatment <- factor(rep(c("Control", "Treatment"), each = 30)) # 10 patients per group
# Simulated recovery scores with some treatment effect and time effect
scores <- c(rnorm(10, mean = 50, sd = 5),
rnorm(10, mean = 55, sd = 5),
rnorm(10, mean = 53, sd = 5),
rnorm(10, mean = 60, sd = 5),
rnorm(10, mean = 55, sd = 5),
rnorm(10, mean = 65, sd = 5))
# Data frame setup
data <- data.frame(patient, time, treatment, scores)
# Performing Mixed-Factor ANOVA
anova_model <- aov(scores ~ treatment * time + Error(patient/time),
data = data)
summary(anova_model)
##
## Error: patient
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 1 1073.7 1073.7 22.27 0.000171 ***
## Residuals 18 867.7 48.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: patient:time
## Df Sum Sq Mean Sq F value Pr(>F)
## time 2 17.1 8.542 0.293 0.748
## treatment:time 2 25.9 12.931 0.444 0.645
## Residuals 36 1049.2 29.145
In this example, we specify treatment * time
to capture both main effects and their interaction, and Error(patient/time)
to account for repeated measures within subjects. The output gives us insight into:
Mixed-factor ANOVA is powerful but sensitive to several issues. The most notable is sphericity, a key assumption in repeated measures. Violations of sphericity inflate the likelihood of Type I errors, leading to false positives. R provides corrections (like Greenhouse-Geisser) for sphericity violations, but these adjustments can reduce power, leading to more conservative results.
Additionally, interactions in mixed designs can be challenging to interpret, especially when there are significant effects across time and between groups. This requires careful examination and sometimes follow-up tests or plots to clarify. Lastly, missing data on within-subjects factors (e.g., if a patient misses one of the time points) can disrupt the analysis, as ANOVA requires balanced designs for reliable results. Missing data can decrease power and lead to biased estimates, complicating the conclusions drawn from the model.
Regression analysis is a versatile tool in statistics, and it can indeed serve as an alternative to t-tests and ANOVA for analyzing differences in means and the effects of factors on a dependent variable. The advantage of regression lies in its flexibility: it can analyze not only group differences but also interactions between factors, continuous predictors, and more complex relationships.
In a two-sample t-test, we test for differences between the means of two groups. This can be achieved in regression by setting up a binary predictor variable representing the two groups and fitting a simple linear regression model. The mathematics underlying this approach is straightforward.
In a regression model for two groups (say, Group A and Group B), we define the binary predictor variable \(X\) as:
\[ X = \begin{cases} 1 & \text{if in Group B} \\ 0 & \text{if in Group A} \end{cases} \]
The regression model becomes:
\[ Y = \beta_0 + \beta_1 X + \epsilon \]
Here, \(\beta_0\) represents the mean of Group A, and \(\beta_1\) captures the mean difference between Group B and Group A. The test for \(\beta_1 \neq 0\) (typically a t-test on the coefficient \(\beta_1\)) tests whether the means of the two groups differ significantly.
Suppose we have two groups of test scores, group_A
and group_B
.
# Generating sample data
set.seed(123)
group_A <- rnorm(30, mean = 75, sd = 10)
group_B <- rnorm(30, mean = 80, sd = 10)
scores <- c(group_A, group_B)
group <- factor(rep(c("A", "B"), each = 30))
# Creating a binary predictor for regression
binary_group <- as.numeric(group) - 1
# Regression model
model <- lm(scores ~ binary_group)
summary(model)
##
## Call:
## lm(formula = scores ~ binary_group)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.195 -5.636 -1.127 5.591 19.906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.529 1.663 44.809 < 2e-16 ***
## binary_group 7.254 2.352 3.084 0.00312 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.11 on 58 degrees of freedom
## Multiple R-squared: 0.1409, Adjusted R-squared: 0.1261
## F-statistic: 9.512 on 1 and 58 DF, p-value: 0.003125
In this regression model, the intercept (beta_0
) represents the mean of group_A
, and the coefficient for binary_group
(beta_1
) represents the mean difference between group_A
and group_B
. A significant p-value for beta_1
indicates a significant difference in means, just as in a two-sample t-test.
For an ANOVA with multiple groups, regression can again be applied using categorical variables. In this case, we can use dummy coding or factor levels in R to represent each group. For a dataset with \(k\) groups, we define \(k-1\) dummy variables to represent each group’s mean difference from a reference group.
In this context, the regression model is:
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_{k-1} X_{k-1} + \epsilon \]
where each \(X_j\) is a binary variable indicating membership in a group relative to the reference group (e.g., Group A). The coefficients \(\beta_1, \beta_2, \ldots, \beta_{k-1}\) measure the mean difference between each group and the reference group.
Consider three groups with data on a response variable, and we want to check if the means differ across the groups.
# Generating sample data for three groups
set.seed(123)
group_A <- rnorm(30, mean = 75, sd = 10)
group_B <- rnorm(30, mean = 80, sd = 10)
group_C <- rnorm(30, mean = 85, sd = 10)
scores <- c(group_A, group_B, group_C)
group <- factor(rep(c("A", "B", "C"), each = 30))
# Regression model using factor levels for ANOVA
model <- lm(scores ~ group)
summary(model)
In this model, R treats group_A
as the reference group by default. The intercept (beta_0
) is the mean of group_A
, and the coefficients for groupB
and groupC
represent the differences in means between group_B
, group_C
, and group_A
. If the coefficients are significantly different from zero, it indicates that those groups differ from the reference group’s mean, similar to what we assess with ANOVA.
Regression can also be extended to include covariates, making it a robust alternative to ANCOVA. In ANCOVA, we are interested in comparing group means while controlling for one or more continuous variables. The regression model for ANCOVA includes both the factor variable and the covariate(s).
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_{k-1} X_{k-1} + \gamma Z + \epsilon \]
where \(Z\) represents the covariate. The coefficients for \(X_1, X_2, \ldots, X_{k-1}\) now represent the adjusted means for each group, controlling for \(Z\).
Suppose we have three groups, and we want to test for differences in scores across groups, adjusting for a covariate (e.g., a pre-test score).
# Generating sample data
set.seed(123)
pre_test <- rnorm(90, mean = 70, sd = 5)
group_A <- rnorm(30, mean = 75, sd = 10)
group_B <- rnorm(30, mean = 80, sd = 10)
group_C <- rnorm(30, mean = 85, sd = 10)
scores <- c(group_A, group_B, group_C)
group <- factor(rep(c("A", "B", "C"), each = 30))
# Regression model with a covariate
model <- lm(scores ~ group + pre_test)
summary(model)
In this ANCOVA model, the regression adjusts the group means for differences in pre_test
scores. The coefficients for groupB
and groupC
now represent mean differences adjusted for the pre_test
covariate, providing insights into group effects while controlling for additional variables.
In short, regression offers a flexible approach to testing for group differences and the effects of factors on a dependent variable. By using binary or dummy variables, we can replicate t-tests and ANOVA in a regression framework. When covariates are included, regression performs similarly to ANCOVA, adjusting for additional variables to isolate the effect of each factor. This approach not only broadens the analytical capabilities beyond traditional t-tests and ANOVA but also enables complex modeling with interactions, continuous predictors, and controls in a single unified framework.
Non-parametric tests offer robust alternatives to parametric tests like the t-test and z-test, especially useful when data does not meet the assumptions required by parametric tests, such as normality or equal variances. For the t-test, common non-parametric alternatives include the Mann-Whitney U test (or Wilcoxon rank-sum test) and the Wilcoxon signed-rank test. For the z-test, particularly when comparing distributions, the Kolmogorov-Smirnov test is frequently used. Below, we will explore these tests, their applicability, and how they can be implemented in R.
The Mann-Whitney U test, also known as the Wilcoxon rank-sum test, serves as a non-parametric alternative to the independent two-sample t-test. It assesses whether there is a significant difference between two independent groups by comparing the ranks of the data rather than their means. This test does not require the data to be normally distributed, making it useful when data deviates from normality or when sample sizes are small.
This test has the following assumptions: - Independence: The two samples should be independent. - Ordinal or Continuous Data: The test can handle ordinal or continuous data. - Similar Distributions: While not strictly necessary, it is beneficial if the distributions of the two groups are similar in shape.
Suppose we are comparing satisfaction scores between two different products (Product A and Product B), but the scores do not follow a normal distribution.
# Generating sample data
set.seed(123)
product_A <- rnorm(30, mean = 75, sd = 10)
product_B <- rnorm(30, mean = 80, sd = 12)
# Conducting Mann-Whitney U test
wilcox_test_result <- wilcox.test(product_A, product_B)
print(wilcox_test_result)
Here, wilcox.test()
compares product_A
and product_B
without assuming normality. The output provides the W statistic and p-value, indicating whether the ranks differ significantly.
The Wilcoxon signed-rank test is a non-parametric alternative to the paired t-test. It is used for testing whether there is a significant difference between paired or matched observations, comparing the median of differences to zero. Like the Mann-Whitney U test, it ranks data rather than relying on means.
Assumptions for this test include: - Paired Observations: Each subject has two related observations (e.g., before and after a treatment). - Ordinal or Continuous Data: The data should be ordinal or continuous, and differences between paired observations should ideally be symmetrically distributed.
Suppose we have a group of individuals measured on a skill level before and after a training program, and we wish to test if the training had a significant effect.
# Generating sample data
set.seed(123)
before_training <- rnorm(30, mean = 70, sd = 10)
after_training <- rnorm(30, mean = 75, sd = 10)
# Conducting Wilcoxon signed-rank test
wilcox_signed_rank_result <- wilcox.test(before_training, after_training, paired = TRUE)
print(wilcox_signed_rank_result)
By specifying paired = TRUE
, wilcox.test()
performs a signed-rank test, providing the W statistic and p-value for the paired observations.
The Kolmogorov-Smirnov (K-S) test is a non-parametric test that compares two samples to determine whether they come from the same distribution. While it can be used as an alternative to a z-test for comparing distributions, it has broader applicability, as it does not assume any specific distribution for the data. The K-S test is particularly useful when comparing the overall shape of two distributions, including both the location and scale differences.
Assumptions for this test are minimal: - Ordinal or Continuous Data: Data should be ordinal or continuous. - Independent Samples: Samples should be independent of each other.
Suppose we are comparing the distributions of two groups’ reaction times to a stimulus. The K-S test can help determine if these distributions are significantly different.
# Generating sample data
set.seed(123)
group_X <- rnorm(50, mean = 5, sd = 1)
group_Y <- rnorm(50, mean = 6, sd = 1)
# Conducting Kolmogorov-Smirnov test
ks_test_result <- ks.test(group_X, group_Y)
print(ks_test_result)
The ks.test()
function compares group_X
and group_Y
, outputting the D statistic and p-value. A significant result indicates that the two distributions differ.
The non-parametric tests introduced in this section are common alternatives to their parametric counterparts, especially useful when data do not meet the assumptions of normality or when dealing with ordinal data. The Mann-Whitney U test, Wilcoxon signed-rank test, and Kolmogorov-Smirnov test allow us to test for differences in distributions or medians without relying on restrictive assumptions, making them versatile tools in non-parametric statistics.
Non-parametric tests are indeed robust and can be used when the normality assumption is violated. However, there are limitations that make parametric tests preferable in situations where the assumptions of normality, homogeneity, and independence are met.
Below are some key reasons why parametric tests should be used when data is reasonably normally distributed and why non-parametric tests should be reserved for situations where the normality assumption is not met:
Greater Statistical Power: Parametric tests generally have higher statistical power than non-parametric tests when data is normally distributed. This means they are more likely to detect a true effect when one exists, given the same sample size. The power of a test is crucial because it influences the likelihood of finding significant results, so a more powerful test (parametric) is preferred when assumptions hold.
Interpretation of Results: Parametric tests, like the t-test and ANOVA, provide results in terms of differences in means, which can be easier to interpret and communicate in terms of real-world implications. For example, stating that “the mean difference in scores between two groups is 5 points” is often clearer and more interpretable than stating differences in median ranks, which is typical in non-parametric methods.
Efficiency with Smaller Sample Sizes: When dealing with small samples, parametric tests can be more efficient if assumptions are met because they leverage more information from the data distribution (e.g., mean and standard deviation in a t-test). Non-parametric tests may require larger sample sizes to achieve similar power levels.
Lower Complexity: Non-parametric tests generally require ranking which often means sorting the data is necessary. The time complexity for ranking a data sample, which is a necessary step in most non-parametric tests like the Mann-Whitney U test or the Wilcoxon signed-rank test, depends on the sorting algorithm used. Ranking essentially involves sorting the data and then assigning ranks based on the sorted order. To rank a sample of size \(n\), the data must first be sorted. The best-known efficient sorting algorithms, like Merge Sort or Quick Sort, have an average time complexity of \(O(n \log n)\). After sorting, assigning ranks to each element in a sorted list has a linear time complexity, \(O(n)\). Together, the overall time complexity for ranking a data sample is therefore driven by the sorting step, resulting in \(O(n \log n)\).
Range of Applications: Parametric tests allow for a broader range of analytical options, such as analyzing interactions in factorial ANOVA or including covariates in ANCOVA. Regression analysis, which is also parametric, offers even more flexibility by allowing for continuous and categorical predictors, interactions, and various covariate adjustments. These options can be limited or absent in non-parametric frameworks.
Normal Distribution’s Descriptive Power: Many real-world phenomena tend to approximate normal distributions, especially with large samples (due to the Central Limit Theorem). This natural fit allows parametric tests to use data in a way that closely reflects underlying population characteristics, giving results that are not only statistically valid but also practical and accurate for modeling and inference.
In conclusion, while non-parametric tests are invaluable for non-normal data, parametric tests make full use of distributional properties, providing more powerful and interpretable results when normality and other assumptions hold. Hence, when sample data is normally distributed and parametric assumptions are reasonably met, parametric tests are usually the better choice.
Effect size is a quantitative measure of the strength or magnitude of a phenomenon, representing how large the difference or association is between groups or variables. Unlike p-values, which merely indicate whether an effect is statistically significant, effect size provides insight into the practical significance of the results. Effect size is essential in determining the real-world impact or relevance of findings, especially when dealing with large sample sizes where even small differences can yield statistically significant p-values.
There are several ways to measure effect size, depending on the type of analysis. For instance, Cohen’s d is commonly used in t-tests to measure the standardized difference between two means, where values of 0.2, 0.5, and 0.8 generally indicate small, medium, and large effects, respectively. Eta-squared (η²) and partial eta-squared are used in ANOVA to express the proportion of total variance explained by a factor. For correlations, Pearson’s r directly serves as an effect size measure. Let’s explore how to calculate some of these effect sizes in R.
Suppose we have two groups with scores and we want to calculate Cohen’s d to quantify the difference in means.
# Generating sample data
set.seed(123)
group_A <- rnorm(30, mean = 75, sd = 10)
group_B <- rnorm(30, mean = 80, sd = 10)
# Calculating Cohen's d
mean_diff <- mean(group_B) - mean(group_A)
pooled_sd <- sqrt((sd(group_A)^2 + sd(group_B)^2) / 2)
cohen_d <- mean_diff / pooled_sd
cohen_d
Here, cohen_d
represents the standardized difference between groups A and B, indicating the effect size of the difference in means.
In an ANOVA setting, eta-squared can provide insight into how much of the total variance is explained by group differences.
# Generating sample data for three groups
set.seed(123)
group_A <- rnorm(30, mean = 75, sd = 10)
group_B <- rnorm(30, mean = 80, sd = 10)
group_C <- rnorm(30, mean = 85, sd = 10)
scores <- c(group_A, group_B, group_C)
group <- factor(rep(c("A", "B", "C"), each = 30))
# Performing ANOVA
anova_result <- aov(scores ~ group)
summary(anova_result)
# Calculating eta-squared
ss_total <- sum((scores - mean(scores))^2)
ss_between <- sum((tapply(scores, group, mean) - mean(scores))^2) * length(group_A)
eta_squared <- ss_between / ss_total
eta_squared
In this example, eta_squared
provides the proportion of variance explained by group differences, giving us a sense of the effect size in terms of variance explained by the factor.
Effect size is fundamental because it allows researchers to understand the magnitude of their findings rather than just the presence of an effect. This distinction is valuable in research and practical applications, where understanding the actual impact is more informative than just knowing if a result is statistically significant. Effect sizes, therefore, contribute to a more nuanced interpretation of data, guiding decisions in fields where practical significance is just as important as statistical significance.
The Bonferroni Correction is a statistical adjustment method used to control for Type I errors (false positives) when performing multiple comparisons. Each time we test a hypothesis, there is a chance we could incorrectly reject the null hypothesis. When conducting multiple tests, the probability of making at least one Type I error increases. The Bonferroni Correction addresses this by adjusting the significance level for each individual test, thus reducing the likelihood of false positives across the set of tests.
The Bonferroni Correction ensures that the overall (family-wise) error rate remains at a desired level, typically 0.05. This means that even if multiple hypotheses are being tested simultaneously, the cumulative probability of making a Type I error across all tests is controlled. By lowering the significance level for each individual test, the Bonferroni Correction helps prevent the detection of false significant results.
The Bonferroni Correction is applied when multiple statistical tests are conducted on the same dataset, such as in _post-ho_c testing after ANOVA or in studies with multiple comparisons across groups. It is especially important in exploratory studies with a high number of tests, where the chance of encountering false positives is significant. However, a drawback is that the Bonferroni Correction can be quite conservative, increasing the risk of Type II errors (false negatives), especially with a large number of comparisons.
The correction works by dividing the overall significance level (α) by the number of comparisons being made. For instance, if the desired family-wise significance level is 0.05 and we conduct 10 tests, each test must meet a significance level of \(\alpha/10 = 0.005\) to be considered statistically significant.
Mathematically, if we perform m tests and we want to control the family-wise error rate at α, we use an adjusted significance level for each individual test:
\[ \alpha_{\text{Bonferroni}} = \frac{\alpha}{m} \]
Suppose we have three groups, and we wish to conduct multiple t-tests to compare each pair of groups. Using the Bonferroni Correction, we adjust the p-value threshold for each comparison.
# Generate sample data for three groups
set.seed(1234)
group_A <- rnorm(30, mean = 75, sd = 10)
group_B <- rnorm(30, mean = 80, sd = 10)
group_C <- rnorm(30, mean = 85, sd = 10)
# Conduct pairwise t-tests
p_value_AB <- t.test(group_A, group_B)$p.value
p_value_AC <- t.test(group_A, group_C)$p.value
p_value_BC <- t.test(group_B, group_C)$p.value
# Number of comparisons
m <- 3
# Bonferroni-adjusted significance level
alpha <- 0.05
alpha_bonferroni <- alpha / m
# Results
cat("Bonferroni-adjusted significance level:", alpha_bonferroni, "\n")
## Bonferroni-adjusted significance level: 0.01666667
## p-value for A vs. B: 0.312944
## p-value for A vs. C: 2.027699e-07
## p-value for B vs. C: 1.350111e-05
## Is A vs. B significant after correction? FALSE
## Is A vs. C significant after correction? TRUE
## Is B vs. C significant after correction? TRUE
In this example, we calculate the Bonferroni-adjusted significance level (alpha_bonferroni) by dividing the desired significance level (0.05) by the number of comparisons (3). We then check if each pairwise comparison meets this stricter threshold. Only comparisons with p-values below the adjusted threshold are considered significant, which helps control the family-wise error rate across all tests.
The Greenhouse-Geisser and Huynh-Feldt adjustments are methods used in repeated measures ANOVA to correct for violations of the sphericity assumption. Sphericity is the assumption that the variances of differences between all pairs of within-subject conditions are equal. Violating this assumption increases the likelihood of Type I errors, meaning there is an increased chance of finding a false positive result. These adjustments correct the degrees of freedom for the F-test in repeated measures ANOVA to maintain the integrity of the test when sphericity is not met.
In contrast, the Bonferroni correction is a general method for controlling the family-wise error rate when conducting multiple comparisons. It is typically used when comparing multiple groups or performing multiple statistical tests, adjusting the significance level for each test to reduce the chance of a Type I error across all tests.
Greenhouse-Geisser and Huynh-Feldt Adjustments: Use these adjustments specifically in repeated measures ANOVA when the sphericity assumption is violated. They adjust the F-test by reducing the degrees of freedom, thereby making the test more conservative. The Greenhouse-Geisser adjustment is more conservative and is often used when sphericity is severely violated, while Huynh-Feldt is less conservative and can be used when the sphericity violation is mild.
Bonferroni Correction: Use the Bonferroni correction when performing multiple comparisons or multiple hypothesis tests in any statistical setting to control for the family-wise error rate. This correction is applied to the significance threshold, dividing it by the number of tests.
Both methods control for Type I errors, but they are used in different contexts: Greenhouse-Geisser and Huynh-Feldt for repeated measures with sphericity issues, and Bonferroni for general multiple comparisons across tests.
This lesson provided an overview of key statistical tests, such as the t-test and ANOVA. It explained when to use non-parametric tests as alternatives to parametric tests, particularly when data does not meet the strict assumptions of normality or homogeneity of variance. Non-parametric alternatives like the Mann-Whitney U test, Wilcoxon signed-rank test, and Kolmogorov-Smirnov test offer robust options that rank data rather than relying on means, making them useful when sample data is not normally distributed. Examples in R demonstrated how to perform each of the most commonly used tests.
Additionally, we discussed why parametric tests are generally preferred when their assumptions hold, as they have higher statistical power, interpretability, and efficiency with smaller sample sizes. In particular, parametric tests make use of distributional properties of the data, providing a more powerful and nuanced analysis.
We also examined regression as an alternative framework for t-tests and ANOVA, illustrating how it can extend analytical capabilities, especially when controlling for covariates.
To address the problem of increased Type II errors in post-hoc analysis, we adjust the significance level using Bonferroni Correction.
Lastly, we briefly reviewed algorithmic time complexity for ranking data in non-parametric tests, noting it as \(O(n \log n)\), mainly due to the sorting step.
Generative AI Assistants were used in the preparation of drafts for this lesson. In particular, we used ChatGPT 4o and Claude 3.5 Sonnet.