Prequisites

This lesson presumes that you are familiar with the contents of:

Introduction

Data transformation is a fundamental aspect of data analysis, enabling the conversion of raw data into a structured format suitable for analytics. While there are many ways to transform and prepare data, the dplyr package – an integral part of the Tidyverse – offers a suite of functions designed to simplify this process. In this lesson, we explore key dplyr functions, including filter(), arrange(), select(), mutate(), and summarize().

What is the Tidyverse?

The “Tidyverse” is a collection of R packages designed to work seamlessly together, adhering to a consistent philosophy of “tidy data.” It provides tools for a wide range of data-related tasks, from importing and cleaning data to transforming, visualizing, and modeling it.

The Tidyverse was developed by Hadley Wickham (CTO at R Studio), a prominent statistician and software developer in the R community. Wickham is well-known for creating many influential R packages and shaping the modern landscape of data analysis in R. His work has revolutionized how analysts and researchers interact with data, emphasizing simplicity, consistency, and the concept of “tidy data.”

Key features of the Tidyverse include:

  • Tidy Data Paradigm: Data is organized so that each variable forms a column, each observation forms a row, and each type of observation unit forms a table.
  • Consistency: All packages in the Tidyverse use a consistent syntax and grammar, making them intuitive to learn and use.
  • Ease of Use: The Tidyverse simplifies complex operations, enabling analysts to focus more on insights rather than the mechanics of data handling.

The Tidyverse includes several key packages, including:

  • ggplot2: For data visualization.
  • dplyr: For data manipulation.
  • tidyr: For tidying and reshaping data.
  • readr: For reading data files.
  • tibble: For improved data frame functionality.
  • purrr: For functional programming with lists.

What is dplyr?

dplyr is a core package within the Tidyverse, specifically focused on data manipulation. It provides a grammar of data manipulation, making it easy to perform tasks such as filtering, sorting, summarizing, and transforming data.

Key operations in dplyr, often called the “verbs” of data manipulation, include:

  • filter(): Select rows based on conditions.
  • select(): Choose specific columns.
  • mutate(): Add or modify columns.
  • arrange(): Sort rows.
  • summarize(): Calculate summary statistics.
  • group_by(): Group data for grouped operations.

These functions are designed to be intuitive, readable, and chainable using the %>% operator (pipe), which allows you to express sequences of transformations in a clear and linear fashion. They present an alternative to using SQL for data filtering, summarizing, and selection.

The dplyr Package

dplyr is part of the Tidyverse, meaning it shares the same principles, syntax, and goals as other Tidyverse packages. It serves as the data manipulation backbone for many data analysis workflows in the Tidyverse. In this section, we will explore key functions in detail. To use the dplyr package, you must install it and then load it:

suppressPackageStartupMessages(library(dplyr))

The call to suppressPackageStartupMessages() is necessary to “suppress” all warnings and messages; if not done, those would end up in your knitted notebook.

The dplyr package uses a modified (and simplified) tabular structure called a tibble that replaces dataframes in Tidyverse.

To load all packages of the Tidyverse, you can simply load tidyverse.

library(tidyverse)

tibbles

Tibbles are a modernized re-imagination of data frames in R, introduced by the tibble package, which is part of the Tidyverse. The goal is to address some of the limitations and frustrations associated with base R data frames by providing a more “user-friendly” and consistent experience for data manipulation.

Tibbles provide some key benefits over dataframes, including:

  1. Simplified Printing
    Tibbles display only the first 10 rows and as many columns as fit the screen, avoiding overwhelming output for large datasets. For example:

    library(tibble)
    
    tibble_example <- tibble(
      x = 1:100,
      y = rnorm(100)
    )
    tibble_example
    ## # A tibble: 100 × 2
    ##        x       y
    ##    <int>   <dbl>
    ##  1     1  1.24  
    ##  2     2  0.0969
    ##  3     3 -0.403 
    ##  4     4  0.946 
    ##  5     5  0.323 
    ##  6     6  0.0385
    ##  7     7 -0.582 
    ##  8     8  0.416 
    ##  9     9  0.636 
    ## 10    10 -0.462 
    ## # ℹ 90 more rows

    Output will show:

    # A tibble: 100 × 2
          x       y
      <int>   <dbl>
    1     1  -0.276
    2     2   1.003
    3     3   0.735
    # … with 97 more rows
  2. Preservation of Data Types
    Unlike base R data frames, tibbles never automatically convert strings into factors or perform other unexpected type coercion. For example:

    df <- data.frame(name = "Alice", age = 25)
    class(df$name)  # Output: factor
    ## [1] "character"
    tb <- tibble(name = "Alice", age = 25)
    class(tb$name)  # Output: character
    ## [1] "character"
  3. Column Access by Name
    Tibbles enforce stricter rules for column access:

    • Accessing columns with $ or [[ requires exact matches to column names.
    • Partial matching of names is disallowed, reducing errors.
    tb <- tibble(name = "Alice", age = 25)
    tb$n      # Error: Unknown column 'n'
    ## Warning: Unknown or uninitialised column: `n`.
    ## NULL
  4. Support for Non-Syntactic Names
    Tibbles allow column names that aren’t syntactically valid in R, such as names containing spaces or special characters. These must be referenced using backticks:

    tb <- tibble(`student name` = "Alice", age = 25)
    tb$`student name`  # Access column
    ## [1] "Alice"
  5. Creation Flexibility
    Tibbles can be created using the tibble() function or the tribble() function for row-wise data entry. For example:

    # Using tibble()
    tb <- tibble(name = c("Alice", "Bob"), age = c(25, 30))
    
    # Using tribble()
    tb <- tribble(
      ~name, ~age,
      "Alice", 25,
      "Bob", 30
    )

Tibbles are especially useful when working with the Tidyverse because all Tidyverse packages are designed to work optimally with tibbles. They are recommended whenever you want a more predictable and consistent data structure for data manipulation and analysis.

All of the functions that expect a dataframe also work with tibbles.

Reading CSV Files

To load a CSV file into a tibble rather than a dataframe, use read_csv() instead of read.csv(). Using read_csv() over Base R’s read.csv() has several benefits (aside from returning a tibble).

  1. Better Defaults:
    • Strings are not automatically converted to factors.
    • Column types are automatically guessed and more accurately handled.
  2. Faster Performance:
    • read_csv() is (a bit) faster for large files.
  3. Tidyverse Integration:
    • Returns a tibble, which integrates seamlessly with other Tidyverse packages.
  4. Informative Output:
    • Prints metadata about the data during import (e.g., column types).

The function read_csv() is part of the readr package, so either install and load that package of the full tidyverse package set.

Consider a dataset students containing information on students’ names, majors, and GPA’s (from the CSV file students.csv):

library(readr)
students <- read_csv("students.csv")
## Rows: 7 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): name, major
## dbl (1): gpa
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The function read_csv() displays information about the structure of the CSV. If you wish to “suppress” this information, then use:

library(readr)
students <- read_csv("students.csv", show_col_types = FALSE)

Filtering Rows with filter()

The filter() function extracts rows from a data frame that meet specified logical conditions. To select students with a GPA of 3.5 or better, you can use the following dplyr pipe:

filtered_students <- students %>%
  filter(gpa >= 3.50)

This returns rows where the gpa column is greater than or equal to 3.50. To display the result, you can use the print() function:

print(filtered_students)
## # A tibble: 4 × 3
##   name              major   gpa
##   <chr>             <chr> <dbl>
## 1 Alice Jones       CS     3.82
## 2 Liu Chen          CS     3.9 
## 3 Karl Langenfelder Bus    3.68
## 4 Susan Myers       Arch   3.76

Arranging Rows with arrange()

The arrange() function orders rows based on the values of specified columns. To sort the students dataset by gpa in descending order:

arranged_students <- students %>%
  arrange(desc(gpa))

print(arranged_students)
## # A tibble: 7 × 3
##   name              major   gpa
##   <chr>             <chr> <dbl>
## 1 Liu Chen          CS     3.9 
## 2 Alice Jones       CS     3.82
## 3 Susan Myers       Arch   3.76
## 4 Karl Langenfelder Bus    3.68
## 5 Cameron Wu        IE     3.45
## 6 Sandeep Patel     Econ   3.4 
## 7 Gert Wilder       Math   2.84

This sorts the students from the highest to the lowest grade.

Selecting Columns with select()

The select() function allows for the extraction of specific columns from a data frame. To create a new dataset with only the name and gpa columns:

selected_students <- students %>%
  select(name, gpa)

print(selected_students)
## # A tibble: 7 × 2
##   name                gpa
##   <chr>             <dbl>
## 1 Alice Jones        3.82
## 2 Sandeep Patel      3.4 
## 3 Gert Wilder        2.84
## 4 Liu Chen           3.9 
## 5 Cameron Wu         3.45
## 6 Karl Langenfelder  3.68
## 7 Susan Myers        3.76

This results in a dataset containing only the name and gpa columns. From a relational perspective, this is a “projection” operation.

Adding or Modifying Columns with mutate()

The mutate() function is used to add new columns or modify existing ones. To add a column honors that categorizes students based on their GPA:

mutated_students <- students %>%
  mutate(grade_category = case_when(
    gpa >= 3.9 ~ "Excellent",
    gpa >= 3.5 ~ "Good",
    gpa >= 3.0 ~ "Passing",
    TRUE ~ "Below Passing"
  ))

print(mutated_students)
## # A tibble: 7 × 4
##   name              major   gpa grade_category
##   <chr>             <chr> <dbl> <chr>         
## 1 Alice Jones       CS     3.82 Good          
## 2 Sandeep Patel     Econ   3.4  Passing       
## 3 Gert Wilder       Math   2.84 Below Passing 
## 4 Liu Chen          CS     3.9  Excellent     
## 5 Cameron Wu        IE     3.45 Passing       
## 6 Karl Langenfelder Bus    3.68 Good          
## 7 Susan Myers       Arch   3.76 Good

This adds a honors column with values for various ranges of GPA.

Summarizing Data with summarize()

The summarize() function computes summary statistics for a data frame. For example, to calculate the mean (average) GPA of the students:

average_gpa <- students %>%
  summarize(mean_gpa = mean(gpa))

print(average_gpa)
## # A tibble: 1 × 1
##   mean_gpa
##      <dbl>
## 1     3.55

This computes the mean of the gpa column, providing the average GPA

Grouped Operations with group_by() and summarize()

For more complex analyses, group_by() in conjunction with summarize() enables grouped operations. To calculate the average GPA by major:

avg_by_major <- students %>%
  group_by(major) %>%
  summarize(avg_gpa = mean(gpa))

print(avg_by_major)
## # A tibble: 6 × 2
##   major avg_gpa
##   <chr>   <dbl>
## 1 Arch     3.76
## 2 Bus      3.68
## 3 CS       3.86
## 4 Econ     3.4 
## 5 IE       3.45
## 6 Math     2.84

Whenever there’s a “by” or “per” it means that we want to computer some “fact” for grouped data.

These examples demonstrate the versatility of dplyr in performing data transformations, facilitating efficient data manipulation for complex analyses using an alternative to SQL.

Summary

The Tidyverse and dplyr are integral parts of the modern data analysis ecosystem in R, designed to simplify and streamline data manipulation and visualization tasks.


Files & Resources

All Files for Lesson 6.107

References

  1. Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, tidy, transform, visualize, and model data. O’Reilly Media. Retrieved from https://r4ds.hadley.nz

  2. Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

Portions of this lesson were created with the assistance of ChatGPT 4o.

Errata

Let us know.

---
title: "Data Manipulation with <b>dplyr</b>"
params:
  type : lesson
  category: 6
  number: 107
  time: 45
  level: beginner
  tags: "r,dplyr,filtering,summarizing,selecting"
  description: "Demonstrates the use of the <b>dplyr</b> package of the Tidyverse
                for data manipulation, summarization, and filtering."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Prequisites

This lesson presumes that you are familiar with the contents of:

-   [6.104 -- Quick Guide to R For Programmers](http://artificium.us/lessons/06.r/l-6-104-r4progs/l-6-104.html)

## Introduction

Data transformation is a fundamental aspect of data analysis, enabling the conversion of raw data into a structured format suitable for analytics. While there are many ways to transform and prepare data, the `dplyr` package -- an integral part of the Tidyverse -- offers a suite of functions designed to simplify this process. In this lesson, we explore key `dplyr` functions, including `filter()`, `arrange()`, `select()`, `mutate()`, and `summarize()`.

### **What is the Tidyverse?**

The "Tidyverse" is a collection of R packages designed to work seamlessly together, adhering to a consistent philosophy of "tidy data." It provides tools for a wide range of data-related tasks, from importing and cleaning data to transforming, visualizing, and modeling it.

The Tidyverse was developed by Hadley Wickham (CTO at R Studio), a prominent statistician and software developer in the R community. Wickham is well-known for creating many influential R packages and shaping the modern landscape of data analysis in R. His work has revolutionized how analysts and researchers interact with data, emphasizing simplicity, consistency, and the concept of "tidy data."

Key features of the Tidyverse include:

-   **Tidy Data Paradigm:** Data is organized so that each variable forms a column, each observation forms a row, and each type of observation unit forms a table.
-   **Consistency:** All packages in the Tidyverse use a consistent syntax and grammar, making them intuitive to learn and use.
-   **Ease of Use:** The Tidyverse simplifies complex operations, enabling analysts to focus more on insights rather than the mechanics of data handling.

The Tidyverse includes several key packages, including:

-   **ggplot2:** For data visualization.
-   **dplyr:** For data manipulation.
-   **tidyr:** For tidying and reshaping data.
-   **readr:** For reading data files.
-   **tibble:** For improved data frame functionality.
-   **purrr:** For functional programming with lists.

### **What is dplyr?**

**dplyr** is a core package within the Tidyverse, specifically focused on data manipulation. It provides a grammar of data manipulation, making it easy to perform tasks such as filtering, sorting, summarizing, and transforming data.

Key operations in **dplyr**, often called the "verbs" of data manipulation, include:

-   **`filter()`**: Select rows based on conditions.
-   **`select()`**: Choose specific columns.
-   **`mutate()`**: Add or modify columns.
-   **`arrange()`**: Sort rows.
-   **`summarize()`**: Calculate summary statistics.
-   **`group_by()`**: Group data for grouped operations.

These functions are designed to be intuitive, readable, and chainable using the `%>%` operator (pipe), which allows you to express sequences of transformations in a clear and linear fashion. They present an alternative to using SQL for data filtering, summarizing, and selection.

## The **dplyr** Package

**dplyr** is part of the Tidyverse, meaning it shares the same principles, syntax, and goals as other Tidyverse packages. It serves as the **data manipulation backbone** for many data analysis workflows in the Tidyverse. In this section, we will explore key functions in detail. To use the **dplyr** package, you must install it and then load it:

```{r warning=FALSE}
suppressPackageStartupMessages(library(dplyr))
```

The call to `suppressPackageStartupMessages()` is necessary to "suppress" all warnings and messages; if not done, those would end up in your knitted notebook.

The **dplyr** package uses a modified (and simplified) tabular structure called a *tibble* that replaces dataframes in Tidyverse.

To load all packages of the Tidyverse, you can simply load **tidyverse**.

```{r warning=FALSE, eval=FALSE}
library(tidyverse)
```

### tibbles

**Tibbles** are a modernized re-imagination of data frames in R, introduced by the **tibble** package, which is part of the **Tidyverse**. The goal is to address some of the limitations and frustrations associated with base R data frames by providing a more "user-friendly" and consistent experience for data manipulation.

Tibbles provide some key benefits over dataframes, including:

1.  **Simplified Printing**\
    Tibbles display only the first 10 rows and as many columns as fit the screen, avoiding overwhelming output for large datasets. For example:

    ```{r}
    library(tibble)

    tibble_example <- tibble(
      x = 1:100,
      y = rnorm(100)
    )
    tibble_example
    ```

    Output will show:

    ```         
    # A tibble: 100 × 2
          x       y
      <int>   <dbl>
    1     1  -0.276
    2     2   1.003
    3     3   0.735
    # … with 97 more rows
    ```

2.  **Preservation of Data Types**\
    Unlike base R data frames, tibbles never automatically convert strings into factors or perform other unexpected type coercion. For example:

    ```{r}
    df <- data.frame(name = "Alice", age = 25)
    class(df$name)  # Output: factor

    tb <- tibble(name = "Alice", age = 25)
    class(tb$name)  # Output: character
    ```

3.  **Column Access by Name**\
    Tibbles enforce stricter rules for column access:

    -   Accessing columns with `$` or `[[` requires exact matches to column names.
    -   Partial matching of names is disallowed, reducing errors.

    ```{r}
    tb <- tibble(name = "Alice", age = 25)
    tb$n      # Error: Unknown column 'n'
    ```

4.  **Support for Non-Syntactic Names**\
    Tibbles allow column names that aren't syntactically valid in R, such as names containing spaces or special characters. These must be referenced using backticks:

    ```{r}
    tb <- tibble(`student name` = "Alice", age = 25)
    tb$`student name`  # Access column
    ```

5.  **Creation Flexibility**\
    Tibbles can be created using the `tibble()` function or the `tribble()` function for row-wise data entry. For example:

    ```{r}
    # Using tibble()
    tb <- tibble(name = c("Alice", "Bob"), age = c(25, 30))

    # Using tribble()
    tb <- tribble(
      ~name, ~age,
      "Alice", 25,
      "Bob", 30
    )
    ```

Tibbles are especially useful when working with the Tidyverse because all Tidyverse packages are designed to work optimally with tibbles. They are recommended whenever you want a more predictable and consistent data structure for data manipulation and analysis.

All of the functions that expect a dataframe also work with tibbles.

### Reading CSV Files

To load a CSV file into a tibble rather than a dataframe, use `read_csv()` instead of `read.csv()`. Using `read_csv()` over Base R's `read.csv()` has several benefits (aside from returning a tibble).

1.  **Better Defaults**:
    -   Strings are not automatically converted to factors.
    -   Column types are automatically guessed and more accurately handled.
2.  **Faster Performance**:
    -   `read_csv()` is (a bit) faster for large files.
3.  **Tidyverse Integration**:
    -   Returns a tibble, which integrates seamlessly with other Tidyverse packages.
4.  **Informative Output**:
    -   Prints metadata about the data during import (*e.g.*, column types).

The function `read_csv()` is part of the **readr** package, so either install and load that package of the full **tidyverse** package set.

Consider a dataset `students` containing information on students' names, majors, and GPA's (from the CSV file [students.csv](students.csv)):

```{r}
library(readr)
students <- read_csv("students.csv")
```

The function `read_csv()` displays information about the structure of the CSV. If you wish to "suppress" this information, then use:

```{r}
library(readr)
students <- read_csv("students.csv", show_col_types = FALSE)
```

### Filtering Rows with `filter()`

The `filter()` function extracts rows from a data frame that meet specified logical conditions. To select students with a GPA of 3.5 or better, you can use the following **dplyr** pipe:

```{r}
filtered_students <- students %>%
  filter(gpa >= 3.50)
```

This returns rows where the `gpa` column is greater than or equal to 3.50. To display the result, you can use the `print()` function:

```{r}
print(filtered_students)
```

### Arranging Rows with `arrange()`

The `arrange()` function orders rows based on the values of specified columns. To sort the `students` dataset by `gpa` in descending order:

```{r}
arranged_students <- students %>%
  arrange(desc(gpa))

print(arranged_students)
```

This sorts the students from the highest to the lowest grade.

### Selecting Columns with `select()`

The `select()` function allows for the extraction of specific columns from a data frame. To create a new dataset with only the `name` and `gpa` columns:

```{r}
selected_students <- students %>%
  select(name, gpa)

print(selected_students)
```

This results in a dataset containing only the `name` and `gpa` columns. From a relational perspective, this is a "projection" operation.

### Adding or Modifying Columns with `mutate()`

The `mutate()` function is used to add new columns or modify existing ones. To add a column `honors` that categorizes students based on their GPA:

```{r}
mutated_students <- students %>%
  mutate(grade_category = case_when(
    gpa >= 3.9 ~ "Excellent",
    gpa >= 3.5 ~ "Good",
    gpa >= 3.0 ~ "Passing",
    TRUE ~ "Below Passing"
  ))

print(mutated_students)
```

This adds a `honors` column with values for various ranges of GPA.

### Summarizing Data with `summarize()`

The `summarize()` function computes summary statistics for a data frame. For example, to calculate the mean (average) GPA of the students:

```{r}
average_gpa <- students %>%
  summarize(mean_gpa = mean(gpa))

print(average_gpa)
```

This computes the mean of the `gpa` column, providing the average GPA

### Grouped Operations with `group_by()` and `summarize()`

For more complex analyses, `group_by()` in conjunction with `summarize()` enables grouped operations. To calculate the average GPA by major:

```{r}
avg_by_major <- students %>%
  group_by(major) %>%
  summarize(avg_gpa = mean(gpa))

print(avg_by_major)
```

Whenever there's a "by" or "per" it means that we want to computer some "fact" for grouped data.

These examples demonstrate the versatility of `dplyr` in performing data transformations, facilitating efficient data manipulation for complex analyses using an alternative to SQL.

## Summary

The **Tidyverse** and **dplyr** are integral parts of the modern data analysis ecosystem in R, designed to simplify and streamline data manipulation and visualization tasks.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

1.  Wickham, H., & Grolemund, G. (2017). *R for Data Science: Import, tidy, transform, visualize, and model data*. O'Reilly Media. Retrieved from <https://r4ds.hadley.nz>

2.  Wickham, H. (2014). Tidy data. *Journal of Statistical Software, 59*(10), 1–23. <https://doi.org/10.18637/jss.v059.i10>

Portions of this lesson were created with the assistance of ChatGPT 4o.

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
