Objectives

Upon completion of this lesson, you will be able to:

  • explore data visually using bar charts, scatterplots, and line graphs
  • create common chart types in Base R
  • identify correlations and outliers visually

Overview

Exploratory data visualization is an essential step in the data analytics process, particularly for college students learning the fundamentals of data science. It involves creating visual representations of data to help identify patterns, trends, and anomalies, thereby facilitating a better understanding of the underlying information. As an effective way to convey complex ideas and explore data, it is a vital skill for data analysts.

This overview aims to introduce the basic concepts of exploratory data visualization, its importance in data analytics, and some common tools and techniques.

The examples use data from two CSV files:

Download the files to try the example code yourself and to experiment with the parameters of each function.

Importance of Exploratory Data Visualization

  • Identifying Patterns and Trends: Visualizations help analysts detect relationships, trends, and patterns within data that might not be easily discernible otherwise.
  • Outlier Detection: By visually exploring data, analysts can quickly identify potential outliers or errors in the dataset.
  • Hypothesis Generation: Visualizations can help generate hypotheses about the data, which can later be tested using statistical methods.
  • Effective Communication: Visual representations of data allow analysts to communicate their findings to both technical and non-technical audiences.

Common Types of Data Visualizations

The three most common visualizations for exploring data are:

  • Scatter Plots: Suitable for exploring the relationship between two continuous variables.
  • Bar Charts: Useful for comparing categorical data or showing changes over time.
  • Line Charts: Ideal for visualizing continuous data, such as trends over time.

These three are also useful, but are not addressed in this lesson:

  • Bubble Charts: A scatterplot where the size of the points encode another data dimension.
  • Pie Charts: Great for representing proportions or percentages of a whole.
  • Heat Maps: Effective for visualizing large datasets and identifying patterns or clusters.

Tips for Effective Data Visualization

  • Choose the Right Chart Type: Select a visualization type that best represents the data and the story you want to tell.
  • Keep It Simple: Focus on clarity and avoid clutter by limiting the number of colors, shapes, and labels.
  • Use Appropriate Scales: Choose appropriate scales and axis limits to avoid misleading representations of the data.
  • Consider Accessibility: Ensure that visualizations are accessible to those with color vision deficiencies and other impairments.

Scatterplots

Scatterplots, also known as scatter graphs or scatter diagrams, are a type of data visualization used to display the relationship between two continuous variables. In a scatterplot, each data point is represented as a dot (or other symbol) on a Cartesian coordinate plane, with one variable plotted on the x-axis and the other on the y-axis.

Scatterplots are particularly useful for identifying patterns or trends between the two variables, such as positive or negative correlations, and for detecting outliers. By visually examining a scatterplot, one can gain insights into the strength and direction of the relationship between the variables.

For example, if the data points in a scatterplot form an upward-sloping pattern, it indicates a positive correlation between the two variables, meaning that as one variable increases, so does the other. Conversely, if the data points form a downward-sloping pattern, it suggests a negative correlation, meaning that as one variable increases, the other decreases. If the data points are scattered randomly without any discernible pattern, it indicates that there may be little or no correlation between the two variables.

To create a scatterplot in R, you can use the plot() function, which is a built-in function in the base R package. Here’s a simple example that demonstrates how to create a scatterplot of two variables:

# Sample data
df <- read.csv("customertxndata.csv")
df.agg <- aggregate(df$Total, list(df$NumVisits), FUN=mean)

# Create a scatterplot using the plot() function
plot(x = df.agg[,1], 
     y = df.agg[,2], 
     main = "Number of Visits vs Average Total Spend", 
     xlab = "#Visits", 
     ylab = "Avg Total ($)",
     type = 'p')

In the example above, we load a CSV of website visits into a dataframe and then use the aggregate() function to calculate the average total for each number of visits. We then plot the average spend for each number of visits to determine if there’s a correlation.

The pattern reveals a positive correlation, i.e., the average total spend increases with the number of visits. Once a correlation has been visually detected, a statistical analysis of the correlation should be made by calculating an appropriate correlation metric such as Pearson Moment or Spearman Rank.

Bar Charts

Bar charts, also known as bar graphs, are a type of data visualization used to display and compare categorical or discrete data. In a bar chart, data categories are represented by rectangular bars, where the length or height of each bar is proportional to the value or frequency of the category it represents. Bar charts can be plotted either vertically or horizontally, with the x-axis representing the categories and the y-axis representing the values or frequencies. If they are plotted vertically, they are also sometimes referred to as column charts.

Bar charts are best used when:

  • Comparing data across categories: Bar charts are an effective way to compare values or frequencies across different categories, making it easy to identify the highest or lowest values and observe general trends.

  • Showing changes over time: When the categories represent time periods (e.g., months, years), bar charts can be used to visualize changes in the data over time. In this case, it’s essential to maintain a consistent time interval between bars to avoid misinterpretation of the data.

  • Displaying distribution or composition: Bar charts can be used to show the distribution of data across different categories or the composition of a whole by using stacked or grouped bar charts.

  • Visualizing small to moderate-sized datasets: Bar charts are most effective for displaying small to moderate-sized datasets with a limited number of categories. For larger datasets or those with numerous categories, other visualization techniques like line charts or heat maps might be more appropriate.

Keep in mind that bar charts are not suitable for visualizing continuous data, as the data needs to be grouped into discrete categories for this type of chart. For continuous data, consider using line charts, scatterplots, or other visualization methods.

Bar plots can be created in R using the barplot() function. It takes a vector and the plot will have bars with their heights equal to the elements in the vector. The example below creates a column chart from a CSV containing data about visits to a website.

In the example below, the function table() is used to count how many visits come from each OS.

# Sample data
df <- read.csv("customertxndata.csv")

# count number of visits by OS
visitsByOS <- table(df$OS)

barplot(visitsByOS,
        main = "Visits by OS",
        xlab = "OS",
        ylab = "#Visits",
        names.arg = c("Android", "iOS"),
        col = "darkred",
        horiz = FALSE)

The barplot() function has a few key parameters

  • main is the main title for the chart
  • sub is a subtitle for the chart
  • horiz can be set to TRUE or FALSE; if FALSE then the plot is a column chart
  • xlab and ylab are the x- and y-axis labels
  • names.arg is a vector of labels for each bar
  • col is the color of the bars

The second example is a bar chart showing the total revenue by operating system.

# Sample data
df <- read.csv("customertxndata.csv")

# calculate the revenue by OS
df.agg <- aggregate(df$Total, list(df$OS), FUN=sum)

barplot(df.agg$x,
        main = "Revenue by OS",
        sub = "in US$ for Q1 2023",
        xlab = "OS",
        ylab = "Revenue (US$)",
        names.arg = c("Android", "iOS"),
        col = "navy",
        horiz = TRUE)

Line Graphs

The examples below use the data from pharmaSalesTxn.csv containing sales transactions with dates.

# read data from CSV
df <- read.csv("pharmaSalesTxn.csv")

# for each transaction (row), extract amount and month (mm/dd/yyyy)
df.sales <- data.frame(
  month = df$date,
  amount = df$amount
)

n <- nrow(df.sales)
for (i in 1:n) {
  df.sales$month[i] <- strsplit(df.sales$month[i], "/")[[1]][1]
}

# sum the sales per month
df.agg <- aggregate(df.sales$amount, list(df.sales$month), FUN=sum)

# rename the columns
colnames(df.agg) <- c("month","amount")

# convert months to integers and order by month
df.agg$month <- as.integer(df.agg$month)
df.agg <- df.agg[order(df.agg$month), ]

# plot the revenue total by month
plot(x = df.agg$month, y = df.agg$amount, 
     type = "b", pch = 19, 
     col = "red", 
     xlab = "Month", ylab = "Total Sales ($)")

Stacking Plots

Several plots can be added to the same visualization. The plot() function creates the canvas for the visualization and the call to lines() adds a line to the existing canvas created by plot().

When stacking plots, i.e., when adding two plots to a single canvas, it is critical that the values for both are accommodated by the range of x and y values of the first plot created with the plot() function.

# read data from CSV
df <- read.csv("pharmaSalesTxn.csv")

# for each transaction (row), extract amount and month (mm/dd/yyyy)
df.sales <- data.frame(
  month = df$date,
  amount = df$amount
)

n <- nrow(df.sales)
for (i in 1:n) {
  df.sales$month[i] <- strsplit(df.sales$month[i], "/")[[1]][1]
}

# sum and average the sales per month
df.agg <- aggregate(df.sales$amount, list(df.sales$month), 
                    FUN = function(x) c(total = sum(x), avg = mean(x) ))

# extract data to new dataframe
df <- data.frame(
  month = as.integer(df.agg$Group.1),
  total = as.numeric(df.agg$x[,1]),
  avg = as.numeric(df.agg$x[,2]),
  stringsAsFactors = FALSE,
  row.names = NULL)

# order by month
df <- df[order(df$month), ]

df <- data.frame(
  month = 1:12,
  total = sample(2000:3000, 12),
  avg = sample(1000:1800, 12)
)

In the code block below, we add lines for the “total” and the “average” sold per month, so we need a range of values that accommodates both numeric ranges.

# determine minimum and maximum y axis values
ylim.min = min(df$total,df$avg) * 0.8
ylim.max = max(df$total,df$avg) * 1.2

# Create a first line: monthly sales total
plot(df$month, df$total, type = "b", frame = FALSE, pch = 19, 
     xlim = c(1, 12),
     ylim = c(ylim.min, ylim.max),
     main = "Average vs Total Sales Per Month",
     sub = "Year 2020",
     col = "red", xlab = "Month", ylab = "Amount ($)")

# Add a second line: monthly average sales
lines(df$month, df$avg, pch = 18, col = "blue", type = "b", lty = 2)

Customizing Plots

R offers several options to customize the chart appearance with parameters to the plot() function.

  • cex → shape size
  • lwd → line width
  • col → control colors
  • lty → line type
  • pch → marker shape
  • type → link between dots

The argument type determines how the line segment (or, to be more precise, the connection between the points) are drawn:

  • ‘p’ – only draw points and no lines which creates a scatterplot
  • ‘b’ – connect segments but leave a gap around each point
  • ‘l’ – fully connect the points with lines

The argument pch determines how each point is drawn and is a value between 1 and 19. For example:

  • 19 – large dot
  • 4 – cross

The argument lwd determines the line width. It is a value from 1 to 4.

The argument lty determines the line type. Line types can either be specified as an integer (0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings “blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, or “twodash”, where “blank” uses ‘invisible lines’ (i.e., does not draw them).

The cheat sheet below helps summarize the values:

Adding Legends

Legends can be added to a plot using the legend() function. It uses the same dataframe and values from a previous block.

# total per month
plot(df$month, df$total, type = "b", frame = FALSE, pch = 19, 
     xlim = c(1, 12),
     ylim = c(ylim.min, ylim.max),
     main = "Total vs Average Sales Per Month",
     col = "red", xlab = "Month", ylab = "$")

# average per month
lines(df$month, df$avg, pch = 18, col = "blue", type = "b", lty = 2)

# add a legend to the plot
legend("topleft", legend=c("Total Sales", "Average Sales"),
       col=c("red", "blue"), lty = 1:2, cex=0.8)

Summary

By gaining a strong foundation in exploratory data visualization, data analysts can enhance their data analytics skills and effectively communicate their findings. With the right tools and techniques, visualizations can unlock valuable insights and support data-driven decision-making.


Files & Resources

All Files for Lesson 6.724

References

None.

Errata

None collected yet. Let us know.

