Objectives

Upon completion of this short, you will be able to:

  • generate synthetic data
  • save data to CSV
  • externalize data to XML

Overview

Synthetic data is artificially generated data that mimics the characteristics and statistical properties of real-world data. It is created using algorithms and statistical models that produce data with similar patterns, structures, and relationships to real-world data.

Synthetic data can be created in different ways, such as using generative models, statistical models, or simulations. The goal of generating synthetic data is to create a representative data set that can be used for testing, training, or experimentation, without the need to use real-world data that may be difficult, expensive, or unethical to obtain.

Synthetic data can be used in various fields, including machine learning, data analytics, and computer vision, to test algorithms and models, simulate scenarios, and develop and optimize systems. It can also be used to protect the privacy of sensitive information by substituting real-world data with synthetic data that retains the statistical properties of the original data, but does not reveal personal information.

This short demonstrates how to generate a synthetic game schedule with game results for a fictitious sports league.

Benefits of Synthetic Data

Generating synthetic datasets has several benefits and advantages over “real-world” data (even data that has been anonymized), including:

Privacy Protection: Synthetic data can be used to protect sensitive information such as personal identifiable information (PII), financial data, and medical records. By using synthetic data, organizations can reduce the risk of a data breach and protect the privacy of their customers and users.

Cost-effective: Creating synthetic data is a cost-effective alternative to collecting and processing real-world data. Synthetic data can be generated quickly and efficiently without the need for expensive data collection processes or lengthy ethical review processes.

Diversity: Synthetic data allows for the creation of diverse datasets that can represent a wide range of scenarios and use cases. This enables researchers and organizations to test their algorithms and models on a broader range of data and improve the accuracy and robustness of their systems.

Scalability: Synthetic data can be easily generated in large quantities, which is essential for training machine learning models and testing the scalability of algorithms.

Reusability: Synthetic data can be reused for different purposes, such as training multiple models or algorithms, without the need for additional data collection efforts.

Overcoming data scarcity: In some cases, there may be a scarcity of real-world data, which can limit the ability to train machine learning models or develop predictive algorithms. Synthetic data can be generated to overcome these limitations and enable researchers to develop and test their models on a broader range of data.

Controllable Correlations: Synthetic data sets can exhibit specific correlations and other statistical relationships between data variables which can be useful for testing machine learning models.

Overall, generating synthetic datasets can provide a cost-effective, scalable, diverse, and privacy-protecting alternative to using real-world data for research and development purposes.

Generating Data

Generating Team Names

The team names for the league were generated using ChatGPT and were saved to teams.csv. The code below loads the teams. The CSV file contains a team number and a team name.

df.teams <- read.csv("data/teams.csv")

head(df.teams,3)
##   team.number          team.name
## 1           1   Arctic Avalanche
## 2           2 Frostbite Phantoms
## 3           3     Polar Pioneers

Generating Games

The first step is to define a data structure in which to store the game results. If the number of teams in the league is n, then there are 2n(n - 1) games in a season and 2(n-1) games for each team, assuming that each team plays every other team once at home and once away (meaning at their stadium). In a double-round robin, the games would be doubled.

num.teams <- nrow(df.teams)
num.games <- 2*(num.teams)*(num.teams-1)

## create data frame for games
df.games <- data.frame(
  home.team = integer(num.games),
  away.team = integer(num.games),
  home.team.goals = integer(num.games),
  away.team.goals = integer(num.games),
  ot = integer(num.games)
)

Next, we will define the game pairings of each team playing the other twice: once at home and once away.

all.teams <- 1:num.teams
game.counter <- 1

## iterate over all teams; t is the "home team"
for (t in all.teams) {
  ## make other teams the away team
  for (o in all.teams) {
    if (o != t) {
      df.games$home.team[game.counter] <- t
      df.games$away.team[game.counter] <- o
      game.counter <- game.counter + 1
    }
  }
}

## now swap home and away for other games for double round robin
h <- df.games$home.team[1:(game.counter-1)]
a <- df.games$away.team[1:(game.counter-1)]
 
df.games$home.team[game.counter:num.games] <- a
df.games$away.team[game.counter:num.games] <- h

Now that we have the game pairings set for a double round robin schedule, let’s assign some scores to each of the two teams. The “goals for” and “goals against” are randomly generated resulting in either a win or loss. Ties are not accepted and an extra goal is randomly added to one of the two teams for an “overtime” win.

## generate scores
df.games$home.team.goals <- as.integer(round(runif(num.games, min = 0, max = 7),0))
df.games$away.team.goals <- as.integer(round(runif(num.games, min = 0, max = 7),0))

## check for ties; set "ot" flag
df.games$ot <- (df.games$home.team.goals == df.games$away.team.goals)

##  identify tie games 
ot.games <- which(df.games$ot == TRUE)
## and (randomly) award extra goal to one of the two teams
for (otg in ot.games) {
  ot.winner <- runif(1)
  # home team wins with 60% chance vs away team
  if (ot.winner < 0.6) {
    ## award win to home team
    df.games$home.team.goals[otg] <- df.games$home.team.goals[otg] + 1
  } else {
    df.games$away.team.goals[otg] <- df.games$away.team.goals[otg] + 1
  }
}

Let’s take a quick look at a few random rows to ensure they make sense; a kind of “smoke test”.

rand.rows <- sample(1:num.games, 6, replace = FALSE)

print(df.games[rand.rows,])
##      home.team away.team home.team.goals away.team.goals    ot
## 169          6        15               1               2 FALSE
## 1930         8        31               7               1 FALSE
## 321         11        12               6               2 FALSE
## 1258        19         9               4               6 FALSE
## 997          6         1               3               4 FALSE
## 1384        21        13               2               0 FALSE

Validate Data

It is critical to validate that the synthetically generated data meets the requirements for the data and that there are no mistakes in the data. For this example, we need to ensure that the goal difference between the home and away team is exactly one of there was an overtime win.

ot.games <- which(df.games$ot == TRUE)
goal.diff <- abs(df.games$home.team.goals[ot.games] - df.games$away.team.goals[ot.games])

isValid <- !any(goal.diff != 1)

Based on the analysis, the data set is valid.

Save to CSV

We will write out a CSV that contains the team names and the scores. The code is a bit complex as we assume that the team numbers are not sequential in the df.teams data frame and that team numbers do not have to be in the range from 1 to the number of teams.

s <- write.csv(df.games, "data/game-schedule-raw.csv")
for (g in 1:num.games) {
  df.games$home.team[g] <- df.teams$team.name[
    which(df.games$home.team[g] == df.teams$team.number)]
  df.games$away.team[g] <- df.teams$team.name[
    which(df.games$away.team[g] == df.teams$team.number)]
}

Now we are ready to save the game schedule with the scores, and full team names, to a CSV

s <- write.csv(df.games, "data/game-schedule.csv")

And to ensure that it was written correctly, let’s read it back in and display some random games.

df.validate <- read.csv("data/game-schedule.csv", header = T)

rand.rows <- sample(1:nrow(df.validate), 6, replace = FALSE)

print(df.validate[rand.rows,])
##         X            home.team            away.team home.team.goals
## 957   957       Tundra Falcons        Tundra Titans               2
## 1223 1223 Coldfront Conquerors      Winter Warriors               1
## 1430 1430    Glacial Guardians Coldfront Conquerors               2
## 1064 1064      Frostfire Foxes       Polar Pioneers               2
## 316   316      Frostfire Foxes   Snowstorm Stingers               0
## 170   170   Snowstorm Stingers    Chillzone Chasers               3
##      away.team.goals    ot
## 957                7 FALSE
## 1223               2  TRUE
## 1430               5 FALSE
## 1064               1  TRUE
## 316                6 FALSE
## 170                2 FALSE

Summary

Synthetic data is artificially generated data that mimics the characteristics of real-world data. It can be created using algorithms and models and is used for testing, training, and experimentation in various fields, such as machine learning, data analytics, and computer vision, while also protecting the privacy of sensitive information. This short explained, through code, how to generate synthetic sports game data.


All Files for Short S-6.156

Errata

Let us know.