Upon completion of this short, you will be able to:
Synthetic data is artificially generated data that mimics the characteristics and statistical properties of real-world data. It is created using algorithms and statistical models that produce data with similar patterns, structures, and relationships to real-world data.
Synthetic data can be created in different ways, such as using generative models, statistical models, or simulations. The goal of generating synthetic data is to create a representative data set that can be used for testing, training, or experimentation, without the need to use real-world data that may be difficult, expensive, or unethical to obtain.
Synthetic data can be used in various fields, including machine learning, data analytics, and computer vision, to test algorithms and models, simulate scenarios, and develop and optimize systems. It can also be used to protect the privacy of sensitive information by substituting real-world data with synthetic data that retains the statistical properties of the original data, but does not reveal personal information.
This short demonstrates how to generate a synthetic game schedule with game results for a fictitious sports league.
Generating synthetic datasets has several benefits and advantages over “real-world” data (even data that has been anonymized), including:
Privacy Protection: Synthetic data can be used to protect sensitive information such as personal identifiable information (PII), financial data, and medical records. By using synthetic data, organizations can reduce the risk of a data breach and protect the privacy of their customers and users.
Cost-effective: Creating synthetic data is a cost-effective alternative to collecting and processing real-world data. Synthetic data can be generated quickly and efficiently without the need for expensive data collection processes or lengthy ethical review processes.
Diversity: Synthetic data allows for the creation of diverse datasets that can represent a wide range of scenarios and use cases. This enables researchers and organizations to test their algorithms and models on a broader range of data and improve the accuracy and robustness of their systems.
Scalability: Synthetic data can be easily generated in large quantities, which is essential for training machine learning models and testing the scalability of algorithms.
Reusability: Synthetic data can be reused for different purposes, such as training multiple models or algorithms, without the need for additional data collection efforts.
Overcoming data scarcity: In some cases, there may be a scarcity of real-world data, which can limit the ability to train machine learning models or develop predictive algorithms. Synthetic data can be generated to overcome these limitations and enable researchers to develop and test their models on a broader range of data.
Controllable Correlations: Synthetic data sets can exhibit specific correlations and other statistical relationships between data variables which can be useful for testing machine learning models.
Overall, generating synthetic datasets can provide a cost-effective, scalable, diverse, and privacy-protecting alternative to using real-world data for research and development purposes.
The team names for the league were generated using ChatGPT and were saved to teams.csv. The code below loads the teams. The CSV file contains a team number and a team name.
## team.number team.name
## 1 1 Arctic Avalanche
## 2 2 Frostbite Phantoms
## 3 3 Polar Pioneers
The first step is to define a data structure in which to store the game results. If the number of teams in the league is n, then there are 2n(n - 1) games in a season and 2(n-1) games for each team, assuming that each team plays every other team once at home and once away (meaning at their stadium). In a double-round robin, the games would be doubled.
num.teams <- nrow(df.teams)
num.games <- 2*(num.teams)*(num.teams-1)
## create data frame for games
df.games <- data.frame(
home.team = integer(num.games),
away.team = integer(num.games),
home.team.goals = integer(num.games),
away.team.goals = integer(num.games),
ot = integer(num.games)
)
Next, we will define the game pairings of each team playing the other twice: once at home and once away.
all.teams <- 1:num.teams
game.counter <- 1
## iterate over all teams; t is the "home team"
for (t in all.teams) {
## make other teams the away team
for (o in all.teams) {
if (o != t) {
df.games$home.team[game.counter] <- t
df.games$away.team[game.counter] <- o
game.counter <- game.counter + 1
}
}
}
## now swap home and away for other games for double round robin
h <- df.games$home.team[1:(game.counter-1)]
a <- df.games$away.team[1:(game.counter-1)]
df.games$home.team[game.counter:num.games] <- a
df.games$away.team[game.counter:num.games] <- h
Now that we have the game pairings set for a double round robin schedule, let’s assign some scores to each of the two teams. The “goals for” and “goals against” are randomly generated resulting in either a win or loss. Ties are not accepted and an extra goal is randomly added to one of the two teams for an “overtime” win.
## generate scores
df.games$home.team.goals <- as.integer(round(runif(num.games, min = 0, max = 7),0))
df.games$away.team.goals <- as.integer(round(runif(num.games, min = 0, max = 7),0))
## check for ties; set "ot" flag
df.games$ot <- (df.games$home.team.goals == df.games$away.team.goals)
## identify tie games
ot.games <- which(df.games$ot == TRUE)
## and (randomly) award extra goal to one of the two teams
for (otg in ot.games) {
ot.winner <- runif(1)
# home team wins with 60% chance vs away team
if (ot.winner < 0.6) {
## award win to home team
df.games$home.team.goals[otg] <- df.games$home.team.goals[otg] + 1
} else {
df.games$away.team.goals[otg] <- df.games$away.team.goals[otg] + 1
}
}
Let’s take a quick look at a few random rows to ensure they make sense; a kind of “smoke test”.
## home.team away.team home.team.goals away.team.goals ot
## 169 6 15 1 2 FALSE
## 1930 8 31 7 1 FALSE
## 321 11 12 6 2 FALSE
## 1258 19 9 4 6 FALSE
## 997 6 1 3 4 FALSE
## 1384 21 13 2 0 FALSE
It is critical to validate that the synthetically generated data meets the requirements for the data and that there are no mistakes in the data. For this example, we need to ensure that the goal difference between the home and away team is exactly one of there was an overtime win.
ot.games <- which(df.games$ot == TRUE)
goal.diff <- abs(df.games$home.team.goals[ot.games] - df.games$away.team.goals[ot.games])
isValid <- !any(goal.diff != 1)
Based on the analysis, the data set is valid.
We will write out a CSV that contains the team names and the scores. The code is a bit complex as we assume that the team numbers are not sequential in the df.teams
data frame and that team numbers do not have to be in the range from 1 to the number of teams.
for (g in 1:num.games) {
df.games$home.team[g] <- df.teams$team.name[
which(df.games$home.team[g] == df.teams$team.number)]
df.games$away.team[g] <- df.teams$team.name[
which(df.games$away.team[g] == df.teams$team.number)]
}
Now we are ready to save the game schedule with the scores, and full team names, to a CSV
And to ensure that it was written correctly, let’s read it back in and display some random games.
df.validate <- read.csv("data/game-schedule.csv", header = T)
rand.rows <- sample(1:nrow(df.validate), 6, replace = FALSE)
print(df.validate[rand.rows,])
## X home.team away.team home.team.goals
## 957 957 Tundra Falcons Tundra Titans 2
## 1223 1223 Coldfront Conquerors Winter Warriors 1
## 1430 1430 Glacial Guardians Coldfront Conquerors 2
## 1064 1064 Frostfire Foxes Polar Pioneers 2
## 316 316 Frostfire Foxes Snowstorm Stingers 0
## 170 170 Snowstorm Stingers Chillzone Chasers 3
## away.team.goals ot
## 957 7 FALSE
## 1223 2 TRUE
## 1430 5 FALSE
## 1064 1 TRUE
## 316 6 FALSE
## 170 2 FALSE
Synthetic data is artificially generated data that mimics the characteristics of real-world data. It can be created using algorithms and models and is used for testing, training, and experimentation in various fields, such as machine learning, data analytics, and computer vision, while also protecting the privacy of sensitive information. This short explained, through code, how to generate synthetic sports game data.