Read all instructions before starting.
Synthetic data is artificially generated data that mimics the characteristics and statistical properties of real-world data. It is created using algorithms and statistical models that produce data with similar patterns, structures, and relationships to real-world data. In this assignment you will use R to generate a synthetic game schedule with game results for a fictitious sports league.
Completing this assignment, will give you an opportunity to
Working is individually is recommended, but working in pairs may be helpful.
Prior to working on this assignment, it is suggest that you review these lessons and refer to them during the assignment:
Create a new project in R Studio and then, within that project, create a new R Script (R Program, i.e., a .R file). Add comments as a file header with the title “Practice / Generate Synthetic Data”; add your name and today’s date.
Follow common programming practices:
The final deliverable is a game schedule where each team plays every other team twice: once as a home team and once as an away team. The game schedule must be saved as a CSV having this format:
"","home.team","away.team","home.team.goals","away.team.goals","ot"
"1","Arctic Avalanche","Frostbite Phantoms",5,7,FALSE
"2","Arctic Avalanche","Polar Pioneers",4,3,TRUE
...
"16","Frostbite Phantoms","Arctic Avalanche",7,3,FALSE
"17","Frostbite Phantoms","Polar Pioneers",7,6,TRUE
...
The above output shows randomly selected rows. Each row is a game. The games do not have to appear in any specific order, as long as the CSV contains all 480 games. The scores are randomly generated with the caveat that ties are not permitted and when a game ends in a tie during regulation time, overtime is played until there is a winner. The overtime format is “sudden death” which means that the teams that scores first wins. So, a game that went into overtime always has a goal differential of exactly one goal, i.e., one of the teams gets the win by scoring that extra goal. In the CSV, the column “ot” is a Boolean (T/F) indicating whether the game went to overtime.
Load the CSV teams.csv into a data frame. This CSV contains names of teams for the league. They are fictitious and were originally created by a generative AI agent. Inspect the data frame and the CSV file to ensure it has been loaded properly. Consider creating a smaller version of the file that has only four teams; it is easier to debug that way – the use of abbreviated data files is common during development.
Create an empty data frame that has the columns you need.
Start by generating a game schedule where each team plays the other; then add to that schedule the same but home and away teams are reverse. After that you will have a game schedule where each team plays the other twice.
Generate scores for each team (perhaps use the random number generator function runif()
which generates a random uniformly distributed number from a range). If the score is tied, set the “ot” flag and then randomly give an extra goal to one of the two teams to simulate an overtime win.
Once all the game data has been generated, save the file as a CSV under the name “gameschedule.csv”.
In a separate R program, load the CSV and validate the schedule: is the correct number of games given the number of teams? are ties resolved? if the overtime flag is TRUE, is the goal differential exactly 1?
Create an R Notebook and build a “report” that lists the teams and then produces the “standings”, i.e., a table that lists the teams in order of points. Award 2 points to the winner, 1 for a team that loses in overtime, and 0 for a team that lost in regulation.
Add another section to the R Notebook that creates a different table based on the winner getting 3 points, the overtime winner gets 2 points, and the loser in overtime gets 1 point.
Provide some analysis of the difference in the standings.
None yet.