Synthetic data refers to artificially generated information that mimics real-world data in structure, distribution, and relationships but does not directly derive from actual observations. In this lesson, we present a worked example of engineering a synthetic dataset on restaurant visits and sales transactions. Specifically, we use a mix of statistical methods, rule-based data generation, made more realistic by incorporating noise and missing data.
The number of transactions generated as well as many parameters for the rules and statistical methods are configurable.
To create synthetic data for restaurant visits, a CSV file should include attributes that capture details about the visit. The structure can reflect real-world scenarios, ensuring it is versatile and relevant for analysis. In this example, we will generate a synthetic tabular dataset (CSV) for use for database design, data warehouse design, data analytics, and regression modeling.
YYYY-MM-DD
format.HH:MM
format.Here is what an example row will look like:
goes here
This structure ensures the dataset is rich enough for various types of analyses, including customer behavior, meal preferences, peak hours, and revenue forecasting.
We will use a rules-based approach to generating the data, augmented with statistical methods and random noise.
The code below configures common parameters used later during the generation of the synthetic data.
The code below defines vectors of meal types, accepted payment methods, and genders for customers. Additional values can be added to the lists and they will be considered in the code that generates the data as values are drawn from those lists at random. Note that other values depend on those values drawn, e.g., the base amount for the amount spent by each person in a party is dependent on the type of meal and is set in perPersonCheck
, so if a meal type is added to the vector mealTypes
, then a corresponding value for the average price of that meal and its standard deviation must be added as well to perPersonCheck
.
mealTypes <- c("Breakfast","Lunch","Dinner","Take-Out")
payTypes <- c("Credit Card", "Cash", "Mobile Payment")
genders <- c("f","m","n","u")
genderProbs <- c(0.4,0.4,0.05,0.15)
perPersonCheck <- data.frame(
mean = c(8, 12, 22, 19),
sd = c(2, 4, 2.5, 2.1)
)
partySizeProbs <- c(0.38,0.3,0.09,0.13,0.07,0.03)
tippingScale <- data.frame (
amount = c(0.15,0.18,0.2,0.22,0.25,0.3),
prob = c(0.4,0.2,0.2,0.05,0.1,0.05)
)
numTxns <- 10 ## number of transactions/restaurant visits
#numTxns <- 139874 ## number of transactions/restaurant visits
probCustomerKnown <- 0.3 ## probability that customer is known and is in
## loyalty program
percentVisitTimesToRemove <- 0.06 ## percentage of visits that should not
## have a known visit time, i.e., missing value
avgWaitTime <- 8 ## average wait time (in minutes) for a table
sdWaitTime <- 4.5 ## standard deviation for wait time assuming
## wait times are normally distributed
minHourlyWage <- 10.25 ## minimum hourly wage for servers
hourlyWageIncrease <- 1.25 ## hourly wage increases
hireBirthYearMin <- 1960 ## birth years for servers
hireBirthYearMax <- 2007
firstHireYear <- 2018 ## first year for hire date
lastHireYear <- 2024 ## also range of dates for visits
curYear <- substr(Sys.Date(), 0, 4) ## current year
The dataframe df
below is the shell for the synthetically generated data. Additional columns are added as they are generated.
This section generates personal information about servers. The server names are read from a CSV containing artificial names generated using generatedata.com. The code below generates starting and end hire dates; some percentage of servers has no hiring end date which means that they still work as servers. Birth dates are generated as well, along with tax id or social security numbers. Based on the time a server is employed, an hourly wage is generated based on a starting minimum hourly wage from the parameters set above. The hourly wage is increased every six months by the hourlyWageIncrease
parameter.
## load server names from generated CSV
server.names <- read.csv("synth-server-names.csv")
num.Servers <- nrow(server.names)
## dataframe for server information
servers.df <- data.frame(
EmpID = sample(1008:3389, num.Servers, replace = F),
ServerName = server.names$x
)
## generate a hire date after the first year parameter
## add a leading 0 to months < 10 and days < 10, so date is 2023-05-09 instead of 2023-5-9
days <- sample(1:12, num.Servers, replace = T)
days <- paste0(ifelse(days < 10, "0", ""), days)
months <- sample(1:28, num.Servers, replace = T)
months <- paste0(ifelse(months < 10, "0", ""), months)
servers.df$StartDateHired <- paste0("",
sample(firstHireYear:lastHireYear, num.Servers, replace = T), "-",
days, "-", months)
## generate a termination date assuming that 40% of servers still work with us
for (y in 1:num.Servers) {
startDate <- as.Date(servers.df$StartDateHired[y])
if (runif(1) < 0.4) {
# still working with restaurant; no termination date
servers.df$EndDateHired[y] <- ""
} else {
servers.df$EndDateHired[y] <- (function(s) {
daysElapsed <- Sys.Date() - s
daysInFuture <- sample(2:(daysElapsed-2), 1)
return(paste0("", s + daysInFuture))
}) (startDate)
}
}
## if there's no end date of employment, use today's date to calculate hourly rate
hireEndDates <- ifelse(servers.df$EndDateHired == "",
paste0("",Sys.Date()),
servers.df$EndDateHired)
## assign hourly rate based on number of months worked
servers.df$HourlyRate <- round((minHourlyWage +
as.integer(((as.Date(hireEndDates) -
as.Date(servers.df$StartDateHired))) /
(6*30))* hourlyWageIncrease), 2)
## generate birthdate
servers.df$BirthDate <- paste0("",
sample(1:12, num.Servers, replace = T), "/",
sample(1:28, num.Servers, replace = T), "/",
sample(hireBirthYearMin:hireBirthYearMax, num.Servers, replace = T))
## generate artificial social security number
servers.df$SSN <- paste0("",
sample(100:999, num.Servers, replace = T), "-",
sample(10:99, num.Servers, replace = T), "-",
sample(1111:9899, num.Servers, replace = T))
To make the data more realistic, we will remove several values for the columns hiring start and end dates, birth dates, and social security numbers. In addition, for two of the servers, we will add a middle initial to their name.
## remove some random values to create missing values
servers.df$BirthDate[sample(1:num.Servers, 4)] <- ""
servers.df$SSN[sample(1:num.Servers, 6)] <- ""
## for two of the names, add a middle initial at the end
z <- sample(1:num.Servers, 2)
servers.df$ServerName[z] <- paste0(servers.df$ServerName[z], " K.")
## EmpID ServerName StartDateHired EndDateHired HourlyRate BirthDate
## 1 3279 Smith, Emily 2021-02-27 2022-05-29 12.75 6/24/1990
## 2 3200 Johnson, Michael 2022-11-09 2023-07-28 11.50 2/28/1987
## 3 1102 Brown, Olivia 2018-02-01 26.50 10/14/1965
## 4 2385 Garcia, Daniel 2018-01-02 2018-07-26 11.50 7/6/1990
## 5 1837 Martinez, Sophia 2019-04-22 24.00 6/27/1961
## 6 1544 Davis, Christopher 2024-10-26 10.25 2/20/1981
## SSN
## 1 266-84-9166
## 2 278-89-9665
## 3 800-39-4198
## 4
## 5 491-25-4420
## 6 127-58-7150
Names of restaurants are in a CSV with names generated by ChatGPT 4o. The idea is that the restaurants are all part of a holding company that operates several restaurants, so servers can move between restaurants.
## RestaurantName hasServers
## 1 Burger Haven yes
## 2 Patty Palace yes
## 3 Grill & Thrill yes
## 4 The Burger Joint yes
## 5 Stacked & Sizzled yes
## 6 Bite & Bun yes
Assign the servers to a restaurant at random if the restaurant has service.
r <- sample(1:nrow(restaurants.df), nrow(servers.df), replace = T)
servers.df$Restaurant <- ifelse(restaurants.df$hasServers[r] == "yes",
restaurants.df$RestaurantName[r], "")
To make it more realistic, we will now assign a few servers to more than one restaurant. Some will work at both restaurants while some will work at two restaurants. We are using servers from the first restaurant to work also at the second restaurant but at a higher hourly rate.
s <- sample(which(servers.df$Restaurant == restaurants.df$RestaurantName[1]), 3, replace = F)
servers.df <- rbind(servers.df,servers.df[s,])
servers.df$Restaurant[nrow(servers.df)-2] <- restaurants.df$RestaurantName[2]
servers.df$Restaurant[nrow(servers.df)-1] <- restaurants.df$RestaurantName[2]
servers.df$Restaurant[nrow(servers.df)] <- restaurants.df$RestaurantName[3]
servers.df$HourlyRate[nrow(servers.df)-1] <- servers.df$HourlyRate[nrow(servers.df)-1] + 1.45
servers.df$HourlyRate[nrow(servers.df)] <- servers.df$HourlyRate[nrow(servers.df)] + 2.10
servers.df$EndDateHired[nrow(servers.df)-1] <- ""
Assign a restaurant for the visit using a probability vector from the parameters and choose a server for that restaurant if the restaurant is a “sit-down” restaurant and has table service. Add all of the available server information. Choose different default values for missing information when a restaurant doesn’t have service.
for (t in 1:numTxns) {
## assign a random restaurant assuming a visit to any restaurant is equally likely
r <- sample(1:nrow(restaurants.df), 1)
df$Restaurant[t] <- restaurants.df$RestaurantName[r]
## assign a random server from that restaurant
## add all server information to the row
if (restaurants.df$hasServers[r] == "no") {
## restaurant has no servers
df$ServerEmpID[t] <- ""
df$ServerName[t] <- "N/A"
df$StartDateHired[t] <- "0000-00-00"
df$EndDateHired[t] <- "9999-99-99"
df$HourlyRate[t] <- 0
df$ServerBirthDate[t] <- ""
df$ServerTIN[t] <- ""
} else {
# choose a random server at that restaurant
e <- sample(which(servers.df$Restaurant == df$Restaurant[t]), 1)
df$ServerEmpID[t] <- servers.df$EmpID[e]
df$ServerName[t] <- servers.df$ServerName[e]
df$StartDateHired[t] <- servers.df$StartDateHired[e]
df$EndDateHired[t] <- servers.df$EndDateHired[e]
df$HourlyRate[t] <- servers.df$HourlyRate[e]
df$ServerBirthDate[t] <- servers.df$BirthDate[e]
df$ServerTIN[t] <- servers.df$SSN[e]
}
}
Assign a random date for the visit that falls within the time that the server assigned to that visit works at that restaurant. A server can work at different times at more than one restaurant. Assign a random date for a restaurant without servers (where the end date is “9999-99-99”).
for (t in 1:numTxns) {
if (df$EndDateHired[t] == "9999-99-99") {
## no servers at the restaurant, so pick a date before today
month <- sample(1:12,1)
if (month < 10) month <- paste0("0", month)
df$VisitDate[t] <- paste0("20", sample(19:24, 1), "-",
month, "-",
sample(10:28, 1))
} else {
## find the server's hire start and end assigned to that visit
theServer.start <- as.Date(df$StartDateHired[t])
theServer.end <- as.Date(ifelse(df$EndDateHired[t] == "", Sys.Date(), df$EndDateHired[t]))
## calculate the days between then and now; if there's no end date, use "today"
numDays <- as.integer(theServer.end - theServer.start)
## pick random date (hack is necessary to coerce date to a string)
df$VisitDate[t] <- paste0("", theServer.start + sample(0:numDays, 1))
}
}
Assigning a visit time first requires determining a meal type, so that the time is during the common meal time for that type of meal.
genVisitTime <- function (mealType)
{
minute <- sample(1:59,1)
hour <- sample(5:23,1)
if (mealType == "Breakfast") hour <- sample(5:11,1)
if (mealType == "Lunch") hour <- sample(11:15,1)
if (mealType == "Dinner") hour <- sample(16:20,1)
if (hour < 10) hour <- paste0("0",hour)
if (minute < 10) minute <- paste0("0",minute)
return (paste0(hour,":", minute))
}
for (t in 1:numTxns) {
mealType <- mealTypes[sample(1:length(mealTypes),1)]
## add the visit time
df$VisitTime[t] <- genVisitTime(mealType)
## add the meal type
df$MealType[t] <- mealType
}
We’ll introduce some “noise” by removing some visit times so that it can be used as an example for data imputation.
Generate a random size of the party, i.e., how many guests are eating in. For “take-out” this is the number of meals ordered. Add some random wait times for some customers but only at times when it is likely to be busy and it is not “take-out”. The party sizes are based on probabilities as not all sizes of parties are equally likely. The wait time is assumed to be drawn from a normal distribution with the mean being avgWaitTime
and the standard deviation being adjusted for the meal type. Since the random numbers drawn could be negative, an adjustment is made to ensure they are positive.
The genders for each member of the party of randomly selected from the genders
parameter vector based on a configurable probability distribution in genderProbs
.
genWaitTime <- function (mealType, partySize, visitTime)
{
## more likely to wait at dinner
## more wait time for parties > 4
## average wait time configurable
t <- 1
hour <- ifelse(visitTime == "",
0,
as.integer(substring(visitTime, 0, 2)))
if (mealType == "Dinner" & partySize > 4) {
t <- rnorm(1, mean = avgWaitTime, sd = 4) + 10
}
if (mealType == "Dinner") {
t <- rnorm(1, mean = avgWaitTime, sd = 4)
}
if (mealType == "Lunch") {
t <- rnorm(1, mean = avgWaitTime, sd = 3) - 1
}
if ((hour < 7) | (hour == 10) | (hour >= 13 & hour < 18) | (hour >= 20))
t <- 0
return (t)
}
Now, let’s assign a random party size (based on an empirical probability distribution making larger parties a lot less likely), random genders, and a random wait time (based on meal type and time of day).
for (t in 1:numTxns) {
## random party size
df$PartySize[t] <- sample(1:length(partySizeProbs), 1, prob = partySizeProbs)
## random party genders
memGenders <- ""
for (m in 1:df$PartySize[t]) {
## assign a random gender to each party member
memGenders <- paste0(memGenders,
genders[sample(1:length(genders), 1, prob = genderProbs)])
}
## add a vector
df$Genders[t] <- memGenders
## add a random wait time based on party size, time of day, and whether or not
## take-out or sit-down restaurant
df$WaitTime[t] <- round(genWaitTime(df$MealType[t], df$PartySize[t], df$VisitTime[t]),0)
}
We’ll introduce some “noise” by making some party sizes an “outlier” or “nonsensical” value so that it can be used as an example for bad data detection and used for data imputation, perhaps by predicting them from the other features and the total bill amount. We’ll use the value of 99 to indicate a missing party size – this, of course, is poor data quality management, but, unfortunately, quite common.
In this section, we will generate information about the party visiting the restaurant, i.e., the “customer”. A customer ID, name, phone, and email were synthetically generated using generatedata.com for about 2500 customers.
The function below, select a random customer, although for some configurable percentage of customers no customer info is generated and thus is left blank.
## generate a customer info for some customers randomly
genCustomerInfo <- function ()
{
if (runif(1) > probCustomerKnown) {
# no customer info
return (NULL)
} else {
# pick a random customer from the list of customers
set.seed(80763+(runif(1)*500))
custInfo <- customers.df[sample(1:nrow(customers.df),1),]
return (custInfo)
}
}
Next, we’ll add the customer information to the visit. If a customer was selected, then we assume that this means that they are a member of the restaurant chain’s loyalty program. Of course, the same customer can visit the same restaurant or different restaurants multiple times.
for (t in 1:numTxns) {
## choose a random customer or no information
cust <- genCustomerInfo()
## add to dataframe if information was selected
if (!is.null(cust)) {
df$CustomerName[t] <- cust$name
df$CustomerPhone[t] <- cust$phone
df$CustomerEmail[t] <- cust$email
df$LoyaltyMember[t] <- TRUE
} else {
df$CustomerName[t] <- ""
df$CustomerPhone[t] <- ""
df$CustomerEmail[t] <- ""
df$LoyaltyMember[t] <- FALSE
}
}
Next, we’ll generate the total amount for the bill. It is a random value based on the meal type and the size of the party.
The cost of a meal (per person) is based on parameters for the meal type and is then selected randomly from a normal distribution. The code below selects costs repeatedly until the cost is more than $0.90.
genFoodBill <- function (mealType, partySize)
{
repeat {
c <- rnorm(1,
mean = perPersonCheck$mean[which(mealTypes == mealType)],
sd = perPersonCheck$sd[which(mealTypes == mealType)])
if (c > 0.9)
break
}
## multiply the average meal cost by size of party
## if the size of party is 99, then it means it is missing
## and we'll generate a random party size
if (partySize == 99)
partySize <- sample(1:length(partySizeProbs), 1, prob = partySizeProbs)
return(round(c * partySize, 2))
}
The tip is a random value between 15% and 25% – this is a bit unrealistic as many customers would also round up to a whole dollar. The tip also varies based on meal type and party size.
genTipAmount <- function (mt, ps, gn)
{
tipAmount <- 0.15 # standard tip is 15%
tipAmount <- ifelse(mt == "Take-Out", 0,
sample(tippingScale$amount, 1,
prob = tippingScale$prob))
return(tipAmount)
}
If a customer is part of the loyalty program, we’ll generate an discount for them that is either 10% or 15%. The probability that the discount is 10% is 80%, 15% otherwise.
genDiscount <- function (flag)
{
discount <- 0
## discount only for non-loyalty program members
if (flag == TRUE)
discount <- ifelse(runif(1) < 0.8, 0.1, 0.15)
return(discount)
}
for (t in 1:numTxns)
{
## generate total amount for bill based on meal type and party size
df$FoodBill[t] <- genFoodBill(df$MealType[t], df$PartySize[t])
## add a random tip amount if it is not "take-out"
df$TipAmount[t] <- genTipAmount(df$MealType[t], df$PartySize[t])
## apply discount if in loyalty program
df$DiscountApplied[t] <- genDiscount(df$LoyaltyMember[t])
## add a random payment type
df$PaymentMethod[t] <- payTypes[sample(1:length(payTypes), 1)]
}
Randomly add flag whether party ordered alcohol (if not take-out) and then add a surcharge to the party and recalculate bill total and tip.
The function encodes the following “drinking rules”:
genAlcoholOrdered <- function (ps, mt, gn)
{
## no alcohol of take-out or breakfast
if (mt == "Take-Out" || mt == "Breakfast") {
return ("no")
}
if ((mt == "Dinner") && (ps == 1) && (runif(1) < 0.9)) {
## single person for dinner
return ("yes")
}
if ((mt == "Lunch") && (ps == 1) && (runif(1) < 0.4)) {
## single person for lunch
return ("yes")
}
if ((mt == "Dinner") && (ps == 2) && (gn == "mm" || gn == "ff") && (runif(1) < 0.8)) {
## party of two of same-sex for dinner
return ("yes")
}
if ((mt == "Dinner") && (ps > 1) && (runif(1) < 0.6)) {
## large party for dinner
return ("yes")
}
if ((mt == "lunch") && (runif(1) < 0.1)) {
return ("yes")
}
## otherwise default is no drinking
return ("no")
}
The amount spent on alcohol should depend on the meal type, the number of people in the party, the average cost per drink, and the genders. The implementation below is a first pass at this.
genAlcoholBill <- function (mt, ps, gn)
{
## alcohol bill is number of people in the party
## multiplied by average cost per drink by random number of drinks
## for percentage of party who will drink
if (ps == 99)
ps <- sample(1:length(partySizeProbs), 1, prob = partySizeProbs)
alcoholBill <- ps * avgCostPerDrink * 1.25
return (alcoholBill)
}
for (t in 1:numTxns)
{
## randomly add alcohol
df$orderedAlcohol[t] <- genAlcoholOrdered(df$PartySize[t],
df$MealType[t],
df$Genders[t])
## if alcohol was consumed, then up the bill by a random amount
## based on a random number of people drinking
if (df$orderedAlcohol[t] == "yes") {
## add a separate amount for alcohol
df$AlcoholBill[t] <- genAlcoholBill(df$MealType[t],
df$PartySize[t],
df$Genders[t])
## adjust tip amount based on new bill
df$TipAmount[t] <- genTipAmount(df$MealType[t],
df$PartySize[t],
df$Genders[t])
} else {
df$AlcoholBill[t] <- 0
}
}
## VisitID Restaurant ServerEmpID ServerName StartDateHired
## 1 1 Big Bite Burgers 1154 Thomas, Benjamin 2019-11-22
## 2 2 Flame Shack N/A 0000-00-00
## 3 3 Flame Shack N/A 0000-00-00
## 4 4 Big Bite Burgers 1154 Thomas, Benjamin 2019-11-22
## 5 5 Big Bite Burgers 1057 Hernandez, Charlotte 2018-02-19
## 6 6 The Burger Joint 2050 Walker, Grace 2018-06-18
## 7 7 Burger Haven 1151 Turner, Andrew 2022-07-22
## 8 8 Bun Fi N/A 0000-00-00
## 9 9 The Burger Joint 2050 Walker, Grace 2018-06-18
## 10 10 Burger Haven 2221 Lee, Liam 2023-05-26
## EndDateHired HourlyRate ServerBirthDate ServerTIN VisitDate VisitTime
## 1 22.75 3/16/1970 2020-11-03 19:08
## 2 9999-99-99 0.00 2020-07-14 12:20
## 3 9999-99-99 0.00 2024-06-26 13:02
## 4 22.75 3/16/1970 2020-08-02 08:04
## 5 26.50 1/12/1977 896-81-4376 2023-05-30 18:39
## 6 26.50 12/7/2006 912-95-4514 2019-02-13 06:43
## 7 2023-03-07 11.50 635-99-5777 2022-10-15 14:48
## 8 9999-99-99 0.00 2024-08-27 14:31
## 9 26.50 12/7/2006 912-95-4514 2022-10-26 14:36
## 10 2023-06-17 10.25 9/1/1966 979-83-2646 2023-06-03 09:08
## MealType PartySize Genders WaitTime CustomerName CustomerPhone
## 1 Dinner 99 f 7
## 2 Take-Out 99 fmfn 1
## 3 Lunch 99 ummmf 0
## 4 Breakfast 1 f 1
## 5 Dinner 1 m 12
## 6 Breakfast 99 um 0
## 7 Lunch 99 uf 0
## 8 Lunch 99 um 0 Aphrodite Watson (570) 651-7873
## 9 Lunch 99 mffmff 0
## 10 Breakfast 99 u 1
## CustomerEmail LoyaltyMember FoodBill TipAmount DiscountApplied
## 1 FALSE 20.14 0.18 0.0
## 2 FALSE 78.52 0.00 0.0
## 3 FALSE 15.38 0.20 0.0
## 4 FALSE 10.60 0.15 0.0
## 5 FALSE 21.78 0.20 0.0
## 6 FALSE 4.62 0.15 0.0
## 7 FALSE 9.12 0.20 0.0
## 8 lobortis@yahoo.org TRUE 70.09 0.15 0.1
## 9 FALSE 26.77 0.22 0.0
## 10 FALSE 52.85 0.18 0.0
## PaymentMethod orderedAlcohol AlcoholBill
## 1 Mobile Payment yes 10.3125
## 2 Cash no 0.0000
## 3 Credit Card no 0.0000
## 4 Mobile Payment no 0.0000
## 5 Mobile Payment yes 10.3125
## 6 Credit Card no 0.0000
## 7 Credit Card no 0.0000
## 8 Credit Card no 0.0000
## 9 Credit Card no 0.0000
## 10 Mobile Payment no 0.0000
## add meta data: source, author, link to source (or program that generated data), date
## features with definitions, intended use, CC4
xml <- paste0(xml, '<meta-data>', '\n')
xml <- paste0(xml, ' <date-generated>', Sys.Date(), '</date-generated>', '\n')
xml <- paste0(xml, ' <author>', '\n')
xml <- paste0(xml, ' <name>','Martin Schedlbauer, PhD', '</name>', '\n')
xml <- paste0(xml, ' <contact>','m.schedlbauer@northeastern.edu', '</contact>', '\n')
xml <- paste0(xml, ' </author>', '\n')
xml <- paste0(xml, ' <source>','Synthetic', '</source>', '\n')
xml <- paste0(xml, ' <description>', '\n')
xml <- paste0(xml, ' A data set containing visits to a restaurant that is part of a chain\n')
xml <- paste0(xml, ' of restaurants.')
xml <- paste0(xml, ' </description>', '\n')
xml <- paste0(xml, ' <use-cases>', '\n')
xml <- paste0(xml, ' <use-case>machine learning</use-case>\n')
xml <- paste0(xml, ' <use-case>data analytics</use-case>\n')
xml <- paste0(xml, ' <use-case>missing values, outlier, and extreme value handling</use-case>\n')
xml <- paste0(xml, ' <use-case>normalization with functional dependencies</use-case>\n')
xml <- paste0(xml, ' <use-case>database import</use-case>\n')
xml <- paste0(xml, ' <use-case>CSV queries with SQL through sqldf</use-case>\n')
xml <- paste0(xml, ' <use-case>XML queries using XPath</use-case>\n')
xml <- paste0(xml, ' </use-cases>', '\n')
xml <- paste0(xml, ' <fields>', '\n')
xml <- paste0(xml,
' <field name="VisitID" type="numeric">',
'Unique ID for the visit.',
' </field>', '\n')
xml <- paste0(xml, ' <field name="BirthDate" type="date">')
xml <- paste0(xml, 'Birth date of server/employee. Blank if not known.')
xml <- paste0(xml, ' </field>', '\n')
xml <- paste0(xml, ' </fields>', '\n')
xml <- paste0(xml, '</meta-data>', '\n')
df.validation <- read.csv("restaurant-visits.csv",
header = T,
stringsAsFactors = F)
print(paste0(nrow(df.validation), " rows read vs ",
numTxns, " rows written"))
## [1] "10 rows read vs 10 rows written"
## [1] "Num columns read: 25"
## VisitID Restaurant ServerEmpID ServerName StartDateHired
## 1 1 Big Bite Burgers 1154 Thomas, Benjamin 2019-11-22
## 2 2 Flame Shack NA N/A 0000-00-00
## 3 3 Flame Shack NA N/A 0000-00-00
## 4 4 Big Bite Burgers 1154 Thomas, Benjamin 2019-11-22
## 5 5 Big Bite Burgers 1057 Hernandez, Charlotte 2018-02-19
## 6 6 The Burger Joint 2050 Walker, Grace 2018-06-18
## EndDateHired HourlyRate ServerBirthDate ServerTIN VisitDate VisitTime
## 1 22.75 3/16/1970 2020-11-03 19:08
## 2 9999-99-99 0.00 2020-07-14 12:20
## 3 9999-99-99 0.00 2024-06-26 13:02
## 4 22.75 3/16/1970 2020-08-02 08:04
## 5 26.50 1/12/1977 896-81-4376 2023-05-30 18:39
## 6 26.50 12/7/2006 912-95-4514 2019-02-13 06:43
## MealType PartySize Genders WaitTime CustomerName CustomerPhone CustomerEmail
## 1 Dinner 99 f 7
## 2 Take-Out 99 fmfn 1
## 3 Lunch 99 ummmf 0
## 4 Breakfast 1 f 1
## 5 Dinner 1 m 12
## 6 Breakfast 99 um 0
## LoyaltyMember FoodBill TipAmount DiscountApplied PaymentMethod
## 1 FALSE 20.14 0.18 0 Mobile Payment
## 2 FALSE 78.52 0.00 0 Cash
## 3 FALSE 15.38 0.20 0 Credit Card
## 4 FALSE 10.60 0.15 0 Mobile Payment
## 5 FALSE 21.78 0.20 0 Mobile Payment
## 6 FALSE 4.62 0.15 0 Credit Card
## orderedAlcohol AlcoholBill
## 1 yes 10.3125
## 2 no 0.0000
## 3 no 0.0000
## 4 no 0.0000
## 5 yes 10.3125
## 6 no 0.0000
Synthetic data is artificially generated information that mimics real-world data in structure and characteristics. In this lesson, we presented a worked example showing some of the technique for synthetic data generation. It is essential to keep in mind that synthetic data has limitations, including reduced realism, risks of overfitting, validation challenges, and potential regulatory issues.
Granville, V. (2024). Synthetic Data and Generative AI. Elsevier.
Singh, J. (2021). The Rise of Synthetic Data: Enhancing AI and Machine Learning Model Training to Address Data Scarcity and Mitigate Privacy Risks. Journal of Artificial Intelligence Research and Applications, 1(2), 292-332.
Hansen, L., Seedat, N., van der Schaar, M., & Petrovic, A. (2023). Reimagining synthetic tabular data generation through data-centric AI: A comprehensive benchmark. Advances in Neural Information Processing Systems, 36, 33781-33823.
None collected yet. Let us know.