Random number generation is a cornerstone of many computational tasks, including simulation, synthetic data generation, and statistical analysis. At its core, the objective is to produce sequences of numbers that appear random and adhere to the statistical properties required by a given application. However, computers, being deterministic machines, cannot generate truly random numbers without external physical processes. Instead, they rely on algorithms to produce “pseudo-random” numbers, which are deterministic yet statistically mimic true randomness. In this lesson we will explain the most common random number algorithms with examples in R.
The short video below provides an overview of the problem of random number generation on computers and common ways to do so.
Algorithms for Random Number Generation
The three most common random number generation algorithms are:
Linear Congruential Generator (LCG)
Mersenne Twister
Cryptographically Secure Generators
Linear Congruential Generator
One of the simplest and most widely used algorithms for generating pseudo-random numbers is the Linear Congruential Generator. Mathematically, an LCG produces a sequence of numbers using the recurrence relation:
\[
x_{n+1} = (a x_n + c) \mod m
\]
Where:
\(x_0\) is the seed or initial value
\(a\) is the multiplier
\(c\) is the increment
\(m\) is the modulus
The numbers \(x_n\) are then normalized to the interval [0,1) by dividing by \(m\). The choice of \(a\), \(c\), and \(m\) is crucial for ensuring the generator has a long period (the length before the sequence starts repeating) and good statistical properties. For instance, setting \(m\) to a large prime and carefully choosing \(a\) and \(c\) can maximize the period to \(m\) itself.
Mersenne Twister
The Mersenne Twister is a more modern and sophisticated pseudo-random number generator. It was specifically designed to have an extremely long period (\(2^{19937}-1\)) and excellent statistical properties. It operates using a large state vector (624-dimensional) and generates numbers in batches, which are then transformed to provide a uniform distribution in [0,1). The algorithm’s name derives from its period length being a Mersenne prime. While computationally more expensive than LCGs, the Mersenne Twister is widely used in simulation and scientific computation due to its high quality.
Cryptographically Secure Generators
For applications such as cryptography, where predictability is unacceptable, cryptographically secure pseudo-random number generators (CSPRNGs) are used. These generators rely on algorithms such as the Blum-Blum-Shub or implementations based on secure hash functions. The key property of a CSPRNG is that even if part of the sequence is known, it is computationally infeasible to predict the next number in the sequence without knowledge of the internal state or seed.
Potential Problems with Random Number Generators
Despite their utility, pseudo-random number generators have several drawbacks and limitations. The primary issue is periodicity; all pseudo-random generators eventually repeat their sequences. Poorly chosen parameters in algorithms like LCG can lead to short periods or correlations that undermine the randomness. Additionally, deterministic algorithms can be exploited if the seed is known, making the sequence predictable. For critical applications, such as secure data transmission, hardware-based true random number generators (TRNGs), which rely on physical processes like thermal noise or radioactive decay, may be preferable.
Random Number Generation in R
The R programming language provides a robust set of functions and packages for generating random numbers. The default generator in R is based on the Mersenne Twister, but users can switch to other algorithms using the RNGkind function.
To generate random numbers, R provides several functions tailored to specific distributions:
Uniform Distribution: runif(n, min = 0, max = 1) generates \(n\) random numbers uniformly distributed between min and max.
Normal Distribution: rnorm(n, mean = 0, sd = 1) generates \(n\) random numbers from a normal (Gaussian) distribution with specified mean and standard deviation.
Binomial Distribution: rbinom(n, size, prob) generates \(n\) random numbers from a binomial distribution with given size and probability.
Poisson Distribution: rpois(n, lambda) generates \(n\) random numbers from a Poisson distribution with rate parameter \(\lambda\).
Exponential Distribution: rexp(n, rate) generates \(n\) random numbers from an exponential distribution with a specified rate parameter.
Examples
To generate a sequence of 10 random numbers uniformly distributed between 0 and 1, you can use:
set.seed(42) # Set a seed for reproducibilityrandom_numbers <-runif(10, min =0, max =1)print(random_numbers)
The function set.seed() sets the “seed” for the random number generator. If the seed is fixed, then the same set of random numbers are generated every time. This is useful for reproducability but is problematic when the numbers must be random and should not be “guessable”. Then using a “random” seed is important, which is often the current time, e.g.,
# use system's current time as seed for random number generatorset.seed(Sys.time())random_numbers <-runif(10, min =0, max =1)print(random_numbers)
Verify for yourself that the sequence produced is different each time.
For a set of numbers that follow a normal distribution, use the function rnorm(). This produces a sequence of random numbers that are normally distributed around the provided mean and standard distribution:
random_normals <-rnorm(10, mean =5, sd =2)print(random_normals)
For a set of random numbers following a binomial distribution, use the function rbinom(). For example, to simulate flipping a coin 100 times with a 50% probability of heads:
Lastly, the function rpois() produces random numbers following a Poisson distribution which is a common distribution that arrivals and departures in queuing systems follow, e.g. time between cars in an intersection or time between visits to a website. For example, to model the number of events in a time interval using the Poisson distribution:
Generating Random Numbers from Custom Distributions
In some cases, you may need random numbers from a distribution not directly supported by R. This can often be accomplished by transforming uniformly distributed random numbers. For example, to generate random numbers from a triangular distribution, you can use the inverse transform sampling method or the built-in functions in external R packages.
Sampling
Sampling is the process of randomly selecting elements from a container, e.g., selecting elements from a vector. This is useful when we need random samples of data such as when selecting training data for machine learning.
The sample() function in R allows you to randomly select elements from a vector or from a range of values. The basic syntax of the function is:
sample(x, size, replace =FALSE, prob =NULL)
Here:
x is the vector or range from which to draw samples.
size specifies the number of samples to draw.
replace determines whether sampling is with replacement (default is FALSE), i.e., whether the same element can be more than once in the sample
prob is an optional vector of probabilities for weighted sampling; if not specified then all elements are equally likely to be selected.
For example, to draw 5 random samples from the numbers 1 through 10 without replacement, you can use:
set.seed(123) # For reproducibilitysamples <-sample(1:10, size =5, replace =FALSE)print(samples)
## [1] 3 10 2 8 6
If you want to sample with replacement, simply set replace = TRUE:
The sample() function is particularly useful when you need to shuffle data, create bootstrap samples, or select random subsets of a dataset. For instance, to randomly reorder a vector, you can use:
You may have asked why the sample() function is necessary and why we couldn’t have simply used the runif() function with some “programming”. In short, you are correct – the sample() function is just a convenience function.
Synthetic Data
Synthetic data refers to artificially generated data that mimics the statistical properties and structure of real-world data without directly copying it. It is typically created through algorithms, mathematical models, or simulations rather than being collected from real-world observations. Synthetic data can take the form of text, images, video, or numerical datasets, depending on the context and application.
The importance of synthetic data lies in its ability to address several challenges associated with real-world data, such as privacy concerns, cost, data scarcity, and representativeness. In sensitive domains like healthcare or finance, using real data might compromise privacy, whereas synthetic data offers a privacy-preserving alternative. Additionally, synthetic data can help overcome the lack of sufficient training data in machine learning, ensuring that models perform better by training on diverse and balanced datasets.
Synthetic data is generally generated through a combination of domain-specific rules and mathematical models. Furthermore, using statistical techniques such as Gaussian distributions, copulas, or regression models based on random data can produce synthetic data that mirrors the statistical patterns observed in real datasets. This approach is common in economic or demographic modeling.
In addition Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Synthetic Minority Oversampling Technique (SMOTE), Data Augmentation, and Agent-Based Models are employed.
Generating Synthetic Sales Data
To generate a synthetic dataset for sales in an online store auctioning digital goods such as photos, we can create a realistic dataset that includes information such as transaction dates, customers, product categories, purchase quantities, prices, etc.. This dataset can simulate actual patterns and is useful for testing of analytics or machine learning applications.
1. Define the Dataset Structure
We start by defining the variables that our dataset will contain. For an online store auctioning unique digital photos, we might restrict ourselves to:
txnID: A unique identifier for each purchase.
category: A category for the photo (e.g., “Nature,” “Abstract,” “Travel”).
bid: Bid amount for the photo.
won: Indicator whether customer won the auction.
2. Generate the Data
Let’s take a look how we can use R to simulate each variable with appropriate randomness. In the setup code below, we define the “seed” for the random number generators to ensure we get the same values every time the code is run. In addition, we will set the number of rows, i.e., the number of auction transactions.
set.seed(33487) # For reproducibilityn <-1000# Number of transactions
a. Transaction ID
We will generate the transaction identifiers as positive integers starting from a base and incrementing by one. Of course, any other method is fine as long as they are unique. The fact that the code below starts the identifiers at 101 is not relevant.
# Generate Transaction IDs as sequential numbers starting at 101transaction_ids <-100+seq(1, n)
Recall that n is the value for the number of transactions set prior.
b. Category
The photo category is randomly assigned using the sample() function. Each category is not equally likely to be selected as we provide a vector of probabilities (prob), which must add up to 1, of course. The values for the probabilities might have come from analyzing historical data or by looking at data from a related domain. If we had no prior knowledge of the probabilities, then we would need to assume equal likelihood and expect each category to be equal likely to be chosen.
The categories are pre-define in a list, but could also be read from a file which would make the code more extensible.
# Select a category from a listproduct_categories <-c("Nature", "Abstract", "Travel", "Architecture", "Wildlife")categories <-sample(product_categories, n, replace =TRUE, prob =c(0.3, 0.2, 0.25, 0.15, 0.1))
Note that we are sampling with replacement (replace = TRUE) as each category can be chosen more than once.
c. Bid Amount
Next, we need to generate a bid amount. We can reasonably surmise that bids would follow a normal (Gaussian) distribution around some mean and having some standard deviation. We might obtain these parameters from historical data or by analyzing other types of auctions.
We assume that the distribution is normal, but this may not be a correct assumption. We should verify the distribution by analyzing historical bid prices and see of they are reasonably normally distributed. But, assuming a normal distribution is reasonable and consequently we can use the rnorm() function to generate bid prices making some guesses regarding mean and standard deviation.
bids <-rnorm(n, mean =3.87, sd =1.72)
The rnorm() function will generate random numbers whose distribution is centered around the mean and 95% of the values will be no more than about 2 standard deviations from the mean. Of course, some values will be further from the mean, including some values possibly being negative. The code below sets all negative values to a default value of 0.01.
bids <-ifelse(bids <0, 0.01, bids)
Of course, there are other ways to have done this, such as using the which() function. Note that since bids is a vector, the ifelse function is applied to each element of the vector. In other programming languages this would have required using a loop, but R allows vector operations. Vector operations are preferable in R because they are simpler to write, make the code easier to read, and run significantly faster than loops. However, loops might be easier to understand for some programmers and writing the code using loops at first would be reasonable – but looking to replace loops with vector operations is important for efficiency.
A remaining issue is the precision of the bid values. Since they represent a monetary quantity (perhaps US$), it might make sense to round the values to two digits of precision.
bids <-round(bids, 2)
d. Winning Flag
The final column to generate for our auction data set is a “flag” (Boolean indicator) that is T if the customer won the bid, and F if they did not win. We would again rely on historical data to see the frequency at which auctions are won. If we had an “Ask Price” from the artist, then we could apply a rule that would correlate the likelihood of winning with the “Bid Price”. However, for now, let’s assume that it is won with a probability of 0.5, i.e., equal likelihood. In the code below, we generate for each bid a uniformly distributed random number from 0 to 1 using the runif() function, i.e., every value is equally likely to occur and consequently half the numbers will be below 0.5 and half above.
won <-ifelse(runif(n) <0.5, T, F)
Alternatively, we could have also used the sample() function. As an exercise, see if you can rewrite the code using the sample() function rather than runif().
We can expect that half the is will be won and half will not be. Verify this for yourself.
Now, let’s say that we know from prior observation that only about 22% of the auctions have a winning bid that is accepted. Let’s modify the code so that for 22% of the bids, the flag is T and for the remaining 78% it is F. Since runif() returns values between 0 and 1 that are uniformly distributed, we can simply check whether the value if less than 0.22 – we can expect the value to be less for about 22% of the values.
won <-ifelse(runif(n) <0.22, T, F)
Let’s assure ourselves that this is correct by counting how many of “won” values are T and what the percentage is.
In the code above, the which() function returns a vector of the indexes of the elements that have a value of T (TRUE) and the length() function counts the elements in that vector telling us how many are T. We can see that the percentage is reasonably close to 22%; of course, it is not exact as we are generating random numbers. The more we generate, the closer it will get to 22% (or at least according to the Central Limit Theorem of Statistics).
Create Data Frame
Now that we have generated the data, we can create a dataframe containing the data as columns.
# Combine into a data frameauction_data <-data.frame(txnID = transaction_ids,category = categories,bid = bids,didWin = won)
Let’s display a few of those auction transactions:
By carefully defining the structure and statistical properties, the dataset somewhat realistically represents an online auction house’s operations while avoiding real-world data collection challenges.
Summary
Random number generation is essential for computational tasks ranging from simulations to statistical analyses. While pseudo-random number generators like the Mersenne Twister provide a balance between speed and statistical quality, users must understand the limitations and select appropriate methods for their specific applications. R offers an extensive toolkit for random number generation, allowing users to simulate a wide range of distributions efficiently.