Introduction

Improving the run-time performance of R code is critical for handling large datasets and complex computations, especially in machine learning, applied AI, data mining, and statistical analysis. This lesson explores several approaches, ranging from optimizing data structures to parallel computing techniques, with the objective to make R code more efficient.

Many of the techniques presented in this lesson are universally applicable, particularly to other interpreted languages like Python and JavaScript. The relative slowness of R and Python compared to languages like Java and C++ can be attributed to several factors related to their design philosophies, runtime environments, and typical use cases. Understanding these differences is essential when choosing an approach to code optimization.

Firstly, R and Python are interpreted languages, whereas Java and C++ are compiled languages. In interpreted languages, code is executed line by line by an interpreter at runtime, which introduces overhead because the interpreter needs to parse and execute each line of code dynamically. In contrast, compiled languages like Java and C++ are translated into machine code before execution, allowing the CPU to run the code directly without the overhead of interpretation. This compilation step generally results in faster execution times for Java and C++ programs.

R and Python are also dynamically typed languages. This means that type checking happens at runtime rather than at compile time. While this provides greater flexibility and ease of use, it incurs additional overhead because the interpreter must manage type information and perform type checking during execution. Java and C++, being statically typed languages, perform type checking at compile time, which helps optimize the generated machine code for better performance.

Another significant factor for the poorer run-time performance of R and Python are in the way both manage memory. R and Python use automatic garbage collection to manage memory, which periodically frees up memory that is no longer in use. While this simplifies memory management for the programmer, it can cause unpredictable pauses in program execution as the garbage collector runs. Java also uses garbage collection but benefits from more advanced and optimized garbage collection algorithms compared to those typically found in R and Python. C++, on the other hand, requires manual memory management by the programmer, which affords full control over memory allocation and deallocation. This programmer-driven approach allows for highly optimized memory usage but requires careful programming to avoid memory leaks and other issues.

The standard libraries and data structures provided by R and Python are designed for ease of use and flexibility rather than raw performance. For example, Python’s list and dictionary objects are highly flexible but not as efficient as Java’s or C++’s array and hash table implementations. Similarly, R’s data frame and list structures prioritize convenience and ease of manipulation over performance. In contrast, Java and C++ standard libraries are often designed with performance in mind, providing more efficient data structures and algorithms. In fact, many of the high-performance functions in third-party packages for R and Python are written in C++ or Java.

R and Python are also heavily used in data analysis, scientific computing, and machine learning, where the ease of prototyping and the availability of extensive libraries (such as NumPy, pandas, and scikit-learn in Python, and caret, psych, data.table and ggplot2 in R) are more critical than raw execution speed. These languages often serve as “glue” languages, where performance-critical parts of the code are implemented in more efficient languages like C or C++ and then called from R or Python. This approach leverages the strengths of both high-level and low-level languages. It is not unusual for exploration and research to be done in R and Python and then, once an approach is settled upon, the code is rewritten for performance in Java or C++. In fact, R Notebooks allow code chunks to be in R, Python, Java, SQL, and C++ which provides full control to the programmer.

Furthermore, R and Python emphasize readability and ease of use, which can sometimes lead to less efficient code. For example, Python’s use of dynamic typing and high-level data structures encourages a programming style that may be more intuitive but less efficient. Java and C++, in contrast, encourage more explicit and optimized programming practices due to their statically typed nature and lower-level control over system resources.

Lastly, the execution model of these languages differs. R, for example, is designed for statistical computing and data analysis, where typical workloads involve vectorized operations and data manipulation tasks that can be more efficiently handled by specialized libraries written in C or even (good old) FORTRAN. Python, while more general-purpose, is often used in similar domains and benefits from extensive C and C++ extensions to mitigate performance issues.

In short, the slowness of R and Python compared to Java and C++ stems from their interpreted nature, dynamic typing, memory management strategies, and design priorities that emphasize ease of use and flexibility over raw performance. Understanding these trade-offs is essential for selecting the right language for your specific application and for leveraging the strengths of each language effectively.

A final word and caution: An experienced programmer always begins by writing correct code that is simple and easy to follow. Run-time is secondary to correctness. Only if the performance of the code does not meet essential non-functional requirements, should changes to the code be entertained. Improvement must proceed in stages, starting with the simplest modifications.

Measuring Performance

Any attempt to improve run-time performance or optimize code must start with profiling the code by measuring its run-time through timing measurements via instrumentation. Lesson 6.134 – Measure Run-Time Performance of R Code demonstrates different methods for profiling code.

If an attempted optimization does not yield measurable and material reduction in run-time and the optimization resulted in less readable and less maintainable code, then the optimization should be rolled back. Readability, correctness, and maintainability are paramount.

Summary of Techniques

These techniques should generally be tried, in this general order:

  1. replace searching via loops with which() and any()
  2. use integers for lookup values (keys) rather than strings/text
  3. use apply() instead of loops
  4. use vectorized operations instead of loops
  5. avoid sqldf unless no simple alternatives exist
  6. avoid regular expressions or substring searches whenever possible
  7. prefer data.table over data frames
  8. pre-allocate memory when possible
  9. employ parallel processing
  10. incorporate C++
  11. save large objects

In the next section, we will look at the above strategies in more detail and using examples.

Throwing Money at the Problem

Of course, a common “business approach” to performance optimization is to “throw iron at the problem”, i.e., to purchase a computer with a more powerful CPU, more memory, a GPU, and perhaps a Neural Processor – or to spin up a more powerful virtual machine with a cloud provider. It might cost a lot of money, but it may be less than having a programmer spend time improving the code (and perhaps introducing bugs) – it may be a “quick fix” but it may be the right approach in many situations.

Performance Improvement Approaches

This section provides specific strategies for improving the performance of R code by actually tweaking the code or rewriting it in a different – and more economical – way. They are presented in order of complexity, starting with the least complex. Naturally, not all strategies are necessary at all times and some of the later approaches may not be worth the complexity introduced into the code. Remember that readability and maintainability are also important, not just performance.

1. Avoid Loops

Loops are a fundamental construct in programming, allowing for repeated execution of a block of code. However, in languages like R and Python, loops can be significantly slower compared to languages like Java and C++. This slowness is due to several factors, including the interpreted nature of these languages, the overhead of dynamic typing, and the way memory management is handled. Understanding the limitations of loops in R and Python, and knowing how to use alternatives like the apply family of functions in R, can help improve performance.

In R, loops are typically slower because each iteration involves interpretation overhead, dynamic type checking, and memory management tasks. This can become a bottleneck, especially when dealing with large datasets or performing complex computations within the loop. For instance, consider the following example of a simple loop in R that calculates the square of each element in a vector:

n <- 1e6 # number of iterations

# pre-allocate memory for result
result <- numeric(n)

for (i in 1:n) {
  result[i] <- i^2
}

In this code, the loop iterates one million times, and in each iteration, it performs an arithmetic operation and assigns the result to a vector. This repeated execution can be inefficient, as the interpreter handles each iteration individually, leading to increased execution time. It may take several seconds to execute. If you want to run the code for yourself and you have a fast CPU, then increase the limit for n to a billion or more. By the way, notice that the result vector result is pre-allocated to n elements rather than growing it dynamically. This is another common performance improvement and discussed in more detail later in this section.

To mitigate the poor performance of loops, R provides a family of apply functions, which are designed to perform operations on elements of data structures like arrays, matrices, and data frames in a more efficient manner. The apply functions are internally optimized and can reduce the overhead associated with loops. For instance, the sapply function can be used to achieve the same result as the loop above, but more efficiently:

# Using sapply to square each element in a vector
n <- 1e6
result <- sapply(1:n, function(x) x^2)

In this example, sapply applies the function function(x) x^2 to each element of the vector 1:n. The sapply function is typically faster than a loop because it is internally optimized to handle vectorized operations more efficiently.

Another member of the apply family is lapply, which is similar to sapply but returns a list instead of a vector. This can be useful when the result of the function applied is not necessarily of the same length or type as the input. For example:

# Using lapply to create a list of squared numbers
n <- 1e6
result <- lapply(1:n, function(x) x^2)

While lapply returns a list, it can still be more efficient than a traditional loop due to its optimized implementation.

For operations on matrices, the apply function can be particularly useful. It allows you to apply a function to the rows or columns of a matrix without explicitly writing loops. Here’s an example where apply is used to calculate the row sums of a matrix:

# Creating a matrix
mat <- matrix(1:1e6, nrow = 1e3, ncol = 1e3)

# Using apply to calculate row sums
row_sums <- apply(mat, 1, sum)

In this example, apply takes three arguments: the matrix mat, the margin (1 for rows, 2 for columns), and the function to apply (in this case, sum). The apply function efficiently iterates over the rows of the matrix and calculates their sums.

Moreover, R provides specialized functions like colSums, rowSums, colMeans, and rowMeans for common operations on matrices, which are even more efficient than using apply. For example:

# Using rowSums to calculate row sums
row_sums <- rowSums(mat)

These specialized functions are highly optimized and should be preferred over apply for their respective operations.

To summarize, while loops in R can be slow due to the interpreted nature of the language and the overhead of dynamic typing and memory management, using the apply family of functions can significantly improve performance. The apply functions are internally optimized to handle operations on data structures more efficiently, reducing the execution time compared to traditional loops. Understanding and leveraging these functions is crucial for writing efficient R code, especially when dealing with large datasets or complex computations.

Even when the use of the apply() functions results in better performance, some programmers might find loops more intuitive especially when coming from a procedural language such as C++. Therefore, a first implementation of an algorithm might use loops. Only once the code is running correctly, do some programmers transition to the use of the apply() functions. Of course, as one becomes more familiar with R, code follows more of the natural R programming idioms.

2. Use which()

Rather than looping through a vector or a column in a data frame, use the which() function (and its existential counterpart any()) as it is much faster. The code below demonstrates this when searching a vector for the position of a particular value (finds the first occurrence only):

n <- 1e8      # size of vector

# vector of random integer values
v <- round(runif(n, min = 100, max = n*n),0)

x <- 445876   # value to find
k <- 0        # position at which found

# insert value at a last position to ensure it is present
v[n] <- x

Now, let’s try to find the position of the key value x in the vector v. We will measure the run-time performance of both approaches (of course, on your computer the random values in v and the timings will be different).

Approach I: Linear Search with Loops

bt <- Sys.time()

for (i in 1:n) {
  if (x == v[i]) {
    k <- i
    break
  }
}

et <- Sys.time()

t.I <- et - bt

cat("Approach I / Loop: found ", x, "at position", k, "in", round((t.I),3), "seconds")
## Approach I / Loop: found  445876 at position 100000000 in 3.016 seconds

Approach II: Using which()

The approach below uses the which() function; as you can see it is substantially faster than looping.

bt <- Sys.time()

k <- which(x == v)

et <- Sys.time()

t.II <- et - bt

cat("Approach II / which: found ", x, "at position", k, "in", round((t.II),3), "seconds")
## Approach II / which: found  445876 at position 100000000 in 0.34 seconds

The difference in time is quite pronounced – about a 10x improvement in speed.

Of course, if the values in the vector had been sorted, then a binary search would have been substantially faster (complexity of \(O(log_n)\) vs \(O(n)\)). And, placing the values into a hashmap rather than a linear vector would also make search faster (\(O(1)\)) – however, R does not have built-in support for hashmaps, so mechanisms such as an in-memory key-value database (e.g., memcached) are often used.

Quick Tutorial: which()

The which() function in R is an efficient mechanism for identifying the indices of elements in a logical vector that meet a specified condition. It is often used as an alternative to loops for conditional searches, providing a more concise and typically faster way to accomplish such tasks. Recall that a column in a data frame is a vector, so this also can be used to search columns of data frames.

The which() function returns the positions in a vector where an element in the vector meets a logical condition. Essentially, it scans through a logical vector and identifies the positions where the condition is met.

Consider a simple example where we want to find the positions (indices) of elements in a numeric vector that are greater than a specified value. In this example, which(vec > 5) returns the indices of vec where the elements are greater than 5.

# Create a numeric vector
vec <- c(2, 4, 6, 8, 10, 3, 5, 7, 9)

# Use which() to find positions of elements greater than 5
indices <- which(vec > 5)

print(indices)
## [1] 3 4 5 8 9

The which() function is also useful for identifying rows in a data frame that meet certain conditions. Let’s consider a data frame example:

# Create a data frame
df <- data.frame(
  ID = 1:10,
  Score = c(85, 90, 88, 78, 92, 95, 70, 65, 88, 91)
)

# se which() to find rows where Score is greater than 90
rows <- which(df$Score > 90)

print(rows)
## [1]  5  6 10

Often, you want to subset the original vector or data frame based on the indices returned by which(). Here’s how you can do it:

# select rows in the data frame based on condition in a column
df.subset <- df[which(df$Score > 90), ]

head(df.subset)
##    ID Score
## 5   5    92
## 6   6    95
## 10 10    91

The which() function is more efficient than loops for identifying indices that meet a condition, as it leverages internal optimizations. Using which() makes the code more concise and readable compared to writing explicit loops – if one is used to R. However, it is the natural and idiomatic way to program in R.

3. Use Integer Keys

Keys should be integers whenever possible rather than character strings as comparing integers is much faster than comparing text strings (which are sequences of characters that have to be compared one character at a time). Of course, using a hash function to map strings to integers would be a good alternative. The code below compares the time to search an integer column versus a text column.

n <- 1e7

df <- data.frame(
  ID = round(runif(n, min = 1000, max = 100000),0),
  Score = round(runif(n, min = 40, max = 99),0)
)

# embed lookup value in the last row
lookup.value.int <- 445632
df$ID[nrow(df)] <- lookup.value.int

# add a text column
df$NUID <- as.character(df$ID)
lookup.value.chr <- as.character(lookup.value.int)
df$NUID[nrow(df)] <- lookup.value.chr

Integer Key Search:

bt <- Sys.time()

# lookup score based on ID
k <- df$Score[which(df$ID == lookup.value.int)]

et <- Sys.time()

t.integer <- et - bt

cat("Integer Key:", round((t.integer),3), "seconds")
## Integer Key: 0.035 seconds

Text/Character Key Search:

bt <- Sys.time()

# lookup score based on ID
k <- df$Score[which(df$NUID == lookup.value.chr)]

et <- Sys.time()

t.text <- et - bt

cat("Text Key:", round((t.text),3), "seconds")
## Text Key: 0.197 seconds

The lookup of a character key is 10x slower. Although, to be fair, it is quite fast when using which(). Try writing the same code with a loop for comparison. What do you notice? What is the performance difference?

4. Leverage Efficient Data Manipulation Packages

One of the primary ways to enhance performance is through efficient data manipulation. Data frames in R are versatile, but they may not always be the best choice for performance-critical applications. Instead, consider using the data.table package, which offers an enhanced version of data frames. Data tables are designed for fast aggregation, joining, and in-place updates. They also consume less memory and execute operations more quickly due to their optimized internal structure.

For example, suppose you have a large dataset and need to perform frequent subsetting and aggregations. Using data.table can significantly reduce execution time. Here is a demonstration:

Using a Data Frame:

# Creating a large data frame
df <- data.frame(x = rnorm(1e8), 
                 y = rnorm(1e8))

# Subsetting and aggregation
system.time({
  result <- mean(df$y[which(df$x > 0)])
})
##    user  system elapsed 
##   0.921   0.210   1.131

Using a data.table:

library(data.table)

# Creating a large data table
dt <- data.table(x = rnorm(1e8), 
                 y = rnorm(1e8))

# Subsetting and aggregation
system.time({
  result <- dt[x > 0, .(mean_y = mean(y))]
})
##    user  system elapsed 
##   1.045   0.142   1.210

In this example, data.table processes the subsetting and aggregation a bit faster than a regular data frame, making it ideal for large-scale data analysis, particularly when there are many columns.

5. Vectorize Operations

Vectorization is another critical technique for improving performance. R is inherently a vectorized language, meaning it can operate on entire vectors of data at once. By leveraging vectorized operations, you can avoid the overhead of explicit loops, which are often slow in R. For instance, instead of using a loop to add two vectors element-wise, you can directly use vectorized addition:

# Vectorized addition
a <- rnorm(1e7)
b <- rnorm(1e7)

system.time({
  c <- sum(a + b)
})
##    user  system elapsed 
##   0.032   0.000   0.032

Here, the vectorized addition is not only more concise but also much faster than looping through each element as demonstrated with the loop-based code below:

# Vectorized addition
a <- rnorm(1e7)
b <- rnorm(1e7)
c <- 0

system.time({
  for (i in 1:length(a)) {
    c <- c + (a[i] + b[i])
  }
})
##    user  system elapsed 
##   0.393   0.000   0.395

6. Preallocate Memory

Memory management is another aspect where performance can be improved. R’s garbage collector handles memory allocation and deallocation automatically, but inefficient memory usage can still slow down your code. Pre-allocating memory for large objects can prevent unnecessary memory reallocation during execution. For example, if you are appending elements to a vector inside a loop, it is better to pre-allocate the vector’s size than growing the vector dynamically as new elements are inserted.

In the code below, the initial vector is created empty; as new elements are inserted, R automatically allocates memory. Generally, this means that memory for a new vector with a size one element larger than the original vector is allocated and then the original vector’s elements are copied to the new vector.

# dynamically growing a vector
n <- 1e7
v <- numeric(0)

system.time({
  for (i in 1:n) {
    v[i] <- rnorm(1)
  }
})
##    user  system elapsed 
##  17.162   2.208  19.382

The code below pre-allocates the memory. Of course, this only works when the required number of elements is known beforehand. If it is not known, then the memory must be estimated, making certain not to overallocate.

# pre-allocating memory for a vector
n <- 1e7
v <- numeric(n)

system.time({
  for (i in 1:n) {
    v[i] <- rnorm(1)
  }
})
##    user  system elapsed 
##  14.771   1.882  16.861

This approach avoids the overhead associated with dynamically growing the vector. The effect becomes more pronounced over time.

7. Parallelize Processing

Parallel computing can also significantly reduce computation time for tasks that can be parallelized. R offers several packages for parallel processing, such as parallel, foreach, and future. The parallel package, for instance, provides functions to execute code on multiple CPU cores simultaneously. Of course, the more cores a CPU has, the better.

Here’s an example using the mclapply function:

library(parallel)

# Parallel processing using mclapply
system.time({
  result <- mclapply(1:8, function(x) {
    sum(rnorm(1e6))
  }, mc.cores = 4)
})
##    user  system elapsed 
##   0.002   0.017   0.113

In this example, the computation is distributed across four cores, resulting in a substantial decrease in execution time compared to a single-threaded approach.

8. Profile Code

Profiling your code is an essential step in identifying performance bottlenecks. The profvis package in R provides a graphical interface for profiling, allowing you to visualize where your code spends the most time. By pinpointing these hotspots, you can focus your optimization efforts more effectively.

library(profvis)

# Profiling code execution
profvis({
  result <- sapply(1:1e6, function(x) x^2)
})

This profiling tool helps you understand the time complexity of different code sections, guiding you towards targeted improvements.

8. Leverage GenAI

Generative AI assistants, such as ChatGPT, Co-Pilot, and Claude, can be quite valuable in identifying and resolving performance issues in R code. These AI systems can assist both novice and experienced programmers by providing insights, code reviews, optimization suggestions, and educational resources. Let’s explore some ways these AI assistants can be leveraged to improve the performance of R code.

One of the primary ways generative AI assistants can help is by reviewing your R code and suggesting optimizations. You can input sections of your code and ask the AI to analyze it for potential performance bottlenecks. For example, if you have a piece of code that involves loops or inefficient data manipulations, the AI can suggest more efficient alternatives, such as vectorized operations or the use of the data.table package.

Consider the following example where an AI assistant is asked to optimize a loop with the following prompt followed by pasting the code:

Optimize the following R code for performance:

# Loop to calculate cumulative sum
n <- 1e6
result <- numeric(n)
for (i in 1:n) {
  result[i] <- sum(1:i)
}

Below is the actual output for ChatGPT 4o (with some edits for brevity and formatting). Notice the explanations giving a novice programmer an opportunity to learn.

The provided R code calculates the cumulative sum using a loop, which can be optimized for better performance. Using vectorized operations, which are inherently faster in R, can significantly improve the efficiency of this code. Specifically, the `cumsum` function in R is designed for such tasks and is much more efficient than a loop.

Here's how you can optimize the code:

```r
n <- 1e6
result <- cumsum(1:n)
```

1.  **Vectorization**: The `cumsum` function calculates the cumulative sum of a numeric vector in a vectorized manner. This means that the computation is performed in an optimized way internally, avoiding the overhead of an explicit loop.

2.  **Avoiding Loops**: Loops in R, especially over large vectors, can be slow due to the interpreted nature of R and the overhead associated with repeatedly calling R functions. By using `cumsum`, you leverage highly optimized internal C code.

To demonstrate the performance improvement, we can compare the execution time of both the loop-based approach and the vectorized approach:

```r
n <- 1e6
result_loop <- numeric(n)
system.time({
  for (i in 1:n) {
    result_loop[i] <- sum(1:i)
  }
})

system.time({
  result_cumsum <- cumsum(1:n)
})
```

The explanation of code changes means that Generative AI can also be a teacher and coach. It can explain best practices for writing efficient R code, including providing guidance on vectorization, memory management, and the use of parallel computing. For example, an AI assistant can explain how to use the parallel package to speed up computations.

Naturally, not all suggestions might be correct – AI assistants do have a (deserved) reputation for writing incorrect code or getting their math wrong, so it is important to double-check all suggestions and to be a critical consumer of AI suggestions: caveat emptor as they used to say in Rome.

Generative AI assistants be particularly helpful in interactive debugging sessions, where you iteratively refine your code based on the AI’s feedback. You can describe the performance issues you’re encountering, and the AI will often suggest modifications and further optimizations. This iterative process helps in progressively improving the code’s performance.

9. Incorporate C++

Integrating C++ code into R can significantly enhance performance for critical sections of code, particularly when dealing with computationally intensive tasks. R, while powerful for data analysis and statistical computing, is not as fast as low-level languages like C++ due to its interpreted nature and dynamic typing. By offloading performance-critical tasks to C++, you can leverage the speed and efficiency of compiled code while maintaining the convenience of R for data manipulation and visualization. This approach is facilitated by the Rcpp package, which provides a seamless interface between R and C++.

R Notebooks allow code blocks in various languages and thus an R Notebook can be a mix of languages. The programmer can choose the “right language” for a programming task. In R scripts, C++ code can be wrapped around the function cppFunction() from the Rcpp package.

C++ is a compiled language and executes much faster than R, especially for loops and complex computations. In addition, C++ allows for more efficient memory management and can handle large datasets or complex algorithms more effectively. So, using C++ for performance-critical sections of code while utilizing R’s rich ecosystem for data manipulation and analysis provides a balanced approach that leverages each language’s strength.

To begin using C++ in R, you need to install the Rcpp package. This package allows for easy integration of C++ code within R scripts and functions.

The Rcpp package provides several functions to facilitate the integration of C++ code. One of the most commonly used functions is cppFunction, which allows you to define and compile a C++ function directly in your R script.

Consider a simple example where we want to compute the sum of squares of a numeric vector. Here is how you can implement this in C++ using Rcpp. Note how the C++ code is an argument to an R function which will compile the C++ code using an installed C++ compiler and then create a binding to allow the C++ function to be called from R.

The function cppFunction() calls the C++ compiler, compiles the code (which can take some time), and then creates a language binding so the compiled C++ function can be called from R.

C++ Function:

library(Rcpp)

cppFunction('
double sumOfSquares(NumericVector x) {
  int n = x.size();
  double sum = 0;
  for (int i = 0; i < n; ++i) {
    sum += x[i] * x[i];
  }
  return sum;
}')

Using the C++ Function in R:

# Creating a numeric vector
vec <- rnorm(1e6)

# Calling the C++ function from R
result <- sumOfSquares(vec)
print(result)
## [1] 1001364

In this example, the C++ function sumOfSquares computes the sum of squares of the elements in the input vector x. This function is then called from R, demonstrating the seamless integration of R and C++. Writing more complex C++ code is best done by writing an entire package in C++ but that is beyond the scope of this lesson.

For a more complex example, consider matrix multiplication, a computationally intensive task that can benefit greatly from C++ optimization.

C++ Function for Matrix Multiplication:

cppFunction('
NumericMatrix matMult(NumericMatrix A, NumericMatrix B) {
  int n = A.nrow();
  int k = A.ncol();
  int m = B.ncol();
  NumericMatrix C(n, m);
  
  for (int i = 0; i < n; ++i) {
    for (int j = 0; j < m; ++j) {
      for (int l = 0; l < k; ++l) {
        C(i,j) += A(i,l) * B(l,j);
      }
    }
  }
  return C;
}')

Using the C++ Function in R:

# create two large matrices
A <- matrix(rnorm(1e6), nrow = 1000)
B <- matrix(rnorm(1e6), nrow = 1000)

# perform matrix multiplication using the C++ function
result <- matMult(A, B)

In this example, the matMult function multiplies two matrices A and B. The C++ implementation is significantly faster than a pure R implementation, especially for large matrices.

10. Save Large Objects

In R, you can save any data object, such as a data frame, to an R object file (.RData) using the save() function and then restore it using the load() function. This approach is useful for preserving the state of your data between sessions, especially when working with large datasets that require significant processing time. It means that you create the data object once, save it to an object file, and then restore it from the file when needed rather than cre-creating it.

First, create a data frame, then save the data frame to an R object file using the save function. When needed later in a new session, check if the object exists and, if not, load it. Before restoring the data frame, check if the object file exists. If the file exists, load the data frame using the load function, otherwise create the data frame. The code below demonstrates this.

# load the data
df.txns <- read.csv('customertxndata.csv')

### do some intensive processing that take significant time
### ...

# save the data frame to an R object file
save(df.txns, file = "df_txnData.RData")

### for testing we will remove the object from memory
rm(df.txns)

The code below checks if the object file exists, and, if it doesn’t, restores the data frame from the object file. Notice that the object file contains the data frame’s name as well.

# file path and name of the object file
file_path <- "df_txnData.RData"

# restore the data frame from its file -- if it exists
if (file.exists(file_path)) {
  load(file_path)
} else {
  # create the data frame
  # ...
}

# display the first few rows of the restored data frame
head(df.txns, 3)
##   NumVisits NumOrders      OS Gender    Total
## 1         7         0 Android   Male   0.0000
## 2        20         1     iOS   <NA> 576.8668
## 3        22         1     iOS Female 850.0000

Summary

In summary, optimizing R code performance involves a multifaceted approach. By adopting efficient data structures like data.table, leveraging vectorized operations, managing memory effectively, utilizing parallel computing, and profiling your code, you can achieve significant run-time improvements. These techniques are not only essential for handling large datasets but also for ensuring that your computations are both fast and scalable, an essential requirements for machine learning, statistical computing, and data applications.

The which() function in R is a highly efficient alternative to loops for conditional searches. It simplifies the process of identifying indices where a condition is met, providing both performance benefits and improved code readability. By leveraging which(), you can write more efficient and concise R code for tasks involving conditional subsetting and filtering.

Generative AI assistants like ChatGPT and Claude can be helpful in identifying and resolving performance issues in R code. For the critical programmer, they can provide code reviews, suggest optimizations, recommend efficient data structures, guide you through profiling techniques, and offer insights into best practices.

By integrating C++ code into R using the Rcpp package, you can achieve substantial performance improvements for computationally intensive tasks. This approach allows you to use R for data manipulation with the targeted efficiency of C++.

Files & Resources

All Files for Lesson 6.135

References

TBD

Errata

Let us know.

---
title: "Improving Run-Time Performance of R Code"
params:
  category: 6
  number: 135
  time: 45
  level: beginner
  tags: "R,debugging,runtime,time,Sys.time,tictoc,rbenchmark"
  description: "Lists several ways to improve and optimize the
                run-time performance of R code. Demonstrates
                strategies through examples."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

Improving the run-time performance of R code is critical for handling large datasets and complex computations, especially in machine learning, applied AI, data mining, and statistical analysis. This lesson explores several approaches, ranging from optimizing data structures to parallel computing techniques, with the objective to make R code more efficient.

Many of the techniques presented in this lesson are universally applicable, particularly to other interpreted languages like Python and JavaScript. The relative slowness of R and Python compared to languages like Java and C++ can be attributed to several factors related to their design philosophies, runtime environments, and typical use cases. Understanding these differences is essential when choosing an approach to code optimization.

Firstly, R and Python are interpreted languages, whereas Java and C++ are compiled languages. In interpreted languages, code is executed line by line by an interpreter at runtime, which introduces overhead because the interpreter needs to parse and execute each line of code dynamically. In contrast, compiled languages like Java and C++ are translated into machine code before execution, allowing the CPU to run the code directly without the overhead of interpretation. This compilation step generally results in faster execution times for Java and C++ programs.

R and Python are also dynamically typed languages. This means that type checking happens at runtime rather than at compile time. While this provides greater flexibility and ease of use, it incurs additional overhead because the interpreter must manage type information and perform type checking during execution. Java and C++, being statically typed languages, perform type checking at compile time, which helps optimize the generated machine code for better performance.

Another significant factor for the poorer run-time performance of R and Python are in the way both manage memory. R and Python use automatic garbage collection to manage memory, which periodically frees up memory that is no longer in use. While this simplifies memory management for the programmer, it can cause unpredictable pauses in program execution as the garbage collector runs. Java also uses garbage collection but benefits from more advanced and optimized garbage collection algorithms compared to those typically found in R and Python. C++, on the other hand, requires manual memory management by the programmer, which affords full control over memory allocation and deallocation. This programmer-driven approach allows for highly optimized memory usage but requires careful programming to avoid memory leaks and other issues.

The standard libraries and data structures provided by R and Python are designed for ease of use and flexibility rather than raw performance. For example, Python's list and dictionary objects are highly flexible but not as efficient as Java's or C++'s array and hash table implementations. Similarly, R's data frame and list structures prioritize convenience and ease of manipulation over performance. In contrast, Java and C++ standard libraries are often designed with performance in mind, providing more efficient data structures and algorithms. In fact, many of the high-performance functions in third-party packages for R and Python are written in C++ or Java.

R and Python are also heavily used in data analysis, scientific computing, and machine learning, where the ease of prototyping and the availability of extensive libraries (such as NumPy, pandas, and scikit-learn in Python, and caret, psych, data.table and ggplot2 in R) are more critical than raw execution speed. These languages often serve as "glue" languages, where performance-critical parts of the code are implemented in more efficient languages like C or C++ and then called from R or Python. This approach leverages the strengths of both high-level and low-level languages. It is not unusual for exploration and research to be done in R and Python and then, once an approach is settled upon, the code is rewritten for performance in Java or C++. In fact, R Notebooks allow code chunks to be in R, Python, Java, SQL, and C++ which provides full control to the programmer.

Furthermore, R and Python emphasize readability and ease of use, which can sometimes lead to less efficient code. For example, Python's use of dynamic typing and high-level data structures encourages a programming style that may be more intuitive but less efficient. Java and C++, in contrast, encourage more explicit and optimized programming practices due to their statically typed nature and lower-level control over system resources.

Lastly, the execution model of these languages differs. R, for example, is designed for statistical computing and data analysis, where typical workloads involve vectorized operations and data manipulation tasks that can be more efficiently handled by specialized libraries written in C or even (good old) FORTRAN. Python, while more general-purpose, is often used in similar domains and benefits from extensive C and C++ extensions to mitigate performance issues.

In short, the slowness of R and Python compared to Java and C++ stems from their interpreted nature, dynamic typing, memory management strategies, and design priorities that emphasize ease of use and flexibility over raw performance. Understanding these trade-offs is essential for selecting the right language for your specific application and for leveraging the strengths of each language effectively.

A final word and caution: An experienced programmer always begins by writing correct code that is simple and easy to follow. Run-time is secondary to correctness. Only if the performance of the code does not meet essential non-functional requirements, should changes to the code be entertained. Improvement must proceed in stages, starting with the simplest modifications.

## Measuring Performance

Any attempt to improve run-time performance or optimize code must start with profiling the code by measuring its run-time through timing measurements via instrumentation. Lesson [6.134 -- Measure Run-Time Performance of R Code](http://artificium.us/lessons/06.r/l-6-134-perf-measure-r/l-6-134.html) demonstrates different methods for profiling code.

If an attempted optimization does not yield measurable and material reduction in run-time and the optimization resulted in less readable and less maintainable code, then the optimization should be rolled back. Readability, correctness, and maintainability are paramount.

## Summary of Techniques

These techniques should generally be tried, in this general order:

1.  replace searching via loops with `which()` and `any()`
2.  use integers for lookup values (keys) rather than strings/text
3.  use `apply()` instead of loops
4.  use vectorized operations instead of loops
5.  avoid **sqldf** unless no simple alternatives exist
6.  avoid regular expressions or substring searches whenever possible
7.  prefer `data.table` over data frames
8.  pre-allocate memory when possible
9.  employ parallel processing
10. incorporate C++
11. save large objects

In the next section, we will look at the above strategies in more detail and using examples.

## Throwing Money at the Problem

Of course, a common "business approach" to performance optimization is to "throw iron at the problem", *i.e.*, to purchase a computer with a more powerful CPU, more memory, a GPU, and perhaps a Neural Processor -- or to spin up a more powerful virtual machine with a cloud provider. It might cost a lot of money, but it may be less than having a programmer spend time improving the code (and perhaps introducing bugs) -- it may be a "quick fix" but it may be the right approach in many situations.

## Performance Improvement Approaches

This section provides specific strategies for improving the performance of R code by actually tweaking the code or rewriting it in a different -- and more economical -- way. They are presented in order of complexity, starting with the least complex. Naturally, not all strategies are necessary at all times and some of the later approaches may not be worth the complexity introduced into the code. Remember that readability and maintainability are also important, not just performance.

### 1. Avoid Loops

Loops are a fundamental construct in programming, allowing for repeated execution of a block of code. However, in languages like R and Python, loops can be significantly slower compared to languages like Java and C++. This slowness is due to several factors, including the interpreted nature of these languages, the overhead of dynamic typing, and the way memory management is handled. Understanding the limitations of loops in R and Python, and knowing how to use alternatives like the `apply` family of functions in R, can help improve performance.

In R, loops are typically slower because each iteration involves interpretation overhead, dynamic type checking, and memory management tasks. This can become a bottleneck, especially when dealing with large datasets or performing complex computations within the loop. For instance, consider the following example of a simple loop in R that calculates the square of each element in a vector:

```{r eval=F}
n <- 1e6 # number of iterations

# pre-allocate memory for result
result <- numeric(n)

for (i in 1:n) {
  result[i] <- i^2
}
```

In this code, the loop iterates one million times, and in each iteration, it performs an arithmetic operation and assigns the result to a vector. This repeated execution can be inefficient, as the interpreter handles each iteration individually, leading to increased execution time. It may take several seconds to execute. If you want to run the code for yourself and you have a fast CPU, then increase the limit for *n* to a billion or more. By the way, notice that the result vector *result* is pre-allocated to *n* elements rather than growing it dynamically. This is another common performance improvement and discussed in more detail later in this section.

To mitigate the poor performance of loops, R provides a family of `apply` functions, which are designed to perform operations on elements of data structures like arrays, matrices, and data frames in a more efficient manner. The `apply` functions are internally optimized and can reduce the overhead associated with loops. For instance, the `sapply` function can be used to achieve the same result as the loop above, but more efficiently:

```{r eval=F}
# Using sapply to square each element in a vector
n <- 1e6
result <- sapply(1:n, function(x) x^2)
```

In this example, `sapply` applies the function `function(x) x^2` to each element of the vector `1:n`. The `sapply` function is typically faster than a loop because it is internally optimized to handle vectorized operations more efficiently.

Another member of the `apply` family is `lapply`, which is similar to `sapply` but returns a list instead of a vector. This can be useful when the result of the function applied is not necessarily of the same length or type as the input. For example:

```{r eval=F}
# Using lapply to create a list of squared numbers
n <- 1e6
result <- lapply(1:n, function(x) x^2)
```

While `lapply` returns a list, it can still be more efficient than a traditional loop due to its optimized implementation.

For operations on matrices, the `apply` function can be particularly useful. It allows you to apply a function to the rows or columns of a matrix without explicitly writing loops. Here’s an example where `apply` is used to calculate the row sums of a matrix:

```{r}
# Creating a matrix
mat <- matrix(1:1e6, nrow = 1e3, ncol = 1e3)

# Using apply to calculate row sums
row_sums <- apply(mat, 1, sum)
```

In this example, `apply` takes three arguments: the matrix `mat`, the margin (1 for rows, 2 for columns), and the function to apply (in this case, `sum`). The `apply` function efficiently iterates over the rows of the matrix and calculates their sums.

Moreover, R provides specialized functions like `colSums`, `rowSums`, `colMeans`, and `rowMeans` for common operations on matrices, which are even more efficient than using `apply`. For example:

```{r}
# Using rowSums to calculate row sums
row_sums <- rowSums(mat)
```

These specialized functions are highly optimized and should be preferred over `apply` for their respective operations.

To summarize, while loops in R can be slow due to the interpreted nature of the language and the overhead of dynamic typing and memory management, using the `apply` family of functions can significantly improve performance. The `apply` functions are internally optimized to handle operations on data structures more efficiently, reducing the execution time compared to traditional loops. Understanding and leveraging these functions is crucial for writing efficient R code, especially when dealing with large datasets or complex computations.

Even when the use of the `apply()` functions results in better performance, some programmers might find loops more intuitive especially when coming from a procedural language such as C++. Therefore, a first implementation of an algorithm might use loops. Only once the code is running correctly, do some programmers transition to the use of the `apply()` functions. Of course, as one becomes more familiar with R, code follows more of the natural R programming idioms.

### 2. Use `which()`

Rather than looping through a vector or a column in a data frame, use the `which()` function (and its existential counterpart `any()`) as it is much faster. The code below demonstrates this when searching a vector for the position of a particular value (finds the first occurrence only):

```{r useWhich}
n <- 1e8      # size of vector

# vector of random integer values
v <- round(runif(n, min = 100, max = n*n),0)

x <- 445876   # value to find
k <- 0        # position at which found

# insert value at a last position to ensure it is present
v[n] <- x
```

Now, let's try to find the position of the key value *x* in the vector `v`. We will measure the run-time performance of both approaches (of course, on your computer the random values in `v` and the timings will be different).

**Approach I: Linear Search with Loops**

```{r}
bt <- Sys.time()

for (i in 1:n) {
  if (x == v[i]) {
    k <- i
    break
  }
}

et <- Sys.time()

t.I <- et - bt

cat("Approach I / Loop: found ", x, "at position", k, "in", round((t.I),3), "seconds")
```

**Approach II: Using `which()`**

The approach below uses the `which()` function; as you can see it is substantially faster than looping.

```{r}
bt <- Sys.time()

k <- which(x == v)

et <- Sys.time()

t.II <- et - bt

cat("Approach II / which: found ", x, "at position", k, "in", round((t.II),3), "seconds")
```

The difference in time is quite pronounced -- about a 10x improvement in speed.

Of course, if the values in the vector had been sorted, then a binary search would have been substantially faster (complexity of $O(log_n)$ vs $O(n)$). And, placing the values into a hashmap rather than a linear vector would also make search faster ($O(1)$) -- however, R does not have built-in support for hashmaps, so mechanisms such as an in-memory key-value database (*e.g.*, *memcached*) are often used.

#### Quick Tutorial: `which()`

The `which()` function in R is an efficient mechanism for identifying the indices of elements in a logical vector that meet a specified condition. It is often used as an alternative to loops for conditional searches, providing a more concise and typically faster way to accomplish such tasks. Recall that a column in a data frame is a vector, so this also can be used to search columns of data frames.

The `which()` function returns the positions in a vector where an element in the vector meets a logical condition. Essentially, it scans through a logical vector and identifies the positions where the condition is met.

Consider a simple example where we want to find the positions (indices) of elements in a numeric vector that are greater than a specified value. In this example, `which(vec > 5)` returns the indices of `vec` where the elements are greater than 5.

```{r}
# Create a numeric vector
vec <- c(2, 4, 6, 8, 10, 3, 5, 7, 9)

# Use which() to find positions of elements greater than 5
indices <- which(vec > 5)

print(indices)
```

The `which()` function is also useful for identifying rows in a data frame that meet certain conditions. Let's consider a data frame example:

```{r}
# Create a data frame
df <- data.frame(
  ID = 1:10,
  Score = c(85, 90, 88, 78, 92, 95, 70, 65, 88, 91)
)

# se which() to find rows where Score is greater than 90
rows <- which(df$Score > 90)

print(rows)
```

Often, you want to subset the original vector or data frame based on the indices returned by `which()`. Here's how you can do it:

```{r}
# select rows in the data frame based on condition in a column
df.subset <- df[which(df$Score > 90), ]

head(df.subset)
```

The `which()` function is more efficient than loops for identifying indices that meet a condition, as it leverages internal optimizations. Using `which()` makes the code more concise and readable compared to writing explicit loops -- if one is used to R. However, it is the natural and idiomatic way to program in R.

### 3. Use Integer Keys

Keys should be integers whenever possible rather than character strings as comparing integers is much faster than comparing text strings (which are sequences of characters that have to be compared one character at a time). Of course, using a hash function to map strings to integers would be a good alternative. The code below compares the time to search an integer column versus a text column.

```{r}
n <- 1e7

df <- data.frame(
  ID = round(runif(n, min = 1000, max = 100000),0),
  Score = round(runif(n, min = 40, max = 99),0)
)

# embed lookup value in the last row
lookup.value.int <- 445632
df$ID[nrow(df)] <- lookup.value.int

# add a text column
df$NUID <- as.character(df$ID)
lookup.value.chr <- as.character(lookup.value.int)
df$NUID[nrow(df)] <- lookup.value.chr
```

**Integer Key Search:**

```{r}
bt <- Sys.time()

# lookup score based on ID
k <- df$Score[which(df$ID == lookup.value.int)]

et <- Sys.time()

t.integer <- et - bt

cat("Integer Key:", round((t.integer),3), "seconds")
```

**Text/Character Key Search:**

```{r}
bt <- Sys.time()

# lookup score based on ID
k <- df$Score[which(df$NUID == lookup.value.chr)]

et <- Sys.time()

t.text <- et - bt

cat("Text Key:", round((t.text),3), "seconds")
```

The lookup of a character key is 10x slower. Although, to be fair, it is quite fast when using `which()`. Try writing the same code with a loop for comparison. What do you notice? What is the performance difference?

### 4. Leverage Efficient Data Manipulation Packages

One of the primary ways to enhance performance is through efficient data manipulation. Data frames in R are versatile, but they may not always be the best choice for performance-critical applications. Instead, consider using the `data.table` package, which offers an enhanced version of data frames. Data tables are designed for fast aggregation, joining, and in-place updates. They also consume less memory and execute operations more quickly due to their optimized internal structure.

For example, suppose you have a large dataset and need to perform frequent subsetting and aggregations. Using `data.table` can significantly reduce execution time. Here is a demonstration:

**Using a Data Frame:**

```{r}
# Creating a large data frame
df <- data.frame(x = rnorm(1e8), 
                 y = rnorm(1e8))

# Subsetting and aggregation
system.time({
  result <- mean(df$y[which(df$x > 0)])
})
```

**Using a `data.table`:**

```{r echo = F, warning=F, message=F}
package <- "data.table"

if (!require(package, character.only = TRUE)) {
  install.packages(package, dependencies = TRUE, quietly = TRUE)
}

library(package, character.only = TRUE)
```

```{r message=F}
library(data.table)

# Creating a large data table
dt <- data.table(x = rnorm(1e8), 
                 y = rnorm(1e8))

# Subsetting and aggregation
system.time({
  result <- dt[x > 0, .(mean_y = mean(y))]
})
```

In this example, `data.table` processes the subsetting and aggregation a bit faster than a regular data frame, making it ideal for large-scale data analysis, particularly when there are many columns.

### 5. Vectorize Operations

Vectorization is another critical technique for improving performance. R is inherently a vectorized language, meaning it can operate on entire vectors of data at once. By leveraging vectorized operations, you can avoid the overhead of explicit loops, which are often slow in R. For instance, instead of using a loop to add two vectors element-wise, you can directly use vectorized addition:

```{r}
# Vectorized addition
a <- rnorm(1e7)
b <- rnorm(1e7)

system.time({
  c <- sum(a + b)
})
```

Here, the vectorized addition is not only more concise but also much faster than looping through each element as demonstrated with the loop-based code below:

```{r}
# Vectorized addition
a <- rnorm(1e7)
b <- rnorm(1e7)
c <- 0

system.time({
  for (i in 1:length(a)) {
    c <- c + (a[i] + b[i])
  }
})
```

### 6. Preallocate Memory

Memory management is another aspect where performance can be improved. R's garbage collector handles memory allocation and deallocation automatically, but inefficient memory usage can still slow down your code. Pre-allocating memory for large objects can prevent unnecessary memory reallocation during execution. For example, if you are appending elements to a vector inside a loop, it is better to pre-allocate the vector's size than growing the vector dynamically as new elements are inserted.

In the code below, the initial vector is created empty; as new elements are inserted, R automatically allocates memory. Generally, this means that memory for a new vector with a size one element larger than the original vector is allocated and then the original vector's elements are copied to the new vector.

```{r}
# dynamically growing a vector
n <- 1e7
v <- numeric(0)

system.time({
  for (i in 1:n) {
    v[i] <- rnorm(1)
  }
})
```

The code below pre-allocates the memory. Of course, this only works when the required number of elements is known beforehand. If it is not known, then the memory must be estimated, making certain not to overallocate.

```{r}
# pre-allocating memory for a vector
n <- 1e7
v <- numeric(n)

system.time({
  for (i in 1:n) {
    v[i] <- rnorm(1)
  }
})
```

This approach avoids the overhead associated with dynamically growing the vector. The effect becomes more pronounced over time.

### 7. Parallelize Processing

Parallel computing can also significantly reduce computation time for tasks that can be parallelized. R offers several packages for parallel processing, such as `parallel`, `foreach`, and `future`. The `parallel` package, for instance, provides functions to execute code on multiple CPU cores simultaneously. Of course, the more cores a CPU has, the better.

Here’s an example using the `mclapply` function:

```{r echo = F, warning=F, message=F}
package <- "parallel"

if (!require(package, character.only = TRUE)) {
  install.packages(package, dependencies = TRUE, quietly = TRUE)
}

library(package, character.only = TRUE)
```

```{r}
library(parallel)

# Parallel processing using mclapply
system.time({
  result <- mclapply(1:8, function(x) {
    sum(rnorm(1e6))
  }, mc.cores = 4)
})
```

In this example, the computation is distributed across four cores, resulting in a substantial decrease in execution time compared to a single-threaded approach.

### 8. Profile Code

Profiling your code is an essential step in identifying performance bottlenecks. The `profvis` package in R provides a graphical interface for profiling, allowing you to visualize where your code spends the most time. By pinpointing these hotspots, you can focus your optimization efforts more effectively.

```{r echo = F, warning=F, message=F}
package <- "profvis"

if (!require(package, character.only = TRUE)) {
  install.packages(package, dependencies = TRUE, quietly = TRUE)
}

library(package, character.only = TRUE)
```

```{r}
library(profvis)

# Profiling code execution
profvis({
  result <- sapply(1:1e6, function(x) x^2)
})
```

This profiling tool helps you understand the time complexity of different code sections, guiding you towards targeted improvements.

### 8. Leverage GenAI

Generative AI assistants, such as ChatGPT, Co-Pilot, and Claude, can be quite valuable in identifying and resolving performance issues in R code. These AI systems can assist both novice and experienced programmers by providing insights, code reviews, optimization suggestions, and educational resources. Let's explore some ways these AI assistants can be leveraged to improve the performance of R code.

One of the primary ways generative AI assistants can help is by reviewing your R code and suggesting optimizations. You can input sections of your code and ask the AI to analyze it for potential performance bottlenecks. For example, if you have a piece of code that involves loops or inefficient data manipulations, the AI can suggest more efficient alternatives, such as vectorized operations or the use of the `data.table` package.

Consider the following example where an AI assistant is asked to optimize a loop with the following prompt followed by pasting the code:

```         
Optimize the following R code for performance:

# Loop to calculate cumulative sum
n <- 1e6
result <- numeric(n)
for (i in 1:n) {
  result[i] <- sum(1:i)
}
```

Below is the actual output for ChatGPT 4o (with some edits for brevity and formatting). Notice the explanations giving a novice programmer an opportunity to learn.

````         
The provided R code calculates the cumulative sum using a loop, which can be optimized for better performance. Using vectorized operations, which are inherently faster in R, can significantly improve the efficiency of this code. Specifically, the `cumsum` function in R is designed for such tasks and is much more efficient than a loop.

Here's how you can optimize the code:

```r
n <- 1e6
result <- cumsum(1:n)
```

1.  **Vectorization**: The `cumsum` function calculates the cumulative sum of a numeric vector in a vectorized manner. This means that the computation is performed in an optimized way internally, avoiding the overhead of an explicit loop.

2.  **Avoiding Loops**: Loops in R, especially over large vectors, can be slow due to the interpreted nature of R and the overhead associated with repeatedly calling R functions. By using `cumsum`, you leverage highly optimized internal C code.

To demonstrate the performance improvement, we can compare the execution time of both the loop-based approach and the vectorized approach:

```r
n <- 1e6
result_loop <- numeric(n)
system.time({
  for (i in 1:n) {
    result_loop[i] <- sum(1:i)
  }
})

system.time({
  result_cumsum <- cumsum(1:n)
})
```
````

The explanation of code changes means that Generative AI can also be a teacher and coach. It can explain best practices for writing efficient R code, including providing guidance on vectorization, memory management, and the use of parallel computing. For example, an AI assistant can explain how to use the `parallel` package to speed up computations.

Naturally, not all suggestions might be correct -- AI assistants do have a (deserved) reputation for writing incorrect code or getting their math wrong, so it is important to double-check all suggestions and to be a critical consumer of AI suggestions: *caveat emptor* as they used to say in Rome.

Generative AI assistants be particularly helpful in interactive debugging sessions, where you iteratively refine your code based on the AI’s feedback. You can describe the performance issues you’re encountering, and the AI will often suggest modifications and further optimizations. This iterative process helps in progressively improving the code’s performance.

### 9. Incorporate C++

Integrating C++ code into R can significantly enhance performance for critical sections of code, particularly when dealing with computationally intensive tasks. R, while powerful for data analysis and statistical computing, is not as fast as low-level languages like C++ due to its interpreted nature and dynamic typing. By offloading performance-critical tasks to C++, you can leverage the speed and efficiency of compiled code while maintaining the convenience of R for data manipulation and visualization. This approach is facilitated by the `Rcpp` package, which provides a seamless interface between R and C++.

R Notebooks allow code blocks in various languages and thus an R Notebook can be a mix of languages. The programmer can choose the "right language" for a programming task. In R scripts, C++ code can be wrapped around the function `cppFunction()` from the **Rcpp** package.

C++ is a compiled language and executes much faster than R, especially for loops and complex computations. In addition, C++ allows for more efficient memory management and can handle large datasets or complex algorithms more effectively. So, using C++ for performance-critical sections of code while utilizing R's rich ecosystem for data manipulation and analysis provides a balanced approach that leverages each language's strength.

To begin using C++ in R, you need to install the `Rcpp` package. This package allows for easy integration of C++ code within R scripts and functions.

```{r echo = F, warning=F, message=F}
package <- "Rcpp"

if (!require(package, character.only = TRUE)) {
  install.packages(package, dependencies = TRUE, quietly = TRUE)
}

library(package, character.only = TRUE)
```

The `Rcpp` package provides several functions to facilitate the integration of C++ code. One of the most commonly used functions is `cppFunction`, which allows you to define and compile a C++ function directly in your R script.

Consider a simple example where we want to compute the sum of squares of a numeric vector. Here is how you can implement this in C++ using `Rcpp`. Note how the C++ code is an argument to an R function which will compile the C++ code using an installed C++ compiler and then create a binding to allow the C++ function to be called from R.

The function `cppFunction()` calls the C++ compiler, compiles the code (which can take some time), and then creates a language binding so the compiled C++ function can be called from R.

**C++ Function:**

```{r createCppFunc}
library(Rcpp)

cppFunction('
double sumOfSquares(NumericVector x) {
  int n = x.size();
  double sum = 0;
  for (int i = 0; i < n; ++i) {
    sum += x[i] * x[i];
  }
  return sum;
}')
```

**Using the C++ Function in R:**

```{r useCppFunc}
# Creating a numeric vector
vec <- rnorm(1e6)

# Calling the C++ function from R
result <- sumOfSquares(vec)
print(result)
```

In this example, the C++ function `sumOfSquares` computes the sum of squares of the elements in the input vector `x`. This function is then called from R, demonstrating the seamless integration of R and C++. Writing more complex C++ code is best done by writing an entire package in C++ but that is beyond the scope of this lesson.

For a more complex example, consider matrix multiplication, a computationally intensive task that can benefit greatly from C++ optimization.

**C++ Function for Matrix Multiplication:**

```{r}
cppFunction('
NumericMatrix matMult(NumericMatrix A, NumericMatrix B) {
  int n = A.nrow();
  int k = A.ncol();
  int m = B.ncol();
  NumericMatrix C(n, m);
  
  for (int i = 0; i < n; ++i) {
    for (int j = 0; j < m; ++j) {
      for (int l = 0; l < k; ++l) {
        C(i,j) += A(i,l) * B(l,j);
      }
    }
  }
  return C;
}')
```

**Using the C++ Function in R:**

```{r}
# create two large matrices
A <- matrix(rnorm(1e6), nrow = 1000)
B <- matrix(rnorm(1e6), nrow = 1000)

# perform matrix multiplication using the C++ function
result <- matMult(A, B)
```

In this example, the `matMult` function multiplies two matrices `A` and `B`. The C++ implementation is significantly faster than a pure R implementation, especially for large matrices.

## 10. Save Large Objects

In R, you can save any data object, such as a data frame, to an R object file (*.RData*) using the `save()` function and then restore it using the `load()` function. This approach is useful for preserving the state of your data between sessions, especially when working with large datasets that require significant processing time. It means that you create the data object once, save it to an object file, and then restore it from the file when needed rather than cre-creating it.

First, create a data frame, then save the data frame to an R object file using the `save` function. When needed later in a new session, check if the object exists and, if not, load it. Before restoring the data frame, check if the object file exists. If the file exists, load the data frame using the `load` function, otherwise create the data frame. The code below demonstrates this.

```{r}
# load the data
df.txns <- read.csv('customertxndata.csv')

### do some intensive processing that take significant time
### ...

# save the data frame to an R object file
save(df.txns, file = "df_txnData.RData")

### for testing we will remove the object from memory
rm(df.txns)

```

The code below checks if the object file exists, and, if it doesn't, restores the data frame from the object file. Notice that the object file contains the data frame's name as well.

```{r}
# file path and name of the object file
file_path <- "df_txnData.RData"

# restore the data frame from its file -- if it exists
if (file.exists(file_path)) {
  load(file_path)
} else {
  # create the data frame
  # ...
}

# display the first few rows of the restored data frame
head(df.txns, 3)

```

## Summary

In summary, optimizing R code performance involves a multifaceted approach. By adopting efficient data structures like `data.table`, leveraging vectorized operations, managing memory effectively, utilizing parallel computing, and profiling your code, you can achieve significant run-time improvements. These techniques are not only essential for handling large datasets but also for ensuring that your computations are both fast and scalable, an essential requirements for machine learning, statistical computing, and data applications.

The `which()` function in R is a highly efficient alternative to loops for conditional searches. It simplifies the process of identifying indices where a condition is met, providing both performance benefits and improved code readability. By leveraging `which()`, you can write more efficient and concise R code for tasks involving conditional subsetting and filtering.

Generative AI assistants like ChatGPT and Claude can be helpful in identifying and resolving performance issues in R code. For the critical programmer, they can provide code reviews, suggest optimizations, recommend efficient data structures, guide you through profiling techniques, and offer insights into best practices.

By integrating C++ code into R using the `Rcpp` package, you can achieve substantial performance improvements for computationally intensive tasks. This approach allows you to use R for data manipulation with the targeted efficiency of C++.

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

TBD

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
