Introduction

This tutorial is a quick introduction to R for programmers of other high-level languages. It shows those features of R that are familiar to most programmers so that they can get started programming right away. It is important to note that the programming approach presented here is generally not the most efficient nor most common approach. We show the “R way” whenever possible and simple enough to explain. But the goal of the tutorial is to get a programmers programming. This is especially useful to students in computer science courses that use R but who are new to R.

R is best suited for data projects: data loading, data transformation, databases, data analysis, data visualization, and data science. There is substantial support for statistical analysis, unsupervised data mining, supervised machine learning, and even interactive dashboards.

Often, programming tasks that take dozens of lines of code in most languages can be written with one statement or one line in R. While R is not the fastest language, for many vector processing and mathematical operations it generally outperforms most other high-level languages, particularly when vectorization hardware is present.

As of 2021, R is one of the top languages to learn and for any kind of data-related work it is critical (along with Python, Scala, and perhaps Go).

The tutorial is geared towards students in information science, data science, and database design. It demonstrates basic syntax in R that are most often used for data processing rather than statistics.

The R Language

R is a procedural language similar to C. It is not object-oriented and does not support objects, classes, inheritance, or polymorphism. It has little support for data encapsulation or abstraction, so no equivalent for class or struct in C/C++ or Java.

Programs is R are scripts. There is no “main function” or similar. R “programs” are collections of R statements that are interpreted when executed interactively. There is no compilation step. R does support reusable third-party code “libraries” in the form of packages.

Working in R

To write “programs” in R you will need Base R which you can download for Linux, MacOS, and Windows from R Project. This is the core language with an interactive console. Programs, or more aptly R scripts, can be built in any text editor (TextEdit, Notepad, vi, Sublime, JEdit, etc.).

Most programming is done with an IDE (Integrated Development Environment). The most common is R Studio downloadable from RStudio. There is a hosted version of R Studio available at rstudio.cloud.

Install R from R Project before installing R Studio.

The tutorial below explains how to get started with R Notebooks:

Execute chunks by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter. The code runs in the order in which the chunks are executed, so non-linear code execution is possible unless you instruct R Studio to run all chunks starting at the first chunk.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

Projects in R Studio

Projects are a better way to manage code rather than creating individual R Notebooks, R Scripts, and other code files. Projects allows all files, including data files, to be managed as a single unit, shared, and version controlled using services such as git and GitHub.

The tutorial below demonstrates how to create a project in R Studio and add files to the project.

R Code Chunks

While you can also write R Scripts, we will concentrate on how to write R “programs” by composing an R Notebook in R Studio. This is the most common way to use R in data related projects where reproducability is paramount. Programs in R run from start to end. Each chunk should be a step in your analysis or data project. Name your code chunks, so you can quickly navigate to them.

In the chunk below, the variable cars passed to the built-in Base R function plot is one of the dozens of “built-in” data frames; a data frame being data arranged in rows and columns similar to a spreadsheet or CSV file.

Note that you call a function by using the function’s name followed by the arguments you wish to pass to the function. Of course, you need to follow the definition of the function. Many functions are simply “built-in” while others come from packages that you need to explicitly load into your program.

Note that there is no semicolon at the end of a line.

plot(cars)

Expressions

R can be directly used to solve simple or complex mathematical expressions.

# [1] in the above answer indicates the index of your results.
# R always shows the result with index for each row.

((2^3)*5)-1
## [1] 39
# sqrt and exp are built-in functions in R for finding Square root and exponential respectively.

sqrt(4)* exp(2)
## [1] 14.77811

Variables

Variables are not declared in R. When you use a variable for the first time, it is defined and the data type is based on what value you assign to the variable. So, R uses dynamic typing, which also means that when you assign a value of a different type to a previously used variable, it changes type. You can find the type of a variable using the functions typeof, class, and mode.

R supports the usual data types: integer, double, Boolean, and string. It also has pre-defined complex types, including vector, data frame, array, list, and matrix. There is also a Date data type and a few others.

The absence of a value is indicated in R using NA rather than nil, null or NULL. An NA means that there is no value for an integer, character, logical value, or numeric. On the other hand, NULL means that an object is “empty”, i.e. a reference to an object that does not exist.

Values are assigned to variables using either = or <-, the latter being more common but the former more familiar to programmers of other common languages.

a = 10                       # defines an integer
d = 9.9                      # defines a double
s = "some text"              # define a string of characters
g <- 'also text'             # single quotes are the same as double quotes

Text can be enclosed in single or double quotes. Which you use depends on preference or if you need to nest quotes.

z = "It's a quote."

To echo the value of a variable to the console, either use the variable on a line by itself or use the function print. If you need to echo multiple values, combine them with paste0.

d <- 123.99
d
## [1] 123.99
print(paste0("Value of d = ", d))
## [1] "Value of d = 123.99"

Missing Values: NA vs NULL

In R, NA and NULL are used to represent different types of missing data, and they serve different purposes:

  1. NA (Not Available):
    • NA is used to represent missing values in vectors, lists, or data frames. It acts as a placeholder for an element that does not have a value but is expected to have one.
    • NA can be of any data type like numeric, character, or logical, meaning you can have NA_integer_, NA_real_, NA_complex_, and NA_character_.
    • Operations involving NA generally result in NA. For example, 5 + NA results in NA.
    • A value of NA is a value and is counted, e.g., length(c(3,NA,4)) has the value 3.
  2. NULL:
    • NULL is used to represent the absence of a value or no value at all. It is typically used to denote that a variable is empty or uninitialized.
    • NULL is often used in list or data frame operations to remove elements or indicate that an element is absent.
    • NULL has a different behavior in operations compared to NA. For instance, adding NULL to another object or concatenating it generally results in the other object unchanged. For example, c(1, 2, NULL) results in c(1, 2).

Essentially, NA is used when an element of the data exists but its value is missing, whereas NULL is used when the data itself does not exist.

Many functions that return an object such as a data frame would return NULL if they could not generate the object.

In R, you can test for NA and NULL using specific functions designed for this purpose. Here’s how you can do it:

  1. Testing for NA:
    • Use the is.na() function, which checks for NA values in an object. It returns a logical vector of the same size as the input, with TRUE for elements that are NA and FALSE for those that are not.

    • Example:

      vec <- c(1, NA, 3, NA, 5)
      is.na(vec)
      # Output: FALSE  TRUE FALSE  TRUE FALSE
  2. Testing for NULL:
    • Use the is.null() function, which checks if an object is NULL. It returns TRUE if the object is NULL and FALSE otherwise.

    • Example:

      x <- NULL
      is.null(x)
      # Output: TRUE

These functions are useful in various programming scenarios, such as conditional execution of code depending on the presence of actual data, and handling missing values in data analysis and transformation tasks.

Naming Identifiers

The naming of identifiers, i.e., variable and function names, are the same as in most other languages, except that you can use the period as a valid identifier character. That can be confusing to Java and C++ programmers as they are used to using . as a method or object property access operator. In R, it’s just another character like a or _.

a.val = 10.5          # legal
a_val = 10.5          # legal
aVal1 = 10.5          # legal
a$val = 10.5          # not quite legal, $ is reserved for data frames

The last one is a bit tricky. a$val will actually create a list object. Let’s just not use it unless you are accessing columns in a data frame, but that’s for another tutorial.

The rules for naming an identifier (variable, function, or package name) for an object are as follows:

  • identifiers are case-sensitive and cannot contain spaces or special characters such as #, %, $, @, * , &, ^, !, ~

  • an identifier must start with a letter, but may contain any combination of letters and digits thereafter

  • special characters dot (.) and underscore (_) are allowed

The dot (.) is a regular character in R and that can be confusing as other languages (e.g., Java) use dot as an operator to designate property or method access, e.g, in Java x.val means that you are accessing the val property of the object x.

Some examples of legal variable names are: df, df2, df.txns, and df_all2017. These are some illegal variable names: 2df (cannot start with a digit), rs$all (cannot contain a $; the $ is used to access columns in a dataframe), rs# (only . and _ are allowed in addition to digits and letters).

It is considered good programming practice to give identifiers a sensible name that hints as to what is stored in the variable rather than using random name like x, val, or i33. Instead use anItem or annualTotal.

Identifiers should be named consistently. Many programmers use one of two styles:

  • underscores, e.g., interest_rate
  • camelCase, e.g., squareRoot, graphData, currentWorkingDirectory

Note that R is case sensitive which means that R treats the identifiers AP and ap as different objects.

As a side note, files may also be case sensitive but that depends on the operating system. MacOS and Linux are case sensitive, while Windows is case aware but not case sensitive. For example, on MacOS and Linux there is a difference between “AirPassengers.txt” and “airpassengers.txt” while on Windows there is not. SQL is also not case sensitive. It is a best practice to assume case sensitivity.

Expressions

As a language with roots in statistical analysis, R supports most mathematical operators, including many that are not supported through operators in most other languages.

Precedence is like other languages and can be (and should be) specified explicitly using parentheses.

a <- 99

b = a + 20                 # addition
b = a - 20                 # subtraction
b = a * 20                 # multiplication
b = a / 20                 # division -- b is now "double"
b = a ^ 3                  # exponentiation
b = a %% 2                 # modulus (mod)
b = a %/% 2                # integer division (div)

b = (a / 2) ^ 5            # force order of evaluation

There are also numerous built-in functions for mathematics and statistics, but those are beyond this tutorial on base syntax. For the sake of completeness, an example is shown below that calculates the z-score of a vector of numbers. In R, a vector is similar to an array; it contains primitives values (integer, double, Boolean, or string). Notice the automatic vector calculation even without a loop.

v = c(1,4,6,2,8,1,0,3)     # vector of integers

m = mean(v)                # mean of values in v
s = sd(v)                  # standard deviation
z = (m - v) / s            # z-score of each value in v

print(round(z,2))          # print rounded values
## [1]  0.77 -0.32 -1.05  0.41 -1.77  0.77  1.14  0.05

Boolean Variables and Logical Expressions

Boolean values are TRUE and FALSE; or T and F, respectively.

m = FALSE
w = TRUE

q = (m & w) | (!m & !w)

print(paste0("q is ", q))
## [1] "q is FALSE"

The Boolean operators are & for AND, | for OR, and ! for NOT. These operators perform logical operations on Boolean variables or vector of Boolean elements. Exclusive or (xor) is performed with the function xor, e.g., xor(a,b).

There are long forms && and || for programming control-flow and generally preferred for if statements.

Flow Control

The control flow statements provided by R are similar to those found in most other programming languages: loops and if statements.

Loops

R supports the three common types of loops: a counting loop (for), a top-tested loop (while), and a bottom-tested loop (repeat).

Counting Loop: for

The counting loop is actually more like an iterator in R, although it can be set up to mimic the behavior of a typical loop in C/C++, Java, etc., that loops from a low value to a high value. See the examples below.

n = 5
for (i in 1:n) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

In the example above, i is the loop variable and it takes on the values in the vector 1:5, i.e., 1, 2, 3, 4, 5

To count down, it would be 5:1 or n:1 in the above code example.

Like other languages, R uses curly braces to enclose the body of the loop, i.e., the statements that are executed repeatedly, once for each value of i. Of course, the loop counter can be any properly named variable.

The example below accesses a vector using positional access, e.g., v[1] accesses the first element. Note that vector, data frames, lists, and arrays are indexed starting at 1 and not 0 as in C-based languages.

v = rnorm(5)      # vector of five random numbers
for (i in 1:length(v)) 
{
  v[i] = v[i] ^ 2
}

print(v)
## [1] 0.003630603 2.683820944 1.535107835 2.112020092 1.549035610

While you can use loops you actually do not need them. If you apply a mathematical operation to a vector, R automatically applies them to each element, but loops might be more natural in the beginning.

# note that you actually do not need loops in R
# this also squares each element in the vector v
v = v ^ 2

As already stated, looping in R is actually iteration over a set: a set of numbers as above or a set of any kind of primitive object, e.g., strings. Note that in the code below, k takes on each value in the vector over which the loop iterates.

s = c("one","two","three","four")

for (k in s) {
  print(k)
}
## [1] "one"
## [1] "two"
## [1] "three"
## [1] "four"

We could have written this using a non-iterator approach as well, which is more like what you’d do in C. Note that length returns the number of elements in a vector, i.e., its “length”.

s = c("one","two","three","four")

for (j in 1:length(s)) {
  print(s[j])
}
## [1] "one"
## [1] "two"
## [1] "three"
## [1] "four"
Iteration Continuation

Continuation of the loop to the next iteration and forgoing processing the remainder of the current iteration is done with next in R, similar to continue in C-based languages such as Java.

In the example below we are already reaching ahead to the if statement. The code fragment echoes only odd number. Of course, you could have done this differently by just iterating over the even numbers from one to twenty, but that would not have allowed us to share this example for using next.

for (i in 1:10) {
  if (!i %% 2) {
    next
  }
  print(i)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9

But, just to show you how to iterate over just odd number, we can use the seq() function which generates a sequence of numbers in steps. No if and no break needed – much simpler.

for (i in seq(1, 10, 2)) {
  print(i)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9

Or, more explicitly by specifying the parameters by name.

for (i in seq(from = 1, to = 10, by = 2)) {
  print(i)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9

To stop the execution of the rest of a loop and to move immediately to the next statement after the loop is done with break in R which is identical to C, C++, Java, etc. In the code below we will find the position of the first occurrence of some number; x is the number we are looking for in v and p is the found position.

v = c(1, 3, 5, 7, 1, 9, 4)
x = 9
p = 0

for (i in 1:length(v)) {
  if (v[i] == x) {
    p = i
    break
  }
}

print(p)
## [1] 6

To be clear, there is a quicker and more efficient way to do this using the which function. This function does not exist is other languages and it shows the power of R. Also, using which is much faster. Incidentally, the code below finds all occurrences of x in v. Note, once again, the use of a vector variable to refer to all elements.

v = c(1, 3, 5, 7, 1, 9, 4)
x = 9

p = which(v == x)

print(p)
## [1] 6

Conditional Loops: while and repeat

Like many other languages, R supports top and bottom tested loops. In the example below we repeatedly ask for a number from the user and only exit the loop if the number is 42. Of course, the example really should be done with a bottom tested loop.

response <- as.integer(readline(prompt="Enter a number: "))

while (response != 42) {   
  print("Sorry, not correct");
  response <- as.integer(readline(prompt="Enter a number: "));
}

In the above example, we are using two new functions. readline is used to read input from the console, while as.integer coerces (or casts in C/C++ terminology) the input to an integer.

The code is above is not very efficient or elegant as it has the code for the user prompt twice. It is really better to ask first and then check the condition: we need a bottom-tested loop.

Let’s look at the same example, but with a repeat loop that runs until a condition is reached and the loop is explicitly exited with break. So, if you want to run the code at least once, use repeat; if the code is run zero or more times, use while. The repeat loop is identical to the do or do while loop found in many other programming languages.

repeat {   
  response <- as.integer(readline(prompt="Enter a number: "));
  if (response == 42) {
    break
  }
  print("Sorry, not correct");
}

There is no equivalent in R for the use of an infinite for loop as in C and C++, e.g., for(;;){ // do something indefinitely } that runs until a condition is reached and then uses break to exit the loop. The repeat loop construct is used instead. There is no testing of a condition in the repeat loop, so there’s no equivalent to the C-like construct do … while.

Alternation: if

The use of if to selectively execute a block of code based on a condition is the same in R as it is in other languages. We already saw an example and the same example is below:

for (i in 1:10) {
  if (!i %% 2) {
    next
  }
  print(i)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9

Note that the condition is in parenthesis and that the block executed if the condition is true is in curly braces. This is identical to C, C++, Java, Python, etc.

A more complex example is shown below that squares all numbers in a vector that are less than 5 and cubes them if greater than 5. Not a useful example but one that allows us to show the use of if-else.

for (i in 1:10) {
  if (i < 5) {
    print(i^2)
  } else {
    print(i^3)
  }
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 125
## [1] 216
## [1] 343
## [1] 512
## [1] 729
## [1] 1000

Functions

Code organization and reuse in R is done using functions. All objects are free methods and are not bound to an object or a class like in Java or C++. It’s the same way as in Python or C.

All functions in R can return a value, although they do not have to. So, R does not distinguish between functions and procedures and there is no void return type as in C, C++, and Java.

Defining a Function

The generic template for defining a function is:

function_name <- function(arg_1, arg_2, ...) {
   Function body 
}

The example below defines a function called findSmallest() which takes a vector of positive integers as an argument and returns the smallest element in the vector. While it can be solved in several ways, we will show a design that uses loops and should be familiar to programmers of most other languages.

Note that we are using the predefined value Inf with is the largest representable integer. There is also -Inf that is the smallest representable integer.

findSmallest = function(v)
{
  s = Inf
  for (i in 1:length(v))
  {
    if (v[i] < s) {
      s = v[i]
    }
  }
  return (s)
}

While you can use = to define a function, you should really get used to using the more common <- syntax. So, let’s try again:

findSmallest <- function(v)
{
  s = Inf
  for (i in 1:length(v))
  {
    if (v[i] < s) {
      s = v[i]
    }
  }
  return (s)
}

Just to be clear, in practice you would use the min() function to find the smallest element rather writing it yourself.

While there are several ways to return a value from a function, the way that is most congruent with other languages is the use of the return statement.

Note that the type of return value and the type of arguments are not declared. R uses a lazy evaluation mechanism and no type checking is performed until run-time.

Calling a Function

To call a function, you would invoke it with its name and its required arguments.

x = c(3,1,9,7,3,6)

w = findSmallest(x)
print(w)
## [1] 1

Function Parameters

If a function takes several arguments you generally pass them in the order declared; the approach that is used by all other languages. However, in R you can pass the arguments in any order as long as you specify the name of the argument.

Argument matching is a bit different in R compared to other languages. Firstly, R does all argument checking at run-time. Secondly, while arguments can be matched positionally like in other languages, arguments can also be matched by parameter name – a syntax not supported by most other languages.

For example, the built-in function seq generates a sequence of numbers and returns those numbers in a vector. The definition of the function is as follows: seq(to, from, by, length.out, along.with).

Here are examples of using it. Note that by, length.out, and along.with have default values and are therefore optional.

v = seq(1, 10, 2)    # integers from 1 to 10 in increments of 2
w = seq(1, 5)        # integers from 1 to 5 (by default in increments of 1)

# pass arguments in a different order but specify by name
w = seq(from = 5, by = -0.5, to = 1)

R also supports variable numbers of arguments but that is beyond the scope of this tutorial.

Default Arguments

R functions can have default values for arguments which are then optional when the function is called. When the argument is missing, then the default value is passed. In the example below, the start argument is the position at which the search for the smallest element will start.

findSmallest <- function(v, start = 1)
{
  s = Inf
  for (i in start:length(v))
  {
    if (v[i] < s) {
      s = v[i]
    }
  }
  return (s)
}
x = c(3,1,9,7,3,6)

w = findSmallest(x, 3)
print(w)
## [1] 3
w = findSmallest(x)
print(w)
## [1] 1

Local Variables

As in most other programming languages, R functions can define local variables that are not known outside the scope of the function. The scope boundaries in R are like other languages: a block enclosed in curly braces.

In the example below, local.var is local to the function and thus is not visible outside of the function. The code below produces the error: “Error in print(local.var) : object ‘local.var’ not found”.

findSmallest <- function(v, start = 1)
{
  local.var = Inf
  for (i in start:length(v))
  {
    if (v[i] < local.var) {
      local.var = v[i]
    }
  }
  return (local.var)
}

x = c(3,1,9,7,3,6)

w = findSmallest(x, 3)

# we cannot echo or access the local variable "s"
print(local.var)

Recursion

R functions can be called recursively. The example below calculates factorial using recursion rather than a loop.

fac <- function(x)
{
  if (x == 1) 
    return (1)
  else 
    return (x * fac(x-1))
}

print(fac(8))
## [1] 40320

If it hasn’t been obvious yet, just like in other languages, the placement of curly braces makes no difference. For single statement blocks, the curly braces can be omitted.

The parenthesis around the value for return are required.

As an exercise, try writing the above function to calculate factorial using a loop.

Type Coercion

Type coercion (or casting in C/C++ terminology) is done with type conversion functions in R and not through operators like in C, C++, and Java. The example below shows the most common type conversion functions. Note that in some situation you will lose information, just like in other languages.

s = "12.4"                 # string (character)
i = as.integer(s)          # convert to integer: 12
d = as.numeric(s)          # convert to double: 12.3

w = "$12.3"                # additional characters
k = as.integer(w)          # cannot be converted due to $
## Warning: NAs introduced by coercion

If a coercion is not successful the result in NA which indicates a null or missing value.

Vectors

Let’s talk more about vectors. In R, a vector is similar to a list or array in other programming languages. It is a collection of elements of the same basic type: numeric, character, or Boolean. A list in R is a collection of mixed data types. This tutorial applies to vectors only.

There are no specific packages required for these functions.

Creating a Vector

The code below creates an artificial vector of random integers for use in the tutorial. In practice, vectors are generally columns in data frames which are frequently the result of reading data from a CSV file or a database.

# vector of 50 random integers between 0 and 10

# set the seed for the random number generator to ensure same
# sequence of random numbers every time the code is run
set.seed(98788)
v <- round(runif(50, min = 0, max = 10),0)

# arguments do not have to be passed in the order that they are
# declared in the function definition as long as the names of the
# arguments are specified
v <- round(runif(n = 50, max = 10, min = 1),0)
v <- round(runif(max = 10, min = 1, n = 50),0)

print(v)
##  [1]  9  6  9  1  9  8 10  2  6  9  8  7  6  1  7 10  4  3  5  9  7  8  4  9  7
## [26]  4  6  5  1  4  4  2  7  9  9  6  3  2  3  7  8  6  3  9  4 10  7  1  4  2

Accessing Elements in a Vector

Elements are accessed positionally, although in R, the access index can be a vector of integers in which case all elements at those positions are retrieved. Positions are numbered from 1 to the number of elements in a vector. The number of elements (or length) of a vector can be obtained using the length() function.

In the example below, note that n:m generates a vector of integers from n to m, inclusive. The seq() generates a sequence of integers at an interval.

# access a single element at position 3
v[3]

# access element 10 through 15
v[10:15]

# access the last element
v[length(v)]

# access every other element
v[seq(from = 1, to = length(v), by = 2)]

# access specific elements at positions 2, 11, 19, and 28
i <- c(2,11,19,28)
v[i]

Testing Predicate Expressions

It is possible in R to apply a predicate expression to every element in a vector. This generates a “Boolean vector” of TRUE/FALSE values that indicate which element matches the predicate expression (TRUE) and which doesn’t (FALSE).

Predicate expressions are built with logical operators (<, >, <=, >=, ==, !=)

v < 5
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [13] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
## [25] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [37]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
## [49]  TRUE  TRUE
(v < 1 | v > 9)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [49] FALSE FALSE
(v <= 7 & v != 3)
##  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
## [13]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## [37] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
## [49]  TRUE  TRUE
v != 5
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [49]  TRUE  TRUE
l <- (v == 5)
print(l)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE

Finding Matches

The which() function returns the positions that are TRUE in a Boolean vector.

# returns positions of vector that matches predicate expression
which(v != 5)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26
## [26] 27 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
# count the number of matches
length(which(v != 0))
## [1] 50
p <- which(v < 5)
print (v[p])
##  [1] 1 2 1 4 3 4 4 1 4 4 2 3 2 3 3 4 1 4 2
# or combine
x <- v[which(v < 5)]
print (x)
##  [1] 1 2 1 4 3 4 4 1 4 4 2 3 2 3 3 4 1 4 2
# find all that are not in vector
not.x <- v[-which(v < 5)]
print (x)
##  [1] 1 2 1 4 3 4 4 1 4 4 2 3 2 3 3 4 1 4 2

Determining Any Matches

To determine if there are any matches, i.e., at least one element in a vector matches the predicate expression, use the any() function. The function any() returns TRUE if there’s at least one match, FALSE otherwise.

any(v < 5)
## [1] TRUE

Dealing with Missing Values

Missing values in R are generally encoded with the special value NA. NA is not a number, not a character or text, and not a Boolean. Consequently, using == or != to check if a value is NA does not work and results in an error. You must use the function is.na() to check if a value is NA. This is similar to NULL in SQL and many programming languages.

# copy the vector of random numbers and then randomly 
# remove 6 values, i.e., set them to NA

v.na <- v

v.na[round(runif(6, min = 1, max = length(v.na)), 0)] = NA

print(v.na)
##  [1]  9  6  9  1  9  8 10  2 NA  9  8  7 NA  1  7 10  4  3  5  9  7 NA  4 NA  7
## [26]  4  6  5  1  4  4  2  7  9 NA  6  3  2  3  7 NA  6  3  9  4 10  7  1  4  2

Many of the functions in R do not work when a value in a vector is NA. Doing so results in a value of NA.

# cannot add values containing NA
sum(v.na)
## [1] NA
# when applying operators, NA remains NA
v.na + 5
##  [1] 14 11 14  6 14 13 15  7 NA 14 13 12 NA  6 12 15  9  8 10 14 12 NA  9 NA 12
## [26]  9 11 10  6  9  9  7 12 14 NA 11  8  7  8 12 NA 11  8 14  9 15 12  6  9  7

Some functions have a parameter that allows you to direct a function to ignore NA values. Check the documentation of functions before using them to see what parameters they support. Use ?sum to view the documentation of the sum function.

Accessing Rows, Columns, and Elements (Cells) of a Data Frame

Data frames are very similar to tables in relational databases and spreadsheets. They have rows and columns and the intersection of a row and column is a cell (or element). The order of access is row followed by column, e.g., the third element in the fourth row of the data frame mtcars is mtcars[4,3]. Note that this is reversed from the way Excel and other spreadsheets work.

The example code below uses the built-in data frame mtcars. You can find out more about its structure using str(mtcars) or displaying the first few rows with head(mtcars).

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars, 3)
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
v <- mtcars[4,3]
x = mtcars[4,3]

print(paste0("v = ",v," and x = ",x))
## [1] "v = 258 and x = 258"

Leaving out a dimension (row or column) accesses the entire row or column. The resultant is a data frame with a single row.

Often the values must be converted to a vector data type. Conversions of variables from one type to another is done with the family of as.xxxx functions, e.g., as.vector, as.numeric, or as.factor. Vectors can contain numeric or character data but all elements must be of the same type. In R, a list is similar to a vector but it may contain a mix of elements. A matrix is similar to a data frame but it can only contain numbers and it can have more than two dimensions.

Some functions expect data frames, some vectors, some lists. You need to read the documentation of a function to find out. Furthermore, some functions will automatically convert a variable from one type to the one it requires.

You can also access a column in data frame by its column name. For an entire column you either use the columns position or its name: df[,column] or df$columnName.

# all of row 4; the result is a data frame
r <- mtcars[4,]
sum(r)
## [1] 426.135
c <- mtcars[3,]
c[1,3]
## [1] 108
mtcars[c(1,4)]   # columns 1 and 4 as a new dataframe
##                      mpg  hp
## Mazda RX4           21.0 110
## Mazda RX4 Wag       21.0 110
## Datsun 710          22.8  93
## Hornet 4 Drive      21.4 110
## Hornet Sportabout   18.7 175
## Valiant             18.1 105
## Duster 360          14.3 245
## Merc 240D           24.4  62
## Merc 230            22.8  95
## Merc 280            19.2 123
## Merc 280C           17.8 123
## Merc 450SE          16.4 180
## Merc 450SL          17.3 180
## Merc 450SLC         15.2 180
## Cadillac Fleetwood  10.4 205
## Lincoln Continental 10.4 215
## Chrysler Imperial   14.7 230
## Fiat 128            32.4  66
## Honda Civic         30.4  52
## Toyota Corolla      33.9  65
## Toyota Corona       21.5  97
## Dodge Challenger    15.5 150
## AMC Javelin         15.2 150
## Camaro Z28          13.3 245
## Pontiac Firebird    19.2 175
## Fiat X1-9           27.3  66
## Porsche 914-2       26.0  91
## Lotus Europa        30.4 113
## Ford Pantera L      15.8 264
## Ferrari Dino        19.7 175
## Maserati Bora       15.0 335
## Volvo 142E          21.4 109
mtcars[,2]       # all of column 2
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[5:7,]     # rows 5 to 7 as a new dataframe
##                    mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.46 20.22  1  0    3    1
## Duster 360        14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
mtcars$cyl       # column named "cyl"
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars$cyl[2]    # 2nd row in the column "cyl"
## [1] 6
mtcars$cyl[3:9]  # rows 3 to 9 for column "cyl" as a vector
## [1] 4 6 8 6 8 4 4
w <- mtcars$mpg
mean(w)
## [1] 20.09062

Adding and Removing Columns from a Data Frame

To add a new column, you simply “access” the column or use a new name for the column. Note in the example below that you can operate on entire columns (as vectors) and the operation is applied to each pair of values in the two vectors in the operation. This is much more efficient than using loops as is necessary in other programming languages.

# copy the data frame mtcars to a new data frame df
df <- mtcars

# create a new column "dispcyl" which is the displacement per cylinder
df$dispcyl <- df$disp / df$cyl

head(df)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  dispcyl
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 26.66667
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 26.66667
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 27.00000
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 43.00000
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 45.00000
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 37.50000

Create a New Data Frame

Data frames are created in various ways: use the <code<>data.frame function, load a CSV file, execute a SQL query, or as a result of many package functions.

Load a Data Frame from CSV

Quick note: Capitalization in path and file names does not matter in Windows, but does matter on MacOS and Linux. Furthermore, note that even in Windows the path delimiter is a forward slash / and not the usual backwards slash \. The \ is an “escape” character and used to inject non-printable characters into a string (text), e.g., “This string contains”quotes”.” which would be written in R as “this string contains \”quotes\“.”

The parameter header = F instructs read.csv not to interpret the first line as header labels. Of course, if there are no labels, then you need to define your own.

Aside from CSV files, R can also load a number of other file format using various packages, including XML, Excel, SPSS, MatLab, among many others.

df <- read.csv(file = "customertxndata.csv", header = F)
head(df)
##   V1 V2      V3     V4        V5
## 1  7  0 Android   Male    0.0000
## 2 20  1     iOS   <NA>  576.8668
## 3 22  1     iOS Female  850.0000
## 4 24  2     iOS Female 1050.0000
## 5  1  0 Android   Male    0.0000
## 6 13  1 Android   Male  460.0000
df <- read.csv(file = "customertxndata.csv", 
               header = F,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
##   numVisits NumTxn      OS Gender     TotSp
## 1         7      0 Android   Male    0.0000
## 2        20      1     iOS   <NA>  576.8668
## 3        22      1     iOS Female  850.0000
## 4        24      2     iOS Female 1050.0000
## 5         1      0 Android   Male    0.0000
## 6        13      1 Android   Male  460.0000

Note that the value of the ‘Male’ column in the first row is NA which is the way that R indicates a missing data value. It is not 0 or an empty string, it is unknown. So, statistical functions and algebraic operations would result in an NA as well.

Strings vs Factors

The factor data type encodes categorical data, e.g., the value of a variable is one of a fixed value set. Many statistical functions in R require categorical variables to be of type factor. However, often, during data processing, we need the actual text rather than having it encoded as a factor (which is actually stored in R as an integer for efficiency). So, when reading a CSV file you need to decide if you want text columns to be character strings or factors by setting the stringsAsFactors parameter.

You may use either F and T or FALSE and TRUE.

df <- read.csv(file = "customertxndata.csv", 
               header = F,
               stringsAsFactors = FALSE,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
##   numVisits NumTxn      OS Gender     TotSp
## 1         7      0 Android   Male    0.0000
## 2        20      1     iOS   <NA>  576.8668
## 3        22      1     iOS Female  850.0000
## 4        24      2     iOS Female 1050.0000
## 5         1      0 Android   Male    0.0000
## 6        13      1 Android   Male  460.0000

Create a new Data Frame

The code below creates a new data frame from column vectors. Notice how the column names are the names of the vectors. A new vector is created with the c function, e.g., v <- c(3,5,1,9).

df1 <- data.frame(state = c('Arizona','Georgia', 'New York','Indiana','Washington','Texas'),
                  code = as.factor(c('AZ','GA','NY','IN','WA','TX')),
                  score = c(62,47,55,74,31,85))

head(df1)
##        state code score
## 1    Arizona   AZ    62
## 2    Georgia   GA    47
## 3   New York   NY    55
## 4    Indiana   IN    74
## 5 Washington   WA    31
## 6      Texas   TX    85

Search Data Frames

There are two important functions for “searching” data frames: which and any. The code below uses the built-in Orange data frame which contains measurements of orange trees. It has three columns: the tree, the age of the tree (days since 1968/12/31), and circumference (in mm).

which

df <- Orange

head(df)
##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142
# find all rows where the circumference is more than 200mm
rs <- which(df$circumference > 200)

# display all rows where the circumference is more than 200mm
df[rs,]
##    Tree  age circumference
## 13    2 1372           203
## 14    2 1582           203
## 27    4 1372           209
## 28    4 1582           214
# compound conditions are possible with & (and), | (or), and ! (not)
rs2 <- which(df$circumference > 200 & df$age < 1500)
rs3 <- which(df$circumference < 200 | !(df$age < 1500))
rs4 <- which(df$circumference > 400 | df$age > 1500)

rs2
## [1] 13 27
rs3
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26
## [26] 28 29 30 31 32 33 34 35
rs4
## [1]  7 14 21 28 35
mean(df[rs4,2])
## [1] 1582
mean(df$age[rs3])
## [1] 894.8788

In the above example rs <- which(df$circumference > 200) finds all rows in the data frame df where circumference > 200. The rows are saved in rs.

any

The any function returns \(TRUE\) or \(FALSE\) depending on whether any column (or row) in the dataframe satisfies a Boolean expression.

# is there any tree with age > 2000?
any(df$age > 25)
## [1] TRUE

Memory Management

R is similar to Python and other interpreted languages in terms of memory management. Objects and variables remain in memory until you restart R or explicitly delete them. This can sometimes cause conflicts during development. Adding this to the start of an R script or an R Notebook ensures that the program runs with an empty memory environment. This is critical for languages like R and Python, but is not needed for programming languages that run in separate processes such as Java and C++ programs.

Use the code below to find and then delete all objects, and reclaim memory. The function ls() lists all objects (variables) by name, while the rm() removes one or more objects from memory. Finally, the function gc() runs the garbage collector and returns freed memory to the usable memory pool for the process in which R is running.

rm(list = ls(all.names = TRUE))
gc()

Of course, rather than deleting all objects as in the code chunk above, you may wish to release large objects or unused objects selectively by their name, e.g., rm(“objName”).

Install Packages on Demand

To make your code portable and reproducible, install packages within your code:

# packages needed in R program
packages <- c("stringr", "RSQLite")

# install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# load all packages by applying 'library' function
invisible(lapply(packages, library, character.only = TRUE))

Conclusion

As you saw, R is not a difficult language to learn as it is similar to other languages and for most language constructs that you are familiar with, there is an equivalent. But it is important that you go beyond this tutorial and learn the “R way” of programming using vectorized operations.


Files & Resources

All Files for Lesson 6.104

References

No references.

Errata

None collected yet. Let us know.

