Introduction
This tutorial is a quick introduction to R for programmers of other high-level languages. It shows those features of R that are familiar to most programmers so that they can get started programming right away. It is important to note that the programming approach presented here is generally not the most efficient nor most common approach. We show the “R way” whenever possible and simple enough to explain. But the goal of the tutorial is to get a programmers programming. This is especially useful to students in computer science courses that use R but who are new to R.
R is best suited for data projects: data loading, data transformation, databases, data analysis, data visualization, and data science. There is substantial support for statistical analysis, unsupervised data mining, supervised machine learning, and even interactive dashboards.
Often, programming tasks that take dozens of lines of code in most languages can be written with one statement or one line in R. While R is not the fastest language, for many vector processing and mathematical operations it generally outperforms most other high-level languages, particularly when vectorization hardware is present.
As of 2021, R is one of the top languages to learn and for any kind of data-related work it is critical (along with Python, Scala, and perhaps Go).
The tutorial is geared towards students in information science, data science, and database design. It demonstrates basic syntax in R that are most often used for data processing rather than statistics.
The R Language
R is a procedural language similar to C. It is not object-oriented and does not support objects, classes, inheritance, or polymorphism. It has little support for data encapsulation or abstraction, so no equivalent for class or struct in C/C++ or Java.
Programs is R are scripts. There is no “main function” or similar. R “programs” are collections of R statements that are interpreted when executed interactively. There is no compilation step. R does support reusable third-party code “libraries” in the form of packages.
Working in R
To write “programs” in R you will need Base R which you can download for Linux, MacOS, and Windows from R Project. This is the core language with an interactive console. Programs, or more aptly R scripts, can be built in any text editor (TextEdit, Notepad, vi, Sublime, JEdit, etc.).
Most programming is done with an IDE (Integrated Development Environment). The most common is R Studio downloadable from RStudio. There is a hosted version of R Studio available at rstudio.cloud.
Install R from R Project before installing R Studio.
The tutorial below explains how to get started with R Notebooks:
Execute chunks by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter. The code runs in the order in which the chunks are executed, so non-linear code execution is possible unless you instruct R Studio to run all chunks starting at the first chunk.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
Projects in R Studio
Projects are a better way to manage code rather than creating individual R Notebooks, R Scripts, and other code files. Projects allows all files, including data files, to be managed as a single unit, shared, and version controlled using services such as git and GitHub.
The tutorial below demonstrates how to create a project in R Studio and add files to the project.
R Code Chunks
While you can also write R Scripts, we will concentrate on how to write R “programs” by composing an R Notebook in R Studio. This is the most common way to use R in data related projects where reproducability is paramount. Programs in R run from start to end. Each chunk should be a step in your analysis or data project. Name your code chunks, so you can quickly navigate to them.
In the chunk below, the variable cars passed to the built-in Base R function plot
is one of the dozens of “built-in” data frames; a data frame being data arranged in rows and columns similar to a spreadsheet or CSV file.
Note that you call a function by using the function’s name followed by the arguments you wish to pass to the function. Of course, you need to follow the definition of the function. Many functions are simply “built-in” while others come from packages that you need to explicitly load into your program.
Note that there is no semicolon at the end of a line.
Expressions
R can be directly used to solve simple or complex mathematical expressions.
# [1] in the above answer indicates the index of your results.
# R always shows the result with index for each row.
((2^3)*5)-1
## [1] 39
# sqrt and exp are built-in functions in R for finding Square root and exponential respectively.
sqrt(4)* exp(2)
## [1] 14.77811
Variables
Variables are not declared in R. When you use a variable for the first time, it is defined and the data type is based on what value you assign to the variable. So, R uses dynamic typing, which also means that when you assign a value of a different type to a previously used variable, it changes type. You can find the type of a variable using the functions typeof
, class
, and mode
.
R supports the usual data types: integer, double, Boolean, and string. It also has pre-defined complex types, including vector, data frame, array, list, and matrix. There is also a Date data type and a few others.
The absence of a value is indicated in R using NA rather than nil, null or NULL. An NA means that there is no value for an integer, character, logical value, or numeric. On the other hand, NULL means that an object is “empty”, i.e. a reference to an object that does not exist.
Values are assigned to variables using either = or <-, the latter being more common but the former more familiar to programmers of other common languages.
a = 10 # defines an integer
d = 9.9 # defines a double
s = "some text" # define a string of characters
g <- 'also text' # single quotes are the same as double quotes
Text can be enclosed in single or double quotes. Which you use depends on preference or if you need to nest quotes.
To echo the value of a variable to the console, either use the variable on a line by itself or use the function print
. If you need to echo multiple values, combine them with paste0
.
## [1] 123.99
print(paste0("Value of d = ", d))
## [1] "Value of d = 123.99"
Missing Values: NA vs NULL
In R, NA
and NULL
are used to represent different types of missing data, and they serve different purposes:
- NA (Not Available):
NA
is used to represent missing values in vectors, lists, or data frames. It acts as a placeholder for an element that does not have a value but is expected to have one.
NA
can be of any data type like numeric, character, or logical, meaning you can have NA_integer_
, NA_real_
, NA_complex_
, and NA_character_
.
- Operations involving
NA
generally result in NA
. For example, 5 + NA
results in NA
.
- A value of
NA
is a value and is counted, e.g., length(c(3,NA,4))
has the value 3.
- NULL:
NULL
is used to represent the absence of a value or no value at all. It is typically used to denote that a variable is empty or uninitialized.
NULL
is often used in list or data frame operations to remove elements or indicate that an element is absent.
NULL
has a different behavior in operations compared to NA
. For instance, adding NULL
to another object or concatenating it generally results in the other object unchanged. For example, c(1, 2, NULL)
results in c(1, 2)
.
Essentially, NA
is used when an element of the data exists but its value is missing, whereas NULL
is used when the data itself does not exist.
Many functions that return an object such as a data frame would return NULL if they could not generate the object.
In R, you can test for NA
and NULL
using specific functions designed for this purpose. Here’s how you can do it:
- Testing for NA:
- Testing for NULL:
These functions are useful in various programming scenarios, such as conditional execution of code depending on the presence of actual data, and handling missing values in data analysis and transformation tasks.
Naming Identifiers
The naming of identifiers, i.e., variable and function names, are the same as in most other languages, except that you can use the period as a valid identifier character. That can be confusing to Java and C++ programmers as they are used to using . as a method or object property access operator. In R, it’s just another character like a or _.
a.val = 10.5 # legal
a_val = 10.5 # legal
aVal1 = 10.5 # legal
a$val = 10.5 # not quite legal, $ is reserved for data frames
The last one is a bit tricky. a$val
will actually create a list object. Let’s just not use it unless you are accessing columns in a data frame, but that’s for another tutorial.
The rules for naming an identifier (variable, function, or package name) for an object are as follows:
identifiers are case-sensitive and cannot contain spaces or special characters such as #, %, $, @, * , &, ^, !, ~
an identifier must start with a letter, but may contain any combination of letters and digits thereafter
special characters dot (.) and underscore (_) are allowed
The dot (.) is a regular character in R and that can be confusing as other languages (e.g., Java) use dot as an operator to designate property or method access, e.g, in Java x.val means that you are accessing the val property of the object x.
Some examples of legal variable names are: df, df2, df.txns, and df_all2017. These are some illegal variable names: 2df (cannot start with a digit), rs$all (cannot contain a $; the $ is used to access columns in a dataframe), rs# (only . and _ are allowed in addition to digits and letters).
It is considered good programming practice to give identifiers a sensible name that hints as to what is stored in the variable rather than using random name like x, val, or i33. Instead use anItem or annualTotal.
Identifiers should be named consistently. Many programmers use one of two styles:
- underscores, e.g., interest_rate
- camelCase, e.g., squareRoot, graphData, currentWorkingDirectory
Note that R is case sensitive which means that R treats the identifiers AP and ap as different objects.
As a side note, files may also be case sensitive but that depends on the operating system. MacOS and Linux are case sensitive, while Windows is case aware but not case sensitive. For example, on MacOS and Linux there is a difference between “AirPassengers.txt” and “airpassengers.txt” while on Windows there is not. SQL is also not case sensitive. It is a best practice to assume case sensitivity.
Expressions
As a language with roots in statistical analysis, R supports most mathematical operators, including many that are not supported through operators in most other languages.
Precedence is like other languages and can be (and should be) specified explicitly using parentheses.
a <- 99
b = a + 20 # addition
b = a - 20 # subtraction
b = a * 20 # multiplication
b = a / 20 # division -- b is now "double"
b = a ^ 3 # exponentiation
b = a %% 2 # modulus (mod)
b = a %/% 2 # integer division (div)
b = (a / 2) ^ 5 # force order of evaluation
There are also numerous built-in functions for mathematics and statistics, but those are beyond this tutorial on base syntax. For the sake of completeness, an example is shown below that calculates the z-score of a vector of numbers. In R, a vector is similar to an array; it contains primitives values (integer, double, Boolean, or string). Notice the automatic vector calculation even without a loop.
v = c(1,4,6,2,8,1,0,3) # vector of integers
m = mean(v) # mean of values in v
s = sd(v) # standard deviation
z = (m - v) / s # z-score of each value in v
print(round(z,2)) # print rounded values
## [1] 0.77 -0.32 -1.05 0.41 -1.77 0.77 1.14 0.05
Boolean Variables and Logical Expressions
Boolean values are TRUE and FALSE; or T and F, respectively.
m = FALSE
w = TRUE
q = (m & w) | (!m & !w)
print(paste0("q is ", q))
## [1] "q is FALSE"
The Boolean operators are & for AND, | for OR, and ! for NOT. These operators perform logical operations on Boolean variables or vector of Boolean elements. Exclusive or (xor) is performed with the function xor, e.g., xor(a,b)
.
There are long forms && and || for programming control-flow and generally preferred for if statements.
Flow Control
The control flow statements provided by R are similar to those found in most other programming languages: loops and if statements.
Loops
R supports the three common types of loops: a counting loop (for), a top-tested loop (while), and a bottom-tested loop (repeat).
Counting Loop: for
The counting loop is actually more like an iterator in R, although it can be set up to mimic the behavior of a typical loop in C/C++, Java, etc., that loops from a low value to a high value. See the examples below.
n = 5
for (i in 1:n) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
In the example above, i is the loop variable and it takes on the values in the vector 1:5, i.e., 1, 2, 3, 4, 5
To count down, it would be 5:1 or n:1 in the above code example.
Like other languages, R uses curly braces to enclose the body of the loop, i.e., the statements that are executed repeatedly, once for each value of i. Of course, the loop counter can be any properly named variable.
The example below accesses a vector using positional access, e.g., v[1] accesses the first element. Note that vector, data frames, lists, and arrays are indexed starting at 1 and not 0 as in C-based languages.
v = rnorm(5) # vector of five random numbers
for (i in 1:length(v))
{
v[i] = v[i] ^ 2
}
print(v)
## [1] 0.003630603 2.683820944 1.535107835 2.112020092 1.549035610
While you can use loops you actually do not need them. If you apply a mathematical operation to a vector, R automatically applies them to each element, but loops might be more natural in the beginning.
# note that you actually do not need loops in R
# this also squares each element in the vector v
v = v ^ 2
As already stated, looping in R is actually iteration over a set: a set of numbers as above or a set of any kind of primitive object, e.g., strings. Note that in the code below, k takes on each value in the vector over which the loop iterates.
s = c("one","two","three","four")
for (k in s) {
print(k)
}
## [1] "one"
## [1] "two"
## [1] "three"
## [1] "four"
We could have written this using a non-iterator approach as well, which is more like what you’d do in C. Note that length
returns the number of elements in a vector, i.e., its “length”.
s = c("one","two","three","four")
for (j in 1:length(s)) {
print(s[j])
}
## [1] "one"
## [1] "two"
## [1] "three"
## [1] "four"
Iteration Continuation
Continuation of the loop to the next iteration and forgoing processing the remainder of the current iteration is done with next in R, similar to continue in C-based languages such as Java.
In the example below we are already reaching ahead to the if statement. The code fragment echoes only odd number. Of course, you could have done this differently by just iterating over the even numbers from one to twenty, but that would not have allowed us to share this example for using next.
for (i in 1:10) {
if (!i %% 2) {
next
}
print(i)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
But, just to show you how to iterate over just odd number, we can use the seq()
function which generates a sequence of numbers in steps. No if and no break needed – much simpler.
for (i in seq(1, 10, 2)) {
print(i)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
Or, more explicitly by specifying the parameters by name.
for (i in seq(from = 1, to = 10, by = 2)) {
print(i)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
To stop the execution of the rest of a loop and to move immediately to the next statement after the loop is done with break in R which is identical to C, C++, Java, etc. In the code below we will find the position of the first occurrence of some number; x is the number we are looking for in v and p is the found position.
v = c(1, 3, 5, 7, 1, 9, 4)
x = 9
p = 0
for (i in 1:length(v)) {
if (v[i] == x) {
p = i
break
}
}
print(p)
## [1] 6
To be clear, there is a quicker and more efficient way to do this using the which
function. This function does not exist is other languages and it shows the power of R. Also, using which
is much faster. Incidentally, the code below finds all occurrences of x in v. Note, once again, the use of a vector variable to refer to all elements.
v = c(1, 3, 5, 7, 1, 9, 4)
x = 9
p = which(v == x)
print(p)
## [1] 6
Conditional Loops: while and repeat
Like many other languages, R supports top and bottom tested loops. In the example below we repeatedly ask for a number from the user and only exit the loop if the number is 42. Of course, the example really should be done with a bottom tested loop.
response <- as.integer(readline(prompt="Enter a number: "))
while (response != 42) {
print("Sorry, not correct");
response <- as.integer(readline(prompt="Enter a number: "));
}
In the above example, we are using two new functions. readline
is used to read input from the console, while as.integer
coerces (or casts in C/C++ terminology) the input to an integer.
The code is above is not very efficient or elegant as it has the code for the user prompt twice. It is really better to ask first and then check the condition: we need a bottom-tested loop.
Let’s look at the same example, but with a repeat loop that runs until a condition is reached and the loop is explicitly exited with break. So, if you want to run the code at least once, use repeat; if the code is run zero or more times, use while. The repeat loop is identical to the do or do while loop found in many other programming languages.
repeat {
response <- as.integer(readline(prompt="Enter a number: "));
if (response == 42) {
break
}
print("Sorry, not correct");
}
There is no equivalent in R for the use of an infinite for loop as in C and C++, e.g., for(;;){ // do something indefinitely }
that runs until a condition is reached and then uses break to exit the loop. The repeat loop construct is used instead. There is no testing of a condition in the repeat loop, so there’s no equivalent to the C-like construct do … while.
Alternation: if
The use of if to selectively execute a block of code based on a condition is the same in R as it is in other languages. We already saw an example and the same example is below:
for (i in 1:10) {
if (!i %% 2) {
next
}
print(i)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
Note that the condition is in parenthesis and that the block executed if the condition is true is in curly braces. This is identical to C, C++, Java, Python, etc.
A more complex example is shown below that squares all numbers in a vector that are less than 5 and cubes them if greater than 5. Not a useful example but one that allows us to show the use of if-else.
for (i in 1:10) {
if (i < 5) {
print(i^2)
} else {
print(i^3)
}
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 125
## [1] 216
## [1] 343
## [1] 512
## [1] 729
## [1] 1000
Functions
Code organization and reuse in R is done using functions. All objects are free methods and are not bound to an object or a class like in Java or C++. It’s the same way as in Python or C.
All functions in R can return a value, although they do not have to. So, R does not distinguish between functions and procedures and there is no void return type as in C, C++, and Java.
Defining a Function
The generic template for defining a function is:
function_name <- function(arg_1, arg_2, ...) {
Function body
}
The example below defines a function called findSmallest() which takes a vector of positive integers as an argument and returns the smallest element in the vector. While it can be solved in several ways, we will show a design that uses loops and should be familiar to programmers of most other languages.
Note that we are using the predefined value Inf with is the largest representable integer. There is also -Inf that is the smallest representable integer.
findSmallest = function(v)
{
s = Inf
for (i in 1:length(v))
{
if (v[i] < s) {
s = v[i]
}
}
return (s)
}
While you can use = to define a function, you should really get used to using the more common <- syntax. So, let’s try again:
findSmallest <- function(v)
{
s = Inf
for (i in 1:length(v))
{
if (v[i] < s) {
s = v[i]
}
}
return (s)
}
Just to be clear, in practice you would use the min()
function to find the smallest element rather writing it yourself.
While there are several ways to return a value from a function, the way that is most congruent with other languages is the use of the return statement.
Note that the type of return value and the type of arguments are not declared. R uses a lazy evaluation mechanism and no type checking is performed until run-time.
Calling a Function
To call a function, you would invoke it with its name and its required arguments.
x = c(3,1,9,7,3,6)
w = findSmallest(x)
print(w)
## [1] 1
Function Parameters
If a function takes several arguments you generally pass them in the order declared; the approach that is used by all other languages. However, in R you can pass the arguments in any order as long as you specify the name of the argument.
Argument matching is a bit different in R compared to other languages. Firstly, R does all argument checking at run-time. Secondly, while arguments can be matched positionally like in other languages, arguments can also be matched by parameter name – a syntax not supported by most other languages.
For example, the built-in function seq
generates a sequence of numbers and returns those numbers in a vector. The definition of the function is as follows: seq(to, from, by, length.out, along.with)
.
Here are examples of using it. Note that by, length.out, and along.with have default values and are therefore optional.
v = seq(1, 10, 2) # integers from 1 to 10 in increments of 2
w = seq(1, 5) # integers from 1 to 5 (by default in increments of 1)
# pass arguments in a different order but specify by name
w = seq(from = 5, by = -0.5, to = 1)
R also supports variable numbers of arguments but that is beyond the scope of this tutorial.
Default Arguments
R functions can have default values for arguments which are then optional when the function is called. When the argument is missing, then the default value is passed. In the example below, the start argument is the position at which the search for the smallest element will start.
findSmallest <- function(v, start = 1)
{
s = Inf
for (i in start:length(v))
{
if (v[i] < s) {
s = v[i]
}
}
return (s)
}
x = c(3,1,9,7,3,6)
w = findSmallest(x, 3)
print(w)
## [1] 3
w = findSmallest(x)
print(w)
## [1] 1
Local Variables
As in most other programming languages, R functions can define local variables that are not known outside the scope of the function. The scope boundaries in R are like other languages: a block enclosed in curly braces.
In the example below, local.var is local to the function and thus is not visible outside of the function. The code below produces the error: “Error in print(local.var) : object ‘local.var’ not found”.
findSmallest <- function(v, start = 1)
{
local.var = Inf
for (i in start:length(v))
{
if (v[i] < local.var) {
local.var = v[i]
}
}
return (local.var)
}
x = c(3,1,9,7,3,6)
w = findSmallest(x, 3)
# we cannot echo or access the local variable "s"
print(local.var)
Recursion
R functions can be called recursively. The example below calculates factorial using recursion rather than a loop.
fac <- function(x)
{
if (x == 1)
return (1)
else
return (x * fac(x-1))
}
print(fac(8))
## [1] 40320
If it hasn’t been obvious yet, just like in other languages, the placement of curly braces makes no difference. For single statement blocks, the curly braces can be omitted.
The parenthesis around the value for return are required.
As an exercise, try writing the above function to calculate factorial using a loop.
Type Coercion
Type coercion (or casting in C/C++ terminology) is done with type conversion functions in R and not through operators like in C, C++, and Java. The example below shows the most common type conversion functions. Note that in some situation you will lose information, just like in other languages.
s = "12.4" # string (character)
i = as.integer(s) # convert to integer: 12
d = as.numeric(s) # convert to double: 12.3
w = "$12.3" # additional characters
k = as.integer(w) # cannot be converted due to $
## Warning: NAs introduced by coercion
If a coercion is not successful the result in NA which indicates a null or missing value.
Vectors
Let’s talk more about vectors. In R, a vector is similar to a list or array in other programming languages. It is a collection of elements of the same basic type: numeric, character, or Boolean. A list in R is a collection of mixed data types. This tutorial applies to vectors only.
There are no specific packages required for these functions.
Creating a Vector
The code below creates an artificial vector of random integers for use in the tutorial. In practice, vectors are generally columns in data frames which are frequently the result of reading data from a CSV file or a database.
# vector of 50 random integers between 0 and 10
# set the seed for the random number generator to ensure same
# sequence of random numbers every time the code is run
set.seed(98788)
v <- round(runif(50, min = 0, max = 10),0)
# arguments do not have to be passed in the order that they are
# declared in the function definition as long as the names of the
# arguments are specified
v <- round(runif(n = 50, max = 10, min = 1),0)
v <- round(runif(max = 10, min = 1, n = 50),0)
print(v)
## [1] 9 6 9 1 9 8 10 2 6 9 8 7 6 1 7 10 4 3 5 9 7 8 4 9 7
## [26] 4 6 5 1 4 4 2 7 9 9 6 3 2 3 7 8 6 3 9 4 10 7 1 4 2
Accessing Elements in a Vector
Elements are accessed positionally, although in R, the access index can be a vector of integers in which case all elements at those positions are retrieved. Positions are numbered from 1 to the number of elements in a vector. The number of elements (or length) of a vector can be obtained using the length()
function.
In the example below, note that n:m
generates a vector of integers from n to m, inclusive. The seq()
generates a sequence of integers at an interval.
# access a single element at position 3
v[3]
# access element 10 through 15
v[10:15]
# access the last element
v[length(v)]
# access every other element
v[seq(from = 1, to = length(v), by = 2)]
# access specific elements at positions 2, 11, 19, and 28
i <- c(2,11,19,28)
v[i]
Testing Predicate Expressions
It is possible in R to apply a predicate expression to every element in a vector. This generates a “Boolean vector” of TRUE/FALSE values that indicate which element matches the predicate expression (TRUE) and which doesn’t (FALSE).
Predicate expressions are built with logical operators (<, >, <=, >=, ==, !=)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [13] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## [25] FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [37] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
## [49] TRUE TRUE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [49] FALSE FALSE
## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [13] TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## [37] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE
## [49] TRUE TRUE
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [49] TRUE TRUE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE
Finding Matches
The which()
function returns the positions that are TRUE in a Boolean vector.
# returns positions of vector that matches predicate expression
which(v != 5)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26
## [26] 27 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
# count the number of matches
length(which(v != 0))
## [1] 50
p <- which(v < 5)
print (v[p])
## [1] 1 2 1 4 3 4 4 1 4 4 2 3 2 3 3 4 1 4 2
# or combine
x <- v[which(v < 5)]
print (x)
## [1] 1 2 1 4 3 4 4 1 4 4 2 3 2 3 3 4 1 4 2
# find all that are not in vector
not.x <- v[-which(v < 5)]
print (x)
## [1] 1 2 1 4 3 4 4 1 4 4 2 3 2 3 3 4 1 4 2
Determining Any Matches
To determine if there are any matches, i.e., at least one element in a vector matches the predicate expression, use the any()
function. The function any()
returns TRUE if there’s at least one match, FALSE otherwise.
## [1] TRUE
Dealing with Missing Values
Missing values in R are generally encoded with the special value NA. NA is not a number, not a character or text, and not a Boolean. Consequently, using == or != to check if a value is NA does not work and results in an error. You must use the function is.na()
to check if a value is NA. This is similar to NULL in SQL and many programming languages.
# copy the vector of random numbers and then randomly
# remove 6 values, i.e., set them to NA
v.na <- v
v.na[round(runif(6, min = 1, max = length(v.na)), 0)] = NA
print(v.na)
## [1] 9 6 9 1 9 8 10 2 NA 9 8 7 NA 1 7 10 4 3 5 9 7 NA 4 NA 7
## [26] 4 6 5 1 4 4 2 7 9 NA 6 3 2 3 7 NA 6 3 9 4 10 7 1 4 2
Many of the functions in R do not work when a value in a vector is NA. Doing so results in a value of NA.
# cannot add values containing NA
sum(v.na)
## [1] NA
# when applying operators, NA remains NA
v.na + 5
## [1] 14 11 14 6 14 13 15 7 NA 14 13 12 NA 6 12 15 9 8 10 14 12 NA 9 NA 12
## [26] 9 11 10 6 9 9 7 12 14 NA 11 8 7 8 12 NA 11 8 14 9 15 12 6 9 7
Some functions have a parameter that allows you to direct a function to ignore NA values. Check the documentation of functions before using them to see what parameters they support. Use ?sum to view the documentation of the sum function.
Accessing Rows, Columns, and Elements (Cells) of a Data Frame
Data frames are very similar to tables in relational databases and spreadsheets. They have rows and columns and the intersection of a row and column is a cell (or element). The order of access is row followed by column, e.g., the third element in the fourth row of the data frame mtcars
is mtcars[4,3]
. Note that this is reversed from the way Excel and other spreadsheets work.
The example code below uses the built-in data frame mtcars. You can find out more about its structure using str(mtcars)
or displaying the first few rows with head(mtcars)
.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
v <- mtcars[4,3]
x = mtcars[4,3]
print(paste0("v = ",v," and x = ",x))
## [1] "v = 258 and x = 258"
Leaving out a dimension (row or column) accesses the entire row or column. The resultant is a data frame with a single row.
Often the values must be converted to a vector data type. Conversions of variables from one type to another is done with the family of as.xxxx
functions, e.g., as.vector
, as.numeric
, or as.factor
. Vectors can contain numeric or character data but all elements must be of the same type. In R, a list is similar to a vector but it may contain a mix of elements. A matrix is similar to a data frame but it can only contain numbers and it can have more than two dimensions.
Some functions expect data frames, some vectors, some lists. You need to read the documentation of a function to find out. Furthermore, some functions will automatically convert a variable from one type to the one it requires.
You can also access a column in data frame by its column name. For an entire column you either use the columns position or its name: df[,column]
or df$columnName
.
# all of row 4; the result is a data frame
r <- mtcars[4,]
sum(r)
## [1] 426.135
## [1] 108
mtcars[c(1,4)] # columns 1 and 4 as a new dataframe
## mpg hp
## Mazda RX4 21.0 110
## Mazda RX4 Wag 21.0 110
## Datsun 710 22.8 93
## Hornet 4 Drive 21.4 110
## Hornet Sportabout 18.7 175
## Valiant 18.1 105
## Duster 360 14.3 245
## Merc 240D 24.4 62
## Merc 230 22.8 95
## Merc 280 19.2 123
## Merc 280C 17.8 123
## Merc 450SE 16.4 180
## Merc 450SL 17.3 180
## Merc 450SLC 15.2 180
## Cadillac Fleetwood 10.4 205
## Lincoln Continental 10.4 215
## Chrysler Imperial 14.7 230
## Fiat 128 32.4 66
## Honda Civic 30.4 52
## Toyota Corolla 33.9 65
## Toyota Corona 21.5 97
## Dodge Challenger 15.5 150
## AMC Javelin 15.2 150
## Camaro Z28 13.3 245
## Pontiac Firebird 19.2 175
## Fiat X1-9 27.3 66
## Porsche 914-2 26.0 91
## Lotus Europa 30.4 113
## Ford Pantera L 15.8 264
## Ferrari Dino 19.7 175
## Maserati Bora 15.0 335
## Volvo 142E 21.4 109
mtcars[,2] # all of column 2
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[5:7,] # rows 5 to 7 as a new dataframe
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
## Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
mtcars$cyl # column named "cyl"
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars$cyl[2] # 2nd row in the column "cyl"
## [1] 6
mtcars$cyl[3:9] # rows 3 to 9 for column "cyl" as a vector
## [1] 4 6 8 6 8 4 4
## [1] 20.09062
Adding and Removing Columns from a Data Frame
To add a new column, you simply “access” the column or use a new name for the column. Note in the example below that you can operate on entire columns (as vectors) and the operation is applied to each pair of values in the two vectors in the operation. This is much more efficient than using loops as is necessary in other programming languages.
# copy the data frame mtcars to a new data frame df
df <- mtcars
# create a new column "dispcyl" which is the displacement per cylinder
df$dispcyl <- df$disp / df$cyl
head(df)
## mpg cyl disp hp drat wt qsec vs am gear carb dispcyl
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 26.66667
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 26.66667
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 27.00000
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 43.00000
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 45.00000
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 37.50000
Create a New Data Frame
Data frames are created in various ways: use the <code<>data.frame function, load a CSV file, execute a SQL query, or as a result of many package functions.
Load a Data Frame from CSV
Quick note: Capitalization in path and file names does not matter in Windows, but does matter on MacOS and Linux. Furthermore, note that even in Windows the path delimiter is a forward slash / and not the usual backwards slash \. The \ is an “escape” character and used to inject non-printable characters into a string (text), e.g., “This string contains”quotes”.” which would be written in R as “this string contains \”quotes\“.”
The parameter header = F
instructs read.csv
not to interpret the first line as header labels. Of course, if there are no labels, then you need to define your own.
Aside from CSV files, R can also load a number of other file format using various packages, including XML, Excel, SPSS, MatLab, among many others.
df <- read.csv(file = "customertxndata.csv", header = F)
head(df)
## V1 V2 V3 V4 V5
## 1 7 0 Android Male 0.0000
## 2 20 1 iOS <NA> 576.8668
## 3 22 1 iOS Female 850.0000
## 4 24 2 iOS Female 1050.0000
## 5 1 0 Android Male 0.0000
## 6 13 1 Android Male 460.0000
df <- read.csv(file = "customertxndata.csv",
header = F,
col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
## numVisits NumTxn OS Gender TotSp
## 1 7 0 Android Male 0.0000
## 2 20 1 iOS <NA> 576.8668
## 3 22 1 iOS Female 850.0000
## 4 24 2 iOS Female 1050.0000
## 5 1 0 Android Male 0.0000
## 6 13 1 Android Male 460.0000
Note that the value of the ‘Male’ column in the first row is NA which is the way that R indicates a missing data value. It is not 0 or an empty string, it is unknown. So, statistical functions and algebraic operations would result in an NA as well.
Strings vs Factors
The factor data type encodes categorical data, e.g., the value of a variable is one of a fixed value set. Many statistical functions in R require categorical variables to be of type factor. However, often, during data processing, we need the actual text rather than having it encoded as a factor (which is actually stored in R as an integer for efficiency). So, when reading a CSV file you need to decide if you want text columns to be character strings or factors by setting the stringsAsFactors
parameter.
You may use either F
and T
or FALSE
and TRUE
.
df <- read.csv(file = "customertxndata.csv",
header = F,
stringsAsFactors = FALSE,
col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
## numVisits NumTxn OS Gender TotSp
## 1 7 0 Android Male 0.0000
## 2 20 1 iOS <NA> 576.8668
## 3 22 1 iOS Female 850.0000
## 4 24 2 iOS Female 1050.0000
## 5 1 0 Android Male 0.0000
## 6 13 1 Android Male 460.0000
Create a new Data Frame
The code below creates a new data frame from column vectors. Notice how the column names are the names of the vectors. A new vector is created with the c
function, e.g., v <- c(3,5,1,9)
.
df1 <- data.frame(state = c('Arizona','Georgia', 'New York','Indiana','Washington','Texas'),
code = as.factor(c('AZ','GA','NY','IN','WA','TX')),
score = c(62,47,55,74,31,85))
head(df1)
## state code score
## 1 Arizona AZ 62
## 2 Georgia GA 47
## 3 New York NY 55
## 4 Indiana IN 74
## 5 Washington WA 31
## 6 Texas TX 85
Search Data Frames
There are two important functions for “searching” data frames: which
and any
. The code below uses the built-in Orange data frame which contains measurements of orange trees. It has three columns: the tree, the age of the tree (days since 1968/12/31), and circumference (in mm).
which
## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
# find all rows where the circumference is more than 200mm
rs <- which(df$circumference > 200)
# display all rows where the circumference is more than 200mm
df[rs,]
## Tree age circumference
## 13 2 1372 203
## 14 2 1582 203
## 27 4 1372 209
## 28 4 1582 214
# compound conditions are possible with & (and), | (or), and ! (not)
rs2 <- which(df$circumference > 200 & df$age < 1500)
rs3 <- which(df$circumference < 200 | !(df$age < 1500))
rs4 <- which(df$circumference > 400 | df$age > 1500)
rs2
## [1] 13 27
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26
## [26] 28 29 30 31 32 33 34 35
## [1] 7 14 21 28 35
## [1] 1582
## [1] 894.8788
In the above example rs <- which(df$circumference > 200)
finds all rows in the data frame df where circumference > 200. The rows are saved in rs.
any
The any
function returns \(TRUE\) or \(FALSE\) depending on whether any column (or row) in the dataframe satisfies a Boolean expression.
# is there any tree with age > 2000?
any(df$age > 25)
## [1] TRUE
Memory Management
R is similar to Python and other interpreted languages in terms of memory management. Objects and variables remain in memory until you restart R or explicitly delete them. This can sometimes cause conflicts during development. Adding this to the start of an R script or an R Notebook ensures that the program runs with an empty memory environment. This is critical for languages like R and Python, but is not needed for programming languages that run in separate processes such as Java and C++ programs.
Use the code below to find and then delete all objects, and reclaim memory. The function ls()
lists all objects (variables) by name, while the rm()
removes one or more objects from memory. Finally, the function gc()
runs the garbage collector and returns freed memory to the usable memory pool for the process in which R is running.
rm(list = ls(all.names = TRUE))
gc()
Of course, rather than deleting all objects as in the code chunk above, you may wish to release large objects or unused objects selectively by their name, e.g., rm(“objName”)
.
Install Packages on Demand
To make your code portable and reproducible, install packages within your code:
# packages needed in R program
packages <- c("stringr", "RSQLite")
# install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# load all packages by applying 'library' function
invisible(lapply(packages, library, character.only = TRUE))
Conclusion
As you saw, R is not a difficult language to learn as it is similar to other languages and for most language constructs that you are familiar with, there is an equivalent. But it is important that you go beyond this tutorial and learn the “R way” of programming using vectorized operations.
References
No references.
Errata
None collected yet. Let us know.
---
title: "Quick Guide to R For Programmers"
params:
  category: 6
  number: 104
  time: 60
  level: beginner
  tags: "r,primer,loops"
  description: "A quick guide for programmers transitioning to R from 
                C, C++, Java, JavaScript, Python, 
                and other high-level languages. Explains key control
                structures and programming paradigms for R."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

This tutorial is a quick introduction to R for programmers of other high-level languages. It shows those features of R that are familiar to most programmers so that they can get started programming right away. It is important to note that the programming approach presented here is generally not the most efficient nor most common approach. We show the "R way" whenever possible and simple enough to explain. But the goal of the tutorial is to get a programmers programming. This is especially useful to students in computer science courses that use R but who are new to R.

R is best suited for data projects: data loading, data transformation, databases, data analysis, data visualization, and data science. There is substantial support for statistical analysis, unsupervised data mining, supervised machine learning, and even interactive dashboards.

Often, programming tasks that take dozens of lines of code in most languages can be written with one statement or one line in R. While R is not the fastest language, for many vector processing and mathematical operations it generally outperforms most other high-level languages, particularly when vectorization hardware is present.

As of 2021, R is one of the top languages to learn and for any kind of data-related work it is critical (along with Python, Scala, and perhaps Go).

The tutorial is geared towards students in information science, data science, and database design. It demonstrates basic syntax in R that are most often used for data processing rather than statistics.

## The R Language

R is a procedural language similar to C. It is not object-oriented and does not support objects, classes, inheritance, or polymorphism. It has little support for data encapsulation or abstraction, so no equivalent for *class* or *struct* in C/C++ or Java.

Programs is R are scripts. There is no "main function" or similar. R "programs" are collections of R statements that are interpreted when executed interactively. There is no compilation step. R does support reusable third-party code "libraries" in the form of *packages*.

## Working in R

To write "programs" in R you will need Base R which you can download for Linux, MacOS, and Windows from [R Project](https://www.r-project.org/). This is the core language with an interactive console. Programs, or more aptly R scripts, can be built in any text editor (TextEdit, Notepad, vi, Sublime, JEdit, etc.).

Most programming is done with an IDE (Integrated Development Environment). The most common is R Studio downloadable from [RStudio](https://www.rstudio.com/products/rstudio/download/). There is a hosted version of R Studio available at [rstudio.cloud](http://rstudio.cloud).

> Install R from [R Project](https://cloud.r-project.org/) **before** installing [R Studio](https://rstudio.com/products/rstudio/download/).

The tutorial below explains how to get started with R Notebooks:

<iframe src="https://northeastern.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=80c2cf02-00d2-427c-8fcd-abe000f06f0d&amp;autoplay=false&amp;offerviewer=true&amp;showtitle=true&amp;showbrand=false&amp;captions=false&amp;interactivity=all" height="180" width="320" style="border: 1px solid #464646;" allowfullscreen allow="autoplay" data-external="1">

</iframe>

Execute chunks by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*. The code runs in the order in which the chunks are executed, so non-linear code execution is possible unless you instruct R Studio to run all chunks starting at the first chunk.

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

### Projects in R Studio

Projects are a better way to manage code rather than creating individual R Notebooks, R Scripts, and other code files. Projects allows all files, including data files, to be managed as a single unit, shared, and version controlled using services such as *git* and *GitHub*.

The tutorial below demonstrates how to create a project in R Studio and add files to the project.

```{=html}
<iframe src="https://player.vimeo.com/video/607451374?h=3056e73073" width="480" height="270" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen data-external="1"></iframe>
```
## R Code Chunks

While you can also write R Scripts, we will concentrate on how to write R "programs" by composing an R Notebook in R Studio. This is the most common way to use R in data related projects where reproducability is paramount. Programs in R run from start to end. Each chunk should be a step in your analysis or data project. Name your code chunks, so you can quickly navigate to them.

In the chunk below, the variable *cars* passed to the built-in Base R function <code>plot</code> is one of the dozens of "built-in" data frames; a data frame being data arranged in rows and columns similar to a spreadsheet or CSV file.

Note that you call a function by using the function's name followed by the arguments you wish to pass to the function. Of course, you need to follow the definition of the function. Many functions are simply "built-in" while others come from packages that you need to explicitly load into your program.

Note that there is no semicolon at the end of a line.

```{r namedChunk}
plot(cars)
```

### Expressions

R can be directly used to solve simple or complex mathematical expressions.

```{r}
# [1] in the above answer indicates the index of your results.
# R always shows the result with index for each row.

((2^3)*5)-1
```

```{r}
# sqrt and exp are built-in functions in R for finding Square root and exponential respectively.

sqrt(4)* exp(2)
```

## Variables

Variables are not declared in R. When you use a variable for the first time, it is defined and the data type is based on what value you assign to the variable. So, R uses dynamic typing, which also means that when you assign a value of a different type to a previously used variable, it changes type. You can find the type of a variable using the functions <code>typeof</code>, <code>class</code>, and <code>mode</code>.

R supports the usual data types: integer, double, Boolean, and string. It also has pre-defined complex types, including vector, data frame, array, list, and matrix. There is also a *Date* data type and a few others.

The absence of a value is indicated in R using *NA* rather than *nil*, *null* or *NULL*. An *NA* means that there is no value for an integer, character, logical value, or numeric. On the other hand, *NULL* means that an object is "empty", *i.e.* a reference to an object that does not exist.

Values are assigned to variables using either *=* or *\<-*, the latter being more common but the former more familiar to programmers of other common languages.

```{r variables, eval=FALSE}
a = 10                       # defines an integer
d = 9.9                      # defines a double
s = "some text"              # define a string of characters
g <- 'also text'             # single quotes are the same as double quotes
```

Text can be enclosed in single or double quotes. Which you use depends on preference or if you need to nest quotes.

```{r textWithQuotes, eval=FALSE}
z = "It's a quote."
```

To echo the value of a variable to the console, either use the variable on a line by itself or use the function <code>print</code>. If you need to echo multiple values, combine them with <code>paste0</code>.

```{r printVars}
d <- 123.99
d
print(paste0("Value of d = ", d))
```

## Missing Values: NA vs NULL

In R, `NA` and `NULL` are used to represent different types of missing data, and they serve different purposes:

1.  **NA (Not Available)**:
    -   `NA` is used to represent missing values in vectors, lists, or data frames. It acts as a placeholder for an element that does not have a value but is expected to have one.
    -   `NA` can be of any data type like numeric, character, or logical, meaning you can have `NA_integer_`, `NA_real_`, `NA_complex_`, and `NA_character_`.
    -   Operations involving `NA` generally result in `NA`. For example, `5 + NA` results in `NA`.
    -   A value of `NA` is a value and is counted, *e.g.*, `length(c(3,NA,4))` has the value 3.
2.  **NULL**:
    -   `NULL` is used to represent the absence of a value or no value at all. It is typically used to denote that a variable is empty or uninitialized.
    -   `NULL` is often used in list or data frame operations to remove elements or indicate that an element is absent.
    -   `NULL` has a different behavior in operations compared to `NA`. For instance, adding `NULL` to another object or concatenating it generally results in the other object unchanged. For example, `c(1, 2, NULL)` results in `c(1, 2)`.

Essentially, `NA` is used when an element of the data exists but its value is missing, whereas `NULL` is used when the data itself does not exist.

Many functions that return an object such as a data frame would return *NULL* if they could not generate the object.

In R, you can test for `NA` and `NULL` using specific functions designed for this purpose. Here’s how you can do it:

1.  **Testing for NA**:
    -   Use the `is.na()` function, which checks for `NA` values in an object. It returns a logical vector of the same size as the input, with `TRUE` for elements that are `NA` and `FALSE` for those that are not.

    -   Example:

        ``` r
        vec <- c(1, NA, 3, NA, 5)
        is.na(vec)
        # Output: FALSE  TRUE FALSE  TRUE FALSE
        ```
2.  **Testing for NULL**:
    -   Use the `is.null()` function, which checks if an object is `NULL`. It returns `TRUE` if the object is `NULL` and `FALSE` otherwise.

    -   Example:

        ``` r
        x <- NULL
        is.null(x)
        # Output: TRUE
        ```

These functions are useful in various programming scenarios, such as conditional execution of code depending on the presence of actual data, and handling missing values in data analysis and transformation tasks.

## Naming Identifiers

The naming of identifiers, *i.e.*, variable and function names, are the same as in most other languages, except that you can use the period as a valid identifier character. That can be confusing to Java and C++ programmers as they are used to using *.* as a method or object property access operator. In R, it's just another character like *a* or *\_*.

```{r identifiers, echo=TRUE, eval=FALSE}
a.val = 10.5          # legal
a_val = 10.5          # legal
aVal1 = 10.5          # legal
a$val = 10.5          # not quite legal, $ is reserved for data frames
```

The last one is a bit tricky. <code>a\$val</code> will actually create a list object. Let's just not use it unless you are accessing columns in a data frame, but that's for another tutorial.

The rules for naming an identifier (variable, function, or package name) for an object are as follows:

-   identifiers are case-sensitive and cannot contain spaces or special characters such as #, %, \$, \@, \* , &, \^, !, \~

-   an identifier must start with a letter, but may contain any combination of letters and digits thereafter

-   special characters dot (.) and underscore (\_) are allowed

The dot (.) is a regular character in R and that can be confusing as other languages (*e.g.*, Java) use dot as an operator to designate property or method access, *e.g*, in Java *x.val* means that you are accessing the *val* property of the object *x*.

Some examples of legal variable names are: *df*, *df2*, *df.txns*, and *df_all2017*. These are some illegal variable names: *2df* (cannot start with a digit), *rs\$all* (cannot contain a \$; the \$ is used to access columns in a dataframe), *rs#* (only . and \_ are allowed in addition to digits and letters).

It is considered good programming practice to give identifiers a sensible name that hints as to what is stored in the variable rather than using random name like *x*, *val*, or *i33*. Instead use *anItem* or *annualTotal*.

Identifiers should be named consistently. Many programmers use one of two styles:

-   underscores, *e.g.*, *interest_rate*
-   camelCase, *e.g.*, *squareRoot*, *graphData*, *currentWorkingDirectory*

Note that R is case sensitive which means that R treats the identifiers *AP* and *ap* as different objects.

As a side note, files may also be case sensitive but that depends on the operating system. MacOS and Linux are case sensitive, while Windows is case aware but not case sensitive. For example, on MacOS and Linux there is a difference between "AirPassengers.txt" and "airpassengers.txt" while on Windows there is not. SQL is also not case sensitive. It is a best practice to assume case sensitivity.

## Expressions

As a language with roots in statistical analysis, R supports most mathematical operators, including many that are not supported through operators in most other languages.

Precedence is like other languages and can be (and should be) specified explicitly using parentheses.

```{r operators, echo=TRUE}
a <- 99

b = a + 20                 # addition
b = a - 20                 # subtraction
b = a * 20                 # multiplication
b = a / 20                 # division -- b is now "double"
b = a ^ 3                  # exponentiation
b = a %% 2                 # modulus (mod)
b = a %/% 2                # integer division (div)

b = (a / 2) ^ 5            # force order of evaluation
```

There are also numerous built-in functions for mathematics and statistics, but those are beyond this tutorial on base syntax. For the sake of completeness, an example is shown below that calculates the *z*-score of a vector of numbers. In R, a vector is similar to an array; it contains primitives values (integer, double, Boolean, or string). Notice the automatic vector calculation even without a loop.

```{r funcs}
v = c(1,4,6,2,8,1,0,3)     # vector of integers

m = mean(v)                # mean of values in v
s = sd(v)                  # standard deviation
z = (m - v) / s            # z-score of each value in v

print(round(z,2))          # print rounded values
```

### Boolean Variables and Logical Expressions

Boolean values are *TRUE* and *FALSE*; or *T* and *F*, respectively.

```{r booleans}
m = FALSE
w = TRUE

q = (m & w) | (!m & !w)

print(paste0("q is ", q))
```

The Boolean operators are *&* for AND, *\|* for OR, and *!* for NOT. These operators perform logical operations on Boolean variables or vector of Boolean elements. Exclusive or (*xor*) is performed with the function *xor*, *e.g.*, <code>xor(a,b)</code>.

There are long forms *&&* and *\|\|* for programming control-flow and generally preferred for *if* statements.

## Flow Control

The control flow statements provided by R are similar to those found in most other programming languages: loops and if statements.

### Loops

R supports the three common types of loops: a counting loop (*for*), a top-tested loop (*while*), and a bottom-tested loop (*repeat*).

#### Counting Loop: *for*

The counting loop is actually more like an iterator in R, although it can be set up to mimic the behavior of a typical loop in C/C++, Java, *etc.*, that loops from a low value to a high value. See the examples below.

```{r simpleForLoop}
n = 5
for (i in 1:n) {
  print(i)
}
```

In the example above, *i* is the loop variable and it takes on the values in the vector *1:5*, *i.e.,* 1, 2, 3, 4, 5 \em one at a time. 

To count down, it would be *5:1* or *n:1* in the above code example.

Like other languages, R uses curly braces to enclose the body of the loop, *i.e.*, the statements that are executed repeatedly, once for each value of *i*. Of course, the loop counter can be any properly named variable.

The example below accesses a vector using positional access, *e.g.*, *v[1]* accesses the first element. Note that vector, data frames, lists, and arrays are indexed starting at 1 and not 0 as in C-based languages.

```{r loopsCalcs}
v = rnorm(5)      # vector of five random numbers
for (i in 1:length(v)) 
{
  v[i] = v[i] ^ 2
}

print(v)
```

While you can use loops you actually do not need them. If you apply a mathematical operation to a vector, R automatically applies them to each element, but loops might be more natural in the beginning.

```{r vectorProc}
# note that you actually do not need loops in R
# this also squares each element in the vector v
v = v ^ 2
```

As already stated, looping in R is actually iteration over a set: a set of numbers as above or a set of any kind of primitive object, *e.g.,* strings. Note that in the code below, *k* takes on each value in the vector over which the loop iterates.

```{r iterateOverStrings}
s = c("one","two","three","four")

for (k in s) {
  print(k)
}
```

We could have written this using a non-iterator approach as well, which is more like what you'd do in C. Note that <code>length</code> returns the number of elements in a vector, *i.e.*, its "length".

```{r loopOverStrings}
s = c("one","two","three","four")

for (j in 1:length(s)) {
  print(s[j])
}
```

##### Iteration Continuation

Continuation of the loop to the next iteration and forgoing processing the remainder of the current iteration is done with *next* in R, similar to *continue* in C-based languages such as Java.

In the example below we are already reaching ahead to the *if* statement. The code fragment echoes only odd number. Of course, you could have done this differently by just iterating over the even numbers from one to twenty, but that would not have allowed us to share this example for using *next*.

```{r nextInFor}
for (i in 1:10) {
  if (!i %% 2) {
    next
  }
  print(i)
}
```

But, just to show you how to iterate over just odd number, we can use the <code>seq()</code> function which generates a sequence of numbers in steps. No *if* and no *break* needed -- much simpler.

```{r loopOverOddSeq}
for (i in seq(1, 10, 2)) {
  print(i)
}
```

Or, more explicitly by specifying the parameters by name.

```{r loopOverOddSeqWithParms}
for (i in seq(from = 1, to = 10, by = 2)) {
  print(i)
}
```

To stop the execution of the rest of a loop and to move immediately to the next statement after the loop is done with *break* in R which is identical to C, C++, Java, etc. In the code below we will find the position of the first occurrence of some number; *x* is the number we are looking for in *v* and *p* is the found position.

```{r breakFor}
v = c(1, 3, 5, 7, 1, 9, 4)
x = 9
p = 0

for (i in 1:length(v)) {
  if (v[i] == x) {
    p = i
    break
  }
}

print(p)
```

To be clear, there is a quicker and more efficient way to do this using the <code>which</code> function. This function does not exist is other languages and it shows the power of R. Also, using <code>which</code> is much faster. Incidentally, the code below finds all occurrences of *x* in *v*. Note, once again, the use of a vector variable to refer to all elements.

```{r findValWithWhich}
v = c(1, 3, 5, 7, 1, 9, 4)
x = 9

p = which(v == x)

print(p)
```

#### Conditional Loops: *while* and *repeat*

Like many other languages, R supports top and bottom tested loops. In the example below we repeatedly ask for a number from the user and only exit the loop if the number is 42. Of course, the example really should be done with a bottom tested loop.

```{r whileLoop, eval=F}
response <- as.integer(readline(prompt="Enter a number: "))

while (response != 42) {   
  print("Sorry, not correct");
  response <- as.integer(readline(prompt="Enter a number: "));
}
```

In the above example, we are using two new functions. <code>readline</code> is used to read input from the console, while <code>as.integer</code> coerces (or casts in C/C++ terminology) the input to an integer.

The code is above is not very efficient or elegant as it has the code for the user prompt twice. It is really better to ask first and then check the condition: we need a bottom-tested loop.

Let's look at the same example, but with a *repeat* loop that runs until a condition is reached and the loop is explicitly exited with *break*. So, if you want to run the code at least once, use *repeat*; if the code is run zero or more times, use *while*. The *repeat* loop is identical to the *do* or *do while* loop found in many other programming languages.

```{r repeatLoop, eval=F}
repeat {   
  response <- as.integer(readline(prompt="Enter a number: "));
  if (response == 42) {
    break
  }
  print("Sorry, not correct");
}
```

There is no equivalent in R for the use of an infinite *for* loop as in C and C++, *e.g.*, <code>for(;;){ // do something indefinitely }</code> that runs until a condition is reached and then uses *break* to exit the loop. The *repeat* loop construct is used instead. There is no testing of a condition in the *repeat* loop, so there's no equivalent to the C-like construct *do ... while*.

### Alternation: *if*

The use of *if* to selectively execute a block of code based on a condition is the same in R as it is in other languages. We already saw an example and the same example is below:

```{r if}
for (i in 1:10) {
  if (!i %% 2) {
    next
  }
  print(i)
}
```

Note that the condition is in parenthesis and that the block executed if the condition is true is in curly braces. This is identical to C, C++, Java, Python, etc.

A more complex example is shown below that squares all numbers in a vector that are less than 5 and cubes them if greater than 5. Not a useful example but one that allows us to show the use of *if-else*.

```{r ifElse}
for (i in 1:10) {
  if (i < 5) {
    print(i^2)
  } else {
    print(i^3)
  }
}
```

## Functions

Code organization and reuse in R is done using functions. All objects are free methods and are not bound to an object or a class like in Java or C++. It's the same way as in Python or C.

All functions in R can return a value, although they do not have to. So, R does not distinguish between functions and procedures and there is no *void* return type as in C, C++, and Java.

### Defining a Function

The generic template for defining a function is:

```{r eval=FALSE}
function_name <- function(arg_1, arg_2, ...) {
   Function body 
}
```

The example below defines a function called *findSmallest()* which takes a vector of positive integers as an argument and returns the smallest element in the vector. While it can be solved in several ways, we will show a design that uses loops and should be familiar to programmers of most other languages.

Note that we are using the predefined value *Inf* with is the largest representable integer. There is also *-Inf* that is the smallest representable integer.

```{r functionDef}
findSmallest = function(v)
{
  s = Inf
  for (i in 1:length(v))
  {
    if (v[i] < s) {
      s = v[i]
    }
  }
  return (s)
}
```

While you can use *=* to define a function, you should really get used to using the more common *\<-* syntax. So, let's try again:

```{r functionDefBetter}
findSmallest <- function(v)
{
  s = Inf
  for (i in 1:length(v))
  {
    if (v[i] < s) {
      s = v[i]
    }
  }
  return (s)
}
```

Just to be clear, in practice you would use the <code>min()</code> function to find the smallest element rather writing it yourself.

While there are several ways to return a value from a function, the way that is most congruent with other languages is the use of the *return* statement.

Note that the type of return value and the type of arguments are not declared. R uses a lazy evaluation mechanism and no type checking is performed until run-time.

### Calling a Function

To call a function, you would invoke it with its name and its required arguments.

```{r callFunc}
x = c(3,1,9,7,3,6)

w = findSmallest(x)
print(w)
```

### Function Parameters

If a function takes several arguments you generally pass them in the order declared; the approach that is used by all other languages. However, in R you can pass the arguments in any order as long as you specify the name of the argument.

Argument matching is a bit different in R compared to other languages. Firstly, R does all argument checking at run-time. Secondly, while arguments can be matched positionally like in other languages, arguments can also be matched by parameter name -- a syntax not supported by most other languages.

For example, the built-in function <code>seq</code> generates a sequence of numbers and returns those numbers in a vector. The definition of the function is as follows: <code>seq(to, from, by, length.out, along.with)</code>.

Here are examples of using it. Note that *by*, *length.out*, and *along.with* have default values and are therefore optional.

```{r seqParmPassing}
v = seq(1, 10, 2)    # integers from 1 to 10 in increments of 2
w = seq(1, 5)        # integers from 1 to 5 (by default in increments of 1)

# pass arguments in a different order but specify by name
w = seq(from = 5, by = -0.5, to = 1)
```

R also supports variable numbers of arguments but that is beyond the scope of this tutorial.

### Default Arguments

R functions can have default values for arguments which are then optional when the function is called. When the argument is missing, then the default value is passed. In the example below, the *start* argument is the position at which the search for the smallest element will start.

```{r functionDefArg}
findSmallest <- function(v, start = 1)
{
  s = Inf
  for (i in start:length(v))
  {
    if (v[i] < s) {
      s = v[i]
    }
  }
  return (s)
}
```

```{r}
x = c(3,1,9,7,3,6)

w = findSmallest(x, 3)
print(w)

w = findSmallest(x)
print(w)
```

### Local Variables

As in most other programming languages, R functions can define local variables that are not known outside the scope of the function. The scope boundaries in R are like other languages: a block enclosed in curly braces.

In the example below, *local.var* is local to the function and thus is not visible outside of the function. The code below produces the error: "Error in print(local.var) : object 'local.var' not found".

```{r locaVars, echo=TRUE, eval=FALSE}
findSmallest <- function(v, start = 1)
{
  local.var = Inf
  for (i in start:length(v))
  {
    if (v[i] < local.var) {
      local.var = v[i]
    }
  }
  return (local.var)
}

x = c(3,1,9,7,3,6)

w = findSmallest(x, 3)

# we cannot echo or access the local variable "s"
print(local.var)
```

### Recursion

R functions can be called recursively. The example below calculates factorial using recursion rather than a loop.

```{r recursiveFuncs}
fac <- function(x)
{
  if (x == 1) 
    return (1)
  else 
    return (x * fac(x-1))
}

print(fac(8))
```

> If it hasn't been obvious yet, just like in other languages, the placement of curly braces makes no difference. For single statement blocks, the curly braces can be omitted.

> The parenthesis around the value for *return* are required.

As an exercise, try writing the above function to calculate factorial using a loop.

## Type Coercion

Type coercion (or casting in C/C++ terminology) is done with type conversion functions in R and not through operators like in C, C++, and Java. The example below shows the most common type conversion functions. Note that in some situation you will lose information, just like in other languages.

```{r typeConversion}
s = "12.4"                 # string (character)
i = as.integer(s)          # convert to integer: 12
d = as.numeric(s)          # convert to double: 12.3

w = "$12.3"                # additional characters
k = as.integer(w)          # cannot be converted due to $
```

If a coercion is not successful the result in *NA* which indicates a *null* or missing value.

## Vectors

Let's talk more about vectors. In R, a vector is similar to a list or array in other programming languages. It is a collection of elements of the same basic type: numeric, character, or Boolean. A list in R is a collection of mixed data types. This tutorial applies to vectors only.

There are no specific packages required for these functions.

## Creating a Vector

The code below creates an artificial vector of random integers for use in the tutorial. In practice, vectors are generally columns in data frames which are frequently the result of reading data from a CSV file or a database.

```{r createSampleVector}
# vector of 50 random integers between 0 and 10

# set the seed for the random number generator to ensure same
# sequence of random numbers every time the code is run
set.seed(98788)
v <- round(runif(50, min = 0, max = 10),0)

# arguments do not have to be passed in the order that they are
# declared in the function definition as long as the names of the
# arguments are specified
v <- round(runif(n = 50, max = 10, min = 1),0)
v <- round(runif(max = 10, min = 1, n = 50),0)

print(v)
```

## Accessing Elements in a Vector

Elements are accessed positionally, although in R, the access index can be a vector of integers in which case all elements at those positions are retrieved. Positions are numbered from 1 to the number of elements in a vector. The number of elements (or length) of a vector can be obtained using the <code>length()</code> function.

In the example below, note that <code>n:m</code> generates a vector of integers from *n* to *m*, inclusive. The <code>seq()</code> generates a sequence of integers at an interval.

```{r simpleVectorAccess, eval=F}
# access a single element at position 3
v[3]

# access element 10 through 15
v[10:15]

# access the last element
v[length(v)]

# access every other element
v[seq(from = 1, to = length(v), by = 2)]

# access specific elements at positions 2, 11, 19, and 28
i <- c(2,11,19,28)
v[i]
```

## Testing Predicate Expressions

It is possible in R to apply a predicate expression to every element in a vector. This generates a "Boolean vector" of *TRUE/FALSE* values that indicate which element matches the predicate expression (*TRUE*) and which doesn't (*FALSE*).

Predicate expressions are built with logical operators (\<, \>, \<=, \>=, ==, !=)

```{r}
v < 5

(v < 1 | v > 9)

(v <= 7 & v != 3)

v != 5

l <- (v == 5)
print(l)
```

## Finding Matches

The <code>which()</code> function returns the positions that are *TRUE* in a Boolean vector.

```{r}
# returns positions of vector that matches predicate expression
which(v != 5)

# count the number of matches
length(which(v != 0))
```

```{r}
p <- which(v < 5)
print (v[p])

# or combine
x <- v[which(v < 5)]
print (x)

# find all that are not in vector
not.x <- v[-which(v < 5)]
print (x)
```

## Determining Any Matches

To determine if there are any matches, *i.e.*, at least one element in a vector matches the predicate expression, use the <code>any()</code> function. The function <code>any()</code> returns *TRUE* if there's at least one match, *FALSE* otherwise.

```{r}
any(v < 5)
```

## Dealing with Missing Values

Missing values in R are generally encoded with the special value **NA**. **NA** is not a number, not a character or text, and not a Boolean. Consequently, using *==* or *!=* to check if a value is **NA** does not work and results in an error. You must use the function <code>is.na()</code> to check if a value is **NA**. This is similar to **NULL** in SQL and many programming languages.

```{r}
# copy the vector of random numbers and then randomly 
# remove 6 values, i.e., set them to NA

v.na <- v

v.na[round(runif(6, min = 1, max = length(v.na)), 0)] = NA

print(v.na)
```

Many of the functions in R do not work when a value in a vector is **NA**. Doing so results in a value of **NA**.

```{r}
# cannot add values containing NA
sum(v.na)

# when applying operators, NA remains NA
v.na + 5
```

Some functions have a parameter that allows you to direct a function to ignore **NA** values. Check the documentation of functions before using them to see what parameters they support. Use *?sum* to view the documentation of the *sum* function.

## Accessing Rows, Columns, and Elements (Cells) of a Data Frame

Data frames are very similar to tables in relational databases and spreadsheets. They have rows and columns and the intersection of a row and column is a cell (or element). The order of access is row followed by column, *e.g.*, the third element in the fourth row of the data frame <code>mtcars</code> is <code>mtcars[4,3]</code>. Note that this is reversed from the way Excel and other spreadsheets work.

The example code below uses the built-in data frame *mtcars*. You can find out more about its structure using <code>str(mtcars)</code> or displaying the first few rows with <code>head(mtcars)</code>.

```{r}
str(mtcars)
head(mtcars, 3)
```

```{r}
v <- mtcars[4,3]
x = mtcars[4,3]

print(paste0("v = ",v," and x = ",x))
```

Leaving out a dimension (row or column) accesses the entire row or column. The resultant is a data frame with a single row.

Often the values must be converted to a vector data type. Conversions of variables from one type to another is done with the family of <code>as.xxxx</code> functions, *e.g.*, <code>as.vector</code>, <code>as.numeric</code>, or <code>as.factor</code>. Vectors can contain numeric or character data but all elements must be of the same type. In R, a list is similar to a vector but it may contain a mix of elements. A matrix is similar to a data frame but it can only contain numbers and it can have more than two dimensions.

Some functions expect data frames, some vectors, some lists. You need to read the documentation of a function to find out. Furthermore, some functions will automatically convert a variable from one type to the one it requires.

You can also access a column in data frame by its column name. For an entire column you either use the columns position or its name: <code>df[,column]</code> or <code>df\$columnName</code>.

```{r}
# all of row 4; the result is a data frame
r <- mtcars[4,]
sum(r)

c <- mtcars[3,]
c[1,3]

mtcars[c(1,4)]   # columns 1 and 4 as a new dataframe

mtcars[,2]       # all of column 2
mtcars[5:7,]     # rows 5 to 7 as a new dataframe
mtcars$cyl       # column named "cyl"
mtcars$cyl[2]    # 2nd row in the column "cyl"

mtcars$cyl[3:9]  # rows 3 to 9 for column "cyl" as a vector

w <- mtcars$mpg
mean(w)
```

## Adding and Removing Columns from a Data Frame

To add a new column, you simply "access" the column or use a new name for the column. Note in the example below that you can operate on entire columns (as vectors) and the operation is applied to each pair of values in the two vectors in the operation. This is much more efficient than using loops as is necessary in other programming languages.

```{r}
# copy the data frame mtcars to a new data frame df
df <- mtcars

# create a new column "dispcyl" which is the displacement per cylinder
df$dispcyl <- df$disp / df$cyl

head(df)
```

## Create a New Data Frame

Data frames are created in various ways: use the \<code\<\>data.frame</code> function, load a CSV file, execute a SQL query, or as a result of many package functions.

### Load a Data Frame from CSV

Quick note: Capitalization in path and file names does not matter in Windows, but **does matter** on MacOS and Linux. Furthermore, note that even in Windows the path delimiter is a forward slash / and not the usual backwards slash \\. The \\ is an "escape" character and used to inject non-printable characters into a string (text), *e.g.*, "This string contains"quotes"." which would be written in R as "this string contains \\"quotes\\"."

The parameter <code>header = F</code> instructs <code>read.csv</code> not to interpret the first line as header labels. Of course, if there are no labels, then you need to define your own.

Aside from CSV files, R can also load a number of other file format using various packages, including XML, Excel, SPSS, MatLab, among many others.

```{r}
df <- read.csv(file = "customertxndata.csv", header = F)
head(df)

df <- read.csv(file = "customertxndata.csv", 
               header = F,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
```

> Note that the value of the 'Male' column in the first row is *NA* which is the way that R indicates a missing data value. It is not 0 or an empty string, it is unknown. So, statistical functions and algebraic operations would result in an *NA* as well.

#### Strings vs Factors

The *factor* data type encodes categorical data, *e.g.*, the value of a variable is one of a fixed value set. Many statistical functions in R require categorical variables to be of type *factor*. However, often, during data processing, we need the actual text rather than having it encoded as a *factor* (which is actually stored in R as an integer for efficiency). So, when reading a CSV file you need to decide if you want text columns to be character strings or factors by setting the <code>stringsAsFactors</code> parameter.

You may use either <code>F</code> and <code>T</code> or <code>FALSE</code> and <code>TRUE</code>.

```{r}
df <- read.csv(file = "customertxndata.csv", 
               header = F,
               stringsAsFactors = FALSE,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
```

### Create a new Data Frame

The code below creates a new data frame from column vectors. Notice how the column names are the names of the vectors. A new vector is created with the <code>c</code> function, e.g., <code>v \<- c(3,5,1,9)</code>.

```{r}
df1 <- data.frame(state = c('Arizona','Georgia', 'New York','Indiana','Washington','Texas'),
                  code = as.factor(c('AZ','GA','NY','IN','WA','TX')),
                  score = c(62,47,55,74,31,85))

head(df1)

```

## Search Data Frames

There are two important functions for "searching" data frames: <code>which</code> and <code>any</code>. The code below uses the built-in [**Orange** data frame](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/Orange) which contains measurements of orange trees. It has three columns: the tree, the *age* of the tree (days since 1968/12/31), and *circumference* (in *mm*).

### which

```{r}
df <- Orange

head(df)

# find all rows where the circumference is more than 200mm
rs <- which(df$circumference > 200)

# display all rows where the circumference is more than 200mm
df[rs,]

# compound conditions are possible with & (and), | (or), and ! (not)
rs2 <- which(df$circumference > 200 & df$age < 1500)
rs3 <- which(df$circumference < 200 | !(df$age < 1500))
rs4 <- which(df$circumference > 400 | df$age > 1500)

rs2
rs3
rs4

mean(df[rs4,2])
mean(df$age[rs3])

```

In the above example <code>rs \<- which(df\$circumference \> 200)</code> finds all rows in the data frame *df* where *circumference \> 200*. The rows are saved in *rs*.

### any

The <code>any</code> function returns $TRUE$ or $FALSE$ depending on whether any column (or row) in the dataframe satisfies a Boolean expression.

```{r}
# is there any tree with age > 2000?
any(df$age > 25)
```

## Memory Management

R is similar to Python and other interpreted languages in terms of memory management. Objects and variables remain in memory until you restart R or explicitly delete them. This can sometimes cause conflicts during development. Adding this to the start of an R script or an R Notebook ensures that the program runs with an empty memory environment. This is critical for languages like R and Python, but is not needed for programming languages that run in separate processes such as Java and C++ programs.

Use the code below to find and then delete all objects, and reclaim memory. The function <code>ls()</code> lists all objects (variables) by name, while the <code>rm()</code> removes one or more objects from memory. Finally, the function <code>gc()</code> runs the garbage collector and returns freed memory to the usable memory pool for the process in which R is running.

```{r cleanMem, echo=T, eval=F}
rm(list = ls(all.names = TRUE))
gc()
```

Of course, rather than deleting all objects as in the code chunk above, you may wish to release large objects or unused objects selectively by their name, *e.g.*, <code>rm("objName")</code>.

## Install Packages on Demand

To make your code portable and reproducible, install packages within your code:

```{r}
# packages needed in R program
packages <- c("stringr", "RSQLite")

# install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# load all packages by applying 'library' function
invisible(lapply(packages, library, character.only = TRUE))
```

## Conclusion

As you saw, R is not a difficult language to learn as it is similar to other languages and for most language constructs that you are familiar with, there is an equivalent. But it is important that you go beyond this tutorial and learn the "R way" of programming using vectorized operations.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

No references.

## Errata

None collected yet. Let us know.

```{=html}
<script src="https://form.jotform.com/static/feedback2.js" type="text/javascript">
  new JotformFeedback({
    formId: "212187072784157",
    buttonText: "Feedback",
    base: "https://form.jotform.com/",
    background: "#F59202",
    fontColor: "#FFFFFF",
    buttonSide: "left",
    buttonAlign: "center",
    type: false,
    width: 700,
    height: 500,
    isCardForm: false
  });
</script>
```
```{r code=xfun::read_utf8(paste0(here::here(),'/R/_deployKnit.R')), include = FALSE}
```
