Prerequisite: R and R Studio

If you do not already have R and/or R Studio you will need to download and install them. You must first install R from R Project and then the R Studio IDE from R Studio. Alternatively, rather than installing R and R Studio locally, you can do the tutorial using R Studio Cloud.

The process to working with R and R Studio is like programming in Java. You install the JRE and the JDK to get the Java language, compiler, and run-time environment. Now you also want an IDE (Integrated Development Environment) in which you write Java program; for example, Eclipse. Same with Python; you install Python the language and then PyCharm, as an example, as your IDE. Of course, you really do not need R Studio (or any IDE) to write R programs, just like you don’t really need an IDE to program in Java, C, C++, or Python. You need a source code editor and even Notepad or TextEdit would suffice. Some programmers prefer simple IDE’s like Notepad++ or JEdit, while other like full-featured development environments like Visual Studio or Eclipse. The tutorial below assumes you will use R Studio.

Install R from R Project before installing R Studio.

Tutorial

The recorded tutorial Demonstrates how to install R, R Studio, and create projects. Shows how to build R Notebooks using R Markdown and add R code chunks. Explains how to load data into a data frame from a CSV and access the data. It summarizes the content of this tutorial.

The lesson files, including the data files can be found at the end of this page.

Basic R

R is a scripted language which means that you do not need to compile the program before running it. Statements and expressions are executed as you type them if you enter them in the R Console or are run when you execute the code in a chunk in an R Notebook.

Creating an R Notebook

R Code Chunks

This tutorial is limited to writing R “programs” using an R Notebook in R Studio. Programs in R run from start to end. Each chunk should be a step in your analysis or data project. Name your code chunk, so you can quickly navigate to them.

In the chunk below, the variable cars passed to the built-in Base R function plot is one of the dozens of “built-in” data frames; a data frame being data arranged in rows and columns similar to a spreadsheet or CSV file.

Note that you call a function by using the function’s name followed by the arguments you wish to pass to the function. Of course, you need to follow the definition of the function. Many functions are simply “built-in” while others come from packages that you need to explicitly load into your program.

Note that there is no semicolon at the end of a line.

```{r namedChunk, eval=FALSE}
plot(x = mtcars$mpg, y = mtcars$hp)
```

Expressions

R can be directly used to solve simple or complex mathematical expressions.

# [1] in the above answer indicates the index of your results.
# R always shows the result with index for each row.

((2^3)*5)-1
## [1] 39
# sqrt and exp are built-in functions in R for finding Square root and exponential respectively.

sqrt(4)* exp(2)
## [1] 14.77811

Variables and Identifiers

Holding a value in a variable is done through assignment. Once you assign a value to a variable, the variables becomes an R object. There are two ways to do an assignment, using ‘=’ or with ‘<-’. The latter is the preferred way in R but the former might be more familiar to those programmers coming to R from Java, C++, or Python.

Note that variables are explicitly defined or declared. The first time a variable is assigned a value defines the variable and its type. The type is based on the value that is assigned. Unlike other programming languages such as C++, C#, or Java, R is not strongly typed: the type of a variable can change when a value of a different type is assigned. A variable can be used in an expression. Its value can be inspected by just using the variable by itself.

The value of a variable can be displayed either by using the variable by itself or using the print() function.

# assignment with '=' of a number
x = 12
# inspect (print/display) the value
x
## [1] 12
# assignment a new value and change its type to "text"
x = "Hello"
x
## [1] "Hello"
# assignment with '<-'
x <- 12
print(x)
## [1] 12

The rules for naming an identifier (variable, function, or package name) for an object are as follows:

identifiers are case-sensitive and cannot contain spaces or special characters such as #, %, $, @, *, &, ^, !, ~ an identifier must start with a letter, but may contain any combination of letters and digits thereafter special characters dot (.) and underscore (_) are allowed

The dot (.) is a regular character in R and that can be confusing as other language (e.g., Java) use dot to designate property or method access, e.g, in Java x.val means that you are accessing the val property of the object x.

Some examples of legal variable names are: df, df2, df.txns, df_all2017. These are some illegal variable names: 2df (cannot start with a digit), rs$all (cannot contain a $; the $ is used to access columns in a dataframe), rs# (only . and _ are allowed in addition to digits and letters).

It is considered good programming practice to give identifiers a sensible name that hints as to what is stored in the variable rather than using random name like x, val, or i33. Identifiers should be named consistently. Many programmers use one of two styles:

  • underscores, e.g., interest_rate
  • camelCase, e.g., squareRoot, graphData, currentWorkingDirectory

Note that R is case sensitive which means that R treats the identifiers AP and ap as different objects. As a side note, files may also be case sensitive but that depends on the operating system. MacOS and Linux are case sensitive, while Windows is case aware but not case sensitive. For example, on MacOS and Linux there is a difference between “AirPassengers.txt” and “airpassengers.txt” while on Windows there is not. SQL is also not case sensitive. It is a best practice to assume case sensitivity.

Built-in Data Frames

There are numerous data frames built into R that are accessible without loading them first from external files. These data frames are for experimentation and learning and not for actual analytics work. One such built-in data frame is mtcars. To get a list of all built-in data frames, run data().

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

mtcars and mtcars print out the first and last six rows of a data frame, respectively. You can specify the number of rows to display.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
head(mtcars, 3)
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Accessing Rows, Columns, and Elements (Cells) of a Data Frame

Data frames are very similar to tables in relational databases and spreadsheets. They have rows and columns and the intersection of a row and column is a cell (or element). The order of access is row followed by column, e.g, the third element in the fourth row of the data frame mtcars is mtcars[4,3]. Note that this is reversed from the way Excel and other spreadsheets work. The <- is the operator for assignment, although = also works. We will see and use both.

To display a value, either use the print function or just use the variable by itself. To print multiple items, use the paste0 function.

v <- mtcars[4,3]
x = mtcars[4,3]

print(paste0("v = ",v," and x = ",x))
## [1] "v = 258 and x = 258"

Leaving out a dimension (row or column) accesses the entire row or column. The resultant is a data frame with a single row. Often the values must be converted to a vector data type. Conversions of variables from one type to another is done with the family of as.xxxx functions, e.g., as.vector, as.numeric, or as.factor. Vectors can contain numeric or character data but all elements must be of the same type. In R, a list is similar to a vector but it may contain a mix of elements. A matrix is similar to a data frame but it can only contain numbers and it can have more than two dimensions.

Some functions expect data frames, some vectors, some lists. You need to read the documentation of a function to find out. Furthermore, some functions will automatically convert (also called coerce) a variable from one type to the one it requires.

You can also access a column in data frame by its column name. For an entire column you either use the columns position or its name: df[,column] or df$columnName.

# all of row 4; the result is a data frame
r <- mtcars[4,]
sum(r)
## [1] 426.135
c <- mtcars[3,]
c[1,3]
## [1] 108
mtcars[c(1,4)]   # columns 1 and 4 as a new dataframe
##                      mpg  hp
## Mazda RX4           21.0 110
## Mazda RX4 Wag       21.0 110
## Datsun 710          22.8  93
## Hornet 4 Drive      21.4 110
## Hornet Sportabout   18.7 175
## Valiant             18.1 105
## Duster 360          14.3 245
## Merc 240D           24.4  62
## Merc 230            22.8  95
## Merc 280            19.2 123
## Merc 280C           17.8 123
## Merc 450SE          16.4 180
## Merc 450SL          17.3 180
## Merc 450SLC         15.2 180
## Cadillac Fleetwood  10.4 205
## Lincoln Continental 10.4 215
## Chrysler Imperial   14.7 230
## Fiat 128            32.4  66
## Honda Civic         30.4  52
## Toyota Corolla      33.9  65
## Toyota Corona       21.5  97
## Dodge Challenger    15.5 150
## AMC Javelin         15.2 150
## Camaro Z28          13.3 245
## Pontiac Firebird    19.2 175
## Fiat X1-9           27.3  66
## Porsche 914-2       26.0  91
## Lotus Europa        30.4 113
## Ford Pantera L      15.8 264
## Ferrari Dino        19.7 175
## Maserati Bora       15.0 335
## Volvo 142E          21.4 109
mtcars[,2]       # all of column 2
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[5:7,]     # rows 5 to 7 as a new dataframe
##                    mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.46 20.22  1  0    3    1
## Duster 360        14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
mtcars$cyl       # column named "cyl"
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars$cyl[2]    # 2nd row in the column "cyl"
## [1] 6
mtcars$cyl[3:9]  # rows 3 to 9 for column "cyl" as a vector
## [1] 4 6 8 6 8 4 4
w <- mtcars$mpg
mean(w)
## [1] 20.09062

Aggregation and Statistical Functions

As a language with its origin in statistics and statistical data processing, R has a plethora of statistical functions. Some of the most important functions for data processing are shown below. Consult online documentation and statistics references for more information, e.g., How To Get Descriptive Statistics In R and Base R Statistical Functions.

Packages

There are thousands of functions across hundreds of packages (external libraries of functions written for specific purposes, e.g., data mining, statistical inference, machine learning, image processing, web development, visualization, XML processing, SQL, and so forth). You will learn them over time – and it’s unlikely you will ever learn all of them, so have patience. For a package to be usable in an R project it must be installed; installation is done once. Then every time you need an installed package in some R code, you must load it using the library function.

Installing Packages

To ensure that packages are automatically installed, you can use the following code. That way your code becomes portable.

if("RSQLite" %in% rownames(installed.packages()) == FALSE) {
  install.packages("RSQLite")
}

library("RSQLite")

In the above code the function installed.packages() returns a list of the names of all installed packages. The operator %in% is a set operator that checks if “RSQLite” is one of the returned names. If it is, the Boolean expression evaluates to \(TRUE\), otherwise \(FALSE\). If it is false, then it means the package is not installed and the optional code that installs the package is executed. That way, the loading of the package with library("RSQLite") cannot fail.

Data Frame Dimensions and Structure

Data frames are one of the most fundamental data structures of R, along with vectors. A data frame is a row/column arrangement of data where each column is a vector of data values of the same type, e.g., all numbers or all characters.

The example below uses one of the many built-in data frames of R: mtcars. These built-in data frames are wonderful for testing code or learning R.

nrow(mtcars)            # number of rows in the data frame
## [1] 32
ncol(mtcars)            # number of columns on the data frame
## [1] 11
str(mtcars)             # structure of the data frame
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mtcars[nrow(mtcars),]   # last row only of a data frame   
##             mpg cyl disp  hp drat   wt qsec vs am gear carb
## Volvo 142E 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

Adding and Removing Columns from a Data Frame

To add a new column, you simply “access” the column or use a new name for the column. Note in the example below that you can operate on entire columns (as vectors) and the operation is applied to each pair of values in the two vectors in the operation. This is much more efficient than using loops as is necessary in other programming languages.

# copy the data frame mtcars to a new data frame df
df <- mtcars

# create a new column "dispcyl" which is the displacement per cylinder
df$dispcyl <- df$disp / df$cyl

head(df)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  dispcyl
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 26.66667
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 26.66667
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 27.00000
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 43.00000
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 45.00000
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 37.50000

Create a New Data Frame

Data frames are created in various ways: use the <code<>data.frame function, load a CSV file, execute a SQL query, or as a result of many package functions.

Load a Data Frame from CSV

Quick note: Capitalization in path and file names does not matter in Windows, but does matter on MacOS and Linux. Furthermore, note that even in Windows the path delimiter is a forward slash / and not the usual backwards slash \. The \ is an “escape” character and used to inject non-printable characters into a string (text), e.g., “This string contains”quotes”.” which would be written in R as “this string contains \”quotes\“.”

Also, the parameters header = F instructs read.csv not to interpret the first line as header labels. Of course, if there are no labels, then you need to define your own.

Aside from CSV files, R can also load a number of other file format using various packages, including XML, Excel, SPSS, MatLab, among many others.

df <- read.csv(file = "customertxndata.csv", header = F)
head(df)

df <- read.csv(file = "customertxndata.csv", 
               header = F,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)

Note that the value of the ‘Male’ column in the first row is NA which is the way that R indicates a missing data value. It is not 0 or an empty string, it is unknown. So, statistical functions and algebraic operations would result in an NA as well.

Strings vs Factors

The factor data type encodes categorical data, e.g., the value of a variable is one of a fixed value set. Many statistical functions in R require categorical variables to be of type factor. However, often, during data processing, we need the actual text rather than having it encoded as a factor (which is actually stored in R as an integer for efficiency). So, when reading a CSV file you need to decide if you want text columns to be character strings or factors by setting the stringsAsFactors parameter.

You may use either F and T or FALSE and TRUE.

df <- read.csv(file = "customertxndata.csv", 
               header = F,
               stringsAsFactors = FALSE,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)

Create a New Data Frame

The code below creates a new data frame from column vectors. Notice how the column names are the names of the vectors. A new vector is created with the c function, e.g., v <- c(3,5,1,9).

df1 <- data.frame(state = c('Arizona','Georgia', 'New York','Indiana','Washington','Texas'),
                  code = as.factor(c('AZ','GA','NY','IN','WA','TX')),
                  score = c(62,47,55,74,31,85))

head(df1)
##        state code score
## 1    Arizona   AZ    62
## 2    Georgia   GA    47
## 3   New York   NY    55
## 4    Indiana   IN    74
## 5 Washington   WA    31
## 6      Texas   TX    85

Search Data Frames

There are two important functions for “searching” data frames: which and any. The code below uses the built-in Orange data frame which contains measurements of orange trees. It has three columns: the tree, the age of the tree (days since 1968/12/31), and circumference (in mm).

which

df <- Orange

head(df)
##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142
# find all rows where the circumference is more than 200mm
rs <- which(df$circumference > 200)

# display all rows where the circumference is more than 200mm
df[rs,]
##    Tree  age circumference
## 13    2 1372           203
## 14    2 1582           203
## 27    4 1372           209
## 28    4 1582           214
# compound conditions are possible with & (and), | (or), and ! (not)
rs2 <- which(df$circumference > 200 & df$age < 1500)
rs3 <- which(df$circumference < 200 | !(df$age < 1500))
rs4 <- which(df$circumference > 400 | df$age > 1500)

rs2
## [1] 13 27
rs3
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26
## [26] 28 29 30 31 32 33 34 35
rs4
## [1]  7 14 21 28 35
mean(df[rs4,2])
## [1] 1582
mean(df$age[rs3])
## [1] 894.8788

In the above example rs <- which(df$circumference > 200) finds all rows in the data frame df where circumference > 200. The rows are saved in rs.


Files & Resources

All Files for Lesson 6.100

References

No references.

Errata

Let us know.

