Prerequisite: R and R Studio
If you do not already have R and/or R Studio you will need to download and install them. You must first install R from R Project and then the R Studio IDE from R Studio. Alternatively, rather than installing R and R Studio locally, you can do the tutorial using R Studio Cloud.
The process to working with R and R Studio is like programming in Java. You install the JRE and the JDK to get the Java language, compiler, and run-time environment. Now you also want an IDE (Integrated Development Environment) in which you write Java program; for example, Eclipse. Same with Python; you install Python the language and then PyCharm, as an example, as your IDE. Of course, you really do not need R Studio (or any IDE) to write R programs, just like you don’t really need an IDE to program in Java, C, C++, or Python. You need a source code editor and even Notepad or TextEdit would suffice. Some programmers prefer simple IDE’s like Notepad++ or JEdit, while other like full-featured development environments like Visual Studio or Eclipse. The tutorial below assumes you will use R Studio.
Install R from R Project before installing R Studio.
Tutorial
The recorded tutorial Demonstrates how to install R, R Studio, and create projects. Shows how to build R Notebooks using R Markdown and add R code chunks. Explains how to load data into a data frame from a CSV and access the data. It summarizes the content of this tutorial.
The lesson files, including the data files can be found at the end of this page.
Basic R
R is a scripted language which means that you do not need to compile the program before running it. Statements and expressions are executed as you type them if you enter them in the R Console or are run when you execute the code in a chunk in an R Notebook.
Creating an R Notebook
R Code Chunks
This tutorial is limited to writing R “programs” using an R Notebook in R Studio. Programs in R run from start to end. Each chunk should be a step in your analysis or data project. Name your code chunk, so you can quickly navigate to them.
In the chunk below, the variable cars passed to the built-in Base R function plot
is one of the dozens of “built-in” data frames; a data frame being data arranged in rows and columns similar to a spreadsheet or CSV file.
Note that you call a function by using the function’s name followed by the arguments you wish to pass to the function. Of course, you need to follow the definition of the function. Many functions are simply “built-in” while others come from packages that you need to explicitly load into your program.
Note that there is no semicolon at the end of a line.
```{r namedChunk, eval=FALSE}
plot(x = mtcars$mpg, y = mtcars$hp)
```
Expressions
R can be directly used to solve simple or complex mathematical expressions.
# [1] in the above answer indicates the index of your results.
# R always shows the result with index for each row.
((2^3)*5)-1
## [1] 39
# sqrt and exp are built-in functions in R for finding Square root and exponential respectively.
sqrt(4)* exp(2)
## [1] 14.77811
Variables and Identifiers
Holding a value in a variable is done through assignment. Once you assign a value to a variable, the variables becomes an R object. There are two ways to do an assignment, using ‘=’ or with ‘<-’. The latter is the preferred way in R but the former might be more familiar to those programmers coming to R from Java, C++, or Python.
Note that variables are explicitly defined or declared. The first time a variable is assigned a value defines the variable and its type. The type is based on the value that is assigned. Unlike other programming languages such as C++, C#, or Java, R is not strongly typed: the type of a variable can change when a value of a different type is assigned. A variable can be used in an expression. Its value can be inspected by just using the variable by itself.
The value of a variable can be displayed either by using the variable by itself or using the print()
function.
# assignment with '=' of a number
x = 12
# inspect (print/display) the value
x
## [1] 12
# assignment a new value and change its type to "text"
x = "Hello"
x
## [1] "Hello"
# assignment with '<-'
x <- 12
print(x)
## [1] 12
The rules for naming an identifier (variable, function, or package name) for an object are as follows:
identifiers are case-sensitive and cannot contain spaces or special characters such as #, %, $, @, *, &, ^, !, ~ an identifier must start with a letter, but may contain any combination of letters and digits thereafter special characters dot (.) and underscore (_) are allowed
The dot (.) is a regular character in R and that can be confusing as other language (e.g., Java) use dot to designate property or method access, e.g, in Java x.val means that you are accessing the val property of the object x.
Some examples of legal variable names are: df, df2, df.txns, df_all2017. These are some illegal variable names: 2df (cannot start with a digit), rs$all (cannot contain a $; the $ is used to access columns in a dataframe), rs# (only . and _ are allowed in addition to digits and letters).
It is considered good programming practice to give identifiers a sensible name that hints as to what is stored in the variable rather than using random name like x, val, or i33. Identifiers should be named consistently. Many programmers use one of two styles:
- underscores, e.g., interest_rate
- camelCase, e.g., squareRoot, graphData, currentWorkingDirectory
Note that R is case sensitive which means that R treats the identifiers AP and ap as different objects. As a side note, files may also be case sensitive but that depends on the operating system. MacOS and Linux are case sensitive, while Windows is case aware but not case sensitive. For example, on MacOS and Linux there is a difference between “AirPassengers.txt” and “airpassengers.txt” while on Windows there is not. SQL is also not case sensitive. It is a best practice to assume case sensitivity.
Built-in Data Frames
There are numerous data frames built into R that are accessible without loading them first from external files. These data frames are for experimentation and learning and not for actual analytics work. One such built-in data frame is mtcars
. To get a list of all built-in data frames, run data()
.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
mtcars
and mtcars
print out the first and last six rows of a data frame, respectively. You can specify the number of rows to display.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Accessing Rows, Columns, and Elements (Cells) of a Data Frame
Data frames are very similar to tables in relational databases and spreadsheets. They have rows and columns and the intersection of a row and column is a cell (or element). The order of access is row followed by column, e.g, the third element in the fourth row of the data frame mtcars
is mtcars[4,3]
. Note that this is reversed from the way Excel and other spreadsheets work. The <-
is the operator for assignment, although =
also works. We will see and use both.
To display a value, either use the print
function or just use the variable by itself. To print multiple items, use the paste0
function.
v <- mtcars[4,3]
x = mtcars[4,3]
print(paste0("v = ",v," and x = ",x))
## [1] "v = 258 and x = 258"
Leaving out a dimension (row or column) accesses the entire row or column. The resultant is a data frame with a single row. Often the values must be converted to a vector data type. Conversions of variables from one type to another is done with the family of as.xxxx
functions, e.g., as.vector
, as.numeric
, or as.factor
. Vectors can contain numeric or character data but all elements must be of the same type. In R, a list is similar to a vector but it may contain a mix of elements. A matrix is similar to a data frame but it can only contain numbers and it can have more than two dimensions.
Some functions expect data frames, some vectors, some lists. You need to read the documentation of a function to find out. Furthermore, some functions will automatically convert (also called coerce) a variable from one type to the one it requires.
You can also access a column in data frame by its column name. For an entire column you either use the columns position or its name: df[,column]
or df$columnName
.
# all of row 4; the result is a data frame
r <- mtcars[4,]
sum(r)
## [1] 426.135
## [1] 108
mtcars[c(1,4)] # columns 1 and 4 as a new dataframe
## mpg hp
## Mazda RX4 21.0 110
## Mazda RX4 Wag 21.0 110
## Datsun 710 22.8 93
## Hornet 4 Drive 21.4 110
## Hornet Sportabout 18.7 175
## Valiant 18.1 105
## Duster 360 14.3 245
## Merc 240D 24.4 62
## Merc 230 22.8 95
## Merc 280 19.2 123
## Merc 280C 17.8 123
## Merc 450SE 16.4 180
## Merc 450SL 17.3 180
## Merc 450SLC 15.2 180
## Cadillac Fleetwood 10.4 205
## Lincoln Continental 10.4 215
## Chrysler Imperial 14.7 230
## Fiat 128 32.4 66
## Honda Civic 30.4 52
## Toyota Corolla 33.9 65
## Toyota Corona 21.5 97
## Dodge Challenger 15.5 150
## AMC Javelin 15.2 150
## Camaro Z28 13.3 245
## Pontiac Firebird 19.2 175
## Fiat X1-9 27.3 66
## Porsche 914-2 26.0 91
## Lotus Europa 30.4 113
## Ford Pantera L 15.8 264
## Ferrari Dino 19.7 175
## Maserati Bora 15.0 335
## Volvo 142E 21.4 109
mtcars[,2] # all of column 2
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[5:7,] # rows 5 to 7 as a new dataframe
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
## Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
mtcars$cyl # column named "cyl"
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars$cyl[2] # 2nd row in the column "cyl"
## [1] 6
mtcars$cyl[3:9] # rows 3 to 9 for column "cyl" as a vector
## [1] 4 6 8 6 8 4 4
## [1] 20.09062
Aggregation and Statistical Functions
As a language with its origin in statistics and statistical data processing, R has a plethora of statistical functions. Some of the most important functions for data processing are shown below. Consult online documentation and statistics references for more information, e.g., How To Get Descriptive Statistics In R and Base R Statistical Functions.
Packages
There are thousands of functions across hundreds of packages (external libraries of functions written for specific purposes, e.g., data mining, statistical inference, machine learning, image processing, web development, visualization, XML processing, SQL, and so forth). You will learn them over time – and it’s unlikely you will ever learn all of them, so have patience. For a package to be usable in an R project it must be installed; installation is done once. Then every time you need an installed package in some R code, you must load it using the library
function.
Installing Packages
To ensure that packages are automatically installed, you can use the following code. That way your code becomes portable.
if("RSQLite" %in% rownames(installed.packages()) == FALSE) {
install.packages("RSQLite")
}
library("RSQLite")
In the above code the function installed.packages()
returns a list of the names of all installed packages. The operator %in%
is a set operator that checks if “RSQLite” is one of the returned names. If it is, the Boolean expression evaluates to \(TRUE\), otherwise \(FALSE\). If it is false, then it means the package is not installed and the optional code that installs the package is executed. That way, the loading of the package with library("RSQLite")
cannot fail.
Data Frame Dimensions and Structure
Data frames are one of the most fundamental data structures of R, along with vectors. A data frame is a row/column arrangement of data where each column is a vector of data values of the same type, e.g., all numbers or all characters.
The example below uses one of the many built-in data frames of R: mtcars. These built-in data frames are wonderful for testing code or learning R.
nrow(mtcars) # number of rows in the data frame
## [1] 32
ncol(mtcars) # number of columns on the data frame
## [1] 11
str(mtcars) # structure of the data frame
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
mtcars[nrow(mtcars),] # last row only of a data frame
## mpg cyl disp hp drat wt qsec vs am gear carb
## Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
Adding and Removing Columns from a Data Frame
To add a new column, you simply “access” the column or use a new name for the column. Note in the example below that you can operate on entire columns (as vectors) and the operation is applied to each pair of values in the two vectors in the operation. This is much more efficient than using loops as is necessary in other programming languages.
# copy the data frame mtcars to a new data frame df
df <- mtcars
# create a new column "dispcyl" which is the displacement per cylinder
df$dispcyl <- df$disp / df$cyl
head(df)
## mpg cyl disp hp drat wt qsec vs am gear carb dispcyl
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 26.66667
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 26.66667
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 27.00000
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 43.00000
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 45.00000
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 37.50000
Create a New Data Frame
Data frames are created in various ways: use the <code<>data.frame function, load a CSV file, execute a SQL query, or as a result of many package functions.
Load a Data Frame from CSV
Quick note: Capitalization in path and file names does not matter in Windows, but does matter on MacOS and Linux. Furthermore, note that even in Windows the path delimiter is a forward slash / and not the usual backwards slash \. The \ is an “escape” character and used to inject non-printable characters into a string (text), e.g., “This string contains”quotes”.” which would be written in R as “this string contains \”quotes\“.”
Also, the parameters header = F
instructs read.csv
not to interpret the first line as header labels. Of course, if there are no labels, then you need to define your own.
Aside from CSV files, R can also load a number of other file format using various packages, including XML, Excel, SPSS, MatLab, among many others.
df <- read.csv(file = "customertxndata.csv", header = F)
head(df)
df <- read.csv(file = "customertxndata.csv",
header = F,
col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
Note that the value of the ‘Male’ column in the first row is NA which is the way that R indicates a missing data value. It is not 0 or an empty string, it is unknown. So, statistical functions and algebraic operations would result in an NA as well.
Strings vs Factors
The factor data type encodes categorical data, e.g., the value of a variable is one of a fixed value set. Many statistical functions in R require categorical variables to be of type factor. However, often, during data processing, we need the actual text rather than having it encoded as a factor (which is actually stored in R as an integer for efficiency). So, when reading a CSV file you need to decide if you want text columns to be character strings or factors by setting the stringsAsFactors
parameter.
You may use either F
and T
or FALSE
and TRUE
.
df <- read.csv(file = "customertxndata.csv",
header = F,
stringsAsFactors = FALSE,
col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
Create a New Data Frame
The code below creates a new data frame from column vectors. Notice how the column names are the names of the vectors. A new vector is created with the c
function, e.g., v <- c(3,5,1,9)
.
df1 <- data.frame(state = c('Arizona','Georgia', 'New York','Indiana','Washington','Texas'),
code = as.factor(c('AZ','GA','NY','IN','WA','TX')),
score = c(62,47,55,74,31,85))
head(df1)
## state code score
## 1 Arizona AZ 62
## 2 Georgia GA 47
## 3 New York NY 55
## 4 Indiana IN 74
## 5 Washington WA 31
## 6 Texas TX 85
Search Data Frames
There are two important functions for “searching” data frames: which
and any
. The code below uses the built-in Orange data frame which contains measurements of orange trees. It has three columns: the tree, the age of the tree (days since 1968/12/31), and circumference (in mm).
which
## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
# find all rows where the circumference is more than 200mm
rs <- which(df$circumference > 200)
# display all rows where the circumference is more than 200mm
df[rs,]
## Tree age circumference
## 13 2 1372 203
## 14 2 1582 203
## 27 4 1372 209
## 28 4 1582 214
# compound conditions are possible with & (and), | (or), and ! (not)
rs2 <- which(df$circumference > 200 & df$age < 1500)
rs3 <- which(df$circumference < 200 | !(df$age < 1500))
rs4 <- which(df$circumference > 400 | df$age > 1500)
rs2
## [1] 13 27
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26
## [26] 28 29 30 31 32 33 34 35
## [1] 7 14 21 28 35
## [1] 1582
## [1] 894.8788
In the above example rs <- which(df$circumference > 200)
finds all rows in the data frame df where circumference > 200. The rows are saved in rs.
References
No references.
---
title: "Beginning R"
params:
  category: 6
  number: 100
  time: 60
  level: beginner
  tags: "r,primer"
  description: "Introduces some basic concept of R, including statements, data
                frames, vectors, variables, and reading from a CSV file."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Prerequisite: R and R Studio

If you do not already have R and/or R Studio you will need to download and install them. You must first install R from [R Project](https://cloud.r-project.org/) and then the R Studio IDE from [R Studio](https://rstudio.com/products/rstudio/download/). Alternatively, rather than installing R and R Studio locally, you can do the tutorial using [R Studio Cloud](https://rstudio.cloud/).

The process to working with R and R Studio is like programming in Java. You install the JRE and the JDK to get the Java language, compiler, and run-time environment. Now you also want an IDE (Integrated Development Environment) in which you write Java program; for example, *Eclipse.* Same with Python; you install Python the language and then *PyCharm*, as an example, as your IDE. Of course, you really do not need R Studio (or any IDE) to write R programs, just like you don't really need an IDE to program in Java, C, C++, or Python. You need a source code editor and even Notepad or TextEdit would suffice. Some programmers prefer simple IDE's like Notepad++ or JEdit, while other like full-featured development environments like Visual Studio or Eclipse. The tutorial below assumes you will use R Studio.

> Install R from [R Project](https://cloud.r-project.org/) **before** installing [R Studio](https://rstudio.com/products/rstudio/download/).

## Tutorial

The recorded tutorial Demonstrates how to install R, R Studio, and create projects. Shows how to build R Notebooks using R Markdown and add R code chunks. Explains how to load data into a data frame from a CSV and access the data. It summarizes the content of this tutorial.

The lesson files, including the data files can be found at the end of this page.

```{=html}
<iframe src="https://player.vimeo.com/video/669796944?h=c56f6fee39" width="480" height="270" frameborder="1" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen data-external="1"></iframe>
```

## Basic R

R is a scripted language which means that you do not need to compile the program before running it. Statements and expressions are executed as you type them if you enter them in the R Console or are run when you execute the code in a chunk in an R Notebook.

## Creating an R Notebook

![](r-project-posit.jpg)

## R Code Chunks

This tutorial is limited to writing R "programs" using an R Notebook in R Studio. Programs in R run from start to end. Each chunk should be a step in your analysis or data project. Name your code chunk, so you can quickly navigate to them.

In the chunk below, the variable *cars* passed to the built-in Base R function <code>plot</code> is one of the dozens of "built-in" data frames; a data frame being data arranged in rows and columns similar to a spreadsheet or CSV file.

Note that you call a function by using the function's name followed by the arguments you wish to pass to the function. Of course, you need to follow the definition of the function. Many functions are simply "built-in" while others come from packages that you need to explicitly load into your program.

Note that there is no semicolon at the end of a line.

<code>\`\`\`{r namedChunk, eval=FALSE}<br/> plot(x = mtcars\$mpg, y = mtcars\$hp)<br/> \`\`\` </code>

### Expressions

R can be directly used to solve simple or complex mathematical expressions.

```{r}
# [1] in the above answer indicates the index of your results.
# R always shows the result with index for each row.

((2^3)*5)-1
```

```{r}
# sqrt and exp are built-in functions in R for finding Square root and exponential respectively.

sqrt(4)* exp(2)
```

### Variables and Identifiers

Holding a value in a variable is done through assignment. Once you assign a value to a variable, the variables becomes an R object. There are two ways to do an assignment, using '=' or with '\<-'. The latter is the preferred way in R but the former might be more familiar to those programmers coming to R from Java, C++, or Python.

Note that variables are explicitly defined or declared. The first time a variable is assigned a value defines the variable and its type. The type is based on the value that is assigned. Unlike other programming languages such as C++, C#, or Java, R is not strongly typed: the type of a variable can change when a value of a different type is assigned. A variable can be used in an expression. Its value can be inspected by just using the variable by itself.

The value of a variable can be displayed either by using the variable by itself or using the <code>print()</code> function.

```{r}
# assignment with '=' of a number
x = 12
# inspect (print/display) the value
x

# assignment a new value and change its type to "text"
x = "Hello"
x

# assignment with '<-'
x <- 12
print(x)
```

The rules for naming an identifier (variable, function, or package name) for an object are as follows:

identifiers are case-sensitive and cannot contain spaces or special characters such as #, %, \$, \@, \*, &, \^, !, \~ an identifier must start with a letter, but may contain any combination of letters and digits thereafter special characters dot (.) and underscore (\_) are allowed

The dot (.) is a regular character in R and that can be confusing as other language (*e.g.*, Java) use dot to designate property or method access, *e.g*, in Java *x.val* means that you are accessing the *val* property of the object *x*.

Some examples of legal variable names are: df, df2, df.txns, df_all2017. These are some illegal variable names: *2df* (cannot start with a digit), *rs\$all* (cannot contain a \$; the \$ is used to access columns in a dataframe), *rs#* (only . and \_ are allowed in addition to digits and letters).

It is considered good programming practice to give identifiers a sensible name that hints as to what is stored in the variable rather than using random name like x, val, or i33. Identifiers should be named consistently. Many programmers use one of two styles:

-   underscores, *e.g.*, *interest_rate*
-   camelCase, *e.g.*, *squareRoot*, *graphData*, *currentWorkingDirectory*

Note that R is case sensitive which means that R treats the identifiers *AP* and *ap* as different objects. As a side note, files may also be case sensitive but that depends on the operating system. MacOS and Linux are case sensitive, while Windows is case aware but not case sensitive. For example, on MacOS and Linux there is a difference between "AirPassengers.txt" and "airpassengers.txt" while on Windows there is not. SQL is also not case sensitive. It is a best practice to assume case sensitivity.

## Built-in Data Frames

There are numerous data frames built into R that are accessible without loading them first from external files. These data frames are for experimentation and learning and not for actual analytics work. One such built-in data frame is <code>mtcars</code>. To get a list of all built-in data frames, run <code>data()</code>.

```{r}
mtcars
```

<code>mtcars</code> and <code>mtcars</code> print out the first and last six rows of a data frame, respectively. You can specify the number of rows to display.

```{r}
head(mtcars)
tail(mtcars)

head(mtcars, 3)
```

## Accessing Rows, Columns, and Elements (Cells) of a Data Frame

Data frames are very similar to tables in relational databases and spreadsheets. They have rows and columns and the intersection of a row and column is a cell (or element). The order of access is row followed by column, e.g, the third element in the fourth row of the data frame <code>mtcars</code> is <code>mtcars[4,3]</code>. Note that this is reversed from the way Excel and other spreadsheets work. The <code>\<-</code> is the operator for assignment, although <code>=</code> also works. We will see and use both.

To display a value, either use the <code>print</code> function or just use the variable by itself. To print multiple items, use the <code>paste0</code> function.

```{r}
v <- mtcars[4,3]
x = mtcars[4,3]

print(paste0("v = ",v," and x = ",x))
```

Leaving out a dimension (row or column) accesses the entire row or column. The resultant is a data frame with a single row. Often the values must be converted to a vector data type. Conversions of variables from one type to another is done with the family of <code>as.xxxx</code> functions, *e.g.*, <code>as.vector</code>, <code>as.numeric</code>, or <code>as.factor</code>. Vectors can contain numeric or character data but all elements must be of the same type. In R, a list is similar to a vector but it may contain a mix of elements. A matrix is similar to a data frame but it can only contain numbers and it can have more than two dimensions.

Some functions expect data frames, some vectors, some lists. You need to read the documentation of a function to find out. Furthermore, some functions will automatically convert (also called coerce) a variable from one type to the one it requires.

You can also access a column in data frame by its column name. For an entire column you either use the columns position or its name: <code>df[,column]</code> or <code>df\$columnName</code>.

```{r}
# all of row 4; the result is a data frame
r <- mtcars[4,]
sum(r)

c <- mtcars[3,]
c[1,3]

mtcars[c(1,4)]   # columns 1 and 4 as a new dataframe

mtcars[,2]       # all of column 2
mtcars[5:7,]     # rows 5 to 7 as a new dataframe
mtcars$cyl       # column named "cyl"
mtcars$cyl[2]    # 2nd row in the column "cyl"

mtcars$cyl[3:9]  # rows 3 to 9 for column "cyl" as a vector

w <- mtcars$mpg
mean(w)
```

## Aggregation and Statistical Functions

As a language with its origin in statistics and statistical data processing, R has a plethora of statistical functions. Some of the most important functions for data processing are shown below. Consult online documentation and statistics references for more information, *e.g.*, [How To Get Descriptive Statistics In R](https://www.programmingr.com/statistics/descriptive-statistics-in-r/#:~:text=The%20summary%20function%20in%20R%20is%20one%20of,such%20as%20range%2C%20mean%2C%20median%20and%20interpercentile%20ranges) and [Base R Statistical Functions](https://www.dummies.com/education/math/statistics/base-r-statistical-functions/).

## Packages

There are thousands of functions across hundreds of packages (external libraries of functions written for specific purposes, *e.g.*, data mining, statistical inference, machine learning, image processing, web development, visualization, XML processing, SQL, and so forth). You will learn them over time -- and it's unlikely you will ever learn all of them, so have patience. For a package to be usable in an R project it must be installed; installation is done once. Then every time you need an installed package in some R code, you must load it using the <code>library</code> function.

### Installing Packages

To ensure that packages are automatically installed, you can use the following code. That way your code becomes portable.

```{r}
if("RSQLite" %in% rownames(installed.packages()) == FALSE) {
  install.packages("RSQLite")
}

library("RSQLite")
```

In the above code the function `installed.packages()` returns a list of the names of all installed packages. The operator `%in%` is a set operator that checks if *"RSQLite"* is one of the returned names. If it is, the Boolean expression evaluates to $TRUE$, otherwise $FALSE$. If it is false, then it means the package is not installed and the optional code that installs the package is executed. That way, the loading of the package with `library("RSQLite")` cannot fail.

## Data Frame Dimensions and Structure

Data frames are one of the most fundamental data structures of R, along with vectors. A data frame is a row/column arrangement of data where each column is a vector of data values of the same type, *e.g.*, all numbers or all characters.

The example below uses one of the many built-in data frames of R: *mtcars*. These built-in data frames are wonderful for testing code or learning R.

```{r}
nrow(mtcars)            # number of rows in the data frame
ncol(mtcars)            # number of columns on the data frame

str(mtcars)             # structure of the data frame

mtcars[nrow(mtcars),]   # last row only of a data frame   
```

## Adding and Removing Columns from a Data Frame

To add a new column, you simply "access" the column or use a new name for the column. Note in the example below that you can operate on entire columns (as vectors) and the operation is applied to each pair of values in the two vectors in the operation. This is much more efficient than using loops as is necessary in other programming languages.

```{r}
# copy the data frame mtcars to a new data frame df
df <- mtcars

# create a new column "dispcyl" which is the displacement per cylinder
df$dispcyl <- df$disp / df$cyl

head(df)
```

## Create a New Data Frame

Data frames are created in various ways: use the \<code\<\>data.frame</code> function, load a CSV file, execute a SQL query, or as a result of many package functions.

### Load a Data Frame from CSV

Quick note: Capitalization in path and file names does not matter in Windows, but **does matter** on MacOS and Linux. Furthermore, note that even in Windows the path delimiter is a forward slash / and not the usual backwards slash \\. The \\ is an "escape" character and used to inject non-printable characters into a string (text), *e.g.*, "This string contains "quotes"." which would be written in R as "this string contains \\"quotes\\"."

Also, the parameters <code>header = F</code> instructs <code>read.csv</code> not to interpret the first line as header labels. Of course, if there are no labels, then you need to define your own.

Aside from CSV files, R can also load a number of other file format using various packages, including XML, Excel, SPSS, MatLab, among many others.

```{r eval=FALSE}
df <- read.csv(file = "customertxndata.csv", header = F)
head(df)

df <- read.csv(file = "customertxndata.csv", 
               header = F,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
```

> Note that the value of the 'Male' column in the first row is *NA* which is the way that R indicates a missing data value. It is not 0 or an empty string, it is unknown. So, statistical functions and algebraic operations would result in an *NA* as well.

#### Strings vs Factors

The *factor* data type encodes categorical data, *e.g.*, the value of a variable is one of a fixed value set. Many statistical functions in R require categorical variables to be of type *factor*. However, often, during data processing, we need the actual text rather than having it encoded as a *factor* (which is actually stored in R as an integer for efficiency). So, when reading a CSV file you need to decide if you want text columns to be character strings or factors by setting the <code>stringsAsFactors</code> parameter.

You may use either <code>F</code> and <code>T</code> or <code>FALSE</code> and <code>TRUE</code>.

```{r eval=FALSE}
df <- read.csv(file = "customertxndata.csv", 
               header = F,
               stringsAsFactors = FALSE,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
```

### Create a New Data Frame

The code below creates a new data frame from column vectors. Notice how the column names are the names of the vectors. A new vector is created with the <code>c</code> function, e.g., <code>v \<- c(3,5,1,9)</code>.

```{r}
df1 <- data.frame(state = c('Arizona','Georgia', 'New York','Indiana','Washington','Texas'),
                  code = as.factor(c('AZ','GA','NY','IN','WA','TX')),
                  score = c(62,47,55,74,31,85))

head(df1)

```

## Search Data Frames

There are two important functions for "searching" data frames: <code>which</code> and <code>any</code>. The code below uses the built-in [**Orange** data frame](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/Orange) which contains measurements of orange trees. It has three columns: the tree, the *age* of the tree (days since 1968/12/31), and *circumference* (in *mm*).

### which

```{r}
df <- Orange

head(df)

# find all rows where the circumference is more than 200mm
rs <- which(df$circumference > 200)

# display all rows where the circumference is more than 200mm
df[rs,]

# compound conditions are possible with & (and), | (or), and ! (not)
rs2 <- which(df$circumference > 200 & df$age < 1500)
rs3 <- which(df$circumference < 200 | !(df$age < 1500))
rs4 <- which(df$circumference > 400 | df$age > 1500)

rs2
rs3
rs4

mean(df[rs4,2])
mean(df$age[rs3])

```

In the above example <code>rs \<- which(df\$circumference \> 200)</code> finds all rows in the data frame *df* where *circumference \> 200*. The rows are saved in *rs*.

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

No references.

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
