Vectors
In R, a vector is similar to a list or array in other programming languages. It is a collection of elements of the same basic type: numeric, character, or logical (Boolean). As an aside, a list in R is a collection of mixed data types. This section applies to vectors only.
Vectors emerge in different ways in R:
- result of manually creating them
- result of a call to a function
- extraction of elements from a vector
- column in a data frame
There are no specific packages required to manipulate, use, create, or access vectors.
Creating a Vector
A simple vector can be created by creating a collection using the c()
function. Note that all elements in a vector must be of the same basic type.
v.numeric <- c(23, 77, 12, 98, -23, 0)
v.char <- c('R', 'Java', 'Python', 'C++', 'LISP')
v.logical <- c(TRUE, F, FALSE, T, T, T)
When elements are not of the same type, R attempts to coerce (cast or convert) them to the same “higher” data type. For example, in the code below, all elements are coerced to character (i.e., text/string):
v.mixed <- c(23, 0, 'R', TRUE)
print(v.mixed)
## [1] "23" "0" "R" "TRUE"
The code below shows a more complex example. It creates an artificial vector of random integers for use in the remainder of the tutorial.
# vector of 50 random integers between 0 and 10
# set the seed for the random number generator to ensure same
# sequence of random numbers every time the code is run
set.seed(98788)
v <- round(runif(50, min = 0, max = 10),0)
# arguments do not have to be passed in the order that they are
# declared in the function definition as long as the names of the
# arguments are specified
v <- round(runif(n = 50, max = 10, min = 1),0)
v <- round(runif(max = 10, min = 1, n = 50),0)
print(v)
## [1] 9 6 9 1 9 8 10 2 6 9 8 7 6 1 7 10 4 3 5 9 7 8 4 9 7 4 6 5 1 4 4 2 7 9 9 6 3 2 3 7 8 6
## [43] 3 9 4 10 7 1 4 2
Size of Vectors
To find the size (length) of a vector, i.e, the number of elements in a vector, use the function length()
.
v <- round(runif(50, min = 0, max = 10),0)
print(length(v))
## [1] 50
Vector Operations
Vectors can be used in algebraic operations without having to write loops like in other programming languages. For example, in the example below, every element of the vector v.a is multiplied by 2.5
v.a <- c(23, 33, 10, 8, 7)
r <- v.a * 2.5
print(r)
## [1] 57.5 82.5 25.0 20.0 17.5
If two vectors are used in an operation then the operand is applied to each corresponding pair of values.
v.a <- c(23, 33, 10, 8, 7)
v.b <- c(0.3, 0.5, 0.1, 0.9, 0.8)
r <- v.a * v.b
print(r)
## [1] 6.9 16.5 1.0 7.2 5.6
The vectors, naturally, need to be of the same size. So, the code below does not work correctly as the vectors do not have the same number of elements.
v.a <- c(23, 33, 10, 8, 7)
v.b <- c(23, 33, 10, 8, 7, 99)
r <- v.a * v.b
## Warning in v.a * v.b: longer object length is not a multiple of shorter object length
Of course, we could also write the code explicitly using a loop. Note that this runs significantly slower and thus should be avoided.
Example: Dot Product
The code below calculates the sum of the products of the two vectors (their dot product), first using a loop, and then using vector operations, and finally using the %*% operator.
# Approach 1: Using Loops
v.a <- c(23, 33, 10, 8, 7)
v.b <- c(0.3, 0.5, 0.1, 0.9, 0.8)
n <- length(v.a)
r <- 0
for (i in 1:n) {
r <- r + (v.a[i] * v.b[i])
}
print(paste0("Approach 1 (loop): ", r))
## [1] "Approach 1 (loop): 37.2"
# Approach 2: Vector Operation
v.prod <- v.a * v.b
v.sum <- sum(v.prod)
print(paste0("Approach 2 (vector operation and function): ", v.sum))
## [1] "Approach 2 (vector operation and function): 37.2"
# Approach 3: Operator
dp <- v.a %*% v.b
print(paste0("Approach 3 (dot product operand): ", dp))
## [1] "Approach 3 (dot product operand): 37.2"
The above example illustrates that there is often more than one way to achieve some task in R. Of course, some approaches are more elegant, require less code, and are faster. There is another way, the above dot product could have been calculated and that is to use the dot()
function from the pracma library. We are certain that you can come up with yet another way.
Functions on Vectors
As seen in previous examples, many functions take vectors as arguments. For example, the function mean()
takes a numeric or logical vector as input and returns the average value. For a logical vector, the values of TRUE and FALSE are converted to numbers where TRUE = 1 and FALSE = 0 and then the arithmetic average is calculated.
The code below illustrates some common functions. The function round()
can be useful for output.
v <- c(2, 6, 8.2, 1.3, 9.4, 45.7, 32, 99, 104,55, 0.05)
n <- length(v)
print(paste0("n = ", n))
## [1] "n = 11"
m <- mean(v)
print(paste0("Mean = ", round(m,2)))
## [1] "Mean = 32.97"
s <- sd(v)
print(paste0("StdDev = ", round(s,2)))
## [1] "StdDev = 38.72"
d <- median(v)
print(paste0("Median = ", round(d,2)))
## [1] "Median = 9.4"
r <- max(v) - min(v)
print(paste0("Range = ", round(r,2)))
## [1] "Range = 103.95"
tm <- mean(v, trim = 0.1)
print(paste0("10% Trimmed Mean = ", round(tm,2)))
## [1] "10% Trimmed Mean = 28.73"
Accessing Elements in a Vector
Elements are accessed positionally, although in R, the access index can be a vector of integers in which case all elements at those positions are retrieved. Positions are numbered from 1 to the number of elements in a vector. The number of elements (or length) of a vector can be obtained using the length()
function.
In the example below, note that n:m
generates a vector of integers from n to m, inclusive. The seq()
generates a vector that is a sequence of integers at an interval.
## [1] 2.00 6.00 8.20 1.30 9.40 45.70 32.00 99.00 104.00 55.00 0.05
# access a single element at position 3
v[3]
## [1] 8.2
# access element 3 through 5
v[3:5]
## [1] 8.2 1.3 9.4
# access the first element
v[1]
## [1] 2
# access the last element
v[length(v)]
## [1] 0.05
# access the next to last element
v[length(v)-1]
## [1] 55
# access every other element
v[seq(from = 1, to = length(v), by = 2)]
## [1] 2.00 8.20 9.40 32.00 104.00 0.05
# access specific elements at positions 1, 5, and 7
i <- c(1,5,7)
v[i]
## [1] 2.0 9.4 32.0
Testing Predicate Expressions
It is possible in R to apply a predicate expression to every element in a vector. This generates a “Boolean vector” of TRUE/FALSE values that indicate which element matches the predicate expression (TRUE) and which doesn’t (FALSE).
Predicate expressions are built with logical operators (<, >, <=, >=, ==, !=)
## [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Finding Matches
The which()
function returns the positions that are TRUE in a Boolean vector.
# returns positions of vector that matches predicate expression
which(v != 5)
## [1] 1 2 3 4 5 6 7 8 9 10 11
# count the number of matches
length(which(v != 0))
## [1] 11
p <- which(v < 5)
print (v[p])
## [1] 2.00 1.30 0.05
# or combine
x <- v[which(v < 5)]
print (x)
## [1] 2.00 1.30 0.05
Determining Any Matches
To determine if there are any matches, i.e., at least one element in a vector matches the predicate expression, use the any()
function. The function any()
returns TRUE if there’s at least one match, FALSE otherwise.
## [1] TRUE
Accessing Rows, Columns, and Elements (Cells) of a Data Frame
Data frames are very similar to tables in relational databases and spreadsheets. They have rows and columns and the intersection of a row and column is a cell (or element). The order of access is row followed by column, e.g., the third element in the fourth row of the data frame mtcars
is mtcars[4,3]
.
Every column in a data frame is a vector but every row is a data frame of a single row.
Note that this is reversed from the way Excel and other spreadsheets work.
The example code below uses the built-in data frame mtcars. You can find out more about its structure using str(mtcars)
or displaying the first few rows with head(mtcars)
. It is also often useful to restrict the columns in the output.
The function str()
should not be included in any R program or R Notebook, but rather is best used to explore the data interactively in the R console; it’s output is lengthy and not useful, generally, to those looking at the output of an R program or a knitted R Notebook.
## 'data.frame': 32 obs. of 12 variables:
## $ miPerGa: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear : num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4 ...
## $ foo : num 99 99 99 99 99 99 99 99 99 99 ...
## miPerGa cyl disp hp
## Mazda RX4 21.0 6 160 110
## Mazda RX4 Wag 21.0 6 160 110
## Datsun 710 22.8 4 108 93
v <- mtcars[4,3]
x = mtcars[4,3]
print(paste0("v = ",v," and x = ",x))
## [1] "v = 258 and x = 258"
Leaving out a dimension (row or column) accesses the entire row or column. The resultant is a data frame with a single row.
Often the values must be converted to a vector data type. Conversions of variables from one type to another is done with the family of as.xxxx
functions, e.g., as.vector
, as.numeric
, or as.factor
. Vectors can contain numeric or character data but all elements must be of the same type. In R, a list is similar to a vector but it may contain a mix of elements. A matrix is similar to a data frame but it can only contain numbers and it can have more than two dimensions.
Some functions expect data frames, some vectors, some lists. You need to read the documentation of a function to find out. Furthermore, some functions will automatically convert a variable from one type to the one it requires.
You can also access a column in data frame by its column name. For an entire column you either use the columns position or its name: df[,column]
or df$columnName
.
# all of row 4; the result is a data frame
r <- mtcars[4,]
sum(r)
## [1] 525.135
## [1] 108
mtcars[c(1,4)] # columns 1 and 4 as a new dataframe
## miPerGa hp
## Mazda RX4 21.0 110
## Mazda RX4 Wag 21.0 110
## Datsun 710 22.8 93
## Hornet 4 Drive 21.4 110
## Hornet Sportabout 18.7 175
## Valiant 18.1 105
## Duster 360 14.3 245
## Merc 240D 24.4 62
## Merc 230 22.8 95
## Merc 280 19.2 123
## Merc 280C 17.8 123
## Merc 450SE 16.4 180
## Merc 450SL 17.3 180
## Merc 450SLC 15.2 180
## Cadillac Fleetwood 10.4 205
## Lincoln Continental 10.4 215
## Chrysler Imperial 14.7 230
## Fiat 128 32.4 66
## Honda Civic 30.4 52
## Toyota Corolla 33.9 65
## Toyota Corona 21.5 97
## Dodge Challenger 15.5 150
## AMC Javelin 15.2 150
## Camaro Z28 13.3 245
## Pontiac Firebird 19.2 175
## Fiat X1-9 27.3 66
## Porsche 914-2 26.0 91
## Lotus Europa 30.4 113
## Ford Pantera L 15.8 264
## Ferrari Dino 19.7 175
## Maserati Bora 15.0 335
## Volvo 142E 21.4 109
mtcars[,2] # all of column 2
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[5:7,] # rows 5 to 7 as a new dataframe
## miPerGa cyl disp hp drat wt qsec vs am gear carb foo
## Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 99
## Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 99
## Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 99
mtcars$cyl # column named "cyl"
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars$cyl[2] # 2nd row in the column "cyl"
## [1] 6
mtcars$cyl[3:9] # rows 3 to 9 for column "cyl" as a vector
## [1] 4 6 8 6 8 4 4
## Warning in mean.default(w): argument is not numeric or logical: returning NA
## [1] NA
Adding Columns to a Data Frame
To add a new column, you simply “access” the column or use a new name for the column. Note in the example below that you can operate on entire columns (as vectors) and the operation is applied to each pair of values in the two vectors in the operation. This is much more efficient than using loops as is necessary in other programming languages.
# copy the data frame mtcars to a new data frame df
df <- mtcars
# create a new column "dispcyl" which is the displacement per cylinder
df$dispcyl <- df$disp / df$cyl
head(df)
## miPerGa cyl disp hp drat wt qsec vs am gear carb foo dispcyl
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 99 26.66667
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 99 26.66667
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 99 27.00000
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 99 43.00000
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 99 45.00000
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 99 37.50000
Change Column Names
To change the names of columns you can either create a new dataframe and copy selected columns from an existing dataframe to the new dataframe, or you can use the colnames()
function to update columns names without copying.
The function colnames()
returns the names of the columns, but it can also be used on the left side of an assignment and so we can change the column names that way. This is illustrated below.
# copy columns 1 through 3 to new dataframe
df <- mtcars[,1:3]
# rename all columns
colnames(df) <- c("mpg.all", "numCylinders", "Displacement.ltr")
# rename a single column
colnames(df)[1] <- "miPerGa"
head(df,3)
## miPerGa numCylinders Displacement.ltr
## Mazda RX4 21.0 6 160
## Mazda RX4 Wag 21.0 6 160
## Datsun 710 22.8 4 108
Adding Rows to a Data Frame
Simply assign a new value to a column in a row that does not (yet) exist, will cause R to allocate additional memory. The code below illustrates this:
Creating a New Data Frame
Data frames are created in various ways:
- use the
data.frame
function
- load a CSV, TSV, or other value separated file
- load a simple XML file
- execute a SQL query
- result of running a function
Load a Data Frame from CSV
Loading data into data frames from files, particularly CSV and TSV files, is covered in detail in Lesson 6.106 – Import Data into R from CSV, TSV, and Excel Files, so consult that lesson for more details. This section is a quick overview and summary.
The most commonly used function to load data from a CSV files is read.csv()
and it is part of Base R, so no additional packages are required. This function can load a CSV from a local file or from a URL.
The parameter header = F instructs read.csv()
not to interpret the first line as header labels. Of course, if there are no labels, then you need to define your own.
df <- read.csv(file = "customertxndata.csv", header = F)
head(df)
## V1 V2 V3 V4 V5
## 1 7 0 Android Male 0.0000
## 2 20 1 iOS <NA> 576.8668
## 3 22 1 iOS Female 850.0000
## 4 24 2 iOS Female 1050.0000
## 5 1 0 Android Male 0.0000
## 6 13 1 Android Male 460.0000
df <- read.csv(file = "customertxndata.csv",
header = F,
col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
## numVisits NumTxn OS Gender TotSp
## 1 7 0 Android Male 0.0000
## 2 20 1 iOS <NA> 576.8668
## 3 22 1 iOS Female 850.0000
## 4 24 2 iOS Female 1050.0000
## 5 1 0 Android Male 0.0000
## 6 13 1 Android Male 460.0000
Note that the value of the ‘Male’ column in the first row is NA which is the way that R indicates a missing data value. It is not 0 or an empty string, it is unknown. So, statistical functions and algebraic operations would result in an NA as well.
Aside from CSV files, R can also load a number of other file format using various packages, including XML, Excel, SPSS, MatLab, among many others.
Nota Bene: Capitalization in path and file names does not matter in Windows, but does matter on MacOS and Linux. Furthermore, note that even in Windows the path delimiter is a forward slash / and not the usual backwards slash \. The \ is an “escape” character and used to inject non-printable characters into a string (text), e.g., “This string contains”quotes”.” which would be written in R as “this string contains \”quotes\“.”
Strings vs Factors
The factor data type encodes categorical data, e.g., the value of a variable is one of a fixed value set. Many statistical functions in R require categorical variables to be of type factor. However, often, during data processing, we need the actual text rather than having it encoded as a factor (which is actually stored in R as an integer for efficiency). So, when reading a CSV file you need to decide if you want text columns to be character strings or factors by setting the stringsAsFactors
parameter.
You may use either F
and T
or FALSE
and TRUE
.
df <- read.csv(file = "customertxndata.csv",
header = F,
stringsAsFactors = FALSE,
col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
## numVisits NumTxn OS Gender TotSp
## 1 7 0 Android Male 0.0000
## 2 20 1 iOS <NA> 576.8668
## 3 22 1 iOS Female 850.0000
## 4 24 2 iOS Female 1050.0000
## 5 1 0 Android Male 0.0000
## 6 13 1 Android Male 460.0000
Create a new Data Frame
The code below creates a new data frame from column vectors. Notice how the column names are the names of the vectors. A new vector is created with the c
function, e.g., v <- c(3,5,1,9)
.
df1 <- data.frame(state = c('Arizona','Georgia', 'New York','Indiana','Washington','Texas'),
code = as.factor(c('AZ','GA','NY','IN','WA','TX')),
score = c(62,47,55,74,31,85))
head(df1)
## state code score
## 1 Arizona AZ 62
## 2 Georgia GA 47
## 3 New York NY 55
## 4 Indiana IN 74
## 5 Washington WA 31
## 6 Texas TX 85
Search Data Frames
There are two important functions for “searching” data frames: which
and any
. The code below uses the built-in Orange data frame which contains measurements of orange trees. It has three columns: the tree, the age of the tree (days since 1968/12/31), and circumference (in mm).
which
## Grouped Data: circumference ~ age | Tree
## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
# find all rows where the circumference is more than 200mm
rs <- which(df$circumference > 200)
# display all rows where the circumference is more than 200mm
df[rs,]
## Grouped Data: circumference ~ age | Tree
## Tree age circumference
## 13 2 1372 203
## 14 2 1582 203
## 27 4 1372 209
## 28 4 1582 214
# compound conditions are possible with & (and), | (or), and ! (not)
rs2 <- which(df$circumference > 200 & df$age < 1500)
rs3 <- which(df$circumference < 200 | !(df$age < 1500))
rs4 <- which(df$circumference > 400 | df$age > 1500)
rs2
## [1] 13 27
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30 31 32 33 34 35
## [1] 7 14 21 28 35
## [1] 1582
## [1] 894.8788
In the above example rs <- which(df$circumference > 200)
finds all rows in the data frame df where circumference > 200. The rows are saved in rs.
any
The any
function returns TRUE or FALSE depending on whether any column (or row) in the dataframe satisfies a Boolean expression.
# is there any tree with age > 2000?
any(df$age > 25)
## [1] TRUE
Memory Management
R is similar to Python and other interpreted languages in terms of memory management. Objects and variables remain in memory until you restart R or explicitly delete them. This can sometimes cause conflicts during development. Adding this to the start of an R script or an R Notebook ensures that the program runs with an empty memory environment. This is critical for languages like R and Python, but is not needed for programming languages that run in separate processes such as Java and C++ programs.
Use the code below to find and then delete all objects, and reclaim memory. The function ls()
lists all objects (variables) by name, while the rm()
removes one or more objects from memory. Finally, the function gc()
runs the garbage collector and returns freed memory to the usable memory pool for the process in which R is running.
rm(list = ls(all.names = TRUE))
gc()
Of course, rather than deleting all objects as in the code chunk above, you may wish to release large objects or unused objects selectively by their name, e.g., rm(“objName”)
.
Conclusion
As you saw, R is not a difficult language to learn as it is similar to other languages and for most language constructs that you are familiar with, there is an equivalent. But it is important that you go beyond this tutorial and learn the “R way” of programming using vectorized operations.
Tutorial
The video tutorial demonstrates the constructs introduced in this lesson.
References
No references.
