• Introduction
  • Vectors
  • Creating a Vector
  • Size of Vectors
  • Vector Operations
    • Example: Dot Product
  • Functions on Vectors
  • Accessing Elements in a Vector
  • Testing Predicate Expressions
  • Finding Matches
  • Determining Any Matches
  • Accessing Rows, Columns, and Elements (Cells) of a Data Frame
  • Adding Columns to a Data Frame
  • Change Column Names
  • Adding Rows to a Data Frame
  • Creating a New Data Frame
    • Load a Data Frame from CSV
    • Create a new Data Frame
  • Search Data Frames
    • which
    • any
  • Memory Management
  • Conclusion
  • Tutorial
  • Files & Resources
  • References
  • Errata

Introduction

Vectors

In R, a vector is similar to a list or array in other programming languages. It is a collection of elements of the same basic type: numeric, character, or logical (Boolean). As an aside, a list in R is a collection of mixed data types. This section applies to vectors only.

Vectors emerge in different ways in R:

  • result of manually creating them
  • result of a call to a function
  • extraction of elements from a vector
  • column in a data frame

There are no specific packages required to manipulate, use, create, or access vectors.

Creating a Vector

A simple vector can be created by creating a collection using the c() function. Note that all elements in a vector must be of the same basic type.

v.numeric <- c(23, 77, 12, 98, -23, 0)
v.char <- c('R', 'Java', 'Python', 'C++', 'LISP')
v.logical <- c(TRUE, F, FALSE, T, T, T)

When elements are not of the same type, R attempts to coerce (cast or convert) them to the same “higher” data type. For example, in the code below, all elements are coerced to character (i.e., text/string):

v.mixed <- c(23, 0, 'R', TRUE)

print(v.mixed)
## [1] "23"   "0"    "R"    "TRUE"

The code below shows a more complex example. It creates an artificial vector of random integers for use in the remainder of the tutorial.

# vector of 50 random integers between 0 and 10

# set the seed for the random number generator to ensure same
# sequence of random numbers every time the code is run
set.seed(98788)
v <- round(runif(50, min = 0, max = 10),0)

# arguments do not have to be passed in the order that they are
# declared in the function definition as long as the names of the
# arguments are specified
v <- round(runif(n = 50, max = 10, min = 1),0)
v <- round(runif(max = 10, min = 1, n = 50),0)

print(v)
##  [1]  9  6  9  1  9  8 10  2  6  9  8  7  6  1  7 10  4  3  5  9  7  8  4  9  7  4  6  5  1  4  4  2  7  9  9  6  3  2  3  7  8  6
## [43]  3  9  4 10  7  1  4  2

Size of Vectors

To find the size (length) of a vector, i.e, the number of elements in a vector, use the function length().

v <- round(runif(50, min = 0, max = 10),0)

print(length(v))
## [1] 50

Vector Operations

Vectors can be used in algebraic operations without having to write loops like in other programming languages. For example, in the example below, every element of the vector v.a is multiplied by 2.5

v.a <- c(23, 33, 10, 8, 7)

r <- v.a * 2.5

print(r)
## [1] 57.5 82.5 25.0 20.0 17.5

If two vectors are used in an operation then the operand is applied to each corresponding pair of values.

v.a <- c(23, 33, 10, 8, 7)
v.b <- c(0.3, 0.5, 0.1, 0.9, 0.8)

r <- v.a * v.b

print(r)
## [1]  6.9 16.5  1.0  7.2  5.6

The vectors, naturally, need to be of the same size. So, the code below does not work correctly as the vectors do not have the same number of elements.

v.a <- c(23, 33, 10, 8, 7)
v.b <- c(23, 33, 10, 8, 7, 99)

r <- v.a * v.b
## Warning in v.a * v.b: longer object length is not a multiple of shorter object length

Of course, we could also write the code explicitly using a loop. Note that this runs significantly slower and thus should be avoided.

Example: Dot Product

The code below calculates the sum of the products of the two vectors (their dot product), first using a loop, and then using vector operations, and finally using the %*% operator.

# Approach 1: Using Loops
v.a <- c(23, 33, 10, 8, 7)
v.b <- c(0.3, 0.5, 0.1, 0.9, 0.8)

n <- length(v.a)

r <- 0
for (i in 1:n) {
  r <- r + (v.a[i] * v.b[i])
}

print(paste0("Approach 1 (loop): ", r))
## [1] "Approach 1 (loop): 37.2"
# Approach 2: Vector Operation
v.prod <- v.a * v.b
v.sum <- sum(v.prod)

print(paste0("Approach 2 (vector operation and function): ", v.sum))
## [1] "Approach 2 (vector operation and function): 37.2"
# Approach 3: Operator

dp <- v.a %*% v.b
print(paste0("Approach 3 (dot product operand): ", dp))
## [1] "Approach 3 (dot product operand): 37.2"

The above example illustrates that there is often more than one way to achieve some task in R. Of course, some approaches are more elegant, require less code, and are faster. There is another way, the above dot product could have been calculated and that is to use the dot() function from the pracma library. We are certain that you can come up with yet another way.

Functions on Vectors

As seen in previous examples, many functions take vectors as arguments. For example, the function mean() takes a numeric or logical vector as input and returns the average value. For a logical vector, the values of TRUE and FALSE are converted to numbers where TRUE = 1 and FALSE = 0 and then the arithmetic average is calculated.

The code below illustrates some common functions. The function round() can be useful for output.

v <- c(2, 6, 8.2, 1.3, 9.4, 45.7, 32, 99, 104,55, 0.05)

n <- length(v)
print(paste0("n = ", n))
## [1] "n = 11"
m <- mean(v)
print(paste0("Mean = ", round(m,2)))
## [1] "Mean = 32.97"
s <- sd(v)
print(paste0("StdDev = ", round(s,2)))
## [1] "StdDev = 38.72"
d <- median(v)
print(paste0("Median = ", round(d,2)))
## [1] "Median = 9.4"
r <- max(v) - min(v)
print(paste0("Range = ", round(r,2)))
## [1] "Range = 103.95"
tm <- mean(v, trim = 0.1)
print(paste0("10% Trimmed Mean = ", round(tm,2)))
## [1] "10% Trimmed Mean = 28.73"

Accessing Elements in a Vector

Elements are accessed positionally, although in R, the access index can be a vector of integers in which case all elements at those positions are retrieved. Positions are numbered from 1 to the number of elements in a vector. The number of elements (or length) of a vector can be obtained using the length() function.

In the example below, note that n:m generates a vector of integers from n to m, inclusive. The seq() generates a vector that is a sequence of integers at an interval.

print(v)
##  [1]   2.00   6.00   8.20   1.30   9.40  45.70  32.00  99.00 104.00  55.00   0.05
# access a single element at position 3
v[3]
## [1] 8.2
# access element 3 through 5
v[3:5]
## [1] 8.2 1.3 9.4
# access the first element
v[1]
## [1] 2
# access the last element
v[length(v)]
## [1] 0.05
# access the next to last element
v[length(v)-1]
## [1] 55
# access every other element
v[seq(from = 1, to = length(v), by = 2)]
## [1]   2.00   8.20   9.40  32.00 104.00   0.05
# access specific elements at positions 1, 5, and 7
i <- c(1,5,7)
v[i]
## [1]  2.0  9.4 32.0

Testing Predicate Expressions

It is possible in R to apply a predicate expression to every element in a vector. This generates a “Boolean vector” of TRUE/FALSE values that indicate which element matches the predicate expression (TRUE) and which doesn’t (FALSE).

Predicate expressions are built with logical operators (<, >, <=, >=, ==, !=)

v < 5
##  [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
(v < 1 | v > 9)
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
(v <= 7 & v != 3)
##  [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
v != 5
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
l <- (v == 5)
print(l)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Finding Matches

The which() function returns the positions that are TRUE in a Boolean vector.

# returns positions of vector that matches predicate expression
which(v != 5)
##  [1]  1  2  3  4  5  6  7  8  9 10 11
# count the number of matches
length(which(v != 0))
## [1] 11
p <- which(v < 5)
print (v[p])
## [1] 2.00 1.30 0.05
# or combine
x <- v[which(v < 5)]
print (x)
## [1] 2.00 1.30 0.05

Determining Any Matches

To determine if there are any matches, i.e., at least one element in a vector matches the predicate expression, use the any() function. The function any() returns TRUE if there’s at least one match, FALSE otherwise.

any(v < 5)
## [1] TRUE

Accessing Rows, Columns, and Elements (Cells) of a Data Frame

Data frames are very similar to tables in relational databases and spreadsheets. They have rows and columns and the intersection of a row and column is a cell (or element). The order of access is row followed by column, e.g., the third element in the fourth row of the data frame mtcars is mtcars[4,3].

Every column in a data frame is a vector but every row is a data frame of a single row.

Note that this is reversed from the way Excel and other spreadsheets work.

The example code below uses the built-in data frame mtcars. You can find out more about its structure using str(mtcars) or displaying the first few rows with head(mtcars). It is also often useful to restrict the columns in the output.

The function str() should not be included in any R program or R Notebook, but rather is best used to explore the data interactively in the R console; it’s output is lengthy and not useful, generally, to those looking at the output of an R program or a knitted R Notebook.

str(mtcars)
## 'data.frame':    32 obs. of  12 variables:
##  $ miPerGa: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl    : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp   : num  160 160 108 258 360 ...
##  $ hp     : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat   : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt     : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec   : num  16.5 17 18.6 19.4 17 ...
##  $ vs     : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am     : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear   : num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb   : num  4 4 1 1 2 1 4 2 2 4 ...
##  $ foo    : num  99 99 99 99 99 99 99 99 99 99 ...
head(mtcars[1:4], 3)
##               miPerGa cyl disp  hp
## Mazda RX4        21.0   6  160 110
## Mazda RX4 Wag    21.0   6  160 110
## Datsun 710       22.8   4  108  93
v <- mtcars[4,3]
x = mtcars[4,3]

print(paste0("v = ",v," and x = ",x))
## [1] "v = 258 and x = 258"

Leaving out a dimension (row or column) accesses the entire row or column. The resultant is a data frame with a single row.

Often the values must be converted to a vector data type. Conversions of variables from one type to another is done with the family of as.xxxx functions, e.g., as.vector, as.numeric, or as.factor. Vectors can contain numeric or character data but all elements must be of the same type. In R, a list is similar to a vector but it may contain a mix of elements. A matrix is similar to a data frame but it can only contain numbers and it can have more than two dimensions.

Some functions expect data frames, some vectors, some lists. You need to read the documentation of a function to find out. Furthermore, some functions will automatically convert a variable from one type to the one it requires.

You can also access a column in data frame by its column name. For an entire column you either use the columns position or its name: df[,column] or df$columnName.

# all of row 4; the result is a data frame
r <- mtcars[4,]
sum(r)
## [1] 525.135
c <- mtcars[3,]
c[1,3]
## [1] 108
mtcars[c(1,4)]   # columns 1 and 4 as a new dataframe
##                     miPerGa  hp
## Mazda RX4              21.0 110
## Mazda RX4 Wag          21.0 110
## Datsun 710             22.8  93
## Hornet 4 Drive         21.4 110
## Hornet Sportabout      18.7 175
## Valiant                18.1 105
## Duster 360             14.3 245
## Merc 240D              24.4  62
## Merc 230               22.8  95
## Merc 280               19.2 123
## Merc 280C              17.8 123
## Merc 450SE             16.4 180
## Merc 450SL             17.3 180
## Merc 450SLC            15.2 180
## Cadillac Fleetwood     10.4 205
## Lincoln Continental    10.4 215
## Chrysler Imperial      14.7 230
## Fiat 128               32.4  66
## Honda Civic            30.4  52
## Toyota Corolla         33.9  65
## Toyota Corona          21.5  97
## Dodge Challenger       15.5 150
## AMC Javelin            15.2 150
## Camaro Z28             13.3 245
## Pontiac Firebird       19.2 175
## Fiat X1-9              27.3  66
## Porsche 914-2          26.0  91
## Lotus Europa           30.4 113
## Ford Pantera L         15.8 264
## Ferrari Dino           19.7 175
## Maserati Bora          15.0 335
## Volvo 142E             21.4 109
mtcars[,2]       # all of column 2
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[5:7,]     # rows 5 to 7 as a new dataframe
##                   miPerGa cyl disp  hp drat   wt  qsec vs am gear carb foo
## Hornet Sportabout    18.7   8  360 175 3.15 3.44 17.02  0  0    3    2  99
## Valiant              18.1   6  225 105 2.76 3.46 20.22  1  0    3    1  99
## Duster 360           14.3   8  360 245 3.21 3.57 15.84  0  0    3    4  99
mtcars$cyl       # column named "cyl"
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars$cyl[2]    # 2nd row in the column "cyl"
## [1] 6
mtcars$cyl[3:9]  # rows 3 to 9 for column "cyl" as a vector
## [1] 4 6 8 6 8 4 4
w <- mtcars$mpg
mean(w)
## Warning in mean.default(w): argument is not numeric or logical: returning NA
## [1] NA

Adding Columns to a Data Frame

To add a new column, you simply “access” the column or use a new name for the column. Note in the example below that you can operate on entire columns (as vectors) and the operation is applied to each pair of values in the two vectors in the operation. This is much more efficient than using loops as is necessary in other programming languages.

# copy the data frame mtcars to a new data frame df
df <- mtcars

# create a new column "dispcyl" which is the displacement per cylinder
df$dispcyl <- df$disp / df$cyl

head(df)
##                   miPerGa cyl disp  hp drat    wt  qsec vs am gear carb foo  dispcyl
## Mazda RX4            21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  99 26.66667
## Mazda RX4 Wag        21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  99 26.66667
## Datsun 710           22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  99 27.00000
## Hornet 4 Drive       21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  99 43.00000
## Hornet Sportabout    18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  99 45.00000
## Valiant              18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  99 37.50000

Change Column Names

To change the names of columns you can either create a new dataframe and copy selected columns from an existing dataframe to the new dataframe, or you can use the colnames() function to update columns names without copying.

The function colnames() returns the names of the columns, but it can also be used on the left side of an assignment and so we can change the column names that way. This is illustrated below.

# copy columns 1 through 3 to new dataframe
df <- mtcars[,1:3]

# rename all columns
colnames(df) <- c("mpg.all", "numCylinders", "Displacement.ltr")

# rename a single column
colnames(df)[1] <- "miPerGa"

head(df,3)
##               miPerGa numCylinders Displacement.ltr
## Mazda RX4        21.0            6              160
## Mazda RX4 Wag    21.0            6              160
## Datsun 710       22.8            4              108

Adding Rows to a Data Frame

Simply assign a new value to a column in a row that does not (yet) exist, will cause R to allocate additional memory. The code below illustrates this:

df[nrow(df)+1,3] <- 34

Creating a New Data Frame

Data frames are created in various ways:

  1. use the data.frame function
  2. load a CSV, TSV, or other value separated file
  3. load a simple XML file
  4. execute a SQL query
  5. result of running a function

Load a Data Frame from CSV

Loading data into data frames from files, particularly CSV and TSV files, is covered in detail in Lesson 6.106 – Import Data into R from CSV, TSV, and Excel Files, so consult that lesson for more details. This section is a quick overview and summary.

The most commonly used function to load data from a CSV files is read.csv() and it is part of Base R, so no additional packages are required. This function can load a CSV from a local file or from a URL.

The parameter header = F instructs read.csv() not to interpret the first line as header labels. Of course, if there are no labels, then you need to define your own.

df <- read.csv(file = "customertxndata.csv", header = F)
head(df)
##   V1 V2      V3     V4        V5
## 1  7  0 Android   Male    0.0000
## 2 20  1     iOS   <NA>  576.8668
## 3 22  1     iOS Female  850.0000
## 4 24  2     iOS Female 1050.0000
## 5  1  0 Android   Male    0.0000
## 6 13  1 Android   Male  460.0000
df <- read.csv(file = "customertxndata.csv", 
               header = F,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
##   numVisits NumTxn      OS Gender     TotSp
## 1         7      0 Android   Male    0.0000
## 2        20      1     iOS   <NA>  576.8668
## 3        22      1     iOS Female  850.0000
## 4        24      2     iOS Female 1050.0000
## 5         1      0 Android   Male    0.0000
## 6        13      1 Android   Male  460.0000

Note that the value of the ‘Male’ column in the first row is NA which is the way that R indicates a missing data value. It is not 0 or an empty string, it is unknown. So, statistical functions and algebraic operations would result in an NA as well.

Aside from CSV files, R can also load a number of other file format using various packages, including XML, Excel, SPSS, MatLab, among many others.

Nota Bene: Capitalization in path and file names does not matter in Windows, but does matter on MacOS and Linux. Furthermore, note that even in Windows the path delimiter is a forward slash / and not the usual backwards slash \. The \ is an “escape” character and used to inject non-printable characters into a string (text), e.g., “This string contains”quotes”.” which would be written in R as “this string contains \”quotes\“.”

Strings vs Factors

The factor data type encodes categorical data, e.g., the value of a variable is one of a fixed value set. Many statistical functions in R require categorical variables to be of type factor. However, often, during data processing, we need the actual text rather than having it encoded as a factor (which is actually stored in R as an integer for efficiency). So, when reading a CSV file you need to decide if you want text columns to be character strings or factors by setting the stringsAsFactors parameter.

You may use either F and T or FALSE and TRUE.

df <- read.csv(file = "customertxndata.csv", 
               header = F,
               stringsAsFactors = FALSE,
               col.names = c("numVisits","NumTxn","OS","Gender","TotSp"))
head(df)
##   numVisits NumTxn      OS Gender     TotSp
## 1         7      0 Android   Male    0.0000
## 2        20      1     iOS   <NA>  576.8668
## 3        22      1     iOS Female  850.0000
## 4        24      2     iOS Female 1050.0000
## 5         1      0 Android   Male    0.0000
## 6        13      1 Android   Male  460.0000

Create a new Data Frame

The code below creates a new data frame from column vectors. Notice how the column names are the names of the vectors. A new vector is created with the c function, e.g., v <- c(3,5,1,9).

df1 <- data.frame(state = c('Arizona','Georgia', 'New York','Indiana','Washington','Texas'),
                  code = as.factor(c('AZ','GA','NY','IN','WA','TX')),
                  score = c(62,47,55,74,31,85))

head(df1)
##        state code score
## 1    Arizona   AZ    62
## 2    Georgia   GA    47
## 3   New York   NY    55
## 4    Indiana   IN    74
## 5 Washington   WA    31
## 6      Texas   TX    85

Search Data Frames

There are two important functions for “searching” data frames: which and any. The code below uses the built-in Orange data frame which contains measurements of orange trees. It has three columns: the tree, the age of the tree (days since 1968/12/31), and circumference (in mm).

which

df <- Orange

head(df)
## Grouped Data: circumference ~ age | Tree
##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142
# find all rows where the circumference is more than 200mm
rs <- which(df$circumference > 200)

# display all rows where the circumference is more than 200mm
df[rs,]
## Grouped Data: circumference ~ age | Tree
##    Tree  age circumference
## 13    2 1372           203
## 14    2 1582           203
## 27    4 1372           209
## 28    4 1582           214
# compound conditions are possible with & (and), | (or), and ! (not)
rs2 <- which(df$circumference > 200 & df$age < 1500)
rs3 <- which(df$circumference < 200 | !(df$age < 1500))
rs4 <- which(df$circumference > 400 | df$age > 1500)

rs2
## [1] 13 27
rs3
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30 31 32 33 34 35
rs4
## [1]  7 14 21 28 35
mean(df[rs4,2])
## [1] 1582
mean(df$age[rs3])
## [1] 894.8788

In the above example rs <- which(df$circumference > 200) finds all rows in the data frame df where circumference > 200. The rows are saved in rs.

any

The any function returns TRUE or FALSE depending on whether any column (or row) in the dataframe satisfies a Boolean expression.

# is there any tree with age > 2000?
any(df$age > 25)
## [1] TRUE

Memory Management

R is similar to Python and other interpreted languages in terms of memory management. Objects and variables remain in memory until you restart R or explicitly delete them. This can sometimes cause conflicts during development. Adding this to the start of an R script or an R Notebook ensures that the program runs with an empty memory environment. This is critical for languages like R and Python, but is not needed for programming languages that run in separate processes such as Java and C++ programs.

Use the code below to find and then delete all objects, and reclaim memory. The function ls() lists all objects (variables) by name, while the rm() removes one or more objects from memory. Finally, the function gc() runs the garbage collector and returns freed memory to the usable memory pool for the process in which R is running.

rm(list = ls(all.names = TRUE))
gc()

Of course, rather than deleting all objects as in the code chunk above, you may wish to release large objects or unused objects selectively by their name, e.g., rm(“objName”).

Conclusion

As you saw, R is not a difficult language to learn as it is similar to other languages and for most language constructs that you are familiar with, there is an equivalent. But it is important that you go beyond this tutorial and learn the “R way” of programming using vectorized operations.

Tutorial

The video tutorial demonstrates the constructs introduced in this lesson.


Files & Resources

All Files for Lesson 6.103

References

No references.

Errata

Let us know.

