Preface
This tutorial presumes that you have R and R Studio installed, or that you have an account on rstudio.cloud. If you do not already have R and/or R Studio you will need to download and install them. You must first install R from R Project and then the R Studio IDE from R Studio.
In this is R Markdown Notebook we will demonstrate how to parse, process, search, and work with text (aka, character strings). When you execute code within the notebook, the results appear beneath the code.
Basic Text Functionality
The functions below are part of “Base R” and do not require any additional packages, although in practice many R programmers prefer string processing packages such as stringr.
paste
: glue (concatenate) text and numeric values together
substr
and substring
: extract or replace substrings in a character vector
grep
: use regular expressions to deal with patterns of text
regexpr
: find matching pattern using regular expressions
sub
and gsub
: replace parts of a string
strsplit
: split strings
nchar
: find number of characters in a string
as.numeric
: convert a string to numeric if feasible
strtoi
: convert a string to integer if it can be (faster than as.integer
)
A common issue in R is that often functions return a list object, when one would expect a vector. This sometimes requires an additional step or two of further processing that ought not to be necessary, but unfortunately often is. So be prepared for some extra code gymnastics using unlist
.
Dates are not always text in R; they are more likely of a Date or DateTime type and require different functions for processing.
Text in R
Text in R are character strings. The data type (class) of a character string is “characters”. There is no need to declare that a variable is of a string type; just assigning a string (anything in either single quotes ’ or double quotes ” is automatically a string). The fact that both ” and ’ can be used is useful when needing to embed quotes or apostrophies within strings. You can also “escape” special characters by prefixing with them with a backslash .
Displaying (or “printing”) text is done by either using the variable name by itself or by using the print
function.
s <- "this is a text or character string"
w <- 'and this is also text'
u <- 'and this is text with "double quotes"'
q <- "and this is a quote's quote"
t <- "text with escaped \"double quotes\" inside a string defined with \""
u
## [1] "and this is text with \"double quotes\""
## [1] "and this is a quote's quote"
Convert Strings to Numbers
Use the coercion functions as.type to convert (coerce) strings to other data types, most commonly Booleans, dates, and numbers. Converting to an integer does not round; it simply drops the fractional part (it “truncates”). The function function strtoi
is a faster way to convert strings to integers assuming that there is no fractional part.
s1 <- "123.99"
n1 <- as.numeric(s1)
n1
## [1] 123.99
## [1] "numeric"
## [1] 123
## [1] "integer"
## [1] 2234
Note that for conversions to work, the text cannot contain any non-numeric characters. The dot (.) is allowed as a decimal separator, but the comma (,) is not recognized as a thousands separator. It returns the error NAs introduced by coercion.
# the following would not work
e <- as.numeric("$123.99") # contains $
## Warning: NAs introduced by coercion
e <- as.numeric("123,99") # comma not recognized
## Warning: NAs introduced by coercion
Convert Strings to Dates
Dates are a separate data type in R. A date object can be used in “date aware” calculations. Note that strings must be in the format YYYY-MM-DD unless the format parameter is specified. The lubridate package has many more functions for converting from various other formats. The strings are converted to the standard format YYYY-MM-DD.
To get the current date from the system’s clock, use Sys.Date()
. To get the current date and time, use date()
.
s2 <- "2021-02-15"
d1 <- as.Date(s2)
d1
## [1] "2021-02-15"
## [1] "Date"
s3 <- "2019/06/30"
d2 <- as.Date(s3)
d2
## [1] "2019-06-30"
as.Date("07/04/20", format = "%m/%d/%y")
## [1] "2020-07-04"
as.Date("JUL 04 2021", format = "%b %d %Y")
## [1] "2021-07-04"
as.Date("February 11 80", format = "%B %d %y")
## [1] "1980-02-11"
# get current date and time
date()
## [1] "Wed Feb 14 09:47:19 2024"
# get current date only as YYYY-MM-DD
Sys.Date()
## [1] "2024-02-14"
Here are the various date format specifications:
%d |
day as a number (01-31) |
06 |
%a |
abbreviated weekday |
Mon |
%A |
full weekday |
Monday |
%m |
month (00-12) |
11 |
%b |
abbreviated month |
Nov |
%B |
full month |
March |
%y |
two-digit year |
20 |
%Y |
four-digit year |
2021 |
String Concatenation
Since print
only takes a single string as a parameter, you need to concatenate (i.e., combine) multiple strings using either paste
or paste0
. The former inserts a space after each string. paste
also allows you to specify a separator other than a space.
s <- "this is a text or character string"
w <- 'and this is also text'
r <- paste(s, w)
t <- paste(s, w, sep = "/")
print(r)
## [1] "this is a text or character string and this is also text"
## [1] "this is a text or character string/and this is also text"
Replacing Parts of a String
The sub
function is used to replace parts of a string with new text. Note that it only replaces the first occurrence while the function gsub
replaces all occurrences. The function returns the new string.
s <- "Boston, MA 02115"
s <- sub("02115","02117",s)
print(s)
## [1] "Boston, MA 02117"
The replacement text does not have to be of the same length as the replaced text.
s <- "Boston, MA 02125"
s <- sub("Boston","Charlestown",s)
print(s)
## [1] "Charlestown, MA 02125"
The sub
and gsub
functions can also contain regular expressions. More on regular expressions below.
s <- c("Charlestown, MA 02121","Brighton, MA 02137")
s <- sub("Charlestown|Brighton","Boston",s)
print(s)
## [1] "Boston, MA 02121" "Boston, MA 02137"
Splitting a String into Tokens
Splitting a string into tokens is done with the splitstr
function.
s <- "MA,CT,RI,NH,CA,FL"
# split based on comma ,
states <- strsplit(s,",")
print(states)
## [[1]]
## [1] "MA" "CT" "RI" "NH" "CA" "FL"
The function returns a “list” rather than a “vector”, so we often use the function unlist()
to convert the list to a vector. We can then use vector access to get to specific elements.
s <- "MA,CT,RI,NH,CA,FL"
# split based on comma ,
states <- unlist(strsplit(s,","))
# print all elements in the vector
print(states)
## [1] "MA" "CT" "RI" "NH" "CA" "FL"
# print a specific element
print(states[2])
## [1] "CT"
Converting to Upper or Lower Case
Matching is often easier if a string is in all upper or lower cases. The functions toupper
and tolower
do that.
s <- "Fred Flintstone"
u <- toupper(s)
l <- tolower(s)
print(u)
## [1] "FRED FLINTSTONE"
## [1] "fred flintstone"
Finding Parts of a String
The grep
and grepl
functions find pattern in strings and return a Boolean if there the pattern is found. The patterns are expressed as regular expressions.
x <- c("door", "apple", "color", "abba", "pattern")
# find which strings contain the pattern
grep("a", x)
## [1] 2 4 5
## [1] FALSE TRUE FALSE TRUE TRUE
The regexpr
returns the position in the string where there is first match occurs or a -1 if there is no match.
# find where the pattern starts
m <- regexpr("a", x)
m
## [1] -1 1 -1 1 2
## attr(,"match.length")
## [1] -1 1 -1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [1] 2
In the example below we are looking for an occurrence of * which is a regular expression character, so we need to “escape its meaning” with a \. However, the \ is also a special character for regular expression so we need to also escape its meaning with a \\, hence the strange regular expression \\*.
# let's see if there is an * in a string
s <- "US AIRWAYS*"
p <- regexpr("\\*", s)
if (p > 0)
{
print(paste("* at position: ",p,"in string",s))
} else {
print("no asterisk found")
}
## [1] "* at position: 11 in string US AIRWAYS*"
To find, for example, the first space in a string, you need to use regexpr
and then process the return value which is a list.
x <- "Mazda Miata L"
# find where the pattern starts
m <- regexpr(" ", x)
post.first.space <- m[1]
The variable pos.first.space
now contains the position of the first space.
The function gregexpr()
returns all occurrences of a character and not just the first one. Inspecting the return list m
reveals that it returns a list and that the first element in the list is a vector of all occurences of the character:
x <- "Mazda Miata L"
# find where the pattern starts
m <- gregexpr(" ", x)
print(m)
## [[1]]
## [1] 6 12
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
# all occurrences of " "
print(unlist(m))
## [1] 6 12
Regular Expression Syntax
Several string processing functions take regular expressions as input. The table below summarizes the most common regular expression syntax constructs.
^ |
Beginning of the string |
$ |
End of the string |
[ab] |
a or b |
[^ab] |
Any character except a and b |
[0-9] |
Any digit |
[A-Z] |
Any uppercase letters from A to Z |
[a-z] |
Any uppercase letters from a to a |
[A-z] |
Any letter |
i+ |
i at least one time |
i* |
i zero or more times |
i? |
i zero or 1 time |
i{n} |
i occurs n times in sequence |
[:alnum:] |
Alphanumeric characters: [:alpha:] and [:digit:] |
[:alpha:] |
Alphabetic characters: [:lower:] and [:upper:] |
[:blank:] |
Blank characters, e.g., space, tab |
[:digit:] |
Digits: 0 1 2 3 4 5 6 7 8 9 |
[:punct:] |
Punctuation character: ! ” # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { |
Examples
# remove the $ from a vector of "numbers" and convert to numbers then calculate mean
v <- c("$99.12","13387.3","$0.998")
v <- as.numeric(substring(v, 2))
mean(v)
## [1] 1162.473
# remove all * characters from a vector of strings
v <- c("US AIRWAYS*","CONTINENTAL*","LUFTHANSA*","UNITED*","WOW*")
v <- substring(v, 1, nchar(v)-1)
v
## [1] "US AIRWAYS" "CONTINENTAL" "LUFTHANSA" "UNITED" "WOW"
# remove * characters from a vector of strings if there is one
v <- c("US AIRWAYS*","CONTINENTAL*","LUFTHANSA","UNITED","WOW*")
# the function below finds the starting positions of all *
# if there is no *, it returns -1
p <- regexpr("\\*", v)
# the function below returns the indexes of the strings in the vector v
# which actually contain a 0
g <- grep("\\*",v)
# only remove the * from those strings in the vector v that have it
v[g] <- substring(v[g],1,p[g]-1)
# our new vector with * removed
print(v)
## [1] "US AIRWAYS" "CONTINENTAL" "LUFTHANSA" "UNITED" "WOW"
# remove all commas for a number string as coercion with as.numeric(s)
# fail if the string contains commas as thousands separator
s <- "100,340,998"
numParts <- unlist(strsplit(s, ","))
num <- paste(numParts, collapse = "")
n <- as.numeric(num)
print(n)
## [1] 100340998
Tutorial
The tutorial below demonstrates how to create a project in R Studio and add files to the project.
References
No references.
Errata
None collected yet. Let us know.
