Preface

This tutorial presumes that you have R and R Studio installed, or that you have an account on rstudio.cloud. If you do not already have R and/or R Studio you will need to download and install them. You must first install R from R Project and then the R Studio IDE from R Studio.

In this is R Markdown Notebook we will demonstrate how to parse, process, search, and work with text (aka, character strings). When you execute code within the notebook, the results appear beneath the code.

Basic Text Functionality

The functions below are part of “Base R” and do not require any additional packages, although in practice many R programmers prefer string processing packages such as stringr.

  • paste: glue (concatenate) text and numeric values together
  • substr and substring: extract or replace substrings in a character vector
  • grep: use regular expressions to deal with patterns of text
  • regexpr: find matching pattern using regular expressions
  • sub and gsub: replace parts of a string
  • strsplit: split strings
  • nchar: find number of characters in a string
  • as.numeric: convert a string to numeric if feasible
  • strtoi: convert a string to integer if it can be (faster than as.integer)

A common issue in R is that often functions return a list object, when one would expect a vector. This sometimes requires an additional step or two of further processing that ought not to be necessary, but unfortunately often is. So be prepared for some extra code gymnastics using unlist.

Dates are not always text in R; they are more likely of a Date or DateTime type and require different functions for processing.

Text in R

Text in R are character strings. The data type (class) of a character string is “characters”. There is no need to declare that a variable is of a string type; just assigning a string (anything in either single quotes ’ or double quotes ” is automatically a string). The fact that both ” and ’ can be used is useful when needing to embed quotes or apostrophies within strings. You can also “escape” special characters by prefixing with them with a backslash .

Displaying (or “printing”) text is done by either using the variable name by itself or by using the print function.

s <- "this is a text or character string"
w <- 'and this is also text'
u <- 'and this is text with "double quotes"'
q <- "and this is a quote's quote"
t <- "text with escaped \"double quotes\" inside a string defined with \""

u
## [1] "and this is text with \"double quotes\""
print(q)
## [1] "and this is a quote's quote"

Convert Strings to Numbers

Use the coercion functions as.type to convert (coerce) strings to other data types, most commonly Booleans, dates, and numbers. Converting to an integer does not round; it simply drops the fractional part (it “truncates”). The function function strtoi is a faster way to convert strings to integers assuming that there is no fractional part.

s1 <- "123.99"
n1 <- as.numeric(s1)
n1
## [1] 123.99
class(n1)
## [1] "numeric"
i1 <- as.integer(s1)
i1
## [1] 123
class(i1)
## [1] "integer"
i2 <- strtoi("2234")
i2
## [1] 2234

Note that for conversions to work, the text cannot contain any non-numeric characters. The dot (.) is allowed as a decimal separator, but the comma (,) is not recognized as a thousands separator. It returns the error NAs introduced by coercion.

# the following would not work
e <- as.numeric("$123.99")        # contains $
## Warning: NAs introduced by coercion
e <- as.numeric("123,99")         # comma not recognized
## Warning: NAs introduced by coercion

Convert Strings to Dates

Dates are a separate data type in R. A date object can be used in “date aware” calculations. Note that strings must be in the format YYYY-MM-DD unless the format parameter is specified. The lubridate package has many more functions for converting from various other formats. The strings are converted to the standard format YYYY-MM-DD.

To get the current date from the system’s clock, use Sys.Date(). To get the current date and time, use date().

s2 <- "2021-02-15"
d1 <- as.Date(s2)
d1
## [1] "2021-02-15"
class(d1)
## [1] "Date"
s3 <- "2019/06/30"
d2 <- as.Date(s3)
d2
## [1] "2019-06-30"
as.Date("07/04/20", format = "%m/%d/%y")
## [1] "2020-07-04"
as.Date("JUL 04 2021", format = "%b %d %Y")
## [1] "2021-07-04"
as.Date("February 11 80", format = "%B %d %y")
## [1] "1980-02-11"
# get current date and time
date()
## [1] "Wed Feb 14 09:47:19 2024"
# get current date only as YYYY-MM-DD
Sys.Date()
## [1] "2024-02-14"

Here are the various date format specifications:

Symbol Meaning Example
%d day as a number (01-31) 06
%a abbreviated weekday Mon
%A full weekday Monday
%m month (00-12) 11
%b abbreviated month Nov
%B full month March
%y two-digit year 20
%Y four-digit year 2021

String Concatenation

Since print only takes a single string as a parameter, you need to concatenate (i.e., combine) multiple strings using either paste or paste0. The former inserts a space after each string. paste also allows you to specify a separator other than a space.

s <- "this is a text or character string"
w <- 'and this is also text'

r <- paste(s, w)
t <- paste(s, w, sep = "/")

print(r)
## [1] "this is a text or character string and this is also text"
print(t)
## [1] "this is a text or character string/and this is also text"

Extract Substrings

Extracting a specific set of characters from a string is done with substr(text, start, stop). An alternative is the substring(text, first, last = 1000000L) function; it functions the same way except that it goes to the end of the string if the last parameter is omitted. Another difference between substr and substring is the possibility to extract several substrings with one line of code. With substr, this is not possible.

The functions extract by position, while the more general grep function provides the location of a substring withing a string.

Both functions treat a string as a “vector” of characters (although it really isn’t a vector type). Characters are in positions from 1 (leftmost character) to the end. You can find the length of a string (the number of characters in it) with the function nchar.

s <- "Boston, MA 02115 USA"

l <- nchar(s)
print(paste("length of s is:", l))
## [1] "length of s is: 20"
# extract the characters from position 1 through 6, inclusive.
city <- substr(s, 1, 6)
print(city)
## [1] "Boston"
city <- substring(s, 1, 6)
print(city)
## [1] "Boston"
# extract the characters from position 9 to the end, inclusive.
state <- substring(s, 9)
print(state)
## [1] "MA 02115 USA"
# extract three characters from the right
country <- substring(s, nchar(s)-2)
print(country)
## [1] "USA"

The functions can also be applied to vectors of strings which is useful for processing data frames as each column in a data frame is a vector.

vs <- c("$123.77","$12.99","$1.99")

# extract all characters from the second to the end
vs2 <- substring(vs,2)
print(vs2)
## [1] "123.77" "12.99"  "1.99"
# count all string lengths in the vector
vsl <- nchar(vs2)
print(vsl)
## [1] 6 5 4

The substr function can also be on the left side of an assignment in which case parts of a string are replaced with new characters. The new characters have to be of the same number as the ones being replaced.

s <- "Boston, MA 02115 USA"

substring(s,7,8) <- "."
print(s)
## [1] "Boston. MA 02115 USA"
substring(s,9) <- "RI"
print(s)
## [1] "Boston. RI 02115 USA"

Replacing Parts of a String

The sub function is used to replace parts of a string with new text. Note that it only replaces the first occurrence while the function gsub replaces all occurrences. The function returns the new string.

s <- "Boston, MA 02115"

s <- sub("02115","02117",s)
print(s)
## [1] "Boston, MA 02117"

The replacement text does not have to be of the same length as the replaced text.

s <- "Boston, MA 02125"
s <- sub("Boston","Charlestown",s)
print(s)
## [1] "Charlestown, MA 02125"

The sub and gsub functions can also contain regular expressions. More on regular expressions below.

s <- c("Charlestown, MA 02121","Brighton, MA 02137")

s <- sub("Charlestown|Brighton","Boston",s)
print(s)
## [1] "Boston, MA 02121" "Boston, MA 02137"

Splitting a String into Tokens

Splitting a string into tokens is done with the splitstr function.

s <- "MA,CT,RI,NH,CA,FL"

# split based on comma ,
states <- strsplit(s,",")
print(states)
## [[1]]
## [1] "MA" "CT" "RI" "NH" "CA" "FL"

The function returns a “list” rather than a “vector”, so we often use the function unlist() to convert the list to a vector. We can then use vector access to get to specific elements.

s <- "MA,CT,RI,NH,CA,FL"

# split based on comma ,
states <- unlist(strsplit(s,","))

# print all elements in the vector

print(states)
## [1] "MA" "CT" "RI" "NH" "CA" "FL"
# print a specific element
print(states[2])
## [1] "CT"

Converting to Upper or Lower Case

Matching is often easier if a string is in all upper or lower cases. The functions toupper and tolower do that.

s <- "Fred Flintstone"

u <- toupper(s)
l <- tolower(s)

print(u)
## [1] "FRED FLINTSTONE"
print(l)
## [1] "fred flintstone"

Finding Parts of a String

The grep and grepl functions find pattern in strings and return a Boolean if there the pattern is found. The patterns are expressed as regular expressions.

x <- c("door", "apple", "color", "abba", "pattern")

# find which strings contain the pattern
grep("a", x)
## [1] 2 4 5
grepl("a", x)
## [1] FALSE  TRUE FALSE  TRUE  TRUE

The regexpr returns the position in the string where there is first match occurs or a -1 if there is no match.

# find where the pattern starts
m <- regexpr("a", x)
m
## [1] -1  1 -1  1  2
## attr(,"match.length")
## [1] -1  1 -1  1  1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
m[5]
## [1] 2

In the example below we are looking for an occurrence of * which is a regular expression character, so we need to “escape its meaning” with a \. However, the \ is also a special character for regular expression so we need to also escape its meaning with a \\, hence the strange regular expression \\*.

# let's see if there is an * in a string
s <- "US AIRWAYS*"
p <- regexpr("\\*", s)

if (p > 0)
{
  print(paste("* at position: ",p,"in string",s))
} else {
  print("no asterisk found")
}
## [1] "* at position:  11 in string US AIRWAYS*"

To find, for example, the first space in a string, you need to use regexpr and then process the return value which is a list.

x <- "Mazda Miata L"
# find where the pattern starts
m <- regexpr(" ", x)

post.first.space <- m[1]

The variable pos.first.space now contains the position of the first space.

The function gregexpr() returns all occurrences of a character and not just the first one. Inspecting the return list m reveals that it returns a list and that the first element in the list is a vector of all occurences of the character:

x <- "Mazda Miata L"
# find where the pattern starts
m <- gregexpr(" ", x)

print(m)
## [[1]]
## [1]  6 12
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
# all occurrences of " "
print(unlist(m))
## [1]  6 12

Regular Expression Syntax

Several string processing functions take regular expressions as input. The table below summarizes the most common regular expression syntax constructs.

Syntax Description
^ Beginning of the string
$ End of the string
[ab] a or b
[^ab] Any character except a and b
[0-9] Any digit
[A-Z] Any uppercase letters from A to Z
[a-z] Any uppercase letters from a to a
[A-z] Any letter
i+ i at least one time
i* i zero or more times
i? i zero or 1 time
i{n} i occurs n times in sequence
[:alnum:] Alphanumeric characters: [:alpha:] and [:digit:]
[:alpha:] Alphabetic characters: [:lower:] and [:upper:]
[:blank:] Blank characters, e.g., space, tab
[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9
[:punct:] Punctuation character: ! ” # $ % & ’ ( ) * + , - . / : ; < = > ? @ [  ] ^ _ ` {

Examples

# remove the $ from a vector of "numbers" and convert to numbers then calculate mean
v <- c("$99.12","13387.3","$0.998")
v <- as.numeric(substring(v, 2))
mean(v)
## [1] 1162.473
# remove all * characters from a vector of strings
v <- c("US AIRWAYS*","CONTINENTAL*","LUFTHANSA*","UNITED*","WOW*")

v <- substring(v, 1, nchar(v)-1)
v
## [1] "US AIRWAYS"  "CONTINENTAL" "LUFTHANSA"   "UNITED"      "WOW"
# remove * characters from a vector of strings if there is one
v <- c("US AIRWAYS*","CONTINENTAL*","LUFTHANSA","UNITED","WOW*")

# the function below finds the starting positions of all *
# if there is no *, it returns -1
p <- regexpr("\\*", v)

# the function below returns the indexes of the strings in the vector v
# which actually contain a 0
g <- grep("\\*",v)

# only remove the * from those strings in the vector v that have it
v[g] <- substring(v[g],1,p[g]-1)

# our new vector with * removed
print(v)
## [1] "US AIRWAYS"  "CONTINENTAL" "LUFTHANSA"   "UNITED"      "WOW"
# remove all commas for a number string as coercion with as.numeric(s)
# fail if the string contains commas as thousands separator
s <- "100,340,998"

numParts <- unlist(strsplit(s, ","))
num <- paste(numParts, collapse = "")
n <- as.numeric(num)

print(n)
## [1] 100340998

Tutorial

The tutorial below demonstrates how to create a project in R Studio and add files to the project.


Files & Resources

All Files for Lesson 6.112

References

No references.

Errata

None collected yet. Let us know.

---
title: "Basics of Text & String Processing in R"
params:
  category: 6
  number: 112
  time: 45
  level: beginner
  tags: "r,r studio,primer"
  description: "Explains string and text processing in R, including
                regular expressions."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Preface

This tutorial presumes that you have R and R Studio installed, or that you have an account on [rstudio.cloud](http://rstudio.cloud). If you do not already have R and/or R Studio you will need to download and install them. You must first install R from [R Project](https://cloud.r-project.org/) and then the R Studio IDE from [R Studio](https://rstudio.com/products/rstudio/download/).

In this is [R Markdown](http://rmarkdown.rstudio.com) Notebook we will demonstrate how to parse, process, search, and work with text (*aka*, character strings). When you execute code within the notebook, the results appear beneath the code.

## Basic Text Functionality

The functions below are part of "Base R" and do not require any additional packages, although in practice many R programmers prefer string processing packages such as **stringr**.

-   <code>paste</code>: glue (concatenate) text and numeric values together
-   <code>substr</code> and <code>substring</code>: extract or replace substrings in a character vector
-   <code>grep</code>: use regular expressions to deal with patterns of text
-   <code>regexpr</code>: find matching pattern using regular expressions
-   <code>sub</code> and <code>gsub</code>: replace parts of a string
-   <code>strsplit</code>: split strings
-   <code>nchar</code>: find number of characters in a string
-   <code>as.numeric</code>: convert a string to numeric if feasible
-   <code>strtoi</code>: convert a string to integer if it can be (faster than <code>as.integer</code>)

A common issue in R is that often functions return a *list* object, when one would expect a *vector*. This sometimes requires an additional step or two of further processing that ought not to be necessary, but unfortunately often is. So be prepared for some extra code gymnastics using <code>unlist</code>.

> Dates are not always text in R; they are more likely of a *Date* or *DateTime* type and require different functions for processing.

## Text in R

Text in R are character strings. The data type (class) of a character string is *"characters"*. There is no need to declare that a variable is of a string type; just assigning a string (anything in either single quotes ' or double quotes " is automatically a string). The fact that both " and ' can be used is useful when needing to embed quotes or apostrophies within strings. You can also "escape" special characters by prefixing with them with a backslash .

Displaying (or "printing") text is done by either using the variable name by itself or by using the <code>print</code> function.

```{r}
s <- "this is a text or character string"
w <- 'and this is also text'
u <- 'and this is text with "double quotes"'
q <- "and this is a quote's quote"
t <- "text with escaped \"double quotes\" inside a string defined with \""

u
print(q)
```

## Convert Strings to Numbers

Use the coercion functions *as.type* to convert (coerce) strings to other data types, most commonly Booleans, dates, and numbers. Converting to an integer does not round; it simply drops the fractional part (it "truncates"). The function function <code>strtoi</code> is a faster way to convert strings to integers assuming that there is no fractional part.

```{r}
s1 <- "123.99"
n1 <- as.numeric(s1)
n1
class(n1)

i1 <- as.integer(s1)
i1
class(i1)

i2 <- strtoi("2234")
i2


```

Note that for conversions to work, the text cannot contain any non-numeric characters. The dot (.) is allowed as a decimal separator, but the comma (,) is not recognized as a thousands separator. It returns the error *NAs introduced by coercion*.

```{r}
# the following would not work
e <- as.numeric("$123.99")        # contains $
e <- as.numeric("123,99")         # comma not recognized
```

## Convert Strings to Dates

Dates are a separate data type in R. A date object can be used in "date aware" calculations. Note that strings must be in the format YYYY-MM-DD unless the *format* parameter is specified. The **lubridate** package has many more functions for converting from various other formats. The strings are converted to the standard format YYYY-MM-DD.

To get the current date from the system's clock, use <code>Sys.Date()</code>. To get the current date **and** time, use <code>date()</code>.

```{r}
s2 <- "2021-02-15"
d1 <- as.Date(s2)
d1
class(d1)

s3 <- "2019/06/30"
d2 <- as.Date(s3)
d2

as.Date("07/04/20", format = "%m/%d/%y")
as.Date("JUL 04 2021", format = "%b %d %Y")
as.Date("February 11 80", format = "%B %d %y")

# get current date and time
date()

# get current date only as YYYY-MM-DD
Sys.Date()
```

Here are the various date *format* specifications:

| Symbol | Meaning                 | Example |
|--------|-------------------------|---------|
| %d     | day as a number (01-31) | 06      |
| %a     | abbreviated weekday     | Mon     |
| %A     | full weekday            | Monday  |
| %m     | month (00-12)           | 11      |
| %b     | abbreviated month       | Nov     |
| %B     | full month              | March   |
| %y     | two-digit year          | 20      |
| %Y     | four-digit year         | 2021    |

## String Concatenation

Since <code>print</code> only takes a single string as a parameter, you need to concatenate (*i.e.*, combine) multiple strings using either <code>paste</code> or <code>paste0</code>. The former inserts a space after each string. <code>paste</code> also allows you to specify a separator other than a space.

```{r}
s <- "this is a text or character string"
w <- 'and this is also text'

r <- paste(s, w)
t <- paste(s, w, sep = "/")

print(r)
print(t)
```

## Extract Substrings

Extracting a specific set of characters from a string is done with <code>substr(text, start, stop)</code>. An alternative is the <code>substring(text, first, last = 1000000L)</code> function; it functions the same way except that it goes to the end of the string if the last parameter is omitted. Another difference between <code>substr</code> and <code>substring</code> is the possibility to extract several substrings with one line of code. With <code>substr</code>, this is not possible.

The functions extract by position, while the more general <code>grep</code> function provides the location of a substring withing a string.

Both functions treat a string as a "vector" of characters (although it really isn't a *vector* type). Characters are in positions from 1 (leftmost character) to the end. You can find the length of a string (the number of characters in it) with the function <code>nchar</code>.

```{r}
s <- "Boston, MA 02115 USA"

l <- nchar(s)
print(paste("length of s is:", l))

# extract the characters from position 1 through 6, inclusive.
city <- substr(s, 1, 6)
print(city)

city <- substring(s, 1, 6)
print(city)

# extract the characters from position 9 to the end, inclusive.
state <- substring(s, 9)
print(state)

# extract three characters from the right
country <- substring(s, nchar(s)-2)
print(country)

```

The functions can also be applied to vectors of strings which is useful for processing data frames as each column in a data frame is a vector.

```{r}
vs <- c("$123.77","$12.99","$1.99")

# extract all characters from the second to the end
vs2 <- substring(vs,2)
print(vs2)

# count all string lengths in the vector
vsl <- nchar(vs2)
print(vsl)
```

The <code>substr</code> function can also be on the left side of an assignment in which case parts of a string are replaced with new characters. The new characters have to be of the same number as the ones being replaced.

```{r}
s <- "Boston, MA 02115 USA"

substring(s,7,8) <- "."
print(s)

substring(s,9) <- "RI"
print(s)
```

## Replacing Parts of a String

The <code>sub</code> function is used to replace parts of a string with new text. Note that it only replaces the first occurrence while the function <code>gsub</code> replaces all occurrences. The function returns the new string.

```{r}
s <- "Boston, MA 02115"

s <- sub("02115","02117",s)
print(s)
```

The replacement text does not have to be of the same length as the replaced text.

```{r}
s <- "Boston, MA 02125"
s <- sub("Boston","Charlestown",s)
print(s)
```

The <code>sub</code> and <code>gsub</code> functions can also contain regular expressions. More on regular expressions below.

```{r}
s <- c("Charlestown, MA 02121","Brighton, MA 02137")

s <- sub("Charlestown|Brighton","Boston",s)
print(s)
```

## Splitting a String into Tokens

Splitting a string into tokens is done with the <code>splitstr</code> function.

```{r}
s <- "MA,CT,RI,NH,CA,FL"

# split based on comma ,
states <- strsplit(s,",")
print(states)
```

The function returns a "list" rather than a "vector", so we often use the function `unlist()` to convert the list to a vector. We can then use vector access to get to specific elements.

```{r}
s <- "MA,CT,RI,NH,CA,FL"

# split based on comma ,
states <- unlist(strsplit(s,","))

# print all elements in the vector

print(states)
# print a specific element
print(states[2])
```

## Converting to Upper or Lower Case

Matching is often easier if a string is in all upper or lower cases. The functions <code>toupper</code> and <code>tolower</code> do that.

```{r}
s <- "Fred Flintstone"

u <- toupper(s)
l <- tolower(s)

print(u)
print(l)
```

## Finding Parts of a String

The <code>grep</code> and <code>grepl</code> functions find pattern in strings and return a Boolean if there the pattern is found. The patterns are expressed as regular expressions.

```{r}
x <- c("door", "apple", "color", "abba", "pattern")

# find which strings contain the pattern
grep("a", x)
grepl("a", x)

```

The <code>regexpr</code> returns the position in the string where there is first match occurs or a -1 if there is no match.

```{r}
# find where the pattern starts
m <- regexpr("a", x)
m
m[5]
```

In the example below we are looking for an occurrence of \* which is a regular expression character, so we need to "escape its meaning" with a *\\*. However, the *\\* is also a special character for regular expression so we need to also escape its meaning with a *\\\\*, hence the strange regular expression *\\\\\**.

```{r}
# let's see if there is an * in a string
s <- "US AIRWAYS*"
p <- regexpr("\\*", s)

if (p > 0)
{
  print(paste("* at position: ",p,"in string",s))
} else {
  print("no asterisk found")
}
```

To find, for example, the first space in a string, you need to use `regexpr` and then process the return value which is a list.

```{r}
x <- "Mazda Miata L"
# find where the pattern starts
m <- regexpr(" ", x)

post.first.space <- m[1]
```

The variable `pos.first.space` now contains the position of the first space.

The function `gregexpr()` returns all occurrences of a character and not just the first one. Inspecting the return list `m` reveals that it returns a list and that the first element in the list is a vector of all occurences of the character:

```{r}
x <- "Mazda Miata L"
# find where the pattern starts
m <- gregexpr(" ", x)

print(m)

# all occurrences of " "
print(unlist(m))
```

## Regular Expression Syntax

Several string processing functions take regular expressions as input. The table below summarizes the most common regular expression syntax constructs.

| Syntax    | Description                                                                              |
|-----------|------------------------------------------------------------------------------------------|
| \^        | Beginning of the string                                                                  |
| \$        | End of the string                                                                        |
| [ab]      | a or b                                                                                   |
| [\^ab]    | Any character except a and b                                                             |
| [0-9]     | Any digit                                                                                |
| [A-Z]     | Any uppercase letters from A to Z                                                        |
| [a-z]     | Any uppercase letters from a to a                                                        |
| [A-z]     | Any letter                                                                               |
| i+        | *i* at least one time                                                                    |
| i\*       | *i* zero or more times                                                                   |
| i?        | *i* zero or 1 time                                                                       |
| i{n}      | *i* occurs *n* times in sequence                                                         |
| [:alnum:] | Alphanumeric characters: [:alpha:] and [:digit:]                                         |
| [:alpha:] | Alphabetic characters: [:lower:] and [:upper:]                                           |
| [:blank:] | Blank characters, *e.g.*, space, tab                                                     |
| [:digit:] | Digits: 0 1 2 3 4 5 6 7 8 9                                                              |
| [:punct:] | Punctuation character: ! " \# \$ % & ' ( ) \* + , - . / : ; \< = \> ? \@ [  ] \^ \_ \` { |

## Examples

```{r}
# remove the $ from a vector of "numbers" and convert to numbers then calculate mean
v <- c("$99.12","13387.3","$0.998")
v <- as.numeric(substring(v, 2))
mean(v)
```

```{r}
# remove all * characters from a vector of strings
v <- c("US AIRWAYS*","CONTINENTAL*","LUFTHANSA*","UNITED*","WOW*")

v <- substring(v, 1, nchar(v)-1)
v
```

```{r}
# remove * characters from a vector of strings if there is one
v <- c("US AIRWAYS*","CONTINENTAL*","LUFTHANSA","UNITED","WOW*")

# the function below finds the starting positions of all *
# if there is no *, it returns -1
p <- regexpr("\\*", v)

# the function below returns the indexes of the strings in the vector v
# which actually contain a 0
g <- grep("\\*",v)

# only remove the * from those strings in the vector v that have it
v[g] <- substring(v[g],1,p[g]-1)

# our new vector with * removed
print(v)
```

```{r}
# remove all commas for a number string as coercion with as.numeric(s)
# fail if the string contains commas as thousands separator
s <- "100,340,998"

numParts <- unlist(strsplit(s, ","))
num <- paste(numParts, collapse = "")
n <- as.numeric(num)

print(n)
```

## Tutorial

The tutorial below demonstrates how to create a project in R Studio and add files to the project.

```{=html}
<iframe src="" width="480" height="270" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen data-external="1"></iframe>
```

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

No references.

## Errata

None collected yet. Let us know.

```{=html}
<script src="https://form.jotform.com/static/feedback2.js" type="text/javascript">
  new JotformFeedback({
    formId: "212187072784157",
    buttonText: "Feedback",
    base: "https://form.jotform.com/",
    background: "#F59202",
    fontColor: "#FFFFFF",
    buttonSide: "left",
    buttonAlign: "center",
    type: false,
    width: 700,
    height: 500,
    isCardForm: false
  });
</script>
```
```{r code=xfun::read_utf8(paste0(here::here(),'/R/_deployKnit.R')), include = FALSE}
```
