Introduction
Regular expressions (often abbreviated as “regex” or “regexp”) are sequences of characters that form a search pattern, primarily used for pattern matching within strings. They provide a flexible and concise means to identify strings of text, such as particular characters, words, or patterns. Regular expressions are commonly used in text searching, text replacement, and data validation.
They are important for many text (string) processing functions such as regexpr()
in R as well as Unix search programs such as grep
and fgrep
.
Need for Regular Expressions
Programmers, software engineers, data engineers, and information scientists in particular benefit from understanding regular expressions for several reasons:
Text Processing: Regular expressions offer a powerful tool for extracting, modifying, and manipulating text in data files, configuration files, logs, codebases, and more.
Data Validation: They can be used to check if input data conforms to expected formats, such as email addresses, phone numbers, or custom formats specific to an application.
Web Scraping: When extracting information from websites, regular expressions can be used to parse relevant data from HTML or other textual formats.
Search & Replace: Regular expressions make it possible to perform complex search and replace operations in text editors and IDEs, which is especially handy during refactoring or codebase-wide updates.
Optimized Search: Instead of writing lengthy code to search for specific patterns in a string, a short regular expression can often achieve the same result, making the code cleaner and more efficient.
Flexibility: Regular expressions are versatile. Once you understand their syntax, you can use them across various programming languages and tools, as the fundamental concepts remain consistent.
Data Parsing: They can be used to split strings into components or capture specific parts from strings, like extracting values from structured texts such as CSV, JSON, or XML.
Enhanced User Experience: On the frontend, regex can be used for real-time data validation as users input data, guiding users to correct errors instantly.
Log Analysis: For system administrators or developers diagnosing issues, regular expressions can help sift through logs to identify errors, warnings, or specific events.
Cybersecurity: Regular expressions play a role in intrusion detection systems, where specific patterns in network traffic or logs might indicate malicious activity.
Understanding regular expressions can save programmers time, streamline their tasks, and enhance the functionality and efficiency of the software they develop. While there’s a learning curve to mastering regular expressions, the benefits for programmers make the investment worthwhile.
Regular Expressions in R
There are three common functions in R that accept regular expressions:
grep()
grepl()
regexpr()
Let’s look at each one of them in more detail and how to use them.
The grep
and grepl
functions find pattern in strings and return a Boolean if there the pattern is found. The patterns are expressed as regular expressions.
x <- c("door", "apple", "color", "abba", "pattern")
# find which strings contain the pattern
grep("a", x)
## [1] 2 4 5
## [1] FALSE TRUE FALSE TRUE TRUE
The regexpr
returns the position in the regular expression string where there is first match occurs or a -1 if there is no match.
# find where the pattern starts
m <- regexpr("a", x)
m
## [1] -1 1 -1 1 2
## attr(,"match.length")
## [1] -1 1 -1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [1] 2
In the example below we are looking for an occurrence of * which is a regular expression character, so we need to “escape its meaning” with a \. However, the \ is also a special character for regular expression so we need to also escape its meaning with a \\, hence the strange regular expression \\*.
# let's see if there is an * in a string
s <- "US AIRWAYS*"
p <- regexpr("\\*", s)
if (p > 0)
{
print(paste("* at position:", p,
"in string",s))
} else {
print("no asterisk found")
}
## [1] "* at position: 11 in string US AIRWAYS*"
To find, for example, the first space in a string, you need to use regexpr
and then process the return value which is a list.
x <- "Mazda Miata L"
# find where the pattern starts
m <- regexpr(" ", x)
post.first.space <- m[1]
The variable pos.first.space
now contains the position of the first space.
The function gregexpr()
returns all occurrences of a character and not just the first one. Inspecting the return list m
reveals that it returns a list and that the first element in the list is a vector of all occurences of the character:
x <- "Mazda Miata L"
# find where the pattern starts
m <- gregexpr(" ", x)
print(m)
## [[1]]
## [1] 6 12
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
# all occurrences of " "
print(unlist(m))
## [1] 6 12
Regular Expression Syntax
Several string processing functions take regular expressions as input. The table below summarizes the most common regular expression syntax constructs.
^ |
Beginning of the string |
$ |
End of the string |
[ab] |
a or b |
[^ab] |
Any character except a and b |
[0-9] |
Any digit |
[A-Z] |
Any uppercase letters from A to Z |
[a-z] |
Any uppercase letters from a to a |
[A-z] |
Any letter |
i+ |
i at least one time |
i* |
i zero or more times |
i? |
i zero or 1 time |
i{n} |
i occurs n times in sequence |
[:alnum:] |
Alphanumeric characters: [:alpha:] and [:digit:] |
[:alpha:] |
Alphabetic characters: [:lower:] and [:upper:] |
[:blank:] |
Blank characters, e.g., space, tab |
[:digit:] |
Digits: 0 1 2 3 4 5 6 7 8 9 |
[:punct:] |
Punctuation character: ! ” # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { |
Examples
# remove the $ from a vector of "numbers" and convert to numbers then calculate mean
v <- c("$99.12","13387.3","$0.998")
v <- as.numeric(substring(v, 2))
mean(v)
## [1] 1162.473
# remove all * characters from a vector of strings
v <- c("US AIRWAYS*","CONTINENTAL*","LUFTHANSA*","UNITED*","WOW*")
v <- substring(v, 1, nchar(v)-1)
v
## [1] "US AIRWAYS" "CONTINENTAL" "LUFTHANSA" "UNITED" "WOW"
# remove * characters from a vector of strings if there is one
v <- c("US AIRWAYS*","CONTINENTAL*","LUFTHANSA","UNITED","WOW*")
# the function below finds the starting positions of all *
# if there is no *, it returns -1
p <- regexpr("\\*", v)
# the function below returns the indexes of the strings in the vector v
# which actually contain a 0
g <- grep("\\*",v)
# only remove the * from those strings in the vector v that have it
v[g] <- substring(v[g],1,p[g]-1)
# our new vector with * removed
print(v)
## [1] "US AIRWAYS" "CONTINENTAL" "LUFTHANSA" "UNITED" "WOW"
# remove all commas for a number string as coercion with as.numeric(s)
# fail if the string contains commas as thousands separator
s <- "100,340,998"
numParts <- unlist(strsplit(s, ","))
num <- paste(numParts, collapse = "")
n <- as.numeric(num)
print(n)
## [1] 100340998
grep and fgrep
While the use of regular expressions for string processing in programming is common, they are also available for text searching from the command line in Unix. Let’s look at two text searching utilities for Unix: grep
and fgrep
. Both commands are used for searching text, but they interpret patterns differently. Here’s a breakdown of their differences:
Pattern Interpretation:
grep
: Stands for “global regular expression print.” It searches for a pattern using regular expressions (regex) by default. So, if you provide a pattern with special regex characters, such as .
(dot) or *
, grep
will interpret them as regex metacharacters. For instance, .
will match any character, and *
will match the preceding character zero or more times.
In the above command, the pattern “a.b” will not match “ab” because the dot expects any character to be present between “a” and “b”.
fgrep
: Stands for “fixed grep” or “fast grep.” It treats the pattern as a fixed string. This means it does not recognize regular expression metacharacters as special; they are treated as regular characters. Thus, fgrep
is useful when you want to search for a string that might contain regex metacharacters and you don’t want them to be interpreted as such.
Here, the pattern “a.b” matches the string “a.b” because fgrep
doesn’t interpret the dot as a regex character.
Speed:
- Due to its literal interpretation of patterns,
fgrep
can be faster than grep
when searching for fixed strings because it doesn’t need to process regex metacharacters.
Alternatives:
Multiple Patterns:
One advantage of fgrep
(or grep -F
) is the ability to search for multiple fixed strings efficiently by placing each string on a new line in a pattern file.
fgrep -f pattern_file.txt target_file.txt
In the above command, each line in pattern_file.txt
is treated as a fixed string, and fgrep
searches for any of these lines in target_file.txt
.
In summary, use grep
when you need the power and flexibility of regular expressions. Opt for fgrep
(or grep -F
) when you want to search for exact, fixed strings, especially when these strings might contain characters that have special meanings in regex.
Detailed Regular Expressions
Classes and Literals
- Literals:
- The simplest form of pattern matching supported by regular expressions is the match of a string literal.
regex hello
- Dot (
.
):
- The
.
character matches any single character except a newline. regex h.llo
- Character Classes (
[]
):
- A set of characters enclosed in square brackets (
[]
) makes up a character class. This matches any single character in the set. regex [aeiou] # matches any single vowel
- Negate Character Classes (
[^ ]
):
- By placing a
^
symbol inside the square brackets, only characters not in the character class will be matched. regex [^aeiou] # matches any single non-vowel character
- Predefined Character Classes:
\d
: Matches any digit. Equivalent to [0-9]
.
\D
: Matches any non-digit. Equivalent to [^0-9]
.
\w
: Matches any word character (alphanumeric and underscore). Equivalent to [a-zA-Z0-9_]
.
\W
: Matches any non-word character.
\s
: Matches any whitespace character (spaces, tabs, line breaks).
\S
: Matches any non-whitespace character.
- Quantifiers:
*
: Matches 0 or more of the preceding element. regex ab*c # matches "ac", "abc", "abbc", etc.
+
: Matches 1 or more of the preceding element. regex ab+c # matches "abc", "abbc", but not "ac"
?
: Matches 0 or 1 of the preceding element. regex ab?c # matches "ac" or "abc"
{n}
: Matches exactly n
occurrences of the preceding element.
{n,}
: Matches n
or more occurrences.
{n,m}
: Matches between n
and m
occurrences. regex a{2,4} # matches "aa", "aaa", or "aaaa"
- Anchors:
^
: Matches the start of a line.
$
: Matches the end of a line. regex ^Hello # matches "Hello" at the beginning of a line world$ # matches "world" at the end of a line
- Alternation (
|
):
- Acts like an OR operator.
regex cat|dog # matches "cat" or "dog"
- Grouping (
()
):
- Elements within parentheses are treated as a single unit.
- Also captures the matched text for use with backreferences.
regex (abc)+ # matches "abc", "abcabc", etc.
- Backreferences:
- Refers to the text matched by a previous capture group.
\1
refers to the text of the first capture group, \2
for the second, and so on. regex (a)b\1 # matches "aba"
- Escape Sequences:
- To match characters that have special meanings in regex, you escape them with a backslash
\
. regex \. # matches a literal dot \\ # matches a backslash \[ # matches a square bracket
Advanced Concepts:
- Lookaheads and Lookbehinds:
Lookaheads allow you to match a group ahead of your main pattern without including it in the result. regex q(?=u) # matches the "q" in "queue" but not the "u"
Lookbehinds are similar, but they look behind. regex (?<=a)b # matches the "b" after an "a"
- Non-capturing Groups:
- Use
(?:...)
to create a non-capturing group. regex (?:abc)+ # matches "abc", "abcabc", etc., but does not capture the match
- Modifiers:
- Some regex engines allow modifiers to change how the regex is interpreted.
regex (?i)abc # matches "abc", "ABC", "AbC", etc. (case-insensitive)
Regular expressions are powerful but can be complex. Mastering them requires practice. Test your regular expressions using tools like regex101 to ensure they match what you intend and to understand their behavior better.
Resources
- regex101 for testing regular expressions
