Prerequisites

This lesson presumes that you understand the structure of an XML document. If you are not familiar with XML, consult this lesson first.

Introduction

XML is a common means to externalize structured data an is commonly used for data interchange between systems and organizations. There are many standard XML languages for expressing data in specific domains, such as finance, agriculture, publishing, among others. It is also often used as a standard for configuration files and for web content.

Being able to extract data from XML is an important skill that every data programmers must master. This lesson explains how to use the functions from the XML package to extract data from XML files (also often called XML documents) in R. Other programming languages have very similar mechanisms and the skills learned in this lesson can be transferred to other languages, such as JavaScript, Java, C++, Rust, and many others.

Most often the data extracted from XML is analyzed statistically or used for constructing machine learning models, as well as being added to relational databases. These use cases are outside the scope of this lesson.

Required Packages

To extract data from XML, requires minimally the following packages:

  • XML

In addition, these packages contain functions that are often quite useful:

  • stringr

Note the XML is not the only package available for working with XML.

Loading an XML File

The process of extracting data starts by “parsing” the XML file from either a local folder or from a URL. The code below illustrates that for one of the included XML files (BookCatalog.xml). Download the file and run the code in your own R Notebook or R Script.

The code fragment below illustrates how to load and parse an XML document from a local file as well as from a URL.

library(XML)

xmlFile <- "BookCatalog.xml"
xmlURL <- "http://artificium.us/lessons/06.r/l-6-114-parse-xml-r-primer/BookCatalog.xml"

dom1 <- xmlParse(xmlFile)
dom2 <- xmlParse(xmlURL)

If you get the error “Error: XML content does not seem to be XML: ’’” then it is most likely that the file name, path, or URL are not correct.

Document Object Model (DOM)

The function xmlParse() returns a reference to an internal tree of nodes that represents the document object model (DOM) for the XML. Processing of the nodes and extraction of data is done through that reference.

Alternatively, one could use the function xmlTreeParse() which will also return a DOM but one that is represented as an R data structure rather than a C data structure1. While xmlParse() is more efficient and uses less memory, xmlTreeParse() has the benefit of element access using the $ operator. In practice, xmlParse() is most commonly used, though.

The variable dom (or to whichever variable you assigned the return value of xmlParse()) points to an in-memory representation of the XML tree. All access to the XML elements is via that pointer.

Processing an XML through its DOM requires that the XML is loaded completely into memory. Naturally, this is only feasible when sufficient memory is available. For very large XML files, another approach is available for extraction: SAX. This is beyond the scope of this tutorial.

Validation

By default, xmlParse() does not validate the file against any DTD or XML Schema; it only checks whether the XML is well-formed. To ensure that the XML conforms to the rules of a DTD or XML Schema, the parameter validate=T must be specified. Of course, this parameter is only meaningful if there is a DTD or Schema; if there isn’t and validate=T is specified then an error will result.

The XML file pagevisits.xml contains a DTD and therefore parsing with validation is possible.

library(XML)

xmlFile <- "pagevisits.xml"

dom <- xmlParse(xmlFile, validate=T)

Other useful parameters for xmlParse() include:

  • trim – a Boolean indicating whether to strip leading and trailing whitespace from values
  • getDTD – a Boolean flag indicating whether the DTD (both internal and external) should be returned along with the nodes
  • isURL – a Boolean indicating whether the document path is a URL; this is not strictly required if the URL starts with a common protocol such as http://

If you need to process HTML documents (a type of XML document), then the function htmlParse() is preferable.

Extracting Elements

The data in an XML document is contained in elements which are marked by pairs of tags, e.g., <tag>value</tag>. There are several common ways to extract the values of elements (also sometimes called “nodes”): access via indexing of the DOM object and through XPath. Let’s take a look at both ways, starting with accessing elements of the tree through indexing.

To illustrate the techniques, we will use a simple example XML (SimpleXML.xml) with two top-level elements.

<catalog>
   <book id="bk101" edition="3">
      <author>
        <surname>Gambardella</surname>
        <given>Matthew</given>
      </author>
      <title>XML Developer's Guide</title>
      <outofprint />
      <price currency="R$">349</price>
   </book>
   <book id="bk102" edition="1">
      <author>
        <surname>Ralls</surname>
        <given>Kim</given></author>
      <title>Midnight Rain</title>
      <price currency="US$">5.95</price>
   </book>
</catalog>

Let’s start by loading the XML into a DOM object. As it has no associated DTD, we will not validate the file during parsing.

xmlDoc <- xmlParse("SimpleXML.xml", validate=F)

Indexed Access

The first step in accessing the elements (nodes) of the DOM tree is to get the root element (for the above XML that would <catalog>) using the function xmlRoot().

root <- xmlRoot(xmlDoc)

Accessing the i^th child node directly underneath the root can be done using the list access operator [[i]]as the representation within R is a list.

# access the first child node underneath the root
aNode <- root[[1]]

print(aNode)
## <book id="bk101" edition="3">
##   <author>
##     <surname>Gambardella</surname>
##     <given>Matthew</given>
##   </author>
##   <title>XML Developer's Guide</title>
##   <outofprint/>
##   <price currency="R$">349</price>
## </book>

This can be continued down the tree. For example to access the given name element of the author for the second book, you would use:

aNode <- root[[2]][[1]][[2]]
print(aNode)
## <given>Kim</given>

To get its value, you would need to use the function xmlValue().

v <- xmlValue(aNode)

print(v)
## [1] "Kim"

The image below illustrates the syntax to access child elements by position within the DOM tree.

Iterating through Nodes

One of the advantages of using this mode of access is that the tree can be processed in a loop. For example, to get all the last names of the authors of all books, we can loop through the nodes. The function xmlSize() returns the number of direct child nodes underneath a given node.

The code fragment below extracts all surnames for all authors and places them into a vector. Each root[[i]][[1]][[1]] returns the value to the first child of the first child within each of the nodes underneath <catalog>. So, if you visualize the tree of the XML, that is <book><author><surname>

# number of <book> nodes
n <- xmlSize(root)

# pre-allocated vector for the surnames
names <- c(length = n)

for (i in 1:n)
{
  names[i] <- xmlValue(root[[i]][[1]][[1]])
}

cat(names)
## Gambardella Ralls

Of course, rather than placing them into a vector, one can also store them in a column of a data frame, which is, after all, simply a vector.

Optional Elements

Dealing with optional elements can be a bit tricky when extracting nodes using indexed access. For example, in the sample XML, the element <outofprint /> is not present in all nodes, so extracting the value of the book prices means that the <price> element can be child node 3 or 4 depending whether the element <outofprint /> is present before the element <price>. One technique is to check the name of the 3^rd element to see if it is price or outofprint using the function xmlName().

The code fragment below also illustrates the conversion of text values to numbers. All values returned from xmlvalue() are of type “character”, so using them a numeric values requires explicit coercion using as.numeric(). Of course, if the value had non-digit characters, such as “$”, then some string extraction would first be needed.

n <- xmlSize(root)

for (i in 1:n)
{
  node <- xmlName(root[[i]][[3]])
  if (node == "outofprint")
    price.node <- 4
  else
    price.node <- 3
  
  price <- xmlValue(root[[i]][[price.node]])
  price.value <- as.numeric(price)
}

Storing XML Data in Data Frame

The code below extracts the titles and the prices and places them into a data frame. This is a common strategy when converting the data from XML to a tabular format for externalization in a CSV or when saving the data to a relational database.

# number of <book> nodes
n <- xmlSize(root)

# empty data frame
df <- data.frame(title = as.character(n),
                 price = as.numeric(n))

for (i in 1:n)
{
  title <- xmlValue(root[[i]][[2]])
  node <- xmlName(root[[i]][[3]])
  if (node == "outofprint")
    price.node <- 4
  else
    price.node <- 3
  
  price <- xmlValue(root[[i]][[price.node]])
  price.value <- as.numeric(price)
  
  # add to data frame
  df[i,"title"] <- title
  df[i,"price"] <- price.value
}

print(head(df,3))
##                   title  price
## 1 XML Developer's Guide 349.00
## 2         Midnight Rain   5.95

The XML package contains the function xmlToDataFrame() that can more conveniently extract XML data to a dataframe but only if the XML has two levels and the elements are in the same order. It would not work properly on the sample XML SimpleXML.xml. See Lesson 6.323 Load Simple XML into Dataframe in R using xmlToDataFrame().

Extracting Attributes

Data is not only in elements but can also be in attributes of elements. For example, in SimpleXML.xml, the edition and id are attributes of the <book> element, as shown below.

<catalog>
   <book id="bk101" edition="3">
      <author>
        <surname>Gambardella</surname>
        <given>Matthew</given>
      </author>
      ...

Attributes are extracted using the xmlAttrs() function to which an element is passed. The function returns a list of all attributes, so we need the double-bracket access operator [[]] to access the elements or use the function unlist() to convert the list to a vector.

n <- xmlSize(root)

for (i in 1:n)
{
  # get the i-th book
  aBook <- root[[i]]
  
  # get attributes of the i-th book
  book.attrs <- xmlAttrs(aBook)
  
  # second attribute in the list is the edition
  edition <- book.attrs[[2]]
  
  print(edition)
}
## [1] "3"
## [1] "1"

To extract a specific attribute, the function xmlGetAttr() is often more convenient. Like values, attributes are returned as character strings requiring coercion to the appropriate data type using coercion functions such as as.numeric().

n <- xmlSize(root)

for (i in 1:n)
{
  # get the i-th book
  aBook <- root[[i]]
  
  # get the value of the attribute "edition"
  edition <- xmlGetAttr(aBook, "edition")
  
  print(edition)
}
## [1] "3"
## [1] "1"

Pretty Tables with “kable”

The code below demonstrates the use of the “knitr::kable()” package. First, we are extracting the title and the author of each book into a data frame and we use the kableExtra package to “pretty print” the data frames.

library(XML)

xmlURL <- "http://artificium.us/lessons/06.r/l-6-114-parse-xml-r-primer/BookCatalog.xml"

xmlDOM <- xmlParse(xmlURL, validate = F)

r <- xmlRoot(xmlDOM)

# number of <book> nodes
n <- xmlSize(r)

df.books <- data.frame(
  title = character(n),
  author = character(n)
)

for (i in 1:n)
{
  ## access the ith book node
  aBook <- r[[i]]
  
  ## extract title (child 2) and author (child 1) from
  ## the book node
  theTitle <- xmlValue(aBook[[2]])
  theAuthor <- xmlValue(aBook[[1]])
  
  ## store values in the ith row of the data frame
  df.books[i,1] <- theTitle
  df.books$author[i] <- theAuthor
}

head(df.books)
##                   title               author
## 1 XML Developer's Guide Gambardella, Matthew
## 2         Midnight Rain           Ralls, Kim
## 3       Maeve Ascendant          Corets, Eva
## 4       Oberon's Legacy          Corets, Eva
## 5    The Sundered Grail          Corets, Eva
## 6           Lover Birds     Randall, Cynthia

Example I

Note that only the first six rows of the data frame are displayed.

library(kableExtra)

df.books[1:6,] %>%
  kbl() %>%
  kable_paper("hover", full_width = F)
title author
XML Developer’s Guide Gambardella, Matthew
Midnight Rain Ralls, Kim
Maeve Ascendant Corets, Eva
Oberon’s Legacy Corets, Eva
The Sundered Grail Corets, Eva
Lover Birds Randall, Cynthia

Example 2

This example uses a different format style and also prints the table over the entire width of the document.

df.books[1:6,] %>%
  kbl(caption = "Books by Title with Author") %>%
  kable_classic(full_width = T, html_font = "Cambria")
Table 1: Books by Title with Author
title author
XML Developer’s Guide Gambardella, Matthew
Midnight Rain Ralls, Kim
Maeve Ascendant Corets, Eva
Oberon’s Legacy Corets, Eva
The Sundered Grail Corets, Eva
Lover Birds Randall, Cynthia

Tutorial I

The content from the lesson to this point is narrated in the code walk below.

XPath

A simple and more elegant, albeit less flexible and perhaps less efficient, way is to use XPath expressions to access elements and attributes. We will redo the above extractions using XPath rather than indexed node access.

Let’s start first by demonstrating how to execute an XPath query on an XML document. After the XML is parsed using xmlParse(), the XPath expression is executed using xpathSApply(). The function returns a list (not a vector) of all nodes that match the XPath path expression.

library(XML)

xmlDoc <- xmlParse("SimpleXML.xml", validate=F)

# XPath expression to get titles of all book nodes
xpathExpr <- "//book/title"

# execute/apply the XPath query
rs <- xpathSApply(xmlDoc, xpathExpr)

# print the list of matching elements
print(rs)
## [[1]]
## <title>XML Developer's Guide</title> 
## 
## [[2]]
## <title>Midnight Rain</title>

To get the values of the elements, add “xmlValue” as a parameter as shown below and xpathSApply() returns a vector of element values.

library(XML)

xmlDoc <- xmlParse("SimpleXML.xml", validate=F)

# XPath expression to get titles of all book nodes
xpathExpr <- "//book/title"

# execute/apply the XPath query and extract values
rs <- xpathSApply(xmlDoc, xpathExpr, xmlValue)

print(rs)
## [1] "XML Developer's Guide" "Midnight Rain"

The example below extracts the title, price, and edition using XPath expressions and adds them to a data frame.Notice how we are no longer concerned about whether the optional <outofprint/> element is part of a <book> node or not. XPath simplifies access.

library(XML)

xmlDoc <- xmlParse("SimpleXML.xml", validate=F)

# number of <book> nodes
n <- xmlSize(root)

titles <- xpathSApply(xmlDoc, 
                      "//book/title", xmlValue)

prices <- xpathSApply(xmlDoc, 
                      "//book/price", xmlValue)

editions <- xpathSApply(xmlDoc, 
                      "//book/@edition")


# create new data frame
df <- data.frame(title = titles,
                 price = as.numeric(prices),
                 edition = as.numeric(editions))


print(head(df,3))
##                   title  price edition
## 1 XML Developer's Guide 349.00       3
## 2         Midnight Rain   5.95       1

Notice how we do not use xmlValue when retrieving attribute values.

Missing Elements

The approach to check whether an element is present is a bit more difficult with XPath than indexed access. The function xpathSApply() returns an empty list when the XPath expression has no matching elements. We can check whether a list is empty by finding its length (or size) using the function length(). If it is empty, then its length is 0.

xpath <- "//book/genre"

# XPath expression should not return a value
rs <- xpathSApply(xmlDoc, xpath, xmlValue)

if (length(rs) == 0)
{
  print("no genre")
}
## [1] "no genre"

XPath with Indexed Access

The aforementioned code presumes that every book has one title, one price, and an edition. If there are multiple titles or prices, or some do not exist, then a combination of node traversal and XPath would be necessary.

Let’s say we had a more complex XML that had multiple prices and we only wanted to extract the US prices (where the currency attributes is “US$”). Here’s what one of the nodes in the XML file SimpleXML-2.xml looks like:

<catalog>
   <book id="bk101" edition="3">
      <author>
        <surname>Gambardella</surname>
        <given>Matthew</given>
      </author>
      <title>XML Developer's Guide</title>
      <outofprint />
      <price currency="R$">349</price>
      <price currency="US$">29.95</price>
      <price currency="€">34.00</price>
   </book>
   ...
 </catalog>

There are multiple ways to solve this. One approach, of course, is to use XPath expressions that extract price values only when currency is “US$”. However, to demonstrate the mixing of indexed access and XPath, we will choose an approach the uses loops. This approach is generally slower (loops are slow, especially in R), but affords more flexibility.

library(XML)
xmlDoc <- xmlParse("SimpleXML-2.xml")

n <- xmlSize(root)

# get all <book> nodes from the XML
books <- xpathSApply(xmlDoc, "//book")

# iterate over the <book> nodes
for (i in 1:n)
{
  # get the i-th book node
  aBook <- books[[i]]
  
  # use XPath to extract the <price> child elements
  price <- xpathSApply(aBook, 
                       "price[./@currency='US$']",
                       xmlValue)
  
  print(price)
}
## [1] "29.95"
## [1] "5.95"

Summary of XML Functions

This section presents the most useful functions from the XML package. Naturally, the list is not exclusive and you should consult the documentation for the package for more information.

  • xmlValue
  • xmlSize
  • xmlRoot
  • xmlAttrs
  • xmlxpathSApply
  • xmlParse
  • xmlName
  • xmlChildren

Summary

The XML package provides numerous functions for extracting data from XML documents in R.


Files & Resources

All Files for Lesson 6.114

Errata

Let us know.


  1. The XML package is written in C, so the “internal” representation of the DOM is a C data structure.↩︎

