Introduction
This lesson demonstrates how to traverse the Document Object Model (DOM) of an XML document. The DOM is a tree that represents the nodes of an XML document. The entire DOM is loaded into memory and consequently only works when the XML is reasonably small. For processing larger XML document, parsing using a SAX (Simple XML API) is advised.
Load XML into DOM
The most common way to load an XML document from a file or URL is to use the function xmlParse()
from the XML package. It returns a pointer to an internal C object. Another function that is sometimes used is xmlTreeParse()
which returns an R object. It is generally slower and less commonly used. It has the benefit of allowing symbolic traversal of the nodes, while the former relies on XPath and indexed node access.
–
library(XML)
xmlURL <- "http://artificium.us/lessons/06.r/l-6-324-parse-xml-dom/HockeyTeamRosters.xml"
xmlObj <- xmlParse(xmlURL)
To get a reference to the root node of the DOM, use xmlRoot()
.
Retrievals of Elements using XPath
The path below returns the last names of all players on all teams. The expression node[[1]]
access the first element in the list of returned elements.
xpathSApply()
is generally preferable over xpathApply()
for XPath expression as it attempts to simplify the return value as a vector when possible, rather than a list.
xpathSApply()
requires the root node of the XML document and an XPath expression. optionally, the function xmlValue
can be added as a parameter to retrieve the values of the matching nodes. The value of a node are the characters between the opening and closing tags, e.g., the value of the node (or element) <foo>bar</foo> is bar.
The code below retrieves the matching nodes.
nodeList <- xpathSApply(xmlObj, "//players/player/lastname")
aNode <- nodeList[[1]]
print(aNode)
## <lastname>Marchmont</lastname>
# get value of an element
aPlayerName <- xmlValue(aNode)
print(aPlayerName)
## [1] "Marchmont"
On the other hand, this code below, retrieves the values of the matching nodes. Note that it returns a vector rather than a list. If we had used xpathApply()
, we would have gotten a list object back.
nodeVals.v <- xpathSApply(xmlObj, "//players/player/lastname", xmlValue)
aNode <- nodeVals.v[1]
print(aNode)
## [1] "Marchmont"
nodeVals.l <- xpathApply(xmlObj, "//players/player/lastname", xmlValue)
aNode <- nodeVals.l[[1]]
print(aNode)
## [1] "Marchmont"
XPath on Elements
Let’s first inspect the returned list. Note that xpathSApply
returns elements. The <player> element has child elements of its own.
playerList <- xpathSApply(xmlObj, "//players/player")
aPlayer <- playerList[[1]]
print(aPlayer)
## <player num="63">
## <firstname>Brad</firstname>
## <lastname>Marchmont</lastname>
## <position>Forward</position>
## <salary>6125000</salary>
## <assistantcaptain/>
## <points>
## <goals>29</goals>
## <assists>40</assists>
## </points>
## </player>
Let’s access the <salary> element of a player using an XPath expression. Note how the XPath expression must start with ./ for the path to be relative to the <player> subtree. Also note that we passed xmlValue
to xpathSApply
to get the value of the <salary> element
salaryList <- xpathSApply(playerList[[1]], "./salary", xmlValue)
aNode <- salaryList[[1]]
print(aNode)
## [1] "6125000"
Processing Node Lists
An XPath expression returns a list object containing all XML elements (or nodes) that match the XPath expression. The elements in the list can be processed iteratively using an apply()
function or a loop.
In the code below, we retrieve all <player> elements for the team “Everglades” into the list object playerList. If the XPath expression could not find matching elements it returns an empty list of length 0. For each <player> element, we will extract the <lastname>, <position>, and <salary> elements using three different techniques to demonstrate. We will store the player’s last name and salary in a data frame. The data frame is initially empty with one row for each player. We could have set the initial number of rows to 0 rather than num.Players and R would allocate new rows as needed. Further note that all values in XML are text (character) and must be converted as necessary.
xmlObj <- xmlParse('HockeyTeamRosters.xml')
xpath <- "//team[@name='Pirates']/players/player"
playerList <- xpathSApply(xmlObj, xpath)
num.Players <- length(playerList)
# data frame for storing last name and salary
players.df <- data.frame(
name = character(num.Players),
position = character(num.Players),
salary = integer(num.Players)
)
# check if players were found
if (num.Players > 0)
{
# iterate through each player element
for (p in 1:num.Players)
{
# get the next player in the list
aPlayer <- playerList[[p]]
# get the name of the player using node access
pname <- xmlValue((aPlayer['lastname'])[[1]])
# get the position using positional access; this requires
# knowing the position of the element and being certain that
# it does not change -- a DTD is useful for that
pposition <- xmlValue(aPlayer[[3]])
# get the salary using XPath; be sure to start with ./
# this approach only works if we use xmlParse to get the
# DOM; it does not work with xmlTreeParse
salary.xpath <- "./salary"
salary.element <- xpathSApply(aPlayer, salary.xpath, xmlValue)
salary.value <- salary.element[[1]]
psalary <- as.integer(salary.value)
# store the name (by accessing a cell with row and column)
players.df[p,1] <- pname
# store name using a named column
players.df[p,'position'] <- pposition
# store the salary using a different technique to demonstrate
players.df$salary[p] <- psalary
}
}
Let’s break the code apart.
Missing Elements
Tutorial
References
No references.
Errata
None collected yet. Let us know.
---
title: "Traverse and Parse XML DOM in R"
params:
  category: 6
  number: 324
  time: 45
  level: beginner
  tags: "r,xpath,xml,dom"
  description: "Explains how to traverse an XML Document 
                Object Model (DOM)
                using a combination of XPath and node access."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

This lesson demonstrates how to traverse the Document Object Model (DOM) of an XML document. The DOM is a tree that represents the nodes of an XML document. The entire DOM is loaded into memory and consequently only works when the XML is reasonably small. For processing larger XML document, parsing using a SAX (Simple XML API) is advised.

## Load XML into DOM

The most common way to load an XML document from a file or URL is to use the function <code>xmlParse()</code> from the **XML** package. It returns a pointer to an internal C object. Another function that is sometimes used is <code>xmlTreeParse()</code> which returns an R object. It is generally slower and less commonly used. It has the benefit of allowing symbolic traversal of the nodes, while the former relies on XPath and indexed node access.

--

```{r}
library(XML)

xmlURL <- "http://artificium.us/lessons/06.r/l-6-324-parse-xml-dom/HockeyTeamRosters.xml"
xmlObj <- xmlParse(xmlURL)
```

To get a reference to the root node of the DOM, use <code>xmlRoot()</code>.

```{r}
root <- xmlRoot(xmlObj)
```

## Retrievals of Elements using XPath

The path below returns the last names of all players on all teams. The expression <code>node[[1]]</code> access the first element in the list of returned elements.

<code>xpathSApply()</code> is generally preferable over <code>xpathApply()</code> for XPath expression as it attempts to simplify the return value as a vector when possible, rather than a list.

<code>xpathSApply()</code> requires the root node of the XML document and an XPath expression. optionally, the function <code>xmlValue</code> can be added as a parameter to retrieve the values of the matching nodes. The value of a node are the characters between the opening and closing tags, *e.g.*, the value of the node (or element) *\<foo\>bar\</foo\>* is *bar*.

The code below retrieves the matching nodes.

```{r}
nodeList <- xpathSApply(xmlObj, "//players/player/lastname")

aNode <- nodeList[[1]]

print(aNode)

# get value of an element

aPlayerName <- xmlValue(aNode)
print(aPlayerName)

```

On the other hand, this code below, retrieves the values of the matching nodes. Note that it returns a vector rather than a list. If we had used <code>xpathApply()</code>, we would have gotten a list object back.

```{r}
nodeVals.v <- xpathSApply(xmlObj, "//players/player/lastname", xmlValue)

aNode <- nodeVals.v[1]

print(aNode)

nodeVals.l <- xpathApply(xmlObj, "//players/player/lastname", xmlValue)

aNode <- nodeVals.l[[1]]

print(aNode)
```

## XPath on Elements

Let's first inspect the returned list. Note that <code>xpathSApply</code> returns elements. The \<player\> element has child elements of its own.

```{r}
playerList <- xpathSApply(xmlObj, "//players/player")

aPlayer <- playerList[[1]]

print(aPlayer)
```

Let's access the *\<salary\>* element of a player using an XPath expression. Note how the XPath expression must start with *./* for the path to be relative to the *\<player\>* subtree. Also note that we passed <code>xmlValue</code> to <code>xpathSApply</code> to get the value of the *\<salary\>* element

```{r}
salaryList <- xpathSApply(playerList[[1]], "./salary", xmlValue)

aNode <- salaryList[[1]]

print(aNode)
```

## Processing Node Lists

An XPath expression returns a list object containing all XML elements (or nodes) that match the XPath expression. The elements in the list can be processed iteratively using an <code>apply()</code> function or a loop.

In the code below, we retrieve all *\<player\>* elements for the team "Everglades" into the list object *playerList*. If the XPath expression could not find matching elements it returns an empty list of length 0. For each *\<player\>* element, we will extract the *\<lastname\>*, *\<position\>*, and *\<salary\>* elements using three different techniques to demonstrate. We will store the player's last name and salary in a data frame. The data frame is initially empty with one row for each player. We could have set the initial number of rows to 0 rather than *num.Players* and R would allocate new rows as needed. Further note that all values in XML are text (character) and must be converted as necessary.

```{r}
xmlObj <- xmlParse('HockeyTeamRosters.xml')

xpath <- "//team[@name='Pirates']/players/player"
playerList <- xpathSApply(xmlObj, xpath)

num.Players <- length(playerList)

# data frame for storing last name and salary
players.df <- data.frame(
  name = character(num.Players),
  position = character(num.Players),
  salary = integer(num.Players)
)

# check if players were found
if (num.Players > 0)
{
  # iterate through each player element
  for (p in 1:num.Players)
  {
    # get the next player in the list
    aPlayer <- playerList[[p]]
    
    # get the name of the player using node access
    pname <- xmlValue((aPlayer['lastname'])[[1]])
    
    # get the position using positional access; this requires
    # knowing the position of the element and being certain that
    # it does not change -- a DTD is useful for that
    pposition <- xmlValue(aPlayer[[3]])
    
    # get the salary using XPath; be sure to start with ./
    # this approach only works if we use xmlParse to get the
    # DOM; it does not work with xmlTreeParse
    salary.xpath <- "./salary"
    salary.element <- xpathSApply(aPlayer, salary.xpath, xmlValue)
    salary.value <- salary.element[[1]]
    psalary <- as.integer(salary.value)
    
    # store the name (by accessing a cell with row and column)
    players.df[p,1] <- pname
    # store name using a named column
    players.df[p,'position'] <- pposition
    # store the salary using a different technique to demonstrate
    players.df$salary[p] <- psalary
  }
}

```

Let's break the code apart.

## Missing Elements

## Play Ground

``` xml
<foo>
</foo>
```

## Tutorial

```{=html}
<iframe src="" width="480" height="270" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen data-external="1"></iframe>
```

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

No references.

## Errata

None collected yet. Let us know.

```{=html}
<script src="https://form.jotform.com/static/feedback2.js" type="text/javascript">
  new JotformFeedback({
    formId: "212187072784157",
    buttonText: "Feedback",
    base: "https://form.jotform.com/",
    background: "#F59202",
    fontColor: "#FFFFFF",
    buttonSide: "left",
    buttonAlign: "center",
    type: false,
    width: 700,
    height: 500,
    isCardForm: false
  });
</script>
```
```{r code=xfun::read_utf8(paste0(here::here(),'/R/_deployKnit.R')), include = FALSE}
```
