Introduction

In this lesson we will demonstrate how to use a combination of node-by-node traversal and XPath expressions to extract data from an XML document and store the data in a data frame for further processing, analysis, or storage in a database.

Prerequisites

This lesson presumes that the learner has an understanding of XML, the structure of an XML DOM, and knows how to formulate XPath expressions.

See also:

Packages

To use any of the XML parsing or any of the XPath function you will need an XML package. The XML package is one of several packages and the one we are using in this tutorial. Note that the XML package only supports XPath Version 1.0 and not the newer 2.0 and 3.1 versions.

library(XML)

Loading an XML Document

Let’s start by loading an XML document. There are several functions for loading them which generally all work the same way, although some create different in-memory structures representing the XML tree and thus some are more and some are less efficient. XML documents (or files) can be loaded from the local file system or from a URL.

Load XML from File

xmlFile <- "CDCatalog2.xml"

xmlObj <- xmlParse(xmlFile)
xmlObjTree <- xmlTreeParse(xmlFile)

The error Error: XML content does not seem to be XML: ’’ is often caused by a file that cannot be found and is often due to a misspelled file or path name.

Load XML via URL

xmlURL <-"http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmlObjTree <- xmlTreeParse(xmlURL)

Note that the R parsing functions do not support https so be sure that any URL starts with http:// rather than https://. If you get the error “XML content does not seem to be XML” then that is often the cause.

xmlParse vs xmlTreeParse

xmlParse is a version of xmlTreeParse where the argument useInternalNodes is set to TRUE. If you want to get an R object use xmlTreeParse. While this is generally not very efficient for large document and often unnecessary if you want to extract only parts of the XML document, it has the benefit that you can traverse the XML tree using named traversal, e.g., root\(child1\)child$…

Using xmlParse is generally more efficient as it returns a pointer to a C structure. To access this structure requires XPath, although xmlTreeParse supports XPath as well.

Applying an XPath Expression

There are several ways to apply an XPath expression to a parsed XML object, the most common of which to use the function xmlPathSApply.

The xmlPathSApply function applies the function passed as a parameter to all matching elements of an XPath expression rather than returning the elements. In the code chunk below, each matching element has the xmlValue function applied to it and thus the value of the matching elements are extracted. Recall that the value of an element is everything that is between the opening and closing tags. For example, the value of {xml} <tag>some value</tag> is someValue. Note that the returned object is a vector of characters (like an array of strings in other programming languages) and thus can be accessed as such.

xmlObj <- xmlParse(xmlFile)

xpathEx <- "//cd/title"
artists <- xpathSApply(xmlObj, xpathEx, xmlValue)

head(artists, 3)
## [1] "Empire Burlesque" "Hide your heart"  "Greatest Hits"
# access the second element
print(paste("The second artist is: ",artists[2]))
## [1] "The second artist is:  Hide your heart"

Retrieving XML Attributes

There are two ways to retrieve an element’s attributes. One, use an XPath expression with xpathSApply (but without applying the xmlValue function). Two, use the xmlAttrs function from a specific node – which requires traversing the tree.

The use of an XPath expression is generally preferable and more maintainable.

xmlObj <- xmlParse(xmlFile)

# Approach 1: use an XPath expression to get the attribute country

xpathEx <- "//cd/company/@country"
countries <- xpathSApply(xmlObj, xpathEx)

head(countries, 3)
## country country country 
##   "USA"    "UK"   "USA"

Using Values in R

All of the values retrieved from XML are text and must be converted to strings, often after parsing the text.

xpathEx <- "//cd/price"
prices <- xpathSApply(xmlObj, xpathEx, xmlValue)

# the values in the vector "prices" are character strings
# mean(prices) results in an error

prices.n <- as.numeric(prices)
avg <- mean(prices.n)

print(paste0("The average price is $", round(avg,2)))
## [1] "The average price is $9.12"

Summary

Tutorial


Files & Resources

All Files for Lesson 6.305

References

No references.

Errata

None collected yet. Let us know.

---
title: "Process XML DOM via XPath and Node Traversal"
params:
  category: 6
  number: 305
  time: 45
  level: beginner
  tags: "r,xpath,xml"
  description: "Explains how to retrieve data from an XML into a data frame
                using a combination of node traversal and
                XPath expressions."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

In this lesson we will demonstrate how to use a combination of node-by-node traversal and XPath expressions to extract data from an XML document and store the data in a data frame for further processing, analysis, or storage in a database.

## Prerequisites

This lesson presumes that the learner has an understanding of XML, the structure of an XML DOM, and knows how to formulate XPath expressions.

See also:

-   [XML Lesson Here]()
-   [6.303 Data Retrieval from XML via XPath in R](http://artificium.us/lessons/06.r/l-6-303-xpath-in-r/l-6-303.html)
-   [6.323 Load Simple XML into Dataframe in R using xmlToDataFrame()](http://artificium.us/lessons/06.r/l-6-323-load-xml-xmlToDataFrame/l-6-323.html)

## Packages

To use any of the XML parsing or any of the XPath function you will need an XML package. The **XML** package is one of several packages and the one we are using in this tutorial. Note that the **XML** package only supports XPath Version 1.0 and not the newer 2.0 and 3.1 versions.

```{r}
library(XML)
```

## Loading an XML Document

Let's start by loading an XML document. There are several functions for loading them which generally all work the same way, although some create different in-memory structures representing the XML tree and thus some are more and some are less efficient. XML documents (or files) can be loaded from the local file system or from a URL.

### Load XML from File

```{r}
xmlFile <- "CDCatalog2.xml"

xmlObj <- xmlParse(xmlFile)
xmlObjTree <- xmlTreeParse(xmlFile)
```

> The error **Error: XML content does not seem to be XML: ''** is often caused by a file that cannot be found and is often due to a misspelled file or path name.

### Load XML via URL

```{r}
xmlURL <-"http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmlObjTree <- xmlTreeParse(xmlURL)
```

> Note that the R parsing functions do not support *https* so be sure that any URL starts with [*http://*](http://){.uri} rather than [*https://*](https://){.uri}. If you get the error "XML content does not seem to be XML" then that is often the cause.

### xmlParse *vs* xmlTreeParse

<code>xmlParse</code> is a version of <code>xmlTreeParse</code> where the argument *useInternalNodes* is set to *TRUE*. If you want to get an R object use <code>xmlTreeParse</code>. While this is generally not very efficient for large document and often unnecessary if you want to extract only parts of the XML document, it has the benefit that you can traverse the XML tree using named traversal, *e.g.*, root$child1$child\$...

Using <code>xmlParse</code> is generally more efficient as it returns a pointer to a C structure. To access this structure requires XPath, although <code>xmlTreeParse</code> supports XPath as well.

## Applying an XPath Expression

There are several ways to apply an XPath expression to a parsed XML object, the most common of which to use the function <code>xmlPathSApply</code>.

The <code>xmlPathSApply</code> function applies the function passed as a parameter to all matching elements of an XPath expression rather than returning the elements. In the code chunk below, each matching element has the <code>xmlValue</code> function applied to it and thus the value of the matching elements are extracted. Recall that the value of an element is everything that is between the opening and closing tags. For example, the value of `{xml} <tag>some value</tag>` is *someValue*. Note that the returned object is a vector of characters (like an array of strings in other programming languages) and thus can be accessed as such.

```{r}
xmlObj <- xmlParse(xmlFile)

xpathEx <- "//cd/title"
artists <- xpathSApply(xmlObj, xpathEx, xmlValue)

head(artists, 3)

# access the second element
print(paste("The second artist is: ",artists[2]))
```

### Retrieving XML Attributes

There are two ways to retrieve an element's attributes. One, use an XPath expression with <code>xpathSApply</code> (but without applying the <code>xmlValue</code> function). Two, use the <code>xmlAttrs</code> function from a specific node -- which requires traversing the tree.

The use of an XPath expression is generally preferable and more maintainable.

```{r}
xmlObj <- xmlParse(xmlFile)

# Approach 1: use an XPath expression to get the attribute country

xpathEx <- "//cd/company/@country"
countries <- xpathSApply(xmlObj, xpathEx)

head(countries, 3)

```

## Using Values in R

All of the values retrieved from XML are text and must be converted to strings, often after parsing the text.

```{r}
xpathEx <- "//cd/price"
prices <- xpathSApply(xmlObj, xpathEx, xmlValue)

# the values in the vector "prices" are character strings
# mean(prices) results in an error

prices.n <- as.numeric(prices)
avg <- mean(prices.n)

print(paste0("The average price is $", round(avg,2)))

```

## Summary

## Next Steps

-   [6.324 Traverse and Parse XML DOM in R](http://artificium.us/lessons/06.r/l-6-324-parse-xml-dom/l-6-324.html)
-   [6.328 Parsing an XML Document and Saving to SQLite Database in R](http://artificium.us/lessons/06.r/l-6-328-xml-to-reldb-sqlite/l-6-328.html)

## Tutorial

```{=html}
<iframe src="" width="480" height="270" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen data-external="1"></iframe>
```

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

No references.

## Errata

None collected yet. Let us know.

```{=html}
<script src="https://form.jotform.com/static/feedback2.js" type="text/javascript">
  new JotformFeedback({
    formId: "212187072784157",
    buttonText: "Feedback",
    base: "https://form.jotform.com/",
    background: "#F59202",
    fontColor: "#FFFFFF",
    buttonSide: "left",
    buttonAlign: "center",
    type: false,
    width: 700,
    height: 500,
    isCardForm: false
  });
</script>
```
```{r code=xfun::read_utf8(paste0(here::here(),'/R/_deployKnit.R')), include = FALSE}
```
