Introduction
In this lesson we will demonstrate how to use a combination of node-by-node traversal and XPath expressions to extract data from an XML document and store the data in a data frame for further processing, analysis, or storage in a database.
Prerequisites
This lesson presumes that the learner has an understanding of XML, the structure of an XML DOM, and knows how to formulate XPath expressions.
See also:
Packages
To use any of the XML parsing or any of the XPath function you will need an XML package. The XML package is one of several packages and the one we are using in this tutorial. Note that the XML package only supports XPath Version 1.0 and not the newer 2.0 and 3.1 versions.
Loading an XML Document
Let’s start by loading an XML document. There are several functions for loading them which generally all work the same way, although some create different in-memory structures representing the XML tree and thus some are more and some are less efficient. XML documents (or files) can be loaded from the local file system or from a URL.
Load XML from File
xmlFile <- "CDCatalog2.xml"
xmlObj <- xmlParse(xmlFile)
xmlObjTree <- xmlTreeParse(xmlFile)
The error Error: XML content does not seem to be XML: ’’ is often caused by a file that cannot be found and is often due to a misspelled file or path name.
Load XML via URL
xmlURL <-"http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmlObjTree <- xmlTreeParse(xmlURL)
Note that the R parsing functions do not support https so be sure that any URL starts with http:// rather than https://. If you get the error “XML content does not seem to be XML” then that is often the cause.
xmlParse vs xmlTreeParse
xmlParse
is a version of xmlTreeParse
where the argument useInternalNodes is set to TRUE. If you want to get an R object use xmlTreeParse
. While this is generally not very efficient for large document and often unnecessary if you want to extract only parts of the XML document, it has the benefit that you can traverse the XML tree using named traversal, e.g., root\(child1\)child$…
Using xmlParse
is generally more efficient as it returns a pointer to a C structure. To access this structure requires XPath, although xmlTreeParse
supports XPath as well.
Applying an XPath Expression
There are several ways to apply an XPath expression to a parsed XML object, the most common of which to use the function xmlPathSApply
.
The xmlPathSApply
function applies the function passed as a parameter to all matching elements of an XPath expression rather than returning the elements. In the code chunk below, each matching element has the xmlValue
function applied to it and thus the value of the matching elements are extracted. Recall that the value of an element is everything that is between the opening and closing tags. For example, the value of {xml} <tag>some value</tag>
is someValue. Note that the returned object is a vector of characters (like an array of strings in other programming languages) and thus can be accessed as such.
xmlObj <- xmlParse(xmlFile)
xpathEx <- "//cd/title"
artists <- xpathSApply(xmlObj, xpathEx, xmlValue)
head(artists, 3)
## [1] "Empire Burlesque" "Hide your heart" "Greatest Hits"
# access the second element
print(paste("The second artist is: ",artists[2]))
## [1] "The second artist is: Hide your heart"
Retrieving XML Attributes
There are two ways to retrieve an element’s attributes. One, use an XPath expression with xpathSApply
(but without applying the xmlValue
function). Two, use the xmlAttrs
function from a specific node – which requires traversing the tree.
The use of an XPath expression is generally preferable and more maintainable.
xmlObj <- xmlParse(xmlFile)
# Approach 1: use an XPath expression to get the attribute country
xpathEx <- "//cd/company/@country"
countries <- xpathSApply(xmlObj, xpathEx)
head(countries, 3)
## country country country
## "USA" "UK" "USA"
Using Values in R
All of the values retrieved from XML are text and must be converted to strings, often after parsing the text.
xpathEx <- "//cd/price"
prices <- xpathSApply(xmlObj, xpathEx, xmlValue)
# the values in the vector "prices" are character strings
# mean(prices) results in an error
prices.n <- as.numeric(prices)
avg <- mean(prices.n)
print(paste0("The average price is $", round(avg,2)))
## [1] "The average price is $9.12"
Summary
Tutorial
References
No references.
Errata
None collected yet. Let us know.
