Introduction
In this lesson we will demonstrate how to execute XPath queries against an XML document (or XML store).
Role of XPath
XPath is a powerful language for navigating and querying XML documents. Its role and significance lie in its ability to efficiently and precisely locate elements and data within XML documents. Here’s an overview of the role and significance of XPath in the context of querying XML documents:
Navigating the XML Tree Structure: XML documents are hierarchical in nature, organized as a tree-like structure with elements, attributes, and text nodes. XPath provides a standardized way to traverse this tree structure, allowing you to pinpoint specific nodes and relationships within the document.
Selecting Elements and Attributes: XPath allows you to select XML elements and attributes based on their names, positions, or characteristics. For example, you can easily select all elements of a particular name or filter elements based on their attributes and values.
Path Expressions: XPath uses path expressions that resemble file system paths to specify the location of elements in an XML document. These expressions make it intuitive to traverse the hierarchy and access the desired data.
Filtering and Predicates: XPath supports filtering using predicates, which enable you to refine your selections further. Predicates allow you to specify conditions that must be met for an element to be included in the result set. For example, you can select all <book>
elements where the <author>
is “John Doe.”
Relative and Absolute Paths: XPath supports both relative and absolute paths. Relative paths are specified in relation to the current context node, making it easy to navigate within specific sections of the document. Absolute paths start from the root of the document.
Accessing Text Content: XPath can retrieve the text content of elements and text nodes, allowing you to extract the data contained within XML tags. This is particularly useful when you want to extract values for further processing.
Support for Functions: XPath includes a wide range of built-in functions for performing operations on data, such as string manipulation, mathematical calculations, and date/time handling. These functions enhance the querying capabilities of XPath.
Integration with Other Technologies: XPath is not limited to standalone querying. It plays a crucial role in various XML-related technologies like XSLT (Extensible Stylesheet Language Transformations), XQuery (for querying XML documents), and XPointer (for addressing specific parts of XML documents). XPath expressions can be embedded within these technologies to achieve specific tasks.
Cross-Platform Compatibility: XPath is platform-independent and widely supported in various programming languages and tools. This ensures that XPath queries can be used in different environments and applications.
Standardization: XPath is a W3C (World Wide Web Consortium) standard, which means it has a well-defined syntax and behavior. This standardization promotes consistency and interoperability across different XML processing tools and libraries.
In short, XPath is a fundamental component of XML processing that facilitates the precise and flexible extraction of data from XML documents. Its standardized syntax and rich set of features make it a valuable tool for developers and data analysts working with XML data, enabling them to efficiently query and manipulate XML documents for various purposes, including data extraction, transformation, and validation.
XPath vs SQL
XPath and SQL are both query languages, but they are designed for querying and manipulating data in different contexts. Here are some key differences between XPath and SQL:
- Data Model:
- XPath: XPath is primarily used for querying and navigating XML and HTML documents. It operates on a tree-like structure, where elements, attributes, and text nodes are organized hierarchically.
- SQL: SQL (Structured Query Language) is used for querying relational databases. It works with tables consisting of rows and columns, and it represents data in a tabular format.
- Data Source:
- XPath: XPath is used to query semi-structured and hierarchical data, mainly XML documents. It’s suitable for navigating the complex structure of XML files.
- SQL: SQL is used for querying structured data stored in relational databases. It excels at handling large datasets with well-defined schemas.
- Query Syntax:
- XPath: XPath uses path expressions and functions to navigate and query XML documents. Path expressions resemble directory paths and are used to specify the location of elements or attributes within the document.
- SQL: SQL employs declarative statements that specify what data to retrieve, update, or manipulate. SQL queries consist of SELECT, INSERT, UPDATE, DELETE, and other commands.
- Data Manipulation:
- XPath: While XPath primarily focuses on querying and selecting data within XML documents, it lacks the ability to perform data modification operations like insertion, deletion, or updating.
- SQL: SQL is a comprehensive language that supports not only querying but also data modification operations such as INSERT, UPDATE, and DELETE, making it suitable for maintaining relational databases.
- Use Cases:
- XPath: It is commonly used in web scraping, XML document processing, and XML-based technologies like XSLT and XQuery for transforming and extracting data from XML sources.
- SQL: SQL is widely used for database management, data retrieval, reporting, and data manipulation in applications ranging from e-commerce platforms to financial systems.
- Data Complexity:
- XPath: XPath excels in handling complex hierarchies and relationships within XML documents, making it ideal for tasks like extracting data from deeply nested XML structures.
- SQL: SQL is designed for managing structured data with well-defined relationships, which is suitable for handling complex data dependencies in relational databases.
- Standardization:
- XPath: XPath is a W3C (World Wide Web Consortium) standard, ensuring consistency and compatibility among different XML processing tools.
- SQL: SQL is an ANSI/ISO standard with different dialects for various database systems. While there is a common core of SQL, each database may have its own extensions and variations.
To recap, XPath and SQL are specialized query languages tailored for different data models and use cases. XPath is focused on navigating and querying hierarchical, semi-structured data in XML documents, while SQL is designed for managing structured data in relational databases with support for data manipulation operations. The choice between XPath and SQL depends on the nature of the data and the specific requirements of the task at hand.
XPath Expression Evaluation
Evaluating an XPath expression against an XML document involves using a parser or library that supports XPath to search, navigate, and retrieve data from the XML document based on the specified XPath query.
There are two common ways to execute XPath queries against an XML document:
- write a program that loads the XML and evealuates the XPath query expression
- load the XML document into a tool to execute ad hoc queries
Ad-Hoc XPath Expression Evaluation
There are several online tools and websites that allow you to evaluate ad hoc XPath queries against XML documents without the need for setting up local development environments or writing code. These tools are handy for quick XPath testing and experimentation. Here are some common websites for evaluating ad hoc XPath queries:
FreeFormatter XPath Tester: FreeFormatter XPath Tester is a straightforward online tool for testing XPath expressions against XML data. It provides a clear and simple interface for entering XML data and XPath queries and viewing the results.
XPath Visualizer and Tester: XPath Visualizer and Tester offers a user-friendly interface for evaluating XPath expressions. It allows you to upload XML files, enter XPath queries, and see the results in a structured format.
CodeBeautify XPath Tester: CodeBeautify XPath Tester provides an online XPath evaluator with XML data input and an output pane to display the results of your XPath queries.
Online XPath Tester by DevGuru: Online XPath Tester by DevGuru is a simple online tool that lets you enter XML data and XPath expressions, and it displays the matching nodes.
XPath Evaluator by W3Schools: XPath Evaluator by W3Schools is part of the W3Schools website, known for its web development tutorials. It allows you to enter XML data and XPath expressions to see the results.
XPathFiddle: XPathFiddle is an online XPath tester with a clean and minimalistic interface. You can input your XML data and XPath queries and instantly view the results.
Online XML Tools: Online XML Tools offers a suite of XML-related tools, including an XPath tester. You can input your XML data and XPath expressions and visualize the results.
Online XPath Tester by TutorialsPoint: Online XPath Tester by TutorialsPoint is a handy tool for testing XPath expressions against XML data. It provides a simple editor and result display.
Oxygen XML Web Author: Oxygen XML Web Author is a powerful online XML editor that includes an XPath evaluator. It’s more comprehensive and feature-rich than some of the other tools listed here.
These websites provide convenient and accessible environments for trying out XPath queries against sample XML data. Depending on your specific needs and preferences, you can choose the one that best suits your workflow and requirements.
XPath within Programs
Here’s a programming language indepedent step-by-step guide on how to evaluate an XPath expression against an XML document:
- Load the XML Document:
- First, you need to load the XML document into memory. This can be done using an XML parsing library or tool available in your programming language of choice. Common choices include libraries like lxml in Python, XmlDocument in C#, or built-in functions like
DOMDocument
in PHP.
- Initialize the XPath Processor:
- Next, you need to initialize the XPath processor or create an XPath object. This object allows you to compile and evaluate XPath expressions against the loaded XML document. The specific method for initializing the processor may vary depending on the programming language and libraries used.
- Compile the XPath Expression:
- Compile the XPath expression you want to evaluate. XPath expressions can vary in complexity and specificity, and they are used to define what data you want to retrieve from the XML document.
- Evaluate the XPath Expression:
- Use the XPath processor to evaluate the compiled XPath expression against the loaded XML document. The evaluation process will return a result, which can be one or more of the following:
- A single node (e.g., element, attribute, text node)
- A node list (multiple nodes matching the expression)
- A boolean value (true or false, depending on whether the expression matches)
- A string value (e.g., the text content of a selected element)
- Handle the Result:
- Depending on the result of the XPath evaluation, you can perform various actions:
- If the result is a single node, you can access its data or attributes.
- If the result is a node list, you can iterate through the list to process each matching node.
- If the result is a boolean value, you can use it to make conditional decisions.
- If the result is a string value, you can access the selected text content.
- Repeat as Needed:
- You can evaluate multiple XPath expressions against the same XML document to retrieve different sets of data or information.
- Error Handling:
- Implement error handling to handle cases where the XPath expression is invalid or does not match any data in the XML document. Most XPath processors provide mechanisms to catch and handle exceptions or errors.
- Release Resources:
- After you’ve finished evaluating XPath expressions and processing the XML document, it’s essential to release any resources or memory associated with the XPath processor and XML document to prevent memory leaks.
Here’s a simplified example in Python (so you can see how it compares to R which is used in the remainder of the tutorial) using the lxml library to illustrate how you might evaluate an XPath expression against an XML document:
from lxml import etree
# Load the XML document
xml_data = """
<root>
<item id="1">Apple</item>
<item id="2">Banana</item>
</root>
"""
root = etree.fromstring(xml_data)
# Initialize the XPath processor
xpath_processor = etree.XPath("//item[@id='1']")
# Evaluate the XPath expression
result = xpath_processor(root)
# Handle the result
if result:
print(result[0].text) # Output: Apple
else:
print("No matching node found.")
This code loads an XML document, initializes the XPath processor, evaluates the XPath expression, and handles the result.
XPath Evaluation in R
XPath can be used in any programming language that provide XML support, including Python, Java, JavaScript, C#, among many others. In this lesson, we will focus on using XPath within R, although the XPath expressions are language independent.
To use any of the XML parsing or any of the XPath function you will need an XML package. The XML package is one of several packages and the one we are using in this lesson, so be sure to install it first.
Note that the XML package only supports XPath Version 1.0 and not the newer 2.0 and 3.1 versions.
Loading an XML Document
Let’s start by loading an XML document. There are several functions for loading them which generally all work the same way, although some create different in-memory structures representing the XML tree and thus some are more and some are less efficient. XML documents (or files) can be loaded from the local file system or from a URL.
Load XML from File
library(XML)
xmlFile <- "CDCatalog2.xml"
xmlObj <- xmlParse(xmlFile)
xmlObjTree <- xmlTreeParse(xmlFile)
The error Error: XML content does not seem to be XML: ’’ is often caused by a file that cannot be found and is often due to a misspelled file or path name.
Load XML via URL
xmlURL <-"http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmlObjTree <- xmlTreeParse(xmlURL)
Note that the R parsing functions do not support https so be sure that any URL starts with http:// rather than https://. If you get the error “XML content does not seem to be XML” then that is often the cause.
xmlParse vs xmlTreeParse
xmlParse
is a version of xmlTreeParse
where argument useInternalNodes is set to TRUE. If you want to get an R object use xmlTreeParse
. While this is generally not very efficient for large document and often unnecessary if you want to extract only parts of the XML document, it has the benefit that you can traverse the XML tree using named traversal, e.g., root\(child1\)child$…
Using xmlParse
is generally more efficient as it returns a pointer to a C structure. To access this structure requires XPath, although xmlTreeParse
supports XPath as well.
You can see the different class types in the code below.
## [1] "XMLInternalDocument" "XMLAbstractDocument"
## [1] "XMLDocument" "XMLAbstractDocument"
Applying an XPath Expression
There are several ways to apply an XPath expression to a parsed XML object, the most common of which to use the function xmlPathSApply
. The code chunks below presumes that fp
is the path or URL to an XML document.
The xmlPathSApply
function applies the function passed as a parameter to all matching elements of an XPath expression rather than returning the elements. In the code chunk below, each matching element has the xmlValue
function applied to it and thus the value of the matching elements are extracted. Recall that the value of an element is everything that is between the opening and closing tags. For example, the value of {xml} <tag>some value</tag>
is someValue. Note that the returned object is a vector of characters (like an array of strings in other programming languages) and thus can be accessed as such.
xmlObj <- xmlParse(xmlFile)
xpathEx <- "//cd/title"
artists <- xpathSApply(xmlObj, xpathEx, xmlValue)
head(artists, 3)
[1] "Empire Burlesque" "Hide your heart" "Greatest Hits"
[1] "character"
# access the second element
print(paste("The second artist is: ",artists[2]))
[1] "The second artist is: Hide your heart"
Retrieving XML Attributes
There are two ways to retrieve an element’s attributes. One, use an XPath expression with xpathSApply
(but without applying the xmlValue
function). Two, use the xmlAttrs
function from a specific node – which requires traversing the tree.
The use of an XPath expression is generally preferable and more maintainable.
xmlObj <- xmlParse(xmlFile)
# Approach 1: use an XPath expression to get the attribute country
xpathEx <- "//cd/company/@country"
countries <- xpathSApply(xmlObj, xpathEx)
head(countries, 3)
country country country
"USA" "UK" "USA"
Using Values in R
All of the values retrieved from XML are text and must be converted to strings, often after parsing the text.
xpathEx <- "//cd/price"
prices <- xpathSApply(xmlObj, xpathEx, xmlValue)
# the values in the vector "prices" are character strings
# mean(prices) results in an error
prices.n <- as.numeric(prices)
avg <- mean(prices.n)
print(paste0("The average price is $", round(avg,2)))
[1] "The average price is $9.12"
Review
So, to summarize, the way that XML documents are processed in R is as follows:
- install the package XML
- load the package XML
- set up an XPath expression
- call
xpathSApply()
- use the result node or call
xmlValue()
to get the value of the node
And, remember that all values are characters (text or strings) in R and must be coerced to the correct data type – perhaps after extracting parts of the returned character value. See Lesson 6.112 Basics of Text & String Processing in R.
Significance of XPath
Within the framework of this tutorial, you may have observed a recurrent emphasis on the term “select” when elucidating the functioning of XPath expressions in effectively pinpointing sections of an XML document. Nonetheless, it is crucial to underscore that XPath does not operate in isolation; it is inherently intertwined with other complementary technologies such as XSLT, XPointer, or XLink. The XPath illustrations presented in the preceding sections of this tutorial necessitate their integration with additional code to manifest their full utility.
For instance, consider the following code snippet, which demonstrates the utilization of an XPath expression in an XSLT stylesheet:
<xsl:value-of select="*/session[@type='running']" />
In this code excerpt, the XPath expression is embedded within the select
attribute of the xsl:value-of
element. This element assumes the responsibility of extracting content from a source XML document and incorporating it into an output document during the transformation process of the source document. For comprehensive insights into XSLT stylesheets and their operational dynamics, I recommend revisiting Tutorials 12 and 13.
The pivotal takeaway here is that the XPath expression derives its significance and functionality from its integration within the XSLT context. XPath plays an integral role in the realm of XSLT, as you may recall from the material covered in Tutorial 13.
In a manner analogous to its function within XSLT, XPath serves as the foundational addressing mechanism in XPointer. XPointer’s primary objective is to pinpoint specific sections within XML documents, and it assumes a central role in the broader context of XLink, a concept we will delve into shortly. XPointer harnesses XPath’s capabilities to facilitate navigation through the hierarchy of nodes comprising an XML document. This may sound reminiscent of XPath’s role thus far. However, XPointer elevates the capabilities of XPath by introducing a syntax for fragment identifiers, which are subsequently employed to precisely specify segments within documents. Through this innovation, XPointer empowers users with an exceptional degree of control over the addressing and manipulation of XML documents.
More Examples
The examples below retrieve data from the XML document TeamRosters.xml. Take a moment to open that file and inspect it. You can download the files in the section Files & resource or from the link.
xmlFile <- "TeamRosters.xml"
xmlObj <- xmlParse(xmlFile)
Find the goals scored by “McAvoy”
This path uses a full absolute path expression starting at the root (/) and specifying a filter through a predicate expression.
xpathEx <- "/rosters/team/player[lastname='McAvoy']/points/goals"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
5
Alternatively, rather than passing the function xmlValue()
as a parameter to xpathSApply()
, it can also be called separately on the returned full node.
xpathEx <- "/rosters/team/player[lastname='McAvoy']/points/goals"
result <- xpathSApply(xmlObj, xpathEx)
print(result)
[[1]]
<goals>5</goals>
value <- xmlValue(result)
cat(value)
5
Find the goals scored by all players
xpathEx <- "/rosters/team/player/points/goals"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
29 23 5 0 1
How many goals were scored in total by all players?
The first solution uses R to calculate the sum.
xpathEx <- "/rosters/team/player/points/goals"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
result <- as.numeric(result)
cat(sum(result))
58
The second solution uses the sum
aggregation function of XPath.
xpathEx <- "sum(/rosters/team/player/points/goals)"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
58
What are the last names of everyone?
This path expression is not a global path. It starts with // so it matches any lastname element regardless of where it is in the tree.
xpathEx <- "//lastname"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
Cassidy Marchand Bergeron McAvoy Rask D'Or
What are the last names of all players?
Unlike the previous example, this path is also not a global path but it is more restrictive as it only matches lastname elements that are direct child nodes of a player node.
xpathEx <- "//player/lastname"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
Marchand Bergeron McAvoy Rask D'Or
What are the first and last names of all players?
The XPath below is a correct XPath 2.0 expression, although it does not evaluate in R because the XML package only supports XPath 1.0.
//player/concat(firstname,',',lastname)
An R solution would be to retrieve the elements individually and then concatenate them in R.
xpathEx <- "//player/firstname"
r.fn <- xpathSApply(xmlObj, xpathEx, xmlValue)
xpathEx <- "//player/lastname"
r.ln <- xpathSApply(xmlObj, xpathEx, xmlValue)
result <- paste0(r.ln, ', ', r.fn)
What are the names of the players who did not score a goal?
For this query two paths are presented: the first path is an absolute path while the second path is not. For this XML document they produce the same result.
xpathEx <- "/rosters/team/player[points/goals = 0]/lastname"
# same result as
xpathEx <- "//player[points/goals = 0]/lastname"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
Rask
How many goals did player 63 score?
Note the predicate expression specifying the value of the num attribute of player. The XML for a player node looks like this:
<player num="63">
...
</player>
<player num="37">
...
</player>
xpathEx <- "/rosters/team/player[@num = '63']/points/goals"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
29
How many points did player 63 score?
Note the use of the OR operator | to find elements that match either path expression, which means that we receive both. Alternatively, we could have executed each path expression individually and concatenated the result in R.
xpathEx <- "sum(/rosters/team/player[@num='63']/points/goals | /rosters/team/player[@num='63']/points/assists)"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
69
How many players scored more than 20 goals?
Once again, a predicate expression but this time the result is provided as an argument to the count function of XPath, which, like its counterpart in SQL, will return the number of nodes in the result.
xpathEx <- "count(/rosters/team/player[points/goals > 20]/points/goals)"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
2
What is the lowest number of goals any player scored?
xpathEx <- "min(/rosters/team/player/points/goals)"
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
0
How many goals did Laurent D’Or score?
This query is a bit “tricky” as the last name includes an apostrophe which we often use in XPath as a string delimiter. Of course, we could use the player’s number but we want to use the name as the number may not be fixed for some types of players.
One approach is the reverse the way we use string delimiters. Notice how the XPath expression below uses single quotes (‘) for the XPath expression in R and the double quote (“) for the value in the XPath condition. Finally, notice how the’ is escaped by adding a backslash before it. By writing \’, R and the XPath expression do not interpret the character but pass it on uninterpreted.
xpathEx <- '/rosters/team/player[lastname = "D\'Or"]/points/goals'
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
1
An alternative would be to build the string from the parts and use ” and ’ quotes judiciously and then embed the ’ within a string surrounded by ” and the ” quotes within a string surrounded by ’.
xpathEx <- paste0('/rosters/team/player[lastname = "D',
"'Or",
'"]/points/goals')
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
1
We could have also placed the value into a variable and then pasted that into the XPath expression; this may be a bit simpler looking but has the same effect.
lastname <- "D'Or"
xpathEx <- paste0('/rosters/team/player[lastname = "',
lastname,
'"]/points/goals')
result <- xpathSApply(xmlObj, xpathEx, xmlValue)
cat(result)
1
Practice Queries
- What are the last names of all goalies?
- What is the total number of points scored by the entire team?
- What is the average number of goals for all players who are not goalies?
- Which player has the highest salary?
- Which players have an above average salary?
Tutorial
The video tutorial below uses different XML files to revisit the concepts and demonstrating them by walking through R code. Follow along and type in the statements yourself. A link to the XML file used in the tutorial is below:
Before you start, remember to install the package XML if you haven’t already done so.
