Introduction

This primer looks at XPath functions and explains how to use them to create expressions with them to augment the query power of XPath. We will start by looking at general XPath expression construction and then show how functions can be integrated. The functions can be grouped based on the data types on which they operate:

  • Node functions

  • Text functions

  • Boolean functions

  • Mathematical functions

For a complete XPath function reference, see http://www.w3.org/TR/xpath#corelib.

Note that the XML package in R implements XPath 1.0 only.

We will use two sample XML documents for illustration. Executing XPath expression requires an evaluator. Luckily, they are easily available. Almost every browser has one built-in and there are web apps that can execute XPath expression for testing, such as FreeFormatter.com. Alternatively, you can execute them in a programming language such as R, Python, JavaScript, C++, Java, among many others.

In this tutorial we will execute them from with R using the XML package. Let’s start by loading the XML sample files into two separate DOM objects.

library(XML)
        
xmlCatFile <- "CDCatalog2.xml"
xmlRosterFile <- "TeamRosters.xml"

xmlCatDOM <- xmlParse(xmlCatFile)
xmlCatDOMIntrnl <- xmlTreeParse(xmlCatFile, useInternalNodes = T)
xmlRosterDOM <- xmlParse(xmlRosterFile)

Before continuing, download all the files (see Files & Resources) and inspect the XML documents.

Node Functions

Node functions refer to specific nodes in the document object model (DOM).

  • position(): determine the a node by position

  • last(): determine the last node in a node set

  • count(): calculate the number of nodes in a node set

Examples: Node Functions

The XPath expression //cd[last()]/title returns the title of the last CD in the collection:

<title>Unchain my heart</title>

The XPath expression //cd[position() = 3]/title returns the title of the 3rd CD in the collection. Positions are indexed starting at 1.

## <title>Greatest Hits</title>

Rather than specifying a position using the position() function, a node can also be specified through its index as shown below.

xpath = "//cd[3]/title"
## [1] "Greatest Hits"

The example below counts the number of nodes in the matching set of nodes.

xpath = "count(//cd/title)"
rs <- xpathSApply(xmlCatDOM, xpath)

print(rs)
## [1] 26

Summary

XPath is an important query mechanism for hierarchical XML data and XPath function provide aggregation and other important functions.

Tutorial


Files & Resources

All Files for Lesson 80.112

References

No references.

Errata

None collected yet. Let us know.

LS0tDQp0aXRsZTogIlByaW1lciBvbiBYUGF0aCBGdW5jdGlvbnMiDQpwYXJhbXM6DQogIGNhdGVnb3J5OiA4MA0KICBudW1iZXI6IDExMg0KICB0aW1lOiA0NQ0KICBsZXZlbDogYmVnaW5uZXINCiAgdGFnczogInIseHBhdGgseG1sIg0KICBkZXNjcmlwdGlvbjogIkEgcHJpbWVyIG9uIHRoZSBtb3N0IGNvbW1vbmx5IHVzZWQgWFBhdGggMi4wIGZ1bmN0aW9ucy4iDQpkYXRlOiAiPHNtYWxsPmByIFN5cy5EYXRlKClgPC9zbWFsbD4iDQphdXRob3I6ICI8c21hbGw+TWFydGluIFNjaGVkbGJhdWVyPC9zbWFsbD4iDQplbWFpbDogIm0uc2NoZWRsYmF1ZXJAbmV1LmVkdSINCmFmZmlsaXRhdGlvbjogIk5vcnRoZWFzdGVybiBVbml2ZXJzaXR5Ig0Kb3V0cHV0OiANCiAgYm9va2Rvd246Omh0bWxfZG9jdW1lbnQyOg0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19mbG9hdDogdHJ1ZQ0KICAgIGNvbGxhcHNlZDogZmFsc2UNCiAgICBudW1iZXJfc2VjdGlvbnM6IGZhbHNlDQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQ0KICAgIHRoZW1lOiBzcGFjZWxhYg0KICAgIGhpZ2hsaWdodDogdGFuZ28NCi0tLQ0KDQotLS0NCnRpdGxlOiAiPHNtYWxsPmByIHBhcmFtcyRjYXRlZ29yeWAuYHIgcGFyYW1zJG51bWJlcmA8L3NtYWxsPjxici8+PHNwYW4gc3R5bGU9J2NvbG9yOiAjMkU0MDUzOyBmb250LXNpemU6IDAuOWVtJz5gciBybWFya2Rvd246Om1ldGFkYXRhJHRpdGxlYDwvc3Bhbj4iDQotLS0NCg0KYGBge3IgY29kZT14ZnVuOjpyZWFkX3V0ZjgocGFzdGUwKGhlcmU6OmhlcmUoKSwnL1IvX2luc2VydDJEQi5SJykpLCBpbmNsdWRlID0gRkFMU0V9DQpgYGANCg0KIyMgSW50cm9kdWN0aW9uDQoNClRoaXMgcHJpbWVyIGxvb2tzIGF0IFhQYXRoIGZ1bmN0aW9ucyBhbmQgZXhwbGFpbnMgaG93IHRvIHVzZSB0aGVtIHRvIGNyZWF0ZSBleHByZXNzaW9ucyB3aXRoIHRoZW0gdG8gYXVnbWVudCB0aGUgcXVlcnkgcG93ZXIgb2YgWFBhdGguIFdlIHdpbGwgc3RhcnQgYnkgbG9va2luZyBhdCBnZW5lcmFsIFhQYXRoIGV4cHJlc3Npb24gY29uc3RydWN0aW9uIGFuZCB0aGVuIHNob3cgaG93IGZ1bmN0aW9ucyBjYW4gYmUgaW50ZWdyYXRlZC4gVGhlIGZ1bmN0aW9ucyBjYW4gYmUgZ3JvdXBlZCBiYXNlZCBvbiB0aGUgZGF0YSB0eXBlcyBvbiB3aGljaCB0aGV5IG9wZXJhdGU6DQoNCi0gICBOb2RlIGZ1bmN0aW9ucw0KDQotICAgVGV4dCBmdW5jdGlvbnMNCg0KLSAgIEJvb2xlYW4gZnVuY3Rpb25zDQoNCi0gICBNYXRoZW1hdGljYWwgZnVuY3Rpb25zDQoNCkZvciBhIGNvbXBsZXRlIFhQYXRoIGZ1bmN0aW9uIHJlZmVyZW5jZSwgc2VlIDxodHRwOi8vd3d3LnczLm9yZy9UUi94cGF0aCNjb3JlbGliPi4NCg0KPiBOb3RlIHRoYXQgdGhlIFhNTCBwYWNrYWdlIGluIFIgaW1wbGVtZW50cyBYUGF0aCAxLjAgb25seS4NCg0KV2Ugd2lsbCB1c2UgdHdvIHNhbXBsZSBYTUwgZG9jdW1lbnRzIGZvciBpbGx1c3RyYXRpb24uIEV4ZWN1dGluZyBYUGF0aCBleHByZXNzaW9uIHJlcXVpcmVzIGFuIGV2YWx1YXRvci4gTHVja2lseSwgdGhleSBhcmUgZWFzaWx5IGF2YWlsYWJsZS4gQWxtb3N0IGV2ZXJ5IGJyb3dzZXIgaGFzIG9uZSBidWlsdC1pbiBhbmQgdGhlcmUgYXJlIHdlYiBhcHBzIHRoYXQgY2FuIGV4ZWN1dGUgWFBhdGggZXhwcmVzc2lvbiBmb3IgdGVzdGluZywgc3VjaCBhcyBbRnJlZUZvcm1hdHRlci5jb21dKGh0dHBzOi8vd3d3LmZyZWVmb3JtYXR0ZXIuY29tL3hwYXRoLXRlc3Rlci5odG1sKS4gQWx0ZXJuYXRpdmVseSwgeW91IGNhbiBleGVjdXRlIHRoZW0gaW4gYSBwcm9ncmFtbWluZyBsYW5ndWFnZSBzdWNoIGFzIFIsIFB5dGhvbiwgSmF2YVNjcmlwdCwgQysrLCBKYXZhLCBhbW9uZyBtYW55IG90aGVycy4NCg0KSW4gdGhpcyB0dXRvcmlhbCB3ZSB3aWxsIGV4ZWN1dGUgdGhlbSBmcm9tIHdpdGggUiB1c2luZyB0aGUgKipYTUwqKiBwYWNrYWdlLiBMZXQncyBzdGFydCBieSBsb2FkaW5nIHRoZSBYTUwgc2FtcGxlIGZpbGVzIGludG8gdHdvIHNlcGFyYXRlIERPTSBvYmplY3RzLg0KDQpgYGB7cn0NCmxpYnJhcnkoWE1MKQ0KICAgICAgICANCnhtbENhdEZpbGUgPC0gIkNEQ2F0YWxvZzIueG1sIg0KeG1sUm9zdGVyRmlsZSA8LSAiVGVhbVJvc3RlcnMueG1sIg0KDQp4bWxDYXRET00gPC0geG1sUGFyc2UoeG1sQ2F0RmlsZSkNCnhtbENhdERPTUludHJubCA8LSB4bWxUcmVlUGFyc2UoeG1sQ2F0RmlsZSwgdXNlSW50ZXJuYWxOb2RlcyA9IFQpDQp4bWxSb3N0ZXJET00gPC0geG1sUGFyc2UoeG1sUm9zdGVyRmlsZSkNCmBgYA0KDQpCZWZvcmUgY29udGludWluZywgZG93bmxvYWQgYWxsIHRoZSBmaWxlcyAoc2VlIFtGaWxlcyAmIFJlc291cmNlc10oI2ZpbGVzLWlkKSkgYW5kIGluc3BlY3QgdGhlIFhNTCBkb2N1bWVudHMuDQoNCiMjIE5vZGUgRnVuY3Rpb25zDQoNCk5vZGUgZnVuY3Rpb25zIHJlZmVyIHRvIHNwZWNpZmljIG5vZGVzIGluIHRoZSBkb2N1bWVudCBvYmplY3QgbW9kZWwgKERPTSkuDQoNCi0gICA8Y29kZT5wb3NpdGlvbigpPC9jb2RlPjogZGV0ZXJtaW5lIHRoZSBhIG5vZGUgYnkgcG9zaXRpb24NCg0KLSAgIDxjb2RlPmxhc3QoKTwvY29kZT46IGRldGVybWluZSB0aGUgbGFzdCBub2RlIGluIGEgbm9kZSBzZXQNCg0KLSAgIDxjb2RlPmNvdW50KCk8L2NvZGU+OiBjYWxjdWxhdGUgdGhlIG51bWJlciBvZiBub2RlcyBpbiBhIG5vZGUgc2V0DQoNCiMjIyBFeGFtcGxlczogTm9kZSBGdW5jdGlvbnMNCg0KYGBge3IgZWNobz1GfQ0KeHBhdGggPC0gIi8vY2RbbGFzdCgpXS90aXRsZSINCmBgYA0KDQpUaGUgWFBhdGggZXhwcmVzc2lvbiA8Y29kZT5gciB4cGF0aGA8L2NvZGU+IHJldHVybnMgdGhlIHRpdGxlIG9mIHRoZSBsYXN0IENEIGluIHRoZSBjb2xsZWN0aW9uOg0KDQpgYGB7ciBlY2hvPUYsIHByb21wdD1GQUxTRSwgY29tbWVudD0iIn0NCnJzIDwtIHhwYXRoU0FwcGx5KHhtbENhdERPTSwgeHBhdGgsIHNhdmVYTUwpDQoNCmNhdChycykNCmBgYA0KDQpgYGB7ciBlY2hvPUZ9DQp4cGF0aCA9ICIvL2NkW3Bvc2l0aW9uKCkgPSAzXS90aXRsZSINCmBgYA0KDQpUaGUgWFBhdGggZXhwcmVzc2lvbiA8Y29kZT5gciB4cGF0aGA8L2NvZGU+IHJldHVybnMgdGhlIHRpdGxlIG9mIHRoZSAzcmQgQ0QgaW4gdGhlIGNvbGxlY3Rpb24uIFBvc2l0aW9ucyBhcmUgaW5kZXhlZCBzdGFydGluZyBhdCAxLg0KDQpgYGB7ciBlY2hvPUZ9DQpycyA9IHhwYXRoU0FwcGx5KHhtbENhdERPTSwgeHBhdGgsIHNhdmVYTUwpDQoNCmNhdChycykNCmBgYA0KDQpSYXRoZXIgdGhhbiBzcGVjaWZ5aW5nIGEgcG9zaXRpb24gdXNpbmcgdGhlIDxjb2RlPnBvc2l0aW9uKCk8L2NvZGU+IGZ1bmN0aW9uLCBhIG5vZGUgY2FuIGFsc28gYmUgc3BlY2lmaWVkIHRocm91Z2ggaXRzIGluZGV4IGFzIHNob3duIGJlbG93Lg0KDQpgYGB7cn0NCnhwYXRoID0gIi8vY2RbM10vdGl0bGUiDQpgYGANCg0KYGBge3IgZWNobz1GfQ0KcnMgPC0geHBhdGhTQXBwbHkoeG1sQ2F0RE9NLCB4cGF0aCwgeG1sVmFsdWUpDQoNCnByaW50KHJzKQ0KYGBgDQoNClRoZSBleGFtcGxlIGJlbG93IGNvdW50cyB0aGUgbnVtYmVyIG9mIG5vZGVzIGluIHRoZSBtYXRjaGluZyBzZXQgb2Ygbm9kZXMuDQoNCmBgYHtyfQ0KeHBhdGggPSAiY291bnQoLy9jZC90aXRsZSkiDQpgYGANCg0KYGBge3J9DQpycyA8LSB4cGF0aFNBcHBseSh4bWxDYXRET00sIHhwYXRoKQ0KDQpwcmludChycykNCmBgYA0KDQojIyBTdW1tYXJ5DQoNClhQYXRoIGlzIGFuIGltcG9ydGFudCBxdWVyeSBtZWNoYW5pc20gZm9yIGhpZXJhcmNoaWNhbCBYTUwgZGF0YSBhbmQgWFBhdGggZnVuY3Rpb24gcHJvdmlkZSBhZ2dyZWdhdGlvbiBhbmQgb3RoZXIgaW1wb3J0YW50IGZ1bmN0aW9ucy4NCg0KIyMgVHV0b3JpYWwNCg0KYGBgez1odG1sfQ0KPGlmcmFtZSBzcmM9IiIgd2lkdGg9IjQ4MCIgaGVpZ2h0PSIyNzAiIGZyYW1lYm9yZGVyPSIwIiBhbGxvdz0iYXV0b3BsYXk7IGZ1bGxzY3JlZW47IHBpY3R1cmUtaW4tcGljdHVyZSIgYWxsb3dmdWxsc2NyZWVuIGRhdGEtZXh0ZXJuYWw9IjEiPjwvaWZyYW1lPg0KYGBgDQoNCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KDQojIyBGaWxlcyAmIFJlc291cmNlcyB7I2ZpbGVzLWlkfQ0KDQpgYGB7ciB6aXBGaWxlcywgZWNobz1GQUxTRX0NCnppcE5hbWUgPSBzcHJpbnRmKCJMZXNzb25GaWxlcy0lcy0lcy56aXAiLCANCiAgICAgICAgICAgICAgICAgcGFyYW1zJGNhdGVnb3J5LA0KICAgICAgICAgICAgICAgICBwYXJhbXMkbnVtYmVyKQ0KDQp0ZXh0QUxpbmsgPSBwYXN0ZTAoIkFsbCBGaWxlcyBmb3IgTGVzc29uICIsIA0KICAgICAgICAgICAgICAgcGFyYW1zJGNhdGVnb3J5LCIuIixwYXJhbXMkbnVtYmVyKQ0KDQojIGRvd25sb2FkRmlsZXNMaW5rKCkgaXMgaW5jbHVkZWQgZnJvbSBfaW5zZXJ0MkRCLlINCmtuaXRyOjpyYXdfaHRtbChkb3dubG9hZEZpbGVzTGluaygiLiIsIHppcE5hbWUsIHRleHRBTGluaykpDQpgYGANCg0KLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tDQoNCiMjIFJlZmVyZW5jZXMNCg0KTm8gcmVmZXJlbmNlcy4NCg0KIyMgRXJyYXRhDQoNCk5vbmUgY29sbGVjdGVkIHlldC4gTGV0IHVzIGtub3cuDQoNCmBgYHs9aHRtbH0NCjxzY3JpcHQgc3JjPSJodHRwczovL2Zvcm0uam90Zm9ybS5jb20vc3RhdGljL2ZlZWRiYWNrMi5qcyIgdHlwZT0idGV4dC9qYXZhc2NyaXB0Ij4NCiAgbmV3IEpvdGZvcm1GZWVkYmFjayh7DQogICAgZm9ybUlkOiAiMjEyMTg3MDcyNzg0MTU3IiwNCiAgICBidXR0b25UZXh0OiAiRmVlZGJhY2siLA0KICAgIGJhc2U6ICJodHRwczovL2Zvcm0uam90Zm9ybS5jb20vIiwNCiAgICBiYWNrZ3JvdW5kOiAiI0Y1OTIwMiIsDQogICAgZm9udENvbG9yOiAiI0ZGRkZGRiIsDQogICAgYnV0dG9uU2lkZTogImxlZnQiLA0KICAgIGJ1dHRvbkFsaWduOiAiY2VudGVyIiwNCiAgICB0eXBlOiBmYWxzZSwNCiAgICB3aWR0aDogNzAwLA0KICAgIGhlaWdodDogNTAwLA0KICAgIGlzQ2FyZEZvcm06IGZhbHNlDQogIH0pOw0KPC9zY3JpcHQ+DQpgYGANCmBgYHtyIGNvZGU9eGZ1bjo6cmVhZF91dGY4KHBhc3RlMChoZXJlOjpoZXJlKCksJy9SL19kZXBsb3lLbml0LlInKSksIGluY2x1ZGUgPSBGQUxTRX0NCmBgYA0K