Introduction
This primer looks at XPath functions and explains how to use them to create expressions with them to augment the query power of XPath. We will start by looking at general XPath expression construction and then show how functions can be integrated. The functions can be grouped based on the data types on which they operate:
Node functions
Text functions
Boolean functions
Mathematical functions
For a complete XPath function reference, see http://www.w3.org/TR/xpath#corelib.
Note that the XML package in R implements XPath 1.0 only.
We will use two sample XML documents for illustration. Executing XPath expression requires an evaluator. Luckily, they are easily available. Almost every browser has one built-in and there are web apps that can execute XPath expression for testing, such as FreeFormatter.com. Alternatively, you can execute them in a programming language such as R, Python, JavaScript, C++, Java, among many others.
In this tutorial we will execute them from with R using the XML package. Let’s start by loading the XML sample files into two separate DOM objects.
library(XML)
xmlCatFile <- "CDCatalog2.xml"
xmlRosterFile <- "TeamRosters.xml"
xmlCatDOM <- xmlParse(xmlCatFile)
xmlCatDOMIntrnl <- xmlTreeParse(xmlCatFile, useInternalNodes = T)
xmlRosterDOM <- xmlParse(xmlRosterFile)
Before continuing, download all the files (see Files & Resources) and inspect the XML documents.
Node Functions
Node functions refer to specific nodes in the document object model (DOM).
position()
: determine the a node by position
last()
: determine the last node in a node set
count()
: calculate the number of nodes in a node set
Examples: Node Functions
The XPath expression //cd[last()]/title
returns the title of the last CD in the collection:
<title>Unchain my heart</title>
The XPath expression //cd[position() = 3]/title
returns the title of the 3rd CD in the collection. Positions are indexed starting at 1.
## <title>Greatest Hits</title>
Rather than specifying a position using the position()
function, a node can also be specified through its index as shown below.
## [1] "Greatest Hits"
The example below counts the number of nodes in the matching set of nodes.
xpath = "count(//cd/title)"
rs <- xpathSApply(xmlCatDOM, xpath)
print(rs)
## [1] 26
Summary
XPath is an important query mechanism for hierarchical XML data and XPath function provide aggregation and other important functions.
Tutorial
References
No references.
Errata
None collected yet. Let us know.
LS0tDQp0aXRsZTogIlByaW1lciBvbiBYUGF0aCBGdW5jdGlvbnMiDQpwYXJhbXM6DQogIGNhdGVnb3J5OiA4MA0KICBudW1iZXI6IDExMg0KICB0aW1lOiA0NQ0KICBsZXZlbDogYmVnaW5uZXINCiAgdGFnczogInIseHBhdGgseG1sIg0KICBkZXNjcmlwdGlvbjogIkEgcHJpbWVyIG9uIHRoZSBtb3N0IGNvbW1vbmx5IHVzZWQgWFBhdGggMi4wIGZ1bmN0aW9ucy4iDQpkYXRlOiAiPHNtYWxsPmByIFN5cy5EYXRlKClgPC9zbWFsbD4iDQphdXRob3I6ICI8c21hbGw+TWFydGluIFNjaGVkbGJhdWVyPC9zbWFsbD4iDQplbWFpbDogIm0uc2NoZWRsYmF1ZXJAbmV1LmVkdSINCmFmZmlsaXRhdGlvbjogIk5vcnRoZWFzdGVybiBVbml2ZXJzaXR5Ig0Kb3V0cHV0OiANCiAgYm9va2Rvd246Omh0bWxfZG9jdW1lbnQyOg0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19mbG9hdDogdHJ1ZQ0KICAgIGNvbGxhcHNlZDogZmFsc2UNCiAgICBudW1iZXJfc2VjdGlvbnM6IGZhbHNlDQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQ0KICAgIHRoZW1lOiBzcGFjZWxhYg0KICAgIGhpZ2hsaWdodDogdGFuZ28NCi0tLQ0KDQotLS0NCnRpdGxlOiAiPHNtYWxsPmByIHBhcmFtcyRjYXRlZ29yeWAuYHIgcGFyYW1zJG51bWJlcmA8L3NtYWxsPjxici8+PHNwYW4gc3R5bGU9J2NvbG9yOiAjMkU0MDUzOyBmb250LXNpemU6IDAuOWVtJz5gciBybWFya2Rvd246Om1ldGFkYXRhJHRpdGxlYDwvc3Bhbj4iDQotLS0NCg0KYGBge3IgY29kZT14ZnVuOjpyZWFkX3V0ZjgocGFzdGUwKGhlcmU6OmhlcmUoKSwnL1IvX2luc2VydDJEQi5SJykpLCBpbmNsdWRlID0gRkFMU0V9DQpgYGANCg0KIyMgSW50cm9kdWN0aW9uDQoNClRoaXMgcHJpbWVyIGxvb2tzIGF0IFhQYXRoIGZ1bmN0aW9ucyBhbmQgZXhwbGFpbnMgaG93IHRvIHVzZSB0aGVtIHRvIGNyZWF0ZSBleHByZXNzaW9ucyB3aXRoIHRoZW0gdG8gYXVnbWVudCB0aGUgcXVlcnkgcG93ZXIgb2YgWFBhdGguIFdlIHdpbGwgc3RhcnQgYnkgbG9va2luZyBhdCBnZW5lcmFsIFhQYXRoIGV4cHJlc3Npb24gY29uc3RydWN0aW9uIGFuZCB0aGVuIHNob3cgaG93IGZ1bmN0aW9ucyBjYW4gYmUgaW50ZWdyYXRlZC4gVGhlIGZ1bmN0aW9ucyBjYW4gYmUgZ3JvdXBlZCBiYXNlZCBvbiB0aGUgZGF0YSB0eXBlcyBvbiB3aGljaCB0aGV5IG9wZXJhdGU6DQoNCi0gICBOb2RlIGZ1bmN0aW9ucw0KDQotICAgVGV4dCBmdW5jdGlvbnMNCg0KLSAgIEJvb2xlYW4gZnVuY3Rpb25zDQoNCi0gICBNYXRoZW1hdGljYWwgZnVuY3Rpb25zDQoNCkZvciBhIGNvbXBsZXRlIFhQYXRoIGZ1bmN0aW9uIHJlZmVyZW5jZSwgc2VlIDxodHRwOi8vd3d3LnczLm9yZy9UUi94cGF0aCNjb3JlbGliPi4NCg0KPiBOb3RlIHRoYXQgdGhlIFhNTCBwYWNrYWdlIGluIFIgaW1wbGVtZW50cyBYUGF0aCAxLjAgb25seS4NCg0KV2Ugd2lsbCB1c2UgdHdvIHNhbXBsZSBYTUwgZG9jdW1lbnRzIGZvciBpbGx1c3RyYXRpb24uIEV4ZWN1dGluZyBYUGF0aCBleHByZXNzaW9uIHJlcXVpcmVzIGFuIGV2YWx1YXRvci4gTHVja2lseSwgdGhleSBhcmUgZWFzaWx5IGF2YWlsYWJsZS4gQWxtb3N0IGV2ZXJ5IGJyb3dzZXIgaGFzIG9uZSBidWlsdC1pbiBhbmQgdGhlcmUgYXJlIHdlYiBhcHBzIHRoYXQgY2FuIGV4ZWN1dGUgWFBhdGggZXhwcmVzc2lvbiBmb3IgdGVzdGluZywgc3VjaCBhcyBbRnJlZUZvcm1hdHRlci5jb21dKGh0dHBzOi8vd3d3LmZyZWVmb3JtYXR0ZXIuY29tL3hwYXRoLXRlc3Rlci5odG1sKS4gQWx0ZXJuYXRpdmVseSwgeW91IGNhbiBleGVjdXRlIHRoZW0gaW4gYSBwcm9ncmFtbWluZyBsYW5ndWFnZSBzdWNoIGFzIFIsIFB5dGhvbiwgSmF2YVNjcmlwdCwgQysrLCBKYXZhLCBhbW9uZyBtYW55IG90aGVycy4NCg0KSW4gdGhpcyB0dXRvcmlhbCB3ZSB3aWxsIGV4ZWN1dGUgdGhlbSBmcm9tIHdpdGggUiB1c2luZyB0aGUgKipYTUwqKiBwYWNrYWdlLiBMZXQncyBzdGFydCBieSBsb2FkaW5nIHRoZSBYTUwgc2FtcGxlIGZpbGVzIGludG8gdHdvIHNlcGFyYXRlIERPTSBvYmplY3RzLg0KDQpgYGB7cn0NCmxpYnJhcnkoWE1MKQ0KICAgICAgICANCnhtbENhdEZpbGUgPC0gIkNEQ2F0YWxvZzIueG1sIg0KeG1sUm9zdGVyRmlsZSA8LSAiVGVhbVJvc3RlcnMueG1sIg0KDQp4bWxDYXRET00gPC0geG1sUGFyc2UoeG1sQ2F0RmlsZSkNCnhtbENhdERPTUludHJubCA8LSB4bWxUcmVlUGFyc2UoeG1sQ2F0RmlsZSwgdXNlSW50ZXJuYWxOb2RlcyA9IFQpDQp4bWxSb3N0ZXJET00gPC0geG1sUGFyc2UoeG1sUm9zdGVyRmlsZSkNCmBgYA0KDQpCZWZvcmUgY29udGludWluZywgZG93bmxvYWQgYWxsIHRoZSBmaWxlcyAoc2VlIFtGaWxlcyAmIFJlc291cmNlc10oI2ZpbGVzLWlkKSkgYW5kIGluc3BlY3QgdGhlIFhNTCBkb2N1bWVudHMuDQoNCiMjIE5vZGUgRnVuY3Rpb25zDQoNCk5vZGUgZnVuY3Rpb25zIHJlZmVyIHRvIHNwZWNpZmljIG5vZGVzIGluIHRoZSBkb2N1bWVudCBvYmplY3QgbW9kZWwgKERPTSkuDQoNCi0gICA8Y29kZT5wb3NpdGlvbigpPC9jb2RlPjogZGV0ZXJtaW5lIHRoZSBhIG5vZGUgYnkgcG9zaXRpb24NCg0KLSAgIDxjb2RlPmxhc3QoKTwvY29kZT46IGRldGVybWluZSB0aGUgbGFzdCBub2RlIGluIGEgbm9kZSBzZXQNCg0KLSAgIDxjb2RlPmNvdW50KCk8L2NvZGU+OiBjYWxjdWxhdGUgdGhlIG51bWJlciBvZiBub2RlcyBpbiBhIG5vZGUgc2V0DQoNCiMjIyBFeGFtcGxlczogTm9kZSBGdW5jdGlvbnMNCg0KYGBge3IgZWNobz1GfQ0KeHBhdGggPC0gIi8vY2RbbGFzdCgpXS90aXRsZSINCmBgYA0KDQpUaGUgWFBhdGggZXhwcmVzc2lvbiA8Y29kZT5gciB4cGF0aGA8L2NvZGU+IHJldHVybnMgdGhlIHRpdGxlIG9mIHRoZSBsYXN0IENEIGluIHRoZSBjb2xsZWN0aW9uOg0KDQpgYGB7ciBlY2hvPUYsIHByb21wdD1GQUxTRSwgY29tbWVudD0iIn0NCnJzIDwtIHhwYXRoU0FwcGx5KHhtbENhdERPTSwgeHBhdGgsIHNhdmVYTUwpDQoNCmNhdChycykNCmBgYA0KDQpgYGB7ciBlY2hvPUZ9DQp4cGF0aCA9ICIvL2NkW3Bvc2l0aW9uKCkgPSAzXS90aXRsZSINCmBgYA0KDQpUaGUgWFBhdGggZXhwcmVzc2lvbiA8Y29kZT5gciB4cGF0aGA8L2NvZGU+IHJldHVybnMgdGhlIHRpdGxlIG9mIHRoZSAzcmQgQ0QgaW4gdGhlIGNvbGxlY3Rpb24uIFBvc2l0aW9ucyBhcmUgaW5kZXhlZCBzdGFydGluZyBhdCAxLg0KDQpgYGB7ciBlY2hvPUZ9DQpycyA9IHhwYXRoU0FwcGx5KHhtbENhdERPTSwgeHBhdGgsIHNhdmVYTUwpDQoNCmNhdChycykNCmBgYA0KDQpSYXRoZXIgdGhhbiBzcGVjaWZ5aW5nIGEgcG9zaXRpb24gdXNpbmcgdGhlIDxjb2RlPnBvc2l0aW9uKCk8L2NvZGU+IGZ1bmN0aW9uLCBhIG5vZGUgY2FuIGFsc28gYmUgc3BlY2lmaWVkIHRocm91Z2ggaXRzIGluZGV4IGFzIHNob3duIGJlbG93Lg0KDQpgYGB7cn0NCnhwYXRoID0gIi8vY2RbM10vdGl0bGUiDQpgYGANCg0KYGBge3IgZWNobz1GfQ0KcnMgPC0geHBhdGhTQXBwbHkoeG1sQ2F0RE9NLCB4cGF0aCwgeG1sVmFsdWUpDQoNCnByaW50KHJzKQ0KYGBgDQoNClRoZSBleGFtcGxlIGJlbG93IGNvdW50cyB0aGUgbnVtYmVyIG9mIG5vZGVzIGluIHRoZSBtYXRjaGluZyBzZXQgb2Ygbm9kZXMuDQoNCmBgYHtyfQ0KeHBhdGggPSAiY291bnQoLy9jZC90aXRsZSkiDQpgYGANCg0KYGBge3J9DQpycyA8LSB4cGF0aFNBcHBseSh4bWxDYXRET00sIHhwYXRoKQ0KDQpwcmludChycykNCmBgYA0KDQojIyBTdW1tYXJ5DQoNClhQYXRoIGlzIGFuIGltcG9ydGFudCBxdWVyeSBtZWNoYW5pc20gZm9yIGhpZXJhcmNoaWNhbCBYTUwgZGF0YSBhbmQgWFBhdGggZnVuY3Rpb24gcHJvdmlkZSBhZ2dyZWdhdGlvbiBhbmQgb3RoZXIgaW1wb3J0YW50IGZ1bmN0aW9ucy4NCg0KIyMgVHV0b3JpYWwNCg0KYGBgez1odG1sfQ0KPGlmcmFtZSBzcmM9IiIgd2lkdGg9IjQ4MCIgaGVpZ2h0PSIyNzAiIGZyYW1lYm9yZGVyPSIwIiBhbGxvdz0iYXV0b3BsYXk7IGZ1bGxzY3JlZW47IHBpY3R1cmUtaW4tcGljdHVyZSIgYWxsb3dmdWxsc2NyZWVuIGRhdGEtZXh0ZXJuYWw9IjEiPjwvaWZyYW1lPg0KYGBgDQoNCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KDQojIyBGaWxlcyAmIFJlc291cmNlcyB7I2ZpbGVzLWlkfQ0KDQpgYGB7ciB6aXBGaWxlcywgZWNobz1GQUxTRX0NCnppcE5hbWUgPSBzcHJpbnRmKCJMZXNzb25GaWxlcy0lcy0lcy56aXAiLCANCiAgICAgICAgICAgICAgICAgcGFyYW1zJGNhdGVnb3J5LA0KICAgICAgICAgICAgICAgICBwYXJhbXMkbnVtYmVyKQ0KDQp0ZXh0QUxpbmsgPSBwYXN0ZTAoIkFsbCBGaWxlcyBmb3IgTGVzc29uICIsIA0KICAgICAgICAgICAgICAgcGFyYW1zJGNhdGVnb3J5LCIuIixwYXJhbXMkbnVtYmVyKQ0KDQojIGRvd25sb2FkRmlsZXNMaW5rKCkgaXMgaW5jbHVkZWQgZnJvbSBfaW5zZXJ0MkRCLlINCmtuaXRyOjpyYXdfaHRtbChkb3dubG9hZEZpbGVzTGluaygiLiIsIHppcE5hbWUsIHRleHRBTGluaykpDQpgYGANCg0KLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tDQoNCiMjIFJlZmVyZW5jZXMNCg0KTm8gcmVmZXJlbmNlcy4NCg0KIyMgRXJyYXRhDQoNCk5vbmUgY29sbGVjdGVkIHlldC4gTGV0IHVzIGtub3cuDQoNCmBgYHs9aHRtbH0NCjxzY3JpcHQgc3JjPSJodHRwczovL2Zvcm0uam90Zm9ybS5jb20vc3RhdGljL2ZlZWRiYWNrMi5qcyIgdHlwZT0idGV4dC9qYXZhc2NyaXB0Ij4NCiAgbmV3IEpvdGZvcm1GZWVkYmFjayh7DQogICAgZm9ybUlkOiAiMjEyMTg3MDcyNzg0MTU3IiwNCiAgICBidXR0b25UZXh0OiAiRmVlZGJhY2siLA0KICAgIGJhc2U6ICJodHRwczovL2Zvcm0uam90Zm9ybS5jb20vIiwNCiAgICBiYWNrZ3JvdW5kOiAiI0Y1OTIwMiIsDQogICAgZm9udENvbG9yOiAiI0ZGRkZGRiIsDQogICAgYnV0dG9uU2lkZTogImxlZnQiLA0KICAgIGJ1dHRvbkFsaWduOiAiY2VudGVyIiwNCiAgICB0eXBlOiBmYWxzZSwNCiAgICB3aWR0aDogNzAwLA0KICAgIGhlaWdodDogNTAwLA0KICAgIGlzQ2FyZEZvcm06IGZhbHNlDQogIH0pOw0KPC9zY3JpcHQ+DQpgYGANCmBgYHtyIGNvZGU9eGZ1bjo6cmVhZF91dGY4KHBhc3RlMChoZXJlOjpoZXJlKCksJy9SL19kZXBsb3lLbml0LlInKSksIGluY2x1ZGUgPSBGQUxTRX0NCmBgYA0K