Objectives

Upon completion of this short, you will be able to:

  • parse and validate an XML
  • traverse the XML using indexed access
  • load data from the XML into a data frame
  • externalize the data frame as a CSV

Overview

Inspect the XML document messages.xml before proceeding so you understand its structure.

Parse XML with Validation into R

Using the package XML, load the XML into R via xmlParse() ensuring that the XML is consistent with the grammar defined in its DTD by setting the validate argument to T. Finally, get the root element of the XML.

library(XML)
xmlDOM <- xmlParse("messages.xml", validate = T)

r <- xmlRoot(xmlDOM)

Store Values in Data Frame

# create data frame to hold values
df <- data.frame(to = character(),
                 from = character(),
                 stringsAsFactors = F)

The argument stringsAsFactors must be F for text column not to be converted to factor variables.

Create an empty data frame to hold values from the XML. Rows will grow dynamically, although a dimension can be preset to reduce overhead for memory reallocation – if we know.

# create data frame to hold values
n <- xmlSize(r)
df <- data.frame(mid = integer(n),
                 to = character(n),
                 from = character(n),
                 stringsAsFactors = F)

Let’s add data from the XML to the data frame through indexed traversal.

n <- xmlSize(r)

for (i in 1:n) {
  # grab the ith <msg> element
  aMessage <- r[[i]]
  
  # get the mid attribute of <message>
  # returns a "list", so get first element of that list
  mid.attr <- xmlAttrs(aMessage, "mid")[[1]]
  
  # extract only the integer portion of the mid
  # all mid values start with "msg"
  mid.val <- substr(mid.attr, 4, 9999)
  mid <- as.integer(mid.val)
  
  # get the value of the first (<to>) and second (<from>) child element
  val.to <- xmlValue(aMessage[[2]][[1]])
  val.from <- xmlValue(aMessage[[2]][[2]])
  
  # get <import/> flag if exists
  tmp <- aMessage[[3]]
  
  # tmp is "null" if <important/> is not present
  val.isImp <- ifelse(is.null(tmp), FALSE, TRUE)
  
  # add to new row in data frame
  df$mid[i] <- mid
  df$to[i] <- val.to
  df$from[i] <- val.from
  df$isImp[i] <- val.isImp
}

Note that we added the Boolean (true/false) column after we create the data frame as an example. In addition, note that all data values in an XML are always text and need to be converted to non-character types using functions such as as.numeric().

Write to CSV

Finally, let’s write the data frame to a CSV. By default, the function write.csv() also writes the row numbers which can act as a sort of key, but we want to omit that in this case.

write.csv(df, file = "msg_log.csv", row.names = F)

Summary

This simple example demonstrates how to parse an XML, traverse it through indexed access, add values from the XML to a data frame, and finally write the data frame to a CSV.


All Files for Short S-6.153