Upon completion of this short, you will be able to:
Inspect the XML document messages.xml before proceeding so you understand its structure.
Using the package XML, load the XML into R via xmlParse()
ensuring that the XML is consistent with the grammar defined in its DTD by setting the validate argument to T. Finally, get the root element of the XML.
<- xmlParse("messages.xml", validate = T)
<- xmlRoot(xmlDOM) r
# create data frame to hold values
<- data.frame(to = character(),
df from = character(),
stringsAsFactors = F)
The argument stringsAsFactors must be F for text column not to be converted to factor variables.
Create an empty data frame to hold values from the XML. Rows will grow dynamically, although a dimension can be preset to reduce overhead for memory reallocation – if we know.
# create data frame to hold values
<- xmlSize(r)
n <- data.frame(mid = integer(n),
df to = character(n),
from = character(n),
stringsAsFactors = F)
Let’s add data from the XML to the data frame through indexed traversal.
<- xmlSize(r)
for (i in 1:n) {
# grab the ith <msg> element
<- r[[i]]
# get the mid attribute of <message>
# returns a "list", so get first element of that list
<- xmlAttrs(aMessage, "mid")[[1]]
# extract only the integer portion of the mid
# all mid values start with "msg"
<- substr(mid.attr, 4, 9999)
mid.val <- as.integer(mid.val)
# get the value of the first (<to>) and second (<from>) child element
<- xmlValue(aMessage[[2]][[1]])
val.to <- xmlValue(aMessage[[2]][[2]])
# get <import/> flag if exists
<- aMessage[[3]]
# tmp is "null" if <important/> is not present
<- ifelse(is.null(tmp), FALSE, TRUE)
# add to new row in data frame
$mid[i] <- mid
df$to[i] <- val.to
df$from[i] <- val.from
df$isImp[i] <- val.isImp
df }
Note that we added the Boolean (true/false) column after we create the data frame as an example. In addition, note that all data values in an XML are always text and need to be converted to non-character types using functions such as as.numeric()
Finally, let’s write the data frame to a CSV. By default, the function write.csv()
also writes the row numbers which can act as a sort of key, but we want to omit that in this case.
write.csv(df, file = "msg_log.csv", row.names = F)
This simple example demonstrates how to parse an XML, traverse it through indexed access, add values from the XML to a data frame, and finally write the data frame to a CSV.