Upon completion of this short, you will be able to:
Inspect the XML document messages.xml before proceeding so you understand its structure.
Using the package XML, load the XML into R via xmlParse()
ensuring that the XML is consistent with the grammar defined in its DTD by setting the validate argument to T. Finally, get the root element of the XML.
library(XML)
<- xmlParse("messages.xml", validate = T)
xmlDOM
<- xmlRoot(xmlDOM) r
# create data frame to hold values
<- data.frame(to = character(),
df from = character(),
stringsAsFactors = F)
The argument stringsAsFactors must be F for text column not to be converted to factor variables.
Create an empty data frame to hold values from the XML. Rows will grow dynamically, although a dimension can be preset to reduce overhead for memory reallocation – if we know.
# create data frame to hold values
<- xmlSize(r)
n <- data.frame(mid = integer(n),
df to = character(n),
from = character(n),
stringsAsFactors = F)
Let’s add data from the XML to the data frame through indexed traversal.
<- xmlSize(r)
n
for (i in 1:n) {
# grab the ith <msg> element
<- r[[i]]
aMessage
# get the mid attribute of <message>
# returns a "list", so get first element of that list
<- xmlAttrs(aMessage, "mid")[[1]]
mid.attr
# extract only the integer portion of the mid
# all mid values start with "msg"
<- substr(mid.attr, 4, 9999)
mid.val <- as.integer(mid.val)
mid
# get the value of the first (<to>) and second (<from>) child element
<- xmlValue(aMessage[[2]][[1]])
val.to <- xmlValue(aMessage[[2]][[2]])
val.from
# get <import/> flag if exists
<- aMessage[[3]]
tmp
# tmp is "null" if <important/> is not present
<- ifelse(is.null(tmp), FALSE, TRUE)
val.isImp
# add to new row in data frame
$mid[i] <- mid
df$to[i] <- val.to
df$from[i] <- val.from
df$isImp[i] <- val.isImp
df }
Note that we added the Boolean (true/false) column after we create the data frame as an example. In addition, note that all data values in an XML are always text and need to be converted to non-character types using functions such as as.numeric()
.
Finally, let’s write the data frame to a CSV. By default, the function write.csv()
also writes the row numbers which can act as a sort of key, but we want to omit that in this case.
write.csv(df, file = "msg_log.csv", row.names = F)
This simple example demonstrates how to parse an XML, traverse it through indexed access, add values from the XML to a data frame, and finally write the data frame to a CSV.