Objectives

Upon completion of this short, you will be able to:

  • load a compressed XML
  • load a compressed CSV
  • load from URL

Read XML from URL on github

To get the raw URL to the actual file on github, go to the file’s page and grab its URL, then modify to fit the pattern for the raw URL:

So, for example, this URL:

https://github.com/mschedlb/sandbox/blob/cb197113e6f0182b614d308ddf3537eab998ccec/pubmed22n0001-tf.xml.zip

obtained from:

would have the direct URL of

https://raw.githubusercontent.com/mschedlb/sandbox/cb197113e6f0182b614d308ddf3537eab998ccec/pubmed22n0001-tf.xml.zip

Note that github does not allow non-SSL connections, so you must use https.

Download File

The file is compressed, so we need to download the file to a (temporary) local folder, unzip it, and then load from the local folder. After it has been loaded, you can delete the local folder.

library(XML)

# create downloads folder in project folder
tf.dir <- "download.zip.xml"
if (!dir.exists(tf.dir)) {
  dir.create(tf.dir)
}

## Download the zip file
zip.file.name <- "messages.xml.zip"
url.zip <- paste0("https://raw.githubusercontent.com/mschedlb/sandbox/cb197113e6f0182b614d308ddf3537eab998ccec/", zip.file.name)
tf.zip.file <- paste0(tf.dir,"/",zip.file.name)

download.file(url = url.zip, 
              destfile = tf.zip.file, 
              quiet = T)

## Unzip the temp folder
xml_files <- unzip(tf.zip.file, exdir = tf.dir)

## wait until the file exists and then parse
if (file.exists(xml_files[1])) {
  ## Parse the first file
  xmlDOM <- xmlParse(xml_files[1])
}

## Delete temporary files and folder
unlink(tf.dir, T, T)

## check that the XML parsed properly
r <- xmlRoot(xmlDOM)
print(xmlSize(r))
## [1] 3

Of course, the above can be used to read zipped XML files from any hosting service and is not specific to github. Naturally, the URLs have to be changed.

Cache File

It can be time consuming and wasteful to repeatedly download large files. A better strategy is to download once, unzip the file, and then when need check of the file exists. If the file does not exist, it is downloaded and unzipped.

Let’s create a temporary folder in the beginning of our program and keep reusing it. Don’t run this multiple times as it will create a new temp folder, with a new name, each time.

# create downloads folder in project folder
tf.dir <- "download.zip.xml"
if (!dir.exists(tf.dir)) {
  dir.create(tf.dir)
}
library(XML)

## file to download
zip.file.name <- "messages.xml.zip"
url.zip <- paste0("https://raw.githubusercontent.com/mschedlb/sandbox/cb197113e6f0182b614d308ddf3537eab998ccec/", zip.file.name)
tf.zip.file <- paste0(tf.dir,"/",zip.file.name)

### strip the .zip extension
xml.fn <- substr(zip.file.name, 1, nchar(zip.file.name)-4)

### create path to local .xml file
xml.unzipped.fn <- paste0(tf.dir,"/",xml.fn)

## check if the file does not already exist
if (!file.exists(xml.unzipped.fn)) {
  ## it does not exist, so download and unzip
  download.file(url.zip, tf.zip.file, quiet = T)

  ## Unzip the temp folder
  xml_files <- unzip(tf.zip.file, 
                     exdir = tf.dir)
}

xmlDOM <- xmlParse(xml.unzipped.fn)

r <- xmlRoot(xmlDOM)

print(xmlSize(r))
## [1] 3

It correctly prints the number of child elements direclty underneath the root.

Download and Cache CSV

In this section we will download a CSV from github into a dataframe and then save the dataframe object to an RDS file.

If we had a URL that starts with http then we could use read.csv() to directly read the URL, but github only allows connections via https and read.csv() does not support that protocol. So, we need to again download the CSV into a temporary local directory and load from there.

Unlike the prior example where we created the temporary folder, we will use a function to have R create a temporary folder for us. This is useful when we don’t want keep the downloaded file.

## check if RDS file exists
strikesCacheFile <- "dfCSVCache.RDS"
if (file.exists(strikesCacheFile)) {
  ## load dataframe from cache
  df.strikes <- readRDS(file = strikesCacheFile)
} else {
  ## download CSV from https connection

  ## file to download
  csv.file.name <- "BirdStrikesData-V3.csv"
  
  url <- paste0(
    "https://raw.githubusercontent.com/mschedlb/sandbox/8aeb6264b8287148f9fccc923f4244b406b88400/", 
    csv.file.name)
  
  ## create temporary local folder
  t.dir <- tempdir()
  tf <- tempfile(tmpdir = t.dir)
  
  ## download file into temp folder
  download.file(url = url,
                destfile = tf,
                quiet = T)
  
  ## read the CSV into dataframe
  df.strikes <- read.csv(file = tf, 
                         header = T,
                         stringsAsFactors = F)
  
  ## cache data frame for future use
  saveRDS(object = df.strikes,
          file = strikesCacheFile)
  
  ## delete temp folder
  unlink(tf.dir)
}

## ensure that the dataframe is loaded
head(df.strikes,3)
##      iid           airline    origin  aircraft           impact   flight_date
## 1 202152       US AIRWAYS*  New York B-737-400 Engine Shut Down 11/23/00 0:00
## 2 208159 AMERICAN AIRLINES     Texas     MD-80             None  7/25/01 0:00
## 3 207601          BUSINESS Louisiana     C-500             None  9/14/01 0:00
##          damage num_birds bird_size sky_conditions altitude_ft heavy_flag
## 1 Caused damage       859    Medium       No Cloud       1,500        Yes
## 2 Caused damage       424     Small     Some Cloud           0         No
## 3     No damage       261     Small       No Cloud          50         No

Naturally, the two approaches can be combined to download a zipped CSV, uncompress the zipped file, load into a dataframe and then save the dataframe object to a file for future use.

The use of cache can be employed whenever a large object requires significant computation for its creation.

Summary

This short illustrated how to load compressed XML files from a file host and parse them. Demonstrates how to deal with files only accessible via _http**s_. Explores the use of RDS cache file to minimize file downloads.


All Files for Short S-6.141