Upon completion of this short, you will be able to:
To get the raw URL to the actual file on github, go to the file’s page and grab its URL, then modify to fit the pattern for the raw URL:
So, for example, this URL:
obtained from:
would have the direct URL of
Note that github does not allow non-SSL connections, so you must use https.
The file is compressed, so we need to download the file to a (temporary) local folder, unzip it, and then load from the local folder. After it has been loaded, you can delete the local folder.
library(XML)
# create downloads folder in project folder
<- "download.zip.xml"
tf.dir if (!dir.exists(tf.dir)) {
dir.create(tf.dir)
}
## Download the zip file
<- "messages.xml.zip"
zip.file.name <- paste0("https://raw.githubusercontent.com/mschedlb/sandbox/cb197113e6f0182b614d308ddf3537eab998ccec/", zip.file.name)
url.zip <- paste0(tf.dir,"/",zip.file.name)
tf.zip.file
download.file(url = url.zip,
destfile = tf.zip.file,
quiet = T)
## Unzip the temp folder
<- unzip(tf.zip.file, exdir = tf.dir)
xml_files
## wait until the file exists and then parse
if (file.exists(xml_files[1])) {
## Parse the first file
<- xmlParse(xml_files[1])
xmlDOM
}
## Delete temporary files and folder
unlink(tf.dir, T, T)
## check that the XML parsed properly
<- xmlRoot(xmlDOM)
r print(xmlSize(r))
## [1] 3
Of course, the above can be used to read zipped XML files from any hosting service and is not specific to github. Naturally, the URLs have to be changed.
It can be time consuming and wasteful to repeatedly download large files. A better strategy is to download once, unzip the file, and then when need check of the file exists. If the file does not exist, it is downloaded and unzipped.
Let’s create a temporary folder in the beginning of our program and keep reusing it. Don’t run this multiple times as it will create a new temp folder, with a new name, each time.
# create downloads folder in project folder
<- "download.zip.xml"
tf.dir if (!dir.exists(tf.dir)) {
dir.create(tf.dir)
}
library(XML)
## file to download
<- "messages.xml.zip"
zip.file.name <- paste0("https://raw.githubusercontent.com/mschedlb/sandbox/cb197113e6f0182b614d308ddf3537eab998ccec/", zip.file.name)
url.zip <- paste0(tf.dir,"/",zip.file.name)
tf.zip.file
### strip the .zip extension
<- substr(zip.file.name, 1, nchar(zip.file.name)-4)
xml.fn
### create path to local .xml file
<- paste0(tf.dir,"/",xml.fn)
xml.unzipped.fn
## check if the file does not already exist
if (!file.exists(xml.unzipped.fn)) {
## it does not exist, so download and unzip
download.file(url.zip, tf.zip.file, quiet = T)
## Unzip the temp folder
<- unzip(tf.zip.file,
xml_files exdir = tf.dir)
}
<- xmlParse(xml.unzipped.fn)
xmlDOM
<- xmlRoot(xmlDOM)
r
print(xmlSize(r))
## [1] 3
It correctly prints the number of child elements direclty underneath the root.
In this section we will download a CSV from github into a dataframe and then save the dataframe object to an RDS file.
If we had a URL that starts with http then we could use read.csv()
to directly read the URL, but github only allows connections via https and read.csv()
does not support that protocol. So, we need to again download the CSV into a temporary local directory and load from there.
Unlike the prior example where we created the temporary folder, we will use a function to have R create a temporary folder for us. This is useful when we don’t want keep the downloaded file.
## check if RDS file exists
<- "dfCSVCache.RDS"
strikesCacheFile if (file.exists(strikesCacheFile)) {
## load dataframe from cache
<- readRDS(file = strikesCacheFile)
df.strikes else {
} ## download CSV from https connection
## file to download
<- "BirdStrikesData-V3.csv"
csv.file.name
<- paste0(
url "https://raw.githubusercontent.com/mschedlb/sandbox/8aeb6264b8287148f9fccc923f4244b406b88400/",
csv.file.name)
## create temporary local folder
<- tempdir()
t.dir <- tempfile(tmpdir = t.dir)
tf
## download file into temp folder
download.file(url = url,
destfile = tf,
quiet = T)
## read the CSV into dataframe
<- read.csv(file = tf,
df.strikes header = T,
stringsAsFactors = F)
## cache data frame for future use
saveRDS(object = df.strikes,
file = strikesCacheFile)
## delete temp folder
unlink(tf.dir)
}
## ensure that the dataframe is loaded
head(df.strikes,3)
## iid airline origin aircraft impact flight_date
## 1 202152 US AIRWAYS* New York B-737-400 Engine Shut Down 11/23/00 0:00
## 2 208159 AMERICAN AIRLINES Texas MD-80 None 7/25/01 0:00
## 3 207601 BUSINESS Louisiana C-500 None 9/14/01 0:00
## damage num_birds bird_size sky_conditions altitude_ft heavy_flag
## 1 Caused damage 859 Medium No Cloud 1,500 Yes
## 2 Caused damage 424 Small Some Cloud 0 No
## 3 No damage 261 Small No Cloud 50 No
Naturally, the two approaches can be combined to download a zipped CSV, uncompress the zipped file, load into a dataframe and then save the dataframe object to a file for future use.
The use of cache can be employed whenever a large object requires significant computation for its creation.
This short illustrated how to load compressed XML files from a file host and parse them. Demonstrates how to deal with files only accessible via _http**s_. Explores the use of RDS cache file to minimize file downloads.