The overall goal of this tutorial is to load a subset of a sample of the “Birdstrikes” CSV file into a normalized database schema in a MySQL database. We will demonstrate several important methods and approaches. Of course, there are many other approaches that are equally valid. In fact, the approaches shown here are most definitely not be the most efficient in terms of memory use and speed, but they are simple to implement and easy to follow.
See Lesson 6.302 Bulk Load Data from CSV into Database in R for a more generic approach.
We start by loading the entire CSV file from a URL into a new data
frame df.csv
. This CSV is a sample of a larger birdstrikes
CSV.
Later on we will copy data from this data frame to two other data frames that will be mirrors of tables in the database. This approach is not terribly scalable and would likely be very slow or overtax our computer’s memory and processing capacity if we had several hundred thousand or more rows in the CSV.
CSV.URL = "http://tinyurl.com/3ukmyxjr"
df.csv <- read.csv(file = CSV.URL,
header = T,
stringsAsFactors = F)
Let’s inspect the raw data we read from the CSV. It might be useful to look at the entire CSV file in Excel or some other editor (such as viewing it in R Studio).
head(df.csv, 3)
We are good; the data is here. Now we will store the data into two
data frames: one for the incidents and one for
airports. All the information for both is in the
df.csv
data frame but that results in duplication – if an
airport experienced multiple flights with bird strikes, then the same
airport information is stored in each row; thus duplication. We will
store the airport information only once in an airports table
and then link a row for the airport to the incident, i.e., we
normalize the data. Eventually we will store the data in the data frames
in tables in the database.
Watch the video below to see what we are trying to accomplish.
We can several approaches to get the data from the CSV into tables in the database:
dbWriteTable()
(if the database
allows that operation), orOf course, there are other approaches too. We could use XSLT to transform the CSV into either another CSV or an XML and then load directly into the database. Or, we could write stored procedures to split the data.
The strategy used here is more simple than the one explained in Lesson 6.302 Bulk Load Data from CSV into Database in R but you should look at that lesson afterwards to see a more generic approach.
The strategy we will use to load the data from the CSV to the
database is to create internal data frames that mirror the tables in the
database and then write the data frames to the database. This requires
extra memory but is fast and convenient. Again, this strategy would not
work well if the CSV were very large or if the database does not support
bulk writing (for example, db4free.net does not allow the use
of dbWriteTable()
).
We will create two data frames: one for the incidents and one for airports. Of course, the CSV contains more data, but storing the other data columns is analogous to this approach.
We will first create a data frame for the incidents. We will not assume we know the exact number of incidents, so we will create a data frame with no rows – R will add new rows every time we insert a row at the end, so this expands dynamically as needed: a bit slower than pre-allocating the memory but it will work for smaller CSV files.
The data frame df.incidents will have three columns (for demonstration purposes; of course, there’s more data in the CSV that would need to be stored):
df.incidents <- data.frame(rid = integer(0),
date = character(0),
origin = integer(0))
Note that the date column in the CSV contains both the date and time and we only want to store the date, so we need to remove the time part of the column.
Next, we will need a data frame for the airports, so let’s create that with the following columns:
df.airports <- data.frame(aid = integer(0),
airportName = character(0),
airportCode = character(0),
state = character(0))
We will use a simplistic and somewhat “brute-force” approach to splitting the CSV data into the two data frame, albeit this approach is very easy to implement and understand, but it is not efficient and can take a long time for large CSV files. However, the data could be processed in batches to make this a bit more efficient.
We will use loops which are notoriously slow in R and Python, but
they are easy to understand. using R vector processing, functions such
as which()
, and using apply()
would be much
faster but the code is harder to write and understand, so we will forgo
that for now. Again, see Lesson 6.302
Bulk Load Data from CSV into Database in R on alternative
approaches.
The code below iterates through all rows in the CSV and adds the
required columns from the CSV to the df.incidents data frame;
we will not yet link the airports – that will be demonstrated in further
down. Note that the code uses an associative access mechanism
(df.incidents[r,'rid']
) based on column names in data frame
– this is convenient and a bit simpler to understand but slower than
using df.incidents[r,1]
.
# let's get the number of rows in the CSV
n <- nrow(df.csv)
# iterate through the incident rows in the CSV
for (r in 1:n)
{
# store rid column from the CSV in the df.incidents data frame's rid column
df.incidents[r,'rid'] <- df.csv[r,'rid']
# store the flight-date from the CSV in the df.incidents date column
# we should chop off the time but we'll not do that in this example
df.incidents[r,'date'] <- df.csv[r,'flight_date']
}
So, let’s see if we have good data – or at least from what we can tell.
head(df.incidents, 4)
The date column contains the time stamp of 0:00 which we should remove. Ideally this should be done in the loop above, but we’ll do that as a separate step in this example. We will do this as a vector operation on the entire column rather than using a loop – this is the “R way” of doing this and much more efficient than a loop. The time stamp are the last four characters, so we will grab everything from the first character to the remainder except the last five (the time stamp plus the leading space).
df.incidents$date <- substr(df.incidents$date,
1,
nchar(df.incidents$date)-5)
Let’s take a quick look to make sure we got it right.
head(df.incidents, 4)
So, now we need to deal with the airport information which should go
into the df.airports
data frame We will use the following
strategy: look to see if the airport name for an incident is already in
the df.airports
data frame. If it is, get its
aid
, but if it is not add the airport information to the
data frame and assign a new unique aid
.
This is best done using a function, so our code will not be as messy
and easier to read. The function findAirportPK() below takes an
airport name (aName), airport code (aCode), and the
airport state (aState) as input arguments. It will then try to
find that airport’s aid in the data frame
df.airports
; if it does, it’ll return it, if it doesn’t
it’ll add the information as a new row and generate an aid and
then return that. In all cases, what comes back from a call to that
function, is an aid, i.e., the primary key of an
airport.
We need to make sure we specify in the function that df.airport
refers to the existing global data frame and not a local variable. So,
we use <<-
for assignment rather than
<-
. Alternatively, we could have passed the object
df.airports as a parameter to the function.
If the airport name is blank, then we’ll save it under the name “UNKNOWN” so that we can better use that in future queries.
findAirportPK <- function (aName, aCode, aState)
{
# aid is the return value
aid <- 0
# does airport already exist?
doesExist <- any(df.airports$airportName == aName)
if (doesExist == TRUE) {
# we found the airport, so let's get it aid
# find the row in df.airports that contains this airport
r <- which(df.airports$airportName == aName)
# found aid is the first element of the returned vector
aid <- r[1]
} else {
# airport does not exist, let's add it after the last row
# R adds a new rows automatically
lastRow <- nrow(df.airports)
newRow <- lastRow + 1
# new aid is the row number
df.airports[newRow,'aid'] <<- newRow
# if the airport name aName is blank, use "unknown" as airport name
if (nchar(aName) == 0)
aName <- "UNKNOWN"
# save airport name, state, and code
df.airports[newRow,'airportName'] <<- aName
df.airports[newRow,'airportCode'] <<- aCode
df.airports[newRow,'state'] <<- aState
# return value: new aid is new row number
aid <- newRow
}
return (aid)
}
In the code above, the any()
function looks for a
matching airport by name; if it doesn’t find one, it’ll return FALSE,
TRUE otherwise. If it is FALSE, we need to add the airport as a new row
at the end. If the search with any()
returns TRUE, then we
need to find the aid for that using which()
.
We could have written that more efficiently using just a call
which()
as it returns a 0 length vector if it cannot find a
match. But the above code it clearer, albeit slower.
Let’s add the call to the function to our previous code.
# let's get the number of rows in the CSV
n <- nrow(df.csv)
# iterate through the incident rows in the CSV
for (r in 1:n)
{
# store rid column from the CSV in the df.incidents data frame's rid column
df.incidents[r,'rid'] <- df.csv[r,'rid']
# store the flight-date from the CSV in the df.incidents date column
df.incidents[r,'date'] <- df.csv[r,'flight_date']
# add the FK in the origin column linking to airport aid in df.airports
# add the airport if it doesn't exist
aName <- df.csv[r,'airport'] # airport name from CSV
aCode <- "" # leave blank
aState <- df.csv[r,'origin'] # airport state from CSV
# findAirportPK() returns ID of airport (and adds if it's not there)
df.incidents[r,'origin'] <- findAirportPK (aName, aCode, aState)
}
# clean up the date column by remove the time stamp
df.incidents$date <- substr(df.incidents$date,
1,
nchar(df.incidents$date)-5)
# if there is no flight date, we'll set to 1/1/1900
df.incidents$date <- ifelse(nchar(df.incidents$date) == 0,
"1/1/1900",
df.incidents$date)
Whew… let’s see if we have it captured:
head(df.incidents)
Now we’ll write the data to the the database. We’ll first use the
simpler and faster bulk loading approach using
dbWriteTable()
. Interestingly, when using
dbWriteTable()
it actually creates the tables in the
database based on the structure of the data frame – we did not actually
need to pre-create the tables.
We will connect to an existing SQLite database. The connection parameters are not shown for security reasons, so if you wish to run this code you need to substitute your own connection parameters – of course, this code works on any relational database, including MySQL on AWS, Oracle, SQL Server, JavaDB, PostgresDB, MariaDB, etc. 1
unlink("birdstrikeDB.sqlitedb")
library(RSQLite)
dbcon <- dbConnect(RSQLite::SQLite(),
"birdstrikeDB.sqlitedb")
If the tables were pre-created, we would need to overwrite them by specifying overwrite = T, but that would essentially ignore any prior table structure. While this is “quick and easy” but does not allow us to define foreign key constraints so the database cannot enforce referential integrity. How to write data frame to an existing table is explained more clearly in Lesson 6.302 Bulk Load Data from CSV into Database in R.
Note that the code will not work with the hosted MySQL database on db4free.net as this database does not
support dbWriteTable()
. However, MySQL on AWS does support
the function.
dbWriteTable(dbcon, "airports", df.airports, overwrite = T)
dbWriteTable(dbcon, "incidents", df.incidents, overwrite = T)
Let’s see if it worked… let’s first fetch the first few rows from the table incidents.
select * from incidents limit 5;
rid | date | origin |
---|---|---|
210435 | 8/13/2001 | 1 |
224522 | 7/24/2004 | 2 |
249714 | 8/17/2007 | 3 |
229778 | 10/1/2004 | 4 |
259731 | 1/1/1900 | 5 |
That appears to be as expected. Now, let’s look at the first few rows from the table airport.
select * from airports limit 5;
aid | airportName | airportCode | state |
---|---|---|---|
1 | BURKE LAKEFRONT | Ohio | |
2 | SACRAMENTO INTL | California | |
3 | BOB HOPE ARPT | California | |
4 | GREATER PITTSBURGH | Pennsylvania | |
5 | UNKNOWN | Florida |
This also seems to work. So, we are good and the data is split and save into a normalized schema. Well, at least part of the data is – the other data can be saved in a similar fashion.
After accessing a database and when no further access is needed, always close the connection and release local resources and resources on the database server.
dbDisconnect(dbcon)
None.
None collected yet. Let us know.
If the database does not exist, you need to create it with the same data or modify the connection information below, if you wish to run the code. Should you not be able to connect, it might be either because the database does not exist, the connection information is incorrect, or port 3306 is blocked on your computer due to a firewall or anti-virus program setting.↩︎