Load Subset of Birdstrikes Sample Data from CSV into Database in R

Introduction

The overall goal of this tutorial is to load a subset of a sample of the “Birdstrikes” CSV file into a normalized database schema in a MySQL database. We will demonstrate several important methods and approaches. Of course, there are many other approaches that are equally valid. In fact, the approaches shown here are most definitely not be the most efficient in terms of memory use and speed, but they are simple to implement and easy to follow.

See Lesson 6.302 Bulk Load Data from CSV into Database in R for a more generic approach.

Load CSV Data

We start by loading the entire CSV file from a URL into a new data frame df.csv. This CSV is a sample of a larger birdstrikes CSV.

Later on we will copy data from this data frame to two other data frames that will be mirrors of tables in the database. This approach is not terribly scalable and would likely be very slow or overtax our computer’s memory and processing capacity if we had several hundred thousand or more rows in the CSV.

CSV.URL = "http://tinyurl.com/3ukmyxjr"

df.csv <- read.csv(file = CSV.URL,
                   header = T,
                   stringsAsFactors = F)

Let’s inspect the raw data we read from the CSV. It might be useful to look at the entire CSV file in Excel or some other editor (such as viewing it in R Studio).

head(df.csv, 3)

We are good; the data is here. Now we will store the data into two data frames: one for the incidents and one for airports. All the information for both is in the df.csv data frame but that results in duplication – if an airport experienced multiple flights with bird strikes, then the same airport information is stored in each row; thus duplication. We will store the airport information only once in an airports table and then link a row for the airport to the incident, i.e., we normalize the data. Eventually we will store the data in the data frames in tables in the database.

Watch the video below to see what we are trying to accomplish.

We can several approaches to get the data from the CSV into tables in the database:

we can build data frames in R, place the data from the CSV into the data frames that mirror the database tables, and then write the data frames to the tables using dbWriteTable() (if the database allows that operation), or
we can write the data directly from the CSV to the database tables; that approach requires writing each row of the CSV one-by-one to the database using INSERT statements which can be slow but it doesn’t require extra memory for the data frames; so, this approach would work best if the CSV had millions of rows that wouldn’t fit into memory; this approach can be sped up by writing batches of rows (say, 1000 rows at a time)

Of course, there are other approaches too. We could use XSLT to transform the CSV into either another CSV or an XML and then load directly into the database. Or, we could write stored procedures to split the data.

The strategy used here is more simple than the one explained in Lesson 6.302 Bulk Load Data from CSV into Database in R but you should look at that lesson afterwards to see a more generic approach.

Create Data Frames

The strategy we will use to load the data from the CSV to the database is to create internal data frames that mirror the tables in the database and then write the data frames to the database. This requires extra memory but is fast and convenient. Again, this strategy would not work well if the CSV were very large or if the database does not support bulk writing (for example, db4free.net does not allow the use of dbWriteTable()).

We will create two data frames: one for the incidents and one for airports. Of course, the CSV contains more data, but storing the other data columns is analogous to this approach.

We will first create a data frame for the incidents. We will not assume we know the exact number of incidents, so we will create a data frame with no rows – R will add new rows every time we insert a row at the end, so this expands dynamically as needed: a bit slower than pre-allocating the memory but it will work for smaller CSV files.

The data frame df.incidents will have three columns (for demonstration purposes; of course, there’s more data in the CSV that would need to be stored):

rid – the incident ID from the rid column in the CSV
date – the incident date from the date column in the CSV
origin – a foreign key column which links to the primary key of the airport from which this flight departed (note that this column is not related to the origin column in the CSV)

df.incidents <- data.frame(rid = integer(0),
                           date = character(0),
                           origin = integer(0))

Note that the date column in the CSV contains both the date and time and we only want to store the date, so we need to remove the time part of the column.

Next, we will need a data frame for the airports, so let’s create that with the following columns:

aid – a unique identifier for the airport that we will assign
airportName – the name of the airport from the airport column in the CSV
airportCode – empty for now as we do not have that in the CSV
state – the state in which the airport is located from the origin column in the CSV

df.airports <- data.frame(aid = integer(0),
                          airportName = character(0),
                          airportCode = character(0),
                          state = character(0))

Split the CSV Data

We will use a simplistic and somewhat “brute-force” approach to splitting the CSV data into the two data frame, albeit this approach is very easy to implement and understand, but it is not efficient and can take a long time for large CSV files. However, the data could be processed in batches to make this a bit more efficient.

We will use loops which are notoriously slow in R and Python, but they are easy to understand. using R vector processing, functions such as which(), and using apply() would be much faster but the code is harder to write and understand, so we will forgo that for now. Again, see Lesson 6.302 Bulk Load Data from CSV into Database in R on alternative approaches.

The code below iterates through all rows in the CSV and adds the required columns from the CSV to the df.incidents data frame; we will not yet link the airports – that will be demonstrated in further down. Note that the code uses an associative access mechanism (df.incidents[r,'rid']) based on column names in data frame – this is convenient and a bit simpler to understand but slower than using df.incidents[r,1].

# let's get the number of rows in the CSV
n <- nrow(df.csv)

# iterate through the incident rows in the CSV
for (r in 1:n) 
{
  # store rid column from the CSV in the df.incidents data frame's rid column
  df.incidents[r,'rid'] <- df.csv[r,'rid']
  
  # store the flight-date from the CSV in the df.incidents date column
  # we should chop off the time but we'll not do that in this example
  df.incidents[r,'date'] <- df.csv[r,'flight_date']
}

So, let’s see if we have good data – or at least from what we can tell.

head(df.incidents, 4)

The date column contains the time stamp of 0:00 which we should remove. Ideally this should be done in the loop above, but we’ll do that as a separate step in this example. We will do this as a vector operation on the entire column rather than using a loop – this is the “R way” of doing this and much more efficient than a loop. The time stamp are the last four characters, so we will grab everything from the first character to the remainder except the last five (the time stamp plus the leading space).

df.incidents$date <- substr(df.incidents$date, 
                            1, 
                            nchar(df.incidents$date)-5)

Let’s take a quick look to make sure we got it right.

head(df.incidents, 4)

So, now we need to deal with the airport information which should go into the df.airports data frame We will use the following strategy: look to see if the airport name for an incident is already in the df.airports data frame. If it is, get its aid, but if it is not add the airport information to the data frame and assign a new unique aid.

This is best done using a function, so our code will not be as messy and easier to read. The function findAirportPK() below takes an airport name (aName), airport code (aCode), and the airport state (aState) as input arguments. It will then try to find that airport’s aid in the data frame df.airports; if it does, it’ll return it, if it doesn’t it’ll add the information as a new row and generate an aid and then return that. In all cases, what comes back from a call to that function, is an aid, i.e., the primary key of an airport.

We need to make sure we specify in the function that df.airport refers to the existing global data frame and not a local variable. So, we use <<- for assignment rather than <-. Alternatively, we could have passed the object df.airports as a parameter to the function.

If the airport name is blank, then we’ll save it under the name “UNKNOWN” so that we can better use that in future queries.

findAirportPK <- function (aName, aCode, aState)
{
  # aid is the return value
  aid <- 0
  
  # does airport already exist?
  doesExist <- any(df.airports$airportName == aName)
  
  if (doesExist == TRUE) {
    # we found the airport, so let's get it aid
    # find the row in df.airports that contains this airport
    r <- which(df.airports$airportName == aName)
    
    # found aid is the first element of the returned vector
    aid <- r[1]
  } else {
    # airport does not exist, let's add it after the last row
    # R adds a new rows automatically
    lastRow <- nrow(df.airports)
    newRow <- lastRow + 1
    
    # new aid is the row number
    df.airports[newRow,'aid'] <<- newRow
    
    # if the airport name aName is blank, use "unknown" as airport name
    if (nchar(aName) == 0)
      aName <- "UNKNOWN"
    
    # save airport name, state, and code
    df.airports[newRow,'airportName'] <<- aName
    df.airports[newRow,'airportCode'] <<- aCode
    df.airports[newRow,'state'] <<- aState
    
    # return value: new aid is new row number
    aid <- newRow
  }
  
  return (aid)
}

In the code above, the any() function looks for a matching airport by name; if it doesn’t find one, it’ll return FALSE, TRUE otherwise. If it is FALSE, we need to add the airport as a new row at the end. If the search with any() returns TRUE, then we need to find the aid for that using which().

We could have written that more efficiently using just a call which() as it returns a 0 length vector if it cannot find a match. But the above code it clearer, albeit slower.

Let’s add the call to the function to our previous code.

# let's get the number of rows in the CSV
n <- nrow(df.csv)

# iterate through the incident rows in the CSV
for (r in 1:n) 
{
  # store rid column from the CSV in the df.incidents data frame's rid column
  df.incidents[r,'rid'] <- df.csv[r,'rid']
  
  # store the flight-date from the CSV in the df.incidents date column
  df.incidents[r,'date'] <- df.csv[r,'flight_date']
  
  # add the FK in the origin column linking to airport aid in df.airports
  # add the airport if it doesn't exist
  aName <- df.csv[r,'airport']    # airport name from CSV
  aCode <- ""                     # leave blank
  aState <- df.csv[r,'origin']    # airport state from CSV
  
  # findAirportPK() returns ID of airport (and adds if it's not there)
  df.incidents[r,'origin'] <- findAirportPK (aName, aCode, aState)
}

# clean up the date column by remove the time stamp
df.incidents$date <- substr(df.incidents$date, 
                            1, 
                            nchar(df.incidents$date)-5)

# if there is no flight date, we'll set to 1/1/1900
df.incidents$date <- ifelse(nchar(df.incidents$date) == 0,
                            "1/1/1900",
                            df.incidents$date)

Whew… let’s see if we have it captured:

head(df.incidents)

Bulk Write Data to Database

Now we’ll write the data to the the database. We’ll first use the simpler and faster bulk loading approach using dbWriteTable(). Interestingly, when using dbWriteTable() it actually creates the tables in the database based on the structure of the data frame – we did not actually need to pre-create the tables.

Connect to Database

We will connect to an existing SQLite database. The connection parameters are not shown for security reasons, so if you wish to run this code you need to substitute your own connection parameters – of course, this code works on any relational database, including MySQL on AWS, Oracle, SQL Server, JavaDB, PostgresDB, MariaDB, etc. ¹

unlink("birdstrikeDB.sqlitedb")

library(RSQLite)

dbcon <- dbConnect(RSQLite::SQLite(), 
                   "birdstrikeDB.sqlitedb")

If the tables were pre-created, we would need to overwrite them by specifying overwrite = T, but that would essentially ignore any prior table structure. While this is “quick and easy” but does not allow us to define foreign key constraints so the database cannot enforce referential integrity. How to write data frame to an existing table is explained more clearly in Lesson 6.302 Bulk Load Data from CSV into Database in R.

Note that the code will not work with the hosted MySQL database on db4free.net as this database does not support dbWriteTable(). However, MySQL on AWS does support the function.

dbWriteTable(dbcon, "airports", df.airports, overwrite = T)
dbWriteTable(dbcon, "incidents", df.incidents, overwrite = T)

Let’s see if it worked… let’s first fetch the first few rows from the table incidents.

select * from incidents limit 5;

5 records
rid	date	origin
210435	8/13/2001	1
224522	7/24/2004	2
249714	8/17/2007	3
229778	10/1/2004	4
259731	1/1/1900	5

That appears to be as expected. Now, let’s look at the first few rows from the table airport.

select * from airports limit 5;

5 records
aid	airportName	state
1	BURKE LAKEFRONT	Ohio
2	SACRAMENTO INTL	California
3	BOB HOPE ARPT	California
4	GREATER PITTSBURGH	Pennsylvania
5	UNKNOWN	Florida

This also seems to work. So, we are good and the data is split and save into a normalized schema. Well, at least part of the data is – the other data can be saved in a similar fashion.

Close the Database

After accessing a database and when no further access is needed, always close the connection and release local resources and resources on the database server.

dbDisconnect(dbcon)

References

None.

Errata

None collected yet. Let us know.

If the database does not exist, you need to create it with the same data or modify the connection information below, if you wish to run the code. Should you not be able to connect, it might be either because the database does not exist, the connection information is incorrect, or port 3306 is blocked on your computer due to a firewall or anti-virus program setting.↩︎