Introduction

The overall goal of this tutorial is to load a CSV file, split the data into multiple tables that follow normal forms, create tables in a database, and then save the data from the CSV via data frames to the tables in the database. We will demonstrate several important methods and approaches. Of course, there are many other approaches that are equally valid. In fact, the approaches shown here may not be the most efficient in terms of memory use and speed, but they are simple to implement and easy to follow.

All work occurs in the R Notebook and is thus reproducible.

Note that the use of dbWriteTable() generally requires that “bulk or batch loading” is enabled for the database. It will work with SQLite, but this feature may not be available for some cloud implementations of MySQL. For example, dbWriteTable() does not work with MySQL on db4free.net. It does work with local installations of MySQL or a MySQL RDS on AWS.


If you are less experienced with R programming or find this tutorial a bit too complex to follow, consider reading this simpler tutorial first: 50.302A Load Subset of Birdstrikes Sample Data from CSV into Database in R.


Connect to Database

Since we will work with SQLite and MySQL, we will make two separate database connections.

First, we will create a new in-memory SQLite database. The code below makes a connection to an in-memory temporary SQLite database that will not persist once the program terminates. This is useful for testing or quick ad hoc work.

# change this code to connect to your specific database
library(RSQLite, quietly=T)

# this makes a connection to an in-memory temporary SQLite database
dbcon <- dbConnect(RSQLite::SQLite(), ":memory:")

Next, we will connect to an existing MySQL cloud database. The code below connects to an existing (pre-created) MySQL cloud database hosted on db4free.net1.

# this makes a connection to a cloud db4free database that was pre-created
library(RMySQL, quietly=T)
## 
## Attaching package: 'RMySQL'
## The following object is masked from 'package:RSQLite':
## 
##     isIdCurrent
db_user <- 'cs5200' 
db_password <- 'khour#2021'
db_name <- 'cs5200db1'
db_host <- 'db4free.net' # Host 85.10.205.173
db_port <- 3306

db4free <-  dbConnect(RMySQL::MySQL(), 
                      user = db_user, 
                      password = db_password,
                      dbname = db_name, 
                      host = db_host, 
                      port = db_port)

Load CSV Data

Now that we have a database connection, let’s load data from a CSV file and then store that CSV file into two tables in the database. Of course, loading a CSV file into tables can be done in other ways, too. Many databases allow loading of CSV files directly into tables using the database’s management console, e.g., MySQL Workbench for MySQL. Here we are demonstrating an simple ETL pipeline, where a CSV file is loaded into R, the data is transformed, and then stored in the database.

fn = "FlightsWithAirlines.csv"

df.raw <- read.csv(file = fn,
                   header = T,
                   stringsAsFactors = F)

Let’s inspect the raw data we read from the CSV.

head(df.raw, 3)
##   year month day dep_hr dep_min dep_delay carrier           airline country flight  equip tailnum origin dest
## 1 2013     9  18      2      38         0      UA            United     USA   1545 B737-8  N14228    EWR  IAH
## 2 2013     2  14     12      25        13      AA American Airlines     USA    441   B777  N24211    MIA  GRU
## 3 2015     1   7     13      56         9      AA American Airlines     USA   1141   A321  N619AA    JFK  MIA
str(df.raw)
## 'data.frame':    9 obs. of  14 variables:
##  $ year     : int  2013 2013 2015 2016 2015 2014 2017 2018 2020
##  $ month    : int  9 2 1 6 1 5 5 10 6
##  $ day      : int  18 14 7 7 30 2 5 24 16
##  $ dep_hr   : int  2 12 13 22 24 11 5 20 5
##  $ dep_min  : int  38 25 56 56 2 2 50 23 31
##  $ dep_delay: int  0 13 9 7 14 49 13 50 30
##  $ carrier  : chr  "UA" "AA" "AA" "B6" ...
##  $ airline  : chr  "United" "American Airlines" "American Airlines" "JetBlue" ...
##  $ country  : chr  "USA" "USA" "USA" "USA" ...
##  $ flight   : int  1545 441 1141 725 461 1696 507 5708 411
##  $ equip    : chr  "B737-8" "B777" "A321" "A321" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin   : chr  "EWR" "MIA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "GRU" "MIA" "BQN" ...

Create Data Frames

The strategy we will use to load the data from the CSV to the database is to create internal data frames that mirror the tables in the database and then write the data frames to the database. This requires extra memory but is fast and convenient. It would not work well if the CSV were very large.

We will need to find the unique set of airlines. There are many duplicates in the CSV file as it is unnormalized. We will use sqldf with a SQL query to find the unique set of airlines. We could do this also with tidyverse.

Notice how we place a fixed value in the first column of the result as we will need a data frame with the first column acting as an artificial primary key. We will assign those values as sequential values.

Airlines Data

We are going to make use of sqldf, a package for using SQL to query dataframes rather than tables in databases. Of course, we could also using tidyverse but sqldf is convenient and relies on SQL.

sqldf actually uses a database to temporarily store dataframes as tables, query them, and then delete them in the database. It will use a temporary in-memory SQLite database by default unless a connection has been made to another database. Unfortunately, this connection sharing can be an issue when using MySQL as that database requires authentication before tables can be created. So, we need to direct sqldf to use SQLite as a backing store. The code below accomplishes this.

# set backing store to SQLite for sqldf
options(sqldf.driver = 'SQLite')
# find the unique airlines
# notice how we select a fixed value for the first column
# it acts as a placeholder for the artificial PK we will assign later
df.airlines <- sqldf::sqldf("select 1 as aid, carrier, airline, country from `df.raw` group by carrier")

# assign an artificial PK using a counter
n.airlines <- nrow(df.airlines)
df.airlines[,1] <- seq(1, n.airlines)

Flights Data

# while we could also use sqldf here, we'll use a different approach to extract
# the data -- we will use entire columns; also note how the aid FK is set to 1;
# we will assign that later to the right PK in the airlines table
n.flights <- nrow(df.raw)
df.flights <- data.frame(fid = 100 + seq(1,n.flights),
                         dep = df.raw$origin,
                         arr = df.raw$dest,
                         aid = 1)

Bulk Load Data to Database

Now we’ll write the data to the tables in the database. We’ll first use the simpler and faster bulk loading approach using dbWriteTable(). Interestingly, when using dbWriteTable() it actually creates the tables in the database based on the structure of the data frame – we did not actually need to pre-create the tables. If the tables were pre-created, we would need to overwrite them by specifying overwrite = T, but that would essentially ignore any prior table structure.

dbWriteTable(dbcon, "airlines", df.airlines, overwrite = T)
dbWriteTable(dbcon, "flights", df.flights, overwrite = T)

Let’s see if it worked…

select * from flights limit 5;
Table 1: 5 records
fid dep arr aid
101 EWR IAH 6
102 MIA GRU 1
103 JFK MIA 1
104 JFK BQN 2
105 LGA ATL 3
select * from airlines limit 5;
Table 2: 5 records
aid carrier airline country
1 AA American Airlines USA
2 B6 JetBlue USA
3 DL Delta Airlines USA
4 LH Lufthansa Germany
5 NK Spirit Airways USA

Save to Existing Schema

If we want to use an existing schema and already created tables, then we have to change the parameters for dbWriteTable() and make sure that the columns in the dataframe match the columns in the table (same order and same data types). One reason for not overwriting the tables – and essentially letting R create tables for us – is that we want to have foreign key and other constraint checks.

Let’s take a look at how to do that. We will start by creating our schema.

Before we create the two tables flights and airlines, we need to delete the tables that were automatically created. In the flights table, we’ll store an artificial primary key value and the subset of the columns from the CSV: departure airport and arrival airport. We then add a foreign key that links to an airline in the airlines table. That table will contain airline information. While we could use the airline code as a primary key, we will create an artificial key for demonstration purposes.

create table airlines (
  aid integer primary key,
  carrier text not null,
  airline text not null,
  country text not null
);
create table flights (
  fid integer primary key,
  dep text not null,
  arr text not null,
  aid integer not null,
  foreign key (aid) references airlines(aid)
);

Note that the columns of the table must match the column names and data types of the dataframe – the dataframe must be a mirror of the table in the database.

Note the use of the parameters overwrite = F and append = T so that any existing table is not overwritten and the data is appended to the table. Of course, running the code twice would require deleting the data first otherwise you would get primary key duplicates.

dbWriteTable(dbcon, "airlines", df.airlines, overwrite = F, append = T)
dbWriteTable(dbcon, "flights", df.flights, overwrite = F, append = T)

Let’s see if it worked…

select * from flights limit 3;
Table 3: 3 records
fid dep arr aid
101 EWR IAH 6
102 MIA GRU 1
103 JFK MIA 1
select * from airlines limit 3;
Table 4: 3 records
aid carrier airline country
1 AA American Airlines USA
2 B6 JetBlue USA
3 DL Delta Airlines USA

To summarize, if you pre-create the schema and want dbWriteTable() then you need to append the data, not overwrite the table, and ensure that the column names and types of the dataframe match those in the table.

Load Data by Row

Some databases (notably db4free.net) may not allow bulk saving of data with dbWriteTable(). So, an alternate approach, if dbWriteTable() is not supported, is to write the data to a table row by row using INSERT.

This approach writes out the rows from each data frame to the corresponding table one by one – not very efficient and potentially very slow but effective and easy to understand.

There are two approaches we can use: prepared statements and manufactured statements.

We will first show prepared statements that take parameters. Besides being easier than to build the SQL INSERT, it also guards against SQL injection attacks.

Prepared Statements

The code below works for SQLite but does not work for MySQL as it does not support prepared statements through the RMySQL package. An alternative for MySQL is shown later.

Now that we have the tables we are ready to insert the data row by row into the cloud MySQL database. We will use prepared statements for this. To create a prepared statement, we first create a SQL statement with named parameters and then bind values to those named parameters. The names of the parameters are defined by the programmer and should reflect the names of the columns and data being inserted.

Of course, we assume that the database is used in single-user mode; if that’s not the case, then the insertion of the data below would have to be within a transaction.

The function dbSendStatement() can be used to send a fully constructed SQL statement or one with parameters – a so called prepared statement. In the latter case, the parameters must be bound to actual values before it can be executed. The function dbBind() binds the parameters and dbGetRowsAffected() executes the statement and returns the number of rows that were inserted in this case. You can specify parameters by position or by name – both are shown below.

n.airlines <- nrow(df.airlines)
for (r in 1:n.airlines){
  sqlStmt <- 'insert into airlines values (?, ?, ?, ?)'
  ps <- dbSendStatement(conn = dbcon, 
                        statement = sqlStmt,
                        immediate=F)
  dbBind(ps, params = list(df.airlines$aid[r],
                           df.airlines$carrier[r],
                           df.airlines$airline[r],
                           df.airlines$country[r]))
  nr <- dbGetRowsAffected(ps)
  if (nr < 1) {
    message(paste0("Could not save airline ", r))
  }

  dbClearResult(ps)
}

n.flights <- nrow(df.flights)
for (f in 1:n.flights) {
  sqlStmt <- "insert into flights values (:fid,:dep,:arr,:aid)"
  ps <- dbSendStatement(conn = dbcon, 
                        statement = sqlStmt)
  dbBind(ps, 
         params = list(fid = df.flights$fid[f],
                       dep = df.flights$dep[f],
                       arr = df.flights$arr[f],
                       aid = df.flights$aid[f]),
         immediate = T)

  nr <- dbGetRowsAffected(ps)
  if (nr < 1) {
    message(paste0("Could not save flight ", f))
  }
    
  dbClearResult(ps)
}

Let’s do a quick check to ensure that the data is in the database.

select * from airlines limit 3;
Table 5: 3 records
aid carrier airline country
1 AA American Airlines USA
2 B6 JetBlue USA
3 DL Delta Airlines USA

Manufactured SQL Command Strings

As an alternative to prepared statements we could also build SQL statements as text strings from the variables that we want to insert as values. This may inadvertently cause SQL injection security issues but is often necessary if a database or the programming language’s database drivers do not support prepared statements.

The code below shows this approach. This approach will work for db4free and other cloud MySQL databases.

create table airlines (
  aid integer primary key,
  carrier text not null,
  airline text not null,
  country text not null
);
create table flights (
  fid integer primary key,
  dep text not null,
  arr text not null,
  aid integer not null,
  foreign key (aid) references airlines(aid)
);
n.airlines <- nrow(df.airlines)
for (r in 1:n.airlines){
  sqlStmt <- paste0(
    "insert into airlines values (",
        df.airlines$aid[r], ",",
        "'", df.airlines$carrier[r], "',",
        "'", df.airlines$airline[r], "',",
        "'", df.airlines$country[r], "'",
        ")")
  nr <- dbSendStatement(conn = db4free, 
                        statement = sqlStmt)
}

n.flights <- nrow(df.flights)
for (r in 1:n.flights) {
  sqlStmt <- paste0(
    "insert into flights values (",
        df.flights$fid[r], ",",
        "'", df.flights$dep[r], "',",
        "'", df.flights$arr[r], "',",
        "'", df.flights$aid[r], "'",
        ")")
  nr <- dbSendStatement(conn = db4free, 
                        statement = sqlStmt)
}

Let’s do a final check to ensure that the data is in the database.

select * from airlines limit 3;
Table 6: 3 records
aid carrier airline country
1 AA American Airlines USA
2 B6 JetBlue USA
3 DL Delta Airlines USA
select * from flights limit 3;
Table 7: 3 records
fid dep arr aid
101 EWR IAH 6
102 MIA GRU 1
103 JFK MIA 1

It worked too. So, now you have two ways to load data from data frames into a database. You also have a way to split data from a single CSV into multiple tables and linking them. This type of work is often necessary when loading data from third party applications or external sources.

Close the Database

After accessing a database and when no further access is needed, always close the connection and release local resources and resources on the database server. Most databases restrict the number of simultaneous connections and by not disconnecting when needed we can often consume all resources needlessly and bar other applications from connecting to the database. The number of concurrent connections is a factor of the vendor’s license (more connections cost more money) or configuration. When connecting, you can sometimes specify the number of concurrent connections. Note that having too many connections open simultaneously can slow down database performance.

## [1] TRUE

Files & Resources

All Files for Lesson 6.302

References

[The sqldf Package. R Documentation.]](https://www.rdocumentation.org/packages/sqldf/versions/0.4-11)

Errata

None collected yet. Let us know.


  1. If the database does not exist, you need to create it with the same data or modify the connection information below, if you wish to run the code. Should you not be able to connect, it might be either because the database does not exist, the connection information is incorrect, or port 3306 is blocked on your computer due to a firewall or anti-virus program setting.↩︎

---
title: "Bulk Load Data from CSV into Database in R"
params:
  category: 6
  number: 302
  time: 45
  level: beginner
  tags: "R,SQL,sqldf,database,sqlite,dbWriteTable"
  description: "Explains how to load data from a CSV into a relational
                database of multiple tables and map primary key to
                foreign key references."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

The overall goal of this tutorial is to load a CSV file, split the data into multiple tables that follow normal forms, create tables in a database, and then save the data from the CSV via data frames to the tables in the database. We will demonstrate several important methods and approaches. Of course, there are many other approaches that are equally valid. In fact, the approaches shown here may not be the most efficient in terms of memory use and speed, but they are simple to implement and easy to follow.

All work occurs in the R Notebook and is thus reproducible.

> Note that the use of <code>dbWriteTable()</code> generally requires that "bulk or batch loading" is enabled for the database. It will work with SQLite, but this feature may not be available for some cloud implementations of MySQL. For example, <code>dbWriteTable()</code> does not work with MySQL on *db4free.net*. It does work with local installations of MySQL or a MySQL RDS on AWS.

------------------------------------------------------------------------

If you are less experienced with R programming or find this tutorial a bit too complex to follow, consider reading this simpler tutorial first: [50.302A Load Subset of Birdstrikes Sample Data from CSV into Database in R](http://artificium.us/lessons/06.r/l-6-302-bulkload-data-into-db/l-6-302-sandbox-load-birdstrikes.html).

------------------------------------------------------------------------

### Connect to Database

Since we will work with SQLite and MySQL, we will make two separate database connections.

First, we will create a new in-memory SQLite database. The code below makes a connection to an in-memory temporary SQLite database that will not persist once the program terminates. This is useful for testing or quick *ad hoc* work.

```{r connect2SQLiteDB, eval=TRUE, warning=FALSE}
# change this code to connect to your specific database
library(RSQLite, quietly=T)

# this makes a connection to an in-memory temporary SQLite database
dbcon <- dbConnect(RSQLite::SQLite(), ":memory:")
```

Next, we will connect to an existing MySQL cloud database. The code below connects to an existing (pre-created) MySQL cloud database hosted on [db4free.net](http://db4free.net)[^l-6-302-1].

[^l-6-302-1]: If the database does not exist, you need to create it with the same data or modify the connection information below, if you wish to run the code. Should you not be able to connect, it might be either because the database does not exist, the connection information is incorrect, or port 3306 is blocked on your computer due to a firewall or anti-virus program setting.

```{r connect2CloudMySQL, eval=TRUE, warning=FALSE}
# this makes a connection to a cloud db4free database that was pre-created
library(RMySQL, quietly=T)

db_user <- 'cs5200' 
db_password <- 'khour#2021'
db_name <- 'cs5200db1'
db_host <- 'db4free.net' # Host 85.10.205.173
db_port <- 3306

db4free <-  dbConnect(RMySQL::MySQL(), 
                      user = db_user, 
                      password = db_password,
                      dbname = db_name, 
                      host = db_host, 
                      port = db_port)
```

## Load CSV Data

Now that we have a database connection, let's load data from a CSV file and then store that CSV file into two tables in the database. Of course, loading a CSV file into tables can be done in other ways, too. Many databases allow loading of CSV files directly into tables using the database's management console, *e.g.*, MySQL Workbench for MySQL. Here we are demonstrating an simple ETL pipeline, where a CSV file is loaded into R, the data is transformed, and then stored in the database.

```{r loadCSV}
fn = "FlightsWithAirlines.csv"

df.raw <- read.csv(file = fn,
                   header = T,
                   stringsAsFactors = F)
```

Let's inspect the raw data we read from the CSV.

```{r}
head(df.raw, 3)
str(df.raw)
```

### Create Data Frames

The strategy we will use to load the data from the CSV to the database is to create internal data frames that mirror the tables in the database and then write the data frames to the database. This requires extra memory but is fast and convenient. It would not work well if the CSV were very large.

We will need to find the unique set of airlines. There are many duplicates in the CSV file as it is unnormalized. We will use **sqldf** with a SQL query to find the unique set of airlines. We could do this also with **tidyverse**.

Notice how we place a fixed value in the first column of the result as we will need a data frame with the first column acting as an artificial primary key. We will assign those values as sequential values.

#### Airlines Data

We are going to make use of **sqldf**, a package for using SQL to query dataframes rather than tables in databases. Of course, we could also using **tidyverse** but **sqldf** is convenient and relies on SQL.

```{r loadsqldf, warning=FALSE, include=F}
# load the sqldf package
library(sqldf, quietly = T, verbose = F, warn.conflicts = F)
```

**sqldf** actually uses a database to temporarily store dataframes as tables, query them, and then delete them in the database. It will use a temporary in-memory SQLite database by default unless a connection has been made to another database. Unfortunately, this connection sharing can be an issue when using MySQL as that database requires authentication before tables can be created. So, we need to direct **sqldf** to use SQLite as a backing store. The code below accomplishes this.

```{r setsqldfbackstore}
# set backing store to SQLite for sqldf
options(sqldf.driver = 'SQLite')
```

```{r extractAirlineData, warning=FALSE}
# find the unique airlines
# notice how we select a fixed value for the first column
# it acts as a placeholder for the artificial PK we will assign later
df.airlines <- sqldf::sqldf("select 1 as aid, carrier, airline, country from `df.raw` group by carrier")

# assign an artificial PK using a counter
n.airlines <- nrow(df.airlines)
df.airlines[,1] <- seq(1, n.airlines)
```

#### Flights Data

```{r extractFlightData}
# while we could also use sqldf here, we'll use a different approach to extract
# the data -- we will use entire columns; also note how the aid FK is set to 1;
# we will assign that later to the right PK in the airlines table
n.flights <- nrow(df.raw)
df.flights <- data.frame(fid = 100 + seq(1,n.flights),
                         dep = df.raw$origin,
                         arr = df.raw$dest,
                         aid = 1)
```

## Link Tables

One hurdle with this approach is the linking of foreign keys in one table to newly created artificial primary keys. In our example, we will need to link each flight to an airline. The strategy we will use is to traverse the data frame containing the original flights data row by row using a *for loop*. For each row we will find the airline in the data frame containing the airlines -- we use the <code> which()</code> function to match. This strategy assumes that the rows in the original raw data correspond row by row to the subset data in the flights data frame. If we removed rows, then this would not work. An alternative might be to copy more data from the original data frame and then eventually remove columns.

```{r linkFKs}
# let's link each flight to the correct airline by assigning the FK to the 
# correct PK

# process each row (flight) one by one
for (r in 1:n.flights) {
  # find the airline PK for that flight in the airlines table
  a <- df.airlines$aid[which(df.airlines$carrier == df.raw$carrier[r] &
                               df.airlines$airline == df.raw$airline[r] &
                               df.airlines$country == df.raw$country[r])]
  df.flights$aid[r] <- a
}

```

## Bulk Load Data to Database

Now we'll write the data to the tables in the database. We'll first use the simpler and faster bulk loading approach using <code>dbWriteTable()</code>. Interestingly, when using <code>dbWriteTable()</code> it actually creates the tables in the database based on the structure of the data frame -- we did not actually need to pre-create the tables. If the tables were pre-created, we would need to overwrite them by specifying *overwrite = T*, but that would essentially ignore any prior table structure.

```{r loadDataBulk}
dbWriteTable(dbcon, "airlines", df.airlines, overwrite = T)
dbWriteTable(dbcon, "flights", df.flights, overwrite = T)
```

Let's see if it worked...

```{sql showFlights, connection=dbcon}
select * from flights limit 5;
```

```{sql showAirlines, connection=dbcon}
select * from airlines limit 5;
```

## Save to Existing Schema

If we want to use an existing schema and already created tables, then we have to change the parameters for <code>dbWriteTable()</code> and make sure that the columns in the dataframe match the columns in the table (same order and same data types). One reason for not overwriting the tables -- and essentially letting R create tables for us -- is that we want to have foreign key and other constraint checks.

Let's take a look at how to do that. We will start by creating our schema.

Before we create the two tables *flights* and *airlines*, we need to delete the tables that were automatically created. In the *flights* table, we'll store an artificial primary key value and the subset of the columns from the CSV: departure airport and arrival airport. We then add a foreign key that links to an airline in the *airlines* table. That table will contain airline information. While we could use the airline code as a primary key, we will create an artificial key for demonstration purposes.

```{sql connection=dbcon, echo=F}
drop table airlines;
```

```{sql connection=dbcon, echo=F}
drop table flights;
```

```{sql createTableAirlines, connection=dbcon}
create table airlines (
  aid integer primary key,
  carrier text not null,
  airline text not null,
  country text not null
);
```

```{sql createTableFlights, connection=dbcon}
create table flights (
  fid integer primary key,
  dep text not null,
  arr text not null,
  aid integer not null,
  foreign key (aid) references airlines(aid)
);
```

> Note that the columns of the table must match the column names and data types of the dataframe -- the dataframe must be a mirror of the table in the database.

Note the use of the parameters *overwrite = F* and *append = T* so that any existing table is not overwritten and the data is appended to the table. Of course, running the code twice would require deleting the data first otherwise you would get primary key duplicates.

```{r loadDataBulkNoOverwrite}
dbWriteTable(dbcon, "airlines", df.airlines, overwrite = F, append = T)
dbWriteTable(dbcon, "flights", df.flights, overwrite = F, append = T)
```

Let's see if it worked...

```{sql showFlightsAgain, connection=dbcon}
select * from flights limit 3;
```

```{sql showAirlinesAgain, connection=dbcon}
select * from airlines limit 3;
```

To summarize, if you pre-create the schema and want <code>dbWriteTable()</code> then you need to append the data, not overwrite the table, and ensure that the column names and types of the dataframe match those in the table.

## Load Data by Row

Some databases (notably *db4free.net*) may not allow bulk saving of data with <code>dbWriteTable()</code>. So, an alternate approach, if <code>dbWriteTable()</code> is not supported, is to write the data to a table row by row using INSERT.

This approach writes out the rows from each data frame to the corresponding table one by one -- not very efficient and potentially very slow but effective and easy to understand.

There are two approaches we can use: prepared statements and manufactured statements.

We will first show prepared statements that take parameters. Besides being easier than to build the SQL INSERT, it also guards against SQL injection attacks.

### Prepared Statements

> The code below works for SQLite but does not work for MySQL as it does not support prepared statements through the **RMySQL** package. An alternative for MySQL is shown later.

Now that we have the tables we are ready to insert the data row by row into the cloud MySQL database. We will use prepared statements for this. To create a prepared statement, we first create a SQL statement with named parameters and then bind values to those named parameters. The names of the parameters are defined by the programmer and should reflect the names of the columns and data being inserted.

Of course, we assume that the database is used in single-user mode; if that's not the case, then the insertion of the data below would have to be within a transaction.

```{sql connection=dbcon, echo=F}
delete from flights;
```

```{sql connection=dbcon, echo=F}
delete from airlines;
```

The function <code>dbSendStatement()</code> can be used to send a fully constructed SQL statement or one with parameters -- a so called *prepared statement*. In the latter case, the parameters must be bound to actual values before it can be executed. The function <code>dbBind()</code> binds the parameters and <code>dbGetRowsAffected()</code> executes the statement and returns the number of rows that were inserted in this case. You can specify parameters by position or by name -- both are shown below.

```{r loadDataByRowIntodb4free}
n.airlines <- nrow(df.airlines)
for (r in 1:n.airlines){
  sqlStmt <- 'insert into airlines values (?, ?, ?, ?)'
  ps <- dbSendStatement(conn = dbcon, 
                        statement = sqlStmt,
                        immediate=F)
  dbBind(ps, params = list(df.airlines$aid[r],
                           df.airlines$carrier[r],
                           df.airlines$airline[r],
                           df.airlines$country[r]))
  nr <- dbGetRowsAffected(ps)
  if (nr < 1) {
    message(paste0("Could not save airline ", r))
  }

  dbClearResult(ps)
}

n.flights <- nrow(df.flights)
for (f in 1:n.flights) {
  sqlStmt <- "insert into flights values (:fid,:dep,:arr,:aid)"
  ps <- dbSendStatement(conn = dbcon, 
                        statement = sqlStmt)
  dbBind(ps, 
         params = list(fid = df.flights$fid[f],
                       dep = df.flights$dep[f],
                       arr = df.flights$arr[f],
                       aid = df.flights$aid[f]),
         immediate = T)

  nr <- dbGetRowsAffected(ps)
  if (nr < 1) {
    message(paste0("Could not save flight ", f))
  }
    
  dbClearResult(ps)
}
```

Let's do a quick check to ensure that the data is in the database.

```{sql showAirlines3, connection=dbcon}
select * from airlines limit 3;
```

### Manufactured SQL Command Strings

As an alternative to prepared statements we could also build SQL statements as text strings from the variables that we want to insert as values. This may inadvertently cause SQL injection security issues but is often necessary if a database or the programming language's database drivers do not support prepared statements.

The code below shows this approach. This approach will work for *db4free* and other cloud MySQL databases.

```{sql connection=db4free, echo=F}
drop table if exists flights;
```

```{sql connection=db4free, echo=F}
drop table if exists airlines;
```

```{sql createTableAirlinesPrepStmt, connection=db4free}
create table airlines (
  aid integer primary key,
  carrier text not null,
  airline text not null,
  country text not null
);
```

```{sql createTableFlightsprepStmt, connection=db4free}
create table flights (
  fid integer primary key,
  dep text not null,
  arr text not null,
  aid integer not null,
  foreign key (aid) references airlines(aid)
);
```

```{sql connection=db4free, echo=F}
delete from flights;
```

```{sql connection=db4free, echo=F}
delete from airlines;
```

```{r loadDataByRowIntodb4freeWithPaste}
n.airlines <- nrow(df.airlines)
for (r in 1:n.airlines){
  sqlStmt <- paste0(
    "insert into airlines values (",
        df.airlines$aid[r], ",",
        "'", df.airlines$carrier[r], "',",
        "'", df.airlines$airline[r], "',",
        "'", df.airlines$country[r], "'",
        ")")
  nr <- dbSendStatement(conn = db4free, 
                        statement = sqlStmt)
}

n.flights <- nrow(df.flights)
for (r in 1:n.flights) {
  sqlStmt <- paste0(
    "insert into flights values (",
        df.flights$fid[r], ",",
        "'", df.flights$dep[r], "',",
        "'", df.flights$arr[r], "',",
        "'", df.flights$aid[r], "'",
        ")")
  nr <- dbSendStatement(conn = db4free, 
                        statement = sqlStmt)
}
```

Let's do a final check to ensure that the data is in the database.

```{sql showAirlines4, connection=db4free}
select * from airlines limit 3;
```

```{sql showFlights4, connection=db4free}
select * from flights limit 3;
```

It worked too. So, now you have two ways to load data from data frames into a database. You also have a way to split data from a single CSV into multiple tables and linking them. This type of work is often necessary when loading data from third party applications or external sources.

## Close the Database

After accessing a database and when no further access is needed, always close the connection and release local resources and resources on the database server. Most databases restrict the number of simultaneous connections and by not disconnecting when needed we can often consume all resources needlessly and bar other applications from connecting to the database. The number of concurrent connections is a factor of the vendor's license (more connections cost more money) or configuration. When connecting, you can sometimes specify the number of concurrent connections. Note that having too many connections open simultaneously can slow down database performance.

```{r disconnectDB, echo=F}
dbDisconnect(dbcon)
dbDisconnect(db4free)
```

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

[The sqldf Package. R Documentation.]](<https://www.rdocumentation.org/packages/sqldf/versions/0.4-11>)

## Errata

None collected yet. Let us know.

```{=html}
<script src="https://form.jotform.com/static/feedback2.js" type="text/javascript">
  new JotformFeedback({
    formId: "212187072784157",
    buttonText: "Feedback",
    base: "https://form.jotform.com/",
    background: "#F59202",
    fontColor: "#FFFFFF",
    buttonSide: "left",
    buttonAlign: "center",
    type: false,
    width: 700,
    height: 500,
    isCardForm: false
  });
</script>
```
```{r code=xfun::read_utf8(paste0(here::here(),'/R/_deployKnit.R')), include = FALSE}
```
