60.101 Tabular Structures, Relational Data, and SQL
Martin Schedlbauer
2025-02-18
Introduction
Structured information objects have properties in the form of attributes. For example, instances of a Course have attributes such as title, length, credits, among others, depending on the use cases. Such information must often be externalized and stored in some structure where they can be shared and queried. A tabular organization is a common format for the externalization of structured information objects.
In this lesson, we will take a look at how tabular structures are queried using SQL. SQL has emerged as the most common way to find information objects stored in tables that meet specific criteria.
Tabular Structures
Information objects are stored in tabular form in:
CSV text data files
Excel (and other spreadsheet program) worksheets
Data frames in R and Python
Relational database tables
Two-Level XML documents
All tabular structures have the same format: rows with columns. Each row is an instance of an information object, while each column is an attribute value. Rows are generally numbered while columns are named.
CSV
The example below shows data in a CSV. Note the separation of columns by commas.
A cell or a range of cells cannot be directly specified until the CSV is loaded into a program for processing.
Excel
The image below shows an example of a tabular structure in Excel. Columns are letters (A..Z, AA-ZZ, etc.) and rows are numbered; each intersection of a column and a row is a cell. In addition to data values, cells can also contain formulas which are algebraic expressions referencing cells or ranges of cells. A range of cells is similar to a vector, array, or a matrix in R.
Sample Tabular Structure in an Excel Worksheet
A cell in Excel is referenced as a column/row pair, e.g., D4 references the Calories attribute of Product instance 3 and has the value 50. A range of cells can be specified by a starting and ending cell, e.g., D2:D8 is the vector of values \(\{70,70,50,110,110,110,130\}\).
Data Frames
Data frames are in-memory tabular structures common in several programming languages, including R and Python. While they can be programmatically generated, they are typically the result of loading a CSV file, Excel worksheet, or SQL query result into a data frame.
The R code fragment below loads a CSV file into a data frame, which is then shown.
Table 1: Example of a tabular structure from a CSV.
Product
CerealName
Manufacturer
Calories
1
100% Bran
Nabisco
70
2
All-Bran
Kellogg
70
3
All-Bran w/Extra Fiber
Kellogg
50
4
Almond Delight
Ralston Purina
110
Cells can be referenced in several ways: df[row-number,column-number], df[row-number,column-name]. The code block below demonstrates some ways to access specific cells and ranges.
# access a single celldf[1,2]# accessing a single columnhead(df[,2])# accessing several columnsdf[,2:4]# accessing rows 1 to 5 and columns 2 and 3df[1:5,2:3]# accessing all columns for some rowsdf[1:5,]# accessing a column by namedf[2:10,"Calories"]# accessing entire column by namedf$Calories# accessing the entire data framedf[,]# or, simplydf
Run the code and convince yourself that it works. In R, row and column specifications are ranges, or, to be precise, vectors – they do not need to be contiguous. The function c() concatenates several values into a vector.
The above access expressions can be used on the left or right side, which means they can also be used to modify a data frame. An example is shown below:
In some situations, XML documents where elements directly underneath the root represent rows and the child elements under the rows represents columns are used as an alternative to CSV. An example is shown below where each element is a row and the columns are , , and .
Most commonly, such XML files are converted to a CSV or read into a programming language as a table. For example, in R, the function xmlToDataFrame() is often used to read such a specially-formatted XML into a data frame.
The code below demonstrates how to load a 2-level XML document and convert the values to the appropriate numeric data type:
library(XML)fn <-"treedata2L.xml"# load from local filedf <-xmlToDataFrame(fn, stringsAsFactors = F)df$Girth <-as.numeric(df$Girth)df$Height <-as.numeric(df$Height)df$Volume <-as.numeric(df$Volume)head(df, 3)
Of course, this only works if the XML has this particular structure where each element directly under the root is a row and the child elements under each row are in the exact same order and represent columns. It does not work for more complex XML element structures. In that case, XPath or node traversal would be needed to read the data – and the data would not longer be a tabular structure but rather be a complex object structure.
The relational model is an approach to storing data that is tabular and that connects data in one table to data in another table. It is the foundation of relational databases, but the tables do not have to be in a database for the data to be “relational”.
Relations
The relational model is based on set theory and each table is more formally called a relation. A relation has attributes; the number of attributes that a relation has is called its degree. Fundamentally, a relation is a set of tuples. The cardinality of a relation is the number of tuples in the relation.
For example, a relational model representing a learning management system might have the relation courses which might have attributes course number, title, and credits, and it might contain the tuples shown below:
courses(
{'IS2000','Principles of Information Science', 4},
{'CS5200','Database Management Systems', 4},
{'CS5380','Project Management with Scrum', 2},
{'DA5021','Big Data', 1}
)
The relation is defined as follows:
\(courses(number,title,credits)\).
Let TEXT be the domain of all character strings. The attribute domains for the relation courses would then be
\(number,title \in TEXT\)
\(credits \in \mathbb{N}\).
The above relation has a degree of 3 and cardinality of 4. It is a set of four tuples. {‘CS5200’,‘Database Management Systems’, 4} is one such tuple.
In tabular form, the above is a table with three columns and four rows.
number
title
credits
IS2000
Principles of Information Science
4
CS5200
Database Management Systems
4
CS5380
Project Management with Scrum
2
DA5021
Big Data
1
Primary Key
Specific tuples are identified through a unique value of one attribute or a combination of several attributes. Tuples must be unique – this is a requirement of the relational model. The attribute (or combination of attributes) that uniquely identifies each tuple is referred to as the primary key. In the above example, course number would be the primary key as it is, by definition in the business domain, unique, i.e., no two different courses can have the same course number.
If there is no natural unique key or the key is too complex, then we generally “invent” one; a so-called artificial key. For example, for the courses relation we might wish to add a unique identifier column and assign a unique value for each course and use that as the primary key rather than the course number. Incidentally, this will make course number a so-called alternate key.
Here is what the new relation courses would look like:
courses(
{4453,'IS2000','Principles of Information Science', 4},
{6654,'CS5200','Database Management Systems', 4},
{1009,'CS5380','Project Management with Scrum', 2},
{4198,'DA5021','Big Data', 1}
)
The values for cid are “made up” – they could have been sequential numbers too, e.g., 1, 2, 3, and 4. The actual values are meaningless as long as they are unique. In a relation definition, primary key attributes are often underlined so they are easier to recognize.
Attributes
The set or range of values from which the value of an attribute can be drawn is called the attribute’s domain. In the above example, the domain for course number is TEXT and includes all possible course numbers that fit the pattern AANNNN, where AA is the course area, such as IS (Information Science) or CY (Cyber Security) and \(NNNN\) is a four digit course number. The domain for the credits attribute is all integers greater than 0 and no more than 5 – or at least that is the domain at Northeastern University.
More formally, we could express the domain for credits as a set using set generating notation:
\(courses \in \{x : x \gt 0 \land x \le 5\}\).
Naturally, different business areas will have different domains for attributes – they are a business rule.
Linked Relations
To show that relations are generally not by themselves but rather are linked, let’s say that we want to track the instructor who teaches a course. We might decide to extend the relation courses with an additional attribute for instructor:
Simple… but what if the same instructor teaches more than one course? Then we would have repetition. There would be even more information repeated if we tracked the instructor’s department, rank, and highest degree earned. While that is commonly done in a CSV or a data frame, it is not appropriate for a relational model where repetition is frowned upon as it violates so-called normal forms. A better approach would be to create another relation to track instructor-specific information: \(instructors(\underline{iid}, name,department,rank,degree)\). Note that we, once again, created an artificial key attribute – in this case there is no natural key as instructor name cannot be assumed to be unique.
There is still some repetition when instructors are in the same department or have the same rank, but we will ignore that for now rather than adding a departments relation and a lookup relation for rank. Also note that we chose not to associate part-time adjunct instructors with a department. This is not a universal rule, but rather a rule that an analyst might have uncovered during information analysis.
To track which instructor teaches which course, we would first remove all instructor information from the courses relation and add a new attribute that has as a value the primary key of the instructor who teaches the course. Such as attribute in courses is referred to as a foreign key attribute.
There are several important operations defined on a relational model:
selection
projection
rename
aggregation
grouping
equi-join
These are defined in a relational algebra. The full set of operations of the relational algebra are beyond the scope of this lesson. If you have an interest in learning more, consult Lesson 60.502 Relational Algebra.
Relational algebra is an important query abstraction mechanism and is often used to express queries without resorting to a specific query language such as SQL.
Relations as Sets
In the relational algebra, relations are treated as (mathematical) sets and thus cannot, by definition of a set, contain duplicates.
Entity-Relationship Diagrams
Entity-Relationship Diagrams (ERD) are used to visualize relational data models. There are several notations that are in common use:
Information Engineering Notation (aka, Crow’s Foot Notation)
Chen Notation (named after Peter Chen, its creator)
IDEF1X (developed by the US Military)
UML Class Diagrams
We will take a closer look at the first one, as it is the most commonly used in practice, albeit being an older notation developed in the early 1980s. It is being slowly supplanted by the UML Class Diagram, though.
In the IE notation, entities (represented as relations in the model and tables in a database or data frames in R/Python and identical to classes in UML) are rounded-corner rectangles with two compartments: the entity name in the first compartment and the attributes in the second compartment. The second compartment is generally divided into three or four columns: modifier, name, data type (domain), and default value. The most common modifiers are PK for primary key and FK for foreign key. Primary keys are linked to foreign keys with a line – and the line emanates from the FK and ends at the PK row, unlike in UML where the line connects two classes. In an ERD it connects attributes.
The diagram below, created in LucidChart, shows a simple IE (Crow’s Feet) ERD with three entities and three relationships (ignore the relationships for now; they will be explain shortly):
ERD in Information Engineering (Crow’s Foot) Notation
As an aside, many models use singular nouns for entities while many relational databases use the entity name in the plural. It is a matter of preference. Relationships are easier to follow and explain if the entities are in the singular.
Whether an attribute name is bold or not, entities are shaded or filled, and whether the attributes are banded has no meaning; it is only intended to improve readability.
Some analysts like to underline primary key attributes and italicize foreign keys, which is consistent with relational schema definitions.
Relationships
Relationships connect entities. In a relational model, relationships can only be one-to-one or one-to-many, but not many-to-many as the latter type of relationships cannot be implemented using tables with fixed numbers of columns or atomic values for the attributes.
A relationship line can be augmented with an optional label to explain the relationship using “business terminology”. It can help explain the model and should be used.
The diagram below shows three entities and two relationships. The relationships have a label and the ends indicate the multiplicity (explained in next section).
ERD in Information Engineering (Crow’s Foot) Notation
Multiplicity
The multiplicity of a relationship is indicated with symbols at the end of the relationship line. The diagram below summarizes the symbols and their semantics.
ERD Multiplicity Symbols
In (I), there is no multiplicity defined; most likely because it is not yet known. in (II), we show a multiplicity of 1 on one side and many of the other size – the triangle, crow’s foot like shape indicates “many” with no particular bound. A circle indicates 0, or an optional relationship. So, (III) shows that it is 0 or more on one side and 0 or 1 on the other.
There is no mechanism to show any multiplicity bound other than 0, 1, and many. So, a UML Class Diagram showing a multiplicity of 5..10 would need to be shown as “many” in an IE ERD.
Reading the multiplicity is done by fixing one instance of one entity and mapping it to the other entity. So, in the diagram below, we visualize the facts:
“An (one) author writes many lessons, possibly none, i.e., there are some authors who do not write any lessons.”
“A (one) lesson is written by no more than one author, although some lessons do not have an author.”
If a relationship is 0..1 then it means that the foreign key can be NULL.
ERD Relationship Example I
Exercise: Go back to the full diagram and state sentences about the relationship between Course and Lesson.
Junction Entities
If there is a many to many relationship then it must be resolved by adding a junction or association entity. For example, if we allow a lesson to be in more than one course, then we would have a many-to-many relationship between Course and Lesson and that would require the introduction of some kind of junction table. In the example below, we chose to name that table ContentMap.
Adding Foreign Keys
The foreign key is a “one-to-many” relationship is added to the table on the side of the “many” and it contains the value of the primary key is the side of the “one”. Sample tables can help clarify where they should be placed.
Implementing One-to-Many Relationships with FK/PK Links
The quick bite tutorial below illustrates how to resolve one-to-many relationships with FK/PK links.
SQL
As we saw, the relational model has a mechanism in which to express “queries” on relations called relational algebra, but that is not useful in practice. Instead, data is extracted from tables in a relational model (i.e., tables) using the industry-standard query language SQL. This section reviews the key operations of SQL for retrieving data from one or more tables. In the context of SQL, we generally prefer the term table over relation.
One noteworthy item is that tables in relational databases are not, in fact, relations and therefore several useful theorems about the relational algebra do not hold in the SQL counterpart. The reason is that the SQL table model is a bag (multiset), rather than a set which cannot contain duplicates and all tuples in the set must be unique.
Databases
Tabular data can be stored in in-memory dataframes created from reading CSV, Excel, or XML files. In addition, tables can be stored in a database which is preferable when there are many tables and the tables are larger than would fit into memory or when multiple applications need access to the same data.
There are many relational database in use today, including SQLite, MySQL, MariaDB, Oracle, Microsoft SQL Server, Informix, DB2, JavaDB, among many others. The aforementioned are all relational databases storing data in tables. Organizations also make use of non-relational databases that store data in a different structure: hierarchical, key/value, columnar, or as documents. These types of databases include CouchDB, MongoDB, Neo4J, among many others, and are also often called NoSQL databases as the primary query language is not SQL.
SQL in R
In R, there are two ways to use SQL for retrieving data from tables depending on where the tables are: tables in memory versus tables in a database. Fortunately, it does not matter where the tables are, we can use SQL to retrieve data from them. However, we need to know where the tables are stored so that we use the correct set of functions from the right package.
sqldf: A package containing functions to retrieve data using SQL from in-memory data frames. This package is actually built on top of the RSQLite package.
RSQLite: A package containing functions to retrieve data using SQL from tables in a SQLite database.
Of course, if we were to use a different database such as MySQL, we would need to use a different package (e.g., RMySQL) but could still use the same SQL queries, making SQL a more or less universal query language.
Before we dive into details, note that all table names and also all data frames used with sqldf cannot contain a period (.) or, if they do, they must be escaped by enclosing the table or data frame name in backticks, e.g.,, `df.table` and not simply df.table as the . is interpreted as a scope operator, i.e., df.table means to SQL the table column in the table df.
sqldf
sqldf is a package that allows data frames to be queried with SQL as if they were tables in a database. It allows a programmer or data analyst to use SQL to access, process, search, and aggregate data in data frames.
Many queries, while generally doable in Base R, are often simpler with a SQL query – albeit a bit slower, but that reduction in performance is often not perceptible. Underneath the hood, the sqldf package actually loads the data frame into an in-memory SQLite database.
sqldf is primarily used to:
summarize of data in data frames
harmonize data access via SQL for all tabular data
import parts of a CSV
It is also useful for learning SQL without being concerned about setting up a separate relational database.
In the explanation below, we will use a set of tables from the package nycflights13 created by Hadley Wickham of RStudio. The tables in that package are actually tibbles. Tibbles are a type of data frame created by Wickham as part of his tidyverse package. However, we will treat the tibbles as data frames and ignore the fact that they are “tibbles”.
library(nycflights13)
The diagram below illustrates the data model for the tables in that package, expressed as an Entity Relational Diagram (ERD) in the Information Engineering (aka “Crow’s Foot”) notation. This notation is common for relational data models, although a UML Class Diagram could also have been used. Irrelevant attributes (columns) have been omitted from the diagram for the sake of readability.
Explore the data model. Inspect each of the data frames.
Surrogate Artificial Primary Keys
One item of interesting is that some of the data frames have composite primary keys (i.e., primary keys that consist of several columns). For example, the primary key to uniquely identify a row in flights is the combination of the carrier, flight [number], year, month, and day. This can be very inconvenient when we want to combine data from multiple data frames with joins. So, we generally add a new column, an identifier, that is a single number that is unique and use that as the primary key instead. Such “invented or artificial” primary keys are also often called a surrogate key. They are quite common.
A simple way to do this is to add a column that is a sequence number. The code below does this – note the use of a starting number so that it is a bit easier to distinguish between keys of different data frames; not necessary but convenient.
# add artificial key to "flights"flights$fid <-seq(1000:(1000+nrow(flights)-1))# add artificial key to "weather"weather$wid <-seq(1:(nrow(weather)))
We now have two possible primary keys for the data frames weather and flights: the combination of the columns and the new surrogate ID. We now need to designate one the primary key, making the other an alternate key. Each would be called a “candidate key”. One of the candidate keys, the compound key, is a natural key as it comes from data in the business, while the other is an artificial key.
Getting Started with SQL
Note that SQL is not case sensitive, so SELECT and select are equally valid. By convention, SQL keywords are written in upper case. The semicolon is generally required by some ad hoc query tools or when used in a script. We generally include it for compatibility and due to convention.
The result of every SQL statement is a table which means that a query can be “queried”. In other words, a query can be used wherever a table is required and so we can formulate subqueries.
Before we can use sqldf, remember to load the library – and install if not yet installed.
In the sqldf::sqldf() function, tables are presumed to be the identifiers for data frames that have been loaded or created. So, in the SQL statement that is passed sqldf::sqldf() below, “airlines” refers to the data frame (well, tibble, to be more precise, but that doesn’t matter) airlines.
The SQL statement can be enclosed either in single ’ or double quotes “. Using one or the other is useful if we want to the the other within the query.
library(sqldf)
Retrieving Rows: SELECT
The SELECT statement is used to retrieve rows that meet certain conditions. It has the general form:
SELECT {columns | * | expression | literal}
FROM {tables}
[WHERE {conditions}]
[GROUP BY {column}]
[HAVING {group criteria}]
[ORDER BY {columns}]
[LIMIT number];
Retrieve All Rows
The simplest SELECT query retrieves all rows from a single table. The ‘*’ specifies all columns in a table. In the example below, we get the first five rows from the table airlines by using the LIMIT keyword; if we do not specify a limit then we would get all rows which might be very large.
library(sqldf)sqldf::sqldf("SELECT * FROM airlines LIMIT 5")
## carrier name
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
If we need to process the result further, then we would assign the return value of sqldf::sqldf() to a variable.
rs <- sqldf::sqldf("SELECT * FROM airlines LIMIT 5")
Since the result is a data frame, we can then apply R functions to that data frame.
WARNING. If your data frame identifier name contains a period (common way to name data frames in R as period is a legal identifier character in R), then you must escape the data frame when used in sqldf with backticks. Let’s say the data frame was df.airlines and not airlines. Not enclosing the data frame in backticks changes its meaning for SQL to: access column airlines in the table df and since that does not exist, it’ll throw a SQL error.
The code fragment below illustrates the correct way to do this.
rs <- sqldf::sqldf("SELECT * FROM `df.airlines` LIMIT 5")
The same approach of using backticks also applies to column names containing periods.
Retrieve Specific Columns
Rather than using ‘*’ to get all columns, we can specify a subset of the columns by name.
rs <- sqldf::sqldf("SELECT carrier FROM airlines LIMIT 5")print(rs)
## carrier
## 1 9E
## 2 AA
## 3 AS
## 4 B6
## 5 DL
Ordering Results: ORDER BY
The rows are returned in some order – and likely not the order in which they were inserted or might have appeared in another query. So, never rely on the order unless you specifically specify an ordering with ORDER BY. An optional sorting direction can be added; DESC or descending and ASC for ascending.
rs <- sqldf::sqldf("SELECT name FROM airlines ORDER BY name DESC LIMIT 5")print(rs)
## name
## 1 Virgin America
## 2 United Air Lines Inc.
## 3 US Airways Inc.
## 4 Southwest Airlines Co.
## 5 SkyWest Airlines Inc.
Rename Columns
By default, the names of the columns in the result table are the same as the source table, but they can be renamed using AS.
rs <- sqldf::sqldf("SELECT name AS airline FROM airlines ORDER BY name LIMIT 5")print(rs)
## airline
## 1 AirTran Airways Corporation
## 2 Alaska Airlines Inc.
## 3 American Airlines Inc.
## 4 Delta Air Lines Inc.
## 5 Endeavor Air Inc.
Expressions
The SELECT statement can include expressions as a “column”. The expression is often renamed with an AS. In the example below, we create a new column “hr_delay” as the departure delay expressed in hours rather than minutes, rounding to two significant digits – not particularly useful but instructive.
We are also using the paste0() function to break up the SQL statement into multiple lines so it is easier to read. R does not allow line breaks in strings like other languages.
sql <-paste0("SELECT carrier, flight, round(dep_delay / 60, 2) AS hr_delay"," FROM flights"," LIMIT 5;")rs <- sqldf::sqldf(sql)print(rs)
## carrier flight hr_delay
## 1 UA 1545 0.03
## 2 UA 1714 0.07
## 3 AA 1141 0.03
## 4 B6 725 -0.02
## 5 DL 461 -0.10
Specific Rows with WHERE
Adding a WHERE clause to a query retrieves specific rows that meet the conditions of the WHERE clause. It is equivalent to a relational selection operation.
The clause is a Boolean expression and can contain AND, OR, and NOT plus logical operations include equality =, less than <, less than or equal <=, greater than >, greater than or equal >=, and not equal <>.
Simple Conditions
sql <-paste0("SELECT carrier, name"," FROM airlines"," WHERE carrier = 'UA'")rs <- sqldf::sqldf(sql)print(rs)
## carrier name
## 1 UA United Air Lines Inc.
Boolean Clauses
Complex selection filters can be created using Boolean expressions containing AND, OR, and NOT. Using parenthesis helps enforce precedence.
sql <-paste0("SELECT carrier, name"," FROM airlines"," WHERE carrier = 'UA'"," OR carrier = 'AA'")rs <- sqldf::sqldf(sql)print(rs)
## carrier name
## 1 AA American Airlines Inc.
## 2 UA United Air Lines Inc.
Testing for NULL
To test if the value is NULL requires using IS NULL or IS NOT NULL. Using = or <> results in an error as NULL is not a value. NULL indicates the absence of a value and is used for “missing” values – it is the equivalent of NA in R and reading a table into a data frame would change all NULL values to NA.
sql <-paste0("SELECT origin, year, day, month"," FROM weather"," WHERE wind_gust IS NULL"," LIMIT 10")rs <- sqldf::sqldf(sql)print(rs)
To use an expression in the WHERE clause requires that the expression be aliased. In the example below, we want to find all flights that have more than a 15 hour delay. We add a column that is an alias for an expression that converts the departure delay from minutes to hours and then use that alias in the WHERE clause. And, yes, we do realize that we could have just tested for /> 900 but then we couldn’t have demonstrated how to use expressions in WHERE clauses…
The SQL statement below also adds an ORDER BY clause to sort the results by departure delay.
sql <-paste0("SELECT carrier, flight, "," round((dep_delay/60.0),1) as depHr"," FROM flights"," WHERE depHr > 15"," ORDER BY depHr DESC")rs <- sqldf::sqldf(sql)print(rs)
SQL has limited support for searching text strings; it does not support regular expressions. More complex string processing must be done in a programming language. The keyword LIKE is used for matching strings that fit a pattern.
The following matching characters are available:
% matches 0, one or more characters
_ matches any one (single) character
So, to find all airlines that contain the substring “America” would require the pattern “%America%”. it would match “American Airlines”, “Virgin America”, and “All American Airways”.
To perform string equality testing, do not use LIKE, use = instead.
The example illustrates the use of LIKE for string matching.
sql <-paste0("SELECT carrier, name"," FROM airlines"," WHERE name LIKE '%America%'")rs <- sqldf::sqldf(sql)print(rs)
## carrier name
## 1 AA American Airlines Inc.
## 2 VX Virgin America
Set Membership
The set membership operator IN tests if a value is one of several values in a set. It is often simpler than using multiple = with OR.
sql <-paste0("SELECT carrier, name"," FROM airlines"," WHERE carrier IN ('UA','AA','LH','VX')")rs <- sqldf::sqldf(sql)print(rs)
## carrier name
## 1 AA American Airlines Inc.
## 2 UA United Air Lines Inc.
## 3 VX Virgin America
Rather than testing if a value is in a set, it is often necessary to test if it is not in the set. So, in the query below, we would get all airlines that are not in the specified set, i.e., everything else.
sql <-paste0("SELECT carrier, name"," FROM airlines"," WHERE carrier NOT IN ('UA','AA','LH','VX')")rs <- sqldf::sqldf(sql)print(rs)
## carrier name
## 1 9E Endeavor Air Inc.
## 2 AS Alaska Airlines Inc.
## 3 B6 JetBlue Airways
## 4 DL Delta Air Lines Inc.
## 5 EV ExpressJet Airlines Inc.
## 6 F9 Frontier Airlines Inc.
## 7 FL AirTran Airways Corporation
## 8 HA Hawaiian Airlines Inc.
## 9 MQ Envoy Air
## 10 OO SkyWest Airlines Inc.
## 11 US US Airways Inc.
## 12 WN Southwest Airlines Co.
## 13 YV Mesa Airlines Inc.
Set membership is often combined with subqueries where the subquery is the set of values.
Dates
Dates are unique data types in SQL. The two most common date related data types are DATE and DATETIME. The data type DATE are values in the form “YYYY-MM-DD” while DATETIME are values in the form of “YYYY-MM-DD HH:MI:SS”.
To check if a date column is a particular date, you can use = and to check if it is within a date range you can use BETWEEN.
Our data does not have any date data types but if we had a column flightdate in flights and want all flights in a date range, we could use the query below.
sql <-paste0("SELECT carrier, flight, flightdate"," FROM flights"," WHERE flightdate BETWEEN '2013-01-01' AND '2013-01-31'")rs <- sqldf::sqldf(sql)print(rs)
Unique Rows
It is possible that a query results in duplicate rows. The keyword DISTINCT eliminates duplicates in the result set.
## carrier flight
## 1 UA 1545
## 2 UA 1714
## 3 AA 1141
## 4 B6 725
## 5 DL 461
Try running the above query without DISTINCT and observe the difference. How many rows does the result have when you remove DISTINCT?
Aggregation
sql <-paste0("SELECT avg(dep_delay) AS avg_delay, max(dep_delay) AS max_delay"," FROM flights")rs <- sqldf::sqldf(sql)print(rs)
## avg_delay max_delay
## 1 12.63907 1301
Counting Rows
Aggregation is also often used to count rows in the result set. For example, this query finds the number of flights.
sql <-paste0("SELECT count(*)"," FROM flights")rs <- sqldf::sqldf(sql)print(rs)
## count(*)
## 1 336776
The DISTINCT keyword removes duplicates and therefore only counts unique rows.
sql <-paste0("SELECT count(DISTINCT carrier)"," FROM flights")rs <- sqldf::sqldf(sql)print(rs)
## count(DISTINCT carrier)
## 1 16
Also, note that the COUNT function does not require a column name like other aggregation functions because it is counting rows and that is the same regardless of which column one specifies. However, if we use the DISTINCT keyword, then a column name is required and it counts the distinct or unique values for that column.
GROUP BY
sql <-paste0("SELECT carrier, round(avg(dep_delay),0) AS avg_delay_mins"," FROM flights"," GROUP BY carrier"," LIMIT 6;")rs <- sqldf::sqldf(sql)print(rs)
## carrier avg_delay_mins
## 1 9E 17
## 2 AA 9
## 3 AS 6
## 4 B6 13
## 5 DL 9
## 6 EV 20
Can you modify the above SQL so that it finds the average and maximum delay (rounded to the nearest minute) for each carrier? What about per airport? Or per airport per carrier? These are common analytical queries and are often supported by fact tables in data warehouses.
The groups can be selected using the HAVING clause. The HAVING clause applies to groups, while the WHERE clause applies to the rows that will be grouped. In other words, the conditions of the WHERE clause and applied first, then GROUP BY and then HAVING; and finally, ORDER BY.
INNER JOIN
An inner join (or simply a join) is an equi-join that selects all rows where the foreign key value of one data frame (table) matches a primary key value of the linked data frame (table).
As an example, let’s find carrier, flight number, and name of the origin airport. Recall that the data frame flights has a foreign key column origin that is the airport code of the origin, i.e., a link to the primary key in the airports table.
To build a join, we list both tables in the FROM clause and then add the keyword JOIN (or INNER JOIN if one wants to be more specific as to which join is being applied) between them and in parenthesis the join condition after the keyword ON.
sql <-paste0("SELECT carrier, flight, name"," FROM flights JOIN airports ON (origin = faa)"," LIMIT 5;")rs <- sqldf::sqldf(sql)print(rs)
## carrier flight name
## 1 UA 1545 Newark Liberty Intl
## 2 UA 1714 La Guardia
## 3 AA 1141 John F Kennedy Intl
## 4 B6 725 John F Kennedy Intl
## 5 DL 461 La Guardia
The above SQL statement is equivalent to the relational algebra expression below:
A point of clarification: the above SQL works as the same attributes names do not appear in the two tables. If we had an attribute att in both tables, then we would have to explicitly scope which table we mean and we would commonly alias the table names. This is also done even if there’s no confusion in order to make the query easier to understand. The updated SQL below illustrates this:
sql <-paste0("SELECT f.carrier, f.flight, a.name"," FROM flights AS f JOIN airports AS a ON (f.origin = a.faa)"," LIMIT 5;")rs <- sqldf::sqldf(sql)print(rs)
## carrier flight name
## 1 UA 1545 Newark Liberty Intl
## 2 UA 1714 La Guardia
## 3 AA 1141 John F Kennedy Intl
## 4 B6 725 John F Kennedy Intl
## 5 DL 461 La Guardia
Of course, we can combine several tables. Let’s say we need to know the carrier name rather than the carrier’s abbreviation code for the above query, i.e., “American Airlines” rather than “AA”. To do that, we need to combine three tables: flights, airlines, and airports.
sql <-paste0("SELECT r.name, f.flight, a.name"," FROM flights AS f "," JOIN airports AS a ON (f.origin = a.faa)"," JOIN airlines AS r ON (f.carrier = r.carrier)"," LIMIT 5;")rs <- sqldf::sqldf(sql)print(rs)
## name flight name
## 1 United Air Lines Inc. 1545 Newark Liberty Intl
## 2 United Air Lines Inc. 1714 La Guardia
## 3 American Airlines Inc. 1141 John F Kennedy Intl
## 4 JetBlue Airways 725 John F Kennedy Intl
## 5 Delta Air Lines Inc. 461 La Guardia
Since the result of a SELECT statement is a table, we can use a SELECT statement wherever a table is required, _e.g., in the FROM clause. We can use a SELECT statement that return one row and one column (a single value) wherever a single value is expected, such as in a condition in a WHERE clause.
The query below finds all flights that have a below average departure delay.
sql <-paste0("SELECT f.flight, f.dep_delay"," FROM flights AS f"," WHERE f.arr_delay < (SELECT avg(arr_delay)"," FROM flights)"," LIMIT 5;")rs <- sqldf::sqldf(sql)print(rs)
In this query, we use a subquery to generate a list of values which we then use in a set operation, such as IN.
SQL on Tabular Data in Files
Using SQL on CSV Files
To query data in a CSV file using SQL can be done in two ways:
load the CSV into a data frame and use sqldf::sqldf()
run the SQL query directly against the CSV using sqldf::read.csv.sql()
Note that in (2), underneath R will still load the data from the CSV into memory and then copy the data frame into an in-memory SQLite database on which the SQL query is actually executed. But that all happens underneath the hood and while no more efficient it is simpler to use.
Let’s query the data in the CSV CerealDataCSV.csv using both of these approaches. You can decide which you prefer.
Approach I: Load CSV into data frame and use sqldf()
library(sqldf)# load the CSV into a data framecereals <-read.csv("CerealDataCSV.csv")sqlStmt <-paste0("SELECT CerealName,Calories,Sodium,Fiber,Carbs "," FROM cereals "," WHERE Sodium > 250")rs <- sqldf::sqldf(sqlStmt)print(rs)
Approach II: Run SQL directly using read.csv.sql()
The function reads a CSV filtered by a SQL statement, resulting in a subset of the data contained in the CSV being imported into a data frame. Since the table is a file it has no name, so in the SQL statement we refer to the “table” as “file”.
Note how we build the SQL “string” using paste0() so that we can split it over multiple lines, embed the value of R variables, and make it more readable.
library(sqldf)sqlStmt <-paste0("SELECT CerealName,Calories,Sodium,Fiber,Carbs "," FROM file "," WHERE Sodium > 250")rs <- sqldf::read.csv.sql(file ="CerealDataCSV.csv", sql = sqlStmt)print(rs)
A relational database is a persistent collection of tables where each table holds information about one entity, e.g., a table of customers or a table of sales. Each table is like a CSV file or a data frame; it has rows and columns, but unlike data frames, the rows are not numbered and cannot be accessed by a row number.
Each table holds information about one entity, e.g., a table of customers or a table of sales. Each table is like a CSV file or a data frame; it has rows and columns, but unlike data frames, the rows are not numbered and cannot be accessed by a row number.
A relational database is generally only necessary when the data does not fit into memory, sharing of CSV or XML files is inconvenient, or multiple applications need access to the same data at the same time. Other times, tabular files are often sufficient.
SQLite
One of the most popular relational databases is SQLite. It is a server-less, file-based database that is fast and small so it runs on most common devices and operating systems, including Windows, MacOS, Linux, Android, and iOS.
This is not a complete tutorial on SQLite databases, rather this section seeks to show how to create and new SQLite database and how to connect to a existing SQLite database from R. If you want to know more about SQLite, read Lesson 70.801 The SQLite Database.
To connect to or access a SQLite database from R requires two steps:
The code below demonstrates connecting to a SQLite database located in the file “flightsDB.db” which is located within the current project folder or current working directory. If it is not, then the full path to the file must be specified. If the database file does not already exist, a new database is created.
The database is a replica of the tables in the nycflights13 package that we have been using.
Run SQL Queries in SQLite
Once the connection to the database has been made, you can work with the database from R using functions from the RSQLite package, such as dbExecute(), dbGetQuery(), and dbSendStatement(). Alternatively, in R Notebooks, SQL code blocks can be inserted that are then “knitted” into the aforementioned R functions. Note that this only works in R Notebooks and not R Scripts (programs).
dbGetQuery()
The code below demonstrates how to run a SQL query (SELECT) on the tables in the connected “flightsDB.db” SQLite database that contains the same tables as we have been using from the package nycflights13.
The function dbgetQuery() can only be used for SELECT statements and it always returns a data frame containing the result set. The function dbExecute(), and dbSendStatement() can be used for SELECT queries and also for other SQL statements, such as INSERT or DROP TABLE. This is beyond the scope of this tutorial. Consult Lesson 6.301 ┆ Working with Databases in R if you want to learn more.
SQL Code Chunks
If we are building R Notebook, then we can also use SQL code chunks. When the R Notebook is knitted, the SQL chunk is converted to a call to an R function, so it is no more or less efficient; it is merely a simpler and more convenient way to run SQL statements.
A SQL code chunk is in a code fence that starts with {sql connection=dbcon} where dbcon is a database connection obtained from a call to dbConnect().
The code chunk as it would be typed into an R Markdown document is shown in the image below:
SQL Code Fence
Table 2: 5 records
airline
code
flight
origin
United Air Lines Inc.
UA
1545
Newark Liberty Intl
United Air Lines Inc.
UA
1714
La Guardia
American Airlines Inc.
AA
1141
John F Kennedy Intl
JetBlue Airways
B6
725
John F Kennedy Intl
Delta Air Lines Inc.
DL
461
La Guardia
Running a SQL query in a code chunk displays the result set. However, if we need to further process the result set in R or want to run analytics, then we need to capture the result set in a data frame. When we used dbGetQuery() this was the return value, but in a code chunk we need to use a parameter: output.var
So, the code chunk we added to the R Notebook looked like this:
SQL Code Fence with Captured Reset Set
Once captured in a {sql} code chunk, we can use it in an R code chunk.
head(df.flights)
## airline code flight origin
## 1 United Air Lines Inc. UA 1545 Newark Liberty Intl
## 2 United Air Lines Inc. UA 1714 La Guardia
## 3 American Airlines Inc. AA 1141 John F Kennedy Intl
## 4 JetBlue Airways B6 725 John F Kennedy Intl
## 5 Delta Air Lines Inc. DL 461 La Guardia
Of course, the R code above is not very useful but it illustrates that the result set is now in a data frame and can be processed accordingly.
Creating a Relational Database
A new database can be created in two ways:
creating tables and then inserting data
writing existing data from a CSV or data frames to tables
CREATE TABLE
Building a database means creating tables with attributes. This is done using the CREATE TABLE SQL statement. SQL is not case sensitive, so it does not matter whether we say CREATE TABLE or create table. Creating a table that already exists results in a error.
Once a table is created, it can be removed or modified (to some extent and with some restrictions). Removing a table is done with the DROP TABLE statement, while modifying a table is done with ALTER TABLE.
Before we can add tables and data to a database, we need to connect to the database. recall that connecting to a SQLite database that does not exist will create it (of course, without any tables in it).
In the following code blocks, we will create a new table lessons. But, in case it already exists, we will first remove it. Note that the {sql} code fence has to set connection=lessonDB.
createtable lessons ( lname text notnull, llength integernotnull);
Inserting Data
Once we have a new table, we can add data using the SQL INSERT statement. Let’s add a few rows of data to the table so we can see how SQLite assigns a row id as a primary key.
Rather than using a {sql} code chunk we can also use R functions such as dbSendStatement(). This is an option for R Notebooks but required if we write an R Script where code fences cannot be used.
Now, finally, let’s run a query that returns some data so we can be sure it worked.
select*from lessons;
Table 3: 3 records
lname
llength
Intro C++
90
Intro Java
90
SQL Joins
60
So, now you have seen how to create, connect to, and work with a SQLite database in R.
Writing Data Frames
As an alternative to inserting data one (or multiple) rows at a time, R has a function that writes an entire data frame to a table in the database2.
The code below creates the “flightsDB.db” database from the data frames in the nycflights13 package that we have been using.
To verify that the creation of the database was successful, we will retrieve some data from multiple tables using a join. For convenience, we’ll use a SQL code chunk rather than using an R function.
select l.name as'airline', f.carrier as'code', f.flight, a.name as'origin'from flights f join airports a on (f.origin = a.faa)join airlines l on (f.carrier = l.carrier)limit5;
Table 4: 5 records
airline
code
flight
origin
United Air Lines Inc.
UA
1545
Newark Liberty Intl
United Air Lines Inc.
UA
1714
La Guardia
American Airlines Inc.
AA
1141
John F Kennedy Intl
JetBlue Airways
B6
725
John F Kennedy Intl
Delta Air Lines Inc.
DL
461
La Guardia
So, now we have seen how to write entire data frames to a relational database.
SQL vs R
In many cases, SQL is simpler and more “universal” but R is often faster. Some queries, particularly across multiple data frames and involving grouping, is simplest in SQL. Others, can be done either with SQL or with R functions. In many situations, if the data frame is in memory, then R functions are much faster.
Let’s look at a few queries and carry them out with SQL and with Base R.
Worked Examples
The worked examples below use the tables defined in the nycflights13 database and the data frames from the nycflight13 package. We will switch between the two but the data model is the same and shown below again for convenience.
Before we can make any queries in a database, we will need to connect to the database.
To follow along and try the examples for yourself, follow these steps:
Assuming that we have the data in memory and assuming that the data fits into memory, then we can also craft a solution that uses R functions and does not use SQL.
n <-length(which(flights$air_time <60))print(n)
## [1] 52433
Solution III: Relational Algebra
\(\pi\)
Example 3
Question
Use the in-memory data frames and create a query that finds the number of flights per airline.
Solution
The solution uses SQL with sqldf. Using Base R would be very difficult – if you do not agree, then build a solution using R and not using SQL.
sql <-paste0("SELECT carrier, count(*) AS numFlights"," FROM flights"," GROUP BY carrier")rs <- sqldf::sqldf(sql)print(rs)
## carrier numFlights
## 1 9E 18460
## 2 AA 32729
## 3 AS 714
## 4 B6 54635
## 5 DL 48110
## 6 EV 54173
## 7 F9 685
## 8 FL 3260
## 9 HA 342
## 10 MQ 26397
## 11 OO 32
## 12 UA 58665
## 13 US 20536
## 14 VX 5162
## 15 WN 12275
## 16 YV 601
Example 4
Question
Use the in-memory data frames and create a query using sqldf that finds the carriers with fewer than 1000 flights sorted alphabetically.
Solution
sql <-paste0("SELECT carrier, count(*) AS numFlights"," FROM flights"," GROUP BY carrier ","HAVING numFlights < 1000"," ORDER BY carrier")rs <- sqldf::sqldf(sql)print(rs)
# calculate the average departure delay ignoring missing values (NA)m <-mean(flights$dep_delay, na.rm = T)# select rows where departure delay is below averagers <- flights[which(flights$dep_delay < m),c("carrier","flight","dep_delay")]head(rs, 5)
## # A tibble: 5 × 3
## carrier flight dep_delay
## <chr> <int> <dbl>
## 1 UA 1545 2
## 2 UA 1714 4
## 3 AA 1141 2
## 4 B6 725 -1
## 5 DL 461 -6
Example 7
Question
List the names of all airlines that have at least one flight in alphabetical order. Remove all duplicates.
Solution: SQL
There may be airlines in the airlines tables that are not linked to from the flights table, i.e., there is not flight for them. So, to list only the airlines that have at least one flight we need to join airlines and flights.
In the solution we are introducing aliases for the two tables and we are using them in the join clause. They are required as the carrier attribute is in two tables and is therefore ambiguous. The keyword as could have been omitted.
selectdistinct name from flights as f join airlines as a on (f.carrier = a.carrier)orderby name;
Table 9: Displaying records 1 - 10
name
AirTran Airways Corporation
Alaska Airlines Inc.
American Airlines Inc.
Delta Air Lines Inc.
Endeavor Air Inc.
Envoy Air
ExpressJet Airlines Inc.
Frontier Airlines Inc.
Hawaiian Airlines Inc.
JetBlue Airways
Example 8
Question
What is the total delay (in hours) for each airline? Display the carrier (e.g., UA or AA) and the total delay (sum).
How many flights departed from each airport? List the name of the airport and number of flights.
Solution: SQL
select name, count(*)from flights f join airports a on (f.origin = a.faa)groupby originlimit10
Table 11: 3 records
name
count(*)
Newark Liberty Intl
120835
John F Kennedy Intl
111279
La Guardia
104662
Example 10
Question
List the name of the airport and number of flights during the winter months (Dec, Jan, Feb, Mar) for all airports that had at least 10000 flights, sorted from highest to lowest.
Solution: SQL
select name, count(*) as'numDepartures'from flights f join airports a on (f.origin = a.faa)wheremonthin (12,1,2,3)groupby originhaving numDepartures >10000orderby numDepartures DESC
Table 12: 3 records
name
numDepartures
Newark Liberty Intl
39342
John F Kennedy Intl
36425
La Guardia
33157
Example 11
Question
List the airline name, carrier code, and the total number of flights each had.
Solution: SQL
select name, count(*)from flights f join airports a on (f.origin = a.faa)groupby originlimit10
Table 13: 3 records
name
count(*)
Newark Liberty Intl
120835
John F Kennedy Intl
111279
La Guardia
104662
Example 12
Question
Find all airlines that contain the word “America”. List the airline carrier code and airline name.
Solution: SQL
select carrier, name from airlineswhere name like'%America%'
Table 14: 2 records
carrier
name
AA
American Airlines Inc.
VX
Virgin America
Example 13
Question
How many flights were delayed during the summer months (June to Sept) due to wind_gusts?
Solution: SQL
selectcount(*) as'delGusts'from flights as f join weather as w on (f.year= w.yearand f.month= w.monthand f.day= w.dayand f.hour= w.hour)where f.monthin (6,7,8,9) and w.wind_gust notnull
Table 15: 1 records
delGusts
60370
Example 14
Question
Find the flight number and airline name plus the date of departure and the name of the departure airport that had the longest arrival delay. Display the date in the format ‘M/D/YYYY’ and the delay in hours rounded to one digit of precision.
Solution: SQL
select name, carrier, flight, round(dep_delay/60,1) as'delay', month||'/'||day||'/'||yearas'date'from flights join airports on (origin = faa)where dep_delay = (selectmax(dep_delay) from flights)
Table 16: 1 records
name
carrier
flight
delay
date
John F Kennedy Intl
HA
51
21.7
1/9/2013
Example 15
Question
Find the flight number and airline name plus the date of departure and the name of the departure airport that had the second longest arrival delay. Display the date in the format ‘M/D/YYYY’ and the delay in hours rounded to one digit of precision.
Solution: SQL
select name, carrier, flight, round(dep_delay/60,1) as'delay', month||'/'||day||'/'||yearas'date'from flights join airports on (origin = faa)where dep_delay = (selectmin(dep_delay)from (select dep_delay from flights where dep_delay >0orderby dep_delay DESClimit2))
Table 17: 1 records
name
carrier
flight
delay
date
John F Kennedy Intl
MQ
3535
18.9
6/15/2013
Lectures and Tutorials
The (hour-long) video of a lecture on these topics by Dr. Martin Schedlbauer of Khoury Online is provided below. We recommend watching it to get additional insights, see the connections between ERDs, UML, SQL, Relational Databases, SQLite, and R. The lecture also provided an introduction to tools for building ERD diagrams, where he demonstrates how to build a relational model as a Crow’s Foot ERD using LucidChart and implement that model in SQLite in an R Notebook using {sql} chunks.
In this video tutorial below, Khoury Boston’s Prof. Schedlbauer explains how to map various associations, aggregations, multi-valued attributes, and generalizations to a relational schema. Generalization defines a specialization hierarchy of classes in a conceptual model. While some databases support generalization or type hierarchies, relational database do not. Nevertheless, there are ways to implement generalization in a relational model.
Summary
This lesson showed that SQL is a universal language for querying information in relationally organized tabular structures. In addition, the lesson explained key operations of relational algebra which forms the basis of SQL.
Finally, the lesson demonstrated, through examples, how to execute SQL queries on in-memory data frames using sqldf and on tables in SQLite databases using {sql} code chunks and R functions.
Naturally, you must first install the RSQLite package if is not yet installed on your installation of R.↩︎
This requires support from the database for “bulk insertion”. SQLite supports this by default, while MySQL and other databases must be configured to allow that. Some cloud database services such as db4free.net do not allow the use of dbWriteTable.↩︎
Be sure to right-click on the link and choose Save As… or Save Link As… rather than clicking on the link which will not work as the browser would try to display the database file as a document.↩︎