The relational model is an approach to storing data that is tabular and that connects data in one table to data in another table. It is the foundation of relational databases, but the tables do not have to be in a database for the data to be “relational”.
Relations
The relational model is based on set theory and each table is more formally called a relation. A relation has attributes; the number of attributes that a relation has is called its degree. Fundamentally, a relation is a set of tuples. The cardinality of a relation is the number of tuples in the relation.
For example, a relational model representing a learning management system might have the relation courses which might have attributes course number, title, and credits, and it might the tuples shown below:
courses(
{'IS2000','Principles of Information Science', 4},
{'CS5200','Database Management Systems', 4},
{'CS5380','Project Management with Scrum', 2},
{'DA5021','Big Data', 1}
)
The relation is defined as follows:
\(courses(number,title,credits)\).
Let TEXT be the domain of all character strings. The attribute domains for the relation courses would then be
\(number,title \in TEXT\)
\(credits \in \mathbb{N}\).
The above relation has a degree of 3 and cardinality of 4. It is a set of four tuples. {‘CS5200’,‘Database Management Systems’, 4} is one such tuple.
In tabular form, the above is a table with three columns and four rows.
IS2000 |
Principles of Information Science |
4 |
CS5200 |
Database Management Systems |
4 |
CS5380 |
Project Management with Scrum |
2 |
DA5021 |
Big Data |
1 |
Primary Key
Specific tuples are identified through a unique value of one attribute or a combination of several attributes. Tuples must be unique – this is a requirement of the relational model. The attribute (or combination of attributes) that uniquely identifies each tuple is referred to as the primary key. In the above example, course number would be the primary key as it is, by definition in the business domain, unique, i.e., no two different courses can have the same course number.
If there is no natural unique key or the key is too complex, then we generally “invent” one; a so-called artificial key. For example, for the courses relation we might wish to add a unique identifier column and assign a unique value for each course and use that as the primary key rather than the course number. Incidentally, this will make course number a so-called alternate key.
Here is what the new relation courses would look like:
\(courses(\underline{cid},number,title,credits)\).
courses(
{4453,'IS2000','Principles of Information Science', 4},
{6654,'CS5200','Database Management Systems', 4},
{1009,'CS5380','Project Management with Scrum', 2},
{4198,'DA5021','Big Data', 1}
)
The values for cid are “made up” – they could have been sequential numbers too, e.g., 1, 2, 3, and 4. The actual values are meaningless as long as they are unique. In a relation definition, primary key attributes are often underlined so they are easier to recognize.
Linked Relations
To show that relations are generally not by themselves but rather are linked, let’s say that we want to track the instructor who teaches a course. We might decide to extend the relation courses with an additional attribute for instructor:
\(courses(\underline{cid},number,title,credits,instructor)\).
An instance of that relation might now be:
courses(
{4453, 'IS2000','Principles of Information Science', 4, 'Jose Roja'},
{6654, 'CS5200','Database Management Systems', 4, 'Anneliese Frack'},
{1009, 'CS5380','Project Management with Scrum', 2, 'Rahul Prahit'},
{4198, 'DA5021','Big Data', 1, 'Xi Wang'}
)
Simple… but what if the same instructor teaches more than one course? Then we would have repetition. There would be even more information repeated if we tracked the instructor’s department, rank, and highest degree earned. While that is commonly done in a CSV or a data frame, it is not appropriate for a relational model where repetition is frowned upon as it violates so-called normal forms. A better approach would be to create another relation to track instructor-specific information: \(instructors(\underline{iid}, name,department,rank,degree)\). Note that we, once again, created an artificial key attribute – in this case there is no natural key as instructor name cannot be assumed to be unique.
instructors(
{101, 'Jose Roja', 'Computer Science', 'Associate', 'PhD'},
{102, 'Xi Wang', 'Computer Science', 'Full', 'PhD'},
{901, 'Anneliese Frack', NULL, 'Adjunct', 'MSc'},
{301, 'Rahul Prahit', 'Industrial Engineering', 'Research', 'DSc'}
{601, 'Elaine Campbell', 'Education', 'Assistant', 'EdD'}
)
There is still some repetition when instructors are in the same department or have the same rank, but we will ignore that for now rather than adding a departments relation and a lookup relation for rank. Also note that we chose not to associate part-time adjunct instructors with a department. This is not a universal rule, but rather a rule that an analyst might have uncovered during information analysis.
To track which instructor teaches which course, we would first remove all instructor information from the courses relation and add a new attribute that has as a value the primary key of the instructor who teaches the course. Such as attribute in courses is referred to as a foreign key attribute.
The updated \(courses\) relation is
\(courses(\underline{cid},number,title,credits,\boldsymbol{iid})\).
Notice how the foreign key attribute is bolded (or sometimes italicized) to emphasize it.
An instance is shown below:
courses(
{4453, 'IS2000','Principles of Information Science', 4, 101},
{6654, 'CS5200','Database Management Systems', 4, 901},
{1009, 'CS5380','Project Management with Scrum', 2, 301},
{4198, 'DA5021','Big Data', 1, 102}
)
Now, if an instructor teaches more than one course, we would not need to repeat the instructor information.
courses(
{4453, 'IS2000','Principles of Information Science', 4, 101},
{6654, 'CS5200','Database Management Systems', 4, 901},
{1009, 'CS5380','Project Management with Scrum', 2, 301},
{4198, 'DA5021','Big Data', 1, 102},
{8890, 'DA5029','Data Analytics Project', 1, 102},
)
Relational Operations
There are several important operations defined on a relational model:
- selection
- projection
- rename
- aggregation
- grouping
- equi-join
These are defined in a relational algebra. The full set of operations of the relational algebra are beyond the scope of this lesson. If you have an interest in learning more, consult Lesson 60.502 Relational Algebra.
Relational algebra is an important query abstraction mechanism and is often used to express queries without resorting to a specific query language such as SQL.
Relations as Sets
In the relational algebra, relations are treated as (mathematical) sets and thus cannot, by definition of a set, contain duplicates.
Selection
The selection operation is denoted with the letter \(\sigma\) and is used to choose a subset of tuples that satisfy a selection condition. The selection condition is expressed as a Boolean logical operation. It acts as a filter on the tuples of a relation. An alternate way of looking at selection is that it specifies conditions under which a tuple may belong to a relation. In more practical terms, we can look at it as selecting rows from a table that meet certain “search” criteria.
We will use the (simpler and single) relation
\(courses(number,title,credits)\)
with the instance shows below, for some of the upcoming examples:
courses(
{'IS2000','Principles of Information Science', 4},
{'CS5200','Database Management Systems', 4},
{'CS5380','Project Management with Scrum', 2},
{'DA5021','Big Data', 1}
)
Using the course relation defined above, to select all courses that are less than four credits, we would use the following operation:
\(\sigma_{credits \lt 4}(courses)\)
In general, the selection operation on any relation \(R\) is denoted by
\(\sigma_{<condition>}(R)\)
where is a Boolean logical expression of one or more clauses. An example with a compound Boolean condition is shown below:
\(\sigma_{(credits \geq 2) \land (credits \leq 4)}(courses)\)
The selection operator is unary – it applies to a single relation. The condition is applied to each tuple and can therefore only involve a single tuple. The degree of the resultant relation is the same as the relation to which selection is applied. The number of tuples of the resultant relation (its cardinality) is always less than or equal to the relation to which selection is applied. That is \(|\sigma_{C}(R)|\leq |R|\). The percentage of tuples selected by the selection operation is called the selection’s selectivity.
Selection operations can be cascaded since the result of a selection is a relation.
Selection is commutative, i.e.,
\(\sigma_{<C_1>}(\sigma_{<C_2>}(R)) = \sigma_{<C_2>}(\sigma_{<C_1>}(R))\)
A cascaded set of selections can always be transformed into a conjunction:
\(\sigma_{<C_1>}(\sigma_{<C_2>}(R)) = \sigma_{<C_1> \land <C_2>}(R)\)
Projection
The project operation is denoted with the Greek letter \(\pi\) and it selects specific attributes from a relation. It can be thought of as a vertical partitioning of a relation into two relations where one partition contains the attributes of interest and the other partition is discarded. The example below extracts the attributes number and title from the relation courses.
\(\pi_{number,title}(courses)\)
In general, the projection operation on any relation \(R\) is denoted by
\(\pi_{<attribute list>}(R)\)
The result of a projection is a relation having a degree equal to the number of attributes in the <attribute list> and is always less than or equal to the degree of \(R\). Similar to selection, the projection operator is also unary – it applies to a single relation. The number of tuples of the resultant relation (its cardinality) is always equal to the relation to which projection is applied.
The generalized form of projection allows functions on attributes to be included and is expressed as:
\(\pi_{F_1,F_2,...,F_j}(R)\),
where each \(F_k\) is a function over the attributes in the relation \(R\) and may include operations and scalar values.
For example,
\(\pi_{upper(title),credits/15}(courses)\),
where \(upper()\) should be interpreted a user defined function (in this example returning its argument in all upper case letters).
Rename
While relational algebra expressions can be cascaded (i.e., nested), this results is potentially long and unwieldy expressions. Alternatively, expressions can be decomposed into a sequence of individual expressions and intermediary resultant relations and attributes can be assigned new names using the rename operation. The rename operation is denoted with the Greek letter \(\rho\). Rename is a unary operation and takes the general form:
\(\rho_{S(B_1,B_2,\ldots,B_n)(R)}\)
This renames the relation \(R\) to \(S\) and renames each attribute \(A_i\) of \(R\) to \(B_i\).
The example below, selects the number and title of all courses with fewer than four credits and renames the resultant relation to shortCourses and the attributes to code and name.
\(\rho_{shortCourses(code,name)}(\pi_{credits,title}(\sigma_{credits \lt 4}(courses)))\)
Expressing the above as a series of relational algebra expression, we would get:
\(\rho_{E}(\sigma_{credits \lt 4}(courses))\)
\(\rho_{S}(\pi_{number,title}(E))\)
\(\rho_{shortCourses(code,name)}(S)\)
Both of the approaches are correct and the use is generally a personal preference. Often, using rename operations can make a complex expression simpler to follow.
Aggregation
An aggregate function operation is denoted with
\(\mathfrak{F}_{<functions>}(R)\).
The <functions> are one or more of
- COUNT (number of occurrences),
- SUM (summation),
- AVG (average or mean),
- MIN (minimum), and
- MAX (maximum) or
- any user-defined function definable over an attribute.
For example, the expression below defines a resultant relation that has two attributes (columns) that are the average and maximum credits for courses.
\(\mathfrak{F}_{<AVG(credits),MAX(credits)>}(courses)\)
The result of an aggregate function is, like every relational operation, a relation, albeit, in some cases, a relation of degree 1 and cardinality 1.
A function can also be an attribute in which case it is the “identify function” which is just the value of the attribute.
Grouping
Grouping is a common operation in which equal values of a column are “grouped” or “batched” and the resultant relation only contains the groups. A grouping operation is denoted with \({}_{<attribute-list>}\mathfrak{G}(R)\)
The result of an grouping function is always a relation, albeit, in some cases, a relation of degree 1 and cardinality of 1.
For example, the expression below defines a resultant relation that has one attribute (column) that are the different credits awarded to courses.
\({}_{<credits>}\mathfrak{G}(courses)\)
In another example, the expression below defines a resultant relation that has two attributes (columns) that are the credits and number of courses for each level of credits.
\({}_{<credits>}\mathfrak{G}_{<credits, COUNT>}(courses)\)
Equi-Join
The equi-join (also often called an inner join) operation is the most common form of a multi-relation operation and is often simply called a join. It is denoted by the symbol \(\bowtie\), is used to combine related tuples from two relations in a cross join (Cartesian product).
An equi-join has the general form:
\({R_1}{\bowtie}_{<join\;condition>}{R_2}\)
The \(<join\;condition>\) are the matching attributes in the two operand relations. For example, assume
\(R_1(pk_{R_1},a_1,a_2,fk)\), where \(fk\) is a foreign key referencing \(pk_{R_2}\), and \(R_2(pk_{R_2},a_1,a_2)\),
then an equi-join of the two relations is expressed as
\({R_1}{\bowtie}_{{R_1}.fk={R_2}.{pk_{R_2}}}{R_2}\)
The degree of a new relation resulting from an equi-join is \(degree(R_1) + degree(R_2)\) and the cardinality is at most \(|R_1| \cdot |R_2|\).
An equi-join matches tuples of the two relations based on equality of the attributes in the join condition.
In order to combine related tuples, one of the tables must have the primary key of the other as an attribute (i.e., as a foreign key). In a one-to-one relationship, it does not matter which relation has the foreign key; however, in a one-to-many relationship, the relation on the “side of the one” has the foreign key to the “side of the many”.
There are several other types of join, including natural, outer, and theta joins, but those are beyond the scope of this lesson.
To demonstrate this simple, but often confusing operation, let’s write a query that finds the name of a course and the name of the instructor who teaches the course. We will rely on the previous definitions for
\(courses(\underline{cid},number,title,credits,\boldsymbol{iid})\)
and
\(instructors(\underline{iid}, name,department,rank,degree)\).
The two relation instances are repeated below for convenience:
instructors(
{101, 'Jose Roja', 'Computer Science', 'Associate', 'PhD'},
{102, 'Xi Wang', 'Computer Science', 'Full', 'PhD'},
{901, 'Anneliese Frack', NULL, 'Adjunct', 'MSc'},
{301, 'Rahul Prahit', 'Industrial Engineering', 'Research', 'DSc'}
{601, 'Elaine Campbell', 'Education', 'Assistant', 'EdD'}
)
courses(
{4453, 'IS2000','Principles of Information Science', 4, 101},
{6654, 'CS5200','Database Management Systems', 4, 901},
{1009, 'CS5380','Project Management with Scrum', 2, 301},
{4198, 'DA5021','Big Data', 1, 102},
{8890, 'DA5029','Data Analytics Project', 1, 102},
)
The equi-join operation is:
\({courses}{\bowtie}_{courses.iid=instructors.iid}{instructors}\)
The foreign key in courses is iid and it “points to” or “references” the instructor who teaches the course in the instructors relation. So, that will need to be our join condition. It does not matter whether we put courses or instructors on the left or right side of the join operation. An equi-join is commutative.
We need to “scope” the names of the attributes as they are the same in both relations. We could have simplified the expression by first renaming the courses and instructors relations. This is left as an exercise for the reader – try it out.
Now the above gets as a relation that contains all attributes from courses and instructors. To get only the course name and instructor name, we need to add a projection. We will also take advantage a rename operation so we can split the expression.
\(\rho_{cni}({(courses)}{\bowtie}_{courses.iid=instructors.iid}{(instructors)})\)
\(\pi_{title,name}(cni)\)
There are several other ways that this query could have been written. Can you come up with one other way of writing this? Could you have projected attributes first before joining?
Let’s add one more wrinkle and say that we only want the four-credit courses and their instructors. Then we would need to add a selection before the projection.
\(\rho_{c}({(courses)}{\bowtie}_{courses.iid=instructors.iid}{(instructors)})\)
\(\rho_{s}(\sigma_{credits = 4}(c))\)
\(\pi_{title,name}(s)\)
Exercise: Try writing the above set of expressions as a single expression? Is is simpler? Easier or more difficult to “parse”?