60.502 Relational Algebra

Objectives

Upon completion of this lesson, you will be able to:

express a query with relational algebra
use relational operators

Overview

This lesson provides an overview of relational algebra, an abstract way to express relational queries. It explains the basic operations of selection, projection, join, rename, aggregation functions, and set operations.

Relational algebra is a foundation for SQL. It was originally proposed by Codd (1970). The first implementation of the relational algebra was Alpha, developed by E. F. Codd himself. After that, ISBL was created, and that language in turn was the basis for SQL. SQL is loosely based on a relational algebra, though the operands in SQL do not quite map to relational algebra. One issues is that tables in relational databases are not, in fact, relations and therefore several useful theorems about the relational algebra do not hold in the SQL counterpart. The reason is that the SQL table model is a bag (multiset), rather than a set which cannot contain duplicates and all tuples in the set must be unique.

Nevertheless, relational algebra is an important query abstraction mechanism and is often used to express queries without resorting to SQL contortions. It also forms the basis of most query optimizer implementations within database engines.

Queries and Query Languages

Retrieving data from a storage repository of any kind requires some kind of query language. Query languages fall into two general buckets: procedural and declarative.

Accessing files in the file system is procedural. For example, finding a file in R, Python, Java, or C++ is generally done by starting at some root folder and then searching all files within a folder hierarchy until the file is found. The programmer has to open the folder, list all files in the folder, determine if a file is actually a folder and navigate there recursively, compare each file name to the one in question. The programmers has to define the steps for the query. This is a procedure and thus the name “procedural query”.

A query expressed in SQL or relational algebra is declarative. The programmer defines the constraints that the result must meet and then the database management system determine how to access and search the data. The programmer does not actually define how tables are searched, whether indexes are used, or how a comparison is done.

This tutorial focuses on the syntax and operators of relational algebra. Note that there are other abstract query languages besides relational algebra. Another notable language is relational calculus which is based on first order (predicate) logic. SQL is influenced by both relational algebra and relational calculus, so it is important to have some understanding of both. Research papers in the area of databases and information retrieval will often express queries using relational algebra and relational calculus.

As you follow along, you might want to try “executing” relational algebra expressions using the interactive relational algebra calculator RelaX developed by the University of Innsbruck.

Set Mathematics

Relational algebra is based on the mathematics of sets. A set is a collection of distinct and unique objects, i.e., a set may not contain duplicate elements. These elements (aka objects or members) can be anything: numbers, letters, symbols, or even other mathematical entities like functions or sets themselves. The concept of sets is a fundamental building block in various branches of mathematics, including relational algebra, and it is a branch of mathematical logic.

Below are some key terms and concepts related to set mathematics that are needed for relational algebra:

Elements: The individual objects or members that belong to a set are called elements. For example, if you have a set A = {1, 2, 3}, then 1, 2, and 3 are the elements of set A.
Uniqueness: Elements is a set are unique and no element can be twice in the same set, so the set A = {1, 2, 3, 2} is not a legal set. Sets in which duplicate elements can appear are often referred to as multisets or bags.
Notation: Sets are typically represented using curly braces {}. For example, the set of natural numbers less than 10 can be written as {1, 2, 3, 4, 5, 6, 7, 8, 9}.
Set Generating Notation. Rather than listing the elements of a set, a set can be defined by a function that generates the members of the set. The most common way to write a generating set is with the notation {x : x ∈ ℝ ∧ x ≥ 1000 ∧ x ≤ 5000}. This defines the set of all real numbers between 1000 and 5000 inclusive.
Common Sets: Various sets of numbers have common set names. For example, ℝ is the set of real numbers, ℤ is the set of integers, and ℕ is the set of natural numbers.
Set Membership: An element either belongs to a set or does not. We use the symbol “∈” to denote membership. For example, if x is an element of set A, we write it as x ∈ A.
Set Equality: Two sets are considered equal if they have exactly the same elements. For example, if A = {1, 2, 3} and B = {3, 2, 1}, then A and B are equal sets. There is no implied order to the elements in a set.
Cardinality: The cardinality of a set is the number of elements it contains. It is denoted by |A|, where A is the set. For example, if A = {1, 2, 3}, then |A| = 3.
Subset: A set A is said to be a subset of another set B if every element of A is also an element of B. This is denoted as A ⊆ B. If A is a subset of B, but A is not equal to B, then it is a proper subset, denoted as A ⊂ B.
Universal Set: The universal set, often denoted as U, is the set that contains all the elements under consideration in a particular context.
Empty Set: The empty set, denoted as ∅ or {}, is a set that contains no elements.
Union and Intersection: The union of two sets A and B, denoted as A ∪ B, contains all the elements that are in A, in B, or in both. The intersection of two sets A and B, denoted as A ∩ B, contains only the elements that are in both A and B. If an element is in both A and B, then it would, of course, only be once in A ∪ B as set elements must be unique, i.e., no duplicates.
Complement: The complement of a set A, denoted as A’, is the set of all elements that are in the universal set U but not in A.

Set mathematics and its concepts form the foundation for various mathematical structures and operations in fields like calculus, algebra, and statistics, making it a fundamental and essential part of mathematics.

Logical Operations

Logical Operators

Logical (or Boolean) operators are used to build statements that evaluate to either TRUE or FALSE. These, in turn, are grouped into logical expressions which also have a truth value. The table below summarizes the logical operators. Note that the operators are the same as those found in first order logic.

Operator	Alternative	Meaning	Example	Explanation
∧	AND, +, &	AND	Q ∧ W	True only if both Q and W are true
⋁	OR, \|, \|\|	OR	Q ⋁ W	True only if either or both Q and W are true
⊻	XOR, EOR, ⊕	XOR	Q ⊻ W	True only if either but not both Q and W are true, false otherwise
¬	NOT, ~, !	NOT	¬ Q	True if Q is false, false if Q is true

The truth values of the various logical operations Q ⊛ W are summarized in the table below:

Q	W	∧	⋁	⊻
T	T	T	T	F
T	F	F	T	T
F	T	F	T	T
F	F	F	F	F

Exercise

Question

What is the truth value of the following Boolean expression

¬(Q ⋁ ¬X) ∧ (X ⊻ Q)

Q is true and X is false?
both Q and X are true

Answer

Let Q = T and X = T, then

¬(Q ⋁ ¬X) ∧ (X ⊻ Q) = ¬(T ⋁ ¬F) ∧ (F ⊻ T) = ¬(T ⋁ ¬T) ∧ (F ⊻ T) = ¬(T ⋁ F) ∧ (F ⊻ T) = ¬(T) ∧ (T) = F ∧ T = F

Let Q = T and X = T, then

¬(Q ⋁ ¬X) ∧ (X ⊻ Q) = ¬(T ⋁ ¬T) ∧ (T ⊻ T) = ¬(T ⋁ F) ∧ (T ⊻ T) = ¬(T) ∧ (T ⊻ T) = F ∧ F = F

Relations and Tuples

In the relational model, a relation, as originally defined by Codd (1970), is a set of tuples (a₁, a₂, …, a_n), where each element a_j is a member of 𝔇_j, a data domain, i.e., a_j ∈ 𝔇_j. Note that this means that the elements in a tuple must not all be of the same domain.

The degree of a relation is the number of elements in each tuple and is sometimes expressed as δ(R). The cardinality of a relation R is the number of tuples in that relation and is commonly expressed with |R|.

A relational schema assigns a name to a relation and a name to each element in a tuple and is often expressed as

R(A₁, A₂, …, A_n).

In a database implementation – and in SQL – a relation is a table, the relation name is the table name, the attribute names are column names, each tuple is a row in the table, the attribute domains are column data types, the degree of the relation is the number of columns, and the relation’s cardinality is the number of rows in the table.

A relation with a set of tuples with specific values for each a_i is called an instance of a relation.

For example, Employee(id,name,salary) is a relation. Let TEXT be the domain of all character strings and define the domains for the attributes as follows:

eid,ename ∈ TEXT

esalary ∈ ℝ⁺

Then,

{(1107,"Peter Maffei",123763), 
 (1009,"Nina Hagen",375000), 
 (1432,"Udo Lindenbergh",978000), 
 (3309,"Doris Day",1250000)}

is an instance of the relation schema \(Employee\).

The number of tuples in a relation instance is called its cardinality and is denoted with \(|R|\). For example, \(|Employee|=4\).

Relational Algebra

The five primitive operators of relational algebra are selection, projection, Cartesian product (cross join), set union, and set difference. Relational algebra borrows set union, set difference, and Cartesian product from set theory, but adds additional constraints to these operators.

For set union and set difference, the two relations involved must be union-compatible, i.e., the two relations must have the same set of attributes. Because set intersection is defined in terms of set union and set difference, the two relations in set intersection must also be union-compatible.

The operators of relational algebra are generally named with Greek and German letters to distinguish them from attributes and relation names.

A sequence of relational algebra operations forms a relational algebra expression, the result of which is always a relation, i.e., relational algebra is a closed mathematical system.

For matters of convenience, relational algebra also includes the rename operation.

Union Compatibility

A number of binary relational algebra and set operations assume that the two relation operands are union compatible, which means that both relations are of the same degree and corresponding attributes belong to the same domain.

In mathematical terms, two relations \(R_1\) and \(R_2\) are union compatible if and only if

\(degree(R_1)=degree(R_2)\), and \(\mathfrak{D}(R^{A_i}_{1}) = \mathfrak{D}(R^{A_i}_{2})\), where \(A_i\) is the ith attribute and \(\mathfrak{D}(R^{A_i}_{k})\) is the domain of the ith attribute of relation \(R_k\).

For example, the two relations \(A\) and \(B\) below are union compatible:

\(A(a_1(char),a_2(char),a_3(date))\)

\(B(b_1(char),b_2(char),b_3(date))\)

Both relations have 3 attributes of same date type, although the actual union many not make much sense.

The two relations \(X\) and \(Y\) below are not union compatible as the data types do not match for corresponding attributes.

\(X(x_1(char),x_2(char),x_3(date))\)

\(Y(y_1(integer),y_2(char),y_3(boolean))\)

The two relations \(Q\) and \(Z\) below are also not union compatible as the degrees of the two relations are not equal. However, by some they are considered union compatible as all attributes of \(Q\) are union compatible with the first \(degree(X)\) attributes of \(Z\).

\(Q(q_1(char),q_2(char))\) \(Z(z_1(char),z_2(char),z_3(boolean))\)

Boolean Expressions

Several relational algebra operations require the definition of conditions. These are expressed as a Boolean logical expression consisting of one or more clauses each of which evaluates to either true or false. Every clause takes the form , where is the name of an attribute of \(R\) and is a logical operator \({=,<,>,\leq,\geq,\neq}\) and the clauses are connected using Boolean operators of and (\(\land\)), or (\(\lor\)), and not (\(\neg\)).

A Boolean expression with clauses connected via (\(\land\)) is referred to as a conjunction while clauses connected via (\(\lor\)) is referred to as a disjunction. A conjunction is true if an only if all clauses are true, while a disjunction is true if and only if at least one of the clauses is true.

Selection

The selection operation is denoted with the letter \(\sigma\) and is used to choose a subset of tuples that satisfy a selection condition. The selection condition is expressed as a Boolean logical operation. It acts as a filter on the tuples of a relation. An alternate way of looking at selection is that it specifies conditions under which a tuple may belong to a relation.

Example: Selection

The relational algebra expression selects all tuples from the relation \(Employee\) where the value of the \(salary\) attribute is greater than or equal to 250,000.

\(\sigma_{salary \geq 250000}(Employee)\)

In general, the selection operation on any relation \(R\) is denoted by

\(\sigma_{<condition>}(R)\)

where is a Boolean logical expression of one or more clauses. An example is shown below:

\(\sigma_{(salary \geq 250000) \lor (salary \leq 100000)}(Employee)\)

Properties of Selection

The selection operator is unary – it applies to a single relation. The condition is applied to each tuple and can therefore only involve a single tuple. The degree of the resultant relation is the same as the relation to which selection is applied. The number of tuples of the resultant relation (its cardinality) is always less than or equal to the relation to which selection is applied. That is \(|\sigma_{C}(R)|\leq |R|\). The percentage of tuples selected by the selection operation is called the selection’s selectivity.

Selection operations can be cascaded since the result of a selection is a relation.

Selection is commutative, i.e.,

\(\sigma_{<C_1>}(\sigma_{<C_2>}(R)) = \sigma_{<C_2>}(\sigma_{<C_1>}(R))\)

A cascaded set of selections can always be transformed into a conjunction:

\(\sigma_{<C_1>}(\sigma_{<C_2>}(R)) = \sigma_{<C_1> \land <C_2>}(R)\)

In SQL, the condition is typically specified in the WHERE clause. So, for example, the above selection can be expressed in SQL as:

SELECT * 
  FROM R 
 WHERE C1 AND C2;

Exercise

Question

How would you express the following relational algebra expression in SQL?

\(\sigma_{(salary \geq 250000) \lor (salary \leq 100000)}(Employee)\)

Answer

SELECT * 
  FROM Employee 
 WHERE salary >= 250000 OR salary <= 100000;

SELECT * 
  FROM Employee 
 WHERE salary BETWEEN 100000 AND 250000;

Projection

The project operation is denoted with the Greek letter \(\pi\) and it selects specific attributes from a relation. It can be thought of as a vertical partitioning of a relation into two relations where one partition contains the attributes of interest and the other partition is discarded. The example below extracts the attributes name and salary from the relation \(Employee\).

\(\pi_{name,salary}(Employee)\)

In general, the projection operation on any relation \(R\) is denoted by

\(\pi_{<attribute list>}(R)\)

where is a comma-separated list of attributes from the relation \(R\).

The result of a projection is a relation having a degree equal to the number of attributes in the <attribute list> and is always less than or equal to the degree of \(R\).

Duplicates

If the attribute list includes non-key attributes of \(R\), then it is likely that duplicate tuples might be part of the resultant relation. However, since relational algebra treats relations as sets, and because sets cannot contain duplicates, duplicates resulting from a projection are automatically eliminated. While this is allowed in SQL, it is not allowed in relational algebra.

Example: Projection

The relational algebra expression below selects the name and salary attributes for all tuples from the relation \(Employee\) where the value of the \(salary\) attribute is greater than or equal to 250,000.

\(\pi_{name,salary}(\sigma_{salary \geq 250000}(Employee))\)

In SQL, a projection is expression by specifying the column names as part of the SELECT clause. For example, the above projection of a selection can be expressed in SQL as:

SELECT DISTINCT name, salary 
  FROM Employee 
 WHERE salary >= 250000;

Note that if the keyword DISTINCT were not present, then duplicates could result. There is no need for such an operation in the relational algebra.

Properties of Projection

Similar to selection, the projection operator is also unary – it applies to a single relation. The number of tuples of the resultant relation (its cardinality) is always equal to the relation to which projection is applied. Furthermore, its degree is always less than or equal to the original relation. If the projection list is a superkey of \(R\), then the cardinality of the projection is equal to the cardinality of the relation to which projection is applied.

Projection is not commutative, i.e.,

\(\pi_{<{att list}_1>}(\pi_{<{att list}_2>}(R)) \neq \pi_{<{att list}_2>}(\pi_{<{att list}_1>}(R))\)

However, if \({att list}_2 \subseteq {att list}_1\), then

\(\pi_{<{att list}_1>}(\pi_{<{att list}_2>}(R)) = \pi_{<{att list}_1>}(R)\)

\(\sigma_{<C_1>}(\sigma_{<C_2>}(R)) = \sigma_{<C_1> \land <C_2>}(R)\)

Generalized Projection

The generalized form of projection allows functions on attributes to be included and is expressed as:

\(\pi_{F_1,F_2,...,F_j}(R)\),

where each \(F_k\) is a function over the attributes in the relation \(R\) and may include operations and scalar values.

For example,

\(\pi_{upper(name),salary/12}(Employee)\),

where \(upper()\) should be interpreted a user defined function (in this example returning its argument in all upper case letters).

Rename

While relational algebra expressions can be cascaded (i.e., nested), this results is potentially long and unwieldy expressions. Alternatively, expressions can be decomposed into a sequence of individual expressions and intermediary resultant relations and attributes can be assigned new names using the rename operation. The rename operation is denoted with the Greek letter \(\rho\). Rename is a unary operation and takes the general form:

\(\rho_{S(B_1,B_2,\ldots,B_n)(R)}\)

This renames the relation \(R\) to \(S\) and renames each attribute \(A_i\) of \(R\) to \(B_i\).

The example below, selects the name and salary of each employee earning more than 250,000 and renames the resultant relation to \(Salaries\) and the attributes to \(FullName\) and \(Salary\).

\(\rho_{Salaries(FullName,Salary)}(\pi_{name,salary}(\sigma_{salary \geq 250000}(Employee)))\)

Expressing the above as a series of relational algebra expression, we would get:

\(\rho_{E}(\sigma_{salary \geq 250000}(Employee))\)

\(\rho_{S}(\pi_{name,salary}(E))\)

\(\rho_{Salaries(FullName,Salary)}(S)\)

Both of the approaches are correct and the use is generally a personal preference. Often, using rename operations can make the expression simpler to follow.

Rename is expressed in SQL using the keyword AS.

SELECT DISTINCT name AS FullName, salary AS Salary
  FROM Employee E
 WHERE E.salary >= 250000;

Cartesian Product

Cartesian product (also known as cross product or cross join) is a binary operation on two relations, denoted by \(\times\), although other symbols are sometimes used. The relations do not have to be union compatible, i.e., they can be of differing degrees and cardinality. This operation produces a new relation in which each tuple from one relation is combined with every tuple of the other relation.

In general, the resultant relation of the operation \(R \times S\), where \(R(A_1,A_2,\ldots,A_n)\) and \(S(B_1,B_2,\ldots,B_m)\), is a relation of degree \(n+m\) and cardinality of \(|R| \times |S|\).

It is noteworthy that Cartesian product is not a useful operation but rather that it is the foundation for joins of tuples in multiple tables.

The Cartesian product is best understood with an example. Assume that you have the following relations (tables): Employee and City. The primary key of \(Employee\) is \(eid\), while the primary key of \(City\) is \(cid\). The city in which an employee works is indicated through the foreign key attribute \(cid\) in \(Employee\). For example, “Peter Maffei” works in city with ID 100, which is, according to the \(City\) relation, “Boston”.

Relation: Employee

eid	name	salary	cid
1107	Peter Maffei	123763	100
1009	Nina Hagen	375000	200
1432	Udo Lindenbergh	978000	200
3309	Doris Day	1250000	100

Relation: City

cid	city	country
100	Boston	USA
200	Berlin	Germany

Then the Cartesian product of the two relations would be the relation:

eid	name	salary	cid	cid	city	country
1107	Peter Maffei	123763	100	100	Boston	USA
1107	Peter Maffei	123763	100	200	Berlin	Germany
1009	Nina Hagen	375000	200	100	Boston	USA
1009	Nina Hagen	375000	200	200	Berlin	Germany
1432	Udo Lindenbergh	978000	200	100	Boston	USA
1432	Udo Lindenbergh	978000	200	200	Berlin	Germany
3309	Doris Day	1250000	100	100	Boston	USA
3309	Doris Day	1250000	100	200	Berlin	Germany

Notice how every tuple (row) in the first table is combined with every tuple in the second table, even when this is not correct. After all “Peter Maffei” works in the Boston office and not in the Berlin office.

The SQL statement below has the same outcome: a cross join. Notice the lack of a WHERE clause.

SELECT *
  FROM Employee E, City c

Join

A join is an operation where a selection occurs after a Cartesian product operation. There are five types of join operations: cross join (Cartesian Product), equi-join (inner join), outer join, left outer join, and right outer join.

There is also a natural join often defined by databases, but that is actually simply an equi-join on matching attributes in the two relations. Additionally, there is a more general form of an equi-join called a theta join.

Equi-Join

The equi-join (also often called an inner join) operation is the most common and is often simply called a join. It is denoted by the symbol \(\bowtie\), is used to combine related tuples from two relations in a cross join (Cartesian product).

An equi-join has the general form:

\({R_1}{\bowtie}_{<join\;condition>}{R_2}\)

The \(<join\;condition>\) are the matching attributes in the two operand relations. For example, assume

\(R_1(pk_{R_1},a_1,a_2,fk)\), where \(fk\) is a foreign key referencing \(pk_{R_2}\), and \(R_2(pk_{R_2},a_1,a_2)\),

then a join of the two relations is expressed as

\({R_1}{\bowtie}_{{R_1}.fk={R_2}.{pk_{R_2}}}{R_2}\)

The degree of the new relation resulting from an equi-join is \(degree(R_1) + degree(R_2)\) and the cardinality is at most \(|R_1| \cdot |R_2|\).

An equi-join matches tuples of the two relations based on equality of the attributes in the join condition. This is in contrast to a theta join which allows any kind of logical condition.

In order to combine related tuples, one of the tables must have the primary key of the other as an attribute (called a foreign key). In a one-to-one relationship, it does not matter which relation has the foreign key; however, in a one-to-many relationship, the relation on the “side of the one” has the foreign key to the “side of the many”.

So, for the example below, a city has many employees working in it, while an employee can only work in one city. Consequently, the foreign key goes on the side of the \(Employee\) relation (the side of the “one”). The ERD below illustrates this relationship.

Employee-City ERD

The primary key and foreign key attribute do not need to have matching names, although it is customary to do so.

Natural Join

A natural join is the same as an equi-join except that the join attributes are not explicitly stated. Instead the join is performed by matching attributes that have the same name in the two relations. A natural join is commonly denoted with the symbol \(*\), although the equi-join without a condition is often interpreted as a natural join, e.g., \(\bowtie\).

Theta Join

A theta join is the similar to an equi-join except that the join conditions can be any condition and are not restricted to equality \(=\). A theta join is commonly denoted with the symbol \(\theta\).

Outer Join

Outer joins are an extension to relational algebra that are necessary for expressing some kinds of queries. In an equi-join or theta join any tuples that do not meet the join condition are omitted from the result. In addition, tuples with NULL values in an attribute that is part of the join condition (e.g., a foreign key with a value of NULL) are also not included in the result. This type of join is often called an inner join.

Including some tuples that do not meet the join conditions in the result is called an outer join and comes in three flavors: full outer join, left outer join, and right outer join. Only the left outer join is actually needed as the other two can be expressed with the former.

A left outer join of relations \(R\) and \(S\) (\(R \; {{}_*{\bowtie}} \; S\))², is defined as all tuples that meet the join condition plus all tuples in \(R\) (the left relation) for which no matches are found; missing attributes are padded with NULL.

For example, consider the two relations \(Employee\) and \(City\) below, where employee Doris Day does not have an assigned city (perhaps because she is remotely working):

Relation: Employee

eid	name	salary	cid
1107	Peter Maffei	123763	100
1009	Nina Hagen	375000	200
1432	Udo Lindenbergh	978000	200
3309	Doris Day	1250000	NULL

Relation: City

cid	city	country
100	Boston	USA
200	Berlin	Germany
300	Sao Paulo	Brazil
500	Toronto	Canada

Then the result of \({Employee} \; {{}_*{\bowtie}} \; {City}\):

eid	name	salary	cid	cid	city	country
1107	Peter Maffei	123763	100	100	Boston	USA
1009	Nina Hagen	375000	200	200	Berlin	Germany
1432	Udo Lindenbergh	978000	200	200	Berlin	Germany
3309	Doris Day	1250000	NULL	NULL	NULL	NULL

A right join of is the same as a left outer join with the operands in reverse order. In other words, if \(R \; {{}_*{\bowtie}} \; S\) is a left outer join of \(R\) and \(S\), then \(S \; {{}_*{\bowtie}} \; R\) is a right outer join of \(R\) and \(S\). A full outer join is simply the union of the left and a right outer join of a pair of relations.

Set Operations

An important collection of relational algebra operations are standard mathematical operations on sets, including union, intersection, difference, and division.

All set operations are binary and all relation operands must be union compatible.

Set Union

A union of two relations \(R\) and \(S\) is simply the combination of the tuples of two relations into a single relation with duplicates removed. It is denoted by the standard union operator \(R \cup S\).

For example, assume you have two relations \(Employee\) and \(Constractor\) defined as follows.

Employee
{(1107,"Peter Maffei",123763), 
 (1009,"Nina Hagen",375000), 
 (1432,"Udo Lindenbergh",978000), 
 (3309,"Doris Day",1250000)}

Contractor
{(9701,"Sarah Connor",575), 
 (9702,"Dieter Bohlen",250), 
 (9957,"Hans Zimmer",375)}

The union \({Employee} \cup {Contractor}\) of the two union-compatible relations is

{(1107,"Peter Maffei",123763), 
 (1009,"Nina Hagen",375000), 
 (1432,"Udo Lindenbergh",978000), 
 (3309,"Doris Day",1250000),
 (9701,"Sarah Connor",575), 
 (9702,"Dieter Bohlen",250), 
 (9957,"Hans Zimmer",375)}

Note that the domains of the attributes must be the same, although it may not often be truly compatible. For example, in the above relations, the third attribute is a number. However, in \(Employee\) the value is an annual salary, while for \(Contractor\) it is an hourly fee. So, it may make more sense to write the union as

\({Employee} \cup {\pi_{oid,name,fee*2080}(Contractor)}\).¹

Note that union is commutative, i.e., \(R \cup S = S \cup R\).

Set Difference

The set difference of two relations \(R\) and \(S\), denoted by \(R-S\), is a relation that includes all tuples that are in \(R\) but not in \(S\).

For example, consider the two relations \(Employee\) and \(Manager\):

Employee
{(1107,"Peter Maffei",123763), 
 (1009,"Nina Hagen",375000), 
 (1432,"Udo Lindenbergh",978000), 
 (3309,"Doris Day",1250000)}

Manager
{(1107,"Peter Maffei",123763), 
 (3310,"Peter Alexander",105000)}

Then, the result of \(Employee - Manager\) is:

{(1009,"Nina Hagen",375000), 
 (1432,"Udo Lindenbergh",978000),
 (3309,"Doris Day",1250000)}

Note that set difference is not commutative, i.e., \(R - S \neq S - R\).

Set Intersection

The set intersection of two relations \(R\) and \(S\), denoted by \(R \cap S\), is a relation that includes all tuples that are in both \(R\) and \(S\).

For example, consider the two relations \(Employee\) and \(Manager\):

Employee
{(1107,"Peter Maffei",123763), 
 (1009,"Nina Hagen",375000), 
 (1432,"Udo Lindenbergh",978000), 
 (3309,"Doris Day",1250000)}

Manager
{(1107,"Peter Maffei",123763), 
 (3310,"Peter Alexander",105000)}

Then, the result of \(Employee \cap Manager\) is:

{(1107,"Peter Maffei",123763)}

Note that union is commutative, i.e., \(R \cap S = S \cap R\).

Set Division

Set division is an operation that can be expressed with projection and join and is thus not needed. It is also not available in standard SQL, although some databases support extended SQL operators, including set division.

Complete Set of Operations

The set of foundational relational algebra operations are \(\{\sigma,\pi,\cup,\rho,-,\times\}\) and form a complete set, which means that all other relational algebra operations can be expressed by a combination of these. For example, equi-join can be expressed as:

\(R\bowtie_{\text{<condition>}}S=\sigma_{\text{<condition>}}(R \times S)\)

However, the additional operations simplify expressions and are useful, although not strictly necessary.

Aggregation Functions and Grouping

While basic relational algebra does not support grouping or mathematical aggregate functions, they are important for expressing queries. Therefore, extended relational algebra introduces an aggregate function operation, often denoted with \({}_{\text{<grouping>}}\mathfrak{F}_{\text{<functions>}}(R)\). The functions are one or more of COUNT (number of occurrences), SUM (summation), AVG (average or mean), MIN (minimum), and MAX (maximum). Either of the <grouping> and <functions> clauses are optional, although at least one of the two must be present.

If there is no grouping clause, then the aggregation applies to all tuples of the relation and the result is a single tuple.

If the functions clause is not present, then the resultant relation is a list of groups.

Note that duplicates are not removed; so if duplicates may result, then a projection must be applied subsequent to the aggregate function.

The result of an aggregate function is always a relation, albeit, in some cases, a relation of degree 1 and cardinality of 1.

For example, the extended relational algebra expression below calculates the average and maximum salaries for employees by country.

\({}_{<cid>}\mathfrak{F}_{<AVG(salary),MAX(salary)>}(Employee)\)

While \(\mathfrak{F}\) is commonly used for aggregation functions, we can also use \(\mathfrak{G}\) to express an explicit grouping. This is not strictly necessary as any “grouping” column to the left of either \(\mathfrak{F}\) or \(\mathfrak{G}\) implies a grouping.

So, the expression below is the same as above

\({}_{<cid>}\mathfrak{G}_{<AVG(salary),MAX(salary)>}(Employee)\)

If we do not want to group, then we omit the grouping columns on the left, i.e.:

\(\mathfrak{F}_{<AVG(salary),MAX(salary)>}(Employee)\)

calculate the average and maximum of all salaries (without any grouping).

Note that the grouping columns are always part of the result set.

Perhaps it helps if we consider the equivalent SQL statements for these expressions:

\({}_{<cid>}\mathfrak{G}_{<AVG(salary),MAX(salary)>}(Employee)\) is equivalent to this SQL expression:
```
select cid, avg(salary), max(salary)
  from Employee
 group by cid;
```
\(\mathfrak{F}_{<AVG(salary),MAX(salary)>}(Employee)\) is equivalent to this SQL expression:
```
select avg(salary), max(salary)
 from Employee
```

Practice Problems

Some of the practice problems refer to the following relations:

Relation: Employee (eid, name, salary, cid)

eid	name	salary	cid
1107	Peter Maffei	123763	100
1009	Nina Hagen	375000	200
1432	Udo Lindenbergh	978000	200
3309	Doris Day	1250000	NULL

Relation: City (cid, city, country)

cid	city	country
100	Boston	USA
200	Berlin	Germany
300	Sao Paulo	Brazil
500	Toronto	Canada

Both union and full outer join contains all tuples from both relations. What is the difference between the two?
What is the result of a right outer join between \(Employee\) and \(City\), i.e., \({Employee} \; {{\bowtie}_*} \; {City}\)?
What is the result of a right outer join between \(Employee\) and \(City\), i.e., \({Employee} \; {{}_*{\bowtie}_*} \; {City}\)?
What is the result of a natural join between \(Employee\) and \(City\), i.e., \({Employee} \; * \; {City}\)?
What is the result of \(Employee \cup City\)?
What is the result of \(Employee \cap City\)?
What is the result of \(\pi_{cid}(Employee) \cup \pi_{cid}(City)\)?
What is the result of \({}_{<cid>}\mathfrak{G}_{<COUNT(eid)>}(Employee)\)?

Summary

Relational algebra helps define queries in an abstract manner. Although relational algebra is powerful enough for most practical purposes, there are some simple and natural operators on relations that cannot be expressed by relational algebra. On such operation is a recursive closure.

Relational Algebra vs Tuple Relational Calculus

Relational Algebra and Tuple Relational Calculus are both formal languages associated with database systems, particularly focusing on describing queries on relational databases. They serve similar purposes but differ fundamentally in their approach and structure. Any discussion of relational algebra must include a discussion of the contrast between it and alternatives.

As we have seen in this lesson, relational algebra is a procedural query language, which means it explicitly defines a sequence of operations to perform to retrieve the desired result. It uses operators (such as select, project, join, union, and intersection) to manipulate relations (tables) and produce new relations as results. The operations in relational algebra are set-based, making it a powerful tool for describing queries because the operations are well-defined and their sequence clearly specifies the processing order. Recall how queries are constructed in relational algebra using the fundamental operations of:

Selection (σ): Filters rows based on a specified condition.
Projection (π): Selects certain columns from the table.
Join: Combines rows from two or more tables based on a related column.
Union, Intersection, Difference: Performs set operations on tuples in the tables.

In contrast, tuple relational calculus is a non-procedural query language, which means that instead of specifying how to retrieve the data, it focuses on describing what data to retrieve. It uses first-order logical formulas to express queries—in other words, it states the requirements for the data to be retrieved from the database without specifying the method to obtain it. Tuple relational calculus can be seen as more declarative, where you describe the properties of the required result without explicitly stating the algorithm or process to get there. A typical expression in tuple relational calculus involves:

Variables: Represent tuples in the universe of the database.
Formulas: Logical statements that specify conditions on the variables.
Quantifiers (Existential and Universal): Used to express queries involving some or all possible tuples.

In practical applications, especially in the design and implementation of most database systems, relational algebra is generally preferred over tuple relational calculus. Here are a few reasons why:

Explicit Operations: Relational algebra’s procedural nature makes it more straightforward for implementing query optimization techniques. Since the operations are explicitly stated, database systems can rearrange and optimize the sequence of operations efficiently to improve query performance.
Foundation for SQL: SQL, the most widely used database query language, is largely based on relational algebra. This makes relational algebra more practical and directly applicable to real-world database systems.
Ease of Understanding and Use: For many users, especially those involved in database management and optimization, understanding and controlling the exact sequence of operations on data is crucial. Relational algebra provides this control and clarity.

Tuple relational calculus, while equally powerful in terms of expressiveness (anything that can be expressed in one can be expressed in the other), is more theoretical in nature and finds its uses primarily in academic contexts or in the theoretical foundation of data retrieval.

Thus, while both relational algebra and tuple relational calculus are essential for understanding the underpinnings of query mechanisms in relational databases, relational algebra’s procedural nature and its direct influence on the structure of SQL make it more preferred for practical database design and querying.

Summary of Relational Algebra

Operation	Definition	Notation
SELECT	Selects all tuples from a relation \(R\) that satisfy the selection conditions.	\(\sigma_{<selection\;conditions}(R)\)
PROJECT	Results in a subset of attributes from a relation \(R\), with all duplicate tuples removed.	\(\pi_{<attribute\;list>}(R)\)
EQUI-JOIN	Results in an combination of tuples from \(R_1\) and \(R_2\) that satisfy the equality join condition.	\({R_1} \bowtie_{<join\;conditions>} {R_2}\)
OUTER JOIN
NATURAL JOIN	The same as an equi-join except that the join condition is not explicitly stated by rather assumes an implicit join condition based on attributes having the same name in both relations \(R_1\) and \(R_2\).	\({R_1} * {R_2}\) or \({R_1} \bowtie {R_2}\)
THETA JOIN	Results in an combination of tuples from \(R_1\) and \(R_2\) that satisfy the any join condition.	\({R_1} \theta_{<join\;conditions>} {R_2}\)
RENAME	Renames attributes and/or relations.	\(\rho_{S(B_1,B_2,\ldots,B_n)(R)}\)
PRODUCT	Results in a relation that has all combinations of all tuples of \(R_1\) with \(R_2\) and the resultant relation has all attributes of \(R_1\) and \(R_2\).	\(R_1 \times R_2\)
UNION	Results in a relation that has all tuples that are either in \(R_1\) or \(R_2\) or are both in \(R_1\) and \(R_2\), with duplicates removed. Assumes that \(R_1\) and \(R_2\) are union compatible.	\(R_1 \cup R_2\)
INTERSECTION	Results in a relation that has all tuples that are in both \(R_1\) and \(R_2\), with duplicates removed. Assumes that \(R_1\) and \(R_2\) are union compatible.	\(R_1 \cap R_2\)
DIFFERENCE	Results in a relation that has all tuples that are in \(R_1\) but not in \(R_2\). Assumes that \(R_1\) and \(R_2\) are union compatible.	\(R_1 - R_2\)

Files & Resources

All Files for Lesson 60.502

References

Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6), 377-387. PDF

RelaX - Relational Algebra Calculator. Universität Innsbruck

Footnotes

¹The annual salary equivalent of an employee in the US is calculated by multiplying the hourly rate by (40 * 52) = 2080 (the number of hours per year).

²There are various symbols in use for left, right, and full outer joins, but some are difficult to represent in LaTeX or HTML. Alternative symbols include *=*, and ⟖ (a combination of two characters).

Errata

Let us know.

60.502Relational Algebra

Martin Schedlbauer, PhD

2024-11-07

Objectives

Overview

Queries and Query Languages

Set Mathematics

Logical Operations

Logical Operators

Exercise

Question

Answer

Relations and Tuples

Relational Algebra

Union Compatibility

Boolean Expressions

Selection

Example: Selection

Properties of Selection

Exercise

Question

Answer

Projection

Duplicates

Example: Projection

Properties of Projection

Generalized Projection

Rename

Cartesian Product

Join

Equi-Join

Natural Join

Theta Join

Outer Join

Set Operations

Set Union

Set Difference

Set Intersection

Set Division

Complete Set of Operations

Aggregation Functions and Grouping

Practice Problems

Summary

Relational Algebra vs Tuple Relational Calculus

Summary of Relational Algebra

Files & Resources

References

Footnotes

Errata

60.502
Relational Algebra