Objectives

Upon completion of this lesson, you will be able to:

  • list the normal forms
  • apply the rules of normalization
  • design relational databases

Introduction

Database normalization is a process used in relational database design to reduce data redundancy and improve data integrity. By decomposing a table into less redundant (and smaller) tables and defining relationships among them, it minimizes certain kinds data anomalies including insert, update, and delete anomalies.

While normalization reduces data redundancy and improves integrity, it is not always the optimal approach for every situation. Sometimes, for reasons like query performance optimization, databases may intentionally be kept in a lower normal form.

It is also worth noting that normalization is typically done during the design of the database. Once the database has been implemented and is in use, changing the normalization level can be a complex (and often quite difficult and practically impossible) task. So, industrial databases often become (by accident and through entropy) de-normalized over time.

Naturally, normalization is applicable to relational databases only and is not meaningful for non-relational (NoSQL) databases as they are based on a non-relational data model.

You may find the tutorial below helpful and it might frame some of the more theoretical discussion later in this lesson.

Normalization

Normalization is a systematic process of organizing the attributes and relations (aka tables) of a relational database to reduce data redundancy and improve data integrity, albeit at generally requiring more relations and additional joins when retrieving data.

The primary objective of normalization is to create well-structured relations (tables) that are free from insertion, update, and deletion anomalies, which can occur when data is redundantly stored across multiple locations. For example, if an address is stored more than once, then an update to the address would have to occur in all instances of the address and omitting a single update would result in data integrity violations and incorrect query results.

The process of normalization generally involves decomposing larger, more complex relations into smaller, simpler ones while preserving relationships between the data. Any decomposition due to normalization is loss-less.

Normalization is based on the concept of functional dependencies and proceeds through a series of stages known as normal forms. The most common normal forms are First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), and Boyce-Codd Normal Form (BCNF). Each normal form adds additional constraints to ensure that the structure of the database eliminates potential anomalies.

It is worth noting that a database that is well-designed and based on a sound conceptual model is most often already in 3NF. In practice, database designers verify that there are no redundancies but they don’t commonly define functional dependencies and normal forms formally, but rather do so informally. An experienced practitioner can ascertain the normal forms that a relation meets through inspection.

Two additional normal forms, known as Fourth Normal Form (4NF) and Fifth Normal Form (5NF), have been introduced over time. These forms address more advanced dependencies: 4NF deals with the concept of multivalued dependencies, while 5NF is based on the concept of join dependencies.

The Problem Normalization Solves

In unnormalized databases, data redundancy can lead to several problems:

  1. Insertion anomalies: Adding new data may require duplicating information. For instance, if student and course data are stored in the same table, inserting a new course might require duplicating student data resulting in increased storage needs, larger indexes, and slower query performance.

  2. Update anomalies: Changing data in one place may require changes in many places, increasing the risk of inconsistencies. For example, if a student changes their address, every instance of the student’s address in multiple rows would need to be updated.

  3. Deletion anomalies: Deleting data can inadvertently remove other important data. For instance, if a student withdraws from their only course, removing the course could also remove their personal information from the database.

By normalizing the database, we can eliminate these anomalies, ensuring that each piece of information is stored in only one place.

Additionally, in a well-designed database, attributes that frequently contain null values (i.e., there exists no meaningful value) are generally avoided. When null values are necessary, they should be used only in rare or exceptional cases. This approach helps minimize the issues associated with null values, such as ambiguity or incomplete data, ensuring that the database maintains a higher level of consistency and integrity. Furthermore, SQL queries involving potential null values are more complex as null values are treated differently in SQL than empty values.

The Process of Normalization

Normalization is carried out by progressing through a series of normal forms, each of which places stricter rules on how functional dependencies are managed within a relation. Each subsequent normal form resolves issues not addressed by the previous one. Normal forms are stacked in that a higher normal form requires that all lower normal forms are met. Higher normal forms are achieved by decomposing relations.

In smaller applications, it might be practical for a database designer to manually decompose a relation schema. However, for large, real-world databases, which are typically highly complex, direct decomposition becomes more challenging because there are multiple ways a relation schema can be decomposed. Additionally, in some cases, decomposition may not even be necessary. As a result, determining whether a relation schema should be decomposed, and if so, deciding which decomposition is more optimal, requires a structured approach. To provide such guidance, several normal forms have been defined. These normal forms are based on the concept of functional dependencies, which play a critical role in the design of relational schemas.

Decomposition

In the process of decomposition, it is essential to ensure that each attribute from the original relation \(R\) appears in at least one of the decomposed relation schemas \(R_i\). This guarantees that no attributes are lost during the decomposition. Formally, this can be expressed as:

\[ \bigcup R_i = R \]

This condition is known as the attribute preservation property of decomposition, which ensures that all attributes from the original relation are retained in the decomposed relations, preserving the completeness of the data structure.

First Normal Form (1NF)

A relation is in First Normal Form (1NF) if it meets two conditions:

  1. All attributes must contain atomic (indivisible) values. This means that each attribute in the table can only hold a single value.

  2. The relation must have no repeating groups of attributes or tuples. Each row must represent a single, distinct entity or concept.

For example, consider a table of students and their enrolled courses:

StudentID StudentName Courses
1 Alice Math, History
2 Bob Science, Math

This table is not in 1NF because the Courses attribute contains multiple values for each student. To bring it into 1NF, we would need to split the courses into separate rows, as follows:

StudentID StudentName Course
1 Alice Math
1 Alice History
2 Bob Science
2 Bob Math

Now, the table is in 1NF, as each attribute contains atomic values and there are no repeating groups but some data is now stored redundantly which is resolved by a higher normal form.

Second Normal Form (2NF)

Second Normal Form (2NF) is built upon the concept of full functional dependency. A relation is in Second Normal Form (2NF) if:

  1. It is in 1NF.

  2. All non-prime attributes (attributes that are not part of the primary key or any candidate key) are fully functionally dependent on the entire primary key, meaning there are no partial dependencies.

A partial dependency occurs when a non-prime attribute (also known as a non-key attributes) is functionally dependent on part of a composite candidate key rather than the entire candidate key. To address partial dependencies, we decompose the table into smaller relations. Functional dependencies are explained in more detail in Functional Dependencies.

Consider the following relation, which is in 1NF but not in 2NF:

StudentID Course InstructorName
1 Math Dr. Smith
1 History Dr. Adams
2 Science Dr. Brown
2 Math Dr. Smith

Here, the primary key is the composite key \((\text{StudentID}, \text{Course})\), but the InstructorName attribute is functionally dependent only on Course, not on the entire composite key. This is a partial dependency.

To bring this table into 2NF, we can decompose it into two tables:

  1. A table that assigns instructors to courses:

    Course InstructorName
    Math Dr. Smith
    History Dr. Adams
    Science Dr. Brown
  2. A table that records student enrollments:

    StudentID Course
    1 Math
    1 History
    2 Science
    2 Math

Now, each table is in 2NF, as all non-prime attributes are fully functionally dependent on the entire primary key.

Note that a relation with a non-composite primary key (i.e. a primary key consisting of a single attribute) is, by definition, in 2NF.

Third Normal Form (3NF)

A relation is in Third Normal Form (3NF) if:

  1. It is in 2NF.

  2. No transitive dependencies exist. A transitive dependency occurs when a non-prime attribute is functionally dependent on another non-prime attribute. A non-prime attribute is an attribute that is not part of any candidate key of a relation.

To elaborate, a candidate key is a minimal set of attributes that can uniquely identify a tuple (row) in a relation, i.e., a potential primary key. A relation may have more than one candidate key: one of the candidate keys is designated as the primary key making all other candidate keys alternate keys. A prime attribute is any attribute that is part of a candidate key, while a non-prime attribute is any attribute that is not included in any candidate key of the relation.

Example

Consider a relation \(R\) with attributes \(\{A, B, C, D\}\) and the following candidate keys:

  • \((A, B)\) - \((C)\)

In this case:

  • \(A\), \(B\), and \(C\) are prime attributes because they are part of candidate keys.
  • \(D\) is a non-prime attribute because it is not part of any candidate key.

Non-prime attributes are the ones that tend to cause normalization issues, such as partial or transitive dependencies, and therefore are the focus when applying higher normal forms (like 2NF, 3NF, and BCNF).

Consider the following table, which is in 2NF but not in 3NF:

StudentID StudentName Course InstructorName
1 Alice Math Dr. Smith
1 Alice History Dr. Adams
2 Bob Science Dr. Brown
2 Bob Math Dr. Smith

Here, \(\text{StudentID} \rightarrow \text{StudentName}\) is a functional dependency, and \(\text{StudentName} \rightarrow \text{InstructorName}\) is a transitive dependency, meaning that InstructorName is indirectly dependent on StudentID through StudentName.

To remove the transitive dependency, we decompose the table into two:

  1. A table that stores student information:

    StudentID StudentName
    1 Alice
    2 Bob
  2. A table that stores course enrollments and instructor assignments:

    StudentID Course InstructorName
    1 Math Dr. Smith
    1 History Dr. Adams
    2 Science Dr. Brown
    2 Math Dr. Smith

Now, both tables are in 3NF, as there are no more transitive dependencies.

Boyce-Codd Normal Form (BCNF)

A relation is in Boyce-Codd Normal Form (BCNF) if:

  1. It is in 3NF.
  2. Every functional dependency \(X \rightarrow Y\) in the relation has \(X\) as a superkey.

BCNF is a stricter version of 3NF. While 3NF allows certain exceptions where non-superkey attributes can determine prime attributes, BCNF eliminates this exception entirely.

To demonstrate the difference between Third Normal Form (3NF) and Boyce-Codd Normal Form (BCNF), let’s work through an example.

Consider the following relation:

\[ R(\text{StudentID}, \text{CourseCode}, \text{InstructorName}) \]

Assume that this relation has the following functional dependencies:

  1. \(\text{StudentID}, \text{CourseCode} \rightarrow \text{InstructorName}\)
  2. \(\text{InstructorName} \rightarrow \text{CourseCode}\)

Step 1: Check if the Relation is in 3NF

For a relation to be in 3NF, one of two conditions must be true for each functional dependency \(X \rightarrow Y\):

  • \(X\) is a superkey.
  • \(Y\) is a prime attribute (an attribute that is part of a candidate key).

Let’s check each functional dependency:

  1. \(\text{StudentID}, \text{CourseCode} \rightarrow \text{InstructorName}\):
    Here, \((\text{StudentID}, \text{CourseCode})\) is a candidate key for the relation, and it uniquely identifies the InstructorName. Therefore, this dependency satisfies the conditions for 3NF because \((\text{StudentID}, \text{CourseCode})\) is a superkey.

  2. \(\text{InstructorName} \rightarrow \text{CourseCode}\):
    In this case, InstructorName determines CourseCode. However, InstructorName is not a superkey because it does not uniquely identify each tuple in the relation. Despite this, CourseCode is a prime attribute (since it is part of the candidate key \(\text{StudentID}, \text{CourseCode}\)). This satisfies the second condition for 3NF (non-superkey determining a prime attribute), so the relation is in 3NF.

Step 2: Check if the Relation is in BCNF

For a relation to be in BCNF, the condition is stricter: for every functional dependency \(X \rightarrow Y\), \(X\) must be a superkey.

Let’s check again:

  1. \(\text{StudentID}, \text{CourseCode} \rightarrow \text{InstructorName}\):
    This is fine because \((\text{StudentID}, \text{CourseCode})\) is a superkey.

  2. \(\text{InstructorName} \rightarrow \text{CourseCode}\):
    This functional dependency violates BCNF because InstructorName is not a superkey, even though it determines CourseCode. In BCNF, every determinant must be a superkey, which is not the case here.

The relation is in 3NF because it satisfies the looser condition for functional dependencies where non-superkeys can determine prime attributes. However, it is not in BCNF because the dependency \(\text{InstructorName} \rightarrow \text{CourseCode}\) violates the BCNF requirement that every determinant be a superkey.

Step 3: Decomposing the Relation to Achieve BCNF

To bring the relation into BCNF, we need to decompose it in such a way that all functional dependencies are preserved, and every determinant becomes a superkey.

  1. Create a new relation that isolates the dependency \(\text{InstructorName} \rightarrow \text{CourseCode}\): \[ R_1(\text{InstructorName}, \text{CourseCode}) \] This relation captures the fact that each InstructorName teaches a specific CourseCode.

  2. The remaining attributes form a second relation: \[ R_2(\text{StudentID}, \text{CourseCode}) \] This relation represents which students are enrolled in which courses.

So, now we have:

  1. InstructorCourse relation \(R_1(\text{InstructorName}, \text{CourseCode})\):

    InstructorName CourseCode
    Dr. Smith CS101
    Dr. Adams MATH101
  2. Enrollment relation \(R_2(\text{StudentID}, \text{CourseCode})\):

    StudentID CourseCode
    1 CS101
    2 MATH101

After the decomposition, both relations are in BCNF because: - In \(R_1\), \(\text{InstructorName}\) is a superkey, and it determines CourseCode. - In \(R_2\), \((\text{StudentID}, \text{CourseCode})\) is the composite superkey.

In short, the original relation \(R(\text{StudentID}, \text{CourseCode}, \text{InstructorName})\) is in 3NF but not in BCNF due to the functional dependency \(\text{InstructorName} \rightarrow \text{CourseCode}\), where InstructorName is not a superkey. By decomposing the relation into two smaller relations, \(R_1(\text{InstructorName}, \text{CourseCode})\) and \(R_2(\text{StudentID}, \text{CourseCode})\), we eliminate the violation, ensuring that both relations conform to BCNF.

Formal Definitions of Normal Forms

Formal definitions of the normal forms are based on two key concepts:

  1. Functional Dependencies
  2. Keys: Primary, Candidate, Super

Functional dependencies are a specific type of integrity constraint that extend the idea of keys. They are instrumental in the design of relational databases, helping to ensure that undesirable properties, such as redundancy or anomalies, are minimized. As a result, functional dependencies are fundamental in the creation of an effective and well-structured database schema.

Let’s first define functional dependencies and the various types of keys followed by formal definitions of the normal forms from 1NF through BCNF.

Functional Dependencies

A functional dependency (FD) is a formal relationship between two sets of attributes in a relation (or table) in a relational database. Formally, let \(R(A_1, A_2, ..., A_n)\) be a relation schema, and let \(X\) and \(Y\) be subsets of the attributes \(\{A_1, A_2, ..., A_n\}\). We say that X functionally determines Y (denoted \(X \rightarrow Y\)) if, for any two tuples \(t_1\) and \(t_2\) in the relation \(R\), whenever \(t_1\) and \(t_2\) agree on all the attributes of \(X\), they must also agree on all the attributes of \(Y\).

More formally, we say:

Given a relation \(R\), a functional dependency \(X \rightarrow Y\) holds if and only if for every pair of tuples \(t_1\) and \(t_2\) in \(R\), the following condition is satisfied:

\[ t_1[X] = t_2[X] \implies t_1[Y] = t_2[Y] \]

Where:

  • \(t_1[X]\) and \(t_2[X]\) denote the values of tuple \(t_1\) and tuple \(t_2\) on the attributes in \(X\).
  • \(t_1[Y]\) and \(t_2[Y]\) denote the values of tuple \(t_1\) and tuple \(t_2\) on the attributes in \(Y\).

The notation \(X \rightarrow Y\) is interpreted as “Y is functionally dependent on X” or “X determines Y.” In this context, \(X\) is referred to as the determinant, while \(Y\) is known as the dependent.

Intuitive Meaning

If \(X \rightarrow Y\), then the value of \(X\) uniquely determines the value of \(Y\). In other words, if two tuples in the relation have the same values for the attributes in \(X\), then they must also have the same values for the attributes in \(Y\).

In practical terms, functional dependencies are statements about the domain, i.e., they represent “business rules”. Functional dependencies are formal ways to express that the value of an attribute can uniquely determine the value of another attribute. While relationships are expressed in Entity-Relationship (ER) or UML Class Diagrams, functional dependencies are not expressed diagrammatically but rather as statements, either informally, formally through structured narratives1, or as functional dependencies.

Example

Consider a relation Student with the following attributes:

  • StudentID: A unique identifier for each student.
  • StudentName: The name of the student.
  • Major: The major field of study.

If every StudentID uniquely determines the StudentName, we can write this functional dependency as:

\[ \text{StudentID} \rightarrow \text{StudentName} \]

This means that if two students have the same StudentID, they must also have the same StudentName. Hence, \(\text{StudentID}\) functionally determines \(\text{StudentName}\).

Trivial vs Non-Trivial Dependencies

A functional dependency is called trivial if it holds true in all possible relations. Specifically, a functional dependency \(X \rightarrow Y\) is considered trivial if the set of attributes on the right-hand side \(Y\) is a subset of the set of attributes on the left-hand side \(X\). In other words, a dependency is trivial when the dependent attributes are already included in the determinant.

For instance,

  • \(A \rightarrow A\): This is trivial because the left-hand side and right-hand side are the same.
  • \(\{A, B\} \rightarrow A\): This is also trivial because \(A\) is part of the left-hand side \(\{A, B\}\).

Trivial dependencies don’t provide any new information about the relationship between attributes in a relation, as they are always satisfied by the structure of the data.

As another example, \(\{ \text{StudentID}, \text{CourseCode} \} \rightarrow \text{CourseCode}\) is trivial because the right-hand side is part of the left-hand side.

In a relation like Enrollment(StudentID, CourseCode, InstructorName), the dependencies \(\text{CourseCode} \rightarrow \text{CourseCode}\) and \(\{ \text{StudentID}, \text{CourseCode} \} \rightarrow \text{CourseCode}\) are examples of trivial dependencies.

On the other hand, non-trivial dependencies do not automatically hold in all relations and represent meaningful constraints in the database. These are the functional dependencies that reflect real relationships, like \(\text{CourseCode} \rightarrow \text{InstructorName}\). Non-trivial dependencies are important in practice because they define the essential integrity constraints that help maintain consistency in the database. However, in formal theory, both trivial and non-trivial dependencies must be considered when analyzing a schema.

Closure of a Set of Functional Dependencies

The closure of a set of functional dependencies refers to the complete set of all functional dependencies that can be logically inferred from a given set of functional dependencies. If you start with a set of functional dependencies \(F\), the closure of \(F\), denoted as \(F^+\), includes every functional dependency that can be derived using the dependencies in \(F\).

Formally, \(F^+\) is the set of all functional dependencies that can be inferred from \(F\) using rules of inference, such as Armstrong’s Axioms (explained below). The closure helps to determine all possible constraints that exist in a relation, even those not explicitly listed in \(F\).

The closure is useful when: - You want to check if a certain functional dependency \(X \rightarrow Y\) logically follows from a set of functional dependencies \(F\). - You need to find the minimal set of functional dependencies. - You want to check if a relation schema is in a certain normal form (e.g., 2NF, 3NF).

Armstrong’s Axioms

Armstrong’s Axioms are a set of inference rules that can be applied to derive the closure of a set of functional dependencies. They are sound (produce only valid dependencies) and complete (able to generate all possible dependencies from a given set). The three basic rules, along with some useful derived rules, are as follows:

  1. Reflexivity: If \(Y \subseteq X\), then \(X \rightarrow Y\).
    This rule states that if a set of attributes \(Y\) is a subset of \(X\), then \(X\) functionally determines \(Y\). For example, if \(X = \{A, B\}\) and \(Y = \{A\}\), then \(\{A, B\} \rightarrow \{A\}\) is valid.

  2. Augmentation: If \(X \rightarrow Y\), then \(XZ \rightarrow YZ\) for any \(Z\).
    This rule allows you to “add” attributes to both sides of a functional dependency without affecting the validity. If \(X \rightarrow Y\) holds, then combining \(X\) with additional attributes \(Z\) also leads to \(XZ \rightarrow YZ\).

  3. Transitivity: If \(X \rightarrow Y\) and \(Y \rightarrow Z\), then \(X \rightarrow Z\).
    This rule states that functional dependencies can be chained. If \(X\) determines \(Y\), and \(Y\) determines \(Z\), then \(X\) determines \(Z\).

Derived Rules

Using Armstrong’s Axioms, you can also derive additional inference rules, such as:

  1. Union: If \(X \rightarrow Y\) and \(X \rightarrow Z\), then \(X \rightarrow YZ\).
    This rule allows you to combine the right-hand sides of two functional dependencies with the same determinant.

  2. Decomposition: If \(X \rightarrow YZ\), then \(X \rightarrow Y\) and \(X \rightarrow Z\).
    This rule lets you break down a functional dependency with multiple attributes on the right-hand side into multiple simpler dependencies.

Example

Suppose you have the following set of functional dependencies for a relation:

\[ F = \{ \text{StudentID, CourseCode} \rightarrow \text{InstructorName}, \ \text{CourseCode} \rightarrow \text{Department} \} \]

The closure \(F^+\) would contain all the functional dependencies you can infer from \(F\). Using Armstrong’s Axioms, you might derive new functional dependencies such as:

\[ \text{StudentID, CourseCode} \rightarrow \text{Department} \]

This can be inferred by applying the transitivity rule, since: 1. \(\text{StudentID, CourseCode} \rightarrow \text{InstructorName}\) 2. \(\text{CourseCode} \rightarrow \text{Department}\)

By combining these, we conclude that \(\text{StudentID, CourseCode} \rightarrow \text{Department}\).

In short, the closure of a set of functional dependencies is the set of all dependencies that can be inferred from a given set, and Armstrong’s Axioms are the inference rules used to derive this closure. Together, these concepts play a critical role in reasoning about dependencies within relational schemas and ensuring a well-normalized database.

Closure of FDs

In practice, there’s often no need to compute the entire closure of a set of functional dependencies (FDs). Typically, what’s required is a specific subset of the closure, namely all FDs where a particular set of attributes, \(Z\), serves as the determinant. However, calculating the entire closure \(F^+\) is inefficient, since \(F^+\) can become very large. Additionally, computing \(F^+\) involves applying Armstrong’s Axioms repeatedly until no new FDs can be generated. Therefore, it’s more practical to have an algorithm that computes just the relevant subset of the closure without needing to generate all of \(F^+\).

For a given relation schema \(R\), a set of attributes \(Z\), and a set \(F\) of functional dependencies that hold on \(R\), the closure of \(Z\) under \(F\) (denoted \(Z^+\)) is the set of all attributes of \(R\) that are functionally dependent on \(Z\). The following algorithm efficiently computes \(Z^+\):

  1. Initialize \(Z^+\) to \(Z\) (i.e., start with \(Z\) itself).
  2. While there is a change in \(Z^+\), iterate over each functional dependency \(X \rightarrow Y\) in \(F\).
  3. For each dependency, if \(X \subseteq Z^+\) (meaning that the left-hand side \(X\) is a subset of the current closure), include the right-hand side \(Y\) in \(Z^+\).

This process continues until no further changes are made to \(Z^+\), at which point the algorithm terminates.

Consider a relation schema \(R(A, B, C, D, E, G, H)\) and a set of functional dependencies:

\[ F = \{ A \rightarrow B, \ BC \rightarrow DE, \ AEG \rightarrow H \} \]

We want to compute the closure \(\{A, C\}^+\) (i.e., the closure of the set \(\{A, C\}\)) under the functional dependencies in \(F\). Using the closure algorithm:

  1. Initialize the result \(Z^+\) to \(\{A, C\}\).
  2. During the first execution of the while loop, the inner loop runs over the three FDs:
    • For \(A \rightarrow B\), since \(A \subseteq Z^+\), we add \(B\) to \(Z^+\). Now, \(Z^+ = \{A, B, C\}\).
    • For \(BC \rightarrow DE\), since \(BC \subseteq Z^+\), we add \(D\) and \(E\) to \(Z^+\). Now, \(Z^+ = \{A, B, C, D, E\}\).
    • For \(AEG \rightarrow H\), no changes are made since \(AEG\) is not a subset of \(Z^+\).
  3. On the second execution of the while loop, the inner loop repeats but no new attributes are added, so the algorithm terminates. The final closure is \(\{A, C\}^+ = \{A, B, C, D, E\}\).

Beyond computing subsets of the closure, the closure algorithm has several other applications:

  1. Checking if a particular FD \(X \rightarrow Y\) is in the closure \(F^+\): Instead of computing the full closure \(F^+\), you can compute \(X^+\) using the closure algorithm and check if \(Y \subseteq X^+\).

  2. Testing if a set of attributes \(A\) is a superkey: You can compute \(A^+\) and verify if \(A^+\) contains all attributes of the relation \(R\). If it does, \(A\) is a superkey.

This algorithm provides a more efficient way to deal with functional dependencies, avoiding the need to compute the entire closure \(F^+\) when only specific information is required.

Keys

Keys in relational databases are necessary for maintaining data integrity, ensuring uniqueness of records, and connecting tables. They form the foundation of the relational model, allowing for efficient data retrieval and query optimization. Keys make it easier to understand and manage data. They enforce important constraints that maintain data validity and consistency, and often serve as the basis for database indexes in order to improve query performance. Proper use of keys is also essential for database normalization, which eliminates (or reduces) data redundancy and further enhances data integrity. Ultimately, the careful implementation of various key types helps create a robust, efficient, and reliable database system that can effectively store, manage, and retrieve data while maintaining its accuracy and relationships.

The video tutorial below provides an overview of the various types of keys encountered when working with databases. The remainder of this section provides formal definitions of the different types of keys.

Candidate Key

A candidate key is a set of attributes (or a minimal subset of attributes, to be more precise) in a relation \(R\) that uniquely identifies each tuple. Formally, a candidate key \(C \subseteq R\) must satisfy the following two conditions:

  1. Uniqueness: The set \(C\) functionally determines all attributes in the relation (i.e., \(C \rightarrow R\)).
  2. Minimality: No proper subset of \(C\) can functionally determine all attributes in the relation (i.e., \(C\) is minimal).

There can be multiple candidate keys in a relation. Generally one of them is chosen as the primary key. Each candidate key is capable of uniquely identifying the tuples in the relation.

Primary Key

A primary key is a specific candidate key chosen to uniquely identify each tuple (row) in a relation \(R\). Formally, a primary key is a minimal set of attributes \(P \subseteq R\) such that:

  1. The primary key functionally determines all attributes of the relation \(R\) (i.e., \(P \rightarrow R\)).
  2. There is no proper subset of \(P\) that can also functionally determine all attributes in the relation (i.e., \(P\) is minimal).

In other words, the primary key is both unique and irreducible. It ensures that no two rows in a table can have the same values for the primary key attributes, and it is the smallest possible combination of attributes that achieves this uniqueness.

Superkey

A superkey is a set of one or more attributes that can uniquely identify each tuple in a relation \(R\). Formally, a superkey \(S \subseteq R\) is defined as uniqueness, i.e., the set \(S\) functionally determines all attributes in the relation (i.e., \(S \rightarrow R\)).

However, unlike a candidate key, a superkey does not need to be minimal. A superkey can have extra attributes that are not necessary for uniqueness. All candidate keys are superkeys, but not all superkeys are candidate keys.

Relationships Between Primary Key, Candidate Key, and Superkey

  • Superkey: A superkey is any set of attributes that can uniquely identify tuples, but it may contain extraneous attributes.
  • Candidate Key: A candidate key is a minimal superkey, meaning it contains no extraneous attributes.
  • Primary Key: The primary key is one of the candidate keys selected to uniquely identify tuples in the relation.

For example, if a relation has attributes \(A, B, C\), and \(A\) alone can uniquely identify all tuples, then:

  • \(A\) is a candidate key.
  • \((A, B)\) is a superkey (but not a candidate key, since it includes an unnecessary attribute \(B\)).
  • \(A\) can be selected as the primary key.

First Normal Form (1NF)

A relation \(R\) is in First Normal Form (1NF) if it satisfies the following formal conditions:

  1. Atomicity: Every attribute in the relation \(R\) contains only atomic (indivisible) values. This means that each attribute holds a single, indivisible value rather than a set of values or composite structures.

  2. Uniqueness of Tuples: The relation contains no duplicate tuples, meaning that each tuple is uniquely identifiable (which is typically enforced by the existence of a primary key or candidate key).

  3. Absence of Repeating Groups: The relation does not allow any attribute to contain multiple values (i.e., it cannot have repeating groups of attributes or records).

Formal Definition of 1NF

Let \(R(A_1, A_2, ..., A_n)\) be a relation schema. \(R\) is in First Normal Form (1NF) if:

  1. For every attribute \(A_i \in \{A_1, A_2, ..., A_n\}\), the domain of \(A_i\) consists only of atomic values (i.e., there are no multi-valued or composite attributes).

  2. For every tuple \(t\) in the relation \(R\), the value of \(t[A_i]\) is a single atomic value for every attribute \(A_i\).

Example of a Relation Not in 1NF

Consider a table of students with their courses:

StudentID StudentName Courses
1 Alice {Math, History}
2 Bob {Science, Math}

Here, the attribute Courses is not atomic because it contains a set of values (i.e., multiple courses for each student). This violates the requirement for atomic values, so the relation is not in 1NF.

Example of a Relation in 1NF

To bring this relation into 1NF, we decompose the Courses attribute into individual rows, ensuring that each attribute contains only atomic values:

StudentID StudentName Course
1 Alice Math
1 Alice History
2 Bob Science
2 Bob Math

Now, the Course attribute contains atomic values, and the relation satisfies 1NF.

A relation \(R\) is in Second Normal Form (2NF) if it satisfies the following two conditions:

  1. The relation is in First Normal Form (1NF), meaning:
    • All attributes in the relation contain atomic values (i.e., each attribute contains indivisible values).
    • There are no repeating groups or multi-valued attributes in the relation.
    • Every tuple in the relation is unique (typically enforced via a primary key or candidate key).
  2. No partial dependencies: Every non-prime attribute in the relation is fully functionally dependent on the entire primary key, meaning that there are no functional dependencies where a non-prime attribute is dependent on only a part of a composite primary key.

Second Normal Form (2NF)

A relation is in Second Normal Form (2NF) if it is in First Normal Form (1NF) and all non-prime attributes are fully functionally dependent on the entire primary key.

Formal Definition of 2NF

Let \(R(A_1, A_2, ..., A_n)\) be a relation schema, and let \(K\) be a candidate key for \(R\). The relation \(R\) is in Second Normal Form (2NF) if and only if:

  1. The relation is in 1NF.
  2. For every functional dependency \(X \rightarrow Y\) in the relation, if \(Y\) is a non-prime attribute (i.e., not part of any candidate key), then \(X\) must be a superkey of the relation or a full composite key, i.e., \(X\) must include all attributes of any composite candidate key of the relation.

In simpler terms, every non-prime attribute must be fully functionally dependent on the entire primary or candidate key, not just a part of it.

Example of a Relation Not in 2NF

Consider a relation \(Enrollment(StudentID, CourseCode, InstructorName, Department)\) with the following functional dependencies:

  1. \((\text{StudentID}, \text{CourseCode}) \rightarrow \text{InstructorName}\)
  2. \(\text{CourseCode} \rightarrow \text{Department}\)

Here, \((\text{StudentID}, \text{CourseCode})\) is the composite primary key. The non-prime attributes are InstructorName and Department.

  • The attribute InstructorName is fully dependent on the composite key \((\text{StudentID}, \text{CourseCode})\), so this is acceptable.
  • However, Department is only dependent on CourseCode, which is part of the composite primary key but not the entire key. This represents a partial dependency since Department does not depend on the whole key \((\text{StudentID}, \text{CourseCode})\).
  • There are no other candidate keys.

Because of this partial dependency, the relation is not in 2NF.

Example of a Relation in 2NF

To convert this relation into 2NF, we decompose it to remove the partial dependency:

  1. Create a new relation to store the information about which department offers each course:

    CourseDepartment(CourseCode, Department):

    CourseCode Department
    CS101 Computer Science
    MATH201 Mathematics
  2. The original relation \(Enrollment\) is now reduced to:

    Enrollment(StudentID, CourseCode, InstructorName):

    StudentID CourseCode InstructorName
    1 CS101 Dr. Smith
    1 MATH201 Dr. Adams
    2 CS101 Dr. Smith

Now, both relations are in 2NF because there are no partial dependencies. The non-prime attributes are fully dependent on the respective primary keys: InstructorName is fully dependent on the composite key \((\text{StudentID}, \text{CourseCode})\), and Department is fully dependent on CourseCode in the new relation. There are no other candidate keys.

To summarize, a relation is in Second Normal Form (2NF) if it is in First Normal Form (1NF) and all non-prime attributes are fully functionally dependent on the entire candidate key. In practice, this means eliminating any partial dependencies where a non-prime attribute is dependent on part, but not all, of a composite primary key.

Third Normal Form (3NF)

A relation \(R\) is in Third Normal Form (3NF) if it satisfies the following two conditions:

  1. The relation is in Second Normal Form (2NF), meaning:
    • The relation is in First Normal Form (1NF), where all attributes contain atomic values, and there are no repeating groups.
    • Every non-prime attribute is fully functionally dependent on the entire primary key (i.e., there are no partial dependencies) and is fully dependent on any other candidate key.
  2. No transitive dependencies: Every non-prime attribute is non-transitively dependent on the primary key (or any other candidate key). In other words, for every functional dependency \(X \rightarrow Y\), at least one of the following must be true:
    • \(X\) is a superkey of the relation, or
    • \(Y\) is a prime attribute (an attribute that is part of a candidate key).

A transitive dependency occurs when a non-prime attribute depends on another non-prime attribute, which is itself dependent on the primary key.

Formal Definition of 3NF

Let \(R(A_1, A_2, ..., A_n)\) be a relation schema. The relation \(R\) is in Third Normal Form (3NF) if and only if:

  1. The relation is in 2NF.
  2. For every functional dependency \(X \rightarrow Y\) in the relation, one of the following conditions holds:
    • \(X\) is a superkey.
    • \(Y\) is a prime attribute (i.e., part of a key).
    • \(X \rightarrow Y\) is a trivial functional dependency, that is, \(Y \subseteq X\).

Example of a Relation Not in 3NF

Consider a relation \(Student(StudentID, StudentName, CourseCode, InstructorName)\) with the following functional dependencies:

  1. \(\text{StudentID} \rightarrow \text{StudentName}\)
  2. \((\text{StudentID}, \text{CourseCode}) \rightarrow \text{InstructorName}\)
  3. \(\text{CourseCode} \rightarrow \text{InstructorName}\)

Here, \((\text{StudentID}, \text{CourseCode})\) is the composite primary key, and StudentName and InstructorName are non-prime attributes.

While this relation is in 2NF (there are no partial dependencies), it is not in 3NF because there is a transitive dependency. Specifically, \(\text{CourseCode} \rightarrow \text{InstructorName}\) is a transitive dependency since InstructorName is dependent on CourseCode, and CourseCode is part of the primary key \((\text{StudentID}, \text{CourseCode})\).

Decomposing the Relation into 3NF

To eliminate the transitive dependency and bring the relation into 3NF, we decompose it as follows:

  1. Create a separate relation to store information about the course and the instructor:

    CourseInstructor(CourseCode, InstructorName):

    CourseCode InstructorName
    CS101 Dr. Smith
    MATH201 Dr. Adams
  2. The original relation \(Student\) is now reduced to:

    StudentEnrollment(StudentID, StudentName, CourseCode):

    StudentID StudentName CourseCode
    1 Alice CS101
    2 Bob MATH201

Now, both relations are in 3NF because:

  • In the StudentEnrollment relation, all non-prime attributes (StudentName) are fully dependent on the primary key \((\text{StudentID}, \text{CourseCode})\), and there are no transitive dependencies.
  • In the CourseInstructor relation, InstructorName is fully dependent on CourseCode, which is the primary key of that relation.
  • There are no other candidate keys.

To summarize, a relation is in Third Normal Form (3NF) if it is in Second Normal Form (2NF) and there are no transitive dependencies. This means that every non-prime attribute must be directly dependent on the primary key (or any other candidate key) and not on any other non-prime attributes. Transitive dependencies, where non-prime attributes depend on other non-prime attributes, must be eliminated by decomposing the relation into smaller, well-structured relations.

Boyce-Codd Normal Form (BCNF)

Boyce-Codd Normal Form (BCNF) is a more stringent version of Third Normal Form (3NF) in database normalization. Informally, a relation is in BCNF if every functional dependency has a determinant that is a superkey.

In simpler terms, BCNF ensures that if one set of attributes determines another set of attributes, the determining set must be capable of uniquely identifying every row in the table — that is, it must be a superkey. This rule prevents situations where non-key attributes can determine other attributes, which can lead to redundancy and potential update anomalies.

BCNF is a stricter version of Third Normal Form (3NF), so every relational that is in BCNF is, by definition, in 3NF. While 3NF allows a functional dependency if \(Y\) is a prime attribute (part of a candidate key), BCNF requires that the determinant \(X\) be a superkey in every functional dependency. While 3NF allows some flexibility by letting non-superkeys determine prime attributes (those that are part of a candidate key), BCNF removes that flexibility, enforcing a stricter condition that applies to every functional dependency.

Formal Definition of BCNF

Let \(R(A_1, A_2, ..., A_n)\) be a relation schema. \(R\) is in Boyce-Codd Normal Form (BCNF) if and only if for every functional dependency \(X \rightarrow Y\) in \(R\), the following condition holds:

  1. \(X\) is a superkey of the relation, meaning that \(X\) functionally determines all attributes in \(R\).

Example of a Relation in 3NF but Not in BCNF

Consider a relation \(ClassEnrollment(StudentID, CourseCode, InstructorName)\) with the following functional dependencies:

  1. \((\text{StudentID}, \text{CourseCode}) \rightarrow \text{InstructorName}\)
  2. \(\text{InstructorName} \rightarrow \text{CourseCode}\)

In this relation, \((\text{StudentID}, \text{CourseCode})\) is a candidate key, and InstructorName is a non-prime attribute.

This relation is in Third Normal Form (3NF) because there are no transitive dependencies (i.e., every non-prime attribute is either fully functionally dependent on a candidate key or prime). However, it is not in BCNF because of the second functional dependency \(\text{InstructorName} \rightarrow \text{CourseCode}\). In this case, InstructorName is not a superkey, yet it determines CourseCode, which violates the BCNF condition that the left-hand side of any non-trivial functional dependency must be a superkey.

Decomposing the Relation into BCNF

To bring the relation into BCNF, we decompose it as follows:

  1. Create a new relation to store information about which instructor teaches each course:

    InstructorCourse(InstructorName, CourseCode):

    InstructorName CourseCode
    Dr. Smith CS101
    Dr. Adams MATH201
  2. The original relation \(ClassEnrollment\) is reduced to:

    Enrollment(StudentID, CourseCode):

    StudentID CourseCode
    1 CS101
    2 MATH201

Now, both relations are in BCNF because:

  • In the InstructorCourse relation, the functional dependency \(\text{InstructorName} \rightarrow \text{CourseCode}\) is allowed, as InstructorName is now the primary key (and therefore a superkey) of this relation.
  • In the Enrollment relation, \((\text{StudentID}, \text{CourseCode})\) remains a composite key, and there are no functional dependencies violating the BCNF rules.

Key Difference Between 3NF and BCNF

The key difference lies in how they handle functional dependencies:

  • In 3NF, a non-superkey can determine a prime attribute (as long as it doesn’t introduce redundancy).
  • In BCNF, only superkeys are allowed to determine any attribute, whether it’s prime or not.

In short, a relation is in Boyce-Codd Normal Form (BCNF) if, for every functional dependency \(X \rightarrow Y\), the left-hand side \(X\) is a superkey. BCNF eliminates any dependencies where non-superkey attributes determine other attributes, even if the relation satisfies 3NF. Decomposition into BCNF may be required to eliminate these issues, resulting in a more rigorous form of normalization than 3NF, although in practice there often isn’t a material difference.

When Normalization Might Not Be Appropriate

Although normalization is important for reducing redundancy and preventing anomalies, there are situations where denormalization (the process of combining tables) might be more appropriate. Denormalization may be considered for performance optimization, particularly in databases where read operations far outnumber write operations.

For example, in large-scale web applications where data is frequently queried but rarely updated, combining tables into a denormalized form can reduce the number of joins needed for queries, improving performance. However, this comes at the cost of potentially introducing redundancy, leading to update anomalies and the need for more complex maintenance processes.

As an example, consider an e-commerce application where customer information is frequently queried alongside order information. Instead of storing customer and order data in separate normalized tables, the database might store customer details along with each order in a single denormalized table to speed up query response times.

While this reduces the need for joins and improves query performance, it increases the likelihood of insertion, update, and deletion anomalies. If a customer’s address changes, it must be updated across all their orders. If denormalization is chosen, careful attention must be paid to how anomalies will be handled.

Analytical databases (such as those used in data marts or data warehousing) often use “fact tables” that contain pre-computed and pre-joined data to speed up queries. These types of tables are decidedly not normalized, but because they are not updated frequently, this is an acceptable trade-off to faster analytical query performance.

Tutorial

In this recording of an online recitation, Khoury Boston’s Prof. Schedlbauer explains the concept and need for normalization in addition to the various normal forms and their definitions up to Boyce-Codd Normal Form (BCNF).

Slide Deck: Normal Forms and Normalization

Summary

Normalization is an essential phase in relational database design. It is intended to reduce data redundancy and improve data integrity by ensuring that functional dependencies are properly managed. The process of normalization involves decomposing relations by applying a series of normal forms, each adding an additional set of constraints to prevent anomalies such as insertion, update, and deletion issues.

While normalization offers significant benefits in terms of maintaining data integrity, there are cases where performance considerations may lead database designers to opt for a lower normal form or to partial or full denormalization. In such cases, the designer must carefully weigh the trade-offs between reducing redundancy, increasing storage needs, but generally improving query performance.

At the end of the day, the choice of whether to normalize or denormalize a database depends on the specific needs of the application, including factors such as the frequency of read versus write operations and the complexity of the queries being executed. Sometimes, for reasons like performance optimization, databases may intentionally be kept in a lower normal form.

It’s also worth noting that normalization is typically done during the design of the database. Once the database has been implemented and is in use, changing the normalization level can be a complex (and often a quite difficult and practically impossible) task. So, industrial databases are often (by accident and through entropy) de-normalized over time. Normalization is an essential phase in relational database design. It is intended to reduce data redundancy and improve data integrity by ensuring that functional dependencies are properly managed. The process of normalization involves decomposing relations by applying a series of normal forms, each adding an additional set of constraints to prevent anomalies such as insertion, update, and deletion issues.

While normalization offers significant benefits in terms of maintaining data integrity, there are cases where performance considerations may lead database designers to opt for a lower normal form or to partial or full denormalization. In such cases, the designer must carefully weigh the trade-offs between reducing redundancy, increasing storage needs, but generally improving query performance.

At the end of the day, the choice of whether to normalize or denormalize a database depends on the specific needs of the application, including factors such as the frequency of read versus write operations and the complexity of the queries being executed.

Naturally, normalization is intended for relational databases only and is not necessary for non-relational (NoSQL) databases.


Files & Resources

All Files for Lesson 60.111

Errata

Let us know.