60.100 The Need for Databases

Objectives

Upon completion of this lesson, you will have an overview of:

databases and database management systems
the key benefits of a DBMS
different database types
database architectures
intereactions with databases and associated roles

Lecture: Foundations

In the video narration below, Khoury Boston’s Prof. Schedlbauer provides an overview of databases and some context on their use and programming as well as its history, comparison of various approaches, and the most common databases.

The video covers numerous topics and you are not expected to understand all of them – it is merely intended to provide a broad introduction to databases¹. If you prefer reading first, then you can skip watching the lecture and proceed to the next section.

Slide Deck: The Need for Databases

Databases

A database is an organized collection of data that is stored and accessed electronically. Databases are designed to manage large amounts of information by structuring data in a way that optimizes retrieval and modification. This structuring can involve objects, documents, tables, columns, rows, indexes, and various other mechanisms that facilitate efficient data management.

The structure of a database can be file-based, relational, record-based, document-based, key/value, hierarchical, among many other structures. The relational databases with tables, columns, and the use of SQL is the most common organizational database for transactional data in use today. MongoDB and CouchDB are examples of common non-relational (“NoSQL”) document databases that are often used in web application development.

Because the term database has become somewhat synonymous with relational database, the alternative term data store is also used to denote any digital storage of data, including unstructured data.

Database Management Systems

Besides storing and retrieving data, databases also allow you to add, change, and delete information. These operations are managed by a Database Management System (DBMS). A DBMS is a set of software programs designed to create and maintain a database. Its main purpose is to provide access to the data in the database and an efficient way to define, store, query, and manipulate data. Additionally, a DBMS protects the database from unauthorized access and ensures data can be recovered if the system fails. It also allows multiple users to share and access data simultaneously by managing concurrency.

In addition to its role in centralized data management, a Database Management System (DBMS) offers several other advantages, which are listed below.

Program-Data Independence: Program-data independence, often referred to as data independence, is a crucial feature of DBMS. It entails the ability to modify the structure of the database without necessitating changes to the application programs that interact with it. To achieve a high degree of data independence, DBMS stores the description of the database structure (including the structure of each file and the storage format of data items) and various data constraints separately in its catalog. This catalog houses metadata, which is information about the data. This independence is made possible through a three-level DBMS architecture, as discussed in Section 1.7.
Data Abstraction: Data abstraction is the concept that underlies program-data independence. It enables the database system to present an abstract view of the data to users, shielding them from the physical storage and implementation details. Users interact with data through a conceptual, simplified representation rather than dealing with the intricacies of how data is stored and managed internally.
Supports Multiple Views of the Data: In a shared database environment, multiple users may have distinct perspectives or views of the data. DBMS offers the capability to define different views of the data tailored to the requirements of individual users or user groups. A view, in this context, represents a subset of the database that contains virtual data derived from the database files. Unlike physical data files, views do not exist in a tangible, physical form; instead, only the view’s definition is stored. This means that no separate physical files are created to store view data; rather, the focus is on defining how the view should be constructed based on the underlying data.

Popular DBMS

There are several types of database management systems in use today, characterized by how they store data and the format in which the data is stored, as well as the query mechanisms provided to applications and users.

Relational DBMS (RDBMS): Uses a table-based structure where data is organized into rows and columns. Examples include MySQL, PostgreSQL, Microsoft SQL Server, and Oracle. This is the most common database type and preferred by organizations for transactional data.
NoSQL DBMS: Designed for specific data models with flexible schemas, including document stores (e.g., MongoDB), key-value stores (e.g., Redis), wide-column stores (e.g., Cassandra), and graph databases (e.g., Neo4j). These are often used in web and mobile applications or in large-scale data mining.
In-Memory DBMS: Stores data in main memory for faster read and write operations. Examples include Redis and Memcached.
Distributed DBMS: Distributes data across multiple nodes or locations to achieve high availability and scalability. Examples include Apache Cassandra and Google Spanner.
Cloud DBMS: Operates in a cloud environment, offering flexibility, scalability, and managed services. Examples include Amazon RDS, Microsoft Azure SQL Server, and Google Cloud Firestore, among others.

Relational Databases

Relational Database Management Systems (RDBMS) are widely used in various applications and industries. Below is a list of some common and popular RDBMS:

Oracle Database: Oracle is one of the most well-known RDBMS in the world, known for its scalability, security features, and comprehensive toolset for database management.
Microsoft SQL Server: Developed by Microsoft, SQL Server is a robust RDBMS that integrates well with Microsoft’s other products and offers features like business intelligence and analytics.
MySQL: An open-source RDBMS, MySQL is known for its speed and reliability. It is often used in web applications, including popular platforms like WordPress.
PostgreSQL: Another open-source RDBMS, PostgreSQL is known for its advanced features, extensibility, and support for complex data types and custom functions.
IBM Db2: Db2 is a family of data management products, with options for various platforms. It is known for its reliability, scalability, and support for AI and machine learning.
SQLite: A self-contained, serverless, and zero-configuration RDBMS, SQLite is embedded in many applications and devices, including mobile apps and web browsers.
MariaDB: A fork of MySQL, MariaDB is open source and designed for high performance, with a focus on compatibility with MySQL.
Amazon RDS: Amazon Relational Database Service (RDS) offers managed database solutions for several RDBMS options, including MySQL, PostgreSQL, SQL Server, and Oracle, making it easier to set up, operate, and scale databases in the cloud.
IBM Informix: Informix is known for its support for time-series data, high availability, and real-time analytics, making it suitable for applications requiring high-speed data processing.
SAP HANA: While primarily known as an in-memory database system for SAP applications, SAP HANA is also used as a general-purpose RDBMS for real-time analytics and data processing.
Teradata: Teradata is popular for data warehousing and analytics, providing high-performance and scalable solutions for large data sets.
SQLite: This lightweight, embedded RDBMS is used extensively in mobile and desktop applications due to its simplicity and small footprint.

These are just a few examples of the many relational database management systems available today. The choice of an RDBMS depends on factors such as the specific requirements of the application, scalability needs, budget, and the ecosystem in which it will be deployed.

Benefits of DBMS

A Database Management System (DBMS) is software that interacts with the user, applications, and the database itself to capture and analyze data. The DBMS provides an interface for users and applications to perform various operations on the database, such as data insertion, query, update, and deletion.

Key Functions of a DBMS:

Data Definition: Defining the structure of the database, including tables, columns, data types, and relationships.
Data Manipulation: Inserting, updating, deleting, and querying data within the database.
Data Security: Managing user access and permissions to ensure data privacy and protection.
Data Integrity: Enforcing rules and constraints to maintain data accuracy and consistency.
Transaction Management: Ensuring that all database transactions adhere to the ACID (Atomicity, Consistency, Isolation, Durability) properties and that data is not corrupted when accessed by multiple users simultaneously.
Backup and Recovery: Providing mechanisms for data backup and recovery to prevent data loss and ensure business continuity.

Concurrency Control

Concurrency control in database management systems (DBMS) is critical for ensuring that multiple users or processes can access and modify a database simultaneously without leading to data inconsistencies, integrity violations, or other problems. It’s essential in multi-user database environments to maintain data consistency and accuracy while allowing for concurrent access.

Here’s an explanation of concurrency control in DBMS:

Simultaneous Access: In a multi-user database system, multiple users or processes may attempt to read and modify data concurrently. Without concurrency control, this could lead to problems like data corruption or lost updates.
Concurrency Problems: Several issues can arise when multiple transactions (sequences of database operations) run concurrently:
- Lost Updates: When two transactions try to update the same data simultaneously, one update might overwrite the other, leading to data loss.
- Dirty Reads: One transaction reads uncommitted data changes made by another transaction, which can lead to incorrect results if the other transaction is rolled back.
- Uncommitted Data: A transaction reads data that another transaction is currently modifying but has not yet committed. This can also result in incorrect data retrieval.
- Inconsistent Retrieval: Inconsistent data can be read if one transaction reads some data before another transaction updates it, leading to a view of the database that doesn’t reflect any single consistent state.
Concurrency Control Techniques: To address these problems, DBMS employs various concurrency control techniques:
- Locking: Transactions lock the data they want to access or modify. Locks prevent other transactions from accessing the same data until the lock is released. This ensures that only one transaction can modify a piece of data at a time.
- Isolation Levels: DBMS offers different isolation levels (e.g., Read Uncommitted, Read Committed, Repeatable Read, Serializable) that define the degree of isolation between concurrent transactions. Each level balances data consistency with performance and concurrency.
- Two-Phase Locking (2PL): Transactions follow a set of rules during their execution, including acquiring locks before accessing data and releasing locks after completing the transaction. The two phases are the growing phase (acquiring locks) and the shrinking phase (releasing locks).
- Timestamp Ordering: Transactions are assigned unique timestamps, and the DBMS uses these timestamps to order and schedule transactions. Older transactions are given priority to avoid conflicts with newer ones.
- Conflict Resolution: When conflicts between transactions occur, DBMS employs conflict resolution strategies, such as waiting, aborting, or rolling back transactions, to ensure data integrity.
Transaction Isolation: The choice of isolation level determines how transactions interact with each other. Higher isolation levels provide stronger guarantees but can impact system performance due to increased locking and blocking.
Commit and Rollback: Transactions in DBMS can be committed (making their changes permanent) or rolled back (undoing their changes). Concurrency control ensures that transactions can be safely committed without conflicting with other transactions.

In short, concurrency control in DBMS is essential for managing simultaneous access to a database to maintain data consistency and integrity. It involves techniques like locking, isolation levels, and conflict resolution to prevent issues like lost updates and dirty reads when multiple transactions operate concurrently. The choice of concurrency control method should consider the specific requirements of the application and balance data consistency with system performance.

Structured vs. Unstructured Data

Most databases store structured data that has a defined set of fields. Structured data is highly organized and easily searchable within a database. This type of data is typically stored in a tabular format with rows and columns, making it straightforward to enter, query, and analyze.

Characteristics of Structured Data:

Format: Stored in tables, columns, and rows.
Ease of Use: Easy to search, query, and analyze using SQL (Structured Query Language).
Examples: Customer data in a CRM (Customer Relationship Management) system, financial data in a banking system, and inventory data in a retail system.
Storage: Usually stored in relational databases (RDBMS) like MySQL, PostgreSQL, or Oracle.

Unstructured data lacks a predefined format or organization, making it more complex to manage and analyze. This type of data includes various formats such as text, images, audio, video, and social media posts. An email system where each email message is stored as a document with various fields such as sender, recipient, subject, body, and attachments or video streaming platform storing video files along with metadata like title, description, and tags are examples of database containing unstructured data. Of course, we could impose some structure.

Characteristics of Unstructured Data:

Format: Does not fit into traditional row-column database structures.
Complexity: Requires more advanced tools and techniques for processing and analysis.
Examples: Emails, videos, social media posts, documents, and sensor data from IoT devices.
Storage: Typically stored in NoSQL databases like MongoDB, Cassandra, or in data lakes.

Relational databases are designed to handle structured data. They use predefined schemas to enforce data integrity and relationships between tables. SQL is used for defining and manipulating data within these databases. As such, they are excellent for transactions requiring data integrity, complex queries, and relationships between data entities. NoSQL (i.e.,non-relational) databases are designed to handle unstructured or semi-structured data. They offer flexible schemas and are optimized for various data models, including document, key-value, column-family, and graph.

Data lakes are large storage repositories that can hold vast amounts of raw data in its native format until it is needed. They are designed to handle both structured and unstructured data and are often used in big data analytics. They offer flexibility to store all types of data, scalability, and suitability for advanced analytics and machine learning. Amazon S3 and Hadoop are examples of products used to build data lakes.

In summary, databases manage structured data using relational databases with predefined schemas and SQL for querying. For unstructured data, NoSQL databases and data lakes offer the necessary flexibility and scalability. Understanding the differences between structured and unstructured data and the appropriate database systems for each is essential for effective data management and utilization.

Genesis & History

Understanding where databases came from helps in understanding the landscape of database today. Furthermore, some of the earliest databases are still in use today; banks, insurance companies, governments, and military still use databases, database management systems, and applications written in the 1960’s, 1970’s and 1980’s.

Of course, database systems have evolved over time, progressing through distinct generations and responding to the changing needs of businesses and leveraging newer technologies and programming frameworks.

Let’s delve into the evolution of database systems for a bit, look at key developments, and the recognize the impact of various eras on today’s databases.

First Generation (1960s): The first generation of database systems emerged in the early 1960s and was characterized by hierarchical and network data models. During this period, Charles Bachman introduced the Integrated Data Store, one of the earliest general-purpose Database Management Systems (DBMS). The advent of magnetic tapes automated various data processing tasks, such as payroll and inventory management. In the late 1960s, IBM developed the Information Management System (IMS) DBMS, which employed both network and hierarchical data models. The use of hard disks, rather than magnetic tapes, facilitated direct data access. It was in this environment that Edger Frank ‘Ted’ Codd proposed the groundbreaking relational data model in 1970. Codd’s work laid the foundation for the development of relational databases, and he received the prestigious ACM Turing Award for his contributions.

Second Generation (1980s): The second generation of database systems saw the dominance of the relational model. Standardization of SQL (Structured Query Language) occurred, culminating in the SQL:1999 standard adopted by ANSI/ISO. Research on parallel and distributed databases gained traction. Commercial relational databases like IBM’s DB2, Oracle, and Informix UDS expanded their capabilities to store complex data types, including images and text.

Third Generation (1990s-2000s): In the early 1990s, SQL evolved primarily to cater to decision support applications. Vendors introduced products for parallel databases, and object-relational support was incorporated into existing databases. With the rise of the World Wide Web, DBMS usage for storing web-accessible data became widespread in the late 1990s. Database systems were redesigned to handle high transaction processing rates and achieve 24/7 availability. Reliability of database systems improved significantly during this period. Additionally, in the early 2000s, XML databases and the Xquery language were developed to handle large volumes of data and high transaction rates.

Integration with Enterprise Resource Planning (ERP) and Management Resource Planning (MRP): The late 20th and early 21st centuries witnessed the emergence of enterprise resource planning (ERP) and management resource planning (MRP) packages such as PeopleSoft, SAP, and Siebel. These packages introduced a layer of application-oriented features atop the DBMS. They provided a general application layer capable of performing tasks such as financial analysis, inventory control, and human resource planning. Companies could customize this layer to meet their specific needs, resulting in reduced costs associated with developing the application layer from scratch.

In short, the evolution of database systems spans three generations, each marked by significant advancements in data modeling, access methods, and functionality. These advancements have been shaped by technological innovations, business requirements, and the growing complexity of data management in an age where data has become central to data-driven organizations.

Database Use Cases

Database systems are widely used across many domains due to their numerous benefits. They help streamline operations, manage data efficiently, and support better decision-making. In fact, it is hard to envision any modern organization that does not employ or use one or more databases in some form or another.

Here are some common applications of databases and their importance in different sectors:

1. Airlines and Railways: These companies use online databases to manage reservations and provide real-time schedule information to passengers. This helps optimize seat allocation and minimize scheduling conflicts, ensuring smooth travel experiences.

2. Banking: Banks rely on databases for customer inquiries, account management, loan processing, and other financial transactions. Databases securely store and retrieve large amounts of financial data, enabling accurate and efficient banking operations.

3. Education: Schools and colleges use databases for course registration, grade tracking, and managing student information. These systems make administrative tasks easier and improve record keeping.

4. Telecommunications: Telecom companies maintain records of communication networks, telephone numbers, and call histories using databases. These systems help generate accurate monthly bills and optimize network operations.

5. Credit Card Transactions: Databases track credit card purchases, allowing financial institutions to generate monthly statements for cardholders. They also support timely billing and fraud detection.

6. E-commerce: In e-commerce, databases integrate information from product catalogs, supporting online shopping, holiday bookings, and virtual consultations. This enhances convenience for consumers.

7. Health Care Information Systems: Databases in healthcare maintain electronic patient records and manage healthcare details, enabling providers to deliver accurate and timely medical services.

8. Digital Libraries and Digital Publishing: Databases manage and deliver large amounts of textual and multimedia data in digital libraries and publishing platforms, allowing efficient content organization and retrieval.

9. Finance: Financial institutions use databases to store data on sales, stock and bond purchases, and online trading information. Real-time data access is essential for informed financial decisions.

10. Sales: Retail businesses use databases to store information about products, customers, and transactions. These systems help manage inventory, track customer preferences, and optimize sales strategies.

11. Human Resources: Organizations use databases to keep detailed records of employees, including salaries, benefits, taxes, and payroll data. This streamlines HR processes, ensuring accurate compensation and compliance.

In short, database systems are essential in virtually all sectors, enhancing the efficiency, accuracy, and accessibility of data. Their flexibility and versatility make them a critical part of modern operations, driving innovation and improving service quality and paving the way for organizations to become more data-driven.

Challenges

Adopting database systems offers many benefits, improving data management and accessibility. However, deciding to switch from traditional file processing systems and spreadsheets to database systems requires a careful evaluation of the associated costs and risks.

Let’s look at some common challenges, as well as typical cost and risk factors:

1. Acquisition Costs: Setting up a new database system often requires significant investments in both hardware and software. Database Management Systems (DBMS) typically need more memory and disk storage, and purchasing DBMS software licenses can be expensive. Organizations must carefully consider the initial costs and balance them against the expected long-term benefits.

2. Training Requirements: Implementing a database system may necessitate hiring or training a specialized data administration team. This team is essential for designing data views, establishing recovery procedures, and optimizing data structures to meet organizational needs. Training or recruiting skilled professionals can be costly, and there may be a learning curve for existing staff during the transition.

3. Backup and Recovery: Ensuring data accuracy and availability in a shared corporate database is crucial. Database systems with online updating need robust backup and recovery procedures to protect against data loss or corruption. These procedures must be carefully planned and implemented to minimize downtime and maintain data integrity.

4. System Failures: If the computer system hosting the database fails, all users must wait until it is restored. Additionally, a failure within the DBMS or application can cause permanent database damage. System failures can disrupt operations, reduce productivity, and require extensive recovery efforts.

5. Data Security and Privacy: Database systems introduce concerns about data security and privacy. Unauthorized access, data breaches, and data loss can have severe financial and reputational consequences. Strong security measures and compliance with data protection regulations are essential for managing a database.

6. Migration Challenges: Moving from a file processing system to a database system can be logistically challenging. Existing data must be migrated and mapped to the new database structure, and legacy systems may need integration with the DBMS. Migrating large data volumes without disruptions or loss is complex.

7. Maintenance Costs: After the initial setup, organizations must consider ongoing maintenance costs, including software updates, hardware upgrades, and regular database maintenance to optimize performance and ensure data integrity.

8. Vendor Lock-In: Relying on a specific DBMS vendor can lead to vendor lock-in, making it difficult and costly to switch to a different DBMS or vendor in the future.

In short, while the adoption of database systems offers substantial benefits in terms of data management and accessibility, organizations must carefully assess the associated costs and risks. A comprehensive cost-benefit analysis, coupled with diligent planning and risk mitigation strategies, is essential to make informed decisions regarding the implementation and maintenance of database systems.

Database Roles

The development, utilization, and maintenance of a valuable asset like a database system involve a diverse group of individuals, each with specific roles and responsibilities. Within this ecosystem, several key stakeholders collaborate to ensure the effective functioning of the database. These key players include database users, system analysts, application programmers, and database administrators (DBAs).

1. Database Users: Database users form the primary interface between the database system and the real-world applications it supports. They interact with the database to retrieve information, update records, and generate reports essential for their tasks. Database users can be categorized into three main groups, each with distinct characteristics:

Naïve Users: Naïve users are those who interact with the database by invoking predefined application programs. For instance, a bookstore owner enters book details into the database using specific application programs tailored for this purpose. These users interact with the database through user-friendly form interfaces, making it easier to input and retrieve data.
Sophisticated Users: Sophisticated users, such as business analysts or scientists, possess a deeper understanding of the functionalities offered by a Database Management System (DBMS). They can interact with the system without having to write custom application programs. Instead, they employ database query languages to retrieve complex information from the database, effectively tailoring their queries to meet intricate requirements.
Specialized Users: Specialized users are proficient in developing specialized database programs that differ from traditional data processing applications. These individuals create applications like computer-aided design systems or knowledge-base and expert systems. These applications handle data with complex data types and require a deep understanding of both the database structure and the specific application domain.

2. System Analysts: System analysts bridge the gap between database users and developers. They are responsible for understanding the requirements of database users and translating them into technical specifications. System analysts play a critical role in designing the database schema, defining data structures, and ensuring that the database aligns with organizational needs.

3. Application Programmers: Application programmers are responsible for developing and maintaining the software applications that interact with the database. They write code to facilitate data entry, retrieval, and processing. These programmers work closely with system analysts and database users to create custom applications that fulfill specific business or operational needs.

4. Database Administrators (DBAs): Database administrators are responsible for the overall management of the database system. They handle tasks such as database installation, configuration, security, performance tuning, and backup and recovery. DBAs ensure data integrity, availability, and security while optimizing database performance to meet user demands.

In essence, the success of a database system hinges on the collaboration and expertise of these key individuals and their respective roles. Effective communication and coordination among database users, system analysts, application programmers, and DBAs are vital to maintaining a well-functioning and responsive database environment that serves the needs of organizations and their stakeholders.

Database Architectures

In the realm of Database Management Systems (DBMS), the architecture plays a pivotal role in shaping how users perceive and interact with the underlying data. It is crucial to note that the DBMS architecture primarily concerns itself with presenting a structured view of the data to users, while it remains agnostic to the intricacies of data handling and processing within the DBMS. This abstraction ensures that users can manipulate data without concerning themselves with its physical storage and organization.

The DBMS architecture, as envisioned by ANSI/SPARC (American National Standards Institute/Standards Planning and Requirements Committee), introduces a conceptual framework known as the Three-Level DBMS Architecture. This framework defines the overall database description at three distinct levels: the internal level, the conceptual level, and the external level. Each of these levels contributes to the holistic understanding of the database system and plays a unique role:

1. Internal Level (Physical Level): At the lowest level of data abstraction, the internal level, also referred to as the physical level, pertains to the physical representation of the database within the computer system. It delves into how data is physically stored and organized on storage media. Considerations at this level encompass optimization of runtime performance and efficient utilization of storage space. Topics addressed include techniques for allocating storage space for data and indexes, defining access paths (e.g., indexes), implementing data compression and encryption methods, and strategically placing records for optimal retrieval.

2. Conceptual Level (Logical Level): The conceptual level, also known as the logical level, focuses on the logical structure of the entire database. It answers questions about what data is stored, the relationships between various data elements, and provides a comprehensive view of user requirements without delving into the physical implementation details. In essence, the conceptual level abstracts away the complexities of physical storage structures, enabling users to interact with a high-level representation of the database that aligns with their conceptual understanding.

3. External Level (View Level): At the highest level of abstraction, the external level, also referred to as the view level, caters to the user’s perspective of the database. In many cases, users and application programs do not require access to the entire dataset. Instead, they interact with a subset of the database tailored to their specific needs. The external level permits users to access data customized to their requirements, allowing the same dataset to be presented differently to various users simultaneously. This flexibility provides a robust security mechanism by concealing parts of the database from specific users, effectively hiding attributes that are not relevant to their view.

In the Three-Level DBMS Architecture, the schema of the database is described at each of these levels. Consequently, the architecture is often referred to as the Three-Schema Architecture, with each level having its own schema:

The internal level features an internal schema, describing the physical storage structure of the database.
The conceptual level encompasses a conceptual schema, outlining the structure of the entire database, providing a comprehensive view.
The external level consists of external schemas or user views, each tailored to the requirements of a particular user or user group. These views hide the remainder of the database, ensuring that users are only exposed to relevant data.

The DBMS, in collaboration with the operating system, manages the physical level. Whenever a user at the external level issues a request for a new external view, the DBMS orchestrates a process known as mapping, transforming the request into corresponding operations at the conceptual and internal levels. This mapping ensures that data retrieval and presentation align with user needs.

A primary advantage of the Three-Level DBMS Architecture is its ability to provide data independence. Data independence is the capacity to modify the schema at one level without necessitating alterations at other levels. It can be categorized into two types:

Logical Data Independence: This type enables changes to the conceptual schema without affecting external schemas or application programs. Modifications to the conceptual schema could include constraints, addition or removal of data items, or other alterations, all of which can occur without affecting the external view.

Physical Data Independence: On the other hand, physical data independence empowers changes to the internal schema without impacting the conceptual or external schema. Adjustments to the internal schema might involve the creation of additional access structures, alterations in storage structures, or similar changes, all achieved without affecting the higher-level schemas.

It’s worth noting that achieving logical data independence can be more challenging than attaining physical data independence because application programs often rely on the logical structure of the database. Consequently, changes to the logical structure may necessitate corresponding modifications to application programs.

In essence, the core objective of a database system is to provide data abstraction, concealing the intricacies of data storage and organization from users. This abstraction is realized through the use of a data model, which serves as an abstract model describing how data is represented and utilized within the system. A data model comprises data structures and conceptual tools that define the structure of the database, encompassing data types, relationships, and constraints.

A data model not only defines the structure of data but also specifies a set of operations that can be performed on the data. It consists of both data model theory, a formal description of data structure and usage, and data model instances tailored to specific applications. The process of applying data model theory to create a data model instance is known as data modeling.

Data models can be categorized based on the concepts they employ to represent database structure. Three primary types of data models include:

High-Level or Conceptual Data Models: These models focus on the abstract, logical representation of data and its relationships, independent of implementation details. They are designed to provide a high-level understanding of data organization and are often used during database design and planning.
Representational or Implementation Data Models: These models bridge the gap between high-level conceptual models and low-level physical models. They aim to represent data structures and relationships in a manner that is closer to the actual implementation, facilitating the translation of conceptual models into physical database designs.
Low-Level or Physical Data Models: These models detail the physical storage and implementation of data in the database system. They encompass aspects such as storage mechanisms, access methods, and indexing techniques, providing the foundational structure for data storage and retrieval.

In recent years, the semistructured data model has gained prominence as a means of representing data that does not conform to rigid, structured schemas. In this model, data items of the same type are not constrained to have identical sets of attributes, allowing for flexibility in data representation. Semistructured data is often self-describing, with schema information embedded within the data itself. This model is particularly relevant for handling sources like the web, where data lacks strict schema constraints, and flexibility in data representation is essential.

The emergence of semistructured data modeling has several motivations:

Handling data sources like the web, which do not adhere to traditional database schemas.
Facilitating flexible data exchange between heterogeneous databases with varying structures.
Enabling easy data discovery and storage, even in scenarios where data types are diverse and dynamic.
Simplifying querying without prior knowledge of data types.

Notably, XML (Extensible Markup Language) serves as a prominent means of representing semistructured data. It allows the definition of schemas using XML schema language, enabling varying levels of schema flexibility over data elements.

In conclusion, the field of database management encompasses a rich tapestry of concepts, architectures, and data models, each contributing to the abstraction, organization, and utilization of data within complex computing systems. Understanding these fundamental aspects is crucial for graduate-level computer science students, as it forms the basis for database design, implementation, and optimization.

Database Interactions

The primary objective of a Database Management System (DBMS) is to empower its users to perform various operations on the database, such as inserting, deleting, and retrieving data, all in abstract terms. Users interact with the database without needing to concern themselves with the physical representation of the data. To facilitate these operations and cater to different user requirements, DBMSs typically offer specialized programming languages known as Database Languages or DBMS Languages.

In addition to Data Definition Language (DDL) and Data Manipulation Language (DML), DBMSs also provide two other essential languages: Data Control Language (DCL) and Transaction Control Language (TCL).

Data Control Language (DCL): DCL is primarily responsible for managing access control and security within the database. It is used to create user roles, grant permissions, and exert control over database access, thus ensuring data security. DCL is instrumental in safeguarding the database from unauthorized access and manipulation.

Transaction Control Language (TCL): TCL, on the other hand, focuses on managing transactions that occur within the database. It includes commands for initiating transactions, committing or rolling back transactions, and ensuring the consistency and integrity of data during transactional operations.

The core of a DBMS comprises two fundamental database languages:

Data Definition Language (DDL): DDL serves as the foundation for defining the structure of the database schema. It enables users to specify the layout of the database, including data types, relationships between tables, and constraints. The DDL compiler processes these schema definitions and stores them in a specialized catalog within the DBMS, known as the data dictionary.

Data Manipulation Language (DML): DML, as the name suggests, is used for manipulating the data within the database. It allows users to perform operations such as inserting new records, deleting existing records, and modifying data. DML commands can be executed both in planned queries, embedded within application programs, and in ad-hoc, unplanned queries or interactive queries.

In some DBMSs, particularly those where a strict separation between the conceptual and internal database levels is maintained, the DDL is primarily used to define the conceptual schema. In these cases, a distinct language, the Storage Definition Language (SDL), is employed to define the internal schema. Furthermore, in DBMSs following the true three-schema architecture, a third language called the View Definition Language (VDL) is utilized to define the external schema, ensuring that different user groups can access customized views of the database.

DDL statements not only define the schema but also specify integrity rules and constraints to maintain the database’s integrity. These integrity constraints encompass domain constraints, referential integrity, assertions, and authorization rules, all of which are essential to preserving the quality and consistency of data. DDL processes these instructions and stores the resulting schema description within the data dictionary—a specialized table containing metadata. Importantly, only the database system itself can modify the data dictionary, preventing users from making direct updates.

A database system, being a complex software entity, comprises various software components, each responsible for distinct tasks:

Data Definition: This component focuses on defining the data’s structure, encompassing the definition and modification of record structures, field data types, and constraints. Database administrators wield this component to define and modify the database structure using DDL and privileged commands. The DDL compiler within the DBMS processes these schema definitions and stores them in the data dictionary for reference by other DBMS components.

Data Manipulation: Once the data structure is defined, data manipulation becomes paramount. This component manages operations like insertion, deletion, and modification of records. It handles both planned queries, embedded within application programs and submitted to precompilers, and unplanned queries or interactive queries. The DML compiler processes these queries, optimizing them for execution.

Data Security and Integrity: Security and data integrity functions are embedded within the DBMS, alleviating the need for application programmers to implement these functions explicitly. They ensure data security by controlling access and safeguarding the integrity of the stored data.

Concurrency and Data Recovery: The DBMS incorporates functions to manage concurrent access by multiple users and to recover data after system failures. Concurrency control techniques and database recovery mechanisms are critical aspects of this component.

Performance Optimization: To optimize query performance, the DBMS offers functions for evaluating various execution plans for a query and selecting the most efficient one. These optimizations enhance the speed and efficiency of data retrieval.

In short, a DBMS is a complex system composed of various software components, each fulfilling a distinct role in managing and maintaining the database. Understanding these components is essential for effectively designing, implementing, and optimizing database systems, making them a crucial subject for graduate-level computer science students.

Client/Server Databases

In the realm of centralized database systems, all components—comprising the database system itself, application programs, and the user-interface—are co-located on a single system, with auxiliary dummy terminals merely serving as information display units. The processing capabilities of the solitary system are fully harnessed, while the dummy terminals act as passive conduits for information presentation. However, with the advent of increasingly potent and affordable personal computers, a shift occurred. Database systems began capitalizing on the substantial processing power available at the user’s end, ushering in the era of client/server architecture. In this paradigm, the computing resources of the user’s system actively participate in processing the user interface.

In client/server architecture, a client is a computer system that initiates requests, transmitting them over a network to a server. The server, in turn, receives these requests, processes them, and subsequently returns the requested data to the client. These client and server entities are typically situated at different physical locations. End users, often referred to as remote database users, interact with the client computer system, while the database system itself operates on the server. Servers, by design, come in various types, including file servers, printer servers, web servers, database servers, and more. The client machines feature user interfaces that facilitate user interaction with these servers, and they also provide local processing power for running client-side applications.

Client/server architecture can be implemented through two distinct approaches. The first approach places the user interface and application programs on the client side, while the database system resides on the server side. This architecture is known as two-tier architecture. Application programs on the client side communicate with the database management system (DBMS) situated on the server side. Standards such as Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) govern the interaction between clients and servers.

The second approach, known as three-tier architecture, is commonly employed in web-based applications. It introduces an intermediary layer—the application server (or web server)—between the client and the database server. In this configuration, clients communicate with the application server, which in turn communicates with the database server. The application server houses business rules, including procedures and constraints, used to access data from the database server. It also verifies the client’s credentials before forwarding a request to the database server, thereby enhancing database security.

When a client requests information, the application server receives the request, processes it, and dispatches the relevant database commands to the database server. The database server, in turn, sends the results back to the application server, which converts the data into a graphical user interface (GUI) format before presenting it to the client.

Database management systems (DBMS) are further categorized based on the number of users they support, leading to the classification of single-user and multi-user systems. In a single-user system, the database resides on a single computer and is accessible by a sole user at any given time. The user assumes multiple roles, including database design, maintenance, and programming for data manipulation. In some cases, the user may hire database system designers to create the system, in which case their role primarily entails end-user activities. However, in most enterprise scenarios, where large volumes of data require management and concurrent access, multi-user systems become imperative. Multi-user systems enable multiple users to access the database simultaneously, each with their respective roles and access rights.

DBMS can also be categorized based on their intended purpose, leading to a distinction between general-purpose and specific-purpose systems. A DBMS is typically a general-purpose software system with broad applicability. However, it can be specialized for particular purposes, such as airline or railway reservation systems. These specialized systems are often tailored to specific applications and may require substantial modifications to adapt them to different use cases. They fall under the category of online transaction processing (OLTP) systems, which excel in handling data entry and retrieval tasks efficiently. OLTP systems are engineered to support a high volume of concurrent transactions with minimal delays, making them suitable for applications like automated teller machines (ATMs) in the banking sector, which handle online commercial transactions seamlessly.

Database Architectures

In the short video below, Khoury Boston’s Prof. Durant provides an overview of common architectures for modern databases and contrasts them with file-based data storage approaches. Note that all of the architectures are still in use and even though file-based data storage has many drawbacks it is still commonly used as it is simple, easy to build, and light-weight².

Database Design and Development

The database design process is a systematic undertaking comprising several well-defined steps, collectively referred to as a design methodology. At its core, this process involves comprehending an organization’s operational and business requirements, modeling these specified needs, and materializing them through the creation of a database. The ultimate aim of database design is to craft a database that is not only efficient but also of high quality and low cost. In larger organizations, the responsibility for designing an efficient database system often falls to the database administrator (DBA), who oversees the entire database life-cycle process. This process encompasses multiple phases, each with its unique role and significance.

Requirement Collection and Analysis: This initial phase revolves around comprehending and scrutinizing user expectations for the forthcoming database application in meticulous detail. A team of analysts or requirement experts conducts this process. They evaluate the existing file processing system or DBMS, extensively engage with users to fathom the business area’s nature, and rationalize the necessity for data and databases. It’s important to note that initial requirements can be informal, incomplete, inconsistent, and partially incorrect. Techniques such as Object-Oriented Analysis (OOA) and Data Flow Diagrams (DFDs) are employed to refine these requirements into a more structured form. While this phase can be time-consuming, its pivotal role in determining the database system’s success cannot be overstated. The output of this phase is a document specifying user requirements.
Conceptual Database Design: In this phase, the database designer selects an appropriate data model and translates the data requirements established in the previous phase into a conceptual database schema, aligning it with the chosen data model’s concepts. The conceptual schema is model-agnostic and serves as a high-level representation of the organization. During this phase, a high-level description of the data and associated constraints is formulated. The Entity-Relationship (E-R) diagram is a commonly used tool to represent the conceptual database design. Key attributes of a good conceptual schema include expressiveness, simplicity, clarity, conciseness, and formalization.
Selection of a DBMS: The selection of a Database Management System (DBMS) hinges on numerous factors, including cost, available features and tools, underlying data model, portability, and hardware requirements. Technical considerations impacting the choice of a DBMS include the DBMS type (e.g., relational, object, object-relational), supported storage structures and access mechanisms, available interfaces, query languages, and system architecture (e.g., client/server, parallel, or distributed). Cost considerations encompass software and hardware acquisition, maintenance, database creation, personnel, training, and operational expenses.
Logical Database Design: Following the selection of an appropriate DBMS, the focus shifts to mapping the high-level conceptual schema onto the implementation data model of the chosen DBMS. This phase entails transitioning from an abstract data model to the practical implementation of the database. For instance, in the case of a relational model, this phase usually involves mapping the E-R model into a relational schema.
Physical Database Design: During this phase, the physical aspects of the database are meticulously specified to optimize performance. Considerations include storage structures, file organization, and access paths for database files. Potential choices for file organization and access mechanisms encompass various indexing methods, record clustering techniques, and hashing schemes.
Database System Implementation: With the logical and physical database designs in place, the actual implementation of the database system begins. This stage employs Data Definition Language (DDL) statements provided by the selected DBMS. These statements are used to create the database schema and files, and subsequently load the database with data.
Testing and Evaluation: In this crucial phase, rigorous testing and performance evaluation of the database are conducted. Attention is paid to aspects such as performance optimization, data integrity, concurrent access, and security constraints. This phase often runs in parallel with application programming. If testing uncovers issues, corrective actions may involve modifications to the physical or logical design, software or hardware upgrades, or changes in the chosen DBMS.

By methodically progressing through these phases, the database design process aims to produce a robust, efficient, and cost-effective database system that aligns with the organization’s operational and business needs.

Summary

In this lesson, we explored various aspects of database management systems (DBMS) and their critical components. We began by defining DBMS as software systems designed to manage, store, and retrieve data efficiently. We discussed the importance of data concurrency control in DBMS to ensure data consistency in multi-user environments.

We then listed several common and popular relational database management systems (RDBMS), including SQLite, MySQL, PostgreSQL, Microsoft SQL Server, and Oracle.

Expanding on DBMS advantages, we highlighted program-data independence, data abstraction, and support for multiple views of the data as key benefits. We also delved into the evolution of database systems over three generations, from hierarchical and network models to the dominant relational model and onward to object-oriented and object-relational models.

Additionally, we explored various applications of database systems across domains such as airlines, banking, education, telecommunications, e-commerce, healthcare, finance, sales, and human resources.

The conversation expanded on the components and languages associated with DBMS, including Data Definition Language (DDL), Data Manipulation Language (DML), Data Control Language (DCL), and Transaction Control Language (TCL). We also discussed the three-level DBMS architecture, which includes internal, conceptual, and external levels, facilitating data abstraction and independence.

Further, we delved into database modeling and data models, distinguishing between high-level conceptual, representational, and low-level physical data models. We introduced the concept of semistructured data models, exemplified by XML, for handling diverse data types.

Lastly, we explored the multi-tiered client/server architecture and its two-tier and three-tier implementations. We discussed the roles of clients and servers, as well as the significance of the application server in web-based applications. The conversation concluded by categorizing DBMS into single-user and multi-user systems, discussing general-purpose and specific-purpose systems, and highlighting the relevance of online transaction processing (OLTP) systems.

Throughout this comprehensive discussion, we gained insights into the multifaceted world of database management systems, from their fundamental concepts to their real-world applications and evolving technologies.

Questions for Reflection

How has the evolution of database management systems impacted the way organizations store and manage their data, particularly in the context of relational databases?
What are the key advantages of using a DBMS, and how do these advantages contribute to the efficiency and effectiveness of data management in various industries?
Can you provide real-world examples of how database systems are applied in different sectors, such as healthcare, finance, and e-commerce, and discuss the specific benefits they offer in each case?
How does data concurrency control in a multi-user environment influence the design and implementation of a database system, and what strategies can be employed to ensure data consistency?
Explore the concept of data abstraction in DBMS and its significance. How does it contribute to program-data independence, and why is this crucial in database design?
Compare and contrast the three generations of database systems, highlighting the key features and innovations that distinguish each generation.
What are the primary roles of different users involved in the database lifecycle, including database administrators, system analysts, and application programmers, and how do they collaborate to create an efficient database system?
Explain the concept of a three-level DBMS architecture and how it contributes to data independence. Can you provide examples of how each level is utilized in real-world scenarios?
Discuss the role of database modeling and data models in the design process. How do high-level conceptual, representational, and low-level physical data models contribute to creating a robust database system?
Reflect on the importance of client/server architecture in modern database systems. How does it leverage the processing power of client machines, and what are the benefits of two-tier and three-tier implementations in different application contexts?

Files

All Files for Lesson 60.100

References

Introduction to Database Systems. ITL Education Solutions Limited. Pearson India. November 2008.

Acknowledgments

Portions of the text initially generated by ChatGPT+ 3.5 by OpenAI.

Errata

Let us know.

Take your own notes – which helps in learning and retention. Slide deck is available for download.↩︎
Take your own notes – which helps in learning and retention. Slide deck is not available for download.↩︎

60.100
The Need for Databases

Martin Schedlbauer, PhD

2024-07-06

Objectives

Lecture: Foundations

Databases

Database Management Systems

Popular DBMS

Relational Databases

Benefits of DBMS

Concurrency Control

Structured vs. Unstructured Data

Genesis & History

Database Use Cases

Challenges

Database Roles

Database Architectures

Database Interactions

Client/Server Databases

Database Architectures

Database Design and Development

Summary

Questions for Reflection

Files

References

Acknowledgments

Errata

60.100The Need for Databases

Martin Schedlbauer, PhD

2024-07-06

Objectives

Lecture: Foundations

Databases

Database Management Systems

Popular DBMS

Relational Databases

Benefits of DBMS

Concurrency Control

Structured vs. Unstructured Data

Genesis & History

Database Use Cases

Challenges

Database Roles

Database Architectures

Database Interactions

Client/Server Databases

Database Architectures

Database Design and Development

Summary

Questions for Reflection

Files

References

Acknowledgments

Errata

60.100
The Need for Databases