Upon completion of this lesson, you will be able to:
A database management system is required to manage large volumes of data and allow retrieval of that data through a high-level query language such as SQL. In a typical client/server architecture, a client application connects to the database server through a network connection (most commonly via a socket on a TCP/IP port) and sends SQL commands through that connection. The database server waits for commands from clients and then carries out the request. In a single-threaded server, one client request at a time is carried out, while in a multi-threaded server, multiple client requests can be processed concurrently. There are numerous activities involved in carrying out a single request: determine what data is to be retrieved and from which tables, locate the tables on the storage device, find the correct rows within the tables, and package the result set and return it. This process is referred to as query processing and involves translation of the SQL statement into low-level data access operations, query optimization, and actual access of the data on the physical storage device. Query optimization involves translation of the query from SQL into relational algebra, specifying alternative but equivalent query execution plans, scoring each execution plan based on time, memory use, and storage device access, and lastly determining the most efficient query plan for a particular mix of concurrent database accesses by multiple clients.
This lesson provides an overview of the steps that are performed when processing a client query, the algorithms that are commonly used to implement various relational algebra operations, and optimization strategies. The lesson also explains external sorting for ordering result sets in memory constrained environments.
Query processing and optimization in relational databases are critical components that ensure efficient data retrieval and manipulation. These processes involve several stages, from parsing the query to executing it in the most efficient manner possible. Here’s a brief overview of how query processing and optimization work in relational databases:
Query processing and optimization are ongoing areas of research and development, as database systems seek to handle increasingly complex queries over large datasets with maximum efficiency. The efficiency of these processes directly impacts the performance and scalability of database applications.
Processing a query and arriving at a result is done is three phases: parsing and translation, optimization, and execution. The process is illustrated in the diagram below.
During translation of the SQL query into an equivalent extended relational algebra expression, the parser performs a syntax check of the SQL statement and verifies that all table names, column names, and other object references do, in fact, exist in the database by consulting the database meta data.
As there are multiple ways to write a SQL query to retrieve some data, there are multiple ways to express the query in relational algebra. To assist in query planning, the relational algebra expression is represented as a parse tree. For example, the SQL statement below can be translated to two equivalent relational algebra expressions.
These are two equivalent relational algebra expressions for the above query.
The diagram below illustrates a parse tree for the first of the two queries. The second parse tree is analogous.
Once all possible parse trees for all equivalent query expressions have been created, then for each step in the parse tree, starting at the leaves, an operation is chosen that executes this step. each step in the tree results in a relation (table) that is used as the input to the step above. For example, the last step in the above tree is loading the table articles, the next step is to extract the columns price and title, and the final step is to do a linear scan of the result set from the projection. If there is no index, then an operation such as linear-scan might be called, while the availability of an index would use an index scan operation. Each operation has an associated “cost”, generally measured in terms of computational time but also taking into account memory. The query optimizer chooses the plan with the lowest “cost” given a set of memory, resource, and other constraints. Finally, the optimized query is given to the query evaluation engine which will actually carry out the operations in the query plan and produce a result for the query. That result is eventually packaged up into a network message and returned to the client where an database API unpacks the network message and builds a data structure in the programming language and returns that structure to the client application. In R, the API is via the package for the database (e.g., RSQLite or RMySQL) and the data structure for the result set is a data frame.
Most database management systems provide a mechanism or command for viewing the query plan for a query. For example, SQLite provides the EXPLAIN QUERY PLAN statement, while MySQL has an EXPLAIN statement. In SQLite, a query plan might look like this:
sqlite> EXPLAIN QUERY PLAN
SELECT *
FROM t1, t2
WHERE t1.a = 1 AND t1.b > 2;
QUERY PLAN
|--SEARCH t1 USING INDEX i2 (a=? AND b>?)
`--SCAN t2
By default (and by definition) a result set of a query is an unordered set of tuples. To ensure that the tuples are order, one must explicitly specify an ORDER BY clause in SQL. Ordering is accomplished by sorting the result set in some column or columns. However, sorting is an expensive operation and using common sorting algorithms, such as Quicksort or Shellsort, require all data to be in memory which is often unfeasible if the result of a query is too large. In such situations, the data must be sorted in secondary storage and not main memory using an external merge-sort algorithm. Naturally, if a resultant relation fits into main memory, then an internal sorting algorithm will be used.
The query optimizer and query execution engine will often look at query blocks of concurrently executing queries and cache results so that they can be shared.
Sorting is not only used when an ORDER BY clause is found in a query but is also often used to find (and eliminate) duplicates (perhaps when a DISTINCT constraint is present in the query).
The sort operation can be applied to the input relations before carrying out a join operator, union, or intersection in order to more efficiently carry out these operations. For example, a join on sorted relations is often faster than on unsorted relations. Of course, in the presence of applicable indexes, no sorting is necessary. For example, an index using a B*-tree can be traversed “in order” which will return the leaf nodes in sorted order. However, that operation can be expensive as accessing the index structure will require additional storage device access as well as using additional main memory. So, for large tables, a sorting of the actual tuples is often preferable.
Relational algebra operations can be carried out in different ways. There is no single “best way” and the choice of data access algorithm depends on the size of the relations, size of temporary relations, available main memory, speed of external data access, availability of indexes, and the mix of concurrently executing operations, as well as cached results from prior operations.
Common operations:
Use of Index: commonly used for selection and join operations; uses an index to find the tuples satisfying the query conditions; note that indexes with numeric keys are more efficient; there is always an index for the primary key column
Iteration: when indexes are not available, relations are scanned linearly and all tuples are examined to determine if they meet the selection of join criteria; generally the slowest as the algorithm has a run-time behavior of \(O(n)\); scanning can be done in memory if the relation is small or on the storage device if it is too large to fit into memory; if the relation is sorted then binary search can be used to find tuples that meet the selection conditions
Partitioning: tuples are partitioned using internal sorting and hashing algorithms to find tuples that meet selection criteria; hashing has fast lookup but creating the hash map is time consuming; sorting is slow (generally \(O(n log n)\)) but easy to partition
The overall goal of the query optimizer and query plan evaluation engine is to determine the most efficient way to process the query.
Joins are often the most expensive operations as the worst case outcome is that it requires a nested (double) loop algorithm where the query engine iterates over the outer relation while iterating over the inner relation for each element. That results in an expensive \(O(n^2)\) performance. This (simple but brute force) algorithm is illustrated below, showing how two relations \(R\) and \(S\) are joined on a common set of columns (commonly the primary key of one and the foreign key of the other, but that doesn’t have to be the case).
for each r in R
{
for each s in S
{
if r[PK] == S[FK] then rs = rs + (r+s)
}
}
The “cost” of a query is the foundation for choosing a query plan. There are several tuples of costs that must be taken into account:
cost of accessing secondary storage devices, such as disks, disk arrays, or solid-state drives; this includes cost of seeking, reading, data transfer to main memory, and writing
cost of storing intermediate results in secondary in storage and related access costs
cost of computation, i.e., CPU use
cost of storing results in main memory
cost of transferring results from servers or remote database via networks
A common misunderstanding concerns the “optimization of SQL” – there is little one can do as the query optimizer determines the best way to implement a query for a given database.
Query optimization depends on being able to transform queries into alternative forms that yield the same result. Two relational algebra expressions are said to be equivalent if they result in the same tuples in the result set, without regard to order, of course.
Equivalence rules provide specific equivalences and the more common rules are listed below.
Rule 1 - Cascading Select
\(\sigma_{c_1 \land c_2}(R) \equiv \sigma_{c_1}(\sigma_{c_2}(R))\)
Rule 2 - Commutative Select
\(\sigma_{c_1}(\sigma_{c_2}(R)) \equiv \sigma_{c_2}(\sigma_{c_1}(R))\)
Rule 3 - Cascading Project
\(\pi_{a_1}(\pi_{a_2}(\ldots \pi_{a_n}(R) \ldots)) \equiv \pi_{a_1}(R)\)
Rule 4 - Commutative Project
\(\pi_{a_1,a_2,\ldots,a_n}(\sigma_{c}(R)) \equiv \sigma_{c}(\pi_{a_1,a_2,\ldots,a_n}(R))\),
if
\(a_1,a_2,\ldots,a_n \in c\)
Rule 5 - Select in Join
\(\sigma_{c}(R_1 \times R_2) \equiv R_1 \bowtie_{c} R_2\)
\(\sigma_{c_1}(R_1 \bowtie_{c_2} R_2 \equiv R_1 \bowtie_{c_1 \land c_2} R_2\)
where \(\times\) is the cross product of two relations.
Rule 6 - Commutative Join
\(R_1 \times R_2 \equiv R_2 \times R_1\)
\(R_1 \bowtie_c R_2 \equiv R_2 \bowtie_c R_1\)
Rule 7 - Associative Join
\((R_1 \times R_2) \times R_3 \equiv R_1 \times (R_2 \times R_3)\)
\((R_1 \bowtie R_2) \bowtie R_3 \equiv R_1 \bowtie (R_2 \bowtie R_3)\)
Rule 8A - Distributed Select
\(\sigma_{c_1}(R_1 \bowtie_c R_2) \equiv (\sigma_{c_1}(R_1)) \bowtie_c R_2\)
if \(c_1\) includes only attributes of either \(R_1\) or \(R_2\)
Rule 8B - Distributed Select
\(\sigma_{c_1 \land c_2}(R_1 \bowtie_c R_2) \equiv (\sigma_{c_1}(R_1)) \bowtie_c (\sigma_{c_2}(R_2))\)
if \(c_1\) includes only attributes of \(R_1\) and \(c_2\) only includes the attributes of \(R_2\)
Rule 9 - Distributed Project
\(\pi_{A_1 \cup A_2}(R_1 \bowtie_c R_2) \equiv (\pi_{A_1}(R_1)) \bowtie_c (\pi_{A_2}(R_2))\)
if \(c\) only uses attributes in \(A_1 \cup A_2\) and \(A_i\) are only attributes in \(R_i\)
Rule 10 - Commutative Union and Intersection
\(R_1 \cup R_2 \equiv R_2 \cup R_1\) \(R_1 \cap R_2 \equiv R_2 \cap R_1\)
Rule 11 - Associate Union and Intersection
\(R_1 \cup (R_2 \cup R_3) \equiv (R_1 \cup R_2) \cup R_3)\)
\(R_1 \cap (R_2 \cap R_3) \equiv (R_1 \cap R_2) \cap R_3)\)
Rule 12 - Distributed Project over Union
\(\pi_A(R_1 \cup R_2) \equiv \pi_A(R_1) \cup \pi_A(R_2)\)
where \(A\) are attributes of either \(R_1\) or \(R_2\) or both
Query processing start with a translation of the SQL statement into low-level data access operations, query optimization, and actual access of the data on the physical storage device.
None.
None collected yet. Let us know.