Introduction

Subqueries and the WITH statement are useful query formulation mechanisms for writing complex queries. This lesson demonstrates, using examples, how to use both in queries and how to simplify database access using the construct.

Sample Database

For many of the examples we will use the Sakila Media and Film database. An ERD of the database is shown below. The examples are executed on SQLite using sakiladb.sqlitedb1.

Subqueries

A subquery, also known as a nested query or inner query, is an SQL query that is embedded within another SQL query. It can be used in various parts of a query, including SELECT, FROM, WHERE, and HAVING clauses. Subqueries are primarily used for performing operations that require multiple steps in data processing. They allow for more flexible and efficient data retrieval, enabling complex operations to be executed in a single query.

Benefits of Subqueries

Subqueries offer several benefits, including:

  • Enhancing Data Retrieval: Subqueries provide a powerful tool for retrieving data in complex situations where the use of standard SQL queries might be limiting or inefficient.
  • Complex Data Relationships: They are particularly useful in handling scenarios where data relationships are complex and require multiple layers of filtering or selection.
  • Nested Logic: Subqueries offer a way to implement nested logic in SQL queries, allowing for sophisticated data analysis and manipulation.

Characteristics of Subqueries

  • Return Types: Subqueries can return individual values, a single row, multiple rows, or even a table-like structure.
  • Independent vs. Correlated: They can be independent of the outer query (standalone) or correlated, where the subquery depends on data from the outer query.
  • Placement Flexibility: Subqueries can be used in various places within a query, such as:
    • In the SELECT clause, to add columns to the main query.
    • In the FROM clause, to create a derived table that the main query can use.
    • In the WHERE or HAVING clauses, to filter records based on complex conditions.

Advantages and Limitations

  • Advantages:
    • Modularity: Breaking down complex queries into simpler, more manageable parts.
    • Readability: Improving the readability of SQL scripts by encapsulating specific logic.
    • Reusability: Facilitating the reuse of code and logic within the database.
  • Limitations:
    • Performance Concerns: Nested queries can sometimes lead to performance issues, particularly if not optimized correctly.
    • Complexity: Overuse or misuse of subqueries can make SQL statements more complex and harder to maintain.

Simple Example

How many payments have a below average amount?

This query requires us to know the average payment amount, but that amount varies with the data, so it cannot be hard coded; a subquery that calculates the account and used where the average is needed resolves the issue. Of course, since average is a scalar value, the subquery must return one value, but because the result must be a table, the subquery actually returns a table with one row and one column.

SELECT COUNT(payment_id) AS numGrtAvg
FROM Payment
WHERE amount < (SELECT AVG(amount) FROM Payment);
Table 1: 1 records
numGrtAvg
8303

This section lays the foundation for understanding subqueries, setting the stage for more detailed exploration in the following sections of the lesson.

Expanding on the types of subqueries in SQL:

Types of Subqueries

Single-Row Subqueries

Definition: These return a single row from the inner query. They’re typically used with single-row comparison operators like =, >, <, <=, >=, and <>.

Use Cases: Ideal for situations where you expect the subquery to return exactly one row. For instance, finding an employee whose salary is equal to the maximum salary in a department or the value of a column is above or below average.

Example:

SELECT payment_id, amount 
  FROM Payment 
 WHERE amount = (SELECT MAX(amount) FROM Payment)
Table 2: Displaying records 1 - 10
payment_id amount
342 11.99
3146 11.99
5280 11.99
5281 11.99
5550 11.99
6409 11.99
8272 11.99
9803 11.99
15821 11.99
15850 11.99

The example above finds those payments that are for the biggest (maximum) amount.

Exercise

Try the query below for yourself on the sample database.

Query

Which customer (last name, first name, and ID) purchased the most in terms of total amount?

Solution

TBD

Multiple-Row Subqueries

Definition: These return multiple rows and are used with existential or set membership operators like IN, ANY, ALL, or comparisons combined with ANY or ALL.

Use Cases: Useful when the subquery is expected to return more than one row. For example, retrieving all orders that match any of the order IDs in a list.

Alternative: A simple inner-join or a theta-join can often be used instead.

Example:

The query below finds how many cities are located in a country that starts with a ‘B’.

SELECT count(*) AS `NumCities`
  FROM City 
 WHERE country_id IN 
    (SELECT country_id 
       FROM Country 
      WHERE country LIKE 'B%')
Table 3: 1 records
NumCities
39

Correlated Subqueries

Definition: A correlated subquery is one that uses values from the outer query. It is executed repeatedly, once for each row that might be selected by the outer query.

Use Cases: Effective in scenarios where each row in the outer query may need different results from the subquery. For instance, finding employees who earn more than the average salary in their respective departments.

Example:

SELECT staff_id, amount
  FROM Payment AS P
 WHERE amount > (SELECT AVG(amount)
                   FROM Payment
                  WHERE staff_id = P.staff_id)
Table 4: Displaying records 1 - 10
staff_id amount
1 5.99
2 9.99
1 4.99
1 4.99
2 5.99
1 5.99
1 4.99
1 4.99
1 7.99
1 4.99

Performance Note: Correlated subqueries can be less efficient because the subquery may need to be executed multiple times. A join with a “group by” is often simpler and preferable as it executes generally much faster and the database can use indexes.

Subquery as a Derived Table

Definition: This involves using a subquery in the FROM clause of the main query. The result of the subquery acts as a temporary table.

Use Cases: Useful for creating complex joins and aggregations. For example, summarizing data in a subquery before joining it to another table.

Example:

SELECT C.customer_id, C.last_name, P.total_amount
  FROM Customer AS C JOIN (SELECT customer_id, sum(amount) AS `total_amount`
                            FROM Payment
                           GROUP BY customer_id) AS P
                     ON (C.customer_id = P.customer_id)
Table 5: Displaying records 1 - 10
customer_id last_name total_amount
1 SMITH 118.68
2 JOHNSON 128.73
3 WILLIAMS 135.74
4 JONES 81.78
5 BROWN 144.62
6 DAVIS 93.72
7 MILLER 151.67
8 WILSON 92.76
9 MOORE 89.77
10 TAYLOR 99.75

Note how the subquery was provided with an alias using AS.

Existential Subqueries

Definition: These are used to test for the existence of rows in a subquery.

Use Cases: Commonly used in conditional logic, like checking if a certain condition is met in another table before performing an operation.

Example:

SELECT DISTINCT A.last_name, A.first_name
  FROM Film_Actor AS FA JOIN Actor AS A USING (actor_id)
 WHERE EXISTS
         (SELECT * FROM Film 
           WHERE film_id = FA.film_id AND length > 180)
 LIMIT 3;
Table 6: 3 records
last_name first_name
GUINESS PENELOPE
CHASE ED
DAVIS JENNIFER

Considerations

Unnecessary Complexity: Be careful not to overuse subqueries, as they can make the query complex and difficult to understand.

Subquery Optimization: It’s important to understand how subqueries can impact performance and to optimize them accordingly. For example, minimizing the number of rows processed by using appropriate WHERE clauses.

Nested Subqueries: Subqueries can be nested inside other subqueries, but this should be done judiciously to avoid excessive complexity and potential performance issues. While often simpler for some programmers to express a query, joins and group by are preferable when an option.

Linear Scans: Subqueries often result in query plans that use linear table scans rather than available indexes and are thus often vastly less efficient.

Independent Testing: Subqueries should be tested independently to ensure that the correct results are returned before incorporating them into other queries.

Each type of subquery offers unique functionalities and is suited for different scenarios in SQL querying. Understanding these types will help data analysts effectively use subqueries to solve complex data retrieval problems.

Common Table Expressions

The SQL WITH statement, also known as Common Table Expressions (CTEs), is a useful feature for creating temporary result sets that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement. CTEs are similar to the ρ (rename) operator in relational algebra.

Basic Syntax

The basic syntax of the WITH statement is as follows:

WITH CTE_Name AS (
    -- Your SQL query here
)

SELECT * FROM CTE_Name;

Here, CTE_Name is a temporary result set that you define within the WITH clause. This result set can then be used in the main SQL query that follows in place of any table.

Advantages of Using WITH

  1. Readability and Maintenance: CTEs can make complex queries more readable and maintainable by breaking them into simpler, modular parts.

  2. Recursive Queries: CTEs enable recursive queries, which are useful for hierarchical data processing, like tree structures.

  3. Performance Optimization: In some cases, CTEs can improve performance by simplifying complex joins and filters.

  4. Reusability: A single CTE can be referenced multiple times in the main query, avoiding the repetition of the same subquery.

Examples

The Sakila sample database is a good resource for demonstrating real-world applications of CTEs. Here are a few use cases:

1. Simplifying Complex Joins

Suppose you want to find films with the highest rental rates in each rating category. This might involve complex joins and aggregations. Using a CTE, you can simplify it:

WITH HighestRentalRates AS (
    SELECT rating, MAX(rental_rate) AS MaxRate
    FROM Film
    GROUP BY rating
)

SELECT f.film_id, f.title, f.rating, f.rental_rate
  FROM Film f INNER JOIN HighestRentalRates h 
              ON (f.rating = h.rating AND f.rental_rate = h.MaxRate);
Table 7: Displaying records 1 - 10
film_id title rating rental_rate
2 ACE GOLDFINGER G 4.99
7 AIRPLANE SIERRA PG-13 4.99
8 AIRPORT POLLOCK R 4.99
10 ALADDIN CALENDAR NC-17 4.99
13 ALI FOREVER PG 4.99
20 AMELIE HELLFIGHTERS R 4.99
21 AMERICAN CIRCUS R 4.99
28 ANTHEM LUKE PG-13 4.99
31 APACHE DIVINE NC-17 4.99
32 APOCALYPSE FLAMINGOS R 4.99

2. Recursive Queries

If the Sakila database had hierarchical data, like a table representing film categories and subcategories, you could use recursive CTEs to traverse this hierarchy. But it’s important to note that the Sakila database does not inherently include hierarchical data like categories and subcategories. For the purpose of this example, I’ll assume such a structure exists or you’re willing to adapt the example to your specific use case.

Assuming we have a table film_category with a hierarchical structure (which is not actually present in the Sakila database), the query might look like this:

WITH RECURSIVE SubCategory AS (
    SELECT category_id, parent_category_id
      FROM film_category
     WHERE parent_category_id IS NULL
    
    UNION ALL
    
    SELECT fc.category_id, sc.parent_category_id
      FROM film_category fc INNER JOIN SubCategory sc 
                            ON sc.parent_category_id = sc.category_id
                            
)

SELECT * FROM SubCategory;

3. Analyzing Patterns

Let’s say you want to analyze rental patterns, like finding customers who rented more than five movies in a month. A CTE can be used to aggregate rental data before joining it with customer data:

WITH MonthlyRentals AS (
    SELECT customer_id, strftime('%m', rental_date) AS rental_month, COUNT(*) AS total_rentals
      FROM rental
     GROUP BY customer_id, strftime('%m', rental_date)
    HAVING COUNT(*) > 5
)
SELECT c.customer_id, c.first_name, c.last_name, mr.rental_month, mr.total_rentals
  FROM customer c
       JOIN MonthlyRentals mr ON c.customer_id = mr.customer_id;
Table 8: Displaying records 1 - 10
customer_id first_name last_name rental_month total_rentals
1 MARY SMITH 06 7
1 MARY SMITH 07 12
1 MARY SMITH 08 11
2 PATRICIA JOHNSON 07 14
2 PATRICIA JOHNSON 08 11
3 LINDA WILLIAMS 07 13
3 LINDA WILLIAMS 08 7
4 BARBARA JONES 06 6
4 BARBARA JONES 08 11
5 ELIZABETH BROWN 07 16

SQLite has slightly different syntax and capabilities compared to other SQL databases like Oracle’s version used in Sakila.

In SQLite, the EXTRACT function isn’t available, so we’ll use strftime to extract the month from the rental_date.

Considerations

In summary, the SQL WITH statement is a versatile tool in the SQL developer’s arsenal, enabling the creation of more readable, maintainable, and often more efficient SQL queries. The examples using the Sakila film database illustrate how CTEs can be used for simplifying complex queries, working with recursive data, and performing detailed data analysis.

Partitions

The PARTITION BY clause in SQL is a powerful feature used in conjunction with window functions. It allows you to partition a result set into groups and perform calculations or apply functions to each group (partition) independently. Here’s a detailed explanation:

1. Purpose

  • Purpose: PARTITION BY divides the result set into partitions to which the window function is applied. Instead of treating the entire result set as a single group, it enables calculations across a set of rows that share common properties.

  • Use with Window Functions: It’s used with window functions like ROW_NUMBER(), RANK(), SUM(), AVG(), etc. These functions perform a calculation across a set of table rows related to the current row.

2. Use Cases

  • Data Analysis: For analyzing trends within subgroups of data. For instance, finding the average sales per department in a store.
  • Ranking: Assigning ranks within categories, such as ranking students by score within each class.
  • Running Totals: Calculating running totals, averages, or other cumulative measures within each partition.
  • Segmentation: Segmenting data into groups for more focused analysis, like monthly expenses per department.

3. Benefits

  • Improved Data Organization: It enables more granular data analysis without the need for complex subqueries or temporary tables.
  • Efficiency: More efficient than self-joins or subqueries for certain types of calculations.
  • Flexibility: Offers flexibility in data analysis, allowing for complex calculations across different segments of data.

4. Limitations

  • Performance: On large datasets, partitioning can lead to performance issues if not indexed or optimized properly.
  • Complexity: The syntax can be complex, especially for beginners or for complex analytical queries.
  • Database Support: Not all databases support all window functions, and there can be differences in implementation.

5. Examples

Assuming the same Sakila database and SQLite, let’s consider some examples:

Example 1: Ranking Customers by Amount Spent

SELECT customer_id, amount,
       RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) as rank
FROM payment
LIMIT 5;
Table 9: 5 records
customer_id amount rank
1 9.99 1
1 7.99 2
1 5.99 3
1 5.99 3
1 5.99 3
  • Description: This ranks customers based on the amount spent, partitioning by customer_id. The LIMIT is arbitrary and only serves to reduce the result set for display in this lesson.

Example 2: Running Total of Payments by Customer

SELECT customer_id, payment_date, amount,
       SUM(amount) OVER (PARTITION BY customer_id ORDER BY payment_date) as running_total
FROM payment
Table 10: Displaying records 1 - 10
customer_id payment_date amount running_total
1 2005-05-25 11:30:37 2.99 2.99
1 2005-05-28 10:35:23 0.99 3.98
1 2005-06-15 00:54:12 5.99 9.97
1 2005-06-15 18:02:53 0.99 10.96
1 2005-06-15 21:08:46 9.99 20.95
1 2005-06-16 15:18:57 4.99 25.94
1 2005-06-18 08:41:48 4.99 30.93
1 2005-06-18 13:33:59 0.99 31.92
1 2005-06-21 06:24:45 3.99 35.91
1 2005-07-08 03:17:05 5.99 41.90
  • Description: Calculates a running total of payments for each customer over time.

Conclusion

PARTITION BY is a versatile tool in SQL that enhances data analysis capabilities by enabling segment-specific calculations. While it has limitations, particularly in performance and complexity, its benefits in organizing and processing data for analytical queries are significant.

PARTION BY vs GROUP BY

PARTITION BY and GROUP BY are both SQL clauses used to organize data into subsets, but they serve different purposes and are used in different contexts.

PARTITION BY

  • Context: Used with window functions.
  • Functionality: It organizes data into partitions, allowing you to perform calculations across each partition. Each partition can be considered a ‘window’ over which SQL functions operate.
  • Row Retention: It does not change the number of rows in the query’s result. Each row in the original dataset remains in the output, along with any additional columns created by the window function.
  • Use Cases: Commonly used for ranking, calculating running totals, moving averages, and other complex analytics that require access to multiple rows of the dataset at once, without grouping them into a single output row.

Example of PARTITION BY

SELECT id, department, salary,
       AVG(salary) OVER (PARTITION BY department) as avg_department_salary
FROM employees;

Here, PARTITION BY calculates the average salary per department, but every row from the original employees table is retained in the output.

GROUP BY

  • Context: Used in aggregate queries.
  • Functionality: It groups rows that have the same values in specified columns into summary rows, like “find the total salary per department”.
  • Row Reduction: It reduces the number of rows in the result, providing one row per group. You can only select the grouped columns and the results of aggregate functions.
  • Use Cases: Ideal for summaries, such as totals, averages, counts, min, max, etc., where you need a single result row per group.

Example of GROUP BY

SELECT department, AVG(salary) as avg_department_salary
FROM employees
GROUP BY department;

Here, GROUP BY provides the average salary per department, but the output will only have one row per department.

Key Differences

  1. Row Count: PARTITION BY retains the original row count, whereas GROUP BY consolidates rows into groups, reducing the row count.
  2. Use with Functions: PARTITION BY is used with window functions, while GROUP BY is used with aggregate functions.
  3. Result Set: PARTITION BY adds additional information to each row, whereas GROUP BY provides a summarized result set.
  4. Scope of Calculation: In PARTITION BY, calculations are done within each partition and don’t affect other partitions. In GROUP BY, calculations are done on grouped data, producing a single result for each group.

Examples

Let’s use the Sakila database with a focus on a scenario where we might want to analyze the film rental data. We will compare the usage of PARTITION BY and GROUP BY in SQL queries using this context. Note that the structure of the Sakila database might differ slightly based on the adaptation for SQLite, but the general concept remains the same.

Scenario: Analyzing Film Rentals

We want to analyze film rentals, looking at both the total number of rentals per category and the rank of each film within its category based on the number of times it has been rented.

1. Using GROUP BY

Objective: To find the total number of rentals for each film category.

SQL Query:

SELECT category.name AS category_name, COUNT(rental.rental_id) AS total_rentals
  FROM category
       JOIN film_category ON category.category_id = film_category.category_id
       JOIN film ON film_category.film_id = film.film_id
       JOIN inventory ON film.film_id = inventory.film_id
       JOIN rental ON inventory.inventory_id = rental.inventory_id
 GROUP BY category.name;
Table 11: Displaying records 1 - 10
category_name total_rentals
Action 1112
Animation 1166
Children 945
Classics 939
Comedy 941
Documentary 1050
Drama 1060
Family 1096
Foreign 1033
Games 969

Explanation: - This query groups films by their category and counts the total number of rentals for each category. - The result will be a summary table with one row per category, showing the total rentals.

2. Using PARTITION BY

Objective: To rank each film within its category based on the number of times it has been rented.

SQL Query:

SELECT film.title, category.name AS category_name, 
       RANK() OVER (PARTITION BY category.name ORDER BY COUNT(rental.rental_id) DESC) AS rental_rank
  FROM category
       JOIN film_category ON category.category_id = film_category.category_id
       JOIN film ON film_category.film_id = film.film_id
       JOIN inventory ON film.film_id = inventory.film_id
       JOIN rental ON inventory.inventory_id = rental.inventory_id
 GROUP BY film.title, category.name;
Table 12: Displaying records 1 - 10
title category_name rental_rank
RUGRATS SHAKESPEARE Action 1
SUSPECTS QUILLS Action 1
HANDICAP BOONDOCK Action 3
STORY SIDE Action 3
TRIP NEWTON Action 3
PRIMARY GLASS Action 6
FANTASY TROOPERS Action 7
STAGECOACH ARMAGEDDON Action 7
CLUELESS BUCKET Action 9
HILLS NEIGHBORS Action 9

Explanation: - This query ranks films within each category based on their rental count. - The PARTITION BY clause is used to create partitions for each film category, and within each partition, films are ranked by their rental counts. - Unlike GROUP BY, this query does not reduce the number of rows in the output. Each film retains its row, but with additional information (rank) added.

Key Differences Illustrated:

  • Aggregation vs. Windowing: The GROUP BY query aggregates data, reducing the result to one row per category, while the PARTITION BY query applies a window function within each category, retaining individual film rows.
  • Output: GROUP BY produces a summary table, whereas PARTITION BY adds additional ranking information to each film within its category.

These examples highlight how GROUP BY is used for aggregating data into summary form, whereas PARTITION BY is used for performing calculations within subsets of data while retaining the original data structure.

Summary

This lesson covered the use of subqueries and the WITH statement in SQL for formulating complex queries, and the use of PARTITION BY to apply window functions to groups of data, using the Sakila Media and Film database for examples.

Subqueries

  • Definition: Subqueries are SQL queries embedded within another SQL query, used in SELECT, FROM, WHERE, and HAVING clauses.
  • Benefits: They enhance data retrieval in complex scenarios, handle complex data relationships, and implement nested logic.
  • Characteristics: Subqueries can return various types of results and can be independent or correlated with the main query.
  • Advantages and Limitations: They offer modularity and improved readability but can cause performance issues and add complexity.
  • Types of Subqueries: These include single-row, multiple-row, correlated, subqueries as derived tables, and existential subqueries, each with specific use cases and examples.

Common Table Expressions (CTEs)

  • Basics: The WITH statement creates temporary result sets for use in SQL queries.
  • Advantages: CTEs enhance readability, allow for recursive queries, and can improve performance.
  • Use Cases in Sakila Database: Examples are provided, like finding films with the highest rental rates and analyzing rental patterns, although it’s noted that the Sakila database doesn’t inherently include hierarchical data for some of the more complex examples.

Partitions

The purpose of partitions and the PARTITION BY clause is to apply window functions to partitions of data. The is often necessary for data analysis and ranking and provides improved data organization, but its use can lead to performance issues on large datasets.

Difference between PARTITION BY and GROUP BY: The distinction between these two clauses was clarified. PARTITION BY, used with window functions, retains the original row count and allows calculations within partitions. Conversely, GROUP BY, used for aggregate queries, reduces the number of rows to one per group and is ideal for summaries.

Tutorial

In this advanced tutorial video, James Oliver explains unions, partitions, recursion, and hierarchies for constructing complex and efficient SQL queries.


Files & Resources

All Files for Lesson 70.107

Errata

Let us know.


  1. Sakila Database is licensed under the New BSD License and may be used for any purpose with the provision that notice of © 2020 Oracle Corporation appears.↩︎

---
title: "Advanced SQL: Subqueries, Paritions, and Common Table Expressions"
params:
  category: 70
  number: 107
  time: 45
  level: beginner
  tags: "sql,SELECT,CTE,WITH,PARTITION BY, GROUP BY,NTILE"
  description: "Explains how to define complex queries in SQL using subqueries,
                as well as Common Table Expressions (CTE) using WITH,
                and how to apply window functions to data partitions using
                PARTITIOn BY. Illustrates the difference between GROUP BY
                and PARTITION BY."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

Subqueries and the WITH statement are useful query formulation mechanisms for writing complex queries. This lesson demonstrates, using examples, how to use both in queries and how to simplify database access using the construct.

## Sample Database

For many of the examples we will use the Sakila Media and Film database. An ERD of the database is shown below. The examples are executed on SQLite using [sakiladb.sqlitedb](sakiladb.sqlitedb)[^1].

[^1]: Sakila Database is licensed under the New BSD License and may be used for any purpose with the provision that notice of © 2020 Oracle Corporation appears.

![](SakilaDB.png){width="70%"}

```{r openSakilaDB, eval=T, echo=F}
library(RSQLite)
dbcon <- dbConnect(RSQLite::SQLite(), "sakila.sqlitedb")
```

## Subqueries

A subquery, also known as a nested query or inner query, is an SQL query that is embedded within another SQL query. It can be used in various parts of a query, including `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. Subqueries are primarily used for performing operations that require multiple steps in data processing. They allow for more flexible and efficient data retrieval, enabling complex operations to be executed in a single query.

### Benefits of Subqueries

Subqueries offer several benefits, including:

-   **Enhancing Data Retrieval**: Subqueries provide a powerful tool for retrieving data in complex situations where the use of standard SQL queries might be limiting or inefficient.
-   **Complex Data Relationships**: They are particularly useful in handling scenarios where data relationships are complex and require multiple layers of filtering or selection.
-   **Nested Logic**: Subqueries offer a way to implement nested logic in SQL queries, allowing for sophisticated data analysis and manipulation.

### Characteristics of Subqueries

-   **Return Types**: Subqueries can return individual values, a single row, multiple rows, or even a table-like structure.
-   **Independent vs. Correlated**: They can be independent of the outer query (standalone) or correlated, where the subquery depends on data from the outer query.
-   **Placement Flexibility**: Subqueries can be used in various places within a query, such as:
    -   In the `SELECT` clause, to add columns to the main query.
    -   In the `FROM` clause, to create a derived table that the main query can use.
    -   In the `WHERE` or `HAVING` clauses, to filter records based on complex conditions.

### Advantages and Limitations

-   **Advantages**:
    -   **Modularity**: Breaking down complex queries into simpler, more manageable parts.
    -   **Readability**: Improving the readability of SQL scripts by encapsulating specific logic.
    -   **Reusability**: Facilitating the reuse of code and logic within the database.
-   **Limitations**:
    -   **Performance Concerns**: Nested queries can sometimes lead to performance issues, particularly if not optimized correctly.
    -   **Complexity**: Overuse or misuse of subqueries can make SQL statements more complex and harder to maintain.

### Simple Example

*How many payments have a below average amount?*

This query requires us to know the average payment amount, but that amount varies with the data, so it cannot be hard coded; a subquery that calculates the account and used where the average is needed resolves the issue. Of course, since average is a scalar value, the subquery must return one value, but because the result must be a table, the subquery actually returns a table with one row and one column.

```{sql connection=dbcon}
SELECT COUNT(payment_id) AS numGrtAvg
FROM Payment
WHERE amount < (SELECT AVG(amount) FROM Payment);
```

This section lays the foundation for understanding subqueries, setting the stage for more detailed exploration in the following sections of the lesson.

Expanding on the types of subqueries in SQL:

## Types of Subqueries

### Single-Row Subqueries

**Definition**: These return a single row from the inner query. They're typically used with single-row comparison operators like `=`, `>`, `<`, `<=`, `>=`, and `<>`.

**Use Cases**: Ideal for situations where you expect the subquery to return exactly one row. For instance, finding an employee whose salary is equal to the maximum salary in a department or the value of a column is above or below average.

**Example**:

```{sql connection=dbcon}
SELECT payment_id, amount 
  FROM Payment 
 WHERE amount = (SELECT MAX(amount) FROM Payment)
```

The example above finds those payments that are for the biggest (maximum) amount.

#### Exercise {.tabset}

Try the query below for yourself on the sample database.

##### Query {.active}

*Which customer (last name, first name, and ID) purchased the most in terms of total amount?*

##### Solution

TBD

### Multiple-Row Subqueries

**Definition**: These return multiple rows and are used with existential or set membership operators like `IN`, `ANY`, `ALL`, or comparisons combined with `ANY` or `ALL`.

**Use Cases**: Useful when the subquery is expected to return more than one row. For example, retrieving all orders that match any of the order IDs in a list.

**Alternative**: A simple inner-join or a theta-join can often be used instead.

**Example**:

The query below finds how many cities are located in a country that starts with a 'B'.

```{sql connection=dbcon}
SELECT count(*) AS `NumCities`
  FROM City 
 WHERE country_id IN 
    (SELECT country_id 
       FROM Country 
      WHERE country LIKE 'B%')
```

### Correlated Subqueries

**Definition**: A correlated subquery is one that uses values from the outer query. It is executed repeatedly, *once for each row that might be selected by the outer query*.

**Use Cases**: Effective in scenarios where each row in the outer query may need different results from the subquery. For instance, finding employees who earn more than the average salary in their respective departments.

**Example**:

```{sql connection=dbcon}
SELECT staff_id, amount
  FROM Payment AS P
 WHERE amount > (SELECT AVG(amount)
                   FROM Payment
                  WHERE staff_id = P.staff_id)
```

**Performance Note**: Correlated subqueries can be less efficient because the subquery may need to be executed multiple times. A join with a "group by" is often simpler and preferable as it executes generally much faster and the database can use indexes.

### Subquery as a Derived Table

**Definition**: This involves using a subquery in the `FROM` clause of the main query. The result of the subquery acts as a temporary table.

**Use Cases**: Useful for creating complex joins and aggregations. For example, summarizing data in a subquery before joining it to another table.

**Example**:

```{sql connection=dbcon}
SELECT C.customer_id, C.last_name, P.total_amount
  FROM Customer AS C JOIN (SELECT customer_id, sum(amount) AS `total_amount`
                            FROM Payment
                           GROUP BY customer_id) AS P
                     ON (C.customer_id = P.customer_id)
```

Note how the subquery was provided with an alias using *AS*.

### Existential Subqueries

**Definition**: These are used to test for the existence of rows in a subquery.

**Use Cases**: Commonly used in conditional logic, like checking if a certain condition is met in another table before performing an operation.

**Example**:

```{sql connection=dbcon}
SELECT DISTINCT A.last_name, A.first_name
  FROM Film_Actor AS FA JOIN Actor AS A USING (actor_id)
 WHERE EXISTS
         (SELECT * FROM Film 
           WHERE film_id = FA.film_id AND length > 180)
 LIMIT 3;
```

### Considerations

**Unnecessary Complexity**: Be careful not to overuse subqueries, as they can make the query complex and difficult to understand.

**Subquery Optimization**: It's important to understand how subqueries can impact performance and to optimize them accordingly. For example, minimizing the number of rows processed by using appropriate WHERE clauses.

**Nested Subqueries**: Subqueries can be nested inside other subqueries, but this should be done judiciously to avoid excessive complexity and potential performance issues. While often simpler for some programmers to express a query, joins and group by are preferable when an option.

**Linear Scans**: Subqueries often result in query plans that use linear table scans rather than available indexes and are thus often vastly less efficient.

**Independent Testing**: Subqueries should be tested independently to ensure that the correct results are returned before incorporating them into other queries.

Each type of subquery offers unique functionalities and is suited for different scenarios in SQL querying. Understanding these types will help data analysts effectively use subqueries to solve complex data retrieval problems.

## Common Table Expressions

The SQL `WITH` statement, also known as Common Table Expressions (CTEs), is a useful feature for creating temporary result sets that can be referenced within a `SELECT`, `INSERT`, `UPDATE`, or `DELETE` statement. CTEs are similar to the ρ (rename) operator in relational algebra.

### Basic Syntax

The basic syntax of the `WITH` statement is as follows:

``` sql
WITH CTE_Name AS (
    -- Your SQL query here
)

SELECT * FROM CTE_Name;
```

Here, `CTE_Name` is a temporary result set that you define within the `WITH` clause. This result set can then be used in the main SQL query that follows in place of any table.

### Advantages of Using `WITH`

1.  **Readability and Maintenance**: CTEs can make complex queries more readable and maintainable by breaking them into simpler, modular parts.

2.  **Recursive Queries**: CTEs enable recursive queries, which are useful for hierarchical data processing, like tree structures.

3.  **Performance Optimization**: In some cases, CTEs can improve performance by simplifying complex joins and filters.

4.  **Reusability**: A single CTE can be referenced multiple times in the main query, avoiding the repetition of the same subquery.

### Examples

The Sakila sample database is a good resource for demonstrating real-world applications of CTEs. Here are a few use cases:

#### 1. Simplifying Complex Joins

Suppose you want to find films with the highest rental rates in each rating category. This might involve complex joins and aggregations. Using a CTE, you can simplify it:

```{sql connection=dbcon}
WITH HighestRentalRates AS (
    SELECT rating, MAX(rental_rate) AS MaxRate
    FROM Film
    GROUP BY rating
)

SELECT f.film_id, f.title, f.rating, f.rental_rate
  FROM Film f INNER JOIN HighestRentalRates h 
              ON (f.rating = h.rating AND f.rental_rate = h.MaxRate);
```

#### 2. Recursive Queries

If the Sakila database had hierarchical data, like a table representing film categories and subcategories, you could use recursive CTEs to traverse this hierarchy. But it's important to note that the Sakila database does not inherently include hierarchical data like categories and subcategories. For the purpose of this example, I'll assume such a structure exists or you're willing to adapt the example to your specific use case.

Assuming we have a table `film_category` with a hierarchical structure (which is not actually present in the Sakila database), the query might look like this:

``` sql
WITH RECURSIVE SubCategory AS (
    SELECT category_id, parent_category_id
      FROM film_category
     WHERE parent_category_id IS NULL
    
    UNION ALL
    
    SELECT fc.category_id, sc.parent_category_id
      FROM film_category fc INNER JOIN SubCategory sc 
                            ON sc.parent_category_id = sc.category_id
                            
)

SELECT * FROM SubCategory;
```

#### 3. Analyzing Patterns

Let's say you want to analyze rental patterns, like finding customers who rented more than five movies in a month. A CTE can be used to aggregate rental data before joining it with customer data:

```{sql connection=dbcon}
WITH MonthlyRentals AS (
    SELECT customer_id, strftime('%m', rental_date) AS rental_month, COUNT(*) AS total_rentals
      FROM rental
     GROUP BY customer_id, strftime('%m', rental_date)
    HAVING COUNT(*) > 5
)
SELECT c.customer_id, c.first_name, c.last_name, mr.rental_month, mr.total_rentals
  FROM customer c
       JOIN MonthlyRentals mr ON c.customer_id = mr.customer_id;

```

SQLite has slightly different syntax and capabilities compared to other SQL databases like Oracle's version used in Sakila.

In SQLite, the `EXTRACT` function isn't available, so we'll use `strftime` to extract the month from the *rental_date*.

### Considerations

In summary, the SQL `WITH` statement is a versatile tool in the SQL developer's arsenal, enabling the creation of more readable, maintainable, and often more efficient SQL queries. The examples using the Sakila film database illustrate how CTEs can be used for simplifying complex queries, working with recursive data, and performing detailed data analysis.

## Partitions

The `PARTITION BY` clause in SQL is a powerful feature used in conjunction with window functions. It allows you to partition a result set into groups and perform calculations or apply functions to each group (partition) independently. Here's a detailed explanation:

### 1. Purpose

-   **Purpose**: `PARTITION BY` divides the result set into partitions to which the window function is applied. Instead of treating the entire result set as a single group, it enables calculations across a set of rows that share common properties.

-   **Use with Window Functions**: It's used with window functions like `ROW_NUMBER()`, `RANK()`, `SUM()`, `AVG()`, etc. These functions perform a calculation across a set of table rows related to the current row.

### 2. Use Cases

-   **Data Analysis**: For analyzing trends within subgroups of data. For instance, finding the average sales per department in a store.
-   **Ranking**: Assigning ranks within categories, such as ranking students by score within each class.
-   **Running Totals**: Calculating running totals, averages, or other cumulative measures within each partition.
-   **Segmentation**: Segmenting data into groups for more focused analysis, like monthly expenses per department.

### 3. Benefits

-   **Improved Data Organization**: It enables more granular data analysis without the need for complex subqueries or temporary tables.
-   **Efficiency**: More efficient than self-joins or subqueries for certain types of calculations.
-   **Flexibility**: Offers flexibility in data analysis, allowing for complex calculations across different segments of data.

### 4. Limitations

-   **Performance**: On large datasets, partitioning can lead to performance issues if not indexed or optimized properly.
-   **Complexity**: The syntax can be complex, especially for beginners or for complex analytical queries.
-   **Database Support**: Not all databases support all window functions, and there can be differences in implementation.

### 5. Examples

Assuming the same Sakila database and SQLite, let's consider some examples:

#### Example 1: Ranking Customers by Amount Spent

```{sql connection=dbcon}
SELECT customer_id, amount,
       RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) as rank
FROM payment
LIMIT 5;
```

-   **Description**: This ranks customers based on the amount spent, partitioning by `customer_id`. The LIMIT is arbitrary and only serves to reduce the result set for display in this lesson.

#### Example 2: Running Total of Payments by Customer

```{sql connection=dbcon}
SELECT customer_id, payment_date, amount,
       SUM(amount) OVER (PARTITION BY customer_id ORDER BY payment_date) as running_total
FROM payment
```

-   **Description**: Calculates a running total of payments for each customer over time.

### Conclusion

`PARTITION BY` is a versatile tool in SQL that enhances data analysis capabilities by enabling segment-specific calculations. While it has limitations, particularly in performance and complexity, its benefits in organizing and processing data for analytical queries are significant.

### PARTION BY vs GROUP BY

`PARTITION BY` and `GROUP BY` are both SQL clauses used to organize data into subsets, but they serve different purposes and are used in different contexts.

### `PARTITION BY`

-   **Context**: Used with window functions.
-   **Functionality**: It organizes data into partitions, allowing you to perform calculations across each partition. Each partition can be considered a 'window' over which SQL functions operate.
-   **Row Retention**: It does not change the number of rows in the query's result. Each row in the original dataset remains in the output, along with any additional columns created by the window function.
-   **Use Cases**: Commonly used for ranking, calculating running totals, moving averages, and other complex analytics that require access to multiple rows of the dataset at once, without grouping them into a single output row.

#### Example of `PARTITION BY`

``` sql
SELECT id, department, salary,
       AVG(salary) OVER (PARTITION BY department) as avg_department_salary
FROM employees;
```

Here, `PARTITION BY` calculates the average salary per department, but every row from the original `employees` table is retained in the output.

### `GROUP BY`

-   **Context**: Used in aggregate queries.
-   **Functionality**: It groups rows that have the same values in specified columns into summary rows, like "find the total salary per department".
-   **Row Reduction**: It reduces the number of rows in the result, providing one row per group. You can only select the grouped columns and the results of aggregate functions.
-   **Use Cases**: Ideal for summaries, such as totals, averages, counts, min, max, etc., where you need a single result row per group.

#### Example of `GROUP BY`

``` sql
SELECT department, AVG(salary) as avg_department_salary
FROM employees
GROUP BY department;
```

Here, `GROUP BY` provides the average salary per department, but the output will only have one row per department.

### Key Differences

1.  **Row Count**: `PARTITION BY` retains the original row count, whereas `GROUP BY` consolidates rows into groups, reducing the row count.
2.  **Use with Functions**: `PARTITION BY` is used with window functions, while `GROUP BY` is used with aggregate functions.
3.  **Result Set**: `PARTITION BY` adds additional information to each row, whereas `GROUP BY` provides a summarized result set.
4.  **Scope of Calculation**: In `PARTITION BY`, calculations are done within each partition and don't affect other partitions. In `GROUP BY`, calculations are done on grouped data, producing a single result for each group.

### Examples

Let's use the Sakila database with a focus on a scenario where we might want to analyze the film rental data. We will compare the usage of `PARTITION BY` and `GROUP BY` in SQL queries using this context. Note that the structure of the Sakila database might differ slightly based on the adaptation for SQLite, but the general concept remains the same.

### Scenario: Analyzing Film Rentals

We want to analyze film rentals, looking at both the total number of rentals per category and the rank of each film within its category based on the number of times it has been rented.

### 1. Using `GROUP BY`

**Objective**: To find the total number of rentals for each film category.

#### SQL Query:

```{sql connection=dbcon}
SELECT category.name AS category_name, COUNT(rental.rental_id) AS total_rentals
  FROM category
       JOIN film_category ON category.category_id = film_category.category_id
       JOIN film ON film_category.film_id = film.film_id
       JOIN inventory ON film.film_id = inventory.film_id
       JOIN rental ON inventory.inventory_id = rental.inventory_id
 GROUP BY category.name;
```

**Explanation**: - This query groups films by their category and counts the total number of rentals for each category. - The result will be a summary table with one row per category, showing the total rentals.

### 2. Using `PARTITION BY`

**Objective**: To rank each film within its category based on the number of times it has been rented.

#### SQL Query:

```{sql connection=dbcon}
SELECT film.title, category.name AS category_name, 
       RANK() OVER (PARTITION BY category.name ORDER BY COUNT(rental.rental_id) DESC) AS rental_rank
  FROM category
       JOIN film_category ON category.category_id = film_category.category_id
       JOIN film ON film_category.film_id = film.film_id
       JOIN inventory ON film.film_id = inventory.film_id
       JOIN rental ON inventory.inventory_id = rental.inventory_id
 GROUP BY film.title, category.name;
```

**Explanation**: - This query ranks films within each category based on their rental count. - The `PARTITION BY` clause is used to create partitions for each film category, and within each partition, films are ranked by their rental counts. - Unlike `GROUP BY`, this query does not reduce the number of rows in the output. Each film retains its row, but with additional information (rank) added.

### Key Differences Illustrated:

-   **Aggregation vs. Windowing**: The `GROUP BY` query aggregates data, reducing the result to one row per category, while the `PARTITION BY` query applies a window function within each category, retaining individual film rows.
-   **Output**: `GROUP BY` produces a summary table, whereas `PARTITION BY` adds additional ranking information to each film within its category.

These examples highlight how `GROUP BY` is used for aggregating data into summary form, whereas `PARTITION BY` is used for performing calculations within subsets of data while retaining the original data structure.

## Summary

This lesson covered the use of subqueries and the `WITH` statement in SQL for formulating complex queries, and the use of `PARTITION BY` to apply window functions to groups of data, using the Sakila Media and Film database for examples.

### Subqueries

-   **Definition**: Subqueries are SQL queries embedded within another SQL query, used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses.
-   **Benefits**: They enhance data retrieval in complex scenarios, handle complex data relationships, and implement nested logic.
-   **Characteristics**: Subqueries can return various types of results and can be independent or correlated with the main query.
-   **Advantages and Limitations**: They offer modularity and improved readability but can cause performance issues and add complexity.
-   **Types of Subqueries**: These include single-row, multiple-row, correlated, subqueries as derived tables, and existential subqueries, each with specific use cases and examples.

### Common Table Expressions (CTEs)

-   **Basics**: The `WITH` statement creates temporary result sets for use in SQL queries.
-   **Advantages**: CTEs enhance readability, allow for recursive queries, and can improve performance.
-   **Use Cases in Sakila Database**: Examples are provided, like finding films with the highest rental rates and analyzing rental patterns, although it's noted that the Sakila database doesn't inherently include hierarchical data for some of the more complex examples.

### Partitions

The purpose of partitions and the `PARTITION BY` clause is to apply window functions to partitions of data. The is often necessary for data analysis and ranking and provides improved data organization, but its use can lead to performance issues on large datasets.

**Difference between `PARTITION BY` and `GROUP BY`**: The distinction between these two clauses was clarified. `PARTITION BY`, used with window functions, retains the original row count and allows calculations within partitions. Conversely, `GROUP BY`, used for aggregate queries, reduces the number of rows to one per group and is ideal for summaries.

## Tutorial

In this advanced tutorial video, James Oliver explains unions, partitions, recursion, and hierarchies for constructing complex and efficient SQL queries.

<iframe style src="https://www.youtube-nocookie.com/embed/C7CPXeEvKN0" width="480" height="270" allowfullscreen="allowfullscreen" data-external="1">

</iframe>

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

[SQLite Documention: The WITH Clause](https://www.sqlite.org/lang_with.html)

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
