Introduction
GROUP BY and PARTITION BY represent two fundamentally different operations in SQL. GROUP BY performs aggregation by collapsing rows into groups, returning one row per group. PARTITION BY, used with window functions, divides rows into partitions but preserves all original rows while adding computed values based on partition-level calculations. Understanding this distinction is essential for writing efficient analytical queries.
Execution Order and Query Processing
SQL queries execute in a specific order that determines how GROUP BY and PARTITION BY interact. The execution sequence proceeds as follows: FROM and JOIN clauses retrieve data, WHERE filters rows, GROUP BY aggregates and collapses rows, HAVING filters groups, SELECT applies window functions including PARTITION BY, ORDER BY sorts results, and finally LIMIT restricts output. This sequence means GROUP BY executes before window functions, so PARTITION BY operates on already-aggregated data when both appear in the same query.
Sample Dataset
To illustrate the use of GROUP BY and PARTITION BY, consider an employee salary dataset:
CREATE TABLE employees (
employee_id INTEGER,
name TEXT,
department TEXT,
salary INTEGER,
hire_date DATE
);
INSERT INTO employees VALUES
(1, 'Kern, Jason', 'Sales', 60000, '2020-01-15'),
(2, 'Samansky, Alison', 'Sales', 65000, '2019-03-20'),
(3, 'McCullum, Charlie', 'Sales', 70000, '2018-07-10'),
(4, 'Kaiser, David', 'Engineering', 80000, '2021-02-01'),
(5, 'da Silva, Evina', 'Engineering', 85000, '2020-06-15'),
(6, 'Olase, Francis', 'Engineering', 90000, '2019-11-30'),
(7, 'Kirti, Gautham', 'HR', 55000, '2021-08-22'),
(8, 'Cheng, Yufan', 'HR', 58000, '2020-12-05');
GROUP BY Behavior
GROUP BY collapses rows based on specified grouping columns and requires aggregate functions for non-grouped columns. The operation fundamentally reduces the result set size.
SELECT
department,
COUNT(*) as employee_count,
AVG(salary) as avg_salary,
MIN(salary) as min_salary,
MAX(salary) as max_salary
FROM employees
GROUP BY department;
Table 1: 3 records
| Engineering |
3 |
85000 |
80000 |
90000 |
| HR |
2 |
56500 |
55000 |
58000 |
| Sales |
3 |
65000 |
60000 |
70000 |
This query returns three rows, one per department, with aggregated statistics. The original eight employee records collapse into three department summaries. Any column in the SELECT clause must either appear in the GROUP BY clause or be wrapped in an aggregate function.
-- Invalid: name is neither grouped nor aggregated
SELECT department, name, AVG(salary)
FROM employees
GROUP BY department;
-- Valid: all non-aggregated columns are in GROUP BY
SELECT department, name, AVG(salary) as avg_salary
FROM employees
GROUP BY department, name;
PARTITION BY Behavior
PARTITION BY divides rows into partitions for window function calculations while preserving all original rows. Window functions compute values based on sets of rows related to the current row. Notice how the PARTITION BY clause is part of the SELECT clause and applied to aggregation functions. Of course, some aggregation results will be the same for all rows, but others can be the result of calculations over partitions.
SELECT
employee_id,
name,
department,
salary,
COUNT(*) OVER (PARTITION BY department) as dept_employee_count,
AVG(salary) OVER (PARTITION BY department) as dept_avg_salary,
salary - AVG(salary) OVER (PARTITION BY department) as diff_from_avg
FROM employees;
Table 2: 8 records
| 4 |
Kaiser, David |
Engineering |
80000 |
3 |
85000 |
-5000 |
| 5 |
da Silva, Evina |
Engineering |
85000 |
3 |
85000 |
0 |
| 6 |
Olase, Francis |
Engineering |
90000 |
3 |
85000 |
5000 |
| 7 |
Kirti, Gautham |
HR |
55000 |
2 |
56500 |
-1500 |
| 8 |
Cheng, Yufan |
HR |
58000 |
2 |
56500 |
1500 |
| 1 |
Kern, Jason |
Sales |
60000 |
3 |
65000 |
-5000 |
| 2 |
Samansky, Alison |
Sales |
65000 |
3 |
65000 |
0 |
| 3 |
McCullum, Charlie |
Sales |
70000 |
3 |
65000 |
5000 |
This query returns all eight rows with department-level statistics added to each row. Every employee in Sales has the same department average of 65000, while every employee in Engineering has 85000. The partition-level calculations are repeated for each row in the partition.
Ranking and Analytical Functions
Window functions support operations impossible with GROUP BY alone. Ranking functions assign sequential numbers to rows within partitions.
SELECT
name,
department,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank_with_ties,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank
FROM employees;
Table 3: 8 records
| Olase, Francis |
Engineering |
90000 |
1 |
1 |
1 |
| da Silva, Evina |
Engineering |
85000 |
2 |
2 |
2 |
| Kaiser, David |
Engineering |
80000 |
3 |
3 |
3 |
| Cheng, Yufan |
HR |
58000 |
1 |
1 |
1 |
| Kirti, Gautham |
HR |
55000 |
2 |
2 |
2 |
| McCullum, Charlie |
Sales |
70000 |
1 |
1 |
1 |
| Samansky, Alison |
Sales |
65000 |
2 |
2 |
2 |
| Kern, Jason |
Sales |
60000 |
3 |
3 |
3 |
ROW_NUMBER assigns unique sequential integers even when values tie. RANK assigns the same rank to tied values and skips subsequent ranks. DENSE_RANK assigns the same rank to ties without skipping ranks. Within the Sales department, Charlie ranks first at 70000, Bob second at 65000, and Alice third at 60000.
LEAD and LAG functions access values from other rows relative to the current row within a partition.
SELECT
name,
department,
salary,
LAG(salary) OVER (PARTITION BY department ORDER BY salary) as prev_salary,
LEAD(salary) OVER (PARTITION BY department ORDER BY salary) as next_salary,
salary - LAG(salary) OVER (PARTITION BY department ORDER BY salary) as increase
FROM employees
ORDER BY department, salary;
Table 4: 8 records
| Kaiser, David |
Engineering |
80000 |
NA |
85000 |
NA |
| da Silva, Evina |
Engineering |
85000 |
80000 |
90000 |
5000 |
| Olase, Francis |
Engineering |
90000 |
85000 |
NA |
5000 |
| Kirti, Gautham |
HR |
55000 |
NA |
58000 |
NA |
| Cheng, Yufan |
HR |
58000 |
55000 |
NA |
3000 |
| Kern, Jason |
Sales |
60000 |
NA |
65000 |
NA |
| Samansky, Alison |
Sales |
65000 |
60000 |
70000 |
5000 |
| McCullum, Charlie |
Sales |
70000 |
65000 |
NA |
5000 |
Combining GROUP BY and PARTITION BY
Queries can combine both operations by using subqueries or common table expressions. GROUP BY first aggregates data, then PARTITION BY performs window calculations on the aggregated results. We will illustrate this with a different database.
-- Create the table first
CREATE TABLE test_scores (
student_id INTEGER,
student_name TEXT,
class TEXT,
test_name TEXT,
score INTEGER
);
-- Insert sample data
INSERT INTO test_scores VALUES
(1, 'Alice', 'Math', 'Test1', 85),
(1, 'Alice', 'Math', 'Test2', 90),
(1, 'Alice', 'Math', 'Test3', 88),
(2, 'Bob', 'Math', 'Test1', 78),
(2, 'Bob', 'Math', 'Test2', 82),
(2, 'Bob', 'Math', 'Test3', 80),
(3, 'Charlie', 'Science', 'Test1', 92),
(3, 'Charlie', 'Science', 'Test2', 95),
(3, 'Charlie', 'Science', 'Test3', 94),
(4, 'David', 'Science', 'Test1', 88),
(4, 'David', 'Science', 'Test2', 90),
(4, 'David', 'Science', 'Test3', 89);
SELECT
student_name,
class,
avg_score,
RANK() OVER (PARTITION BY class ORDER BY avg_score DESC) as class_rank,
AVG(avg_score) OVER (PARTITION BY class) as class_overall_avg
FROM (
SELECT
student_name,
class,
AVG(score) as avg_score
FROM test_scores
GROUP BY student_name, class
)
ORDER BY class, class_rank;
Table 5: 4 records
| Alice |
Math |
87.66667 |
1 |
83.83333 |
| Bob |
Math |
80.00000 |
2 |
83.83333 |
| Charlie |
Science |
93.66667 |
1 |
91.33333 |
| David |
Science |
89.00000 |
2 |
91.33333 |
-- Calculate student averages, then rank within classes
SELECT
student_name,
class,
avg_score,
RANK() OVER (PARTITION BY class ORDER BY avg_score DESC) as class_rank,
AVG(avg_score) OVER (PARTITION BY class) as class_overall_avg
FROM (
SELECT
student_name,
class,
AVG(score) as avg_score
FROM test_scores
GROUP BY student_name, class
) student_averages
ORDER BY class, class_rank;
Table 6: 4 records
| Alice |
Math |
87.66667 |
1 |
83.83333 |
| Bob |
Math |
80.00000 |
2 |
83.83333 |
| Charlie |
Science |
93.66667 |
1 |
91.33333 |
| David |
Science |
89.00000 |
2 |
91.33333 |
The subquery first collapses multiple test scores per student into a single average using GROUP BY. The outer query then applies window functions to rank students within their classes and compare individual averages to class averages. This pattern is common in analytical queries requiring both aggregation and comparative analysis.
Sales Analysis Example
Consider a sales dataset demonstrating practical applications of combined GROUP BY and PARTITION BY operations.
CREATE TABLE sales (
sale_id INTEGER,
salesperson TEXT,
region TEXT,
sale_amount REAL,
sale_date TEXT
);
INSERT INTO sales VALUES
(1, 'Alice', 'North', 1000, '2024-01-15'),
(2, 'Alice', 'North', 1500, '2024-01-20'),
(3, 'Bob', 'North', 800, '2024-01-18'),
(4, 'Bob', 'North', 1200, '2024-01-22'),
(5, 'Charlie', 'South', 2000, '2024-01-16'),
(6, 'Charlie', 'South', 1800, '2024-01-25'),
(7, 'David', 'South', 1500, '2024-01-17'),
(8, 'David', 'South', 1600, '2024-01-23');
A comprehensive analysis ranks salespeople within regions while calculating regional statistics.
SELECT
salesperson,
region,
total_sales,
RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) as regional_rank,
AVG(total_sales) OVER (PARTITION BY region) as regional_avg,
total_sales - AVG(total_sales) OVER (PARTITION BY region) as diff_from_avg,
ROUND(100.0 * total_sales / SUM(total_sales) OVER (PARTITION BY region), 2) as pct_of_regional_total
FROM (
SELECT
salesperson,
region,
SUM(sale_amount) as total_sales
FROM sales
GROUP BY salesperson, region
) salesperson_totals
ORDER BY region, regional_rank;
Table 7: 4 records
| Alice |
North |
2500 |
1 |
2250 |
250 |
55.56 |
| Bob |
North |
2000 |
2 |
2250 |
-250 |
44.44 |
| Charlie |
South |
3800 |
1 |
3450 |
350 |
55.07 |
| David |
South |
3100 |
2 |
3450 |
-350 |
44.93 |
The inner query aggregates individual sales into per-salesperson totals using GROUP BY. The outer query then ranks salespeople within regions and calculates each salesperson’s percentage of regional total sales. Alice leads the North region with 2500 in total sales, representing 55.56 percent of the regional total, while Bob trails with 2000 representing 44.44 percent.
Running Totals and Window Frames
Window functions support frame specifications for calculating running aggregates. The frame defines which rows relative to the current row participate in the calculation.
SELECT
name,
department,
salary,
hire_date,
SUM(salary) OVER (
PARTITION BY department
ORDER BY hire_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cumulative_dept_salary
FROM employees
ORDER BY department, hire_date;
Table 8: 8 records
| Olase, Francis |
Engineering |
90000 |
2019-11-30 |
90000 |
| da Silva, Evina |
Engineering |
85000 |
2020-06-15 |
175000 |
| Kaiser, David |
Engineering |
80000 |
2021-02-01 |
255000 |
| Cheng, Yufan |
HR |
58000 |
2020-12-05 |
58000 |
| Kirti, Gautham |
HR |
55000 |
2021-08-22 |
113000 |
| McCullum, Charlie |
Sales |
70000 |
2018-07-10 |
70000 |
| Samansky, Alison |
Sales |
65000 |
2019-03-20 |
135000 |
| Kern, Jason |
Sales |
60000 |
2020-01-15 |
195000 |
This query calculates a running total of salaries within each department ordered by hire date. For each employee, the running total includes all employees hired in the same department up to and including the current employee’s hire date. The frame clause “ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW” specifies this behavior explicitly.
Decision Framework
Choose GROUP BY when creating summary reports that require one row per group, when reducing result sets through aggregation, or when building dashboards with aggregate statistics. The operation is appropriate when detail-level data is unnecessary in the output.
Choose PARTITION BY when preserving detail rows while adding aggregate context, when ranking rows within groups, when comparing individual values to group statistics, or when calculating running totals and moving averages. Window functions excel at analytical queries requiring both detail and summary information simultaneously.
Combine both operations when first aggregating data then performing comparative analysis on the aggregates, when ranking aggregated groups, or when calculating percentages of group totals from already-aggregated data.
Common Errors
Attempting to select non-aggregated columns without including them in GROUP BY causes errors. Every column in the SELECT clause must either appear in GROUP BY or be wrapped in an aggregate function.
-- Error: name must be grouped or aggregated
SELECT department, name, AVG(salary)
FROM employees
GROUP BY department;
-- Correct: include name in GROUP BY
SELECT department, name, AVG(salary)
FROM employees
GROUP BY department, name;
When combining GROUP BY and PARTITION BY, ensure partition columns exist in the grouped result set. If GROUP BY collapses away a column, that column cannot be used in PARTITION BY.
-- Error: employee_id grouped away but needed for partition
SELECT department, SUM(salary),
RANK() OVER (PARTITION BY employee_id ORDER BY salary)
FROM employees GROUP BY department;
-- Correct: partition by what remains after grouping
SELECT department, SUM(salary) as total_salary,
RANK() OVER (ORDER BY SUM(salary) DESC) as dept_rank
FROM employees GROUP BY department;
Expecting PARTITION BY to reduce row counts is a common misconception. Window functions always preserve the input row count unless DISTINCT or additional filtering is applied.
Notes
SQLite added comprehensive window function support including PARTITION BY in version 3.25.0 released in September 2018. Any modern SQLite installation supports these features. Check the version with SELECT sqlite_version(); to confirm compatibility. Versions prior to 3.25.0 require complex workarounds using self-joins and subqueries that are both inefficient and difficult to maintain.
Acknowledgements
Claude 4.6 (Sonnet) was used in the preparation of this lesson.
---
title: "GROUP BY and PARTITION BY: Two Approaches to Analytical Queries in SQL"
params:
  category: 70
  number: 115
  time: 45
  level: beginner
  tags: "sql,group by,partition,analytics"
  description: "Examines the fundamental differences between GROUP BY 
                aggregation, which collapses rows into summary groups, 
                and PARTITION BY window functions, which preserve all rows 
                while adding partition-level calculations. Combine both 
                operations for complex analytical queries. Provides
                examples using SQLite and R but lesson is applicable to
                any database and programming language."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: lumen
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r echo=F}
# Package names
packages <- c("here", "RSQLite")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages], repos = "https://cloud.r-project.org")
}

# Packages loading
suppressPackageStartupMessages(invisible(lapply(packages, library, character.only = TRUE)))
```

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

GROUP BY and PARTITION BY represent two fundamentally different operations in SQL. GROUP BY performs aggregation by collapsing rows into groups, returning one row per group. PARTITION BY, used with window functions, divides rows into partitions but preserves all original rows while adding computed values based on partition-level calculations. Understanding this distinction is essential for writing efficient analytical queries.

## Execution Order and Query Processing

SQL queries execute in a specific order that determines how GROUP BY and PARTITION BY interact. The execution sequence proceeds as follows: FROM and JOIN clauses retrieve data, WHERE filters rows, GROUP BY aggregates and collapses rows, HAVING filters groups, SELECT applies window functions including PARTITION BY, ORDER BY sorts results, and finally LIMIT restricts output. This sequence means GROUP BY executes before window functions, so PARTITION BY operates on already-aggregated data when both appear in the same query.

## Sample Dataset

To illustrate the use of GROUP BY and PARTITION BY, consider an employee salary dataset:

```{r echo=F, warning=FALSE}
library(RSQLite)
dbcon <- dbConnect(RSQLite::SQLite(), ":memory:")
```

```{sql connection=dbcon}
CREATE TABLE employees (
    employee_id INTEGER,
    name TEXT,
    department TEXT,
    salary INTEGER,
    hire_date DATE
);
```

```{sql connection=dbcon}
INSERT INTO employees VALUES
(1, 'Kern, Jason', 'Sales', 60000, '2020-01-15'),
(2, 'Samansky, Alison', 'Sales', 65000, '2019-03-20'),
(3, 'McCullum, Charlie', 'Sales', 70000, '2018-07-10'),
(4, 'Kaiser, David', 'Engineering', 80000, '2021-02-01'),
(5, 'da Silva, Evina', 'Engineering', 85000, '2020-06-15'),
(6, 'Olase, Francis', 'Engineering', 90000, '2019-11-30'),
(7, 'Kirti, Gautham', 'HR', 55000, '2021-08-22'),
(8, 'Cheng, Yufan', 'HR', 58000, '2020-12-05');
```

## GROUP BY Behavior

GROUP BY collapses rows based on specified grouping columns and requires aggregate functions for non-grouped columns. The operation fundamentally reduces the result set size.

```{sql connection=dbcon}
SELECT 
    department,
    COUNT(*) as employee_count,
    AVG(salary) as avg_salary,
    MIN(salary) as min_salary,
    MAX(salary) as max_salary
FROM employees
GROUP BY department;
```

This query returns three rows, one per department, with aggregated statistics. The original eight employee records collapse into three department summaries. Any column in the SELECT clause must either appear in the GROUP BY clause or be wrapped in an aggregate function.

``` sql
-- Invalid: name is neither grouped nor aggregated
SELECT department, name, AVG(salary) 
  FROM employees 
 GROUP BY department;

-- Valid: all non-aggregated columns are in GROUP BY
SELECT department, name, AVG(salary) as avg_salary
  FROM employees
 GROUP BY department, name;
```

## PARTITION BY Behavior

PARTITION BY divides rows into partitions for window function calculations while preserving all original rows. Window functions compute values based on sets of rows related to the current row. Notice how the PARTITION BY clause is part of the SELECT clause and applied to aggregation functions. Of course, some aggregation results will be the same for all rows, but others can be the result of calculations over partitions.

```{sql connection=dbcon}
SELECT 
    employee_id,
    name,
    department,
    salary,
    COUNT(*) OVER (PARTITION BY department) as dept_employee_count,
    AVG(salary) OVER (PARTITION BY department) as dept_avg_salary,
    salary - AVG(salary) OVER (PARTITION BY department) as diff_from_avg
FROM employees;
```

This query returns all eight rows with department-level statistics added to each row. Every employee in Sales has the same department average of 65000, while every employee in Engineering has 85000. The partition-level calculations are repeated for each row in the partition.

## Ranking and Analytical Functions

Window functions support operations impossible with GROUP BY alone. Ranking functions assign sequential numbers to rows within partitions.

```{sql connection=dbcon}
SELECT 
    name,
    department,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank_with_ties,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank
FROM employees;
```

ROW_NUMBER assigns unique sequential integers even when values tie. RANK assigns the same rank to tied values and skips subsequent ranks. DENSE_RANK assigns the same rank to ties without skipping ranks. Within the Sales department, Charlie ranks first at 70000, Bob second at 65000, and Alice third at 60000.

LEAD and LAG functions access values from other rows relative to the current row within a partition.

```{sql connection=dbcon}
SELECT 
    name,
    department,
    salary,
    LAG(salary) OVER (PARTITION BY department ORDER BY salary) as prev_salary,
    LEAD(salary) OVER (PARTITION BY department ORDER BY salary) as next_salary,
    salary - LAG(salary) OVER (PARTITION BY department ORDER BY salary) as increase
FROM employees
ORDER BY department, salary;
```

## Combining GROUP BY and PARTITION BY

Queries can combine both operations by using subqueries or common table expressions. GROUP BY first aggregates data, then PARTITION BY performs window calculations on the aggregated results. We will illustrate this with a different database.

```{sql connection=dbcon}
-- Create the table first
CREATE TABLE test_scores (
    student_id INTEGER,
    student_name TEXT,
    class TEXT,
    test_name TEXT,
    score INTEGER
);
```

```{sql connection=dbcon}
-- Insert sample data
INSERT INTO test_scores VALUES
(1, 'Alice', 'Math', 'Test1', 85),
(1, 'Alice', 'Math', 'Test2', 90),
(1, 'Alice', 'Math', 'Test3', 88),
(2, 'Bob', 'Math', 'Test1', 78),
(2, 'Bob', 'Math', 'Test2', 82),
(2, 'Bob', 'Math', 'Test3', 80),
(3, 'Charlie', 'Science', 'Test1', 92),
(3, 'Charlie', 'Science', 'Test2', 95),
(3, 'Charlie', 'Science', 'Test3', 94),
(4, 'David', 'Science', 'Test1', 88),
(4, 'David', 'Science', 'Test2', 90),
(4, 'David', 'Science', 'Test3', 89);
```

```{sql connection=dbcon}
SELECT 
    student_name,
    class,
    avg_score,
    RANK() OVER (PARTITION BY class ORDER BY avg_score DESC) as class_rank,
    AVG(avg_score) OVER (PARTITION BY class) as class_overall_avg
FROM (
    SELECT 
        student_name,
        class,
        AVG(score) as avg_score
    FROM test_scores
    GROUP BY student_name, class
)
ORDER BY class, class_rank;
```

```{sql connection=dbcon}
-- Calculate student averages, then rank within classes
SELECT 
    student_name,
    class,
    avg_score,
    RANK() OVER (PARTITION BY class ORDER BY avg_score DESC) as class_rank,
    AVG(avg_score) OVER (PARTITION BY class) as class_overall_avg
FROM (
    SELECT 
        student_name,
        class,
        AVG(score) as avg_score
    FROM test_scores
    GROUP BY student_name, class
) student_averages
ORDER BY class, class_rank;
```

The subquery first collapses multiple test scores per student into a single average using GROUP BY. The outer query then applies window functions to rank students within their classes and compare individual averages to class averages. This pattern is common in analytical queries requiring both aggregation and comparative analysis.

## Sales Analysis Example

Consider a sales dataset demonstrating practical applications of combined GROUP BY and PARTITION BY operations.

```{sql connection=dbcon}
CREATE TABLE sales (
    sale_id INTEGER,
    salesperson TEXT,
    region TEXT,
    sale_amount REAL,
    sale_date TEXT
);
```

```{sql connection=dbcon}
INSERT INTO sales VALUES
(1, 'Alice', 'North', 1000, '2024-01-15'),
(2, 'Alice', 'North', 1500, '2024-01-20'),
(3, 'Bob', 'North', 800, '2024-01-18'),
(4, 'Bob', 'North', 1200, '2024-01-22'),
(5, 'Charlie', 'South', 2000, '2024-01-16'),
(6, 'Charlie', 'South', 1800, '2024-01-25'),
(7, 'David', 'South', 1500, '2024-01-17'),
(8, 'David', 'South', 1600, '2024-01-23');
```

A comprehensive analysis ranks salespeople within regions while calculating regional statistics.

```{sql connection=dbcon}
SELECT 
    salesperson,
    region,
    total_sales,
    RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) as regional_rank,
    AVG(total_sales) OVER (PARTITION BY region) as regional_avg,
    total_sales - AVG(total_sales) OVER (PARTITION BY region) as diff_from_avg,
    ROUND(100.0 * total_sales / SUM(total_sales) OVER (PARTITION BY region), 2) as pct_of_regional_total
FROM (
    SELECT 
        salesperson,
        region,
        SUM(sale_amount) as total_sales
    FROM sales
    GROUP BY salesperson, region
) salesperson_totals
ORDER BY region, regional_rank;
```

The inner query aggregates individual sales into per-salesperson totals using GROUP BY. The outer query then ranks salespeople within regions and calculates each salesperson's percentage of regional total sales. Alice leads the North region with 2500 in total sales, representing 55.56 percent of the regional total, while Bob trails with 2000 representing 44.44 percent.

## Running Totals and Window Frames

Window functions support frame specifications for calculating running aggregates. The frame defines which rows relative to the current row participate in the calculation.

```{sql connection=dbcon}
SELECT 
    name,
    department,
    salary,
    hire_date,
    SUM(salary) OVER (
        PARTITION BY department 
        ORDER BY hire_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_dept_salary
FROM employees
ORDER BY department, hire_date;
```

This query calculates a running total of salaries within each department ordered by hire date. For each employee, the running total includes all employees hired in the same department up to and including the current employee's hire date. The frame clause "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" specifies this behavior explicitly.

## Decision Framework

Choose GROUP BY when creating summary reports that require one row per group, when reducing result sets through aggregation, or when building dashboards with aggregate statistics. The operation is appropriate when detail-level data is unnecessary in the output.

Choose PARTITION BY when preserving detail rows while adding aggregate context, when ranking rows within groups, when comparing individual values to group statistics, or when calculating running totals and moving averages. Window functions excel at analytical queries requiring both detail and summary information simultaneously.

Combine both operations when first aggregating data then performing comparative analysis on the aggregates, when ranking aggregated groups, or when calculating percentages of group totals from already-aggregated data.

## Common Errors

Attempting to select non-aggregated columns without including them in GROUP BY causes errors. Every column in the SELECT clause must either appear in GROUP BY or be wrapped in an aggregate function.

``` sql
-- Error: name must be grouped or aggregated
SELECT department, name, AVG(salary) 
  FROM employees 
 GROUP BY department;

-- Correct: include name in GROUP BY
SELECT department, name, AVG(salary) 
  FROM employees 
 GROUP BY department, name;
```

When combining GROUP BY and PARTITION BY, ensure partition columns exist in the grouped result set. If GROUP BY collapses away a column, that column cannot be used in PARTITION BY.

``` sql
-- Error: employee_id grouped away but needed for partition
SELECT department, SUM(salary),
       RANK() OVER (PARTITION BY employee_id ORDER BY salary)
  FROM employees GROUP BY department;

-- Correct: partition by what remains after grouping
SELECT department, SUM(salary) as total_salary,
       RANK() OVER (ORDER BY SUM(salary) DESC) as dept_rank
  FROM employees GROUP BY department;
```

Expecting PARTITION BY to reduce row counts is a common misconception. Window functions always preserve the input row count unless DISTINCT or additional filtering is applied.

## Performance Considerations

Window functions typically perform better than equivalent self-joins or correlated subqueries. Modern query optimizers recognize window function patterns and generate efficient execution plans. However, partitioning on unindexed columns or ordering by expressions can degrade performance on large datasets.

Consider creating indexes on columns used in PARTITION BY and ORDER BY clauses for large tables. When combining GROUP BY and PARTITION BY, the initial aggregation reduces the dataset size before window calculations, often improving overall query performance compared to window functions on raw data.

Multiple window functions with identical PARTITION BY and ORDER BY clauses can share computation. Modern optimizers recognize this pattern and avoid redundant sorting and partitioning operations.

``` sql
-- Efficient: same window specification reused
SELECT 
    name,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num
FROM employees;
```

Understanding the distinction between GROUP BY aggregation and PARTITION BY window functions, their execution order, and their appropriate use cases enables construction of sophisticated analytical queries that efficiently answer complex business questions.

```{r echo=FALSE, warning=FALSE}
dbDisconnect(dbcon)
```

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## Notes

SQLite added comprehensive window function support including PARTITION BY in version 3.25.0 released in September 2018. Any modern SQLite installation supports these features. Check the version with `SELECT sqlite_version();` to confirm compatibility. Versions prior to 3.25.0 require complex workarounds using self-joins and subqueries that are both inefficient and difficult to maintain.

## Acknowledgements

Claude 4.6 (Sonnet) was used in the preparation of this lesson.

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
