Introduction

GROUP BY and PARTITION BY represent two fundamentally different operations in SQL. GROUP BY performs aggregation by collapsing rows into groups, returning one row per group. PARTITION BY, used with window functions, divides rows into partitions but preserves all original rows while adding computed values based on partition-level calculations. Understanding this distinction is essential for writing efficient analytical queries.

Execution Order and Query Processing

SQL queries execute in a specific order that determines how GROUP BY and PARTITION BY interact. The execution sequence proceeds as follows: FROM and JOIN clauses retrieve data, WHERE filters rows, GROUP BY aggregates and collapses rows, HAVING filters groups, SELECT applies window functions including PARTITION BY, ORDER BY sorts results, and finally LIMIT restricts output. This sequence means GROUP BY executes before window functions, so PARTITION BY operates on already-aggregated data when both appear in the same query.

Sample Dataset

To illustrate the use of GROUP BY and PARTITION BY, consider an employee salary dataset:

CREATE TABLE employees (
    employee_id INTEGER,
    name TEXT,
    department TEXT,
    salary INTEGER,
    hire_date DATE
);
INSERT INTO employees VALUES
(1, 'Kern, Jason', 'Sales', 60000, '2020-01-15'),
(2, 'Samansky, Alison', 'Sales', 65000, '2019-03-20'),
(3, 'McCullum, Charlie', 'Sales', 70000, '2018-07-10'),
(4, 'Kaiser, David', 'Engineering', 80000, '2021-02-01'),
(5, 'da Silva, Evina', 'Engineering', 85000, '2020-06-15'),
(6, 'Olase, Francis', 'Engineering', 90000, '2019-11-30'),
(7, 'Kirti, Gautham', 'HR', 55000, '2021-08-22'),
(8, 'Cheng, Yufan', 'HR', 58000, '2020-12-05');

GROUP BY Behavior

GROUP BY collapses rows based on specified grouping columns and requires aggregate functions for non-grouped columns. The operation fundamentally reduces the result set size.

SELECT 
    department,
    COUNT(*) as employee_count,
    AVG(salary) as avg_salary,
    MIN(salary) as min_salary,
    MAX(salary) as max_salary
FROM employees
GROUP BY department;
Table 1: 3 records
department employee_count avg_salary min_salary max_salary
Engineering 3 85000 80000 90000
HR 2 56500 55000 58000
Sales 3 65000 60000 70000

This query returns three rows, one per department, with aggregated statistics. The original eight employee records collapse into three department summaries. Any column in the SELECT clause must either appear in the GROUP BY clause or be wrapped in an aggregate function.

-- Invalid: name is neither grouped nor aggregated
SELECT department, name, AVG(salary) 
  FROM employees 
 GROUP BY department;

-- Valid: all non-aggregated columns are in GROUP BY
SELECT department, name, AVG(salary) as avg_salary
  FROM employees
 GROUP BY department, name;

PARTITION BY Behavior

PARTITION BY divides rows into partitions for window function calculations while preserving all original rows. Window functions compute values based on sets of rows related to the current row. Notice how the PARTITION BY clause is part of the SELECT clause and applied to aggregation functions. Of course, some aggregation results will be the same for all rows, but others can be the result of calculations over partitions.

SELECT 
    employee_id,
    name,
    department,
    salary,
    COUNT(*) OVER (PARTITION BY department) as dept_employee_count,
    AVG(salary) OVER (PARTITION BY department) as dept_avg_salary,
    salary - AVG(salary) OVER (PARTITION BY department) as diff_from_avg
FROM employees;
Table 2: 8 records
employee_id name department salary dept_employee_count dept_avg_salary diff_from_avg
4 Kaiser, David Engineering 80000 3 85000 -5000
5 da Silva, Evina Engineering 85000 3 85000 0
6 Olase, Francis Engineering 90000 3 85000 5000
7 Kirti, Gautham HR 55000 2 56500 -1500
8 Cheng, Yufan HR 58000 2 56500 1500
1 Kern, Jason Sales 60000 3 65000 -5000
2 Samansky, Alison Sales 65000 3 65000 0
3 McCullum, Charlie Sales 70000 3 65000 5000

This query returns all eight rows with department-level statistics added to each row. Every employee in Sales has the same department average of 65000, while every employee in Engineering has 85000. The partition-level calculations are repeated for each row in the partition.

Ranking and Analytical Functions

Window functions support operations impossible with GROUP BY alone. Ranking functions assign sequential numbers to rows within partitions.

SELECT 
    name,
    department,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank_with_ties,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank
FROM employees;
Table 3: 8 records
name department salary salary_rank rank_with_ties dense_rank
Olase, Francis Engineering 90000 1 1 1
da Silva, Evina Engineering 85000 2 2 2
Kaiser, David Engineering 80000 3 3 3
Cheng, Yufan HR 58000 1 1 1
Kirti, Gautham HR 55000 2 2 2
McCullum, Charlie Sales 70000 1 1 1
Samansky, Alison Sales 65000 2 2 2
Kern, Jason Sales 60000 3 3 3

ROW_NUMBER assigns unique sequential integers even when values tie. RANK assigns the same rank to tied values and skips subsequent ranks. DENSE_RANK assigns the same rank to ties without skipping ranks. Within the Sales department, Charlie ranks first at 70000, Bob second at 65000, and Alice third at 60000.

LEAD and LAG functions access values from other rows relative to the current row within a partition.

SELECT 
    name,
    department,
    salary,
    LAG(salary) OVER (PARTITION BY department ORDER BY salary) as prev_salary,
    LEAD(salary) OVER (PARTITION BY department ORDER BY salary) as next_salary,
    salary - LAG(salary) OVER (PARTITION BY department ORDER BY salary) as increase
FROM employees
ORDER BY department, salary;
Table 4: 8 records
name department salary prev_salary next_salary increase
Kaiser, David Engineering 80000 NA 85000 NA
da Silva, Evina Engineering 85000 80000 90000 5000
Olase, Francis Engineering 90000 85000 NA 5000
Kirti, Gautham HR 55000 NA 58000 NA
Cheng, Yufan HR 58000 55000 NA 3000
Kern, Jason Sales 60000 NA 65000 NA
Samansky, Alison Sales 65000 60000 70000 5000
McCullum, Charlie Sales 70000 65000 NA 5000

Combining GROUP BY and PARTITION BY

Queries can combine both operations by using subqueries or common table expressions. GROUP BY first aggregates data, then PARTITION BY performs window calculations on the aggregated results. We will illustrate this with a different database.

-- Create the table first
CREATE TABLE test_scores (
    student_id INTEGER,
    student_name TEXT,
    class TEXT,
    test_name TEXT,
    score INTEGER
);
-- Insert sample data
INSERT INTO test_scores VALUES
(1, 'Alice', 'Math', 'Test1', 85),
(1, 'Alice', 'Math', 'Test2', 90),
(1, 'Alice', 'Math', 'Test3', 88),
(2, 'Bob', 'Math', 'Test1', 78),
(2, 'Bob', 'Math', 'Test2', 82),
(2, 'Bob', 'Math', 'Test3', 80),
(3, 'Charlie', 'Science', 'Test1', 92),
(3, 'Charlie', 'Science', 'Test2', 95),
(3, 'Charlie', 'Science', 'Test3', 94),
(4, 'David', 'Science', 'Test1', 88),
(4, 'David', 'Science', 'Test2', 90),
(4, 'David', 'Science', 'Test3', 89);
SELECT 
    student_name,
    class,
    avg_score,
    RANK() OVER (PARTITION BY class ORDER BY avg_score DESC) as class_rank,
    AVG(avg_score) OVER (PARTITION BY class) as class_overall_avg
FROM (
    SELECT 
        student_name,
        class,
        AVG(score) as avg_score
    FROM test_scores
    GROUP BY student_name, class
)
ORDER BY class, class_rank;
Table 5: 4 records
student_name class avg_score class_rank class_overall_avg
Alice Math 87.66667 1 83.83333
Bob Math 80.00000 2 83.83333
Charlie Science 93.66667 1 91.33333
David Science 89.00000 2 91.33333
-- Calculate student averages, then rank within classes
SELECT 
    student_name,
    class,
    avg_score,
    RANK() OVER (PARTITION BY class ORDER BY avg_score DESC) as class_rank,
    AVG(avg_score) OVER (PARTITION BY class) as class_overall_avg
FROM (
    SELECT 
        student_name,
        class,
        AVG(score) as avg_score
    FROM test_scores
    GROUP BY student_name, class
) student_averages
ORDER BY class, class_rank;
Table 6: 4 records
student_name class avg_score class_rank class_overall_avg
Alice Math 87.66667 1 83.83333
Bob Math 80.00000 2 83.83333
Charlie Science 93.66667 1 91.33333
David Science 89.00000 2 91.33333

The subquery first collapses multiple test scores per student into a single average using GROUP BY. The outer query then applies window functions to rank students within their classes and compare individual averages to class averages. This pattern is common in analytical queries requiring both aggregation and comparative analysis.

Sales Analysis Example

Consider a sales dataset demonstrating practical applications of combined GROUP BY and PARTITION BY operations.

CREATE TABLE sales (
    sale_id INTEGER,
    salesperson TEXT,
    region TEXT,
    sale_amount REAL,
    sale_date TEXT
);
INSERT INTO sales VALUES
(1, 'Alice', 'North', 1000, '2024-01-15'),
(2, 'Alice', 'North', 1500, '2024-01-20'),
(3, 'Bob', 'North', 800, '2024-01-18'),
(4, 'Bob', 'North', 1200, '2024-01-22'),
(5, 'Charlie', 'South', 2000, '2024-01-16'),
(6, 'Charlie', 'South', 1800, '2024-01-25'),
(7, 'David', 'South', 1500, '2024-01-17'),
(8, 'David', 'South', 1600, '2024-01-23');

A comprehensive analysis ranks salespeople within regions while calculating regional statistics.

SELECT 
    salesperson,
    region,
    total_sales,
    RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) as regional_rank,
    AVG(total_sales) OVER (PARTITION BY region) as regional_avg,
    total_sales - AVG(total_sales) OVER (PARTITION BY region) as diff_from_avg,
    ROUND(100.0 * total_sales / SUM(total_sales) OVER (PARTITION BY region), 2) as pct_of_regional_total
FROM (
    SELECT 
        salesperson,
        region,
        SUM(sale_amount) as total_sales
    FROM sales
    GROUP BY salesperson, region
) salesperson_totals
ORDER BY region, regional_rank;
Table 7: 4 records
salesperson region total_sales regional_rank regional_avg diff_from_avg pct_of_regional_total
Alice North 2500 1 2250 250 55.56
Bob North 2000 2 2250 -250 44.44
Charlie South 3800 1 3450 350 55.07
David South 3100 2 3450 -350 44.93

The inner query aggregates individual sales into per-salesperson totals using GROUP BY. The outer query then ranks salespeople within regions and calculates each salesperson’s percentage of regional total sales. Alice leads the North region with 2500 in total sales, representing 55.56 percent of the regional total, while Bob trails with 2000 representing 44.44 percent.

Running Totals and Window Frames

Window functions support frame specifications for calculating running aggregates. The frame defines which rows relative to the current row participate in the calculation.

SELECT 
    name,
    department,
    salary,
    hire_date,
    SUM(salary) OVER (
        PARTITION BY department 
        ORDER BY hire_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_dept_salary
FROM employees
ORDER BY department, hire_date;
Table 8: 8 records
name department salary hire_date cumulative_dept_salary
Olase, Francis Engineering 90000 2019-11-30 90000
da Silva, Evina Engineering 85000 2020-06-15 175000
Kaiser, David Engineering 80000 2021-02-01 255000
Cheng, Yufan HR 58000 2020-12-05 58000
Kirti, Gautham HR 55000 2021-08-22 113000
McCullum, Charlie Sales 70000 2018-07-10 70000
Samansky, Alison Sales 65000 2019-03-20 135000
Kern, Jason Sales 60000 2020-01-15 195000

This query calculates a running total of salaries within each department ordered by hire date. For each employee, the running total includes all employees hired in the same department up to and including the current employee’s hire date. The frame clause “ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW” specifies this behavior explicitly.

Decision Framework

Choose GROUP BY when creating summary reports that require one row per group, when reducing result sets through aggregation, or when building dashboards with aggregate statistics. The operation is appropriate when detail-level data is unnecessary in the output.

Choose PARTITION BY when preserving detail rows while adding aggregate context, when ranking rows within groups, when comparing individual values to group statistics, or when calculating running totals and moving averages. Window functions excel at analytical queries requiring both detail and summary information simultaneously.

Combine both operations when first aggregating data then performing comparative analysis on the aggregates, when ranking aggregated groups, or when calculating percentages of group totals from already-aggregated data.

Common Errors

Attempting to select non-aggregated columns without including them in GROUP BY causes errors. Every column in the SELECT clause must either appear in GROUP BY or be wrapped in an aggregate function.

-- Error: name must be grouped or aggregated
SELECT department, name, AVG(salary) 
  FROM employees 
 GROUP BY department;

-- Correct: include name in GROUP BY
SELECT department, name, AVG(salary) 
  FROM employees 
 GROUP BY department, name;

When combining GROUP BY and PARTITION BY, ensure partition columns exist in the grouped result set. If GROUP BY collapses away a column, that column cannot be used in PARTITION BY.

-- Error: employee_id grouped away but needed for partition
SELECT department, SUM(salary),
       RANK() OVER (PARTITION BY employee_id ORDER BY salary)
  FROM employees GROUP BY department;

-- Correct: partition by what remains after grouping
SELECT department, SUM(salary) as total_salary,
       RANK() OVER (ORDER BY SUM(salary) DESC) as dept_rank
  FROM employees GROUP BY department;

Expecting PARTITION BY to reduce row counts is a common misconception. Window functions always preserve the input row count unless DISTINCT or additional filtering is applied.

Performance Considerations

Window functions typically perform better than equivalent self-joins or correlated subqueries. Modern query optimizers recognize window function patterns and generate efficient execution plans. However, partitioning on unindexed columns or ordering by expressions can degrade performance on large datasets.

Consider creating indexes on columns used in PARTITION BY and ORDER BY clauses for large tables. When combining GROUP BY and PARTITION BY, the initial aggregation reduces the dataset size before window calculations, often improving overall query performance compared to window functions on raw data.

Multiple window functions with identical PARTITION BY and ORDER BY clauses can share computation. Modern optimizers recognize this pattern and avoid redundant sorting and partitioning operations.

-- Efficient: same window specification reused
SELECT 
    name,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num
FROM employees;

Understanding the distinction between GROUP BY aggregation and PARTITION BY window functions, their execution order, and their appropriate use cases enables construction of sophisticated analytical queries that efficiently answer complex business questions.


Files & Resources

All Files for Lesson 70.115

Notes

SQLite added comprehensive window function support including PARTITION BY in version 3.25.0 released in September 2018. Any modern SQLite installation supports these features. Check the version with SELECT sqlite_version(); to confirm compatibility. Versions prior to 3.25.0 require complex workarounds using self-joins and subqueries that are both inefficient and difficult to maintain.

Acknowledgements

Claude 4.6 (Sonnet) was used in the preparation of this lesson.

Errata

Let us know.

---
title: "GROUP BY and PARTITION BY: Two Approaches to Analytical Queries in SQL"
params:
  category: 70
  number: 115
  time: 45
  level: beginner
  tags: "sql,group by,partition,analytics"
  description: "Examines the fundamental differences between GROUP BY 
                aggregation, which collapses rows into summary groups, 
                and PARTITION BY window functions, which preserve all rows 
                while adding partition-level calculations. Combine both 
                operations for complex analytical queries. Provides
                examples using SQLite and R but lesson is applicable to
                any database and programming language."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: lumen
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r echo=F}
# Package names
packages <- c("here", "RSQLite")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages], repos = "https://cloud.r-project.org")
}

# Packages loading
suppressPackageStartupMessages(invisible(lapply(packages, library, character.only = TRUE)))
```

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

GROUP BY and PARTITION BY represent two fundamentally different operations in SQL. GROUP BY performs aggregation by collapsing rows into groups, returning one row per group. PARTITION BY, used with window functions, divides rows into partitions but preserves all original rows while adding computed values based on partition-level calculations. Understanding this distinction is essential for writing efficient analytical queries.

## Execution Order and Query Processing

SQL queries execute in a specific order that determines how GROUP BY and PARTITION BY interact. The execution sequence proceeds as follows: FROM and JOIN clauses retrieve data, WHERE filters rows, GROUP BY aggregates and collapses rows, HAVING filters groups, SELECT applies window functions including PARTITION BY, ORDER BY sorts results, and finally LIMIT restricts output. This sequence means GROUP BY executes before window functions, so PARTITION BY operates on already-aggregated data when both appear in the same query.

## Sample Dataset

To illustrate the use of GROUP BY and PARTITION BY, consider an employee salary dataset:

```{r echo=F, warning=FALSE}
library(RSQLite)
dbcon <- dbConnect(RSQLite::SQLite(), ":memory:")
```

```{sql connection=dbcon}
CREATE TABLE employees (
    employee_id INTEGER,
    name TEXT,
    department TEXT,
    salary INTEGER,
    hire_date DATE
);
```

```{sql connection=dbcon}
INSERT INTO employees VALUES
(1, 'Kern, Jason', 'Sales', 60000, '2020-01-15'),
(2, 'Samansky, Alison', 'Sales', 65000, '2019-03-20'),
(3, 'McCullum, Charlie', 'Sales', 70000, '2018-07-10'),
(4, 'Kaiser, David', 'Engineering', 80000, '2021-02-01'),
(5, 'da Silva, Evina', 'Engineering', 85000, '2020-06-15'),
(6, 'Olase, Francis', 'Engineering', 90000, '2019-11-30'),
(7, 'Kirti, Gautham', 'HR', 55000, '2021-08-22'),
(8, 'Cheng, Yufan', 'HR', 58000, '2020-12-05');
```

## GROUP BY Behavior

GROUP BY collapses rows based on specified grouping columns and requires aggregate functions for non-grouped columns. The operation fundamentally reduces the result set size.

```{sql connection=dbcon}
SELECT 
    department,
    COUNT(*) as employee_count,
    AVG(salary) as avg_salary,
    MIN(salary) as min_salary,
    MAX(salary) as max_salary
FROM employees
GROUP BY department;
```

This query returns three rows, one per department, with aggregated statistics. The original eight employee records collapse into three department summaries. Any column in the SELECT clause must either appear in the GROUP BY clause or be wrapped in an aggregate function.

``` sql
-- Invalid: name is neither grouped nor aggregated
SELECT department, name, AVG(salary) 
  FROM employees 
 GROUP BY department;

-- Valid: all non-aggregated columns are in GROUP BY
SELECT department, name, AVG(salary) as avg_salary
  FROM employees
 GROUP BY department, name;
```

## PARTITION BY Behavior

PARTITION BY divides rows into partitions for window function calculations while preserving all original rows. Window functions compute values based on sets of rows related to the current row. Notice how the PARTITION BY clause is part of the SELECT clause and applied to aggregation functions. Of course, some aggregation results will be the same for all rows, but others can be the result of calculations over partitions.

```{sql connection=dbcon}
SELECT 
    employee_id,
    name,
    department,
    salary,
    COUNT(*) OVER (PARTITION BY department) as dept_employee_count,
    AVG(salary) OVER (PARTITION BY department) as dept_avg_salary,
    salary - AVG(salary) OVER (PARTITION BY department) as diff_from_avg
FROM employees;
```

This query returns all eight rows with department-level statistics added to each row. Every employee in Sales has the same department average of 65000, while every employee in Engineering has 85000. The partition-level calculations are repeated for each row in the partition.

## Ranking and Analytical Functions

Window functions support operations impossible with GROUP BY alone. Ranking functions assign sequential numbers to rows within partitions.

```{sql connection=dbcon}
SELECT 
    name,
    department,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank_with_ties,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank
FROM employees;
```

ROW_NUMBER assigns unique sequential integers even when values tie. RANK assigns the same rank to tied values and skips subsequent ranks. DENSE_RANK assigns the same rank to ties without skipping ranks. Within the Sales department, Charlie ranks first at 70000, Bob second at 65000, and Alice third at 60000.

LEAD and LAG functions access values from other rows relative to the current row within a partition.

```{sql connection=dbcon}
SELECT 
    name,
    department,
    salary,
    LAG(salary) OVER (PARTITION BY department ORDER BY salary) as prev_salary,
    LEAD(salary) OVER (PARTITION BY department ORDER BY salary) as next_salary,
    salary - LAG(salary) OVER (PARTITION BY department ORDER BY salary) as increase
FROM employees
ORDER BY department, salary;
```

## Combining GROUP BY and PARTITION BY

Queries can combine both operations by using subqueries or common table expressions. GROUP BY first aggregates data, then PARTITION BY performs window calculations on the aggregated results. We will illustrate this with a different database.

```{sql connection=dbcon}
-- Create the table first
CREATE TABLE test_scores (
    student_id INTEGER,
    student_name TEXT,
    class TEXT,
    test_name TEXT,
    score INTEGER
);
```

```{sql connection=dbcon}
-- Insert sample data
INSERT INTO test_scores VALUES
(1, 'Alice', 'Math', 'Test1', 85),
(1, 'Alice', 'Math', 'Test2', 90),
(1, 'Alice', 'Math', 'Test3', 88),
(2, 'Bob', 'Math', 'Test1', 78),
(2, 'Bob', 'Math', 'Test2', 82),
(2, 'Bob', 'Math', 'Test3', 80),
(3, 'Charlie', 'Science', 'Test1', 92),
(3, 'Charlie', 'Science', 'Test2', 95),
(3, 'Charlie', 'Science', 'Test3', 94),
(4, 'David', 'Science', 'Test1', 88),
(4, 'David', 'Science', 'Test2', 90),
(4, 'David', 'Science', 'Test3', 89);
```

```{sql connection=dbcon}
SELECT 
    student_name,
    class,
    avg_score,
    RANK() OVER (PARTITION BY class ORDER BY avg_score DESC) as class_rank,
    AVG(avg_score) OVER (PARTITION BY class) as class_overall_avg
FROM (
    SELECT 
        student_name,
        class,
        AVG(score) as avg_score
    FROM test_scores
    GROUP BY student_name, class
)
ORDER BY class, class_rank;
```

```{sql connection=dbcon}
-- Calculate student averages, then rank within classes
SELECT 
    student_name,
    class,
    avg_score,
    RANK() OVER (PARTITION BY class ORDER BY avg_score DESC) as class_rank,
    AVG(avg_score) OVER (PARTITION BY class) as class_overall_avg
FROM (
    SELECT 
        student_name,
        class,
        AVG(score) as avg_score
    FROM test_scores
    GROUP BY student_name, class
) student_averages
ORDER BY class, class_rank;
```

The subquery first collapses multiple test scores per student into a single average using GROUP BY. The outer query then applies window functions to rank students within their classes and compare individual averages to class averages. This pattern is common in analytical queries requiring both aggregation and comparative analysis.

## Sales Analysis Example

Consider a sales dataset demonstrating practical applications of combined GROUP BY and PARTITION BY operations.

```{sql connection=dbcon}
CREATE TABLE sales (
    sale_id INTEGER,
    salesperson TEXT,
    region TEXT,
    sale_amount REAL,
    sale_date TEXT
);
```

```{sql connection=dbcon}
INSERT INTO sales VALUES
(1, 'Alice', 'North', 1000, '2024-01-15'),
(2, 'Alice', 'North', 1500, '2024-01-20'),
(3, 'Bob', 'North', 800, '2024-01-18'),
(4, 'Bob', 'North', 1200, '2024-01-22'),
(5, 'Charlie', 'South', 2000, '2024-01-16'),
(6, 'Charlie', 'South', 1800, '2024-01-25'),
(7, 'David', 'South', 1500, '2024-01-17'),
(8, 'David', 'South', 1600, '2024-01-23');
```

A comprehensive analysis ranks salespeople within regions while calculating regional statistics.

```{sql connection=dbcon}
SELECT 
    salesperson,
    region,
    total_sales,
    RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) as regional_rank,
    AVG(total_sales) OVER (PARTITION BY region) as regional_avg,
    total_sales - AVG(total_sales) OVER (PARTITION BY region) as diff_from_avg,
    ROUND(100.0 * total_sales / SUM(total_sales) OVER (PARTITION BY region), 2) as pct_of_regional_total
FROM (
    SELECT 
        salesperson,
        region,
        SUM(sale_amount) as total_sales
    FROM sales
    GROUP BY salesperson, region
) salesperson_totals
ORDER BY region, regional_rank;
```

The inner query aggregates individual sales into per-salesperson totals using GROUP BY. The outer query then ranks salespeople within regions and calculates each salesperson's percentage of regional total sales. Alice leads the North region with 2500 in total sales, representing 55.56 percent of the regional total, while Bob trails with 2000 representing 44.44 percent.

## Running Totals and Window Frames

Window functions support frame specifications for calculating running aggregates. The frame defines which rows relative to the current row participate in the calculation.

```{sql connection=dbcon}
SELECT 
    name,
    department,
    salary,
    hire_date,
    SUM(salary) OVER (
        PARTITION BY department 
        ORDER BY hire_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_dept_salary
FROM employees
ORDER BY department, hire_date;
```

This query calculates a running total of salaries within each department ordered by hire date. For each employee, the running total includes all employees hired in the same department up to and including the current employee's hire date. The frame clause "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" specifies this behavior explicitly.

## Decision Framework

Choose GROUP BY when creating summary reports that require one row per group, when reducing result sets through aggregation, or when building dashboards with aggregate statistics. The operation is appropriate when detail-level data is unnecessary in the output.

Choose PARTITION BY when preserving detail rows while adding aggregate context, when ranking rows within groups, when comparing individual values to group statistics, or when calculating running totals and moving averages. Window functions excel at analytical queries requiring both detail and summary information simultaneously.

Combine both operations when first aggregating data then performing comparative analysis on the aggregates, when ranking aggregated groups, or when calculating percentages of group totals from already-aggregated data.

## Common Errors

Attempting to select non-aggregated columns without including them in GROUP BY causes errors. Every column in the SELECT clause must either appear in GROUP BY or be wrapped in an aggregate function.

``` sql
-- Error: name must be grouped or aggregated
SELECT department, name, AVG(salary) 
  FROM employees 
 GROUP BY department;

-- Correct: include name in GROUP BY
SELECT department, name, AVG(salary) 
  FROM employees 
 GROUP BY department, name;
```

When combining GROUP BY and PARTITION BY, ensure partition columns exist in the grouped result set. If GROUP BY collapses away a column, that column cannot be used in PARTITION BY.

``` sql
-- Error: employee_id grouped away but needed for partition
SELECT department, SUM(salary),
       RANK() OVER (PARTITION BY employee_id ORDER BY salary)
  FROM employees GROUP BY department;

-- Correct: partition by what remains after grouping
SELECT department, SUM(salary) as total_salary,
       RANK() OVER (ORDER BY SUM(salary) DESC) as dept_rank
  FROM employees GROUP BY department;
```

Expecting PARTITION BY to reduce row counts is a common misconception. Window functions always preserve the input row count unless DISTINCT or additional filtering is applied.

## Performance Considerations

Window functions typically perform better than equivalent self-joins or correlated subqueries. Modern query optimizers recognize window function patterns and generate efficient execution plans. However, partitioning on unindexed columns or ordering by expressions can degrade performance on large datasets.

Consider creating indexes on columns used in PARTITION BY and ORDER BY clauses for large tables. When combining GROUP BY and PARTITION BY, the initial aggregation reduces the dataset size before window calculations, often improving overall query performance compared to window functions on raw data.

Multiple window functions with identical PARTITION BY and ORDER BY clauses can share computation. Modern optimizers recognize this pattern and avoid redundant sorting and partitioning operations.

``` sql
-- Efficient: same window specification reused
SELECT 
    name,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num
FROM employees;
```

Understanding the distinction between GROUP BY aggregation and PARTITION BY window functions, their execution order, and their appropriate use cases enables construction of sophisticated analytical queries that efficiently answer complex business questions.

```{r echo=FALSE, warning=FALSE}
dbDisconnect(dbcon)
```

------------------------------------------------------------------------

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## Notes

SQLite added comprehensive window function support including PARTITION BY in version 3.25.0 released in September 2018. Any modern SQLite installation supports these features. Check the version with `SELECT sqlite_version();` to confirm compatibility. Versions prior to 3.25.0 require complex workarounds using self-joins and subqueries that are both inefficient and difficult to maintain.

## Acknowledgements

Claude 4.6 (Sonnet) was used in the preparation of this lesson.

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
