Window Functions: SQL's Most Powerful Feature Nobody Uses
You are going to hit a problem. Not now. A few weeks into data analysis work, or maybe during a job interview, someone will ask you something like: "For each sale, can you show the sale amount alongside the running total for that salesperson?" Or: "Rank employees by salary within each department without removing any rows." Or: "For each order, show the previous order's amount for comparison." You will try GROUP BY. It collapses rows. Wrong. Then someone tells you about window functions and you feel both relieved and slightly cheated that nobody mentioned these sooner. This is that mention. Window functions compute a value for each row using other rows, without collapsing the result into groups. GROUP BY: 8 rows go in, 3 rows come out. That distinction is everything. import sqlite3 import pandas as pd conn = sqlite3.connect("window_demo.db") conn.executescript(""" DROP TABLE IF EXISTS sales; CREATE TABLE sales ( sale_id INTEGER PRIMARY KEY, rep_name TEXT, region TEXT, amount REAL, sale_date TEXT, product TEXT ); INSERT INTO sales VALUES (1, 'Alex', 'North', 45000, '2024-01-05', 'Laptop'), (2, 'Priya', 'South', 32000, '2024-01-08', 'Phone'), (3, 'Alex', 'North', 12000, '2024-01-12', 'Watch'), (4, 'Sam', 'East', 28000, '2024-01-15', 'Tablet'), (5, 'Priya', 'South', 75000, '2024-01-18', 'Laptop'), (6, 'Alex', 'North', 8500, '2024-01-22', 'Headphones'), (7, 'Sam', 'East', 45000, '2024-02-01', 'Laptop'), (8, 'Priya', 'South', 18000, '2024-02-05', 'Watch'), (9, 'Alex', 'North', 62000, '2024-02-10', 'Laptop'), (10, 'Sam', 'East', 15000, '2024-02-14', 'Headphones'), (11, 'Priya', 'South', 41000, '2024-02-18', 'Tablet'), (12, 'Alex', 'North', 33000, '2024-02-22', 'Phone'); """) conn.commit() def q(sql): return pd.read_sql_query(sql, conn) Every window function uses the OVER() clause. That is what makes it a window function instead of a regular aggregate. function_name() OVER ( PARTITION BY column -- optional: like GROUP BY for the window ORDER BY column -- optional: defines row order within window ROWS/RANGE BETWEEN ... -- optional: defines which rows to include ) Without PARTITION BY: the window is the entire table. PARTITION BY: separate windows per group. Like GROUP BY but rows stay. Every aggregate you know (SUM, AVG, COUNT, MIN, MAX) can become a window function by adding OVER. result = q(""" SELECT rep_name, sale_date, amount, SUM(amount) OVER () AS company_total, AVG(amount) OVER () AS company_avg, SUM(amount) OVER (PARTITION BY rep_name) AS rep_total, AVG(amount) OVER (PARTITION BY rep_name) AS rep_avg, ROUND(100.0 * amount / SUM(amount) OVER (PARTITION BY rep_name), 1) AS pct_of_rep_total FROM sales ORDER BY rep_name, sale_date; """) print(result.round(1)) Output: rep_name sale_date amount company_total company_avg rep_total rep_avg pct_of_rep_total 0 Alex 2024-01-05 45000 414500.0 34541.7 160500.0 40125.0 28.0 1 Alex 2024-01-12 12000 414500.0 34541.7 160500.0 40125.0 7.5 2 Alex 2024-01-22 8500 414500.0 34541.7 160500.0 40125.0 5.3 3 Alex 2024-02-10 62000 414500.0 34541.7 160500.0 40125.0 38.6 4 Alex 2024-02-22 33000 414500.0 34541.7 160500.0 40125.0 20.6 5 Priya 2024-01-08 32000 414500.0 34541.7 166000.0 41500.0 19.3 ... Every single row is still there. No collapsing. But each row now shows: The company-wide total alongside the individual sale Each rep's total alongside each individual sale What percentage of that rep's revenue this one sale represents This is impossible with GROUP BY. With window functions, one query. Add ORDER BY inside OVER() and the window becomes cumulative. result = q(""" SELECT rep_name, sale_date, amount, SUM(amount) OVER ( PARTITION BY rep_name ORDER BY sale_date ) AS running_total FROM sales ORDER BY rep_name, sale_date; """) print(result) Output: rep_name sale_date amount running_total 0 Alex 2024-01-05 45000 45000.0 1 Alex 2024-01-12 12000 57000.0 2 Alex 2024-01-22 8500 65500.0 3 Alex 2024-02-10 62000 127500.0 4 Alex 2024-02-22 33000 160500.0 5 Priya 2024-01-08 32000 32000.0 6 Priya 2024-01-18 75000 107000.0 ... The running total accumulates within each rep's partition in chronological order. Alex starts at 45000, adds 12000 to get 57000, and so on. Priya's running total starts fresh at 32000 because it is a separate partition. Three ranking functions. They look similar and produce different results when there are ties. result = q(""" SELECT rep_name, amount, ROW_NUMBER() OVER (ORDER BY amount DESC) AS row_num, RANK() OVER (ORDER BY amount DESC) AS rank_val, DENSE_RANK() OVER (ORDER BY amount DESC) AS dense_rank_val FROM sales ORDER BY amount DESC; """) print(result) Output: rep_name amount row_num rank_val dense_rank_val 0 Priya 75000 1 1 1 1 Alex 62000 2 2 2 2 Alex 45000 3 3 3 3 Sam 45000 4 3 3 4 Priya 41000 5 5 4 5 Sam 28000 6 6 5 ... Alex and Sam both have 45000. Watch what happens: ROW_NUMBER: assigns 3 and 4. No ties, just sequential numbering. Arbitrary which gets 3. RANK: assigns 3 and 3, then skips to 5. The skip represents the gap where 4 would have been. DENSE_RANK: assigns 3 and 3, then continues with 4. No gaps. When to use which: ROW_NUMBER when you want exactly N rows with no duplicates (top 3 strictly). RANK for sports-style rankings where ties share a rank and the next rank skips. DENSE_RANK when you want tied entries to share a rank without creating gaps downstream. The real power: rank within partitions. result = q(""" SELECT rep_name, region, amount, sale_date, RANK() OVER ( PARTITION BY rep_name ORDER BY amount DESC ) AS rank_within_rep, RANK() OVER ( PARTITION BY region ORDER BY amount DESC ) AS rank_within_region FROM sales ORDER BY rep_name, rank_within_rep; """) print(result) Output: rep_name region amount sale_date rank_within_rep rank_within_region 0 Alex North 62000 2024-02-10 1 1 1 Alex North 45000 2024-01-05 2 2 2 Alex North 33000 2024-02-22 3 3 3 Alex North 12000 2024-01-12 4 4 4 Alex North 8500 2024-01-22 5 5 5 Priya South 75000 2024-01-18 1 1 ... Each rep ranked within their own results. Each rep also ranked within their region (which happens to be the same here since each rep owns one region). In a real dataset with multiple reps per region, these would differ. LAG looks at the previous row. LEAD looks at the next row. Both without any joins. result = q(""" SELECT rep_name, sale_date, amount, LAG(amount) OVER (PARTITION BY rep_name ORDER BY sale_date) AS prev_sale, LEAD(amount) OVER (PARTITION BY rep_name ORDER BY sale_date) AS next_sale, amount - LAG(amount) OVER ( PARTITION BY rep_name ORDER BY sale_date ) AS change_from_prev FROM sales ORDER BY rep_name, sale_date; """) print(result) Output: rep_name sale_date amount prev_sale next_sale change_from_prev 0 Alex 2024-01-05 45000 NaN 12000.0 NaN 1 Alex 2024-01-12 12000 45000.0 8500.0 -33000.0 2 Alex 2024-01-22 8500 12000.0 62000.0 -3500.0 3 Alex 2024-02-10 62000 8500.0 33000.0 53500.0 4 Alex 2024-02-22 33000 62000.0 NaN -29000.0 The first row has no previous sale so prev_sale is NULL. The last row has no next sale so next_sale is NULL. Every other row shows both neighbors. change_from_prev shows whether each sale was up or down from the previous one. This is month-over-month or sale-by-sale comparison without any self-joins or CTEs. LAG(amount, 2) looks two rows back. LEAD(amount, 1, 0) looks one row ahead and returns 0 instead of NULL when there is no next row. The default fill value is the third argument. NTILE(n) divides rows into n equal buckets and assigns each row a bucket number. result = q(""" SELECT rep_name, amount, NTILE(4) OVER (ORDER BY amount DESC) AS quartile FROM sales ORDER BY amount DESC; """) print(result) Output: rep_name amount quartile 0 Priya 75000 1 1 Alex 62000 1 2 Alex 45000 1 3 Sam 45000 2 4 Priya 41000 2 5 Sam 28000 2 6 Priya 32000 3 ... Top 25% of sales are in quartile 1. This is how you create performance tiers, risk buckets, or customer segments by value without hardcoding thresholds. Get the first or last value in a window. result = q(""" SELECT rep_name, sale_date, amount, FIRST_VALUE(amount) OVER ( PARTITION BY rep_name ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS first_sale, LAST_VALUE(amount) OVER ( PARTITION BY rep_name ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS latest_sale FROM sales ORDER BY rep_name, sale_date; """) print(result) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING means the window includes all rows in the partition from first to last. Without this, LAST_VALUE defaults to the current row which is never useful. Sales performance dashboard in one query. result = q(""" WITH ranked AS ( SELECT rep_name, sale_date, amount, SUM(amount) OVER (PARTITION BY rep_name ORDER BY sale_date) AS running_total, AVG(amount) OVER (PARTITION BY rep_name) AS rep_avg, RANK() OVER (ORDER BY amount DESC) AS overall_rank, RANK() OVER (PARTITION BY rep_name ORDER BY amount DESC) AS personal_best_rank, LAG(amount) OVER (PARTITION BY rep_name ORDER BY sale_date) AS prev_amount, amount - LAG(amount) OVER ( PARTITION BY rep_name ORDER BY sale_date) AS mom_change FROM sales ) SELECT rep_name, sale_date, amount, ROUND(running_total, 0) AS cumulative, ROUND(rep_avg, 0) AS rep_average, overall_rank, personal_best_rank, COALESCE(ROUND(mom_change, 0), 0) AS change_vs_prev FROM ranked ORDER BY rep_name, sale_date; """) print(result) Running total per rep. Company-wide rank per sale. Whether each sale beat the rep's previous sale. All in one query. No joins. No Python processing. Use the Index, Luke at use-the-index-luke.com has a section on window functions that explains not just the syntax but why the database executes them the way it does. Written by Markus Winand, it is the most practically useful resource for understanding SQL performance and window functions together. Search "Use the Index Luke window functions." Mode Analytics SQL tutorial also has an excellent interactive window functions section at mode.com/sql-tutorial/sql-window-functions that lets you run every query in this post directly in your browser. Zero setup. Immediately useful. Create window_functions_practice.py. Use the sales database from this post. Write a query that shows each sale with the rep's running total, the rep's all-time best sale (use MAX as a window function), and how far this sale is from their personal best. Find the top 2 sales per region using RANK and a CTE to filter. Only show rank 1 and rank 2 within each region. Calculate month-over-month growth rate for the total company revenue. January is the baseline. Show February as a percentage change from January. Using NTILE(3), bucket all sales into three tiers: top third, middle third, bottom third. Show how many sales are in each tier and the total revenue per tier. For each rep, show their first sale date, latest sale date, first sale amount, and latest sale amount in a single query using FIRST_VALUE and LAST_VALUE. One post left in Phase 4: connecting SQL to Python so your data pipelines can use both languages seamlessly. Then Phase 5: Git, GitHub, Jupyter, Colab. The tools that organize your work and make it shareable. Then the real thing starts. Machine learning.
