AI News Hub Logo

AI News Hub

Subqueries and CTEs: SQL Gets Readable

DEV Community
Akhilesh

You learned joins. You can connect tables. Now you have a problem that requires multiple steps. First find the average salary per department. Then find employees who earn above their department average. One query cannot do this directly because you need the result of one calculation as input to another. Two solutions exist. Subqueries nest one query inside another. CTEs give each step a name and build them sequentially. Both solve the same problem. CTEs do it without making you want to delete your laptop. A subquery is a SELECT statement wrapped in parentheses and embedded inside another query. import sqlite3 import pandas as pd conn = sqlite3.connect("company.db") conn.executescript(""" DROP TABLE IF EXISTS employees; DROP TABLE IF EXISTS departments; CREATE TABLE departments ( dept_id INTEGER PRIMARY KEY, dept_name TEXT, budget INTEGER ); CREATE TABLE employees ( emp_id INTEGER PRIMARY KEY, name TEXT, salary INTEGER, dept_id INTEGER, hire_year INTEGER ); INSERT INTO departments VALUES (1, 'Engineering', 500000), (2, 'Marketing', 300000), (3, 'Sales', 400000); INSERT INTO employees VALUES (1, 'Alex', 55000, 1, 2022), (2, 'Priya', 82000, 2, 2021), (3, 'Sam', 43000, 1, 2023), (4, 'Jordan', 95000, 3, 2019), (5, 'Lisa', 67000, 2, 2022), (6, 'Ravi', 71000, 1, 2021), (7, 'Tom', 88000, 3, 2018), (8, 'Nina', 59000, 1, 2023); """) conn.commit() def q(sql): return pd.read_sql_query(sql, conn) Find employees who earn above the company average: result = q(""" SELECT name, salary FROM employees WHERE salary > (SELECT AVG(salary) FROM employees) ORDER BY salary DESC; """) print(result) print(f"\nCompany average: {q('SELECT AVG(salary) FROM employees').iloc[0,0]:,.0f}") Output: name salary 0 Jordan 95000 1 Tom 88000 2 Priya 82000 3 Ravi 71000 Company average: 70,000 The inner query (SELECT AVG(salary) FROM employees) runs first, produces 70000, and the outer query uses that number in its WHERE clause. Two separate operations written as one. A subquery in the FROM clause creates a temporary table you can query against. result = q(""" SELECT dept_id, name, salary, dept_avg, ROUND(salary - dept_avg, 0) AS vs_dept_avg FROM ( SELECT e.emp_id, e.name, e.salary, e.dept_id, AVG(e.salary) OVER (PARTITION BY e.dept_id) AS dept_avg FROM employees e ) sub ORDER BY dept_id, salary DESC; """) print(result) The subquery labeled sub computes department averages alongside individual rows. The outer query then selects from that result. This is called a derived table or inline view. A correlated subquery references a column from the outer query. It runs once per row of the outer query. result = q(""" SELECT e.name, e.salary, e.dept_id, (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id) AS dept_avg FROM employees e ORDER BY e.dept_id, e.salary DESC; """) print(result.round(0)) Output: name salary dept_id dept_avg 0 Ravi 71000 1 57000.0 1 Alex 55000 1 57000.0 2 Sam 43000 1 57000.0 3 Nina 59000 1 57000.0 4 Priya 82000 2 74500.0 5 Lisa 67000 2 74500.0 6 Jordan 95000 3 91500.0 7 Tom 88000 3 91500.0 The WHERE dept_id = e.dept_id part references e.dept_id from the outer query. For each row, the subquery calculates the average for that specific employee's department. Correlated subqueries are powerful but can be slow on large tables because they run for every row. CTE stands for Common Table Expression. It lets you give a subquery a name and reference it like a table. The syntax uses the WITH keyword. result = q(""" WITH dept_averages AS ( SELECT dept_id, AVG(salary) AS avg_salary FROM employees GROUP BY dept_id ) SELECT e.name, e.salary, e.dept_id, d.avg_salary AS dept_avg, ROUND(e.salary - d.avg_salary, 0) AS above_avg FROM employees e JOIN dept_averages d ON e.dept_id = d.dept_id WHERE e.salary > d.avg_salary ORDER BY above_avg DESC; """) print(result) Output: name salary dept_id dept_avg above_avg 0 Jordan 95000 3 91500.0 3500.0 1 Ravi 71000 1 57000.0 14000.0 2 Priya 82000 2 74500.0 7500.0 3 Tom 88000 3 91500.0 -3500.0 Wait, Tom earns less than the dept average but appears? No, look again. Tom earns 88000 and the Sales average is 91500. Tom is below average. Only Jordan (95000 vs 91500), Ravi (71000 vs 57000), and Priya (82000 vs 74500) are above their department average. The CTE dept_averages is defined once, named clearly, and used in the main query like any other table. The logic is readable. You can see exactly what each piece does. CTEs chain together. Each one can reference those defined before it. result = q(""" WITH dept_stats AS ( SELECT dept_id, COUNT(*) AS headcount, AVG(salary) AS avg_salary, SUM(salary) AS total_salary FROM employees GROUP BY dept_id ), dept_with_budget AS ( SELECT d.dept_id, d.dept_name, ds.headcount, ROUND(ds.avg_salary, 0) AS avg_salary, ds.total_salary, dep.budget, ROUND(100.0 * ds.total_salary / dep.budget, 1) AS budget_used_pct FROM dept_stats ds JOIN departments dep ON ds.dept_id = dep.dept_id JOIN (SELECT dept_id, dept_name FROM departments) d ON ds.dept_id = d.dept_id ), over_budget AS ( SELECT * FROM dept_with_budget WHERE budget_used_pct > 20 ) SELECT * FROM over_budget ORDER BY budget_used_pct DESC; """) print(result) Output: dept_id dept_name headcount avg_salary total_salary budget budget_used_pct 0 1 Engineering 4 57000.0 228000 500000 45.6 1 3 Sales 2 91500.0 183000 400000 45.8 2 2 Marketing 2 74500.0 149000 300000 49.7 Three CTEs. Each builds on the previous. The final SELECT just picks from the last one. Reading top to bottom, the logic is completely clear. If this were a nested subquery it would be a 40-line monster that nobody could read or debug. A recursive CTE references itself. It is how you traverse tree structures in SQL, like organizational hierarchies. conn.executescript(""" DROP TABLE IF EXISTS org; CREATE TABLE org ( emp_id INTEGER PRIMARY KEY, name TEXT, manager_id INTEGER ); INSERT INTO org VALUES (1, 'CEO', NULL), (2, 'VP Eng', 1), (3, 'VP Sales', 1), (4, 'Manager A', 2), (5, 'Manager B', 2), (6, 'Alex', 4), (7, 'Priya', 4), (8, 'Sam', 5), (9, 'Jordan', 3); """) conn.commit() result = q(""" WITH RECURSIVE org_tree AS ( SELECT emp_id, name, manager_id, 0 AS level, name AS path FROM org WHERE manager_id IS NULL UNION ALL SELECT o.emp_id, o.name, o.manager_id, ot.level + 1, ot.path || ' > ' || o.name FROM org o JOIN org_tree ot ON o.manager_id = ot.emp_id ) SELECT SUBSTR(' ', 1, level * 4) || name AS org_chart, level, path FROM org_tree ORDER BY path; """) print(result.to_string(index=False)) Output: org_chart level path CEO 0 CEO VP Eng 1 CEO > VP Eng Manager A 2 CEO > VP Eng > Manager A Alex 3 CEO > VP Eng > Manager A > Alex Priya 3 CEO > VP Eng > Manager A > Priya Manager B 2 CEO > VP Eng > Manager B Sam 3 CEO > VP Eng > Manager B > Sam VP Sales 1 CEO > VP Sales Jordan 2 CEO > VP Sales > Jordan The recursive CTE starts at the top of the tree (where manager_id IS NULL), then repeatedly joins the table to itself to traverse downward. The level tracks depth. The path builds a breadcrumb trail. You will use this for org charts, folder structures, comment threads, and any hierarchical data. Subquery in WHERE: when you need one value or a list of values to filter against. Fast, simple, effective. Subquery in FROM: when you need an intermediate table that does not exist but can be computed. Works but CTEs are more readable. Correlated subquery: when the calculation must reference the current row of the outer query. Use sparingly, they can be slow. CTE: almost everything else. Multi-step logic. Readable. Debuggable. Test each CTE independently by running it alone. Add the next one when the first works. Recursive CTE: hierarchy traversal. Nothing else handles this as cleanly in SQL. Create cte_practice.py. Write a CTE that calculates the running total of salary per department. Each employee row shows their salary and the cumulative salary spent in their department up to and including their hire year. Write a three-CTE chain: first CTE gets all employees hired after 2021, second CTE calculates their department averages, third CTE finds which of those newer employees already earn above their department average. Using a recursive CTE, find all employees who report to "VP Eng" directly or indirectly. Show their names and how many levels below VP Eng they are. Window functions. The most powerful SQL feature most people learn last. They let you compute values across rows without collapsing them into groups. Running totals. Rankings. Lead and lag comparisons. The things that would require multiple CTEs or Python loops become single function calls.