Common Table Expressions - Learning Module

Loading content...

0/241

Named Subqueries

Giving Your Queries a Name

In programming, we rarely write all our code in a single function. We decompose complex logic into named functions, each with a clear purpose, each testable in isolation. This modular approach makes code maintainable, understandable, and reusable.

SQL, historically, lacked this capability. Every derived table was an anonymous inline subquery—unnamed, untestable, and invisible to the surrounding query's context. Named subqueries change everything. CTEs allow you to give names to intermediate result sets, transforming SQL from a monolithic block into a structured, modular composition of clearly-defined data transformations.

What You Will Learn

By the end of this page, you will understand how to think about CTEs as modular building blocks, when to extract subqueries into named CTEs versus keeping them inline, how to design CTE chains that tell a coherent data story, and patterns for building complex analytical queries from simple, composable parts.

The Philosophy of Named Subqueries

A CTE is fundamentally a named subquery—a complete SELECT statement given an identifier that can be referenced elsewhere in the query. But calling them "named subqueries" understates their transformative impact on how we write and think about SQL.

Consider what naming does in programming:

The Power of Naming

•Names communicate intent — calculate_tax(amount) tells you something; (amount * 0.0825) makes you guess
•Names create abstraction layers — You can understand what a function does without reading its implementation
•Names enable composition — Reference a complex operation by its name rather than duplicating its logic
•Names facilitate testing — You can verify a named unit works correctly in isolation
•Names document automatically — Good names eliminate the need for comments explaining what code does

The paradigm shift:

Without CTEs, SQL is expression-oriented—you build complex expressions by nesting simpler ones. The outer expression's reader must mentally unwrap each layer.

With CTEs, SQL becomes declaration-oriented—you declare named intermediate results, then compose them. The reader follows a sequence of named transformations.

paradigm_comparison.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- EXPRESSION-ORIENTED (nested, anonymous)
-- Reader must decode from inside-out
SELECT 
    department_name,
    (SELECT AVG(salary) 
     FROM employees e 
     WHERE e.department_id = d.department_id
     AND e.employee_id IN (
         SELECT employee_id 
         FROM performance_reviews 
         WHERE rating >= 4
         AND review_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 YEAR)
     )) as avg_top_performer_salary
FROM departments d;
 
-- DECLARATION-ORIENTED (named, sequential)
-- Reader follows a narrative
WITH 
    recent_top_performers AS (
        SELECT employee_id
        FROM performance_reviews
        WHERE rating >= 4
        AND review_date >= CURRENT_DATE - INTERVAL '1 year'
    ),
    
    top_performer_salaries AS (
        SELECT e.department_id, e.salary
        FROM employees e
        INNER JOIN recent_top_performers rtp USING (employee_id)
    ),
    
    department_averages AS (
        SELECT 
            department_id,
            AVG(salary) as avg_salary
        FROM top_performer_salaries
        GROUP BY department_id
    )
 
SELECT d.department_name, da.avg_salary as avg_top_performer_salary
FROM departments d
LEFT JOIN department_averages da USING (department_id);

The Readability Test

Can a colleague understand your query's purpose by reading only the CTE names and the final SELECT? If yes, your names are good. If they need to read every CTE's body, your names need work.

Anatomy of a Named Subquery

Let's dissect what makes an effective named subquery. A well-designed CTE has specific characteristics that make it useful as a modular component.

cte_anatomy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- A well-designed named subquery has these properties:
 
WITH quarterly_revenue_by_product (
    -- 1. EXPLICIT COLUMN NAMES (when beneficial)
    product_id,
    product_name,
    category,
    quarter,
    revenue,
    units_sold
) AS (
    -- 2. SINGLE RESPONSIBILITY
    -- Does one thing: calculates quarterly revenue per product
    SELECT 
        p.product_id,
        p.product_name,
        p.category,
        DATE_TRUNC('quarter', o.order_date) as quarter,
        SUM(oi.quantity * oi.unit_price) as revenue,
        SUM(oi.quantity) as units_sold
    FROM products p
    INNER JOIN order_items oi ON p.product_id = oi.product_id
    INNER JOIN orders o ON oi.order_id = o.order_id
    WHERE o.status = 'completed'
    -- 3. APPROPRIATE SCOPE
    -- Not overly broad (all history) or overly narrow (hardcoded dates)
    AND o.order_date >= DATE_TRUNC('year', CURRENT_DATE - INTERVAL '2 years')
    GROUP BY 
        p.product_id, 
        p.product_name, 
        p.category,
        DATE_TRUNC('quarter', o.order_date)
)
 
-- 4. CLEAR CONTRACT
-- Users of this CTE know exactly what columns they get
-- and what each row represents (one product-quarter combination)
 
SELECT 
    product_name,
    quarter,
    revenue,
    revenue - LAG(revenue) OVER (
        PARTITION BY product_id ORDER BY quarter
    ) as revenue_change
FROM quarterly_revenue_by_product
WHERE category = 'Electronics'
ORDER BY quarter, revenue DESC;

Properties of Well-Designed Named Subqueries
Property	Description	Why It Matters
Single Responsibility	CTE does exactly one logical operation	Easier to understand, test, and modify independently
Clear Contract	Predictable columns and row semantics	Users don't need to read the implementation
Appropriate Scope	Neither too broad nor too narrow	Reusable across multiple main queries
Meaningful Name	Name describes WHAT, not HOW	Self-documenting; reduces need for comments
Explicit Columns (when needed)	Column list in CTE definition	Prevents ambiguity with expressions/calculations

The "What vs How" naming principle:

A CTE name should describe what result set it represents, not how it computes that result:

❌ How (Implementation-Focused)	✅ What (Intent-Focused)
`orders_joined_with_customers`	`customer_orders`
`sales_grouped_by_month`	`monthly_sales`
`products_filtered_by_active`	`active_products`
`users_where_login_recent`	`recently_active_users`
`aggregated_stats`	`department_performance_metrics`

When to Extract a Named Subquery

Not every subquery should become a CTE. Over-using CTEs leads to verbose queries where the structure obscures rather than clarifies. Here are the signals that indicate extraction to a named CTE is beneficial:

Extract to CTE When...

•The subquery is referenced multiple times — Duplication is a strong signal. Extract once, reference many
•The subquery represents a distinct concept — If you can give it a meaningful business name, it deserves to be named
•The subquery adds nesting depth beyond 2 levels — Deeper nesting hurts readability; flatten with CTEs
•The subquery contains complex logic — Aggregations, window functions, or multiple joins benefit from isolation
•The subquery would benefit from a comment — If you'd add a comment explaining it, the CTE name can replace the comment
•The subquery might be reused in future queries — Named patterns are easier to copy and adapt
•You need to test the intermediate result — Named CTEs can be debugged by selecting from them directly

Keep Inline When...

•The subquery is simple and single-use — WHERE id IN (SELECT id FROM other_table) needs no CTE
•The subquery is a trivial filter — EXISTS (SELECT 1 FROM ...) is clear enough inline
•Adding a CTE would just move code without adding clarity — Relocation without naming benefit is churn
•The subquery is in a scalar context — Simple scalar subqueries often read well inline
•You're optimizing for query size limits — Some systems have SQL text length limits

extraction_decision.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- ❌ OVER-EXTRACTION: CTE adds verbosity without value
WITH active_filter AS (
    SELECT user_id FROM users WHERE status = 'active'
)
SELECT * FROM orders WHERE user_id IN (SELECT user_id FROM active_filter);
 
-- ✅ BETTER: Simple enough to be inline
SELECT * FROM orders 
WHERE user_id IN (SELECT user_id FROM users WHERE status = 'active');
 
 
-- ✅ GOOD EXTRACTION: Multiple references
WITH active_users AS (
    SELECT user_id, username, email, region
    FROM users 
    WHERE status = 'active'
    AND last_login >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT 
    r.region_name,
    COUNT(au.user_id) as active_count,
    (SELECT SUM(amount) FROM orders o 
     WHERE o.user_id IN (SELECT user_id FROM active_users a2 
                         WHERE a2.region = r.region_id)) as region_revenue
FROM regions r
LEFT JOIN active_users au ON r.region_id = au.region;
 
 
-- ✅ GOOD EXTRACTION: Complex logic with business meaning
WITH customer_lifetime_value AS (
    -- Complex calculation deserves a name
    SELECT 
        c.customer_id,
        c.customer_name,
        c.acquisition_date,
        SUM(o.amount) as total_revenue,
        COUNT(DISTINCT o.order_id) as total_orders,
        AVG(o.amount) as avg_order_value,
        MAX(o.order_date) as last_order_date,
        DATEDIFF(day, c.acquisition_date, MAX(o.order_date)) as customer_lifespan_days,
        SUM(o.amount) / NULLIF(
            DATEDIFF(month, c.acquisition_date, CURRENT_DATE), 0
        ) as monthly_revenue_rate
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.customer_name, c.acquisition_date
)
SELECT *
FROM customer_lifetime_value
WHERE monthly_revenue_rate > 100
ORDER BY total_revenue DESC;

Building Query Narratives

The most effective use of named subqueries is to tell a data story—a logical narrative that transforms raw data into actionable insights through a sequence of named steps. Each CTE represents a chapter in this story.

The narrative pattern:

Start with base data — Raw tables or minimal filtering
Enrich and transform — Add calculated fields, join reference data
Filter and focus — Narrow to the population of interest
Aggregate and analyze — Compute metrics and summaries
Format and deliver — Shape the final result set

query_narrative.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
-- A COMPLETE DATA STORY: Identifying at-risk high-value customers
-- Business question: Which premium customers are showing signs of churn?
 
WITH 
    -- Chapter 1: Define our universe (base data)
    premium_customers AS (
        SELECT 
            customer_id,
            customer_name,
            email,
            account_manager_id,
            DATE_DIFF('day', created_at, CURRENT_DATE) as tenure_days
        FROM customers
        WHERE tier = 'premium'
        AND status = 'active'
    ),
    
    -- Chapter 2: Understand their history (enrichment)
    customer_order_history AS (
        SELECT 
            pc.customer_id,
            pc.customer_name,
            pc.email,
            pc.account_manager_id,
            pc.tenure_days,
            COUNT(o.order_id) as total_orders,
            SUM(o.amount) as lifetime_value,
            AVG(o.amount) as avg_order_value,
            MAX(o.order_date) as last_order_date,
            DATE_DIFF('day', MAX(o.order_date), CURRENT_DATE) as days_since_last_order
        FROM premium_customers pc
        LEFT JOIN orders o ON pc.customer_id = o.customer_id
        GROUP BY 
            pc.customer_id, pc.customer_name, pc.email, 
            pc.account_manager_id, pc.tenure_days
    ),
    
    -- Chapter 3: Calculate behavioral signals (transformation)
    customer_engagement_signals AS (
        SELECT 
            coh.*,
            -- Calculate expected order frequency
            CASE 
                WHEN tenure_days > 90 THEN total_orders * 90.0 / tenure_days
                ELSE total_orders
            END as orders_per_90_days,
            -- Calculate recency score (0-100, higher = more recent = better)
            GREATEST(0, 100 - days_since_last_order * 2) as recency_score,
            -- Calculate value score (percentile within cohort)
            PERCENT_RANK() OVER (ORDER BY lifetime_value) * 100 as value_percentile
        FROM customer_order_history coh
    ),
    
    -- Chapter 4: Apply churn risk criteria (focus)
    at_risk_customers AS (
        SELECT *,
            -- Risk scoring algorithm
            CASE
                WHEN days_since_last_order > 60 AND orders_per_90_days > 3 THEN 'CRITICAL'
                WHEN days_since_last_order > 45 AND value_percentile > 75 THEN 'HIGH'
                WHEN days_since_last_order > 30 AND value_percentile > 50 THEN 'MEDIUM'
                WHEN days_since_last_order > 21 THEN 'LOW'
                ELSE 'STABLE'
            END as risk_level
        FROM customer_engagement_signals
        -- Only customers who previously ordered regularly
        WHERE total_orders >= 3
        AND tenure_days >= 60
    ),
    
    -- Chapter 5: Prepare actionable output (aggregation/formatting)
    prioritized_risk_report AS (
        SELECT 
            arc.customer_id,
            arc.customer_name,
            arc.email,
            am.account_manager_name,
            arc.lifetime_value,
            arc.last_order_date,
            arc.days_since_last_order,
            arc.risk_level,
            -- Calculate priority score for sorting
            CASE arc.risk_level
                WHEN 'CRITICAL' THEN 100
                WHEN 'HIGH' THEN 75
                WHEN 'MEDIUM' THEN 50
                WHEN 'LOW' THEN 25
                ELSE 0
            END + (arc.value_percentile / 2) as priority_score
        FROM at_risk_customers arc
        LEFT JOIN account_managers am 
            ON arc.account_manager_id = am.account_manager_id
        WHERE arc.risk_level != 'STABLE'
    )
 
-- The Final Reveal: Actionable results
SELECT 
    customer_id,
    customer_name,
    email,
    account_manager_name,
    lifetime_value,
    last_order_date,
    days_since_last_order,
    risk_level
FROM prioritized_risk_report
ORDER BY priority_score DESC, lifetime_value DESC
LIMIT 100;

Reading the Story

Notice how someone can understand this query's purpose by reading only the CTE names: premium_customers → customer_order_history → customer_engagement_signals → at_risk_customers → prioritized_risk_report. The narrative flows logically, each step building on the previous.

Designing CTE Interfaces

Like function signatures in programming, CTEs have implicit "interfaces"—the columns they expose and the semantics of their rows. Well-designed CTE interfaces make your query components reusable and understandable.

CTE Interface Design Principles
Principle	Description	Example
Include keys	Always include natural/surrogate keys for joining	customer_id, order_id
Name calculations	Give computed columns descriptive names	avg_order_value, days_since_last_login
Preserve granularity	Document what each row represents	One row per customer-month
Limit scope	Only include columns needed downstream	Avoid SELECT *
Use consistent types	Dates as dates, amounts as decimals	No stringly-typed dates

cte_interface_design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
-- ❌ POOR INTERFACE: Unclear what we're getting
WITH data AS (
    SELECT * 
    FROM orders o 
    JOIN customers c ON o.customer_id = c.customer_id
)
SELECT * FROM data;
-- What columns exist? What does each row represent?
 
 
-- ✅ GOOD INTERFACE: Clear contract
WITH customer_order_summary (
    -- Explicit column list documents the interface
    customer_id,          -- PK: unique per row
    total_orders,         -- Count of completed orders
    total_revenue,        -- Sum in USD
    first_order_date,     -- Earliest order
    last_order_date,      -- Most recent order
    avg_order_value       -- Mean order value in USD
) AS (
    SELECT 
        customer_id,
        COUNT(*) as total_orders,
        SUM(amount) as total_revenue,
        MIN(order_date) as first_order_date,
        MAX(order_date) as last_order_date,
        AVG(amount) as avg_order_value
    FROM orders
    WHERE status = 'completed'
    GROUP BY customer_id
)
-- Consumer knows exactly what they're getting
SELECT customer_id, total_revenue, avg_order_value
FROM customer_order_summary
WHERE total_orders >= 5;
 
 
-- ✅ GOOD: Documenting row semantics in the name
WITH 
    -- Name indicates: one row per product per day
    daily_product_sales AS (
        SELECT 
            product_id,
            DATE(order_date) as sale_date,
            SUM(quantity) as units_sold,
            SUM(quantity * unit_price) as revenue
        FROM order_items oi
        JOIN orders o USING (order_id)
        GROUP BY product_id, DATE(order_date)
    ),
    
    -- Name indicates: one row per product per month
    monthly_product_sales AS (
        SELECT 
            product_id,
            DATE_TRUNC('month', sale_date) as sale_month,
            SUM(units_sold) as units_sold,
            SUM(revenue) as revenue,
            COUNT(*) as selling_days
        FROM daily_product_sales
        GROUP BY product_id, DATE_TRUNC('month', sale_date)
    )
 
SELECT * FROM monthly_product_sales;

Row Semantics Matter

Always be clear about what each row represents. A CTE named 'order_data' is ambiguous—does it have one row per order, per order item, per customer, or per day? Names like 'order_line_items' or 'daily_order_totals' eliminate this ambiguity.

Testing Named Subqueries

One of the most practical benefits of named subqueries is testability. Each CTE can be validated independently before being combined into the final query. This is particularly valuable when debugging complex queries.

testing_ctes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
-- The full query we're debugging
WITH 
    base_orders AS (
        SELECT order_id, customer_id, amount, order_date
        FROM orders
        WHERE order_date >= '2024-01-01'
        AND status = 'completed'
    ),
    
    customer_aggregates AS (
        SELECT 
            customer_id,
            COUNT(*) as order_count,
            SUM(amount) as total_spent,
            AVG(amount) as avg_order
        FROM base_orders
        GROUP BY customer_id
    ),
    
    high_value_customers AS (
        SELECT customer_id, total_spent, order_count
        FROM customer_aggregates
        WHERE total_spent > 1000
        AND order_count >= 3
    )
 
SELECT c.customer_name, hvc.total_spent, hvc.order_count
FROM high_value_customers hvc
JOIN customers c USING (customer_id);
 
 
-- TESTING STRATEGY: Build up incrementally
 
-- TEST 1: Validate base_orders
WITH base_orders AS (
    SELECT order_id, customer_id, amount, order_date
    FROM orders
    WHERE order_date >= '2024-01-01'
    AND status = 'completed'
)
-- Check row count and sample
SELECT 
    COUNT(*) as total_rows,
    MIN(order_date) as earliest_date,
    MAX(order_date) as latest_date,
    COUNT(DISTINCT customer_id) as unique_customers
FROM base_orders;
 
 
-- TEST 2: Validate customer_aggregates
WITH 
    base_orders AS (
        SELECT order_id, customer_id, amount, order_date
        FROM orders
        WHERE order_date >= '2024-01-01'
        AND status = 'completed'
    ),
    customer_aggregates AS (
        SELECT 
            customer_id,
            COUNT(*) as order_count,
            SUM(amount) as total_spent,
            AVG(amount) as avg_order
        FROM base_orders
        GROUP BY customer_id
    )
-- Verify aggregation logic for a known customer
SELECT * 
FROM customer_aggregates
WHERE customer_id = 'KNOWN_TEST_CUSTOMER_ID';
-- Cross-check: manually verify against raw orders
 
 
-- TEST 3: Validate threshold logic
WITH 
    base_orders AS (...),
    customer_aggregates AS (...),
    high_value_customers AS (
        SELECT customer_id, total_spent, order_count
        FROM customer_aggregates
        WHERE total_spent > 1000
        AND order_count >= 3
    )
-- Check edge cases
SELECT 
    COUNT(*) as qualifying_customers,
    MIN(total_spent) as min_spent_in_set,
    MIN(order_count) as min_orders_in_set
FROM high_value_customers;
-- Verify: min_spent should be > 1000, min_orders should be >= 3

CTE Testing Strategies

•Incremental validation — Start with the first CTE, verify, add the next, verify, repeat
•Row count checkpoints — After each CTE, verify the row count makes sense
•Known-value testing — Check a specific ID whose expected values you know
•Boundary verification — Confirm filters work at edge cases (exactly at threshold)
•Cross-reference raw data — Calculate expected values manually for a sample
•Negative testing — Verify excluded rows are actually excluded

The Comment-Out Technique

During debugging, temporarily change your main query to SELECT * FROM [cte_name] LIMIT 100. This lets you inspect any intermediate step without modifying the CTE definitions. Once verified, restore the original main query.

Common Named Subquery Patterns

Over time, certain CTE patterns emerge repeatedly. Recognizing these patterns accelerates query development and ensures consistent, battle-tested approaches.

The Filter Pattern isolates row selection logic into a named CTE, making the filtering criteria explicit and reusable.

filter_pattern.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Filter Pattern: Named row selection
WITH eligible_products AS (
    SELECT product_id, product_name, category_id
    FROM products
    WHERE status = 'active'
    AND inventory_count > 0
    AND category_id NOT IN (
        SELECT category_id FROM restricted_categories
    )
)
SELECT 
    ep.product_name,
    COUNT(oi.order_item_id) as times_ordered,
    SUM(oi.quantity) as total_units
FROM eligible_products ep
LEFT JOIN order_items oi ON ep.product_id = oi.product_id
GROUP BY ep.product_id, ep.product_name;

Summary: Named Subqueries

We've explored CTEs as named, modular building blocks for SQL queries. Let's consolidate the key insights:

Key Takeaways

•Names transform understanding — A well-named CTE communicates intent without requiring implementation details
•CTEs enable declaration-oriented SQL — Move from nested expressions to sequential declarations
•Not every subquery needs a name — Extract when it adds clarity, reusability, or testability
•Design clear interfaces — Explicit columns, documented row semantics, meaningful names
•Build data narratives — Structure CTEs as chapters in a story that transforms raw data to insights
•Test incrementally — Validate each CTE before building on it; named parts are independently verifiable

What's Next:

Now that you understand CTEs as named, modular subqueries, the next page explores Multiple CTEs—how to define and compose multiple CTEs in a single query, manage dependencies between them, and build sophisticated data pipelines entirely within a single SQL statement.

Page Complete

You now understand how to think about CTEs as modular building blocks, when to extract logic into named subqueries, how to design CTE interfaces, and patterns for building complex queries from simple, composable parts. You're ready to explore multi-CTE compositions.