Expressive Power - Learning Module

Loading content...

0/252

Extensions

Beyond Relational Completeness

In the previous page, we charted the boundaries of relational algebra—the queries it cannot express. But boundaries need not be walls. Over the past fifty years, database theorists and practitioners have developed extensions to the relational model that transcend these limitations while preserving, as much as possible, the elegance and optimizability of the original framework.

This page explores these extensions in depth. From SQL's aggregation and grouping to recursive queries, window functions, and beyond, we'll examine how the theoretical foundation expands to meet practical demands. Understanding these extensions is essential for the working database professional—they represent the true expressive power of modern query languages.

What You Will Learn

By the end of this page, you will understand the major extensions to relational expressiveness: aggregation and grouping, recursive queries (transitive closure), window functions, outer joins, and specialized operations. You'll see how each extension addresses a specific limitation while attempting to preserve the benefits of the relational foundation.

Extension: Aggregation and Grouping

The most widely used extension to relational algebra is aggregation—the ability to compute summary values (counts, sums, averages) over sets of tuples, often grouped by key attributes.

Extended Algebra: Grouping Operator

The grouping operator γ (gamma) extends relational algebra:

γ_{grouping-attrs, agg-function → result-attr}(R)

It partitions R by the grouping attributes, applies the aggregate function to each partition, and produces tuples with the grouping values and aggregated results.

The formal extension:

-- Relational Algebra Extended with Grouping
γ_{department; COUNT(*) → emp_count, SUM(salary) → total_salary}(Employee)

This produces one tuple per department with the employee count and total salary for that department.

SQL syntax:

SELECT department, COUNT(*) AS emp_count, SUM(salary) AS total_salary
FROM Employee
GROUP BY department;

Aggregate functions:

Function	Description	Null Handling
COUNT(*)	Number of rows	Counts all rows
COUNT(col)	Number of non-null values	Ignores nulls
SUM(col)	Sum of values	Ignores nulls
AVG(col)	Average of values	Ignores nulls
MIN(col)	Minimum value	Ignores nulls
MAX(col)	Maximum value	Ignores nulls
COUNT(DISTINCT col)	Distinct value count	Ignores nulls

The HAVING clause:

HAVING filters groups by aggregate conditions—a capability not possible without aggregation:

SELECT department, COUNT(*) AS emp_count
FROM Employee
GROUP BY department
HAVING COUNT(*) > 10;   -- Only large departments

Algebraic representation:

σ_{emp_count > 10}(γ_{department; COUNT(*) → emp_count}(Employee))

The selection applies AFTER the grouping, filtering on aggregate results.

Why Aggregation Fits the Relational Model

•Output is a relation — The result of grouping with aggregation is still a set of tuples (the set of groups with their aggregated values).
•Composable — Aggregation results can be used in further operations (selections, joins, nesting).
•Deterministic — Given the same input, aggregation produces the same output. No undefined behavior.
•Optimizable — Aggregation pushdown, partial aggregation, and other techniques can optimize aggregate queries.

Extension: Recursion and Transitive Closure

The most significant expressiveness gap—transitive closure—is addressed through recursive query extensions. Two main approaches exist: recursive Datalog and SQL's recursive Common Table Expressions (CTEs).

Datalog: Logic Programming for Databases

Datalog extends relational calculus with recursive rule definitions. A Datalog program consists of rules that can reference themselves, enabling fixed-point computation of transitive closure and similar queries.

Datalog approach:

% Base case: direct edges are reachable
Reachable(X, Y) :- Edge(X, Y).

% Recursive case: if X reaches Z and Z reaches Y, then X reaches Y
Reachable(X, Y) :- Edge(X, Z), Reachable(Z, Y).

This computes transitive closure through recursive rule application until a fixed point is reached (no new tuples can be derived).

SQL recursive CTEs:

SQL:1999 introduced recursive Common Table Expressions:

WITH RECURSIVE Reachable AS (
    -- Base case: direct edges
    SELECT source, target FROM Edge
    UNION
    -- Recursive case: extend paths
    SELECT e.source, r.target
    FROM Edge e
    JOIN Reachable r ON e.target = r.source
)
SELECT * FROM Reachable;

Practical example—organizational hierarchy:

-- Find all reports (direct and indirect) of a manager
WITH RECURSIVE Reports AS (
    -- Direct reports
    SELECT emp_id, name, manager_id
    FROM Employee
    WHERE manager_id = 100  -- Start from manager #100
    UNION ALL
    -- Indirect reports: people who report to direct reports
    SELECT e.emp_id, e.name, e.manager_id
    FROM Employee e
    JOIN Reports r ON e.manager_id = r.emp_id
)
SELECT * FROM Reports;

Recursion Capabilities by Query Language
Feature	Pure RA	Datalog	SQL Recursive CTE
Fixed-length paths	✓ (k joins)	✓	✓
Arbitrary-length paths	✗	✓	✓
Transitive closure	✗	✓	✓
Aggregation in recursion	N/A	With extensions	✓ (limited)
Guaranteed termination	✓	✓ (stratified)	Depends on query

Theoretical foundations:

Least fixed-point semantics — Recursive queries compute the smallest relation satisfying the recursive definition
Stratification — Ensures safe use of negation in recursive rules
Data Complexity — Recursive Datalog is PTIME-complete (more powerful than RA's AC⁰)

Expressiveness implications:

Adding recursion to RA creates a strictly more powerful language:

Pure RA < RA + Aggregation < RA + Aggregation + Recursion

Each extension adds genuine expressiveness—there are queries expressible with the extension that are impossible without it.

Extension: Window Functions

Window functions (also called analytic functions) are one of the most powerful modern SQL extensions. They compute values across a 'window' of related rows without collapsing the result to a single row per group.

Window Functions vs. Aggregation

GROUP BY aggregation: N input rows → 1 output row per group Window functions: N input rows → N output rows, each with computed values from its window

Window functions add data to each row; aggregation collapses rows.

Basic syntax:

SELECT 
    employee_id,
    department,
    salary,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg,
    RANK() OVER (ORDER BY salary DESC) AS salary_rank,
    SUM(salary) OVER (ORDER BY hire_date 
                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM Employee;

Components of a window function:

Function: What to compute (SUM, AVG, RANK, ROW_NUMBER, LEAD, LAG, etc.)
PARTITION BY: Divides rows into groups (like GROUP BY but doesn't collapse)
ORDER BY: Orders rows within the partition
Frame: Defines which rows relative to current row are included (ROWS/RANGE BETWEEN)

Window function categories:

Category	Functions	Description
Aggregate Windows	SUM, AVG, COUNT, MIN, MAX	Aggregate over window frame
Ranking	ROW_NUMBER, RANK, DENSE_RANK, NTILE	Position within partition
Value Functions	FIRST_VALUE, LAST_VALUE, NTH_VALUE	Specific values from frame
Offset Functions	LAG, LEAD	Values from offset rows
Distribution	PERCENT_RANK, CUME_DIST, PERCENTILE_CONT	Statistical distributions

Queries enabled by window functions:

-- Running totals
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date) AS running_total
FROM Sales;

-- Compare each employee to department average
SELECT name, salary, department,
       salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM Employee;

-- Find the previous and next order for each customer
SELECT customer_id, order_date,
       LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order,
       LEAD(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_order
FROM Orders;

-- Top-N per group
SELECT * FROM (
    SELECT name, department, salary,
           ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM Employee
) ranked WHERE rn <= 3;  -- Top 3 per department

Theoretical significance:

Window functions require ordered evaluation—they're outside the set-based relational model. They represent a controlled crossing of the boundary into ordered computation while producing relational (set-valued) output.

Extension: Outer Joins

Regular (inner) joins only include tuples that match on both sides. Outer joins extend this to preserve non-matching tuples, padding with NULLs where no match exists.

Outer join types:

-- LEFT OUTER JOIN: Keep all left tuples
SELECT e.name, d.dname
FROM Employee e
LEFT OUTER JOIN Department d ON e.dno = d.dno;
-- Employees without departments get NULL for dname

-- RIGHT OUTER JOIN: Keep all right tuples
SELECT e.name, d.dname  
FROM Employee e
RIGHT OUTER JOIN Department d ON e.dno = d.dno;
-- Departments without employees get NULL for name

-- FULL OUTER JOIN: Keep all tuples from both
SELECT e.name, d.dname
FROM Employee e
FULL OUTER JOIN Department d ON e.dno = d.dno;
-- Both unmatched employees and empty departments appear

Algebraic notation:

Type	Symbol	Description
Left Outer Join	⟕	R ⟕ S preserves all R tuples
Right Outer Join	⟖	R ⟖ S preserves all S tuples
Full Outer Join	⟗	R ⟗ S preserves all tuples

Why outer joins extend the model:

Pure relational algebra generates NULLs only indirectly (through difference operations in certain formulations). Outer joins explicitly generate NULLs as padding for non-matching tuples, requiring NULL handling throughout the system.

Practical importance:

Outer joins are essential for:

Reporting (showing all categories even if empty)
Data quality checks (finding orphan records)
Optional relationships (customers who may or may not have orders)
Comparative analysis (comparing what exists vs. what could exist)

Outer Joins and Optimization

Outer joins complicate query optimization because they don't have all the algebraic properties of inner joins. For example, outer join order matters: (R ⟕ S) ⟕ T ≠ R ⟕ (S ⟕ T) in general. Optimizers must handle outer joins specially.

Extension: Extended Projection and Expressions

Pure projection (π) only selects existing attributes. Extended projection allows computing new attributes using expressions, functions, and conditionals.

Extended projection capabilities:

-- Arithmetic expressions
SELECT name, salary, salary * 12 AS annual_salary
FROM Employee;

-- String operations  
SELECT first_name || ' ' || last_name AS full_name
FROM Person;

-- Conditional expressions
SELECT name,
       CASE 
           WHEN salary > 100000 THEN 'High'
           WHEN salary > 50000 THEN 'Medium'
           ELSE 'Low'
       END AS salary_tier
FROM Employee;

-- Type conversions
SELECT CAST(hire_date AS VARCHAR) AS hire_date_text
FROM Employee;

-- Function applications
SELECT name, UPPER(name), LENGTH(name), SUBSTRING(name FROM 1 FOR 1)
FROM Employee;

Algebraic representation:

Extended projection is written as:

π_{A, B, A+B → C, f(D) → E}(R)

This projects A and B, computes A+B as C, and applies function f to D as E.

Built-in function categories:

Category	Examples	Purpose
Arithmetic	+, -, *, /, MOD, ABS, ROUND	Numeric computation
String	CONCAT, SUBSTRING, UPPER, LOWER, TRIM	Text manipulation
Date/Time	NOW, DATE_ADD, EXTRACT, DATE_DIFF	Temporal computation
Conditional	CASE, COALESCE, NULLIF, GREATEST, LEAST	Control flow
Type	CAST, CONVERT	Type conversion
Null handling	COALESCE, NULLIF, IS NULL	Null management

Computed Attributes in Selection

Extended projection enables computed attributes, which can then be used in selections: σ_{annual_salary > 100000}(π_{name, salary*12 → annual_salary}(Employee)). This pattern is common: compute derived values, then filter on them.

Extension: Bag (Multiset) Semantics

Pure relational algebra uses set semantics: duplicates are automatically eliminated. SQL, for performance reasons, defaults to bag (multiset) semantics: duplicates are preserved unless explicitly removed.

Set vs. Bag semantics:

Operation	Set Semantics	Bag Semantics
Projection	Removes duplicates	Preserves duplicates (unless DISTINCT)
Union	{1,1,2} ∪ {1,2,2} = {1,2}	{1,1,2} UNION ALL {1,2,2} = {1,1,1,2,2,2}
Intersection	Takes common elements	Takes min of counts
Difference	Removes all matching	Subtracts counts
Output uniqueness	Always unique	May have duplicates

SQL operators:

-- Set semantics (removes duplicates)
SELECT DISTINCT department FROM Employee;
SELECT name FROM Employee UNION SELECT name FROM Contractor;

-- Bag semantics (preserves duplicates)
SELECT department FROM Employee;  -- May have duplicates
SELECT name FROM Employee UNION ALL SELECT name FROM Contractor;

Why bag semantics in practice?

Performance — Duplicate elimination requires sorting or hashing. Skipping it is faster.
Aggregation accuracy — COUNT and SUM need to count every occurrence, not just unique values.
Intermediate results — During query processing, preserving duplicates enables correct final aggregation.

Algebraic adaptation:

Bag algebra uses multiplicity-tracking operators:

Selection: preserves multiplicities
Projection: keeps all produced tuples (may create more duplicates)
Union: adds multiplicities
Difference: subtracts multiplicities (down to 0)

The DISTINCT Keyword

SQL's DISTINCT converts bag results to set results. It should be used consciously—it adds cost and may change semantics (especially with aggregation). Default to bag semantics unless duplicate elimination is specifically needed.

Extension: Specialized Query Constructs

Beyond the major extensions, modern SQL includes numerous specialized constructs that address specific query patterns not easily expressed in pure RA.

Specialized SQL Constructs

•EXISTS / NOT EXISTS — Efficient semi-join and anti-join patterns for testing existence without retrieving values.
•IN / NOT IN — Membership testing against subqueries or value lists.
•LATERAL joins — Subqueries that can reference columns from preceding tables in the FROM clause.
•PIVOT / UNPIVOT — Transpose rows to columns or vice versa for reporting.
•MERGE (UPSERT) — Combine insert and update based on match conditions.
•FETCH FIRST / LIMIT — Top-k queries with ordering.
•OFFSET — Pagination for result sets.
•DISTINCT ON — Keep first row per group (PostgreSQL extension).

Examples of specialized constructs:

-- EXISTS for semi-join
SELECT * FROM Department d
WHERE EXISTS (SELECT 1 FROM Employee e WHERE e.dno = d.dno);

-- LATERAL for correlated subquery in FROM
SELECT d.dname, top_earner.name, top_earner.salary
FROM Department d
CROSS JOIN LATERAL (
    SELECT name, salary FROM Employee
    WHERE dno = d.dno
    ORDER BY salary DESC
    LIMIT 1
) AS top_earner;

-- FETCH FIRST for top-k
SELECT * FROM Employee
ORDER BY salary DESC
FETCH FIRST 10 ROWS ONLY;

-- OFFSET for pagination
SELECT * FROM Products
ORDER BY product_id
LIMIT 20 OFFSET 40;  -- Page 3 of 20-item pages

Algebraic representation:

These constructs don't have standard algebraic notation. They're query language conveniences that the optimizer translates to plans using physical operators. The representation varies by system.

Extension: Temporal and Spatial Data

Specialized application domains require extensions beyond standard relational operations. Temporal and spatial extensions are two important examples.

Temporal Extensions:

SQL:2011 introduced temporal tables:

-- System-versioned temporal table
CREATE TABLE Employee (
    emp_id INT PRIMARY KEY,
    name VARCHAR(100),
    salary INT,
    valid_from TIMESTAMP GENERATED ALWAYS AS ROW START,
    valid_to TIMESTAMP GENERATED ALWAYS AS ROW END,
    PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
) WITH SYSTEM VERSIONING;

-- Query historical state
SELECT * FROM Employee
FOR SYSTEM_TIME AS OF '2023-01-01';

-- Query entire history
SELECT * FROM Employee
FOR SYSTEM_TIME ALL;

Spatial Extensions (PostGIS example):

-- Spatial types and operations
CREATE TABLE Stores (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    location GEOMETRY(POINT, 4326)
);

-- Spatial queries
SELECT name
FROM Stores
WHERE ST_DWithin(
    location,
    ST_MakePoint(-122.4194, 37.7749),
    1000  -- within 1km
);

-- Spatial joins
SELECT s.name, d.name AS district
FROM Stores s
JOIN Districts d ON ST_Contains(d.boundary, s.location);

Why these require extensions:

Temporal data needs period overlap tests, time-based versions, and "as-of" semantics—concepts foreign to pure RA.
Spatial data requires specialized indexes (R-trees, quadtrees), distance functions, and geometric predicates that pure equality testing can't express.

Other domain-specific extensions:

Domain	Extensions	Examples
Graph	Path patterns, reachability	Cypher (Neo4j), SPARQL
JSON/XML	Path navigation, extraction	JSON_EXTRACT, XPath
Full-text	Search ranking, relevance	tsvector, ts_query
Array	Element access, operations	UNNEST, ARRAY_AGG

The Expressiveness Hierarchy

With all these extensions, we can now chart the hierarchy of expressiveness from pure relational algebra to full SQL and beyond.

Expressiveness levels:

Level 1: Pure Relational Algebra (= First-Order Logic)
    ↓
Level 2: RA + Aggregation (COUNT, SUM, GROUP BY)
    ↓
Level 3: RA + Aggregation + Outer Joins + Extended Projection
    ↓
Level 4: RA + ... + Window Functions + Bag Semantics
    ↓
Level 5: RA + ... + Recursion (CTEs, Datalog)
    ↓
Level 6: Full SQL (+ temporal, spatial, domain-specific)
    ↓
Level 7: Procedural extensions (PL/SQL, stored procedures)
    ↓
Level 8: Turing-complete (general programming)

Key boundaries:

Level 1 → 2: Adds counting (not first-order definable)
Level 2 → 5: Adds transitive closure (requires fixed-point computation)
Level 5 → 7: Adds side effects, control flow, arbitrary computation
Level 7 → 8: Full Turing completeness (but loses decidability guarantees)

Practical SQL

Modern SQL sits around Level 5-6: powerful enough for almost all practical queries, while maintaining optimization potential and avoiding Turing-complete complexity. This balance explains SQL's enduring success—it's powerful enough to be useful and restricted enough to be efficient.

Summary: The Extended Relational Model

We've explored the extensions that transform pure relational algebra into the powerful query languages used in modern database systems.

Key Takeaways

•Aggregation and grouping (γ operator) enable COUNT, SUM, AVG, and GROUP BY—addressing the counting limitation.
•Recursion (Datalog, recursive CTEs) enables transitive closure and arbitrary-length path queries—the most significant expressiveness addition.
•Window functions enable row-by-row computation over ordered partitions without collapsing results—uniquely powerful for analytics.
•Outer joins preserve non-matching tuples with NULL padding—essential for many reporting patterns.
•Extended projection allows computing new attributes via expressions and functions—enabling derived columns.
•Bag semantics preserve duplicates for performance and correct aggregation—SQL's default behavior.
•Domain extensions (temporal, spatial, JSON) adapt the relational model to specialized data types and operations.
•The expressiveness hierarchy shows how each extension adds genuine power, culminating in Level 5-6 SQL that balances power with optimizability.

Module completion:

With this page, we complete our exploration of Expressive Power in relational query languages. We've traveled from the foundational concept of relational completeness, through Codd's elegant theorem proving algebra-calculus equivalence, into the formal proof structure, through the inherent limitations of first-order logic, and finally to the extensions that transform theoretical elegance into practical power.

Understanding expressive power isn't merely theoretical—it informs query design (knowing what's possible), performance analysis (knowing what's hard), and system selection (knowing what different systems can express). This knowledge separates database professionals who use SQL from those who truly understand it.

Module Complete

Congratulations! You've completed Module 6: Expressive Power. You now understand relational completeness as the benchmark for query languages, Codd's Theorem proving algebra-calculus equivalence, the formal proofs connecting these paradigms, the inherent limitations of the relational model, and the extensions that address those limitations. This knowledge forms the theoretical foundation for understanding any query language, present or future.

Extensions

Beyond Relational Completeness

What You Will Learn

Extension: Aggregation and Grouping

The most widely used extension to relational algebra is aggregation—the ability to compute summary values (counts, sums, averages) over sets of tuples, often grouped by key attributes.

Extended Algebra: Grouping Operator

The grouping operator γ (gamma) extends relational algebra:

γ_{grouping-attrs, agg-function → result-attr}(R)

It partitions R by the grouping attributes, applies the aggregate function to each partition, and produces tuples with the grouping values and aggregated results.

The formal extension:

-- Relational Algebra Extended with Grouping
γ_{department; COUNT(*) → emp_count, SUM(salary) → total_salary}(Employee)

This produces one tuple per department with the employee count and total salary for that department.

SQL syntax:

SELECT department, COUNT(*) AS emp_count, SUM(salary) AS total_salary
FROM Employee
GROUP BY department;

Aggregate functions:

Function	Description	Null Handling
COUNT(*)	Number of rows	Counts all rows
COUNT(col)	Number of non-null values	Ignores nulls
SUM(col)	Sum of values	Ignores nulls
AVG(col)	Average of values	Ignores nulls
MIN(col)	Minimum value	Ignores nulls
MAX(col)	Maximum value	Ignores nulls
COUNT(DISTINCT col)	Distinct value count	Ignores nulls

The HAVING clause:

HAVING filters groups by aggregate conditions—a capability not possible without aggregation:

SELECT department, COUNT(*) AS emp_count
FROM Employee
GROUP BY department
HAVING COUNT(*) > 10;   -- Only large departments

Algebraic representation:

σ_{emp_count > 10}(γ_{department; COUNT(*) → emp_count}(Employee))

The selection applies AFTER the grouping, filtering on aggregate results.

Why Aggregation Fits the Relational Model

•Output is a relation — The result of grouping with aggregation is still a set of tuples (the set of groups with their aggregated values).
•Composable — Aggregation results can be used in further operations (selections, joins, nesting).
•Deterministic — Given the same input, aggregation produces the same output. No undefined behavior.
•Optimizable — Aggregation pushdown, partial aggregation, and other techniques can optimize aggregate queries.

Extension: Recursion and Transitive Closure

Datalog: Logic Programming for Databases

Datalog approach:

% Base case: direct edges are reachable
Reachable(X, Y) :- Edge(X, Y).

% Recursive case: if X reaches Z and Z reaches Y, then X reaches Y
Reachable(X, Y) :- Edge(X, Z), Reachable(Z, Y).

This computes transitive closure through recursive rule application until a fixed point is reached (no new tuples can be derived).

SQL recursive CTEs:

SQL:1999 introduced recursive Common Table Expressions:

WITH RECURSIVE Reachable AS (
    -- Base case: direct edges
    SELECT source, target FROM Edge
    UNION
    -- Recursive case: extend paths
    SELECT e.source, r.target
    FROM Edge e
    JOIN Reachable r ON e.target = r.source
)
SELECT * FROM Reachable;

Practical example—organizational hierarchy:

-- Find all reports (direct and indirect) of a manager
WITH RECURSIVE Reports AS (
    -- Direct reports
    SELECT emp_id, name, manager_id
    FROM Employee
    WHERE manager_id = 100  -- Start from manager #100
    UNION ALL
    -- Indirect reports: people who report to direct reports
    SELECT e.emp_id, e.name, e.manager_id
    FROM Employee e
    JOIN Reports r ON e.manager_id = r.emp_id
)
SELECT * FROM Reports;

Recursion Capabilities by Query Language
Feature	Pure RA	Datalog	SQL Recursive CTE
Fixed-length paths	✓ (k joins)	✓	✓
Arbitrary-length paths	✗	✓	✓
Transitive closure	✗	✓	✓
Aggregation in recursion	N/A	With extensions	✓ (limited)
Guaranteed termination	✓	✓ (stratified)	Depends on query

Theoretical foundations:

Least fixed-point semantics — Recursive queries compute the smallest relation satisfying the recursive definition
Stratification — Ensures safe use of negation in recursive rules
Data Complexity — Recursive Datalog is PTIME-complete (more powerful than RA's AC⁰)

Expressiveness implications:

Adding recursion to RA creates a strictly more powerful language:

Pure RA < RA + Aggregation < RA + Aggregation + Recursion

Each extension adds genuine expressiveness—there are queries expressible with the extension that are impossible without it.

Extension: Window Functions

Window Functions vs. Aggregation

GROUP BY aggregation: N input rows → 1 output row per group Window functions: N input rows → N output rows, each with computed values from its window

Window functions add data to each row; aggregation collapses rows.

Basic syntax:

SELECT 
    employee_id,
    department,
    salary,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg,
    RANK() OVER (ORDER BY salary DESC) AS salary_rank,
    SUM(salary) OVER (ORDER BY hire_date 
                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM Employee;

Components of a window function:

Function: What to compute (SUM, AVG, RANK, ROW_NUMBER, LEAD, LAG, etc.)
PARTITION BY: Divides rows into groups (like GROUP BY but doesn't collapse)
ORDER BY: Orders rows within the partition
Frame: Defines which rows relative to current row are included (ROWS/RANGE BETWEEN)

Window function categories:

Category	Functions	Description
Aggregate Windows	SUM, AVG, COUNT, MIN, MAX	Aggregate over window frame
Ranking	ROW_NUMBER, RANK, DENSE_RANK, NTILE	Position within partition
Value Functions	FIRST_VALUE, LAST_VALUE, NTH_VALUE	Specific values from frame
Offset Functions	LAG, LEAD	Values from offset rows
Distribution	PERCENT_RANK, CUME_DIST, PERCENTILE_CONT	Statistical distributions

Queries enabled by window functions:

-- Running totals
SELECT date, amount,
       SUM(amount) OVER (ORDER BY date) AS running_total
FROM Sales;

-- Compare each employee to department average
SELECT name, salary, department,
       salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM Employee;

-- Find the previous and next order for each customer
SELECT customer_id, order_date,
       LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order,
       LEAD(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_order
FROM Orders;

-- Top-N per group
SELECT * FROM (
    SELECT name, department, salary,
           ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM Employee
) ranked WHERE rn <= 3;  -- Top 3 per department

Theoretical significance:

Extension: Outer Joins

Regular (inner) joins only include tuples that match on both sides. Outer joins extend this to preserve non-matching tuples, padding with NULLs where no match exists.

Outer join types:

-- LEFT OUTER JOIN: Keep all left tuples
SELECT e.name, d.dname
FROM Employee e
LEFT OUTER JOIN Department d ON e.dno = d.dno;
-- Employees without departments get NULL for dname

-- RIGHT OUTER JOIN: Keep all right tuples
SELECT e.name, d.dname  
FROM Employee e
RIGHT OUTER JOIN Department d ON e.dno = d.dno;
-- Departments without employees get NULL for name

-- FULL OUTER JOIN: Keep all tuples from both
SELECT e.name, d.dname
FROM Employee e
FULL OUTER JOIN Department d ON e.dno = d.dno;
-- Both unmatched employees and empty departments appear

Algebraic notation:

Type	Symbol	Description
Left Outer Join	⟕	R ⟕ S preserves all R tuples
Right Outer Join	⟖	R ⟖ S preserves all S tuples
Full Outer Join	⟗	R ⟗ S preserves all tuples

Why outer joins extend the model:

Practical importance:

Outer joins are essential for:

Reporting (showing all categories even if empty)
Data quality checks (finding orphan records)
Optional relationships (customers who may or may not have orders)
Comparative analysis (comparing what exists vs. what could exist)

Outer Joins and Optimization

Extension: Extended Projection and Expressions

Pure projection (π) only selects existing attributes. Extended projection allows computing new attributes using expressions, functions, and conditionals.

Extended projection capabilities:

-- Arithmetic expressions
SELECT name, salary, salary * 12 AS annual_salary
FROM Employee;

-- String operations  
SELECT first_name || ' ' || last_name AS full_name
FROM Person;

-- Conditional expressions
SELECT name,
       CASE 
           WHEN salary > 100000 THEN 'High'
           WHEN salary > 50000 THEN 'Medium'
           ELSE 'Low'
       END AS salary_tier
FROM Employee;

-- Type conversions
SELECT CAST(hire_date AS VARCHAR) AS hire_date_text
FROM Employee;

-- Function applications
SELECT name, UPPER(name), LENGTH(name), SUBSTRING(name FROM 1 FOR 1)
FROM Employee;

Algebraic representation:

Extended projection is written as:

π_{A, B, A+B → C, f(D) → E}(R)

This projects A and B, computes A+B as C, and applies function f to D as E.

Built-in function categories:

Category	Examples	Purpose
Arithmetic	+, -, *, /, MOD, ABS, ROUND	Numeric computation
String	CONCAT, SUBSTRING, UPPER, LOWER, TRIM	Text manipulation
Date/Time	NOW, DATE_ADD, EXTRACT, DATE_DIFF	Temporal computation
Conditional	CASE, COALESCE, NULLIF, GREATEST, LEAST	Control flow
Type	CAST, CONVERT	Type conversion
Null handling	COALESCE, NULLIF, IS NULL	Null management

Computed Attributes in Selection

Extension: Bag (Multiset) Semantics

Set vs. Bag semantics:

Operation	Set Semantics	Bag Semantics
Projection	Removes duplicates	Preserves duplicates (unless DISTINCT)
Union	{1,1,2} ∪ {1,2,2} = {1,2}	{1,1,2} UNION ALL {1,2,2} = {1,1,1,2,2,2}
Intersection	Takes common elements	Takes min of counts
Difference	Removes all matching	Subtracts counts
Output uniqueness	Always unique	May have duplicates

SQL operators:

-- Set semantics (removes duplicates)
SELECT DISTINCT department FROM Employee;
SELECT name FROM Employee UNION SELECT name FROM Contractor;

-- Bag semantics (preserves duplicates)
SELECT department FROM Employee;  -- May have duplicates
SELECT name FROM Employee UNION ALL SELECT name FROM Contractor;

Why bag semantics in practice?

Performance — Duplicate elimination requires sorting or hashing. Skipping it is faster.
Aggregation accuracy — COUNT and SUM need to count every occurrence, not just unique values.
Intermediate results — During query processing, preserving duplicates enables correct final aggregation.

Algebraic adaptation:

Bag algebra uses multiplicity-tracking operators:

Selection: preserves multiplicities
Projection: keeps all produced tuples (may create more duplicates)
Union: adds multiplicities
Difference: subtracts multiplicities (down to 0)

The DISTINCT Keyword

Extension: Specialized Query Constructs

Beyond the major extensions, modern SQL includes numerous specialized constructs that address specific query patterns not easily expressed in pure RA.

Specialized SQL Constructs

•EXISTS / NOT EXISTS — Efficient semi-join and anti-join patterns for testing existence without retrieving values.
•IN / NOT IN — Membership testing against subqueries or value lists.
•LATERAL joins — Subqueries that can reference columns from preceding tables in the FROM clause.
•PIVOT / UNPIVOT — Transpose rows to columns or vice versa for reporting.
•MERGE (UPSERT) — Combine insert and update based on match conditions.
•FETCH FIRST / LIMIT — Top-k queries with ordering.
•OFFSET — Pagination for result sets.
•DISTINCT ON — Keep first row per group (PostgreSQL extension).

Examples of specialized constructs:

-- EXISTS for semi-join
SELECT * FROM Department d
WHERE EXISTS (SELECT 1 FROM Employee e WHERE e.dno = d.dno);

-- LATERAL for correlated subquery in FROM
SELECT d.dname, top_earner.name, top_earner.salary
FROM Department d
CROSS JOIN LATERAL (
    SELECT name, salary FROM Employee
    WHERE dno = d.dno
    ORDER BY salary DESC
    LIMIT 1
) AS top_earner;

-- FETCH FIRST for top-k
SELECT * FROM Employee
ORDER BY salary DESC
FETCH FIRST 10 ROWS ONLY;

-- OFFSET for pagination
SELECT * FROM Products
ORDER BY product_id
LIMIT 20 OFFSET 40;  -- Page 3 of 20-item pages

Algebraic representation:

These constructs don't have standard algebraic notation. They're query language conveniences that the optimizer translates to plans using physical operators. The representation varies by system.

Extension: Temporal and Spatial Data

Specialized application domains require extensions beyond standard relational operations. Temporal and spatial extensions are two important examples.

Temporal Extensions:

SQL:2011 introduced temporal tables:

-- System-versioned temporal table
CREATE TABLE Employee (
    emp_id INT PRIMARY KEY,
    name VARCHAR(100),
    salary INT,
    valid_from TIMESTAMP GENERATED ALWAYS AS ROW START,
    valid_to TIMESTAMP GENERATED ALWAYS AS ROW END,
    PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
) WITH SYSTEM VERSIONING;

-- Query historical state
SELECT * FROM Employee
FOR SYSTEM_TIME AS OF '2023-01-01';

-- Query entire history
SELECT * FROM Employee
FOR SYSTEM_TIME ALL;

Spatial Extensions (PostGIS example):

-- Spatial types and operations
CREATE TABLE Stores (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    location GEOMETRY(POINT, 4326)
);

-- Spatial queries
SELECT name
FROM Stores
WHERE ST_DWithin(
    location,
    ST_MakePoint(-122.4194, 37.7749),
    1000  -- within 1km
);

-- Spatial joins
SELECT s.name, d.name AS district
FROM Stores s
JOIN Districts d ON ST_Contains(d.boundary, s.location);

Why these require extensions:

Temporal data needs period overlap tests, time-based versions, and "as-of" semantics—concepts foreign to pure RA.
Spatial data requires specialized indexes (R-trees, quadtrees), distance functions, and geometric predicates that pure equality testing can't express.

Other domain-specific extensions:

Domain	Extensions	Examples
Graph	Path patterns, reachability	Cypher (Neo4j), SPARQL
JSON/XML	Path navigation, extraction	JSON_EXTRACT, XPath
Full-text	Search ranking, relevance	tsvector, ts_query
Array	Element access, operations	UNNEST, ARRAY_AGG

The Expressiveness Hierarchy

With all these extensions, we can now chart the hierarchy of expressiveness from pure relational algebra to full SQL and beyond.

Expressiveness levels:

Level 1: Pure Relational Algebra (= First-Order Logic)
    ↓
Level 2: RA + Aggregation (COUNT, SUM, GROUP BY)
    ↓
Level 3: RA + Aggregation + Outer Joins + Extended Projection
    ↓
Level 4: RA + ... + Window Functions + Bag Semantics
    ↓
Level 5: RA + ... + Recursion (CTEs, Datalog)
    ↓
Level 6: Full SQL (+ temporal, spatial, domain-specific)
    ↓
Level 7: Procedural extensions (PL/SQL, stored procedures)
    ↓
Level 8: Turing-complete (general programming)

Key boundaries:

Level 1 → 2: Adds counting (not first-order definable)
Level 2 → 5: Adds transitive closure (requires fixed-point computation)
Level 5 → 7: Adds side effects, control flow, arbitrary computation
Level 7 → 8: Full Turing completeness (but loses decidability guarantees)

Practical SQL

Summary: The Extended Relational Model

We've explored the extensions that transform pure relational algebra into the powerful query languages used in modern database systems.

Key Takeaways

•Aggregation and grouping (γ operator) enable COUNT, SUM, AVG, and GROUP BY—addressing the counting limitation.
•Recursion (Datalog, recursive CTEs) enables transitive closure and arbitrary-length path queries—the most significant expressiveness addition.
•Window functions enable row-by-row computation over ordered partitions without collapsing results—uniquely powerful for analytics.
•Outer joins preserve non-matching tuples with NULL padding—essential for many reporting patterns.
•Extended projection allows computing new attributes via expressions and functions—enabling derived columns.
•Bag semantics preserve duplicates for performance and correct aggregation—SQL's default behavior.
•Domain extensions (temporal, spatial, JSON) adapt the relational model to specialized data types and operations.
•The expressiveness hierarchy shows how each extension adds genuine power, culminating in Level 5-6 SQL that balances power with optimizability.

Module completion:

Module Complete