Selection And Projection - Learning Module

Loading content...

0/252

Projection (π) Operator

Selecting the Columns You Need

Consider a database table with 50 columns storing comprehensive employee information—everything from basic demographics to compensation details, emergency contacts, and performance metrics. Now imagine you need just three pieces of information: employee ID, name, and department. Fetching all 50 columns when you need only 3 wastes bandwidth, memory, and processing power.

This is the problem that the Projection operator (π) solves. While Selection filters rows based on conditions, Projection filters columns—extracting only the attributes you need from a relation. It's the vertical counterpart to Selection's horizontal filtering.

The Projection operator, denoted by the Greek letter pi (π), appears in virtually every database query. Each time you specify column names in a SQL SELECT clause instead of using SELECT *, you're invoking Projection. Understanding this operation deeply is essential for query optimization, data privacy enforcement, and efficient system design.

What You Will Learn

By the end of this page, you will understand the formal mathematical definition of Projection, master attribute list specification, comprehend the critical role of duplicate elimination, understand implementation strategies including column pruning, and recognize Projection's relationship to SQL's SELECT clause.

Formal Definition and Notation

The Projection operator is a unary operation that extracts specified attributes from a relation, producing a new relation containing only those attributes.

Formal Definition:

Given a relation R with schema R(A₁, A₂, ..., Aₙ) and an attribute list L = {Aᵢ₁, Aᵢ₂, ..., Aᵢₖ} where each Aᵢⱼ ∈ {A₁, A₂, ..., Aₙ}, the Projection operation is defined as:

$$π_L(R) = {t[L] \mid t ∈ R}$$

Where t[L] denotes the tuple t restricted to only the attributes in L.

In plain language: Projection πₗ(R) produces a relation containing all tuples from R, but each tuple contains only the attribute values specified in L.

Key Properties:

Schema Transformation: The output schema differs from the input schema. If R has attributes (A₁, A₂, ..., Aₙ) and L = {A₂, A₅}, then πₗ(R) has schema (A₂, A₅).
Cardinality Behavior: Projection may reduce cardinality due to duplicate elimination: |πₗ(R)| ≤ |R|. This is a critical distinction from Selection.
Closure Property: The output is a valid relation, maintaining the set-theoretic foundation of the relational model.

Basic Projection NotationProjection applied to extract specific columns from an Employee relation

Input

Output

Notation Variations

Different textbooks may use: π_{L}(R), πL, PROJECT(R, L), or R[L]. Some use angle brackets: π⟨A₁, A₂⟩(R). The attribute list may be comma-separated or space-separated. All express the same operation—extracting specified columns from a relation.

Mathematical Foundation

Projection has deep mathematical foundations in set theory and function theory. Understanding these foundations clarifies why Projection behaves as it does.

Projection as Tuple Restriction:

A tuple t over schema S = (A₁, A₂, ..., Aₙ) is a function from attribute names to values:

t: S → D₁ ∪ D₂ ∪ ... ∪ Dₙ

where Dᵢ is the domain of attribute Aᵢ.

Projection restricts this function to a subset of its domain:

t[L] : L → D where L ⊆ S

Set-Theoretic Properties:

1. Idempotence: $$π_L(π_L(R)) = π_L(R)$$

Projecting the same attributes twice has no additional effect.

2. Absorption (for nested projections): If L₁ ⊆ L₂, then: $$π_{L₁}(π_{L₂}(R)) = π_{L₁}(R)$$

Projecting to a superset and then to a subset equals projecting directly to the subset.

3. Commutativity with Selection (under conditions): If all attributes referenced in predicate P are in L, then: $$π_L(σ_P(R)) = σ_P(π_L(R))$$

This property is crucial for query optimization.

Projection Algebraic Laws
Law Name	Expression	Condition
Identity	π_{all attributes}(R) = R	L contains all attributes of R
Empty Projection	π_{}(R) = single tuple or empty	Degenerative case, not standard
Idempotence	π_L(π_L(R)) = π_L(R)	Always holds
Absorption	π_{L₁}(π_{L₂}(R)) = π_{L₁}(R)	When L₁ ⊆ L₂
Selection Push	π_L(σ_P(R)) = σ_P(π_L(R))	When P references only attributes in L
Cardinality Bound	\|π_L(R)\| ≤ \|R\|	Always holds; equality when L includes a key

Cardinality Analysis:

Unlike Selection where the predicate determines cardinality, Projection's cardinality depends on the uniqueness of projected values:

Projecting a superkey: |π_L(R)| = |R| (no duplicates possible)
Projecting a candidate key: |π_L(R)| = |R| (no duplicates possible)
Projecting non-key attributes: |π_L(R)| ≤ |R| (duplicates may exist)

Maximum and Minimum Cardinality:

Maximum: |π_L(R)| = |R| when L includes a key
Minimum: |π_L(R)| = 1 when all tuples have identical values for attributes in L

Cardinality Estimation: For attribute A with NDV(A) distinct values in R:

|π_{A}(R)| = min(NDV(A), |R|) = NDV(A) typically
For multiple attributes: |π_{A,B}(R)| ≤ min(NDV(A) × NDV(B), |R|)

Key Preservation Insight

If your projection includes a candidate key of the relation, you're guaranteed no duplicates. This is why SELECT DISTINCT on a table's primary key is redundant—the key already ensures uniqueness. Understanding this helps you predict when duplicate elimination is necessary.

Duplicate Elimination: Theory and Practice

One of the most significant aspects of Projection—and a frequent source of confusion—is its handling of duplicate tuples. In pure relational algebra, relations are sets, and sets by definition contain no duplicates.

The Relational Model Perspective:

In Codd's original relational model:

Relations are sets of tuples
No two tuples in a relation are identical
Therefore, Projection must eliminate duplicates to produce a valid relation

This means: $$π_{department}(Employee) = {Engineering, Marketing, HR}$$

Even though 'Engineering' appears in 3 tuples of the original relation.

The SQL Perspective (Important Distinction!):

SQL deviates from pure relational algebra. SQL tables are multisets (bags) by default, allowing duplicates:

-- This does NOT eliminate duplicates
SELECT department FROM Employee;
-- Returns: Engineering, Marketing, Engineering, HR, Engineering

-- This eliminates duplicates (matching relational algebra)
SELECT DISTINCT department FROM Employee;
-- Returns: Engineering, Marketing, HR

Relational Algebra (Set Semantics)

•Relations are sets
•No duplicate tuples allowed
•Projection always eliminates duplicates
•Matches theoretical foundations
•Guaranteed unique results
•|π_L(R)| ≤ |R|

SQL (Bag Semantics by Default)

•Tables are multisets (bags)
•Duplicates allowed by default
•SELECT preserves duplicates
•SELECT DISTINCT eliminates duplicates
•Performance-driven design choice
•|SELECT cols| = |table| always

Why SQL Uses Bag Semantics:

The SQL standard chose bag semantics for practical reasons:

Performance: Duplicate elimination is expensive (requires sorting or hashing)
Aggregation correctness: SELECT SUM(salary) must count all values, including duplicates
Intermediate results: Many operations produce duplicates that get resolved later
Programmer control: Developers can add DISTINCT when needed

Cost of Duplicate Elimination:

Method	Time Complexity	Space Complexity
Sort-based	O(n log n)	O(n)
Hash-based	O(n) expected	O(d) where d = distinct values
Index-based	O(n)	Uses existing index

For large relations, duplicate elimination can be the dominant cost of a query.

Duplicate Elimination ImpactDemonstrating when duplicates matter

Input

Output

DISTINCT Overuse

Adding DISTINCT to 'fix' a query that returns unexpected duplicates often masks a deeper problem—usually a missing or incorrect join condition. If you're surprised by duplicates, investigate why they appear before reflexively adding DISTINCT. The duplicates may indicate a Cartesian product or other logical error.

Attribute List Specification

The attribute list in a Projection specifies which columns appear in the result. Understanding the various ways to specify attributes helps you write concise and maintainable queries.

Basic Attribute Reference:

The simplest form lists attribute names explicitly:

$$π_{emp_id, name, salary}(Employee)$$

Attribute Order:

The order of attributes in the list determines the order in the result schema:

π_{A, B}(R) produces schema (A, B)
π_{B, A}(R) produces schema (B, A)

These are different relations (different schemas) even though they contain the same information.

Qualified Attribute Names:

When working with multiple relations (in combined operations), qualify attributes with relation names:

$$π_{Employee.name, Department.location}(...)$$

All Attributes:

Projecting all attributes is equivalent to no projection:

$$π_{A₁, A₂, ..., Aₙ}(R) = R$$

In SQL, this is SELECT *.

Attribute Specification PatternsVarious ways to specify projection attributes

Input

Output

Column Expressions in SQL:

SQL extends pure projection with computed columns:

SELECT 
    emp_id,
    CONCAT(first_name, ' ', last_name) AS full_name,
    salary * 12 AS annual_salary,
    DATEDIFF(CURRENT_DATE, hire_date) AS days_employed
FROM Employee;

This isn't pure projection—it's projection combined with tuple transformation. In relational algebra, this would require the Extended Projection or Generalized Projection operator.

Extended Projection:

Denoted π_{f₁, f₂, ..., fₖ}(R) where each fᵢ is either:

An attribute of R
An expression involving constants and attributes of R
An attribute renamed using AS

Considerations:

Type Safety: Expressions must produce valid types for result columns
NULL Handling: Expressions involving NULL typically produce NULL
Determinism: Non-deterministic functions (RAND(), NOW()) may complicate query optimization

Column Selection Best Practices

Always select only the columns you need. Avoid SELECT * in production code—it wastes resources, breaks if schema changes, and may expose sensitive columns. Explicit column lists document intent, enable column pruning optimization, and make code resilient to schema evolution.

Implementation Strategies

Database systems implement Projection using various strategies depending on whether duplicate elimination is required and the characteristics of the data.

Strategy 1: Simple Column Extraction (No Deduplication)

When duplicates are acceptable (SQL without DISTINCT):

Algorithm: SimpleProjection(R, L)
  result ← empty list
  for each tuple t in R:
    projected_t ← extract attributes L from t
    append projected_t to result
  return result

Time Complexity: O(n) - single pass through data
Space Complexity: O(|L| × n) for output
I/O Complexity: One scan of input relation

Strategy 2: Sort-Based Deduplication

For DISTINCT queries on unsorted data:

Algorithm: SortProjection(R, L)
  // Phase 1: Project
  projected ← []
  for each tuple t in R:
    projected.append(extract L from t)
  
  // Phase 2: Sort
  sort projected
  
  // Phase 3: Eliminate duplicates
  result ← [projected[0]]
  for i from 1 to len(projected)-1:
    if projected[i] ≠ projected[i-1]:
      result.append(projected[i])
  return result

Time Complexity: O(n log n) dominated by sort
Space Complexity: O(n) for sorted intermediate result

Sort-Based Approach

•Time: O(n log n)
•Space: O(n) for sorting
•Produces sorted output (bonus)
•Stable performance
•Can use external sort for large data
•Efficient when output is large

Hash-Based Approach

•Time: O(n) expected
•Space: O(d) where d = distinct values
•Unordered output
•Best when d << n
•Memory-bound for large d
•Fastest for small result sets

Strategy 3: Hash-Based Deduplication

Algorithm: HashProjection(R, L)
  hash_set ← empty HashSet
  result ← []
  for each tuple t in R:
    projected_t ← extract L from t
    if projected_t not in hash_set:
      hash_set.add(projected_t)
      result.append(projected_t)
  return result

Time Complexity: O(n) expected (hash operations are O(1) amortized)
Space Complexity: O(d) where d is number of distinct values
Best Case: Very few distinct values → small hash set
Worst Case: All values distinct → hash set size = n

Strategy 4: Index-Based Projection

When an index covers the projection columns:

Algorithm: IndexOnlyProjection(Index, L)
  result ← []
  for each entry e in Index:
    if first occurrence of e.key:  -- Index naturally deduplicates
      projected_t ← extract L from e
      result.append(projected_t)
  return result

Advantage: No table access required
Requirement: Index must contain all projected columns (covering index)
Natural deduplication: Unique indexes inherently deduplicate

Optimizer's Choice

The query optimizer selects the implementation strategy based on: estimated number of distinct values, available memory, whether a covering index exists, order requirements for downstream operations, and overall query plan. EXPLAIN/EXPLAIN ANALYZE reveals which strategy was chosen.

Column Pruning Optimization

One of the most important optimizations related to Projection is column pruning (also called column projection pushdown)—eliminating unnecessary columns as early as possible in query execution.

The Problem:

Consider a query that joins two large tables but only needs a few columns from each:

SELECT e.name, d.department_name
FROM Employee e
JOIN Department d ON e.dept_id = d.dept_id
WHERE e.salary > 100000;

Without optimization, the database might:

Scan all 50 columns of Employee
Scan all 20 columns of Department
Join (carrying 70 columns through)
Filter on salary
Finally project to name, department_name

The Solution - Column Pruning:

With optimization:

Scan only needed Employee columns (name, dept_id, salary)
Scan only needed Department columns (dept_id, department_name)
Join (carrying only 4-5 columns)
Filter on salary
Final projection is trivial (columns already limited)

Column Pruning Impact
Metric	Without Pruning	With Pruning	Improvement
Disk I/O	Read all columns per row	Read only needed columns	60-90% reduction typical
Memory Usage	Buffer all columns	Buffer minimal columns	Proportional to column reduction
Network Transfer	Send full rows	Send minimal rows	Significant in distributed systems
Cache Efficiency	Pollute cache with unused data	Maximize useful data in cache	Better cache hit rates
Join Performance	Hash/sort on wide tuples	Hash/sort on narrow tuples	Faster comparisons

How Column Pruning Works:

Analyze final output: Determine columns in SELECT clause
Trace dependencies: Identify columns needed for WHERE, JOIN, ORDER BY, GROUP BY
Compute minimal set: For each table, find columns actually referenced
Push projection down: Add projection as early as possible in the plan

Interaction with Storage Formats:

Row-oriented storage (traditional RDBMS):

Rows stored contiguously
Must read entire row even if only some columns needed
Column pruning reduces data copied after read

Column-oriented storage (analytical databases):

Columns stored separately
Can read only needed columns from disk
Column pruning directly reduces I/O

This is why analytical queries on column stores benefit dramatically from explicit column lists.

Column Pruning in ActionQuery plan transformation with column pruning

Input

Output

SELECT * Performance Impact

SELECT * defeats column pruning because the optimizer must assume all columns are needed. This is why production code should always specify explicit columns: it enables column pruning, reduces data transfer, and protects against schema changes adding unexpected columns.

Projection in SQL

Projection in SQL is expressed through the SELECT clause. Understanding the nuances of SQL's implementation helps bridge theory and practice.

Basic Mapping:

Relational Algebra	SQL Equivalent
π_{A,B}(R)	SELECT DISTINCT A, B FROM R
π_{A,B}(R) (bag semantics)	SELECT A, B FROM R

Important SQL Distinctions:

projection_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- Pure projection (set semantics, matches relational algebra)
SELECT DISTINCT department
FROM Employee;
 
-- Projection without duplicate elimination (SQL default)
SELECT department
FROM Employee;
 
-- Multi-column projection
SELECT emp_id, name, salary
FROM Employee;
 
-- Projection with column aliasing (renaming)
SELECT 
    emp_id AS employee_id,
    name AS employee_name,
    salary AS annual_salary
FROM Employee;
 
-- Extended projection with expressions
SELECT 
    name,
    salary AS monthly_salary,
    salary * 12 AS annual_salary,
    salary * 0.2 AS bonus,
    CONCAT(name, ' (', department, ')') AS display_name
FROM Employee;
 
-- Projection of all columns (defeats column pruning)
SELECT * FROM Employee;
 
-- Projection with prefix for multiple tables
SELECT 
    e.name AS employee_name,
    e.salary,
    d.department_name,
    d.location
FROM Employee e
JOIN Department d ON e.dept_id = d.dept_id;
 
-- DISTINCT on multiple columns
-- Eliminates duplicate (department, status) combinations
SELECT DISTINCT department, status
FROM Employee;

SQL-Specific Projection Features:

1. Column Aliasing (AS):

SELECT salary AS compensation FROM Employee;

This combines projection with renaming (ρ operator in relational algebra).

2. Expressions in SELECT:

SELECT salary * 1.1 AS raised_salary FROM Employee;

This is Extended Projection—computing new values from existing attributes.

3. Aggregate Functions:

SELECT department, AVG(salary) AS avg_salary
FROM Employee
GROUP BY department;

This combines projection, grouping, and aggregation.

4. Window Functions:

SELECT name, salary, RANK() OVER (ORDER BY salary DESC) as rank
FROM Employee;

Advanced computed columns using window context.

5. Subqueries in SELECT:

SELECT name, 
       (SELECT COUNT(*) FROM Orders o WHERE o.emp_id = e.emp_id) as order_count
FROM Employee e;

Correlated subqueries for row-by-row computation.

SELECT Clause Evaluation Order

Despite appearing first in the query, SELECT is logically evaluated after FROM, WHERE, GROUP BY, and HAVING. This is why you can't reference SELECT aliases in WHERE. The logical order is: FROM → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT.

Summary and Key Takeaways

The Projection operator is the fundamental mechanism for column selection in relational algebra, and understanding its properties is essential for efficient query design.

Key Takeaways

•Projection (π) performs vertical filtering — It extracts specified attributes from a relation, producing a new relation with a potentially different schema.
•Duplicate elimination is definitional in relational algebra — Relations are sets; identical tuples cannot exist. SQL relaxes this with bag semantics by default, requiring DISTINCT for true projection.
•Cardinality depends on projected attributes — Projecting a superkey preserves cardinality; projecting other attributes may reduce it through deduplication.
•Implementation varies by requirements — Sort-based deduplication is O(n log n) but produces ordered output; hash-based is O(n) expected but requires memory proportional to distinct values.
•Column pruning optimization is critical — Pushing projection down eliminates unnecessary columns early, reducing I/O, memory usage, and data transfer throughout query execution.
•SQL extends basic projection — Column aliasing, expressions, aggregates, and window functions all build upon the fundamental projection concept.
•Avoid SELECT * in production — Explicit column lists enable optimization, protect against schema changes, and document intent clearly.

What's Next:

With both Selection (horizontal filtering) and Projection (vertical filtering) mastered, we now examine a subtle but important topic: duplicate elimination in depth. We'll explore why SQL chose bag semantics, when DISTINCT is truly needed versus reflexively added, and how to reason about duplicates in complex queries.

Page Complete

You now have a comprehensive understanding of the Projection operator—from its formal mathematical definition through implementation strategies to practical SQL usage. This knowledge enables you to select precisely the columns you need while understanding performance implications. Next, we'll dive deep into duplicate elimination.

Projection (π) Operator

Selecting the Columns You Need

What You Will Learn

Formal Definition and Notation

The Projection operator is a unary operation that extracts specified attributes from a relation, producing a new relation containing only those attributes.

Formal Definition:

$$π_L(R) = {t[L] \mid t ∈ R}$$

Where t[L] denotes the tuple t restricted to only the attributes in L.

In plain language: Projection πₗ(R) produces a relation containing all tuples from R, but each tuple contains only the attribute values specified in L.

Key Properties:

Schema Transformation: The output schema differs from the input schema. If R has attributes (A₁, A₂, ..., Aₙ) and L = {A₂, A₅}, then πₗ(R) has schema (A₂, A₅).
Cardinality Behavior: Projection may reduce cardinality due to duplicate elimination: |πₗ(R)| ≤ |R|. This is a critical distinction from Selection.
Closure Property: The output is a valid relation, maintaining the set-theoretic foundation of the relational model.

Basic Projection NotationProjection applied to extract specific columns from an Employee relation

Input

Output

Notation Variations

Mathematical Foundation

Projection has deep mathematical foundations in set theory and function theory. Understanding these foundations clarifies why Projection behaves as it does.

Projection as Tuple Restriction:

A tuple t over schema S = (A₁, A₂, ..., Aₙ) is a function from attribute names to values:

t: S → D₁ ∪ D₂ ∪ ... ∪ Dₙ

where Dᵢ is the domain of attribute Aᵢ.

Projection restricts this function to a subset of its domain:

t[L] : L → D where L ⊆ S

Set-Theoretic Properties:

1. Idempotence: $$π_L(π_L(R)) = π_L(R)$$

Projecting the same attributes twice has no additional effect.

2. Absorption (for nested projections): If L₁ ⊆ L₂, then: $$π_{L₁}(π_{L₂}(R)) = π_{L₁}(R)$$

Projecting to a superset and then to a subset equals projecting directly to the subset.

3. Commutativity with Selection (under conditions): If all attributes referenced in predicate P are in L, then: $$π_L(σ_P(R)) = σ_P(π_L(R))$$

This property is crucial for query optimization.

Projection Algebraic Laws
Law Name	Expression	Condition
Identity	π_{all attributes}(R) = R	L contains all attributes of R
Empty Projection	π_{}(R) = single tuple or empty	Degenerative case, not standard
Idempotence	π_L(π_L(R)) = π_L(R)	Always holds
Absorption	π_{L₁}(π_{L₂}(R)) = π_{L₁}(R)	When L₁ ⊆ L₂
Selection Push	π_L(σ_P(R)) = σ_P(π_L(R))	When P references only attributes in L
Cardinality Bound	\|π_L(R)\| ≤ \|R\|	Always holds; equality when L includes a key

Cardinality Analysis:

Unlike Selection where the predicate determines cardinality, Projection's cardinality depends on the uniqueness of projected values:

Projecting a superkey: |π_L(R)| = |R| (no duplicates possible)
Projecting a candidate key: |π_L(R)| = |R| (no duplicates possible)
Projecting non-key attributes: |π_L(R)| ≤ |R| (duplicates may exist)

Maximum and Minimum Cardinality:

Maximum: |π_L(R)| = |R| when L includes a key
Minimum: |π_L(R)| = 1 when all tuples have identical values for attributes in L

Cardinality Estimation: For attribute A with NDV(A) distinct values in R:

|π_{A}(R)| = min(NDV(A), |R|) = NDV(A) typically
For multiple attributes: |π_{A,B}(R)| ≤ min(NDV(A) × NDV(B), |R|)

Key Preservation Insight

Duplicate Elimination: Theory and Practice

The Relational Model Perspective:

In Codd's original relational model:

Relations are sets of tuples
No two tuples in a relation are identical
Therefore, Projection must eliminate duplicates to produce a valid relation

This means: $$π_{department}(Employee) = {Engineering, Marketing, HR}$$

Even though 'Engineering' appears in 3 tuples of the original relation.

The SQL Perspective (Important Distinction!):

SQL deviates from pure relational algebra. SQL tables are multisets (bags) by default, allowing duplicates:

-- This does NOT eliminate duplicates
SELECT department FROM Employee;
-- Returns: Engineering, Marketing, Engineering, HR, Engineering

-- This eliminates duplicates (matching relational algebra)
SELECT DISTINCT department FROM Employee;
-- Returns: Engineering, Marketing, HR

Relational Algebra (Set Semantics)

•Relations are sets
•No duplicate tuples allowed
•Projection always eliminates duplicates
•Matches theoretical foundations
•Guaranteed unique results
•|π_L(R)| ≤ |R|

SQL (Bag Semantics by Default)

•Tables are multisets (bags)
•Duplicates allowed by default
•SELECT preserves duplicates
•SELECT DISTINCT eliminates duplicates
•Performance-driven design choice
•|SELECT cols| = |table| always

Why SQL Uses Bag Semantics:

The SQL standard chose bag semantics for practical reasons:

Performance: Duplicate elimination is expensive (requires sorting or hashing)
Aggregation correctness: SELECT SUM(salary) must count all values, including duplicates
Intermediate results: Many operations produce duplicates that get resolved later
Programmer control: Developers can add DISTINCT when needed

Cost of Duplicate Elimination:

Method	Time Complexity	Space Complexity
Sort-based	O(n log n)	O(n)
Hash-based	O(n) expected	O(d) where d = distinct values
Index-based	O(n)	Uses existing index

For large relations, duplicate elimination can be the dominant cost of a query.

Duplicate Elimination ImpactDemonstrating when duplicates matter

Input

Output

DISTINCT Overuse

Attribute List Specification

The attribute list in a Projection specifies which columns appear in the result. Understanding the various ways to specify attributes helps you write concise and maintainable queries.

Basic Attribute Reference:

The simplest form lists attribute names explicitly:

$$π_{emp_id, name, salary}(Employee)$$

Attribute Order:

The order of attributes in the list determines the order in the result schema:

π_{A, B}(R) produces schema (A, B)
π_{B, A}(R) produces schema (B, A)

These are different relations (different schemas) even though they contain the same information.

Qualified Attribute Names:

When working with multiple relations (in combined operations), qualify attributes with relation names:

$$π_{Employee.name, Department.location}(...)$$

All Attributes:

Projecting all attributes is equivalent to no projection:

$$π_{A₁, A₂, ..., Aₙ}(R) = R$$

In SQL, this is SELECT *.

Attribute Specification PatternsVarious ways to specify projection attributes

Input

Output

Column Expressions in SQL:

SQL extends pure projection with computed columns:

SELECT 
    emp_id,
    CONCAT(first_name, ' ', last_name) AS full_name,
    salary * 12 AS annual_salary,
    DATEDIFF(CURRENT_DATE, hire_date) AS days_employed
FROM Employee;

This isn't pure projection—it's projection combined with tuple transformation. In relational algebra, this would require the Extended Projection or Generalized Projection operator.

Extended Projection:

Denoted π_{f₁, f₂, ..., fₖ}(R) where each fᵢ is either:

An attribute of R
An expression involving constants and attributes of R
An attribute renamed using AS

Considerations:

Type Safety: Expressions must produce valid types for result columns
NULL Handling: Expressions involving NULL typically produce NULL
Determinism: Non-deterministic functions (RAND(), NOW()) may complicate query optimization

Column Selection Best Practices

Implementation Strategies

Database systems implement Projection using various strategies depending on whether duplicate elimination is required and the characteristics of the data.

Strategy 1: Simple Column Extraction (No Deduplication)

When duplicates are acceptable (SQL without DISTINCT):

Algorithm: SimpleProjection(R, L)
  result ← empty list
  for each tuple t in R:
    projected_t ← extract attributes L from t
    append projected_t to result
  return result

Time Complexity: O(n) - single pass through data
Space Complexity: O(|L| × n) for output
I/O Complexity: One scan of input relation

Strategy 2: Sort-Based Deduplication

For DISTINCT queries on unsorted data:

Algorithm: SortProjection(R, L)
  // Phase 1: Project
  projected ← []
  for each tuple t in R:
    projected.append(extract L from t)
  
  // Phase 2: Sort
  sort projected
  
  // Phase 3: Eliminate duplicates
  result ← [projected[0]]
  for i from 1 to len(projected)-1:
    if projected[i] ≠ projected[i-1]:
      result.append(projected[i])
  return result

Time Complexity: O(n log n) dominated by sort
Space Complexity: O(n) for sorted intermediate result

Sort-Based Approach

•Time: O(n log n)
•Space: O(n) for sorting
•Produces sorted output (bonus)
•Stable performance
•Can use external sort for large data
•Efficient when output is large

Hash-Based Approach

•Time: O(n) expected
•Space: O(d) where d = distinct values
•Unordered output
•Best when d << n
•Memory-bound for large d
•Fastest for small result sets

Strategy 3: Hash-Based Deduplication

Algorithm: HashProjection(R, L)
  hash_set ← empty HashSet
  result ← []
  for each tuple t in R:
    projected_t ← extract L from t
    if projected_t not in hash_set:
      hash_set.add(projected_t)
      result.append(projected_t)
  return result

Time Complexity: O(n) expected (hash operations are O(1) amortized)
Space Complexity: O(d) where d is number of distinct values
Best Case: Very few distinct values → small hash set
Worst Case: All values distinct → hash set size = n

Strategy 4: Index-Based Projection

When an index covers the projection columns:

Algorithm: IndexOnlyProjection(Index, L)
  result ← []
  for each entry e in Index:
    if first occurrence of e.key:  -- Index naturally deduplicates
      projected_t ← extract L from e
      result.append(projected_t)
  return result

Advantage: No table access required
Requirement: Index must contain all projected columns (covering index)
Natural deduplication: Unique indexes inherently deduplicate

Optimizer's Choice

Column Pruning Optimization

The Problem:

Consider a query that joins two large tables but only needs a few columns from each:

SELECT e.name, d.department_name
FROM Employee e
JOIN Department d ON e.dept_id = d.dept_id
WHERE e.salary > 100000;

Without optimization, the database might:

Scan all 50 columns of Employee
Scan all 20 columns of Department
Join (carrying 70 columns through)
Filter on salary
Finally project to name, department_name

The Solution - Column Pruning:

With optimization:

Scan only needed Employee columns (name, dept_id, salary)
Scan only needed Department columns (dept_id, department_name)
Join (carrying only 4-5 columns)
Filter on salary
Final projection is trivial (columns already limited)

Column Pruning Impact
Metric	Without Pruning	With Pruning	Improvement
Disk I/O	Read all columns per row	Read only needed columns	60-90% reduction typical
Memory Usage	Buffer all columns	Buffer minimal columns	Proportional to column reduction
Network Transfer	Send full rows	Send minimal rows	Significant in distributed systems
Cache Efficiency	Pollute cache with unused data	Maximize useful data in cache	Better cache hit rates
Join Performance	Hash/sort on wide tuples	Hash/sort on narrow tuples	Faster comparisons

How Column Pruning Works:

Analyze final output: Determine columns in SELECT clause
Trace dependencies: Identify columns needed for WHERE, JOIN, ORDER BY, GROUP BY
Compute minimal set: For each table, find columns actually referenced
Push projection down: Add projection as early as possible in the plan

Interaction with Storage Formats:

Row-oriented storage (traditional RDBMS):

Rows stored contiguously
Must read entire row even if only some columns needed
Column pruning reduces data copied after read

Column-oriented storage (analytical databases):

Columns stored separately
Can read only needed columns from disk
Column pruning directly reduces I/O

This is why analytical queries on column stores benefit dramatically from explicit column lists.

Column Pruning in ActionQuery plan transformation with column pruning

Input

Output

SELECT * Performance Impact

Projection in SQL

Projection in SQL is expressed through the SELECT clause. Understanding the nuances of SQL's implementation helps bridge theory and practice.

Basic Mapping:

Relational Algebra	SQL Equivalent
π_{A,B}(R)	SELECT DISTINCT A, B FROM R
π_{A,B}(R) (bag semantics)	SELECT A, B FROM R

Important SQL Distinctions:

projection_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- Pure projection (set semantics, matches relational algebra)
SELECT DISTINCT department
FROM Employee;
 
-- Projection without duplicate elimination (SQL default)
SELECT department
FROM Employee;
 
-- Multi-column projection
SELECT emp_id, name, salary
FROM Employee;
 
-- Projection with column aliasing (renaming)
SELECT 
    emp_id AS employee_id,
    name AS employee_name,
    salary AS annual_salary
FROM Employee;
 
-- Extended projection with expressions
SELECT 
    name,
    salary AS monthly_salary,
    salary * 12 AS annual_salary,
    salary * 0.2 AS bonus,
    CONCAT(name, ' (', department, ')') AS display_name
FROM Employee;
 
-- Projection of all columns (defeats column pruning)
SELECT * FROM Employee;
 
-- Projection with prefix for multiple tables
SELECT 
    e.name AS employee_name,
    e.salary,
    d.department_name,
    d.location
FROM Employee e
JOIN Department d ON e.dept_id = d.dept_id;
 
-- DISTINCT on multiple columns
-- Eliminates duplicate (department, status) combinations
SELECT DISTINCT department, status
FROM Employee;

SQL-Specific Projection Features:

1. Column Aliasing (AS):

SELECT salary AS compensation FROM Employee;

This combines projection with renaming (ρ operator in relational algebra).

2. Expressions in SELECT:

SELECT salary * 1.1 AS raised_salary FROM Employee;

This is Extended Projection—computing new values from existing attributes.

3. Aggregate Functions:

SELECT department, AVG(salary) AS avg_salary
FROM Employee
GROUP BY department;

This combines projection, grouping, and aggregation.

4. Window Functions:

SELECT name, salary, RANK() OVER (ORDER BY salary DESC) as rank
FROM Employee;

Advanced computed columns using window context.

5. Subqueries in SELECT:

SELECT name, 
       (SELECT COUNT(*) FROM Orders o WHERE o.emp_id = e.emp_id) as order_count
FROM Employee e;

Correlated subqueries for row-by-row computation.

SELECT Clause Evaluation Order

Summary and Key Takeaways

The Projection operator is the fundamental mechanism for column selection in relational algebra, and understanding its properties is essential for efficient query design.

Key Takeaways

•Projection (π) performs vertical filtering — It extracts specified attributes from a relation, producing a new relation with a potentially different schema.
•Duplicate elimination is definitional in relational algebra — Relations are sets; identical tuples cannot exist. SQL relaxes this with bag semantics by default, requiring DISTINCT for true projection.
•Cardinality depends on projected attributes — Projecting a superkey preserves cardinality; projecting other attributes may reduce it through deduplication.
•Implementation varies by requirements — Sort-based deduplication is O(n log n) but produces ordered output; hash-based is O(n) expected but requires memory proportional to distinct values.
•Column pruning optimization is critical — Pushing projection down eliminates unnecessary columns early, reducing I/O, memory usage, and data transfer throughout query execution.
•SQL extends basic projection — Column aliasing, expressions, aggregates, and window functions all build upon the fundamental projection concept.
•Avoid SELECT * in production — Explicit column lists enable optimization, protect against schema changes, and document intent clearly.

What's Next:

Page Complete