Database Management SystemsRelational Algebra

Selection and Projection

LevelIntermediate

Duration90 mins

TopicRelational Algebra

1 / 5

Selection (σ) Operator

The Art of Filtering Data

Imagine you have a database containing millions of employee records, but you only need information about employees in the Engineering department who earn more than $100,000. How do you extract precisely the rows you need without retrieving and discarding irrelevant data? This is the fundamental problem that the Selection operator (σ) solves—and it does so with mathematical elegance and computational efficiency.

The Selection operator, denoted by the Greek letter sigma (σ), is one of the most frequently used operations in relational algebra. Every time you write a WHERE clause in SQL, you're invoking the conceptual machinery that Selection provides. Understanding this operator deeply isn't just academic—it's essential for writing efficient queries, understanding query optimization, and debugging performance problems in database systems.

What You Will Learn

By the end of this page, you will understand the formal mathematical definition of the Selection operator, master predicate construction and evaluation, comprehend the implementation strategies used by real database systems, and recognize how Selection forms the foundation for declarative data retrieval across all SQL operations.

Formal Definition and Notation

The Selection operator is a unary operation in relational algebra, meaning it operates on a single relation to produce a new relation as output. Its purpose is horizontal filtering—selecting specific tuples (rows) from a relation based on a specified condition while preserving all attributes (columns).

Formal Definition:

Given a relation R with schema R(A₁, A₂, ..., Aₙ) and a selection predicate P, the Selection operation is defined as:

$$σ_P(R) = {t \mid t ∈ R ∧ P(t)}$$

In plain language: the Selection σₚ(R) returns the set of all tuples t from relation R where the predicate P evaluates to true for t.

Key Properties:

Type Preservation: The output schema is identical to the input schema. If R has attributes (A₁, A₂, ..., Aₙ), then σₚ(R) also has attributes (A₁, A₂, ..., Aₙ).
Cardinality Reduction: The number of tuples in the result is less than or equal to the number in the input: |σₚ(R)| ≤ |R|. Selection can only reduce or maintain cardinality, never increase it.
Closure Property: The output is always a valid relation, ensuring that Selection can be composed with other relational operations.

Basic Selection NotationConsider a relation Employee with schema Employee(emp_id, name, department, salary, hire_date). Here are various Selection expressions and their meanings:

Input

Output

Notation Variations

You may encounter different notational conventions: σ_{P}(R), σP, SELECTP, or R WHERE P. All express the same operation. The subscript or bracketed expression contains the selection predicate, and the relation appears either inside parentheses or to the left of the operator symbol. This course uses σ_{P}(R) as the standard notation.

Mathematical Foundation

Understanding the Selection operator requires grounding in set theory, as relational algebra is fundamentally a set-based formalism. The Selection operator performs set restriction—it produces a subset of the input relation.

Set-Theoretic Properties:

1. Subset Guarantee: For any relation R and predicate P: $$σ_P(R) ⊆ R$$

Every tuple in the result must be a tuple in the original relation. Selection never creates new tuples or modifies existing ones.

2. Idempotence: Applying the same Selection twice produces the same result: $$σ_P(σ_P(R)) = σ_P(R)$$

Once tuples not satisfying P are eliminated, applying the same filter again has no additional effect.

3. Commutativity: The order of cascaded Selections doesn't matter: $$σ_{P1}(σ_{P2}(R)) = σ_{P2}(σ_{P1}(R))$$

This property is crucial for query optimization, allowing the optimizer to reorder Selection operations.

4. Cascade into Conjunction: Multiple Selections can be combined into one with a conjunctive predicate: $$σ_{P1}(σ_{P2}(R)) = σ_{P1 ∧ P2}(R)$$

This equivalence allows query optimizers to either split or merge Selection operations based on execution efficiency.

Selection Algebraic Laws
Law Name	Expression	Meaning
Identity	σ_{true}(R) = R	Selecting with always-true predicate returns entire relation
Empty Result	σ_{false}(R) = ∅	Selecting with always-false predicate returns empty relation
Cascade	σ_{P1}(σ_{P2}(R)) = σ_{P1 ∧ P2}(R)	Cascaded selections combine into single selection
Commutativity	σ_{P1}(σ_{P2}(R)) = σ_{P2}(σ_{P1}(R))	Order of cascaded selections doesn't matter
Idempotence	σ_P(σ_P(R)) = σ_P(R)	Repeated identical selection has no additional effect
Selectivity Bound	0 ≤ \|σ_P(R)\| ≤ \|R\|	Result cardinality bounded by input cardinality

Relationship to Predicate Logic:

The selection predicate P is a formula in first-order predicate logic. For each tuple t, P(t) evaluates to either true or false (we'll address NULL values later). The predicate can reference:

Attribute values: Referencing specific columns of the tuple being evaluated
Constants: Literal values (numbers, strings, dates)
Logical connectives: AND (∧), OR (∨), NOT (¬)
Comparison operators: =, ≠, <, >, ≤, ≥

The expressiveness of selection predicates directly corresponds to the fragment of first-order logic restricted to single-tuple evaluation.

Why Mathematical Foundations Matter

These algebraic laws aren't just theoretical curiosities—they're the foundation of query optimization. When a database optimizer sees a complex query, it uses these equivalences to rewrite the query into a more efficient form while guaranteeing identical results. Understanding these laws helps you predict optimizer behavior and write queries that are easier to optimize.

Anatomy of Selection Predicates

The power and flexibility of Selection lies in its predicate—the boolean expression that determines which tuples pass through the filter. Understanding predicate construction is essential for effective query writing.

Atomic Predicates (Simple Conditions):

The simplest predicates compare a single attribute to a value or another attribute:

Attribute-Constant Comparison: salary > 100000
Attribute-Attribute Comparison: end_date > start_date

General form: <attribute> <comparison_op> <value|attribute>

Comparison operators include: =, ≠ (or <>), <, >, ≤ (<=), ≥ (>=)

Compound Predicates:

Atomic predicates combine using logical connectives to form compound predicates:

Conjunction (AND): Both conditions must be true
- department = 'Engineering' AND salary > 100000
Disjunction (OR): At least one condition must be true
- department = 'Engineering' OR department = 'Research'
Negation (NOT): The condition must be false
- NOT (status = 'inactive')

Operator Precedence (highest to lowest):

Comparison operators (=, <, >, etc.)
NOT
AND
OR

Compound Predicate ExamplesDemonstrating various compound predicate constructions on the Employee relation

Input

Output

Common Predicate Pitfalls

•Operator Precedence Confusion: A OR B AND C evaluates as A OR (B AND C), not (A OR B) AND C. Always use explicit parentheses for clarity.
•Type Mismatches: Comparing string '100' with integer 100 may cause implicit conversions with unexpected results. Ensure type consistency.
•NULL Comparisons: The expression salary > 100000 returns UNKNOWN (not false) when salary is NULL. We'll cover three-valued logic in detail.
•Case Sensitivity: String comparisons may or may not be case-sensitive depending on database collation. Always verify your database's behavior.

Three-Valued Logic and NULL Handling

One of the most subtle aspects of Selection is how it handles NULL values. In standard relational algebra, tuples either satisfy a predicate or they don't—a two-valued Boolean world. However, SQL and practical database systems use three-valued logic (3VL) to handle unknown or missing values.

The Three Truth Values:

TRUE: The predicate definitely holds
FALSE: The predicate definitely does not hold
UNKNOWN: The result cannot be determined (typically due to NULL)

Selection passes only tuples where the predicate evaluates to TRUE. Tuples evaluating to FALSE or UNKNOWN are excluded.

NULL in Comparisons:

When any operand in a comparison is NULL, the result is UNKNOWN:

NULL = NULL → UNKNOWN (not TRUE!)
NULL > 100 → UNKNOWN
100 = NULL → UNKNOWN
'Alice' <> NULL → UNKNOWN

Three-Valued Logic Truth Tables
P	Q	P AND Q	P OR Q	NOT P
TRUE	TRUE	TRUE	TRUE	FALSE
TRUE	FALSE	FALSE	TRUE	FALSE
TRUE	UNKNOWN	UNKNOWN	TRUE	FALSE
FALSE	TRUE	FALSE	TRUE	TRUE
FALSE	FALSE	FALSE	FALSE	TRUE
FALSE	UNKNOWN	FALSE	UNKNOWN	TRUE
UNKNOWN	TRUE	UNKNOWN	TRUE	UNKNOWN
UNKNOWN	FALSE	FALSE	UNKNOWN	UNKNOWN
UNKNOWN	UNKNOWN	UNKNOWN	UNKNOWN	UNKNOWN

NULL Handling in SelectionDemonstrating how NULLs affect selection results

Input

Output

The NULL Paradox

A critical implication: σ_{P}(R) ∪ σ_{NOT P}(R) ≠ R when NULLs are present. Tuples with NULL values in predicate-relevant columns disappear from both results! To include NULLs, you must explicitly test: salary > 100000 OR salary IS NULL.

Testing for NULL:

Since comparisons with NULL yield UNKNOWN, SQL provides special predicates:

IS NULL: Returns TRUE if the value is NULL
IS NOT NULL: Returns TRUE if the value is not NULL

These are the only predicates that return TRUE or FALSE (never UNKNOWN) when evaluating NULL values.

Practical Implications:

When writing selections involving nullable columns:

Consider whether NULL rows should be included
Use explicit NULL tests when needed
Remember that negating a predicate doesn't capture NULL rows
COALESCE or IFNULL functions can provide default values for comparison

Selectivity and Computational Cost

Understanding the performance characteristics of Selection is essential for writing efficient queries and predicting system behavior. Two key concepts govern Selection performance: selectivity and evaluation cost.

Selectivity:

The selectivity of a predicate P on relation R is the fraction of tuples that satisfy the predicate:

$$sel(P, R) = \frac{|σ_P(R)|}{|R|}$$

Selectivity ranges from 0 (no tuples match) to 1 (all tuples match).

Examples of Selectivity:

emp_id = 'E001' on primary key: selectivity ≈ 1/n (highly selective)
department = 'Engineering' on 5-department company: selectivity ≈ 0.2
salary > 0 on salary column: selectivity ≈ 1.0 (non-selective)
status = 'active' AND country = 'US' AND age > 30: compound selectivity

Selectivity Estimation:

Database query optimizers estimate selectivity using:

Histograms: Statistical distributions of column values
Distinct Value Counts: Uniformity assumption: selectivity ≈ 1/NDV for equality
Range Statistics: Min/max values for range predicates
Sampling: Running predicate on data sample

Selectivity Estimation Heuristics
Predicate Type	Typical Selectivity Estimate	Assumptions
column = constant	1/NDV(column)	Uniform distribution of values
column > constant	(max - constant)/(max - min)	Uniform distribution in range
column < constant	(constant - min)/(max - min)	Uniform distribution in range
column BETWEEN a AND b	(b - a)/(max - min)	Uniform distribution in range
column LIKE 'prefix%'	Varies	Uses histogram if available
P₁ AND P₂	sel(P₁) × sel(P₂)	Independence assumption
P₁ OR P₂	sel(P₁) + sel(P₂) - sel(P₁)×sel(P₂)	Inclusion-exclusion
NOT P	1 - sel(P)	Complement

Time Complexity:

Without Indexes: Selection requires a full table scan: O(n) where n is the number of tuples. Every tuple must be read and evaluated against the predicate.

With Indexes: If an index exists on the selection attribute, complexity improves dramatically:

B-tree index equality: O(log n) to find first match + O(k) for k matches
B-tree index range: O(log n) + O(k)
Hash index equality: O(1) expected + O(k) for k matches
Bitmap index: O(n/w) where w is word size (efficient for low-cardinality columns)

Space Complexity:

Selection produces a result relation that requires storage. In the worst case (selectivity = 1), the output equals the input size. In practice:

Streaming evaluation: Process one tuple at a time, O(1) working memory
Materialized result: O(sel × n) space for result storage

Selectivity Matters for Query Planning

Query optimizers use selectivity estimates to choose execution strategies. A highly selective predicate (sel < 0.01) might warrant an index lookup, while a non-selective predicate (sel > 0.3) might favor a full table scan. Inaccurate selectivity estimates cause poor query plans—one reason database statistics must be kept current.

Implementation Strategies

Understanding how database systems implement Selection illuminates why some queries are fast and others are slow. The query execution engine has several strategies available, each suited to different scenarios.

Strategy 1: Full Table Scan

The simplest approach reads every tuple and evaluates the predicate:

Algorithm: FullTableScan(R, P)
  result ← ∅
  for each page p in R:
    for each tuple t in p:
      if P(t) = TRUE:
        add t to result
  return result

Cost: O(n) tuple evaluations + O(B) I/O operations (B = number of pages)
When Used: No suitable index, low selectivity predicates, small tables
Advantage: Simple, predictable, uses sequential I/O

Strategy 2: Index Scan

When an index exists on selection attributes:

Algorithm: IndexScan(R, P, Index)
  result ← ∅
  keys ← SearchIndex(Index, P)  -- Returns matching keys
  for each key k in keys:
    rid ← Index.lookup(k)       -- Get record ID
    t ← R.fetch(rid)            -- Fetch tuple by ID
    if P(t) = TRUE:             -- Verify (for partial matches)
      add t to result
  return result

Cost: O(log n) + O(k) for k matching tuples + random I/O per match
When Used: Highly selective predicates on indexed columns
Advantage: Examines only relevant tuples
Disadvantage: Random I/O pattern can be costly for many matches

Full Table Scan: Best When

•Table is small (fits in memory)
•Selectivity > 10-20% of table
•No suitable index exists
•Sequential I/O is much faster than random
•Query touches most columns (wide rows)
•Predicate involves functions on columns

Index Scan: Best When

•Predicate is highly selective (< 5%)
•Index covers predicate columns
•Table is large and doesn't fit in memory
•Clustered index aligns with predicate
•Index-only query (all needed columns in index)
•Point queries on primary key

Strategy 3: Bitmap Index Scan

For complex predicates combining multiple conditions:

Algorithm: BitmapScan(R, P₁ AND P₂, Index₁, Index₂)
  bitmap1 ← Index₁.getBitmap(P₁)  -- Bitmap of RIDs matching P₁
  bitmap2 ← Index₂.getBitmap(P₂)  -- Bitmap of RIDs matching P₂
  combined ← bitmap1 AND bitmap2  -- Bitwise AND
  result ← ∅
  for each bit set in combined:
    rid ← position of bit
    t ← R.fetch(rid)
    add t to result
  return result

When Used: Compound predicates with multiple indexed columns
Advantage: Combines index results before fetching, reduces random I/O

Strategy 4: Index-Only Scan (Covering Index)

When all needed columns exist in the index:

Algorithm: IndexOnlyScan(Index, P)
  result ← ∅
  for each entry e in Index where P(e) = TRUE:
    add ProjectedTuple(e) to result
  return result

Advantage: No table access required—index contains all needed data
Fastest possible: Avoids I/O for tuple fetching entirely

Selection in SQL

The Selection operator manifests in SQL through the WHERE clause. Understanding this connection helps bridge theoretical algebra with practical query writing.

Direct Mapping:

Relational Algebra	SQL Equivalent
σ_{P}(R)	SELECT * FROM R WHERE P

SQL extends the basic Selection capability with additional features not present in pure relational algebra:

selection_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- Basic Selection: σ_{department='Engineering'}(Employee)
SELECT * 
FROM Employee 
WHERE department = 'Engineering';
 
-- Compound Selection: σ_{department='Engineering' ∧ salary>100000}(Employee)
SELECT * 
FROM Employee 
WHERE department = 'Engineering' 
  AND salary > 100000;
 
-- Range Selection
SELECT * 
FROM Employee 
WHERE salary BETWEEN 80000 AND 120000;
 
-- Pattern Matching (SQL extension beyond pure relational algebra)
SELECT * 
FROM Employee 
WHERE name LIKE 'A%';  -- Names starting with 'A'
 
-- Set Membership (simpler than multiple OR conditions)
SELECT * 
FROM Employee 
WHERE department IN ('Engineering', 'Research', 'Product');
 
-- NULL Handling
SELECT * 
FROM Employee 
WHERE salary IS NOT NULL 
  AND salary > 100000;
 
-- Combining NULL check with OR
SELECT * 
FROM Employee 
WHERE salary > 100000 
   OR salary IS NULL;  -- Include unknown salaries

SQL Predicates Beyond Basic Algebra:

SQL provides several predicate forms that extend beyond simple comparisons:

BETWEEN: salary BETWEEN 50000 AND 100000 (inclusive range)
IN: department IN ('Eng', 'HR', 'Sales') (set membership)
LIKE: name LIKE 'John%' (pattern matching with wildcards)
IS NULL / IS NOT NULL: Explicit NULL testing
EXISTS: Subquery existence testing (correlated with other operators)
ANY / ALL: Subquery comparison operators

Expression in Predicates:

SQL allows expressions, not just attributes, in predicates:

-- Function application in predicate
SELECT * FROM Employee WHERE YEAR(hire_date) = 2023;

-- Arithmetic expression
SELECT * FROM Employee WHERE salary * 1.1 > 100000;

-- String function
SELECT * FROM Employee WHERE UPPER(department) = 'ENGINEERING';

Warning: Expressions on columns often prevent index usage, causing full table scans. Prefer hire_date >= '2023-01-01' AND hire_date < '2024-01-01' over YEAR(hire_date) = 2023.

SARGability: Index-Friendly Predicates

A predicate is 'SARGable' (Search ARGument able) if it can leverage an index. Non-SARGable predicates include: functions on columns (YEAR(date)), arithmetic on columns (salary * 1.1), implicit type conversions, and LIKE patterns starting with wildcards ('%suffix'). Rewrite predicates to be SARGable whenever possible for optimal performance.

Summary and Key Takeaways

The Selection operator is the fundamental filtering mechanism in relational algebra, and mastering its properties is essential for database professionals. Let's consolidate what we've learned:

Key Takeaways

•Selection (σ) performs horizontal filtering — It extracts tuples matching a predicate while preserving all attributes. The output schema equals the input schema.
•Predicates can be atomic or compound — Simple comparisons combine with AND, OR, NOT to form complex boolean expressions. Operator precedence matters.
•Three-valued logic handles NULLs — Comparisons with NULL yield UNKNOWN. Only tuples evaluating to TRUE pass Selection; UNKNOWN tuples are excluded.
•Algebraic laws enable optimization — Commutativity and cascade properties allow query rewriting. Selection is idempotent and always produces a subset.
•Selectivity drives execution strategy — High selectivity favors index scans; low selectivity favors full table scans. Query optimizers estimate selectivity from statistics.
•Implementation varies by context — Full scans, index scans, bitmap scans, and index-only scans each have optimal use cases. Understanding when to expect each helps diagnose performance.
•SQL's WHERE clause implements Selection — The mapping is direct, but SQL extends with BETWEEN, IN, LIKE, and other conveniences. SARGable predicates enable index optimization.

What's Next:

Having mastered the Selection operator for horizontal filtering, we'll next explore the predicates in greater depth. Understanding the different types of conditions you can express—and how databases evaluate them—is crucial for writing effective queries. The next page dives deep into predicate conditions, covering comparison types, logical operators, and advanced predicate patterns.

Page Complete

You now have a comprehensive understanding of the Selection operator—from its formal mathematical definition through implementation strategies to practical SQL usage. This knowledge forms the foundation for all filtering operations in relational databases. Next, we'll dive deeper into the predicate conditions that power Selection.

1 / 5

Loading learning content...

Database Management SystemsRelational Algebra

Selection and Projection

LevelIntermediate

Duration90 mins

TopicRelational Algebra

1 / 5

Selection (σ) Operator

The Art of Filtering Data

What You Will Learn

Formal Definition and Notation

Formal Definition:

Given a relation R with schema R(A₁, A₂, ..., Aₙ) and a selection predicate P, the Selection operation is defined as:

$$σ_P(R) = {t \mid t ∈ R ∧ P(t)}$$

In plain language: the Selection σₚ(R) returns the set of all tuples t from relation R where the predicate P evaluates to true for t.

Key Properties:

Type Preservation: The output schema is identical to the input schema. If R has attributes (A₁, A₂, ..., Aₙ), then σₚ(R) also has attributes (A₁, A₂, ..., Aₙ).
Cardinality Reduction: The number of tuples in the result is less than or equal to the number in the input: |σₚ(R)| ≤ |R|. Selection can only reduce or maintain cardinality, never increase it.
Closure Property: The output is always a valid relation, ensuring that Selection can be composed with other relational operations.