Dbms Components - Learning Module

Loading content...

0/252

Query Processor: From SQL to Execution

The Invisible Translator

When you write SELECT * FROM employees WHERE salary > 50000, you're issuing a high-level declarative request. You're stating what you want, not how to get it. The magic that transforms this human-readable statement into a sequence of disk reads, memory operations, and data comparisons is performed by the Query Processor—one of the most sophisticated components in any DBMS.

The Query Processor is the intellectual heart of the database system. It must understand your intent, validate your request against the schema, explore potentially millions of execution strategies, select the most efficient one, and orchestrate its execution—all while appearing instantaneous to the user.

This page takes you deep inside the Query Processor, revealing the layers of transformation that convert declarative SQL into optimized physical operations.

What You Will Learn

By the end of this page, you will understand the complete query processing pipeline: parsing and syntax analysis, semantic checking, query optimization strategies, and execution engine mechanics. You'll see why query optimization is often called the most challenging problem in database systems, and how modern DBMS implementations tackle it.

The Query Processing Pipeline

The journey from SQL text to executed results follows a well-defined pipeline. Each stage performs a specific transformation, progressively converting abstract requests into concrete machine operations. Understanding this pipeline reveals how declarative power becomes operational efficiency.

The fundamental insight: SQL is a specification language, not an execution language. The Query Processor's job is to bridge this semantic gap—translating user intent into an executable plan that leverages all available optimizations.

Converting Mermaid diagram...

The pipeline stages in detail:

1. Parsing Phase — Converts raw SQL text into structured representations. Lexical analysis tokenizes the input; syntax analysis builds a parse tree according to SQL grammar rules.

2. Validation Phase — Ensures the query is semantically meaningful. Checks that referenced tables exist, columns have correct types, and the user has necessary permissions.

3. Optimization Phase — Transforms the validated query into an efficient execution plan. This is where the DBMS applies its intelligence—rewriting queries, generating alternative plans, and using statistics to choose the best approach.

4. Execution Phase — The execution engine runs the chosen plan, coordinating with storage and buffer managers to retrieve and process data.

Compilation vs. Interpretation

Some DBMS implementations compile queries into executable code (like PostgreSQL's executor nodes), while others interpret plans at runtime. Modern systems often use JIT (Just-In-Time) compilation for frequently executed queries, combining interpretation flexibility with compiled performance.

DDL Interpreter: Schema Management

The DDL Interpreter handles Data Definition Language statements—commands that define, alter, or remove database structures. Unlike DML queries that operate on data, DDL statements modify the database's metadata, the schema definitions stored in the system catalog.

The DDL Interpreter's responsibilities:

DDL Interpreter Functions

•Schema Object Creation — Processing CREATE statements for tables, indexes, views, procedures, triggers, and other database objects. Each object type has specific validation rules and catalog entries.
•Structure Modification — Handling ALTER statements that modify existing objects—adding columns, changing data types, modifying constraints. These operations must maintain data integrity during transition.
•Object Removal — Processing DROP statements with proper dependency checking. Cannot drop a table if views depend on it (unless CASCADE is specified).
•Catalog Updates — Every DDL operation updates the system catalog—the metadata repository describing all database objects. The catalog is itself a set of tables, creating a recursive structure.
•Constraint Enforcement — Validating that new definitions don't violate existing constraints or create irresolvable dependencies.

DDL Processing Examples
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- CREATE TABLE: DDL Interpreter must:
-- 1. Parse the entire statement
-- 2. Validate data types and constraints
-- 3. Check for name conflicts
-- 4. Allocate storage structures
-- 5. Update system catalog
 
CREATE TABLE employees (
    employee_id     INT PRIMARY KEY,
    first_name      VARCHAR(50) NOT NULL,
    last_name       VARCHAR(50) NOT NULL,
    email           VARCHAR(100) UNIQUE,
    department_id   INT REFERENCES departments(dept_id),
    salary          DECIMAL(10,2) CHECK (salary > 0),
    hire_date       DATE DEFAULT CURRENT_DATE
);
 
-- The DDL Interpreter creates entries in:
-- - pg_class (table metadata)
-- - pg_attribute (column metadata)
-- - pg_constraint (constraint definitions)
-- - pg_index (for PRIMARY KEY, UNIQUE)
-- - pg_depend (dependency tracking)
 
-- ALTER TABLE: Complex coordination required
ALTER TABLE employees 
    ADD COLUMN manager_id INT REFERENCES employees(employee_id),
    DROP CONSTRAINT IF EXISTS email_unique,
    ALTER COLUMN salary SET DEFAULT 50000;
 
-- The interpreter must:
-- 1. Acquire exclusive lock on the table
-- 2. Validate new column against existing data
-- 3. Update all catalog entries atomically
-- 4. Rebuild affected indexes if necessary
 
-- DROP with dependency checking
DROP TABLE employees CASCADE;
-- CASCADE: Also drops dependent views, foreign keys
-- RESTRICT (default): Fails if dependencies exist

The System Catalog: Metadata About Metadata

The system catalog (also called the data dictionary) is a collection of tables that describe all database objects. When the DDL Interpreter processes CREATE TABLE employees, it doesn't just create the table—it inserts rows into catalog tables describing that table's structure.

This creates an elegant recursive property: the catalog describes itself. You can query the catalog to discover what tables exist, what columns they have, what indexes are available—using the same SQL that queries regular data.

Common System Catalog Tables (PostgreSQL)
Catalog Table	Purpose	Key Columns
pg_class	All relations (tables, indexes, views, sequences)	relname, relkind, reltuples, relpages
pg_attribute	Columns of all tables	attname, atttypid, attnum, attnotnull
pg_index	Index definitions	indrelid, indkey, indisunique, indisprimary
pg_constraint	All constraints	conname, contype, conrelid, confrelid
pg_namespace	Schemas	nspname, nspowner
pg_type	Data type definitions	typname, typlen, typtype
pg_proc	Functions and procedures	proname, proargtypes, prosrc
pg_depend	Object dependencies	classid, objid, refobjid, deptype

Catalog Queries are Query Optimization Fuel

The Query Optimizer constantly reads the system catalog to understand table structures, available indexes, and statistical information. A well-maintained catalog with accurate statistics is essential for good query plans. This is why ANALYZE/UPDATE STATISTICS commands exist—they refresh the catalog's statistical information.

DML Compiler: Query Translation

The DML Compiler transforms Data Manipulation Language statements (SELECT, INSERT, UPDATE, DELETE) from SQL text into internal query representations. This compilation process involves multiple sophisticated stages, each adding structure and removing ambiguity.

The compilation pipeline transforms:

SQL Text → Tokens → Parse Tree → Abstract Syntax Tree → Relational Algebra → Execution Plan

Each transformation step serves a purpose: tokenization handles lexical structure, parsing validates grammar, semantic analysis resolves names and types, and algebraic translation creates the foundation for optimization.

Lexical Analysis (Tokenization) breaks the SQL string into a stream of tokens—meaningful units like keywords, identifiers, operators, and literals.

Token Categories:

Keywords: SELECT, FROM, WHERE, JOIN, ORDER BY
Identifiers: Table names, column names, aliases
Operators: =, <, >, <=, >=, <>, AND, OR, NOT
Literals: 'John', 42, 3.14159, TRUE, NULL
Punctuation: (, ), ,, ;

The lexer also handles SQL's case-insensitivity for keywords, whitespace normalization, and string escape sequences.

Tokenization Example
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Input SQL:
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
 
-- Tokenized output:
-- [KEYWORD:SELECT]
-- [IDENTIFIER:e] [DOT] [IDENTIFIER:name] [COMMA]
-- [IDENTIFIER:d] [DOT] [IDENTIFIER:dept_name]
-- [KEYWORD:FROM]
-- [IDENTIFIER:employees] [IDENTIFIER:e]
-- [KEYWORD:JOIN]
-- [IDENTIFIER:departments] [IDENTIFIER:d]
-- [KEYWORD:ON]
-- [IDENTIFIER:e] [DOT] [IDENTIFIER:dept_id]
-- [OPERATOR:=]
-- [IDENTIFIER:d] [DOT] [IDENTIFIER:id]
-- [KEYWORD:WHERE]
-- [IDENTIFIER:e] [DOT] [IDENTIFIER:salary]
-- [OPERATOR:>]
-- [LITERAL:50000]
-- [SEMICOLON]

From Parse Tree to Relational Algebra

After semantic analysis, the validated parse tree is converted to relational algebra—a formal mathematical notation for expressing data operations. This translation is crucial because relational algebra has well-defined algebraic properties that enable query optimization.

The SQL query:

SELECT e.name, d.dept_name
FROM employees e JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000

Becomes the relational algebra expression:

π(name, dept_name)(
  σ(salary > 50000)(
    employees ⋈(dept_id = id) departments
  )
)

This algebraic form is the input to the Query Optimizer.

Query Optimizer: The Intelligence Core

The Query Optimizer is arguably the most intellectually sophisticated component of any DBMS. Its task sounds simple: given a declarative query, find the most efficient way to execute it. In practice, this is an extraordinarily complex problem.

Why is optimization hard?

For a simple query joining 5 tables, there may be thousands of valid execution plans—different join orders, different join algorithms, different index choices. For 10 tables, the number explodes to millions. The optimizer must navigate this vast search space quickly, using heuristics and cost models to find a good (ideally optimal) plan without exhaustive enumeration.

Query optimization is so challenging that it's been studied for over 40 years, and modern optimizers still contain thousands of transformation rules, statistical models, and specialized algorithms.

The Optimization Paradox

The optimizer faces a meta-problem: spending more time optimizing might find a better plan, but if optimization takes too long, the total query time increases. Modern optimizers have time budgets and will terminate with a 'good enough' plan rather than search exhaustively. For OLTP queries (simple, frequent), less optimization time makes sense. For OLAP queries (complex, running hours), extensive optimization is worthwhile.

Logical Optimization

•Predicate Pushdown — Move WHERE conditions as early as possible, filtering rows before expensive operations
•Projection Pushdown — Select only needed columns early, reducing data volume
•Constant Folding — Evaluate constant expressions at compile time (WHERE 1+1=2 → WHERE TRUE)
•View Merging — Inline view definitions when beneficial, exposing optimization opportunities
•Subquery Flattening — Convert correlated subqueries to joins when possible
•Contradiction Detection — Identify impossible conditions (WHERE x > 5 AND x < 3), return empty result immediately

Physical Optimization

•Access Path Selection — Choose between table scan, index scan, bitmap scan based on selectivity
•Join Order Determination — Find optimal ordering among many possibilities (n! for n tables)
•Join Algorithm Selection — Nested loops, hash join, merge join—each optimal for different conditions
•Parallelization Decisions — When to parallelize operations across CPU cores
•Materialization Points — When to materialize intermediate results vs. stream them
•Sort Avoidance — Exploit existing order from indexes to avoid explicit sorts

Cost-Based Optimization: The Statistical Foundation

Modern optimizers are cost-based, meaning they estimate the execution cost of each candidate plan and select the lowest-cost option. The cost model considers:

I/O Cost: Disk reads are expensive; how many pages must be read?
CPU Cost: Processing time for comparisons, function evaluations, etc.
Memory Usage: Does the plan fit in available memory, or will it spill to disk?
Network Cost: In distributed databases, data transfer between nodes.

These estimates depend on statistics maintained about the data:

Statistics Used by the Query Optimizer
Statistic Type	What It Measures	How It's Used
Table Cardinality	Number of rows in a table	Estimate result sizes, join costs
Column Cardinality	Number of distinct values in a column	Estimate selectivity of equality predicates
Histograms	Distribution of values in a column	Handle skewed data, range predicates
Correlation	Relationship between columns	Estimate compound predicate selectivity
Index Statistics	B-tree depth, leaf pages, clustering factor	Estimate index access costs
Null Fraction	Percentage of NULL values	Adjust selectivity for nullable columns
Average Width	Average byte size of column values	Estimate memory and I/O requirements

Viewing Query Plans
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- PostgreSQL: EXPLAIN shows the optimizer's chosen plan
EXPLAIN SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
 
-- Output:
-- Hash Join  (cost=3.25..8.60 rows=37 width=64)
--   Hash Cond: (e.dept_id = d.id)
--   ->  Seq Scan on employees e  (cost=0.00..4.50 rows=37 width=40)
--         Filter: (salary > 50000)
--   ->  Hash  (cost=2.00..2.00 rows=100 width=36)
--         ->  Seq Scan on departments d  (cost=0.00..2.00 rows=100 width=36)
 
-- EXPLAIN ANALYZE executes and shows actual vs. estimated
EXPLAIN ANALYZE SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
 
-- Shows:
-- actual time=0.015..0.020 rows=42 loops=1
-- (Compares estimated rows=37 to actual rows=42)

Why Plans Go Wrong

Bad plans usually stem from bad statistics. If the optimizer thinks a table has 100 rows when it actually has 1 million, it might choose a nested-loop join (O(n²)) instead of a hash join (O(n)). This is why running ANALYZE/UPDATE STATISTICS regularly is crucial, especially after bulk data loads.

Execution Engine: Running the Plan

Once the optimizer produces an execution plan, the Execution Engine takes over. This component interprets (or executes compiled code for) the plan, coordinating with the Storage Manager and Buffer Manager to retrieve data and produce results.

Execution Models:

Modern DBMS implementations use one of three primary execution models, each with distinct performance characteristics:

The Volcano/Iterator Model is the classic approach, used by PostgreSQL, MySQL, and most traditional databases.

How it works:

Each operator in the plan tree implements a simple interface: open(), next(), close()
The top operator calls next() on its children, which recursively call their children
Data flows up one tuple at a time (pull-based execution)

Advantages:

Simple to implement and understand
Memory efficient—only one tuple in flight per operator
Easy to add new operators

Disadvantages:

High function call overhead—every tuple crosses multiple function boundaries
Poor CPU cache utilization
Cannot leverage SIMD instructions effectively

Volcano Model Interface
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Conceptual Volcano-style operator interface
interface Operator {
    open(): void;
    next(): Tuple | null;  // Returns next tuple, null when exhausted
    close(): void;
}
 
// Example: Sequential Scan Operator
class SeqScanOperator implements Operator {
    private table: Table;
    private cursor: number = 0;
 
    open() { this.cursor = 0; }
    
    next(): Tuple | null {
        if (this.cursor >= this.table.rowCount) return null;
        return this.table.getRow(this.cursor++);
    }
    
    close() { /* cleanup */ }
}
 
// Example: Filter Operator (Selection σ)
class FilterOperator implements Operator {
    private child: Operator;
    private predicate: (t: Tuple) => boolean;
 
    next(): Tuple | null {
        while (true) {
            const tuple = this.child.next();
            if (tuple === null) return null;
            if (this.predicate(tuple)) return tuple;
            // Keep pulling until predicate satisfied or exhausted
        }
    }
}

Physical Operators: The Execution Toolbox

The execution engine implements various physical operators that correspond to relational algebra operations. Each operator may have multiple implementations optimized for different conditions:

Common Physical Operators and Their Implementations
Operation	Implementations	Best For
Table Access	Sequential Scan, Index Scan, Index-Only Scan, Bitmap Scan	Depends on selectivity and index availability
Selection (Filter)	On-the-fly evaluation, Index-based filter	All workloads—always needed
Projection	Column elimination, Expression evaluation	Reducing data width early
Nested-Loop Join	Simple nested loop, Index nested loop, Block nested loop	Small tables, inequality joins, or when index exists
Hash Join	In-memory hash, Grace hash (partitioned)	Equi-joins, larger tables, sufficient memory
Merge Join	Sort-merge with pre-sorted inputs	Already-sorted data, range joins
Aggregation	Hash aggregate, Sort aggregate, Pre-aggregation	GROUP BY operations
Sort	Quicksort, External merge sort	ORDER BY, merge join preparation
Limit	Top-N with heap, Early termination	LIMIT/TOP queries

Query Parallelism and Distribution

Modern Query Processors must leverage multi-core CPUs and distributed architectures to handle massive datasets. Parallel query execution transforms the execution plan to run portions concurrently, dramatically improving performance.

Parallelism Dimensions:

Intra-operator Parallelism: A single operator (e.g., table scan) runs across multiple cores, each processing a partition of the data.
Inter-operator Parallelism: Different operators run concurrently (pipelining). While one operator produces tuples, another consumes them.
Partitioned Parallelism: Data is partitioned across nodes/cores; each partition is processed independently.

Converting Mermaid diagram...

Parallel Query Plans
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- PostgreSQL parallel query example
SET max_parallel_workers_per_gather = 4;
 
EXPLAIN ANALYZE 
SELECT department, COUNT(*), AVG(salary)
FROM employees 
WHERE hire_date > '2015-01-01'
GROUP BY department;
 
-- Parallel plan:
-- Finalize HashAggregate  (rows=50)
--   Group Key: department
--   ->  Gather  (workers planned: 4)
--         Workers Launched: 4
--         ->  Partial HashAggregate  (rows=50)
--               Group Key: department
--               ->  Parallel Seq Scan on employees
--                     Filter: (hire_date > '2015-01-01')
--                     Rows Removed by Filter: 150000
--                     Workers: 4
 
-- Each worker scans 1/4 of the table,
-- builds partial aggregates,
-- leader combines partial results

Amdahl's Law in Query Processing

Not all query components can be parallelized. Serial bottlenecks (like final aggregation, sorting, or result sending) limit speedup. A query with 90% parallelizable work and 10% serial work can achieve at most 10x speedup, regardless of core count. This is why query structure significantly affects parallel efficiency.

Parallelism Challenges

•Data Skew — If partitions are uneven, some workers finish early while others struggle. A single large partition can eliminate parallel benefits.
•Synchronization Overhead — Workers must coordinate at barriers (Gather nodes). Too much synchronization can negate parallel gains.
•Memory Pressure — Each parallel worker needs memory for its operators. 8 workers each needing 1GB of hash memory requires 8GB total.
•Network Saturation — In distributed databases, inter-node data movement can become the bottleneck, especially for shuffling joins.
•Coordination Complexity — Handling failures, straggler workers, and dynamic load balancing adds implementation complexity.

Summary: The Query Processor

The Query Processor stands as the most intellectually sophisticated component of any DBMS, transforming declarative SQL into efficient executable plans through a series of sophisticated transformations.

Key takeaways:

Query Processor Essentials

•DDL Interpreter manages schema definitions, updating the system catalog that describes all database objects. The catalog's accuracy directly impacts query optimization.
•DML Compiler transforms SQL through lexical analysis, parsing, and semantic validation, converting queries into relational algebra expressions suitable for optimization.
•Query Optimizer explores thousands of execution plans, using cost models and statistics to select the most efficient approach. This is where the DBMS applies its intelligence.
•Execution Engine runs the optimized plan using various models (iterator, vectorized, or JIT-compiled), coordinating with storage and buffer managers to retrieve and process data.
•Parallelism enables modern Query Processors to leverage multi-core CPUs and distributed architectures, though effective parallelization requires careful handling of data skew and synchronization.

What's next:

The Query Processor relies heavily on the Storage Manager to read and write data, and on the Buffer Manager to efficiently cache frequently accessed pages. The next page explores the Storage Manager—the component responsible for organizing data on disk and providing efficient access paths through indexes and file organizations.

Page Complete

You now understand the Query Processor's complete pipeline: from SQL text through parsing, validation, optimization, and execution. You can appreciate why query optimization is considered one of the hardest problems in database systems, and how modern execution engines achieve high performance through vectorization and JIT compilation.

Query Processor: From SQL to Execution

The Invisible Translator

This page takes you deep inside the Query Processor, revealing the layers of transformation that convert declarative SQL into optimized physical operations.

What You Will Learn

The Query Processing Pipeline

Converting Mermaid diagram...

The pipeline stages in detail:

1. Parsing Phase — Converts raw SQL text into structured representations. Lexical analysis tokenizes the input; syntax analysis builds a parse tree according to SQL grammar rules.

2. Validation Phase — Ensures the query is semantically meaningful. Checks that referenced tables exist, columns have correct types, and the user has necessary permissions.

4. Execution Phase — The execution engine runs the chosen plan, coordinating with storage and buffer managers to retrieve and process data.

Compilation vs. Interpretation

DDL Interpreter: Schema Management

The DDL Interpreter's responsibilities:

DDL Interpreter Functions

•Schema Object Creation — Processing CREATE statements for tables, indexes, views, procedures, triggers, and other database objects. Each object type has specific validation rules and catalog entries.
•Structure Modification — Handling ALTER statements that modify existing objects—adding columns, changing data types, modifying constraints. These operations must maintain data integrity during transition.
•Object Removal — Processing DROP statements with proper dependency checking. Cannot drop a table if views depend on it (unless CASCADE is specified).
•Catalog Updates — Every DDL operation updates the system catalog—the metadata repository describing all database objects. The catalog is itself a set of tables, creating a recursive structure.
•Constraint Enforcement — Validating that new definitions don't violate existing constraints or create irresolvable dependencies.

DDL Processing Examples
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- CREATE TABLE: DDL Interpreter must:
-- 1. Parse the entire statement
-- 2. Validate data types and constraints
-- 3. Check for name conflicts
-- 4. Allocate storage structures
-- 5. Update system catalog
 
CREATE TABLE employees (
    employee_id     INT PRIMARY KEY,
    first_name      VARCHAR(50) NOT NULL,
    last_name       VARCHAR(50) NOT NULL,
    email           VARCHAR(100) UNIQUE,
    department_id   INT REFERENCES departments(dept_id),
    salary          DECIMAL(10,2) CHECK (salary > 0),
    hire_date       DATE DEFAULT CURRENT_DATE
);
 
-- The DDL Interpreter creates entries in:
-- - pg_class (table metadata)
-- - pg_attribute (column metadata)
-- - pg_constraint (constraint definitions)
-- - pg_index (for PRIMARY KEY, UNIQUE)
-- - pg_depend (dependency tracking)
 
-- ALTER TABLE: Complex coordination required
ALTER TABLE employees 
    ADD COLUMN manager_id INT REFERENCES employees(employee_id),
    DROP CONSTRAINT IF EXISTS email_unique,
    ALTER COLUMN salary SET DEFAULT 50000;
 
-- The interpreter must:
-- 1. Acquire exclusive lock on the table
-- 2. Validate new column against existing data
-- 3. Update all catalog entries atomically
-- 4. Rebuild affected indexes if necessary
 
-- DROP with dependency checking
DROP TABLE employees CASCADE;
-- CASCADE: Also drops dependent views, foreign keys
-- RESTRICT (default): Fails if dependencies exist

The System Catalog: Metadata About Metadata

Common System Catalog Tables (PostgreSQL)
Catalog Table	Purpose	Key Columns
pg_class	All relations (tables, indexes, views, sequences)	relname, relkind, reltuples, relpages
pg_attribute	Columns of all tables	attname, atttypid, attnum, attnotnull
pg_index	Index definitions	indrelid, indkey, indisunique, indisprimary
pg_constraint	All constraints	conname, contype, conrelid, confrelid
pg_namespace	Schemas	nspname, nspowner
pg_type	Data type definitions	typname, typlen, typtype
pg_proc	Functions and procedures	proname, proargtypes, prosrc
pg_depend	Object dependencies	classid, objid, refobjid, deptype

Catalog Queries are Query Optimization Fuel

DML Compiler: Query Translation

The compilation pipeline transforms:

SQL Text → Tokens → Parse Tree → Abstract Syntax Tree → Relational Algebra → Execution Plan

Lexical Analysis (Tokenization) breaks the SQL string into a stream of tokens—meaningful units like keywords, identifiers, operators, and literals.

Token Categories:

Keywords: SELECT, FROM, WHERE, JOIN, ORDER BY
Identifiers: Table names, column names, aliases
Operators: =, <, >, <=, >=, <>, AND, OR, NOT
Literals: 'John', 42, 3.14159, TRUE, NULL
Punctuation: (, ), ,, ;

The lexer also handles SQL's case-insensitivity for keywords, whitespace normalization, and string escape sequences.

Tokenization Example
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Input SQL:
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
 
-- Tokenized output:
-- [KEYWORD:SELECT]
-- [IDENTIFIER:e] [DOT] [IDENTIFIER:name] [COMMA]
-- [IDENTIFIER:d] [DOT] [IDENTIFIER:dept_name]
-- [KEYWORD:FROM]
-- [IDENTIFIER:employees] [IDENTIFIER:e]
-- [KEYWORD:JOIN]
-- [IDENTIFIER:departments] [IDENTIFIER:d]
-- [KEYWORD:ON]
-- [IDENTIFIER:e] [DOT] [IDENTIFIER:dept_id]
-- [OPERATOR:=]
-- [IDENTIFIER:d] [DOT] [IDENTIFIER:id]
-- [KEYWORD:WHERE]
-- [IDENTIFIER:e] [DOT] [IDENTIFIER:salary]
-- [OPERATOR:>]
-- [LITERAL:50000]
-- [SEMICOLON]

From Parse Tree to Relational Algebra

The SQL query:

SELECT e.name, d.dept_name
FROM employees e JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000

Becomes the relational algebra expression:

π(name, dept_name)(
  σ(salary > 50000)(
    employees ⋈(dept_id = id) departments
  )
)

This algebraic form is the input to the Query Optimizer.

Query Optimizer: The Intelligence Core

Why is optimization hard?

Query optimization is so challenging that it's been studied for over 40 years, and modern optimizers still contain thousands of transformation rules, statistical models, and specialized algorithms.

The Optimization Paradox

Logical Optimization

•Predicate Pushdown — Move WHERE conditions as early as possible, filtering rows before expensive operations
•Projection Pushdown — Select only needed columns early, reducing data volume
•Constant Folding — Evaluate constant expressions at compile time (WHERE 1+1=2 → WHERE TRUE)
•View Merging — Inline view definitions when beneficial, exposing optimization opportunities
•Subquery Flattening — Convert correlated subqueries to joins when possible
•Contradiction Detection — Identify impossible conditions (WHERE x > 5 AND x < 3), return empty result immediately

Physical Optimization

•Access Path Selection — Choose between table scan, index scan, bitmap scan based on selectivity
•Join Order Determination — Find optimal ordering among many possibilities (n! for n tables)
•Join Algorithm Selection — Nested loops, hash join, merge join—each optimal for different conditions
•Parallelization Decisions — When to parallelize operations across CPU cores
•Materialization Points — When to materialize intermediate results vs. stream them
•Sort Avoidance — Exploit existing order from indexes to avoid explicit sorts

Cost-Based Optimization: The Statistical Foundation

Modern optimizers are cost-based, meaning they estimate the execution cost of each candidate plan and select the lowest-cost option. The cost model considers:

I/O Cost: Disk reads are expensive; how many pages must be read?
CPU Cost: Processing time for comparisons, function evaluations, etc.
Memory Usage: Does the plan fit in available memory, or will it spill to disk?
Network Cost: In distributed databases, data transfer between nodes.

These estimates depend on statistics maintained about the data:

Statistics Used by the Query Optimizer
Statistic Type	What It Measures	How It's Used
Table Cardinality	Number of rows in a table	Estimate result sizes, join costs
Column Cardinality	Number of distinct values in a column	Estimate selectivity of equality predicates
Histograms	Distribution of values in a column	Handle skewed data, range predicates
Correlation	Relationship between columns	Estimate compound predicate selectivity
Index Statistics	B-tree depth, leaf pages, clustering factor	Estimate index access costs
Null Fraction	Percentage of NULL values	Adjust selectivity for nullable columns
Average Width	Average byte size of column values	Estimate memory and I/O requirements

Viewing Query Plans
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- PostgreSQL: EXPLAIN shows the optimizer's chosen plan
EXPLAIN SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
 
-- Output:
-- Hash Join  (cost=3.25..8.60 rows=37 width=64)
--   Hash Cond: (e.dept_id = d.id)
--   ->  Seq Scan on employees e  (cost=0.00..4.50 rows=37 width=40)
--         Filter: (salary > 50000)
--   ->  Hash  (cost=2.00..2.00 rows=100 width=36)
--         ->  Seq Scan on departments d  (cost=0.00..2.00 rows=100 width=36)
 
-- EXPLAIN ANALYZE executes and shows actual vs. estimated
EXPLAIN ANALYZE SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
 
-- Shows:
-- actual time=0.015..0.020 rows=42 loops=1
-- (Compares estimated rows=37 to actual rows=42)

Why Plans Go Wrong

Execution Engine: Running the Plan

Execution Models:

Modern DBMS implementations use one of three primary execution models, each with distinct performance characteristics:

The Volcano/Iterator Model is the classic approach, used by PostgreSQL, MySQL, and most traditional databases.

How it works:

Each operator in the plan tree implements a simple interface: open(), next(), close()
The top operator calls next() on its children, which recursively call their children
Data flows up one tuple at a time (pull-based execution)

Advantages:

Simple to implement and understand
Memory efficient—only one tuple in flight per operator
Easy to add new operators

Disadvantages:

High function call overhead—every tuple crosses multiple function boundaries
Poor CPU cache utilization
Cannot leverage SIMD instructions effectively

Volcano Model Interface
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Conceptual Volcano-style operator interface
interface Operator {
    open(): void;
    next(): Tuple | null;  // Returns next tuple, null when exhausted
    close(): void;
}
 
// Example: Sequential Scan Operator
class SeqScanOperator implements Operator {
    private table: Table;
    private cursor: number = 0;
 
    open() { this.cursor = 0; }
    
    next(): Tuple | null {
        if (this.cursor >= this.table.rowCount) return null;
        return this.table.getRow(this.cursor++);
    }
    
    close() { /* cleanup */ }
}
 
// Example: Filter Operator (Selection σ)
class FilterOperator implements Operator {
    private child: Operator;
    private predicate: (t: Tuple) => boolean;
 
    next(): Tuple | null {
        while (true) {
            const tuple = this.child.next();
            if (tuple === null) return null;
            if (this.predicate(tuple)) return tuple;
            // Keep pulling until predicate satisfied or exhausted
        }
    }
}

Physical Operators: The Execution Toolbox

The execution engine implements various physical operators that correspond to relational algebra operations. Each operator may have multiple implementations optimized for different conditions:

Common Physical Operators and Their Implementations
Operation	Implementations	Best For
Table Access	Sequential Scan, Index Scan, Index-Only Scan, Bitmap Scan	Depends on selectivity and index availability
Selection (Filter)	On-the-fly evaluation, Index-based filter	All workloads—always needed
Projection	Column elimination, Expression evaluation	Reducing data width early
Nested-Loop Join	Simple nested loop, Index nested loop, Block nested loop	Small tables, inequality joins, or when index exists
Hash Join	In-memory hash, Grace hash (partitioned)	Equi-joins, larger tables, sufficient memory
Merge Join	Sort-merge with pre-sorted inputs	Already-sorted data, range joins
Aggregation	Hash aggregate, Sort aggregate, Pre-aggregation	GROUP BY operations
Sort	Quicksort, External merge sort	ORDER BY, merge join preparation
Limit	Top-N with heap, Early termination	LIMIT/TOP queries

Query Parallelism and Distribution

Parallelism Dimensions:

Intra-operator Parallelism: A single operator (e.g., table scan) runs across multiple cores, each processing a partition of the data.
Inter-operator Parallelism: Different operators run concurrently (pipelining). While one operator produces tuples, another consumes them.
Partitioned Parallelism: Data is partitioned across nodes/cores; each partition is processed independently.

Converting Mermaid diagram...

Parallel Query Plans
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- PostgreSQL parallel query example
SET max_parallel_workers_per_gather = 4;
 
EXPLAIN ANALYZE 
SELECT department, COUNT(*), AVG(salary)
FROM employees 
WHERE hire_date > '2015-01-01'
GROUP BY department;
 
-- Parallel plan:
-- Finalize HashAggregate  (rows=50)
--   Group Key: department
--   ->  Gather  (workers planned: 4)
--         Workers Launched: 4
--         ->  Partial HashAggregate  (rows=50)
--               Group Key: department
--               ->  Parallel Seq Scan on employees
--                     Filter: (hire_date > '2015-01-01')
--                     Rows Removed by Filter: 150000
--                     Workers: 4
 
-- Each worker scans 1/4 of the table,
-- builds partial aggregates,
-- leader combines partial results

Amdahl's Law in Query Processing

Parallelism Challenges

•Data Skew — If partitions are uneven, some workers finish early while others struggle. A single large partition can eliminate parallel benefits.
•Synchronization Overhead — Workers must coordinate at barriers (Gather nodes). Too much synchronization can negate parallel gains.
•Memory Pressure — Each parallel worker needs memory for its operators. 8 workers each needing 1GB of hash memory requires 8GB total.
•Network Saturation — In distributed databases, inter-node data movement can become the bottleneck, especially for shuffling joins.
•Coordination Complexity — Handling failures, straggler workers, and dynamic load balancing adds implementation complexity.

Summary: The Query Processor

Key takeaways:

Query Processor Essentials

•DDL Interpreter manages schema definitions, updating the system catalog that describes all database objects. The catalog's accuracy directly impacts query optimization.
•DML Compiler transforms SQL through lexical analysis, parsing, and semantic validation, converting queries into relational algebra expressions suitable for optimization.
•Query Optimizer explores thousands of execution plans, using cost models and statistics to select the most efficient approach. This is where the DBMS applies its intelligence.
•Execution Engine runs the optimized plan using various models (iterator, vectorized, or JIT-compiled), coordinating with storage and buffer managers to retrieve and process data.
•Parallelism enables modern Query Processors to leverage multi-core CPUs and distributed architectures, though effective parallelization requires careful handling of data skew and synchronization.

What's next:

Page Complete