Sql Execution Flow - Learning Module

Loading content...

0/241

Query Parsing

From Text to Structure

When you type a SQL query and press execute, an intricate sequence of transformations begins. The database engine doesn't directly understand human-readable SQL text—it needs to break down your query into structured components, validate each piece, and build an internal representation that captures your intent precisely. This process is query parsing, and it's the critical first step in every SQL execution.

Query parsing is the process by which the DBMS analyzes the textual SQL statement and converts it into an internal data structure called a parse tree or abstract syntax tree (AST). This transformation involves multiple sophisticated stages that mirror the techniques used in programming language compilers—lexical analysis, syntactic parsing, and semantic validation.

Understanding query parsing isn't just academic curiosity. Knowing how parsing works helps you:

Write syntactically correct SQL that parses efficiently
Understand error messages when queries fail
Appreciate why certain SQL constructs perform better than others
Debug complex queries by understanding how the DBMS interprets them

What You Will Learn

By the end of this page, you will understand the complete parsing pipeline: how SQL text is tokenized, how syntax rules are applied, how meaning is validated against the database schema, and how parse trees are constructed. You'll gain insight into the compiler-like sophistication that underpins every SQL statement you've ever written.

The Parsing Pipeline Overview

SQL query parsing follows a multi-stage pipeline that progressively transforms raw text into structured representations. Each stage performs a specific transformation, and the output of one stage becomes the input for the next.

The parsing pipeline consists of three main phases:

1. Lexical Analysis (Scanning): The raw SQL string is broken into a sequence of tokens—keywords, identifiers, literals, operators, and punctuation.

2. Syntactic Analysis (Parsing): The token sequence is analyzed according to SQL grammar rules to create a parse tree that represents the query's structure.

3. Semantic Analysis (Validation): The parse tree is validated against the database catalog to ensure all referenced objects exist, data types are compatible, and the query is logically meaningful.

Converting Mermaid diagram...

This pipeline architecture offers several advantages:

Modularity: Each phase is independent and can be modified without affecting others
Error Isolation: Errors can be precisely identified at the phase where they occur
Optimization: Each phase can be optimized independently
Reusability: The same lexer can handle different SQL dialects by changing grammar rules

Let's examine each phase in detail, starting with lexical analysis.

Lexical Analysis: Breaking Down the Query

Lexical analysis (also called scanning or tokenization) is the first phase of parsing. The lexer reads the raw SQL character stream and groups characters into meaningful units called tokens. Think of it as breaking a sentence into individual words before understanding their grammatical relationships.

A token is the smallest meaningful unit in SQL. Each token has:

Type: What kind of token it is (keyword, identifier, literal, operator, etc.)
Value: The actual text or value the token represents
Position: Where in the query the token appears (for error reporting)

Token Types in SQL
Token Type	Examples	Description
Keywords	`SELECT`, `FROM`, `WHERE`, `INSERT`, `JOIN`	Reserved words with special meaning in SQL syntax
Identifiers	`employees`, `first_name`, `order_total`	Names for tables, columns, schemas, aliases
Literals	`'John'`, `42`, `3.14`, `TRUE`, `NULL`	Constant values: strings, numbers, booleans, null
Operators	`=`, `<>`, `>=`, `AND`, `OR`, `LIKE`	Symbols for comparison, arithmetic, and logical operations
Delimiters	`,`, `;`, `(`, `)`, `.`	Punctuation that structures the query
Comments	`-- single line`, `/* block */`	Non-executable text for documentation
Whitespace	Spaces, tabs, newlines	Separates tokens (usually discarded after tokenization)

Consider how the lexer processes this query:

example_query.sql
SQL
1
2
3
SELECT employee_name, salary * 1.1 AS new_salary
FROM employees
WHERE department_id = 10 AND status = 'active';

The lexer produces the following token stream:

Token Stream for Example Query
Position	Token	Type	Value
1	`SELECT`	Keyword	SELECT
2	`employee_name`	Identifier	employee_name
3	`,`	Delimiter	,
4	`salary`	Identifier	salary
5	`*`	Operator
6	`1.1`	Numeric Literal	1.1
7	`AS`	Keyword	AS
8	`new_salary`	Identifier	new_salary
9	`FROM`	Keyword	FROM
10	`employees`	Identifier	employees
11	`WHERE`	Keyword	WHERE
12	`department_id`	Identifier	department_id
13	`=`	Operator	=
14	`10`	Numeric Literal	10
15	`AND`	Keyword/Operator	AND
16	`status`	Identifier	status
17	`=`	Operator	=
18	`'active'`	String Literal	active
19	`;`	Delimiter	;

Case Sensitivity in Tokenization

SQL keywords are case-insensitive—SELECT, select, and SeLeCt all tokenize to the same keyword type. However, identifier handling varies: some databases (PostgreSQL) fold unquoted identifiers to lowercase, others (Oracle) to uppercase, and quoted identifiers preserve exact case. Understanding your DBMS's case handling prevents subtle bugs.

Lexer Challenges:

The lexer must handle several tricky situations:

1. Keyword vs. Identifier Ambiguity: Is order the keyword for ORDER BY, or a table named 'order'? Context and quoting resolve this.

2. String Escaping: How to tokenize 'It''s a string' (escaped apostrophe) or E'new\nline' (escape sequences)?

3. Numeric Formats: Handling 42, 3.14, 1.5E10, 0x1A (hex), and database-specific formats.

4. Multi-character Operators: Distinguishing < from <= from <> requires lookahead.

5. Unicode: Modern databases support Unicode identifiers and string literals, complicating character classification.

Syntactic Analysis: Building the Parse Tree

Syntactic analysis (parsing proper) takes the token stream and organizes it according to the rules of SQL grammar. The output is a parse tree (or abstract syntax tree)—a hierarchical data structure that represents the grammatical structure of the query.

SQL grammar is defined formally using a notation similar to BNF (Backus-Naur Form). Here's a simplified example showing how a SELECT statement might be defined:

sql_grammar.bnf
BNF
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Simplified SQL Grammar (BNF-like notation)
 
<select_statement> ::= 
    SELECT <select_list>
    FROM <table_reference>
    [WHERE <search_condition>]
    [GROUP BY <grouping_columns>]
    [HAVING <search_condition>]
    [ORDER BY <sort_specification>]
 
<select_list> ::=
    '*' | <select_item> [',' <select_item>]*
 
<select_item> ::=
    <expression> [AS <alias>]
 
<expression> ::=
    <column_reference>
    | <literal>
    | <expression> <operator> <expression>
    | <function_call>
    | '(' <expression> ')'
 
<search_condition> ::=
    <predicate>
    | <search_condition> AND <search_condition>
    | <search_condition> OR <search_condition>
    | NOT <search_condition>
    | '(' <search_condition> ')'

The parser uses these grammar rules to match the token sequence and build a tree structure. For our earlier example, the parse tree would look like this:

Converting Mermaid diagram...

Parse Tree Properties:

The parse tree accurately captures:

Hierarchical Structure: The tree reflects how query components nest within each other
Operator Precedence: * binds tighter than AND; the tree structure respects this
Grouping: Parenthesized expressions form subtrees
Clause Relationships: Each SQL clause becomes a major branch of the tree

Parsing Algorithms:

DBMS parsers typically use one of several parsing strategies:

Recursive Descent: Hand-written parsers that directly implement grammar rules as recursive functions. Simple and fast, but requires careful grammar design.
LL Parsing: Top-down parsing that reads input left-to-right and produces a leftmost derivation. Efficient for many SQL constructs.
LALR Parsing: Bottom-up parsing (used by tools like Yacc/Bison) that handles a broader class of grammars. Many production databases use LALR parsers.
PEG Parsing: Parsing Expression Grammars offer unambiguous parsing with ordered choice, increasingly popular in modern implementations.

Parse Trees vs. Abstract Syntax Trees

A parse tree directly mirrors the grammar rules, including every intermediate production. An abstract syntax tree (AST) is a simplified version that removes redundant nodes and focuses on semantic content. Most databases work with ASTs internally, discarding pure syntactic structure that doesn't affect meaning.

Semantic Analysis: Validating Meaning

A syntactically valid query isn't necessarily meaningful. Consider:

SELECT imaginary_column FROM nonexistent_table;

This is grammatically correct SQL—it follows all syntax rules. But it's semantically invalid because the table and column don't exist. Semantic analysis validates that the query makes sense given the actual database schema.

Semantic analysis performs several critical checks:

Semantic Validation Checks

•Name Resolution: Verify that all table names, column names, and aliases exist in the database catalog or are properly defined within the query
•Type Checking: Ensure operations are valid for the involved data types (can't add a string to an integer, can't use LIKE on a date column)
•Scope Validation: Check that column references are unambiguous and resolve to exactly one column
•Privilege Checking: Verify the user has permission to access the referenced tables and columns
•Constraint Validation: For DML, ensure the operation won't violate NOT NULL, UNIQUE, or CHECK constraints
•Function Resolution: Match function calls to the correct overloaded function based on argument types

During semantic analysis, the DBMS consults the database catalog (also called the data dictionary or system catalog). This is a set of internal tables containing metadata about all database objects:

Key Catalog Information Used in Semantic Analysis
Catalog Component	Information Provided	Semantic Use
Table Definitions	Table names, schemas, owners	Verifying table existence and access
Column Definitions	Column names, data types, nullability	Name resolution, type checking
Constraints	Primary keys, foreign keys, checks, unique	DML validation, join inference
Indexes	Index columns, types, statistics	Optimization hints (prepared for next phase)
Views	View definitions, dependencies	View expansion and validation
Functions/Procedures	Signatures, return types, parameters	Function call resolution
Privileges	Grants, roles, permissions	Access control validation

Type Checking in Detail:

SQL's type system is more complex than it might appear. Consider these typing challenges:

Implicit Conversions: What happens with WHERE salary > '50000'? Is this an error, or should the string be converted to a number?

NULL Handling: NULL has special typing rules—it's compatible with any type but propagates through most operations.

Function Overloading: SUBSTRING(string, int) vs SUBSTRING(string, int, int) must be resolved based on argument count and types.

Aggregate Context: Aggregates like SUM() require numeric inputs and can only appear in specific query positions.

type_checking_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Type checking examples and potential issues
 
-- Valid: integer comparison
SELECT * FROM orders WHERE total > 1000;
 
-- Needs type coercion: string to number
SELECT * FROM orders WHERE total > '1000';
 
-- Type error: cannot compare date with string pattern
-- (unless DBMS has liberal coercion rules)
SELECT * FROM orders WHERE order_date LIKE '%2024%';
 
-- Valid: LIKE works with string types
SELECT * FROM orders WHERE status LIKE '%pending%';
 
-- Semantic error: aggregate in wrong context
SELECT order_id, SUM(total) FROM orders;
-- Error: order_id must be in GROUP BY or aggregate
 
-- Ambiguous column reference (semantic error)
SELECT name FROM employees, departments;
-- Error if both tables have 'name' column

Semantic Errors Are Caught Early

Semantic analysis catches errors before any execution begins. This is crucial for efficiency—detecting 'table not found' after starting to execute a complex join would waste resources. All validation happens during parsing so execution can proceed without repeated checks.

Parse Tree Transformations

After semantic analysis, the validated parse tree often undergoes transformations before proceeding to optimization. These transformations normalize the query representation while preserving semantic meaning.

Common Parse Tree Transformations

•View Expansion: Views are replaced with their defining queries. SELECT * FROM active_employees becomes SELECT * FROM (SELECT * FROM employees WHERE status = 'active')
•Subquery Flattening: Simple subqueries may be converted to joins for optimization. WHERE id IN (SELECT id FROM orders) might become a semi-join
•Constant Folding: Compile-time evaluatable expressions are computed. WHERE date > '2024-01-01' + INTERVAL '30 days' computes the date once
•Predicate Simplification: Tautologies and contradictions are detected. WHERE 1=1 AND status='active' simplifies to WHERE status='active'
•Column Qualification: Unqualified column names are fully qualified with their table aliases
•Star Expansion: SELECT * is expanded to the explicit column list from catalog metadata

View Expansion Example:

Consider a view that filters for high-value customers:

view_expansion.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- View definition
CREATE VIEW high_value_customers AS
SELECT c.id, c.name, c.email, SUM(o.total) as lifetime_value
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email
HAVING SUM(o.total) > 10000;
 
-- User query
SELECT name, lifetime_value 
FROM high_value_customers 
WHERE name LIKE 'A%';
 
-- After view expansion (internally transformed to):
SELECT derived.name, derived.lifetime_value
FROM (
    SELECT c.id, c.name, c.email, SUM(o.total) as lifetime_value
    FROM customers c
    JOIN orders o ON c.id = o.customer_id
    GROUP BY c.id, c.name, c.email
    HAVING SUM(o.total) > 10000
) AS derived
WHERE derived.name LIKE 'A%';

After view expansion, the optimizer may push the LIKE 'A%' predicate down into the inner query for efficiency—but that's an optimization step, not a parsing transformation.

Normalization Benefits:

Consistent Representation: All queries are in a standard form, simplifying the optimizer
Early Simplification: Obvious optimizations are applied before the main optimization phase
View Transparency: Users query views like tables; expansion is invisible

Parse Errors and Diagnostics

When parsing fails, the DBMS must provide useful error messages. Error quality significantly impacts developer productivity—a message like "syntax error" is far less helpful than "expected column name after SELECT, found 'FROM' at line 1, column 8."

Types of Parsing Errors:

Categories of Parsing Errors
Error Type	Phase	Example	Typical Message
Lexical Error	Tokenization	`SELECT 'unterminated string`	Unterminated string literal
Syntax Error	Parsing	`SELECT FROM employees`	Expected column list after SELECT
Unknown Identifier	Semantic	`SELECT xyz FROM employees`	Column 'xyz' not found in table 'employees'
Type Mismatch	Semantic	`WHERE salary = 'high'`	Cannot compare INTEGER with VARCHAR
Ambiguous Reference	Semantic	`SELECT id FROM a, b`	Column 'id' is ambiguous
Privilege Error	Semantic	`SELECT * FROM secret_table`	Permission denied on table 'secret_table'

Error Recovery Strategies:

Good parsers don't stop at the first error—they attempt to recover and find additional errors. Strategies include:

Panic Mode: Skip tokens until a synchronizing token (like ; or a keyword) is found, then resume parsing.

Phrase-Level Recovery: Make local corrections (insert missing token, delete extra token) and continue.

Error Productions: Grammar rules that match common errors and produce specific messages.

Global Correction: Find the minimal edit (insertions/deletions) to make the query valid. Computationally expensive but provides excellent diagnostics.

error_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Common parsing errors and their fixed versions
 
-- ERROR: Missing comma between columns
SELECT name department FROM employees;
-- FIX: SELECT name, department FROM employees;
 
-- ERROR: Missing FROM clause
SELECT name, salary WHERE id = 5;
-- FIX: SELECT name, salary FROM employees WHERE id = 5;
 
-- ERROR: Mismatched parentheses
SELECT * FROM employees WHERE (status = 'active' AND (department = 'IT');
-- FIX: Add closing parenthesis
 
-- ERROR: Wrong keyword order
SELECT * FROM employees ORDER BY name WHERE status = 'active';
-- FIX: WHERE comes before ORDER BY
 
-- ERROR: Using reserved keyword as identifier
SELECT order FROM orders;
-- FIX: SELECT "order" FROM orders; (or use different name)

Reading Error Messages

When you encounter a parse error, look for: (1) the line and character position, (2) what the parser expected to find, and (3) what it actually found. The error is often immediately before the reported position—if 'expected FROM at line 1, column 25,' the problem is likely in the SELECT list that precedes it.

Prepared Statements and Parse Caching

Parsing isn't free—it consumes CPU cycles and memory. For frequently executed queries, re-parsing each time is wasteful. Prepared statements and parse caching address this inefficiency.

Prepared Statements:

A prepared statement is parsed and validated once, then executed multiple times with different parameter values:

prepared_statement.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Without prepared statement: parsed every time
SELECT * FROM orders WHERE customer_id = 1001;
SELECT * FROM orders WHERE customer_id = 1002;
SELECT * FROM orders WHERE customer_id = 1003;
-- Each execution: full parse + optimize + execute
 
-- With prepared statement: parsed once
PREPARE get_orders AS
SELECT * FROM orders WHERE customer_id = $1;
 
EXECUTE get_orders(1001);
EXECUTE get_orders(1002);
EXECUTE get_orders(1003);
-- Each execution: just bind parameters + execute
-- (Optimization may happen once or per-execution depending on DBMS)

Benefits of Prepared Statements:

Performance: Skip parsing overhead on repeated executions
Security: Parameters are bound safely, preventing SQL injection
Plan Stability: The execution plan is often cached (see optimization page)
Type Safety: Parameter types are validated once during preparation

Parse Caching (Query Cache):

Many databases cache parsed query trees keyed by the query text. When the same query arrives again, the cached parse tree is used.

Parse Caching Behavior by DBMS
Database	Cache Type	Key	Notes
Oracle	Shared SQL Area	Query text hash	Extensive caching; cursor sharing for similar queries
SQL Server	Plan Cache	Query text + SET options	Auto-parameterization for simple queries
PostgreSQL	Per-connection cache	Prepared statement name	Manual preparation; pg_stat_statements tracks queries
MySQL	Query Cache (deprecated)	Exact query text	Removed in MySQL 8.0; use application-level caching

Cache Invalidation

Cached parse trees become invalid when the schema changes. DDL operations (ALTER TABLE, DROP INDEX) typically flush relevant cached entries. This is why frequent DDL in high-traffic systems can cause performance degradation—queries suddenly need re-parsing.

Summary: Query Parsing

Query parsing transforms your SQL text into a validated, structured representation that the database can work with. Let's consolidate what we've learned:

Key Takeaways

•Parsing is a pipeline: Lexical analysis → Syntactic analysis → Semantic analysis, each building on the previous stage's output
•Lexical analysis tokenizes: Raw SQL text becomes a stream of typed tokens (keywords, identifiers, literals, operators)
•Syntactic analysis builds structure: Tokens are organized into a parse tree according to SQL grammar rules
•Semantic analysis validates meaning: The parse tree is checked against the database catalog for correctness—names resolved, types checked, privileges verified
•Transformations normalize: Views are expanded, constants folded, and the query prepared for optimization
•Errors are categorized: Lexical, syntactic, and semantic errors occur at different phases and provide different diagnostics
•Caching avoids redundant work: Prepared statements and parse caches skip parsing for repeated queries

What's Next:

With the query successfully parsed into a validated internal representation, the next phase is optimization. The optimizer examines the parse tree, considers multiple execution strategies, estimates their costs, and chooses the most efficient approach. Understanding optimization is key to writing queries that perform well—we'll explore this in the next page.

Page Complete

You now understand how SQL queries are parsed—from raw text to validated parse trees. This knowledge helps you interpret error messages, write cleaner SQL, and leverage prepared statements effectively. Next, we'll explore how the optimizer transforms parse trees into efficient execution strategies.

Query Parsing

From Text to Structure

Understanding query parsing isn't just academic curiosity. Knowing how parsing works helps you:

Write syntactically correct SQL that parses efficiently
Understand error messages when queries fail
Appreciate why certain SQL constructs perform better than others
Debug complex queries by understanding how the DBMS interprets them

What You Will Learn

The Parsing Pipeline Overview

The parsing pipeline consists of three main phases:

1. Lexical Analysis (Scanning): The raw SQL string is broken into a sequence of tokens—keywords, identifiers, literals, operators, and punctuation.

2. Syntactic Analysis (Parsing): The token sequence is analyzed according to SQL grammar rules to create a parse tree that represents the query's structure.

Converting Mermaid diagram...

This pipeline architecture offers several advantages:

Modularity: Each phase is independent and can be modified without affecting others
Error Isolation: Errors can be precisely identified at the phase where they occur
Optimization: Each phase can be optimized independently
Reusability: The same lexer can handle different SQL dialects by changing grammar rules

Let's examine each phase in detail, starting with lexical analysis.

Lexical Analysis: Breaking Down the Query

A token is the smallest meaningful unit in SQL. Each token has:

Type: What kind of token it is (keyword, identifier, literal, operator, etc.)
Value: The actual text or value the token represents
Position: Where in the query the token appears (for error reporting)

Token Types in SQL
Token Type	Examples	Description
Keywords	`SELECT`, `FROM`, `WHERE`, `INSERT`, `JOIN`	Reserved words with special meaning in SQL syntax
Identifiers	`employees`, `first_name`, `order_total`	Names for tables, columns, schemas, aliases
Literals	`'John'`, `42`, `3.14`, `TRUE`, `NULL`	Constant values: strings, numbers, booleans, null
Operators	`=`, `<>`, `>=`, `AND`, `OR`, `LIKE`	Symbols for comparison, arithmetic, and logical operations
Delimiters	`,`, `;`, `(`, `)`, `.`	Punctuation that structures the query
Comments	`-- single line`, `/* block */`	Non-executable text for documentation
Whitespace	Spaces, tabs, newlines	Separates tokens (usually discarded after tokenization)

Consider how the lexer processes this query:

example_query.sql
SQL
1
2
3
SELECT employee_name, salary * 1.1 AS new_salary
FROM employees
WHERE department_id = 10 AND status = 'active';

The lexer produces the following token stream:

Token Stream for Example Query
Position	Token	Type	Value
1	`SELECT`	Keyword	SELECT
2	`employee_name`	Identifier	employee_name
3	`,`	Delimiter	,
4	`salary`	Identifier	salary
5	`*`	Operator
6	`1.1`	Numeric Literal	1.1
7	`AS`	Keyword	AS
8	`new_salary`	Identifier	new_salary
9	`FROM`	Keyword	FROM
10	`employees`	Identifier	employees
11	`WHERE`	Keyword	WHERE
12	`department_id`	Identifier	department_id
13	`=`	Operator	=
14	`10`	Numeric Literal	10
15	`AND`	Keyword/Operator	AND
16	`status`	Identifier	status
17	`=`	Operator	=
18	`'active'`	String Literal	active
19	`;`	Delimiter	;

Case Sensitivity in Tokenization

Lexer Challenges:

The lexer must handle several tricky situations:

1. Keyword vs. Identifier Ambiguity: Is order the keyword for ORDER BY, or a table named 'order'? Context and quoting resolve this.

2. String Escaping: How to tokenize 'It''s a string' (escaped apostrophe) or E'new\nline' (escape sequences)?

3. Numeric Formats: Handling 42, 3.14, 1.5E10, 0x1A (hex), and database-specific formats.

4. Multi-character Operators: Distinguishing < from <= from <> requires lookahead.

5. Unicode: Modern databases support Unicode identifiers and string literals, complicating character classification.

Syntactic Analysis: Building the Parse Tree

SQL grammar is defined formally using a notation similar to BNF (Backus-Naur Form). Here's a simplified example showing how a SELECT statement might be defined:

sql_grammar.bnf
BNF
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Simplified SQL Grammar (BNF-like notation)
 
<select_statement> ::= 
    SELECT <select_list>
    FROM <table_reference>
    [WHERE <search_condition>]
    [GROUP BY <grouping_columns>]
    [HAVING <search_condition>]
    [ORDER BY <sort_specification>]
 
<select_list> ::=
    '*' | <select_item> [',' <select_item>]*
 
<select_item> ::=
    <expression> [AS <alias>]
 
<expression> ::=
    <column_reference>
    | <literal>
    | <expression> <operator> <expression>
    | <function_call>
    | '(' <expression> ')'
 
<search_condition> ::=
    <predicate>
    | <search_condition> AND <search_condition>
    | <search_condition> OR <search_condition>
    | NOT <search_condition>
    | '(' <search_condition> ')'

The parser uses these grammar rules to match the token sequence and build a tree structure. For our earlier example, the parse tree would look like this:

Converting Mermaid diagram...

Parse Tree Properties:

The parse tree accurately captures:

Hierarchical Structure: The tree reflects how query components nest within each other
Operator Precedence: * binds tighter than AND; the tree structure respects this
Grouping: Parenthesized expressions form subtrees
Clause Relationships: Each SQL clause becomes a major branch of the tree

Parsing Algorithms:

DBMS parsers typically use one of several parsing strategies:

Recursive Descent: Hand-written parsers that directly implement grammar rules as recursive functions. Simple and fast, but requires careful grammar design.
LL Parsing: Top-down parsing that reads input left-to-right and produces a leftmost derivation. Efficient for many SQL constructs.
LALR Parsing: Bottom-up parsing (used by tools like Yacc/Bison) that handles a broader class of grammars. Many production databases use LALR parsers.
PEG Parsing: Parsing Expression Grammars offer unambiguous parsing with ordered choice, increasingly popular in modern implementations.

Parse Trees vs. Abstract Syntax Trees

Semantic Analysis: Validating Meaning

A syntactically valid query isn't necessarily meaningful. Consider:

SELECT imaginary_column FROM nonexistent_table;

Semantic analysis performs several critical checks:

Semantic Validation Checks

•Name Resolution: Verify that all table names, column names, and aliases exist in the database catalog or are properly defined within the query
•Type Checking: Ensure operations are valid for the involved data types (can't add a string to an integer, can't use LIKE on a date column)
•Scope Validation: Check that column references are unambiguous and resolve to exactly one column
•Privilege Checking: Verify the user has permission to access the referenced tables and columns
•Constraint Validation: For DML, ensure the operation won't violate NOT NULL, UNIQUE, or CHECK constraints
•Function Resolution: Match function calls to the correct overloaded function based on argument types

Key Catalog Information Used in Semantic Analysis
Catalog Component	Information Provided	Semantic Use
Table Definitions	Table names, schemas, owners	Verifying table existence and access
Column Definitions	Column names, data types, nullability	Name resolution, type checking
Constraints	Primary keys, foreign keys, checks, unique	DML validation, join inference
Indexes	Index columns, types, statistics	Optimization hints (prepared for next phase)
Views	View definitions, dependencies	View expansion and validation
Functions/Procedures	Signatures, return types, parameters	Function call resolution
Privileges	Grants, roles, permissions	Access control validation

Type Checking in Detail:

SQL's type system is more complex than it might appear. Consider these typing challenges:

Implicit Conversions: What happens with WHERE salary > '50000'? Is this an error, or should the string be converted to a number?

NULL Handling: NULL has special typing rules—it's compatible with any type but propagates through most operations.

Function Overloading: SUBSTRING(string, int) vs SUBSTRING(string, int, int) must be resolved based on argument count and types.

Aggregate Context: Aggregates like SUM() require numeric inputs and can only appear in specific query positions.

type_checking_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Type checking examples and potential issues
 
-- Valid: integer comparison
SELECT * FROM orders WHERE total > 1000;
 
-- Needs type coercion: string to number
SELECT * FROM orders WHERE total > '1000';
 
-- Type error: cannot compare date with string pattern
-- (unless DBMS has liberal coercion rules)
SELECT * FROM orders WHERE order_date LIKE '%2024%';
 
-- Valid: LIKE works with string types
SELECT * FROM orders WHERE status LIKE '%pending%';
 
-- Semantic error: aggregate in wrong context
SELECT order_id, SUM(total) FROM orders;
-- Error: order_id must be in GROUP BY or aggregate
 
-- Ambiguous column reference (semantic error)
SELECT name FROM employees, departments;
-- Error if both tables have 'name' column

Semantic Errors Are Caught Early

Parse Tree Transformations

Common Parse Tree Transformations

•View Expansion: Views are replaced with their defining queries. SELECT * FROM active_employees becomes SELECT * FROM (SELECT * FROM employees WHERE status = 'active')
•Subquery Flattening: Simple subqueries may be converted to joins for optimization. WHERE id IN (SELECT id FROM orders) might become a semi-join
•Constant Folding: Compile-time evaluatable expressions are computed. WHERE date > '2024-01-01' + INTERVAL '30 days' computes the date once
•Predicate Simplification: Tautologies and contradictions are detected. WHERE 1=1 AND status='active' simplifies to WHERE status='active'
•Column Qualification: Unqualified column names are fully qualified with their table aliases
•Star Expansion: SELECT * is expanded to the explicit column list from catalog metadata

View Expansion Example:

Consider a view that filters for high-value customers:

view_expansion.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- View definition
CREATE VIEW high_value_customers AS
SELECT c.id, c.name, c.email, SUM(o.total) as lifetime_value
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email
HAVING SUM(o.total) > 10000;
 
-- User query
SELECT name, lifetime_value 
FROM high_value_customers 
WHERE name LIKE 'A%';
 
-- After view expansion (internally transformed to):
SELECT derived.name, derived.lifetime_value
FROM (
    SELECT c.id, c.name, c.email, SUM(o.total) as lifetime_value
    FROM customers c
    JOIN orders o ON c.id = o.customer_id
    GROUP BY c.id, c.name, c.email
    HAVING SUM(o.total) > 10000
) AS derived
WHERE derived.name LIKE 'A%';

After view expansion, the optimizer may push the LIKE 'A%' predicate down into the inner query for efficiency—but that's an optimization step, not a parsing transformation.

Normalization Benefits:

Consistent Representation: All queries are in a standard form, simplifying the optimizer
Early Simplification: Obvious optimizations are applied before the main optimization phase
View Transparency: Users query views like tables; expansion is invisible

Parse Errors and Diagnostics

Types of Parsing Errors:

Categories of Parsing Errors
Error Type	Phase	Example	Typical Message
Lexical Error	Tokenization	`SELECT 'unterminated string`	Unterminated string literal
Syntax Error	Parsing	`SELECT FROM employees`	Expected column list after SELECT
Unknown Identifier	Semantic	`SELECT xyz FROM employees`	Column 'xyz' not found in table 'employees'
Type Mismatch	Semantic	`WHERE salary = 'high'`	Cannot compare INTEGER with VARCHAR
Ambiguous Reference	Semantic	`SELECT id FROM a, b`	Column 'id' is ambiguous
Privilege Error	Semantic	`SELECT * FROM secret_table`	Permission denied on table 'secret_table'

Error Recovery Strategies:

Good parsers don't stop at the first error—they attempt to recover and find additional errors. Strategies include:

Panic Mode: Skip tokens until a synchronizing token (like ; or a keyword) is found, then resume parsing.

Phrase-Level Recovery: Make local corrections (insert missing token, delete extra token) and continue.

Error Productions: Grammar rules that match common errors and produce specific messages.

Global Correction: Find the minimal edit (insertions/deletions) to make the query valid. Computationally expensive but provides excellent diagnostics.

error_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Common parsing errors and their fixed versions
 
-- ERROR: Missing comma between columns
SELECT name department FROM employees;
-- FIX: SELECT name, department FROM employees;
 
-- ERROR: Missing FROM clause
SELECT name, salary WHERE id = 5;
-- FIX: SELECT name, salary FROM employees WHERE id = 5;
 
-- ERROR: Mismatched parentheses
SELECT * FROM employees WHERE (status = 'active' AND (department = 'IT');
-- FIX: Add closing parenthesis
 
-- ERROR: Wrong keyword order
SELECT * FROM employees ORDER BY name WHERE status = 'active';
-- FIX: WHERE comes before ORDER BY
 
-- ERROR: Using reserved keyword as identifier
SELECT order FROM orders;
-- FIX: SELECT "order" FROM orders; (or use different name)

Reading Error Messages

Prepared Statements and Parse Caching

Parsing isn't free—it consumes CPU cycles and memory. For frequently executed queries, re-parsing each time is wasteful. Prepared statements and parse caching address this inefficiency.

Prepared Statements:

A prepared statement is parsed and validated once, then executed multiple times with different parameter values:

prepared_statement.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Without prepared statement: parsed every time
SELECT * FROM orders WHERE customer_id = 1001;
SELECT * FROM orders WHERE customer_id = 1002;
SELECT * FROM orders WHERE customer_id = 1003;
-- Each execution: full parse + optimize + execute
 
-- With prepared statement: parsed once
PREPARE get_orders AS
SELECT * FROM orders WHERE customer_id = $1;
 
EXECUTE get_orders(1001);
EXECUTE get_orders(1002);
EXECUTE get_orders(1003);
-- Each execution: just bind parameters + execute
-- (Optimization may happen once or per-execution depending on DBMS)

Benefits of Prepared Statements:

Performance: Skip parsing overhead on repeated executions
Security: Parameters are bound safely, preventing SQL injection
Plan Stability: The execution plan is often cached (see optimization page)
Type Safety: Parameter types are validated once during preparation

Parse Caching (Query Cache):

Many databases cache parsed query trees keyed by the query text. When the same query arrives again, the cached parse tree is used.

Parse Caching Behavior by DBMS
Database	Cache Type	Key	Notes
Oracle	Shared SQL Area	Query text hash	Extensive caching; cursor sharing for similar queries
SQL Server	Plan Cache	Query text + SET options	Auto-parameterization for simple queries
PostgreSQL	Per-connection cache	Prepared statement name	Manual preparation; pg_stat_statements tracks queries
MySQL	Query Cache (deprecated)	Exact query text	Removed in MySQL 8.0; use application-level caching

Cache Invalidation

Summary: Query Parsing

Query parsing transforms your SQL text into a validated, structured representation that the database can work with. Let's consolidate what we've learned:

Key Takeaways

•Parsing is a pipeline: Lexical analysis → Syntactic analysis → Semantic analysis, each building on the previous stage's output
•Lexical analysis tokenizes: Raw SQL text becomes a stream of typed tokens (keywords, identifiers, literals, operators)
•Syntactic analysis builds structure: Tokens are organized into a parse tree according to SQL grammar rules
•Semantic analysis validates meaning: The parse tree is checked against the database catalog for correctness—names resolved, types checked, privileges verified
•Transformations normalize: Views are expanded, constants folded, and the query prepared for optimization
•Errors are categorized: Lexical, syntactic, and semantic errors occur at different phases and provide different diagnostics
•Caching avoids redundant work: Prepared statements and parse caches skip parsing for repeated queries

What's Next:

Page Complete