Database Management SystemsQuery Processing Overview

Query Processing Steps

LevelIntermediate

Duration60 mins

TopicQuery Processing Overview

2 / 5

Parsing - Transforming SQL Text into Structure

From Text to Structure

Once query text arrives at the database server, it faces a fundamental challenge: computers don't understand human languages. The SQL statement SELECT name FROM users WHERE id = 42 is just a sequence of characters—bytes without inherent meaning. The database system must extract the structure and intent from this text.

This is the job of the parser—a sophisticated component that transforms raw SQL text into a structured representation the rest of the query processing pipeline can work with. Parsing is one of the most thoroughly studied areas in computer science, with decades of theory and practice informing modern database parsers.

What You Will Learn

By the end of this page, you will understand how SQL text is tokenized into lexemes, how grammar rules define valid SQL syntax, how parsers construct parse trees representing query structure, and how syntax errors are detected and reported. You'll gain insight into the compiler construction techniques that power every database system.

Two-Phase Parsing Architecture

Most database parsers employ a two-phase architecture borrowed from compiler design. This separation of concerns simplifies implementation and enables independent optimization of each phase.

Phase 1: Lexical Analysis (Lexing/Tokenizing)

The lexer (also called tokenizer or scanner) reads the character stream and groups characters into meaningful units called tokens (or lexemes). Think of this as identifying individual words and punctuation in a sentence.

Phase 2: Syntax Analysis (Parsing)

The parser takes the token stream and verifies that tokens appear in valid sequences according to SQL grammar rules. It builds a parse tree (or abstract syntax tree) representing the query's hierarchical structure.

This separation mirrors how you read English: first recognizing words, then understanding how they combine into sentences following grammatical rules.

Converting Mermaid diagram...

Why Two Phases?

Simplicity — Each phase handles one level of complexity. The lexer doesn't worry about grammar; the parser doesn't worry about character-level details.
Efficiency — Lexers use simple finite automata; parsers use more complex pushdown automata. Separation allows optimal algorithms for each task.
Modularity — Lexer and parser can be developed, tested, and maintained independently. New keywords can be added to the lexer without parser changes.
Error Handling — Errors can be categorized as lexical ("invalid character") or syntactic ("unexpected token"), enabling clearer error messages.

Lexical Analysis (Tokenization)

The lexer scans the input character-by-character, recognizing patterns that form valid tokens. Each token has a type (category) and a value (the actual text or converted value).

Token Categories in SQL:

SQL Token Categories
Token Type	Examples	Description
Keywords	`SELECT`, `FROM`, `WHERE`, `INSERT`, `JOIN`	Reserved words with special meaning in SQL
Identifiers	`users`, `order_date`, `CustomerName`	Names of tables, columns, aliases, functions
Literals	`42`, `'hello'`, `3.14159`, `TRUE`	Constant values (numeric, string, boolean, null)
Operators	`=`, `<>`, `>=`, `+`, `-`, `AND`, `OR`	Comparison, arithmetic, and logical operators
Delimiters	`,`, `;`, `(`, `)`, `.`	Punctuation separating or grouping elements
Whitespace	Spaces, tabs, newlines	Usually discarded; separates other tokens
Comments	`-- comment`, `/* block */`	Documentation; usually discarded or extracted for hints

tokenization_example.sql
1
2
3
4
5
6
7
SELECT customer_name, SUM(order_total)
FROM   orders o
       INNER JOIN customers c ON o.cust_id = c.id
WHERE  order_date >= '2024-01-01'
       AND status = 'completed'
GROUP BY customer_name
HAVING SUM(order_total) > 1000.00;

Lexer Implementation Concepts:

Lexers are typically implemented as finite automata (state machines) that transition between states based on input characters:

Start state — Waiting for the first character of a new token
Recognition states — Accumulating characters that form a token
Accept states — Complete token recognized, ready to emit
Error state — Invalid character sequence detected

Modern databases often use lexer generators (like Flex, RE2C) that compile regular expression patterns into efficient state machines.

lexer_pseudocode.pseudo

Lexer Logic

// Simplified SQL lexer pseudocode
 
class SQLLexer:
    input: String       // SQL text to tokenize
    position: Integer   // Current position in input
    tokens: List<Token> // Accumulated tokens
    
    function tokenize() -> List<Token>:
        while not at_end():
            skip_whitespace()
            if at_end(): break
            
            char = peek()
            
            if char is letter or underscore:
                read_identifier_or_keyword()
            else if char is digit:
                read_number()
            else if char is single_quote:
                read_string_literal()
            else if char is double_quote:
                read_quoted_identifier()
            else if char is operator_start:
                read_operator()
            else if char is delimiter:
                emit_single_char_token()
            else if char is '-' and next is '-':
                skip_line_comment()
            else if char is '/' and next is '*':
                skip_block_comment()
            else:
                raise LexicalError("Unexpected character: " + char)
        
        emit(Token(EOF, ""))
        return tokens
    
    function read_identifier_or_keyword():
        start = position
        while is_alphanumeric(peek()) or peek() == '_':
            advance()
        
        text = input[start..position]
        
        // Check if it's a reserved keyword
        if text.upper() in KEYWORDS:
            emit(Token(KEYWORD, text.upper()))
        else:
            emit(Token(IDENTIFIER, text))
    
    function read_number():
        start = position
        while is_digit(peek()):
            advance()
        
        if peek() == '.' and is_digit(peek_next()):
            advance()  // consume '.'
            while is_digit(peek()):
                advance()
            emit(Token(FLOAT_LITERAL, input[start..position]))
        else:
            emit(Token(INTEGER_LITERAL, input[start..position]))

Keyword vs. Identifier Ambiguity

Some words might be both keywords and valid identifiers. For example, USER is a keyword in some SQL dialects but might also be a table name. Databases handle this through context-sensitivity or by requiring quoted identifiers ("USER") for reserved word identifiers. PostgreSQL has about 100 reserved keywords; MySQL has over 200.

SQL Grammar and Syntax Rules

With tokens identified, the parser's job is to verify they appear in valid sequences according to SQL's grammar rules. Grammar defines the legal structure of SQL statements.

What is a Grammar?

A grammar is a formal specification of a language's syntax using production rules. Each rule shows how complex constructs decompose into simpler ones.

SQL grammars are typically expressed in BNF (Backus-Naur Form) or EBNF (Extended BNF) notation, borrowed from programming language specification.

sql_grammar_excerpt.ebnf

EBNF

(* Simplified SQL SELECT grammar in EBNF notation *)
 
select_statement = 
    SELECT [DISTINCT | ALL] select_list
    FROM   table_reference_list
    [WHERE  condition]
    [GROUP BY expression_list [HAVING condition]]
    [ORDER BY order_by_list]
    [LIMIT limit_specification] ;
 
select_list = 
    '*' 
    | select_item { ',' select_item } ;
 
select_item = 
    expression [AS alias] ;
 
table_reference_list = 
    table_reference { ',' table_reference } ;
 
table_reference = 
    table_name [alias]
    | table_reference join_type table_reference ON condition
    | '(' select_statement ')' [alias] ;
 
join_type = 
    [INNER] JOIN 
    | LEFT [OUTER] JOIN 
    | RIGHT [OUTER] JOIN 
    | FULL [OUTER] JOIN 
    | CROSS JOIN ;
 
condition = 
    expression comparison_op expression
    | expression IN '(' value_list ')'
    | expression BETWEEN expression AND expression
    | expression IS [NOT] NULL
    | expression LIKE pattern
    | condition AND condition
    | condition OR condition
    | NOT condition
    | '(' condition ')' ;
 
expression = 
    literal
    | column_reference
    | function_call
    | expression arith_op expression
    | '(' expression ')'
    | '(' select_statement ')' ;
 
(* ... and hundreds more rules for complete SQL support *)

Understanding Grammar Notation:

Notation	Meaning	Example
`=`	is defined as	`stmt = SELECT ...`
`\|`	or (alternatives)	`DISTINCT \| ALL`
`[ ]`	optional	`[WHERE condition]`
`{ }`	zero or more	`{ ',' column }`
`( )`	grouping	`(INNER \| OUTER)`
`'x'`	literal token	`','` or `'SELECT'`
`name`	rule reference	`expression`

The grammar forms a tree structure: complex constructs (like select_statement) are built from simpler constructs (like expression), ultimately grounding in terminal symbols (tokens).

Grammar Size

The full SQL grammar is massive. The SQL:2016 standard specification runs over 1,700 pages. PostgreSQL's grammar file (gram.y) contains over 15,000 lines. MySQL's grammar is similarly complex. Modern SQL supports not just queries but DDL, DML, transactions, security, procedural extensions, and more.

Parse Tree Construction

As the parser validates the token stream against grammar rules, it simultaneously constructs a parse tree (also called concrete syntax tree or CST). This tree explicitly represents the hierarchical structure of the SQL statement.

Parse Tree Properties:

Root node — The top-level grammar rule (e.g., select_statement)
Internal nodes — Non-terminal grammar symbols (rules like expression)
Leaf nodes — Terminal symbols (tokens like SELECT, 42, users)
Parent-child — Each child represents a component of the parent rule

parse_tree_example.sql
SQL Query
SELECT name, age
FROM   users
WHERE  age > 21;

Converting Mermaid diagram...

Parse Tree vs. Abstract Syntax Tree (AST):

Many databases simplify the parse tree into an Abstract Syntax Tree (AST) during or after parsing:

Aspect	Parse Tree (CST)	Abstract Syntax Tree (AST)
Detail Level	Preserves all grammar structure	Simplifies to essential structure
Punctuation	Includes delimiters, keywords	Omits syntactic sugar
Size	Larger, more nodes	Smaller, more compact
Purpose	Validates syntax	Represents semantic structure

For example, parentheses (a + b) appear in the parse tree but not the AST—the AST captures precedence through structure, not explicit parentheses.

parse_tree_node.c
C (PostgreSQL-style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* PostgreSQL-style parse tree node structures */
 
/* Base node type - all nodes start with this */
typedef struct Node {
    NodeTag type;       /* Discriminator for node type */
} Node;
 
/* SELECT statement node */
typedef struct SelectStmt {
    NodeTag type;               /* T_SelectStmt */
    List *distinctClause;       /* NULL, DISTINCT, or DISTINCT ON */
    List *targetList;           /* Result columns (ResTarget nodes) */
    List *fromClause;           /* FROM clause (table refs/joins) */
    Node *whereClause;          /* WHERE qualification */
    List *groupClause;          /* GROUP BY columns */
    Node *havingClause;         /* HAVING predicate */
    List *sortClause;           /* ORDER BY columns */
    Node *limitCount;           /* LIMIT expression */
    Node *limitOffset;          /* OFFSET expression */
    /* ... additional fields ... */
} SelectStmt;
 
/* Result target (select list item) */
typedef struct ResTarget {
    NodeTag type;               /* T_ResTarget */
    char *name;                 /* Column alias (AS name) */
    List *indirection;          /* Subscripts/field selections */
    Node *val;                  /* Expression to compute value */
    int location;               /* Token position for errors */
} ResTarget;
 
/* Column reference (e.g., "table.column") */
typedef struct ColumnRef {
    NodeTag type;               /* T_ColumnRef */
    List *fields;               /* List of identifiers (schema.table.column) */
    int location;               /* Token position for errors */
} ColumnRef;
 
/* Simple expression (operator expression) */
typedef struct A_Expr {
    NodeTag type;               /* T_A_Expr */
    A_Expr_Kind kind;           /* AND, OR, NOT, OP, etc. */
    List *name;                 /* Operator name (String list) */
    Node *lexpr;                /* Left operand */
    Node *rexpr;                /* Right operand */
    int location;               /* Token position for errors */
} A_Expr;
 
/* Constant value */
typedef struct A_Const {
    NodeTag type;               /* T_A_Const */
    Value val;                  /* The value (integer, float, string) */
    int location;               /* Token position for errors */
} A_Const;

Viewing Parse Trees

Many databases let you view parse trees for debugging. In PostgreSQL, use SET debug_print_parse = on; before a query. MySQL has EXPLAIN FORMAT=TREE. These help understand how the parser interprets complex queries.

Parsing Algorithms

Different parsing algorithms offer various trade-offs between power (what grammars they can handle) and efficiency. Most database parsers use one of these approaches:

Recursive Descent (LL) Parsers:

Top-down approach where each grammar rule becomes a function. Easy to implement manually, good error messages, but cannot handle left-recursive grammars directly.

LR Parsers (LALR, SLR, LR(1)):

Bottom-up approach using a parsing table generated from the grammar. More powerful than LL parsers, can handle more complex grammars, but harder to implement manually. Most database parsers use this approach via parser generators.

Pratt Parsers (Precedence Climbing):

Specialized for expression parsing with operator precedence. Often used alongside other techniques for parsing SQL expressions.

Parsing Algorithm Comparison
Algorithm	Direction	Grammar Power	Common Usage
Recursive Descent	Top-down (LL)	Limited	Hand-written parsers, simple languages
LL(1)	Top-down	Limited	Many programming languages
SLR	Bottom-up	Moderate	Simple languages
LALR(1)	Bottom-up	High	Most databases (via Bison/Yacc)
LR(1)	Bottom-up	Highest	When LALR insufficient
GLR	Bottom-up	Ambiguous grammars	Complex/ambiguous languages

Parser Generators:

Most production databases use parser generators rather than hand-written parsers:

Yacc/Bison — Classic parser generators producing LALR(1) parsers in C
ANTLR — Generates LL(*) parsers in multiple languages
Lemon — SQLite's parser generator, similar to Yacc

The grammar is written in a domain-specific language, and the generator produces parser source code. PostgreSQL, MySQL, SQLite, and most others use this approach.

Example: How LALR Parsing Works

LALR parsers use a parsing table and a stack to process tokens:

Read next token from input
Look up (current state, token) in parsing table
Perform action: Shift (push token and new state) or Reduce (apply grammar rule)
Repeat until accept or error

lalr_parse_trace.txt

Parse Trace

Parsing: SELECT id FROM t WHERE x = 5
 
Stack                       Input                   Action
------------------------------------------------------------------
[0]                         SELECT id FROM t...     SHIFT (push SELECT, goto 3)
[0 SELECT 3]                id FROM t WHERE...      SHIFT (push id, goto 7)
[0 SELECT 3 id 7]           FROM t WHERE x = 5      REDUCE (id -> column_ref)
[0 SELECT 3 column_ref 12]  FROM t WHERE x = 5      REDUCE (column_ref -> select_item)
[0 SELECT 3 select_item 15] FROM t WHERE x = 5      REDUCE (select_item -> select_list)
[0 SELECT 3 select_list 20] FROM t WHERE x = 5      SHIFT (push FROM, goto 25)
[... 20 FROM 25]            t WHERE x = 5 ;         SHIFT (push t, goto 30)
[... FROM 25 t 30]          WHERE x = 5 ;           REDUCE (t -> table_name)
[... FROM 25 table_name 33] WHERE x = 5 ;           REDUCE (table_name -> table_ref)
[... FROM 25 table_ref 35]  WHERE x = 5 ;           SHIFT (push WHERE, goto 40)
...
[0 select_stmt 99]          ;                       SHIFT (push ;, goto 100)
[0 select_stmt 99 ; 100]    <EOF>                   REDUCE (complete stmt)
[0 statement 1]             <EOF>                   ACCEPT
 
Parsing successful! Parse tree constructed.

Syntax Error Detection and Recovery

When parsing fails, the database must detect the error, generate a helpful error message, and (optionally) attempt recovery to find additional errors.

Error Detection:

Syntax errors occur when the parser encounters a token that cannot legally appear in the current context. The parser's decision table has no valid action for (current state, current token).

Error Reporting:

Good error messages are crucial for developer productivity. Quality error reporting includes:

Error Message Components

•Position — Line number, column number, and character offset where the error occurred
•Context — The surrounding query text, often with a visual indicator pointing to the error
•Description — What was wrong (e.g., "syntax error at or near 'FORM'")
•Expectation — What tokens were expected (e.g., "expected FROM, found FORM")
•Suggestion — Possible fixes (e.g., "did you mean FROM?")

syntax_error_examples.sql

-- Query with typo: FORM instead of FROM
postgres=# SELECT id FORM users;
ERROR:  syntax error at or near "users"
LINE 1: SELECT id FORM users;
                       ^
 
-- Missing closing parenthesis
postgres=# SELECT COUNT(* FROM orders;
ERROR:  syntax error at or near "FROM"
LINE 1: SELECT COUNT(* FROM orders;
                       ^
 
-- Invalid keyword order
postgres=# SELECT * WHERE id = 1 FROM users;
ERROR:  syntax error at or near "WHERE"
LINE 1: SELECT * WHERE id = 1 FROM users;
                 ^
HINT: You may need a FROM clause before the WHERE clause.
 
-- Unterminated string literal
postgres=# SELECT * FROM users WHERE name = 'John;
ERROR:  unterminated quoted string at or near "'John;"
LINE 1: SELECT * FROM users WHERE name = 'John;
                                         ^

Error Recovery Strategies:

Some parsers attempt to recover from errors to report multiple issues in a single parse:

Panic Mode — Skip tokens until a synchronization point (like ; or FROM) is reached, then resume parsing
Phrase-Level Recovery — Insert or delete tokens to match an expected pattern
Error Productions — Add grammar rules that explicitly match common errors

Database parsers typically use simple panic mode or report only the first error, as interactive use patterns mean users fix and retry immediately.

Common SQL Syntax Errors

The most frequent SQL syntax errors include: missing or misplaced commas, unbalanced parentheses, using reserved words as identifiers without quoting, incorrect keyword order (WHERE before FROM), missing table aliases in self-joins, and single quotes around identifiers instead of double quotes.

Case Sensitivity and Identifier Handling

SQL's handling of identifier case is a common source of confusion and portability issues. The SQL standard specifies one behavior, but databases implement varying interpretations.

SQL Standard Behavior:

Unquoted identifiers are case-insensitive and stored in uppercase
Quoted identifiers ("TableName") are case-sensitive and preserve exact case

Reality: Database-Specific Behavior:

Identifier Case Handling by Database
Database	Unquoted Identifiers	Quoted Identifiers	Quote Character
PostgreSQL	Folded to lowercase	Case-sensitive	`"DoubleQuotes"`
MySQL (default)	Case varies by OS	Case-sensitive	`Backticks`
SQL Server	Case-insensitive*	Case-sensitive	`[Brackets]` or `"Quotes"`
Oracle	Folded to uppercase	Case-sensitive	`"DoubleQuotes"`
SQLite	Case-insensitive	Case-sensitive	`"Quotes"` or `Backticks`

case_sensitivity_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL folds unquoted identifiers to lowercase
 
CREATE TABLE MyTable (Id INT, Name TEXT);
-- Actually creates: mytable(id, name)
 
SELECT * FROM MyTable;  -- Works (MyTable -> mytable)
SELECT * FROM MYTABLE;  -- Works (MYTABLE -> mytable)
SELECT * FROM mytable;  -- Works (already lowercase)
 
-- To preserve case, use double quotes
CREATE TABLE "MyTable" ("Id" INT, "Name" TEXT);
-- Creates exactly: MyTable(Id, Name)
 
SELECT * FROM "MyTable";     -- Works
SELECT * FROM MyTable;       -- ERROR: relation "mytable" does not exist
SELECT * FROM "mytable";     -- ERROR: relation "mytable" does not exist
 
-- Common gotcha: mixed quoted/unquoted access
CREATE TABLE "Users" (id INT);
SELECT * FROM Users;         -- Looks for 'users', not found!
SELECT * FROM "Users";       -- Correct

Portability Warning

If your application must work across multiple database systems, use a consistent convention: either always use lowercase unquoted identifiers, or always use quoted identifiers with consistent casing. Never mix approaches, as it creates subtle bugs when migrating between databases.

Parser Performance Considerations

While parsing is typically fast compared to optimization and execution, it can become a bottleneck in high-throughput systems executing thousands of queries per second.

Parsing Overhead:

For a typical OLTP query, parsing might take:

Simple queries — 50-200 microseconds
Complex queries — 1-5 milliseconds
Very complex queries (long IN lists, many JOINs) — 10-50+ milliseconds

When executing 10,000 queries per second, even 100μs of parsing overhead consumes an entire CPU core.

Techniques for Reducing Parse Overhead

•Prepared Statements — Parse once, execute many times with different parameters. The parsed structure is cached and reused.
•Query Plan Caching — Cache parsed queries by signature (fingerprint). When an identical query arrives, skip parsing entirely.
•Connection Pooling — Reuse connections with prepared statements already registered, avoiding repeated preparation.
•Statement Batching — Combine multiple operations into single statements to amortize parsing overhead.
•JIT Compilation — Some modern parsers use JIT-compiled lexers for additional speed gains.

parse_caching_example.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- PostgreSQL: Viewing parse time vs execution time
 
-- Enable timing
\timing on
 
-- First execution: includes parsing
SELECT * FROM users WHERE id = 42;
-- Time: 2.345 ms (includes parse + plan + execute)
 
-- Enable detailed timing breakdown
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) 
SELECT * FROM users WHERE id = 42;
/*
 Index Scan using users_pkey on users
   Index Cond: (id = 42)
   Buffers: shared hit=3
 Planning Time: 0.089 ms    <-- Parse + Plan time
 Execution Time: 0.031 ms   <-- Actual execution
*/
 
-- With prepared statement (parsing happens once)
PREPARE get_user(int) AS SELECT * FROM users WHERE id = $1;
 
EXPLAIN (ANALYZE) EXECUTE get_user(42);
/*
 Planning Time: 0.012 ms    <-- Much faster (no parse, minimal plan)
 Execution Time: 0.029 ms
*/
 
EXPLAIN (ANALYZE) EXECUTE get_user(100);
/*
 Planning Time: 0.008 ms    <-- Even faster (plan caching kicks in)
 Execution Time: 0.025 ms
*/

Parser Limits:

Databases impose limits on query complexity to prevent parser resource exhaustion:

Limit Type	PostgreSQL	MySQL	Purpose
Max query length	1 GB	64 MB default	Prevent memory exhaustion
Max expression depth	~thousands	~1000	Prevent stack overflow
Max table references	~thousands	61 per statement	Prevent exponential planning
Max IN list items	~millions	~thousands	Prevent memory exhaustion
Max parameters	65535	65535	Protocol limitation

Summary: Parsing Transforms Text to Structure

We've explored the parsing phase in comprehensive detail. Let's consolidate the key concepts:

Key Takeaways

•Two-Phase Architecture — Parsing consists of lexical analysis (tokenizing) and syntax analysis (grammar validation), each optimized for its specific task.
•Lexical Analysis — The lexer converts character streams into tokens, categorizing keywords, identifiers, literals, operators, and delimiters.
•SQL Grammar — Production rules in BNF/EBNF notation define valid SQL syntax, with modern databases using grammars containing thousands of rules.
•Parse Trees — The parser constructs hierarchical tree structures representing query organization, enabling subsequent processing phases.
•Parsing Algorithms — Most databases use LALR(1) parsers generated by tools like Bison/Yacc, balancing power and efficiency.
•Error Handling — Quality error messages include position, context, expected tokens, and suggestions; panic-mode recovery enables continued parsing.
•Case Sensitivity — Different databases handle identifier case differently; understanding your database's behavior prevents subtle bugs.
•Performance — Parsing overhead matters at scale; prepared statements and query caching reduce repeated parsing costs.

What's Next:

With the parse tree constructed, the query processing pipeline advances to the translation phase. The next page examines how parse trees are transformed into relational algebra representations—the logical query plan. This translation bridges SQL's declarative syntax with the formal mathematical foundations of relational databases.

Page Complete

You now understand how SQL text is transformed into structured parse trees through lexical and syntax analysis. This foundation prepares you for understanding semantic analysis and relational algebra translation in the next phase.

2 / 5

Loading learning content...

Database Management SystemsQuery Processing Overview

Query Processing Steps

LevelIntermediate

Duration60 mins

TopicQuery Processing Overview

2 / 5

Parsing - Transforming SQL Text into Structure

From Text to Structure

What You Will Learn

Two-Phase Parsing Architecture

Most database parsers employ a two-phase architecture borrowed from compiler design. This separation of concerns simplifies implementation and enables independent optimization of each phase.

Phase 1: Lexical Analysis (Lexing/Tokenizing)

Phase 2: Syntax Analysis (Parsing)

This separation mirrors how you read English: first recognizing words, then understanding how they combine into sentences following grammatical rules.

Converting Mermaid diagram...

Why Two Phases?

Simplicity — Each phase handles one level of complexity. The lexer doesn't worry about grammar; the parser doesn't worry about character-level details.
Efficiency — Lexers use simple finite automata; parsers use more complex pushdown automata. Separation allows optimal algorithms for each task.
Modularity — Lexer and parser can be developed, tested, and maintained independently. New keywords can be added to the lexer without parser changes.
Error Handling — Errors can be categorized as lexical ("invalid character") or syntactic ("unexpected token"), enabling clearer error messages.

Lexical Analysis (Tokenization)

The lexer scans the input character-by-character, recognizing patterns that form valid tokens. Each token has a type (category) and a value (the actual text or converted value).

Token Categories in SQL:

SQL Token Categories
Token Type	Examples	Description
Keywords	`SELECT`, `FROM`, `WHERE`, `INSERT`, `JOIN`	Reserved words with special meaning in SQL
Identifiers	`users`, `order_date`, `CustomerName`	Names of tables, columns, aliases, functions
Literals	`42`, `'hello'`, `3.14159`, `TRUE`	Constant values (numeric, string, boolean, null)
Operators	`=`, `<>`, `>=`, `+`, `-`, `AND`, `OR`	Comparison, arithmetic, and logical operators
Delimiters	`,`, `;`, `(`, `)`, `.`	Punctuation separating or grouping elements
Whitespace	Spaces, tabs, newlines	Usually discarded; separates other tokens
Comments	`-- comment`, `/* block */`	Documentation; usually discarded or extracted for hints

tokenization_example.sql
1
2
3
4
5
6
7
SELECT customer_name, SUM(order_total)
FROM   orders o
       INNER JOIN customers c ON o.cust_id = c.id
WHERE  order_date >= '2024-01-01'
       AND status = 'completed'
GROUP BY customer_name
HAVING SUM(order_total) > 1000.00;

Lexer Implementation Concepts:

Lexers are typically implemented as finite automata (state machines) that transition between states based on input characters:

Start state — Waiting for the first character of a new token
Recognition states — Accumulating characters that form a token
Accept states — Complete token recognized, ready to emit
Error state — Invalid character sequence detected

Modern databases often use lexer generators (like Flex, RE2C) that compile regular expression patterns into efficient state machines.

lexer_pseudocode.pseudo

Lexer Logic

// Simplified SQL lexer pseudocode
 
class SQLLexer:
    input: String       // SQL text to tokenize
    position: Integer   // Current position in input
    tokens: List<Token> // Accumulated tokens
    
    function tokenize() -> List<Token>:
        while not at_end():
            skip_whitespace()
            if at_end(): break
            
            char = peek()
            
            if char is letter or underscore:
                read_identifier_or_keyword()
            else if char is digit:
                read_number()
            else if char is single_quote:
                read_string_literal()
            else if char is double_quote:
                read_quoted_identifier()
            else if char is operator_start:
                read_operator()
            else if char is delimiter:
                emit_single_char_token()
            else if char is '-' and next is '-':
                skip_line_comment()
            else if char is '/' and next is '*':
                skip_block_comment()
            else:
                raise LexicalError("Unexpected character: " + char)
        
        emit(Token(EOF, ""))
        return tokens
    
    function read_identifier_or_keyword():
        start = position
        while is_alphanumeric(peek()) or peek() == '_':
            advance()
        
        text = input[start..position]
        
        // Check if it's a reserved keyword
        if text.upper() in KEYWORDS:
            emit(Token(KEYWORD, text.upper()))
        else:
            emit(Token(IDENTIFIER, text))
    
    function read_number():
        start = position
        while is_digit(peek()):
            advance()
        
        if peek() == '.' and is_digit(peek_next()):
            advance()  // consume '.'
            while is_digit(peek()):
                advance()
            emit(Token(FLOAT_LITERAL, input[start..position]))
        else:
            emit(Token(INTEGER_LITERAL, input[start..position]))

Keyword vs. Identifier Ambiguity

SQL Grammar and Syntax Rules

With tokens identified, the parser's job is to verify they appear in valid sequences according to SQL's grammar rules. Grammar defines the legal structure of SQL statements.

What is a Grammar?

A grammar is a formal specification of a language's syntax using production rules. Each rule shows how complex constructs decompose into simpler ones.

SQL grammars are typically expressed in BNF (Backus-Naur Form) or EBNF (Extended BNF) notation, borrowed from programming language specification.

sql_grammar_excerpt.ebnf

EBNF

(* Simplified SQL SELECT grammar in EBNF notation *)
 
select_statement = 
    SELECT [DISTINCT | ALL] select_list
    FROM   table_reference_list
    [WHERE  condition]
    [GROUP BY expression_list [HAVING condition]]
    [ORDER BY order_by_list]
    [LIMIT limit_specification] ;
 
select_list = 
    '*' 
    | select_item { ',' select_item } ;
 
select_item = 
    expression [AS alias] ;
 
table_reference_list = 
    table_reference { ',' table_reference } ;
 
table_reference = 
    table_name [alias]
    | table_reference join_type table_reference ON condition
    | '(' select_statement ')' [alias] ;
 
join_type = 
    [INNER] JOIN 
    | LEFT [OUTER] JOIN 
    | RIGHT [OUTER] JOIN 
    | FULL [OUTER] JOIN 
    | CROSS JOIN ;
 
condition = 
    expression comparison_op expression
    | expression IN '(' value_list ')'
    | expression BETWEEN expression AND expression
    | expression IS [NOT] NULL
    | expression LIKE pattern
    | condition AND condition
    | condition OR condition
    | NOT condition
    | '(' condition ')' ;
 
expression = 
    literal
    | column_reference
    | function_call
    | expression arith_op expression
    | '(' expression ')'
    | '(' select_statement ')' ;
 
(* ... and hundreds more rules for complete SQL support *)

Understanding Grammar Notation:

Notation	Meaning	Example
`=`	is defined as	`stmt = SELECT ...`
`\|`	or (alternatives)	`DISTINCT \| ALL`
`[ ]`	optional	`[WHERE condition]`
`{ }`	zero or more	`{ ',' column }`
`( )`	grouping	`(INNER \| OUTER)`
`'x'`	literal token	`','` or `'SELECT'`
`name`	rule reference	`expression`

The grammar forms a tree structure: complex constructs (like select_statement) are built from simpler constructs (like expression), ultimately grounding in terminal symbols (tokens).

Grammar Size

Parse Tree Construction

Parse Tree Properties:

Root node — The top-level grammar rule (e.g., select_statement)
Internal nodes — Non-terminal grammar symbols (rules like expression)
Leaf nodes — Terminal symbols (tokens like SELECT, 42, users)
Parent-child — Each child represents a component of the parent rule

parse_tree_example.sql
SQL Query
SELECT name, age
FROM   users
WHERE  age > 21;

Converting Mermaid diagram...

Parse Tree vs. Abstract Syntax Tree (AST):

Many databases simplify the parse tree into an Abstract Syntax Tree (AST) during or after parsing:

Aspect	Parse Tree (CST)	Abstract Syntax Tree (AST)
Detail Level	Preserves all grammar structure	Simplifies to essential structure
Punctuation	Includes delimiters, keywords	Omits syntactic sugar
Size	Larger, more nodes	Smaller, more compact
Purpose	Validates syntax	Represents semantic structure

For example, parentheses (a + b) appear in the parse tree but not the AST—the AST captures precedence through structure, not explicit parentheses.

parse_tree_node.c
C (PostgreSQL-style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* PostgreSQL-style parse tree node structures */
 
/* Base node type - all nodes start with this */
typedef struct Node {
    NodeTag type;       /* Discriminator for node type */
} Node;
 
/* SELECT statement node */
typedef struct SelectStmt {
    NodeTag type;               /* T_SelectStmt */
    List *distinctClause;       /* NULL, DISTINCT, or DISTINCT ON */
    List *targetList;           /* Result columns (ResTarget nodes) */
    List *fromClause;           /* FROM clause (table refs/joins) */
    Node *whereClause;          /* WHERE qualification */
    List *groupClause;          /* GROUP BY columns */
    Node *havingClause;         /* HAVING predicate */
    List *sortClause;           /* ORDER BY columns */
    Node *limitCount;           /* LIMIT expression */
    Node *limitOffset;          /* OFFSET expression */
    /* ... additional fields ... */
} SelectStmt;
 
/* Result target (select list item) */
typedef struct ResTarget {
    NodeTag type;               /* T_ResTarget */
    char *name;                 /* Column alias (AS name) */
    List *indirection;          /* Subscripts/field selections */
    Node *val;                  /* Expression to compute value */
    int location;               /* Token position for errors */
} ResTarget;
 
/* Column reference (e.g., "table.column") */
typedef struct ColumnRef {
    NodeTag type;               /* T_ColumnRef */
    List *fields;               /* List of identifiers (schema.table.column) */
    int location;               /* Token position for errors */
} ColumnRef;
 
/* Simple expression (operator expression) */
typedef struct A_Expr {
    NodeTag type;               /* T_A_Expr */
    A_Expr_Kind kind;           /* AND, OR, NOT, OP, etc. */
    List *name;                 /* Operator name (String list) */
    Node *lexpr;                /* Left operand */
    Node *rexpr;                /* Right operand */
    int location;               /* Token position for errors */
} A_Expr;
 
/* Constant value */
typedef struct A_Const {
    NodeTag type;               /* T_A_Const */
    Value val;                  /* The value (integer, float, string) */
    int location;               /* Token position for errors */
} A_Const;

Viewing Parse Trees

Parsing Algorithms

Different parsing algorithms offer various trade-offs between power (what grammars they can handle) and efficiency. Most database parsers use one of these approaches:

Recursive Descent (LL) Parsers:

Top-down approach where each grammar rule becomes a function. Easy to implement manually, good error messages, but cannot handle left-recursive grammars directly.

LR Parsers (LALR, SLR, LR(1)):

Pratt Parsers (Precedence Climbing):

Specialized for expression parsing with operator precedence. Often used alongside other techniques for parsing SQL expressions.

Parsing Algorithm Comparison
Algorithm	Direction	Grammar Power	Common Usage
Recursive Descent	Top-down (LL)	Limited	Hand-written parsers, simple languages
LL(1)	Top-down	Limited	Many programming languages
SLR	Bottom-up	Moderate	Simple languages
LALR(1)	Bottom-up	High	Most databases (via Bison/Yacc)
LR(1)	Bottom-up	Highest	When LALR insufficient
GLR	Bottom-up	Ambiguous grammars	Complex/ambiguous languages

Parser Generators:

Most production databases use parser generators rather than hand-written parsers:

Yacc/Bison — Classic parser generators producing LALR(1) parsers in C
ANTLR — Generates LL(*) parsers in multiple languages
Lemon — SQLite's parser generator, similar to Yacc

The grammar is written in a domain-specific language, and the generator produces parser source code. PostgreSQL, MySQL, SQLite, and most others use this approach.

Example: How LALR Parsing Works

LALR parsers use a parsing table and a stack to process tokens:

Read next token from input
Look up (current state, token) in parsing table
Perform action: Shift (push token and new state) or Reduce (apply grammar rule)
Repeat until accept or error

lalr_parse_trace.txt

Parse Trace

Parsing: SELECT id FROM t WHERE x = 5
 
Stack                       Input                   Action
------------------------------------------------------------------
[0]                         SELECT id FROM t...     SHIFT (push SELECT, goto 3)
[0 SELECT 3]                id FROM t WHERE...      SHIFT (push id, goto 7)
[0 SELECT 3 id 7]           FROM t WHERE x = 5      REDUCE (id -> column_ref)
[0 SELECT 3 column_ref 12]  FROM t WHERE x = 5      REDUCE (column_ref -> select_item)
[0 SELECT 3 select_item 15] FROM t WHERE x = 5      REDUCE (select_item -> select_list)
[0 SELECT 3 select_list 20] FROM t WHERE x = 5      SHIFT (push FROM, goto 25)
[... 20 FROM 25]            t WHERE x = 5 ;         SHIFT (push t, goto 30)
[... FROM 25 t 30]          WHERE x = 5 ;           REDUCE (t -> table_name)
[... FROM 25 table_name 33] WHERE x = 5 ;           REDUCE (table_name -> table_ref)
[... FROM 25 table_ref 35]  WHERE x = 5 ;           SHIFT (push WHERE, goto 40)
...
[0 select_stmt 99]          ;                       SHIFT (push ;, goto 100)
[0 select_stmt 99 ; 100]    <EOF>                   REDUCE (complete stmt)
[0 statement 1]             <EOF>                   ACCEPT
 
Parsing successful! Parse tree constructed.

Syntax Error Detection and Recovery

When parsing fails, the database must detect the error, generate a helpful error message, and (optionally) attempt recovery to find additional errors.

Error Detection:

Syntax errors occur when the parser encounters a token that cannot legally appear in the current context. The parser's decision table has no valid action for (current state, current token).

Error Reporting:

Good error messages are crucial for developer productivity. Quality error reporting includes:

Error Message Components

•Position — Line number, column number, and character offset where the error occurred
•Context — The surrounding query text, often with a visual indicator pointing to the error
•Description — What was wrong (e.g., "syntax error at or near 'FORM'")
•Expectation — What tokens were expected (e.g., "expected FROM, found FORM")
•Suggestion — Possible fixes (e.g., "did you mean FROM?")

syntax_error_examples.sql

-- Query with typo: FORM instead of FROM
postgres=# SELECT id FORM users;
ERROR:  syntax error at or near "users"
LINE 1: SELECT id FORM users;
                       ^
 
-- Missing closing parenthesis
postgres=# SELECT COUNT(* FROM orders;
ERROR:  syntax error at or near "FROM"
LINE 1: SELECT COUNT(* FROM orders;
                       ^
 
-- Invalid keyword order
postgres=# SELECT * WHERE id = 1 FROM users;
ERROR:  syntax error at or near "WHERE"
LINE 1: SELECT * WHERE id = 1 FROM users;
                 ^
HINT: You may need a FROM clause before the WHERE clause.
 
-- Unterminated string literal
postgres=# SELECT * FROM users WHERE name = 'John;
ERROR:  unterminated quoted string at or near "'John;"
LINE 1: SELECT * FROM users WHERE name = 'John;
                                         ^

Error Recovery Strategies:

Some parsers attempt to recover from errors to report multiple issues in a single parse:

Panic Mode — Skip tokens until a synchronization point (like ; or FROM) is reached, then resume parsing
Phrase-Level Recovery — Insert or delete tokens to match an expected pattern
Error Productions — Add grammar rules that explicitly match common errors

Database parsers typically use simple panic mode or report only the first error, as interactive use patterns mean users fix and retry immediately.

Common SQL Syntax Errors

Case Sensitivity and Identifier Handling

SQL's handling of identifier case is a common source of confusion and portability issues. The SQL standard specifies one behavior, but databases implement varying interpretations.

SQL Standard Behavior:

Unquoted identifiers are case-insensitive and stored in uppercase
Quoted identifiers ("TableName") are case-sensitive and preserve exact case

Reality: Database-Specific Behavior:

Identifier Case Handling by Database
Database	Unquoted Identifiers	Quoted Identifiers	Quote Character
PostgreSQL	Folded to lowercase	Case-sensitive	`"DoubleQuotes"`
MySQL (default)	Case varies by OS	Case-sensitive	`Backticks`
SQL Server	Case-insensitive*	Case-sensitive	`[Brackets]` or `"Quotes"`
Oracle	Folded to uppercase	Case-sensitive	`"DoubleQuotes"`
SQLite	Case-insensitive	Case-sensitive	`"Quotes"` or `Backticks`

case_sensitivity_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL folds unquoted identifiers to lowercase
 
CREATE TABLE MyTable (Id INT, Name TEXT);
-- Actually creates: mytable(id, name)
 
SELECT * FROM MyTable;  -- Works (MyTable -> mytable)
SELECT * FROM MYTABLE;  -- Works (MYTABLE -> mytable)
SELECT * FROM mytable;  -- Works (already lowercase)
 
-- To preserve case, use double quotes
CREATE TABLE "MyTable" ("Id" INT, "Name" TEXT);
-- Creates exactly: MyTable(Id, Name)
 
SELECT * FROM "MyTable";     -- Works
SELECT * FROM MyTable;       -- ERROR: relation "mytable" does not exist
SELECT * FROM "mytable";     -- ERROR: relation "mytable" does not exist
 
-- Common gotcha: mixed quoted/unquoted access
CREATE TABLE "Users" (id INT);
SELECT * FROM Users;         -- Looks for 'users', not found!
SELECT * FROM "Users";       -- Correct

Portability Warning

Parser Performance Considerations

While parsing is typically fast compared to optimization and execution, it can become a bottleneck in high-throughput systems executing thousands of queries per second.

Parsing Overhead:

For a typical OLTP query, parsing might take:

Simple queries — 50-200 microseconds
Complex queries — 1-5 milliseconds
Very complex queries (long IN lists, many JOINs) — 10-50+ milliseconds

When executing 10,000 queries per second, even 100μs of parsing overhead consumes an entire CPU core.

Techniques for Reducing Parse Overhead

•Prepared Statements — Parse once, execute many times with different parameters. The parsed structure is cached and reused.
•Query Plan Caching — Cache parsed queries by signature (fingerprint). When an identical query arrives, skip parsing entirely.
•Connection Pooling — Reuse connections with prepared statements already registered, avoiding repeated preparation.
•Statement Batching — Combine multiple operations into single statements to amortize parsing overhead.
•JIT Compilation — Some modern parsers use JIT-compiled lexers for additional speed gains.

parse_caching_example.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- PostgreSQL: Viewing parse time vs execution time
 
-- Enable timing
\timing on
 
-- First execution: includes parsing
SELECT * FROM users WHERE id = 42;
-- Time: 2.345 ms (includes parse + plan + execute)
 
-- Enable detailed timing breakdown
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) 
SELECT * FROM users WHERE id = 42;
/*
 Index Scan using users_pkey on users
   Index Cond: (id = 42)
   Buffers: shared hit=3
 Planning Time: 0.089 ms    <-- Parse + Plan time
 Execution Time: 0.031 ms   <-- Actual execution
*/
 
-- With prepared statement (parsing happens once)
PREPARE get_user(int) AS SELECT * FROM users WHERE id = $1;
 
EXPLAIN (ANALYZE) EXECUTE get_user(42);
/*
 Planning Time: 0.012 ms    <-- Much faster (no parse, minimal plan)
 Execution Time: 0.029 ms
*/
 
EXPLAIN (ANALYZE) EXECUTE get_user(100);
/*
 Planning Time: 0.008 ms    <-- Even faster (plan caching kicks in)
 Execution Time: 0.025 ms
*/

Parser Limits:

Databases impose limits on query complexity to prevent parser resource exhaustion:

Limit Type	PostgreSQL	MySQL	Purpose
Max query length	1 GB	64 MB default	Prevent memory exhaustion
Max expression depth	~thousands	~1000	Prevent stack overflow
Max table references	~thousands	61 per statement	Prevent exponential planning
Max IN list items	~millions	~thousands	Prevent memory exhaustion
Max parameters	65535	65535	Protocol limitation

Summary: Parsing Transforms Text to Structure

We've explored the parsing phase in comprehensive detail. Let's consolidate the key concepts:

Key Takeaways

•Two-Phase Architecture — Parsing consists of lexical analysis (tokenizing) and syntax analysis (grammar validation), each optimized for its specific task.
•Lexical Analysis — The lexer converts character streams into tokens, categorizing keywords, identifiers, literals, operators, and delimiters.
•SQL Grammar — Production rules in BNF/EBNF notation define valid SQL syntax, with modern databases using grammars containing thousands of rules.
•Parse Trees — The parser constructs hierarchical tree structures representing query organization, enabling subsequent processing phases.
•Parsing Algorithms — Most databases use LALR(1) parsers generated by tools like Bison/Yacc, balancing power and efficiency.
•Error Handling — Quality error messages include position, context, expected tokens, and suggestions; panic-mode recovery enables continued parsing.
•Case Sensitivity — Different databases handle identifier case differently; understanding your database's behavior prevents subtle bugs.
•Performance — Parsing overhead matters at scale; prepared statements and query caching reduce repeated parsing costs.

What's Next:

Page Complete

2 / 5