Loading learning content...
Once query text arrives at the database server, it faces a fundamental challenge: computers don't understand human languages. The SQL statement SELECT name FROM users WHERE id = 42 is just a sequence of characters—bytes without inherent meaning. The database system must extract the structure and intent from this text.
This is the job of the parser—a sophisticated component that transforms raw SQL text into a structured representation the rest of the query processing pipeline can work with. Parsing is one of the most thoroughly studied areas in computer science, with decades of theory and practice informing modern database parsers.
By the end of this page, you will understand how SQL text is tokenized into lexemes, how grammar rules define valid SQL syntax, how parsers construct parse trees representing query structure, and how syntax errors are detected and reported. You'll gain insight into the compiler construction techniques that power every database system.
Most database parsers employ a two-phase architecture borrowed from compiler design. This separation of concerns simplifies implementation and enables independent optimization of each phase.
Phase 1: Lexical Analysis (Lexing/Tokenizing)
The lexer (also called tokenizer or scanner) reads the character stream and groups characters into meaningful units called tokens (or lexemes). Think of this as identifying individual words and punctuation in a sentence.
Phase 2: Syntax Analysis (Parsing)
The parser takes the token stream and verifies that tokens appear in valid sequences according to SQL grammar rules. It builds a parse tree (or abstract syntax tree) representing the query's hierarchical structure.
This separation mirrors how you read English: first recognizing words, then understanding how they combine into sentences following grammatical rules.
Why Two Phases?
Simplicity — Each phase handles one level of complexity. The lexer doesn't worry about grammar; the parser doesn't worry about character-level details.
Efficiency — Lexers use simple finite automata; parsers use more complex pushdown automata. Separation allows optimal algorithms for each task.
Modularity — Lexer and parser can be developed, tested, and maintained independently. New keywords can be added to the lexer without parser changes.
Error Handling — Errors can be categorized as lexical ("invalid character") or syntactic ("unexpected token"), enabling clearer error messages.
The lexer scans the input character-by-character, recognizing patterns that form valid tokens. Each token has a type (category) and a value (the actual text or converted value).
Token Categories in SQL:
| Token Type | Examples | Description |
|---|---|---|
| Keywords | SELECT, FROM, WHERE, INSERT, JOIN | Reserved words with special meaning in SQL |
| Identifiers | users, order_date, CustomerName | Names of tables, columns, aliases, functions |
| Literals | 42, 'hello', 3.14159, TRUE | Constant values (numeric, string, boolean, null) |
| Operators | =, <>, >=, +, -, AND, OR | Comparison, arithmetic, and logical operators |
| Delimiters | ,, ;, (, ), . | Punctuation separating or grouping elements |
| Whitespace | Spaces, tabs, newlines | Usually discarded; separates other tokens |
| Comments | -- comment, /* block */ | Documentation; usually discarded or extracted for hints |
1234567
SELECT customer_name, SUM(order_total)FROM orders o INNER JOIN customers c ON o.cust_id = c.idWHERE order_date >= '2024-01-01' AND status = 'completed'GROUP BY customer_nameHAVING SUM(order_total) > 1000.00;Lexer Implementation Concepts:
Lexers are typically implemented as finite automata (state machines) that transition between states based on input characters:
Modern databases often use lexer generators (like Flex, RE2C) that compile regular expression patterns into efficient state machines.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
// Simplified SQL lexer pseudocode class SQLLexer: input: String // SQL text to tokenize position: Integer // Current position in input tokens: List<Token> // Accumulated tokens function tokenize() -> List<Token>: while not at_end(): skip_whitespace() if at_end(): break char = peek() if char is letter or underscore: read_identifier_or_keyword() else if char is digit: read_number() else if char is single_quote: read_string_literal() else if char is double_quote: read_quoted_identifier() else if char is operator_start: read_operator() else if char is delimiter: emit_single_char_token() else if char is '-' and next is '-': skip_line_comment() else if char is '/' and next is '*': skip_block_comment() else: raise LexicalError("Unexpected character: " + char) emit(Token(EOF, "")) return tokens function read_identifier_or_keyword(): start = position while is_alphanumeric(peek()) or peek() == '_': advance() text = input[start..position] // Check if it's a reserved keyword if text.upper() in KEYWORDS: emit(Token(KEYWORD, text.upper())) else: emit(Token(IDENTIFIER, text)) function read_number(): start = position while is_digit(peek()): advance() if peek() == '.' and is_digit(peek_next()): advance() // consume '.' while is_digit(peek()): advance() emit(Token(FLOAT_LITERAL, input[start..position])) else: emit(Token(INTEGER_LITERAL, input[start..position]))Some words might be both keywords and valid identifiers. For example, USER is a keyword in some SQL dialects but might also be a table name. Databases handle this through context-sensitivity or by requiring quoted identifiers ("USER") for reserved word identifiers. PostgreSQL has about 100 reserved keywords; MySQL has over 200.
With tokens identified, the parser's job is to verify they appear in valid sequences according to SQL's grammar rules. Grammar defines the legal structure of SQL statements.
What is a Grammar?
A grammar is a formal specification of a language's syntax using production rules. Each rule shows how complex constructs decompose into simpler ones.
SQL grammars are typically expressed in BNF (Backus-Naur Form) or EBNF (Extended BNF) notation, borrowed from programming language specification.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
(* Simplified SQL SELECT grammar in EBNF notation *) select_statement = SELECT [DISTINCT | ALL] select_list FROM table_reference_list [WHERE condition] [GROUP BY expression_list [HAVING condition]] [ORDER BY order_by_list] [LIMIT limit_specification] ; select_list = '*' | select_item { ',' select_item } ; select_item = expression [AS alias] ; table_reference_list = table_reference { ',' table_reference } ; table_reference = table_name [alias] | table_reference join_type table_reference ON condition | '(' select_statement ')' [alias] ; join_type = [INNER] JOIN | LEFT [OUTER] JOIN | RIGHT [OUTER] JOIN | FULL [OUTER] JOIN | CROSS JOIN ; condition = expression comparison_op expression | expression IN '(' value_list ')' | expression BETWEEN expression AND expression | expression IS [NOT] NULL | expression LIKE pattern | condition AND condition | condition OR condition | NOT condition | '(' condition ')' ; expression = literal | column_reference | function_call | expression arith_op expression | '(' expression ')' | '(' select_statement ')' ; (* ... and hundreds more rules for complete SQL support *)Understanding Grammar Notation:
| Notation | Meaning | Example |
|---|---|---|
= | is defined as | stmt = SELECT ... |
| | or (alternatives) | DISTINCT | ALL |
[ ] | optional | [WHERE condition] |
{ } | zero or more | { ',' column } |
( ) | grouping | (INNER | OUTER) |
'x' | literal token | ',' or 'SELECT' |
name | rule reference | expression |
The grammar forms a tree structure: complex constructs (like select_statement) are built from simpler constructs (like expression), ultimately grounding in terminal symbols (tokens).
The full SQL grammar is massive. The SQL:2016 standard specification runs over 1,700 pages. PostgreSQL's grammar file (gram.y) contains over 15,000 lines. MySQL's grammar is similarly complex. Modern SQL supports not just queries but DDL, DML, transactions, security, procedural extensions, and more.
As the parser validates the token stream against grammar rules, it simultaneously constructs a parse tree (also called concrete syntax tree or CST). This tree explicitly represents the hierarchical structure of the SQL statement.
Parse Tree Properties:
select_statement)expression)SELECT, 42, users)SELECT name, ageFROM usersWHERE age > 21;Parse Tree vs. Abstract Syntax Tree (AST):
Many databases simplify the parse tree into an Abstract Syntax Tree (AST) during or after parsing:
| Aspect | Parse Tree (CST) | Abstract Syntax Tree (AST) |
|---|---|---|
| Detail Level | Preserves all grammar structure | Simplifies to essential structure |
| Punctuation | Includes delimiters, keywords | Omits syntactic sugar |
| Size | Larger, more nodes | Smaller, more compact |
| Purpose | Validates syntax | Represents semantic structure |
For example, parentheses (a + b) appear in the parse tree but not the AST—the AST captures precedence through structure, not explicit parentheses.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
/* PostgreSQL-style parse tree node structures */ /* Base node type - all nodes start with this */typedef struct Node { NodeTag type; /* Discriminator for node type */} Node; /* SELECT statement node */typedef struct SelectStmt { NodeTag type; /* T_SelectStmt */ List *distinctClause; /* NULL, DISTINCT, or DISTINCT ON */ List *targetList; /* Result columns (ResTarget nodes) */ List *fromClause; /* FROM clause (table refs/joins) */ Node *whereClause; /* WHERE qualification */ List *groupClause; /* GROUP BY columns */ Node *havingClause; /* HAVING predicate */ List *sortClause; /* ORDER BY columns */ Node *limitCount; /* LIMIT expression */ Node *limitOffset; /* OFFSET expression */ /* ... additional fields ... */} SelectStmt; /* Result target (select list item) */typedef struct ResTarget { NodeTag type; /* T_ResTarget */ char *name; /* Column alias (AS name) */ List *indirection; /* Subscripts/field selections */ Node *val; /* Expression to compute value */ int location; /* Token position for errors */} ResTarget; /* Column reference (e.g., "table.column") */typedef struct ColumnRef { NodeTag type; /* T_ColumnRef */ List *fields; /* List of identifiers (schema.table.column) */ int location; /* Token position for errors */} ColumnRef; /* Simple expression (operator expression) */typedef struct A_Expr { NodeTag type; /* T_A_Expr */ A_Expr_Kind kind; /* AND, OR, NOT, OP, etc. */ List *name; /* Operator name (String list) */ Node *lexpr; /* Left operand */ Node *rexpr; /* Right operand */ int location; /* Token position for errors */} A_Expr; /* Constant value */typedef struct A_Const { NodeTag type; /* T_A_Const */ Value val; /* The value (integer, float, string) */ int location; /* Token position for errors */} A_Const;Many databases let you view parse trees for debugging. In PostgreSQL, use SET debug_print_parse = on; before a query. MySQL has EXPLAIN FORMAT=TREE. These help understand how the parser interprets complex queries.
Different parsing algorithms offer various trade-offs between power (what grammars they can handle) and efficiency. Most database parsers use one of these approaches:
Recursive Descent (LL) Parsers:
Top-down approach where each grammar rule becomes a function. Easy to implement manually, good error messages, but cannot handle left-recursive grammars directly.
LR Parsers (LALR, SLR, LR(1)):
Bottom-up approach using a parsing table generated from the grammar. More powerful than LL parsers, can handle more complex grammars, but harder to implement manually. Most database parsers use this approach via parser generators.
Pratt Parsers (Precedence Climbing):
Specialized for expression parsing with operator precedence. Often used alongside other techniques for parsing SQL expressions.
| Algorithm | Direction | Grammar Power | Common Usage |
|---|---|---|---|
| Recursive Descent | Top-down (LL) | Limited | Hand-written parsers, simple languages |
| LL(1) | Top-down | Limited | Many programming languages |
| SLR | Bottom-up | Moderate | Simple languages |
| LALR(1) | Bottom-up | High | Most databases (via Bison/Yacc) |
| LR(1) | Bottom-up | Highest | When LALR insufficient |
| GLR | Bottom-up | Ambiguous grammars | Complex/ambiguous languages |
Parser Generators:
Most production databases use parser generators rather than hand-written parsers:
The grammar is written in a domain-specific language, and the generator produces parser source code. PostgreSQL, MySQL, SQLite, and most others use this approach.
Example: How LALR Parsing Works
LALR parsers use a parsing table and a stack to process tokens:
1234567891011121314151617181920
Parsing: SELECT id FROM t WHERE x = 5 Stack Input Action------------------------------------------------------------------[0] SELECT id FROM t... SHIFT (push SELECT, goto 3)[0 SELECT 3] id FROM t WHERE... SHIFT (push id, goto 7)[0 SELECT 3 id 7] FROM t WHERE x = 5 REDUCE (id -> column_ref)[0 SELECT 3 column_ref 12] FROM t WHERE x = 5 REDUCE (column_ref -> select_item)[0 SELECT 3 select_item 15] FROM t WHERE x = 5 REDUCE (select_item -> select_list)[0 SELECT 3 select_list 20] FROM t WHERE x = 5 SHIFT (push FROM, goto 25)[... 20 FROM 25] t WHERE x = 5 ; SHIFT (push t, goto 30)[... FROM 25 t 30] WHERE x = 5 ; REDUCE (t -> table_name)[... FROM 25 table_name 33] WHERE x = 5 ; REDUCE (table_name -> table_ref)[... FROM 25 table_ref 35] WHERE x = 5 ; SHIFT (push WHERE, goto 40)...[0 select_stmt 99] ; SHIFT (push ;, goto 100)[0 select_stmt 99 ; 100] <EOF> REDUCE (complete stmt)[0 statement 1] <EOF> ACCEPT Parsing successful! Parse tree constructed.When parsing fails, the database must detect the error, generate a helpful error message, and (optionally) attempt recovery to find additional errors.
Error Detection:
Syntax errors occur when the parser encounters a token that cannot legally appear in the current context. The parser's decision table has no valid action for (current state, current token).
Error Reporting:
Good error messages are crucial for developer productivity. Quality error reporting includes:
123456789101112131415161718192021222324
-- Query with typo: FORM instead of FROMpostgres=# SELECT id FORM users;ERROR: syntax error at or near "users"LINE 1: SELECT id FORM users; ^ -- Missing closing parenthesispostgres=# SELECT COUNT(* FROM orders;ERROR: syntax error at or near "FROM"LINE 1: SELECT COUNT(* FROM orders; ^ -- Invalid keyword orderpostgres=# SELECT * WHERE id = 1 FROM users;ERROR: syntax error at or near "WHERE"LINE 1: SELECT * WHERE id = 1 FROM users; ^HINT: You may need a FROM clause before the WHERE clause. -- Unterminated string literalpostgres=# SELECT * FROM users WHERE name = 'John;ERROR: unterminated quoted string at or near "'John;"LINE 1: SELECT * FROM users WHERE name = 'John; ^Error Recovery Strategies:
Some parsers attempt to recover from errors to report multiple issues in a single parse:
Panic Mode — Skip tokens until a synchronization point (like ; or FROM) is reached, then resume parsing
Phrase-Level Recovery — Insert or delete tokens to match an expected pattern
Error Productions — Add grammar rules that explicitly match common errors
Database parsers typically use simple panic mode or report only the first error, as interactive use patterns mean users fix and retry immediately.
The most frequent SQL syntax errors include: missing or misplaced commas, unbalanced parentheses, using reserved words as identifiers without quoting, incorrect keyword order (WHERE before FROM), missing table aliases in self-joins, and single quotes around identifiers instead of double quotes.
SQL's handling of identifier case is a common source of confusion and portability issues. The SQL standard specifies one behavior, but databases implement varying interpretations.
SQL Standard Behavior:
"TableName") are case-sensitive and preserve exact caseReality: Database-Specific Behavior:
| Database | Unquoted Identifiers | Quoted Identifiers | Quote Character |
|---|---|---|---|
| PostgreSQL | Folded to lowercase | Case-sensitive | "DoubleQuotes" |
| MySQL (default) | Case varies by OS | Case-sensitive | `Backticks` |
| SQL Server | Case-insensitive* | Case-sensitive | [Brackets] or "Quotes" |
| Oracle | Folded to uppercase | Case-sensitive | "DoubleQuotes" |
| SQLite | Case-insensitive | Case-sensitive | "Quotes" or `Backticks` |
123456789101112131415161718192021
-- PostgreSQL folds unquoted identifiers to lowercase CREATE TABLE MyTable (Id INT, Name TEXT);-- Actually creates: mytable(id, name) SELECT * FROM MyTable; -- Works (MyTable -> mytable)SELECT * FROM MYTABLE; -- Works (MYTABLE -> mytable)SELECT * FROM mytable; -- Works (already lowercase) -- To preserve case, use double quotesCREATE TABLE "MyTable" ("Id" INT, "Name" TEXT);-- Creates exactly: MyTable(Id, Name) SELECT * FROM "MyTable"; -- WorksSELECT * FROM MyTable; -- ERROR: relation "mytable" does not existSELECT * FROM "mytable"; -- ERROR: relation "mytable" does not exist -- Common gotcha: mixed quoted/unquoted accessCREATE TABLE "Users" (id INT);SELECT * FROM Users; -- Looks for 'users', not found!SELECT * FROM "Users"; -- CorrectIf your application must work across multiple database systems, use a consistent convention: either always use lowercase unquoted identifiers, or always use quoted identifiers with consistent casing. Never mix approaches, as it creates subtle bugs when migrating between databases.
While parsing is typically fast compared to optimization and execution, it can become a bottleneck in high-throughput systems executing thousands of queries per second.
Parsing Overhead:
For a typical OLTP query, parsing might take:
When executing 10,000 queries per second, even 100μs of parsing overhead consumes an entire CPU core.
12345678910111213141516171819202122232425262728293031323334
-- PostgreSQL: Viewing parse time vs execution time -- Enable timing\timing on -- First execution: includes parsingSELECT * FROM users WHERE id = 42;-- Time: 2.345 ms (includes parse + plan + execute) -- Enable detailed timing breakdownEXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM users WHERE id = 42;/* Index Scan using users_pkey on users Index Cond: (id = 42) Buffers: shared hit=3 Planning Time: 0.089 ms <-- Parse + Plan time Execution Time: 0.031 ms <-- Actual execution*/ -- With prepared statement (parsing happens once)PREPARE get_user(int) AS SELECT * FROM users WHERE id = $1; EXPLAIN (ANALYZE) EXECUTE get_user(42);/* Planning Time: 0.012 ms <-- Much faster (no parse, minimal plan) Execution Time: 0.029 ms*/ EXPLAIN (ANALYZE) EXECUTE get_user(100);/* Planning Time: 0.008 ms <-- Even faster (plan caching kicks in) Execution Time: 0.025 ms*/Parser Limits:
Databases impose limits on query complexity to prevent parser resource exhaustion:
| Limit Type | PostgreSQL | MySQL | Purpose |
|---|---|---|---|
| Max query length | 1 GB | 64 MB default | Prevent memory exhaustion |
| Max expression depth | ~thousands | ~1000 | Prevent stack overflow |
| Max table references | ~thousands | 61 per statement | Prevent exponential planning |
| Max IN list items | ~millions | ~thousands | Prevent memory exhaustion |
| Max parameters | 65535 | 65535 | Protocol limitation |
We've explored the parsing phase in comprehensive detail. Let's consolidate the key concepts:
What's Next:
With the parse tree constructed, the query processing pipeline advances to the translation phase. The next page examines how parse trees are transformed into relational algebra representations—the logical query plan. This translation bridges SQL's declarative syntax with the formal mathematical foundations of relational databases.
You now understand how SQL text is transformed into structured parse trees through lexical and syntax analysis. This foundation prepares you for understanding semantic analysis and relational algebra translation in the next phase.