Database Management SystemsQuery Processing

Parsing and Validation

LevelIntermediate

Duration75 mins

TopicQuery Processing

1 / 5

Syntax Analysis

The Query's First Test

Every SQL query you write must pass through a gauntlet of validation before a single row of data is ever touched. The database system scrutinizes your query at multiple levels, and the very first checkpoint is syntax analysis—the process of ensuring your query is structurally valid according to the rules of the SQL language.

When you execute a query like SELECT name FROM employees WHERE salary > 50000, the database doesn't immediately start searching through tables. Instead, it first breaks down your query into individual tokens, checks that these tokens are arranged according to SQL's grammar rules, and constructs a structured representation suitable for further processing.

This seemingly simple step is surprisingly sophisticated. Modern database systems process millions of queries per second, and each one must pass through syntax analysis with microsecond-level efficiency while providing clear, actionable error messages when something goes wrong.

What You Will Learn

By the end of this page, you will understand the complete syntax analysis pipeline: from raw query text to structured parse tree. You'll learn how lexical analysis breaks queries into tokens, how grammar rules define valid SQL structure, how databases detect and report syntax errors with precision, and why this phase is critical for both correctness and performance.

The Parsing Pipeline Overview

Before diving into the details of syntax analysis, let's understand where it fits in the broader query processing pipeline. When a SQL query arrives at the database engine, it undergoes a series of transformations:

The Complete Query Processing Pipeline:

Query Receipt — The raw SQL string arrives from a client connection
Lexical Analysis — The string is broken into meaningful tokens
Syntax Analysis — Tokens are validated against SQL grammar rules
Semantic Analysis — Names are resolved and types are verified
Query Rewriting — The query may be transformed for optimization
Query Optimization — Multiple execution plans are evaluated
Query Execution — The optimal plan is executed
Result Return — Data is returned to the client

Syntax analysis encompasses steps 2 and 3 in this pipeline, though some systems combine them into a single parsing phase. The output of syntax analysis is typically a parse tree (also called Abstract Syntax Tree or AST)—a hierarchical data structure that represents the query's structure in a form suitable for subsequent processing.

Converting Mermaid diagram...

Historical Context

The parsing techniques used in database systems are direct descendants of compiler technology developed in the 1960s and 1970s. Many database parsers are actually built using the same tools (like yacc/bison and lex/flex) that have been used to build programming language compilers for decades. This shared heritage explains why database professionals often speak the same language as compiler engineers when discussing parsing.

Lexical Analysis: The First Pass

Lexical analysis—also called tokenization or scanning—is the process of converting a raw character stream into a sequence of tokens. Each token represents a meaningful unit in the SQL language: a keyword, an identifier, a literal value, an operator, or punctuation.

Why Tokenization Matters:

Consider this SQL query:

SELECT employee_name, salary FROM employees WHERE department_id = 10

To a human, this is clearly a query that selects employee names and salaries from employees in department 10. But to the database, it initially arrives as a stream of 54 ASCII characters including spaces. The lexer's job is to transform this character stream into structured tokens that subsequent phases can process efficiently.

The lexer component (also called a scanner or tokenizer) reads characters left-to-right, identifies token boundaries, classifies each token, and passes the token stream to the parser.

Token Types in SQL Lexical Analysis
Token Category	Examples	Description
Keywords	`SELECT`, `FROM`, `WHERE`, `INSERT`, `UPDATE`	Reserved words with special meaning in SQL syntax
Identifiers	`employees`, `department_id`, `customer_name`	Names of tables, columns, schemas, aliases, etc.
String Literals	`'John Doe'`, `'New York'`	Quoted character sequences
Numeric Literals	`42`, `3.14159`, `-100`, `1.5e10`	Integer and floating-point numbers
Operators	`=`, `<>`, `>=`, `+`, `-`, `*`, `\|\|`	Comparison, arithmetic, and string operators
Delimiters	`(`, `)`, `,`, `;`	Punctuation that structures the query
Comments	`-- comment`, `/* block */`	Ignored by parser, used for documentation
Whitespace	spaces, tabs, newlines	Token separators (usually discarded)

Tokenization Example:

Let's trace how the lexer processes our example query character by character:

tokenization-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Original Query:
SELECT employee_name, salary FROM employees WHERE department_id = 10
 
-- Resulting Token Stream:
Token 1:  KEYWORD      "SELECT"
Token 2:  IDENTIFIER   "employee_name"
Token 3:  COMMA        ","
Token 4:  IDENTIFIER   "salary"
Token 5:  KEYWORD      "FROM"
Token 6:  IDENTIFIER   "employees"
Token 7:  KEYWORD      "WHERE"
Token 8:  IDENTIFIER   "department_id"
Token 9:  OPERATOR     "="
Token 10: INTEGER      "10"
 
-- Note: Whitespace between tokens is consumed but not emitted
-- Each token carries:
--   1. Token type (KEYWORD, IDENTIFIER, etc.)
--   2. Token value (the actual text)
--   3. Position information (line, column) for error reporting

Lexer Implementation Details

Understanding how lexers actually work reveals important details about SQL parsing behavior. Modern database lexers are typically implemented using one of two approaches:

1. Hand-Written Lexers: Some databases (notably PostgreSQL) use carefully hand-crafted lexers written in C. These offer maximum control over tokenization behavior and can be highly optimized for specific SQL dialects.

2. Generated Lexers: Other databases use lexer generators like Lex or Flex that create lexer code from a declarative specification. This approach is faster to develop and easier to maintain.

Regardless of implementation approach, all lexers must handle several challenging aspects of SQL tokenization:

Lexer Implementation Challenges

•Keyword Recognition vs Identifiers — Is SELECT a keyword or could it be a table name? Most SQL dialects reserve keywords, but some allow escaping (e.g., "SELECT" as an identifier in PostgreSQL).
•Case Sensitivity — SQL keywords are case-insensitive (SELECT = select = SeLeCt), but identifier handling varies by database. MySQL is case-insensitive on Windows by default, case-sensitive on Linux.
•String Literal Handling — Single quotes delimit strings, but what about embedded quotes? Most SQL uses '' to escape ('O''Brien'), but some databases support backslash escaping.
•Numeric Literal Variations — The lexer must distinguish integers, decimals, and scientific notation (42, 3.14, 1.5e-10), plus handle negative signs correctly.
•Multi-Character Operators — Operators like <>, >=, ||, and :: require lookahead to distinguish from single-character operators.
•Comment Handling — Both line comments (--) and block comments (/* */) must be recognized and typically stripped before parsing.
•Unicode Support — Modern databases must handle Unicode identifiers and string literals, significantly complicating character classification.

lexer-pseudocode.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Simplified Lexer State Machine
function nextToken():
    skipWhitespaceAndComments()
    
    if atEndOfInput():
        return Token(EOF)
    
    char = peekChar()
    
    // Identifiers and Keywords
    if isLetter(char) or char == '_':
        word = consumeWhile(isAlphanumeric or '_')
        if word.toUpperCase() in RESERVED_KEYWORDS:
            return Token(KEYWORD, word)
        else:
            return Token(IDENTIFIER, word)
    
    // Numeric Literals
    if isDigit(char):
        num = consumeWhile(isDigit)
        if peekChar() == '.':
            consume('.')
            num += '.' + consumeWhile(isDigit)
            return Token(FLOAT_LITERAL, num)
        return Token(INTEGER_LITERAL, num)
    
    // String Literals
    if char == "'":
        consume("'")
        str = ""
        while not atEndOfInput():
            if peekChar() == "'" and peekAhead(1) == "'":
                str += "'"
                consume("''")
            else if peekChar() == "'":
                consume("'")
                return Token(STRING_LITERAL, str)
            else:
                str += consumeChar()
        error("Unterminated string literal")
    
    // Operators (with lookahead for multi-char)
    if char == '<':
        consume('<')
        if peekChar() == '=':
            consume('=')
            return Token(OPERATOR, "<=")
        if peekChar() == '>':
            consume('>')
            return Token(OPERATOR, "<>")
        return Token(OPERATOR, "<")
    
    // ... similar patterns for other operators
    
    // Single-character tokens
    if char in "(),;":
        return Token(DELIMITER, consumeChar())
    
    error("Unexpected character: " + char)

Performance Consideration

Production database lexers are heavily optimized. They often use lookup tables for character classification (rather than function calls), buffer tokens to avoid repeated allocation, and minimize branching to improve CPU pipeline efficiency. PostgreSQL's lexer, for instance, can tokenize millions of characters per second.

SQL Grammar and Syntax Rules

Once the lexer produces a token stream, the parser takes over. The parser's job is to verify that the tokens are arranged according to the grammar rules of SQL and to build a parse tree representing the query's structure.

What is a Grammar?

A grammar is a formal specification of what constitutes a valid statement in a language. SQL grammars are typically specified using Backus-Naur Form (BNF) or a variant like Extended BNF (EBNF). These notations define:

Non-terminals — Abstract syntactic categories (like <select_statement> or <expression>)
Terminals — Actual tokens from the lexer (like SELECT or IDENTIFIER)
Production Rules — Rules showing how non-terminals expand into sequences of terminals and non-terminals

Let's look at a simplified excerpt from an SQL grammar:

sql-grammar-bnf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Simplified SQL SELECT Statement Grammar (BNF notation)
 
<select_statement> ::= SELECT <select_list> FROM <table_reference>
                       [WHERE <search_condition>]
                       [GROUP BY <group_by_clause>]
                       [HAVING <having_clause>]
                       [ORDER BY <order_by_clause>]
 
<select_list> ::= '*'
                | <select_item> {',' <select_item>}
 
<select_item> ::= <expression> [AS <alias>]
                | <table_name> '.' '*'
 
<table_reference> ::= <table_name> [<alias>]
                    | <table_reference> <join_type> <table_reference>
                      ON <search_condition>
                    | '(' <select_statement> ')' [AS <alias>]
 
<expression> ::= <term>
               | <expression> <additive_op> <term>
 
<term> ::= <factor>
         | <term> <multiplicative_op> <factor>
 
<factor> ::= <column_reference>
           | <literal>
           | <function_call>
           | '(' <expression> ')'
           | <subquery>
 
<search_condition> ::= <predicate>
                     | <search_condition> AND <search_condition>
                     | <search_condition> OR <search_condition>
                     | NOT <search_condition>
                     | '(' <search_condition> ')'
 
<predicate> ::= <expression> <comparison_op> <expression>
              | <expression> IS [NOT] NULL
              | <expression> [NOT] IN <in_list>
              | <expression> [NOT] BETWEEN <expression> AND <expression>
              | <expression> [NOT] LIKE <pattern>

Grammar Complexity:

The actual SQL grammar is vastly more complex than this excerpt. The SQL:2016 standard grammar has hundreds of production rules covering:

All DML statements (SELECT, INSERT, UPDATE, DELETE, MERGE)
All DDL statements (CREATE, ALTER, DROP for tables, indexes, views, procedures, etc.)
Transaction control (COMMIT, ROLLBACK, SAVEPOINT)
Session settings and pragmas
Procedural extensions (CASE, IF, LOOP, etc.)
Window functions, CTEs, lateral joins, and more

PostgreSQL's gram.y file (the grammar specification), for example, is over 15,000 lines of grammar rules. MySQL's grammar is similarly extensive. This complexity is why writing a complete SQL parser from scratch is a multi-month undertaking.

Parser Types and Algorithms

Database systems use various parsing algorithms, each with different tradeoffs. Understanding these helps explain why some SQL constructs parse successfully while others cause unexpected errors.

Bottom-Up Parsing (LR Parsing):

Most production database parsers use LR parsing or its variants (SLR, LALR). These parsers:

Read tokens left-to-right
Construct the parse tree from leaves up to root
Use a parse table generated from the grammar
Handle a broad class of grammars efficiently

The most common tool for generating LR parsers is Yacc (Yet Another Compiler Compiler) or its GNU equivalent Bison. PostgreSQL, MySQL, and many other databases use Bison-generated parsers.

Top-Down Parsing (LL Parsing):

Some databases and SQL parsers use LL parsing or recursive descent:

Start from the grammar's start symbol
Recursively expand non-terminals
Match terminals against input tokens
Often hand-written for more control

Recursive descent parsers are easier to understand and debug but may struggle with left-recursive grammars.

LR Parsing (Bottom-Up)

•Handles larger class of grammars
•Table-driven, very efficient
•Generated from declarative specs
•Error recovery can be complex
•Used by PostgreSQL, MySQL, SQLite
•Industrial strength and well-tested

LL Parsing (Top-Down)

•Easier to understand and debug
•Hand-written for fine control
•Better error messages possible
•May need grammar refactoring
•Used in some query builders
•Flexible for dialect variations

Parsing Complexity:

Both LR and LL parsing run in O(n) time where n is the number of tokens. This linear complexity is crucial—imagine if parsing complexity were O(n²) for queries with thousands of tokens. For a SELECT with 500 columns (not unusual in data warehouses), quadratic parsing would be catastrophic.

The constant factors matter too. LR parsers using optimized table-driven approaches can parse tokens with just a few memory accesses per token. Combined with efficient lexing, modern databases can parse complex queries in microseconds.

Parser Generator Tools

Many databases use Bison (or its predecessor Yacc) combined with Flex (or Lex) for lexer generation. This combination has been standard since the 1970s. More modern tools like ANTLR offer similar functionality with better debugging support and multiple target languages. Some newer databases (like CockroachDB) use Go-based parser generators that integrate with their codebase.

Syntax Error Detection and Reporting

One of the most visible aspects of syntax analysis is error detection and reporting. When you write malformed SQL, the parser must:

Detect that the query violates grammar rules
Identify the location of the error as precisely as possible
Generate a helpful error message
Optionally, attempt to recover and continue parsing

Types of Syntax Errors:

Common SQL Syntax Errors
Error Type	Example	Parser Detection
Missing keyword	`SELECT name employees`	Expected `FROM` after select list
Misspelled keyword	`SLECT name FROM employees`	SLECT not recognized as keyword or valid identifier position
Missing delimiter	`SELECT name salary FROM employees`	Expected `,` or `FROM` after `name`
Unbalanced parentheses	`WHERE (a = 1 AND b = 2`	End of input reached with unclosed parenthesis
Unterminated string	`WHERE name = 'John`	End of line/input in string literal
Invalid operator	`WHERE salary = = 100`	Unexpected `=` after operator
Missing expression	`SELECT FROM employees`	Expected expression in select list
Reserved word misuse	`SELECT select FROM table`	Unexpected keyword in identifier position

syntax-errors-examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Example 1: Missing FROM keyword
SELECT employee_name, salary employees WHERE department = 'Sales';
-- PostgreSQL error:
-- ERROR:  syntax error at or near "employees"
-- LINE 1: SELECT employee_name, salary employees WHERE department = 'Sa...
--                                     ^
 
-- Example 2: Mismatched parentheses  
SELECT * FROM orders WHERE (status = 'pending' AND amount > 100;
-- PostgreSQL error:
-- ERROR:  syntax error at or near ";"
-- LINE 1: ...orders WHERE (status = 'pending' AND amount > 100;
--                                                             ^
-- Hint: Missing closing parenthesis
 
-- Example 3: Invalid column list syntax
SELECT name,, salary FROM employees;
-- PostgreSQL error:  
-- ERROR:  syntax error at or near ","
-- LINE 1: SELECT name,, salary FROM employees;
--                     ^
 
-- Example 4: String literal issues
SELECT * FROM employees WHERE name = 'O'Brien';
-- PostgreSQL error:
-- ERROR:  syntax error at or near "Brien"
-- LINE 1: SELECT * FROM employees WHERE name = 'O'Brien';
--                                              ^
-- Correct: WHERE name = 'O''Brien'

Error Message Quality:

The quality of syntax error messages varies dramatically between database systems. Good error messages:

Point to the exact location of the error (line and column number)
Explain what was expected vs. what was found
Suggest possible corrections when unambiguous
Avoid cascading errors (one mistake shouldn't produce 20 error messages)

PostgreSQL is generally praised for its error messages, while some databases produce notoriously cryptic errors. This quality difference stems from how much engineering effort is invested in the parser's error-handling routines.

Debugging Syntax Errors

When facing a cryptic syntax error, try these strategies: (1) Simplify the query by removing clauses until it parses, then add them back one at a time; (2) Check for invisible characters (copy-pasted text sometimes includes zero-width characters); (3) Verify string delimiters and escape sequences; (4) Use a SQL formatter to restructure the query visually.

Building the Parse Tree

As the parser validates the token stream against grammar rules, it simultaneously constructs a parse tree (also called an Abstract Syntax Tree or AST). This tree is the structured representation of the query that subsequent phases—semantic analysis, optimization, execution—will operate upon.

Parse Tree Structure:

The parse tree mirrors the grammar's hierarchical structure:

Internal nodes represent grammar non-terminals (like SELECT_STATEMENT, WHERE_CLAUSE)
Leaf nodes represent terminals (actual tokens from the query)
Edges represent the "contains" relationship between grammar productions

Let's trace the parse tree for a simple query:

parse-tree-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Query:
SELECT name, salary FROM employees WHERE department = 'Sales'
 
-- Simplified Parse Tree Structure:
 
SelectStatement
├── SelectList
│   ├── SelectItem
│   │   └── ColumnRef: "name"
│   └── SelectItem
│       └── ColumnRef: "salary"
├── FromClause
│   └── TableRef: "employees"
└── WhereClause
    └── ComparisonExpr
        ├── Operator: "="
        ├── Left: ColumnRef "department"
        └── Right: StringLiteral "'Sales'"

Converting Mermaid diagram...

AST vs. Concrete Syntax Tree:

Database systems typically build an Abstract Syntax Tree (AST) rather than a full Concrete Syntax Tree (CST). The difference:

CST includes every token, including punctuation, parentheses, and formatting
AST retains only semantically significant elements, discarding syntactic sugar

For example, the parentheses in WHERE (a = 1) might appear in a CST but would be omitted from an AST since they don't affect meaning—the grouping is already captured in the tree structure.

AST Node Types:

Typical database AST nodes include:

SelectStmt, InsertStmt, UpdateStmt, DeleteStmt — Statement types
ResTarget — Items in a SELECT list
RangeVar — Table references
JoinExpr — Join expressions with conditions
BoolExpr — AND/OR/NOT combinations
OpExpr — Operator expressions (comparisons, arithmetic)
Const — Literal values
ColumnRef — Column references
FuncCall — Function invocations
SubLink — Subqueries

Summary: Syntax Analysis Foundation

Syntax analysis forms the critical first validation layer for every SQL query. Let's consolidate what we've learned:

Key Takeaways

•Lexical analysis breaks raw SQL text into tokens—keywords, identifiers, literals, operators, and delimiters—creating structured input for parsing.
•Grammar rules formally define valid SQL structure using BNF notation, specifying how tokens can be legally arranged.
•Parsers (typically LR/LALR using tools like Bison) validate token sequences against grammar rules in O(n) time.
•Syntax errors are detected when token sequences violate grammar rules, with error quality depending on parser implementation.
•Parse trees (ASTs) are the output of syntax analysis—hierarchical structures representing query logic for subsequent processing.
•Performance matters — parsing must be fast (microseconds) since every query passes through this phase.

What's Next:

Syntax analysis ensures your query is structurally valid SQL, but it says nothing about whether the tables, columns, and operations you've specified actually make sense. That's the job of semantic analysis, which we'll explore in the next page. There, we'll see how databases resolve names to actual database objects, verify type compatibility, and prepare the query for optimization.

Page Complete

You now understand the complete syntax analysis pipeline: from raw SQL text through lexical analysis, token generation, grammar-based parsing, error detection, and parse tree construction. This foundation enables the deeper semantic checks covered next.

1 / 5

Loading learning content...

Database Management SystemsQuery Processing

Parsing and Validation

LevelIntermediate

Duration75 mins

TopicQuery Processing

1 / 5

Syntax Analysis

The Query's First Test

What You Will Learn

The Parsing Pipeline Overview

The Complete Query Processing Pipeline:

Query Receipt — The raw SQL string arrives from a client connection
Lexical Analysis — The string is broken into meaningful tokens
Syntax Analysis — Tokens are validated against SQL grammar rules
Semantic Analysis — Names are resolved and types are verified
Query Rewriting — The query may be transformed for optimization
Query Optimization — Multiple execution plans are evaluated
Query Execution — The optimal plan is executed
Result Return — Data is returned to the client

Converting Mermaid diagram...

Historical Context

Lexical Analysis: The First Pass

Why Tokenization Matters:

Consider this SQL query:

SELECT employee_name, salary FROM employees WHERE department_id = 10

The lexer component (also called a scanner or tokenizer) reads characters left-to-right, identifies token boundaries, classifies each token, and passes the token stream to the parser.

Token Types in SQL Lexical Analysis
Token Category	Examples	Description
Keywords	`SELECT`, `FROM`, `WHERE`, `INSERT`, `UPDATE`	Reserved words with special meaning in SQL syntax
Identifiers	`employees`, `department_id`, `customer_name`	Names of tables, columns, schemas, aliases, etc.
String Literals	`'John Doe'`, `'New York'`	Quoted character sequences
Numeric Literals	`42`, `3.14159`, `-100`, `1.5e10`	Integer and floating-point numbers
Operators	`=`, `<>`, `>=`, `+`, `-`, `*`, `\|\|`	Comparison, arithmetic, and string operators
Delimiters	`(`, `)`, `,`, `;`	Punctuation that structures the query
Comments	`-- comment`, `/* block */`	Ignored by parser, used for documentation
Whitespace	spaces, tabs, newlines	Token separators (usually discarded)

Tokenization Example:

Let's trace how the lexer processes our example query character by character:

tokenization-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Original Query:
SELECT employee_name, salary FROM employees WHERE department_id = 10
 
-- Resulting Token Stream:
Token 1:  KEYWORD      "SELECT"
Token 2:  IDENTIFIER   "employee_name"
Token 3:  COMMA        ","
Token 4:  IDENTIFIER   "salary"
Token 5:  KEYWORD      "FROM"
Token 6:  IDENTIFIER   "employees"
Token 7:  KEYWORD      "WHERE"
Token 8:  IDENTIFIER   "department_id"
Token 9:  OPERATOR     "="
Token 10: INTEGER      "10"
 
-- Note: Whitespace between tokens is consumed but not emitted
-- Each token carries:
--   1. Token type (KEYWORD, IDENTIFIER, etc.)
--   2. Token value (the actual text)
--   3. Position information (line, column) for error reporting

Lexer Implementation Details

Understanding how lexers actually work reveals important details about SQL parsing behavior. Modern database lexers are typically implemented using one of two approaches:

Regardless of implementation approach, all lexers must handle several challenging aspects of SQL tokenization:

Lexer Implementation Challenges

•Keyword Recognition vs Identifiers — Is SELECT a keyword or could it be a table name? Most SQL dialects reserve keywords, but some allow escaping (e.g., "SELECT" as an identifier in PostgreSQL).
•Case Sensitivity — SQL keywords are case-insensitive (SELECT = select = SeLeCt), but identifier handling varies by database. MySQL is case-insensitive on Windows by default, case-sensitive on Linux.
•String Literal Handling — Single quotes delimit strings, but what about embedded quotes? Most SQL uses '' to escape ('O''Brien'), but some databases support backslash escaping.
•Numeric Literal Variations — The lexer must distinguish integers, decimals, and scientific notation (42, 3.14, 1.5e-10), plus handle negative signs correctly.
•Multi-Character Operators — Operators like <>, >=, ||, and :: require lookahead to distinguish from single-character operators.
•Comment Handling — Both line comments (--) and block comments (/* */) must be recognized and typically stripped before parsing.
•Unicode Support — Modern databases must handle Unicode identifiers and string literals, significantly complicating character classification.

lexer-pseudocode.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Simplified Lexer State Machine
function nextToken():
    skipWhitespaceAndComments()
    
    if atEndOfInput():
        return Token(EOF)
    
    char = peekChar()
    
    // Identifiers and Keywords
    if isLetter(char) or char == '_':
        word = consumeWhile(isAlphanumeric or '_')
        if word.toUpperCase() in RESERVED_KEYWORDS:
            return Token(KEYWORD, word)
        else:
            return Token(IDENTIFIER, word)
    
    // Numeric Literals
    if isDigit(char):
        num = consumeWhile(isDigit)
        if peekChar() == '.':
            consume('.')
            num += '.' + consumeWhile(isDigit)
            return Token(FLOAT_LITERAL, num)
        return Token(INTEGER_LITERAL, num)
    
    // String Literals
    if char == "'":
        consume("'")
        str = ""
        while not atEndOfInput():
            if peekChar() == "'" and peekAhead(1) == "'":
                str += "'"
                consume("''")
            else if peekChar() == "'":
                consume("'")
                return Token(STRING_LITERAL, str)
            else:
                str += consumeChar()
        error("Unterminated string literal")
    
    // Operators (with lookahead for multi-char)
    if char == '<':
        consume('<')
        if peekChar() == '=':
            consume('=')
            return Token(OPERATOR, "<=")
        if peekChar() == '>':
            consume('>')
            return Token(OPERATOR, "<>")
        return Token(OPERATOR, "<")
    
    // ... similar patterns for other operators
    
    // Single-character tokens
    if char in "(),;":
        return Token(DELIMITER, consumeChar())
    
    error("Unexpected character: " + char)

Performance Consideration

SQL Grammar and Syntax Rules

What is a Grammar?

Non-terminals — Abstract syntactic categories (like <select_statement> or <expression>)
Terminals — Actual tokens from the lexer (like SELECT or IDENTIFIER)
Production Rules — Rules showing how non-terminals expand into sequences of terminals and non-terminals

Let's look at a simplified excerpt from an SQL grammar:

sql-grammar-bnf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Simplified SQL SELECT Statement Grammar (BNF notation)
 
<select_statement> ::= SELECT <select_list> FROM <table_reference>
                       [WHERE <search_condition>]
                       [GROUP BY <group_by_clause>]
                       [HAVING <having_clause>]
                       [ORDER BY <order_by_clause>]
 
<select_list> ::= '*'
                | <select_item> {',' <select_item>}
 
<select_item> ::= <expression> [AS <alias>]
                | <table_name> '.' '*'
 
<table_reference> ::= <table_name> [<alias>]
                    | <table_reference> <join_type> <table_reference>
                      ON <search_condition>
                    | '(' <select_statement> ')' [AS <alias>]
 
<expression> ::= <term>
               | <expression> <additive_op> <term>
 
<term> ::= <factor>
         | <term> <multiplicative_op> <factor>
 
<factor> ::= <column_reference>
           | <literal>
           | <function_call>
           | '(' <expression> ')'
           | <subquery>
 
<search_condition> ::= <predicate>
                     | <search_condition> AND <search_condition>
                     | <search_condition> OR <search_condition>
                     | NOT <search_condition>
                     | '(' <search_condition> ')'
 
<predicate> ::= <expression> <comparison_op> <expression>
              | <expression> IS [NOT] NULL
              | <expression> [NOT] IN <in_list>
              | <expression> [NOT] BETWEEN <expression> AND <expression>
              | <expression> [NOT] LIKE <pattern>

Grammar Complexity:

The actual SQL grammar is vastly more complex than this excerpt. The SQL:2016 standard grammar has hundreds of production rules covering:

All DML statements (SELECT, INSERT, UPDATE, DELETE, MERGE)
All DDL statements (CREATE, ALTER, DROP for tables, indexes, views, procedures, etc.)
Transaction control (COMMIT, ROLLBACK, SAVEPOINT)
Session settings and pragmas
Procedural extensions (CASE, IF, LOOP, etc.)
Window functions, CTEs, lateral joins, and more

Parser Types and Algorithms

Database systems use various parsing algorithms, each with different tradeoffs. Understanding these helps explain why some SQL constructs parse successfully while others cause unexpected errors.

Bottom-Up Parsing (LR Parsing):

Most production database parsers use LR parsing or its variants (SLR, LALR). These parsers:

Read tokens left-to-right
Construct the parse tree from leaves up to root
Use a parse table generated from the grammar
Handle a broad class of grammars efficiently

The most common tool for generating LR parsers is Yacc (Yet Another Compiler Compiler) or its GNU equivalent Bison. PostgreSQL, MySQL, and many other databases use Bison-generated parsers.

Top-Down Parsing (LL Parsing):

Some databases and SQL parsers use LL parsing or recursive descent:

Start from the grammar's start symbol
Recursively expand non-terminals
Match terminals against input tokens
Often hand-written for more control

Recursive descent parsers are easier to understand and debug but may struggle with left-recursive grammars.

LR Parsing (Bottom-Up)

•Handles larger class of grammars
•Table-driven, very efficient
•Generated from declarative specs
•Error recovery can be complex
•Used by PostgreSQL, MySQL, SQLite
•Industrial strength and well-tested

LL Parsing (Top-Down)

•Easier to understand and debug
•Hand-written for fine control
•Better error messages possible
•May need grammar refactoring
•Used in some query builders
•Flexible for dialect variations

Parsing Complexity:

Parser Generator Tools

Syntax Error Detection and Reporting

One of the most visible aspects of syntax analysis is error detection and reporting. When you write malformed SQL, the parser must:

Detect that the query violates grammar rules
Identify the location of the error as precisely as possible
Generate a helpful error message
Optionally, attempt to recover and continue parsing

Types of Syntax Errors:

Common SQL Syntax Errors
Error Type	Example	Parser Detection
Missing keyword	`SELECT name employees`	Expected `FROM` after select list
Misspelled keyword	`SLECT name FROM employees`	SLECT not recognized as keyword or valid identifier position
Missing delimiter	`SELECT name salary FROM employees`	Expected `,` or `FROM` after `name`
Unbalanced parentheses	`WHERE (a = 1 AND b = 2`	End of input reached with unclosed parenthesis
Unterminated string	`WHERE name = 'John`	End of line/input in string literal
Invalid operator	`WHERE salary = = 100`	Unexpected `=` after operator
Missing expression	`SELECT FROM employees`	Expected expression in select list
Reserved word misuse	`SELECT select FROM table`	Unexpected keyword in identifier position

syntax-errors-examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Example 1: Missing FROM keyword
SELECT employee_name, salary employees WHERE department = 'Sales';
-- PostgreSQL error:
-- ERROR:  syntax error at or near "employees"
-- LINE 1: SELECT employee_name, salary employees WHERE department = 'Sa...
--                                     ^
 
-- Example 2: Mismatched parentheses  
SELECT * FROM orders WHERE (status = 'pending' AND amount > 100;
-- PostgreSQL error:
-- ERROR:  syntax error at or near ";"
-- LINE 1: ...orders WHERE (status = 'pending' AND amount > 100;
--                                                             ^
-- Hint: Missing closing parenthesis
 
-- Example 3: Invalid column list syntax
SELECT name,, salary FROM employees;
-- PostgreSQL error:  
-- ERROR:  syntax error at or near ","
-- LINE 1: SELECT name,, salary FROM employees;
--                     ^
 
-- Example 4: String literal issues
SELECT * FROM employees WHERE name = 'O'Brien';
-- PostgreSQL error:
-- ERROR:  syntax error at or near "Brien"
-- LINE 1: SELECT * FROM employees WHERE name = 'O'Brien';
--                                              ^
-- Correct: WHERE name = 'O''Brien'

Error Message Quality:

The quality of syntax error messages varies dramatically between database systems. Good error messages:

Point to the exact location of the error (line and column number)
Explain what was expected vs. what was found
Suggest possible corrections when unambiguous
Avoid cascading errors (one mistake shouldn't produce 20 error messages)

Debugging Syntax Errors

Building the Parse Tree

Parse Tree Structure:

The parse tree mirrors the grammar's hierarchical structure:

Internal nodes represent grammar non-terminals (like SELECT_STATEMENT, WHERE_CLAUSE)
Leaf nodes represent terminals (actual tokens from the query)
Edges represent the "contains" relationship between grammar productions

Let's trace the parse tree for a simple query:

parse-tree-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Query:
SELECT name, salary FROM employees WHERE department = 'Sales'
 
-- Simplified Parse Tree Structure:
 
SelectStatement
├── SelectList
│   ├── SelectItem
│   │   └── ColumnRef: "name"
│   └── SelectItem
│       └── ColumnRef: "salary"
├── FromClause
│   └── TableRef: "employees"
└── WhereClause
    └── ComparisonExpr
        ├── Operator: "="
        ├── Left: ColumnRef "department"
        └── Right: StringLiteral "'Sales'"

Converting Mermaid diagram...

AST vs. Concrete Syntax Tree:

Database systems typically build an Abstract Syntax Tree (AST) rather than a full Concrete Syntax Tree (CST). The difference:

CST includes every token, including punctuation, parentheses, and formatting
AST retains only semantically significant elements, discarding syntactic sugar

For example, the parentheses in WHERE (a = 1) might appear in a CST but would be omitted from an AST since they don't affect meaning—the grouping is already captured in the tree structure.

AST Node Types:

Typical database AST nodes include:

SelectStmt, InsertStmt, UpdateStmt, DeleteStmt — Statement types
ResTarget — Items in a SELECT list
RangeVar — Table references
JoinExpr — Join expressions with conditions
BoolExpr — AND/OR/NOT combinations
OpExpr — Operator expressions (comparisons, arithmetic)
Const — Literal values
ColumnRef — Column references
FuncCall — Function invocations
SubLink — Subqueries

Summary: Syntax Analysis Foundation

Syntax analysis forms the critical first validation layer for every SQL query. Let's consolidate what we've learned:

Key Takeaways

•Lexical analysis breaks raw SQL text into tokens—keywords, identifiers, literals, operators, and delimiters—creating structured input for parsing.
•Grammar rules formally define valid SQL structure using BNF notation, specifying how tokens can be legally arranged.
•Parsers (typically LR/LALR using tools like Bison) validate token sequences against grammar rules in O(n) time.
•Syntax errors are detected when token sequences violate grammar rules, with error quality depending on parser implementation.
•Parse trees (ASTs) are the output of syntax analysis—hierarchical structures representing query logic for subsequent processing.
•Performance matters — parsing must be fast (microseconds) since every query passes through this phase.

What's Next:

Page Complete

1 / 5