Database Management SystemsQuery Processing

Parsing and Validation

LevelIntermediate

Duration75 mins

TopicQuery Processing

5 / 5

Parse Tree

The Query's Skeleton

When you write a SQL query like SELECT name FROM employees WHERE salary > 50000, you're writing for humans to read. But databases don't execute text—they execute structured operations on data. The bridge between human-readable SQL and executable operations is the parse tree, also called the Abstract Syntax Tree (AST).

The parse tree is a hierarchical representation of your query's structure. Each clause becomes a subtree, each column reference becomes a node, and the relationships between elements become edges. This tree is the "skeleton" upon which all subsequent processing hangs—semantic analysis annotates it, optimization transforms it, and execution interprets it.

Understanding parse trees reveals how databases "see" your queries and helps explain why certain query formulations produce different execution plans.

What You Will Learn

By the end of this page, you will understand parse tree structure, node types, and construction process. You'll learn how annotations enrich the tree during semantic analysis, how the tree is transformed for optimization, and how to visualize and interpret parse trees in real database systems.

Parse Tree Fundamentals

A parse tree is a tree data structure that represents the syntactic structure of a query according to the SQL grammar. Understanding its properties is essential for understanding query processing.

Tree Properties:

Root Node: Represents the entire statement (SELECT, INSERT, UPDATE, etc.)
Internal Nodes: Represent grammatical constructs (clauses, expressions, operators)
Leaf Nodes: Represent tokens (literals, identifiers, keywords)
Edges: Represent containment relationships ("this clause contains these elements")

Concrete Syntax Tree (CST) vs. Abstract Syntax Tree (AST):

Parsers can produce two types of trees:

CST (Parse Tree): Includes every token, every punctuation mark, preserves exact syntax
AST (Abstract Syntax Tree): Omits syntactic sugar, preserves only semantically significant elements

Most databases build ASTs because:

Smaller memory footprint
Easier to analyze and transform
Irrelevant details (parentheses, commas) don't clutter processing

cst-vs-ast.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Query: SELECT (a + b) * c FROM t WHERE x = 1;
 
-- Concrete Syntax Tree (CST) - includes everything:
SelectStatement
├── SELECT (keyword)
├── SelectList
│   └── Expression
│       ├── LPAREN "("
│       ├── Expression
│       │   ├── ColumnRef "a"
│       │   ├── PLUS "+"
│       │   └── ColumnRef "b"
│       ├── RPAREN ")"
│       ├── STAR "*"
│       └── ColumnRef "c"
├── FROM (keyword)
├── TableRef "t"
├── WHERE (keyword)
├── Expression
│   ├── ColumnRef "x"
│   ├── EQUALS "="
│   └── IntLiteral "1"
└── SEMICOLON ";"
 
-- Abstract Syntax Tree (AST) - semantics only:
SelectStmt
├── targetList: [MultExpr]
│   └── MultExpr (*)
│       ├── left: AddExpr (+)
│       │   ├── left: ColumnRef "a"
│       │   └── right: ColumnRef "b"
│       └── right: ColumnRef "c"
├── fromClause: [RangeVar "t"]
└── whereClause: OpExpr (=)
    ├── left: ColumnRef "x"
    └── right: Const 1

Notice how the AST omits parentheses—the tree structure itself encodes operator precedence. The addition happens first because it's a child of the multiplication node. This is why ASTs are preferred: they encode meaning in structure rather than relying on syntactic markers.

Common Node Types in SQL Parse Trees

SQL ASTs contain many node types, each representing a specific syntactic construct. Let's examine the major categories.

Statement Nodes:

These are root nodes representing complete SQL statements:

SQL Statement Node Types
Node Type	SQL Statement	Key Child Nodes
SelectStmt	SELECT query	targetList, fromClause, whereClause, groupClause, sortClause
InsertStmt	INSERT statement	relation, cols, selectStmt/valuesList
UpdateStmt	UPDATE statement	relation, targetList, whereClause, fromClause
DeleteStmt	DELETE statement	relation, whereClause, usingClause
CreateStmt	CREATE TABLE	relation, tableElts (columns), constraints
AlterStmt	ALTER TABLE	relation, alterCmds
IndexStmt	CREATE INDEX	idxname, relation, indexParams

Expression Nodes:

These represent values, calculations, and conditions:

expression-nodes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Expression Node Types (PostgreSQL internal names)
 
Const           -- Literal values: 42, 'hello', TRUE
ColumnRef       -- Column references: employees.name, e.salary
FuncCall        -- Function calls: UPPER(name), COUNT(*)
OpExpr          -- Binary operators: a + b, x > 5
BoolExpr        -- Boolean combinations: AND, OR, NOT
SubLink         -- Subqueries: (SELECT max(x) FROM t)
CaseExpr        -- CASE expressions
CoalesceExpr    -- COALESCE(a, b, c)
NullTest        -- IS NULL, IS NOT NULL
BooleanTest     -- IS TRUE, IS FALSE, IS UNKNOWN
TypeCast        -- CAST(x AS type), x::type
ArrayExpr       -- ARRAY[1,2,3]
RowExpr         -- ROW(a, b, c)
CoerceExpr      -- Type coercion wrapper
Aggref          -- Aggregate function: SUM, AVG, COUNT
WindowFunc      -- Window function: ROW_NUMBER() OVER(...)

Clause and Reference Nodes:

These represent structural elements within statements:

Clause and Reference Nodes

•RangeVar — Table reference with schema, table name, and alias
•JoinExpr — JOIN with type (INNER/LEFT/etc.), left table, right table, and condition
•RangeSubselect — Subquery in FROM clause with alias
•ResTarget — Item in SELECT list (expression + optional alias)
•SortBy — ORDER BY element with direction and nulls handling
•GroupClause — GROUP BY specification
•WindowDef — Window specification for OVER clause
•CommonTableExpr — CTE definition in WITH clause
•WithClause — Container for CTEs

AST Construction During Parsing

The parser constructs the AST incrementally as it processes the token stream. Understanding this process helps explain how parsing and tree building intertwine.

Bottom-Up Construction (LR Parsers):

Most database parsers use LR parsing, which builds the tree bottom-up:

Read tokens left-to-right
When tokens match a grammar rule, reduce them to a parent node
Continue until the entire query is reduced to a single root node

ast-construction.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Query: SELECT a FROM t WHERE b > 5
 
-- Parsing and AST construction trace:
 
Token Stream: SELECT  a  FROM  t  WHERE  b  >  5
 
Step 1: Read 'SELECT' → Shift to stack
Step 2: Read 'a' → Shift (identifier)
Step 3: 'a' matches column_ref rule → Reduce to ColumnRef node
Step 4: ColumnRef matches target rule → Reduce to ResTarget
Step 5: ResTarget matches target_list → Reduce to [ResTarget]
Step 6: Read 'FROM', 't' → Reduce 't' to RangeVar
Step 7: RangeVar matches from_clause → Reduce to FROM clause
Step 8: Read 'WHERE', 'b', '>', '5'
Step 9: 'b' → ColumnRef, '5' → Const
Step 10: ColumnRef > Const matches operator_expr → Reduce to OpExpr
Step 11: OpExpr matches where_clause → Reduce to WHERE clause
Step 12: target_list + from_clause + where_clause → Reduce to SelectStmt
 
Final AST:
SelectStmt
├── targetList: [ResTarget(ColumnRef "a")]
├── fromClause: [RangeVar "t"]
└── whereClause: OpExpr(ColumnRef "b" > Const 5)

Parser Actions:

During reduction, the parser executes semantic actions—code that creates AST nodes:

// Simplified from PostgreSQL's gram.y
select_with_parens:
    '(' select_no_parens ')'
    {
        // Semantic action: create SelectStmt node
        SelectStmt *n = makeNode(SelectStmt);
        n->targetList = $2->targetList;
        n->fromClause = $2->fromClause;
        n->whereClause = $2->whereClause;
        $$ = n;  // Return the created node
    }
    ;

Each grammar rule has associated code that builds AST nodes when that rule is matched.

Memory Allocation

Database parsers typically allocate AST nodes from a memory context (memory pool) associated with the query. This allows efficient bulk deallocation when the query completes, avoiding individual free() calls for thousands of nodes.

AST Annotation During Semantic Analysis

The initial AST from parsing contains only syntactic information. Semantic analysis enriches this tree with annotations—additional metadata about each node.

Types of Annotations:

AST Annotations Added During Semantic Analysis
Annotation Type	Applied To	Information Added
Type Info	All expression nodes	Data type, nullability, collation
Catalog References	RangeVar, ColumnRef	OID of table/column, position
Resolution Info	ColumnRef	Which table the column comes from
Coercion Nodes	Type mismatches	Inserted nodes for implicit casts
Aggregate Info	Aggref	Aggregate function OID, distinct flag
Subquery Type	SubLink	EXISTS, IN, scalar, etc.
Parameter Info	Parameter nodes	Parameter number, type

annotated-ast.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Query: SELECT name, salary * 1.1 FROM employees WHERE dept = 'Sales'
 
-- Raw AST (after parsing, before semantic analysis):
SelectStmt
├── targetList:
│   ├── ResTarget(ColumnRef "name")
│   └── ResTarget(OpExpr: ColumnRef "salary" * Const 1.1)
├── fromClause: [RangeVar "employees"]
└── whereClause: OpExpr(ColumnRef "dept" = Const 'Sales')
 
-- Annotated AST (after semantic analysis):
SelectStmt
├── targetList:
│   ├── ResTarget
│   │   └── ColumnRef "name"
│   │       ├── table_oid: 16384 (employees)
│   │       ├── column_num: 2 (name)
│   │       └── type: VARCHAR(100), collation: en_US
│   └── ResTarget  
│       └── OpExpr(*)
│           ├── type: NUMERIC(12,2)
│           ├── left: ColumnRef "salary"
│           │   ├── table_oid: 16384
│           │   ├── column_num: 4
│           │   └── type: NUMERIC(10,2)
│           └── right: Const 1.1
│               └── type: NUMERIC
├── fromClause:
│   └── RangeVar "employees"
│       ├── schema_oid: 2200 (public)
│       ├── relation_oid: 16384
│       └── alias: none
└── whereClause:
    └── OpExpr(=)
        ├── type: BOOLEAN
        ├── left: ColumnRef "dept"
        │   ├── table_oid: 16384
        │   ├── column_num: 3
        │   └── type: VARCHAR(50)
        └── right: Const 'Sales'
            └── type: VARCHAR (coerced from unknown)

Coercion Node Insertion:

When types don't match exactly, the analyzer inserts coercion nodes:

-- Original expression: int_column = 3.14
OpExpr(=)
├── left: ColumnRef (type: INTEGER)
└── right: Const 3.14 (type: NUMERIC)

-- After type checking (coercion inserted):
OpExpr(=)
├── left: CoerceExpr 
│   ├── arg: ColumnRef (type: INTEGER)
│   └── resulttype: NUMERIC
└── right: Const 3.14 (type: NUMERIC)

The CoerceExpr node tells the executor to convert the integer to numeric before comparison.

AST Transformations and Rewrites

After semantic analysis, the AST may undergo transformations that rewrite it into equivalent but more optimizable forms. These transformations prepare the query for optimization.

Common AST Transformations:

Query Rewrite Transformations

•View Expansion — Replace view references with underlying query definition
•Subquery Flattening — Convert some subqueries to joins for better optimization
•Constant Folding — Evaluate constant expressions at compile time (2 + 3 → 5)
•Predicate Normalization — Convert to canonical form (NOT (a > b) → a <= b)
•IN-list Expansion — Convert IN (1,2,3) to OR conditions or array format
•Exists-to-Join — Convert EXISTS subqueries to semi-joins where possible
•Rule-Based Rewriting — Apply user-defined rewrite rules

ast-transformation-examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Example 1: View Expansion
 
-- View definition:
CREATE VIEW active_employees AS 
SELECT * FROM employees WHERE status = 'active';
 
-- User query:
SELECT name FROM active_employees WHERE salary > 50000;
 
-- After view expansion (conceptual AST transformation):
SELECT name FROM employees WHERE status = 'active' AND salary > 50000;
 
-- Example 2: Subquery Flattening
 
-- Original:
SELECT * FROM orders WHERE customer_id IN (
    SELECT id FROM customers WHERE region = 'West'
);
 
-- Flattened to semi-join:
SELECT orders.* FROM orders 
SEMI JOIN customers ON orders.customer_id = customers.id
WHERE customers.region = 'West';
 
-- Example 3: Constant Folding
 
-- Original:
SELECT * FROM t WHERE date > '2024-01-01'::DATE + INTERVAL '30 days';
 
-- After constant folding:
SELECT * FROM t WHERE date > '2024-01-31'::DATE;

Converting Mermaid diagram...

Viewing Parse Trees in Database Systems

Many database systems provide ways to inspect the parse tree or its representations. This is invaluable for understanding query processing and debugging.

PostgreSQL:

PostgreSQL offers several debugging options:

viewing-parse-trees.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- PostgreSQL: Enable parse tree debugging
SET debug_print_parse = on;
SET debug_print_rewritten = on;
SET debug_print_plan = on;
SET client_min_messages = log;
 
-- Execute a query and check server log for parse tree output
SELECT name, salary FROM employees WHERE department_id = 5;
 
-- The log will contain detailed parse tree representation:
-- DETAIL: {QUERY 
--    :commandType 1 
--    :querySource 0 
--    :canSetTag true 
--    :utilityStmt <> 
--    :resultRelation 0 
--    :targetList (
--       {TARGETENTRY 
--          :expr {VAR :varno 1 :varattno 2 :vartype 1043 ...}
--          :resno 1 
--          :resname name 
--          ...
--       }
--       ...
--    )
--    ...
-- }
 
-- MySQL: EXPLAIN can show some internal representation
EXPLAIN FORMAT=JSON SELECT * FROM employees;
 
-- SQL Server: Execution plan XML includes parsed structure
SET SHOWPLAN_XML ON;
GO
SELECT name FROM employees;
GO
SET SHOWPLAN_XML OFF;

Third-Party Tools:

Various tools can visualize parse trees:

pgAdmin — Query explain with visual plan
DBeaver — Execution plan visualization
SQL Fiddle — Online query analysis
Language-specific parsers — libpg_query, sql-parser-cst, etc.

Using libpg_query:

This library exposes PostgreSQL's parser as a standalone component:

libpg-query-example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Python example using pglast (libpg_query wrapper)
from pglast import parse_sql
from pglast.stream import RawStream
import json
 
query = "SELECT a, b FROM t WHERE c > 5 ORDER BY a"
 
# Parse query to AST
tree = parse_sql(query)
 
# Pretty-print the parse tree
print(json.dumps(tree.stmts[0].stmt, indent=2, default=str))
 
# Output (simplified):
# {
#   "SelectStmt": {
#     "targetList": [
#       {"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": "a"}]}}}},
#       {"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": "b"}]}}}}
#     ],
#     "fromClause": [
#       {"RangeVar": {"relname": "t", "inh": true, "relpersistence": "p"}}
#     ],
#     "whereClause": {
#       "A_Expr": {
#         "kind": 0,
#         "name": [">"],
#         "lexpr": {"ColumnRef": {"fields": [{"String": "c"}]}},
#         "rexpr": {"A_Const": {"val": {"Integer": 5}}}
#       }
#     },
#     "sortClause": [...]
#   }
# }

Learning by Exploring

Parsing queries and examining their AST is one of the best ways to deeply understand SQL semantics. Try parsing queries with subtle differences (e.g., implicit vs. explicit joins, different WHERE clause structures) and compare the resulting ASTs.

From Parse Tree to Query Plan

The annotated, transformed parse tree is the input to the query optimizer. Understanding how the AST relates to query plans completes the parsing picture.

Parse Tree vs. Query Plan:

Parse Tree (AST)	Query Plan
Represents SQL structure	Represents execution strategy
Declarative (what to compute)	Imperative (how to compute)
One form per query	Many possible plans per query
Language-level operations	Physical operations (scans, joins)
Input to optimizer	Output of optimizer

Translation Process:

The optimizer translates AST constructs into physical operations:

ast-to-plan.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Query
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000
ORDER BY e.name;
 
-- Parse Tree (simplified)
SelectStmt
├── targetList: [e.name, d.dept_name]
├── fromClause: JoinExpr(employees e, departments d, e.dept_id = d.id)
├── whereClause: e.salary > 50000
└── sortClause: e.name ASC
 
-- Possible Query Plans (optimizer chooses best):
 
Plan 1: Nested Loop with Index
Sort (by name)
└── Nested Loop Join
    ├── Index Scan on employees (salary > 50000)
    └── Index Scan on departments (id = e.dept_id)
 
Plan 2: Hash Join with Seq Scan
Sort (by name)
└── Hash Join (dept_id = id)
    ├── Seq Scan on employees (salary > 50000)
    └── Hash on departments
 
Plan 3: Merge Join (if both sorted)
Sort (by name)
└── Merge Join (dept_id = id)
    ├── Index Scan on employees (salary > 50000, ordered by dept_id)
    └── Index Scan on departments (ordered by id)

Key Translation Decisions:

FROM clause → Access methods (table scan, index scan, etc.)
JOIN expressions → Join algorithms (nested loop, hash, merge)
WHERE predicates → Filters (may push down to scans)
SELECT expressions → Projection operations
ORDER BY → Sort operations (may use index order)
GROUP BY/aggregates → GroupAggregate or HashAggregate
Subqueries → SubPlan nodes or flattened into main plan

The AST provides the semantic specification; the plan provides the physical recipe.

Optimizer's Freedom

The optimizer has complete freedom to rearrange operations as long as the result is semantically equivalent. It can reorder joins, push predicates into scans, convert between plan shapes, and more—all guided by cost estimates. The AST constrains WHAT is computed; the plan determines HOW.

Summary: Parse Tree

The parse tree is the fundamental representation of your query that underlies all subsequent processing. Let's consolidate the key concepts:

Key Takeaways

•Parse trees (ASTs) are hierarchical representations of query structure, with statement nodes at the root and expression/literal nodes at leaves.
•AST vs. CST — ASTs omit syntactic details (parentheses, commas) and encode structure (like operator precedence) in tree shape.
•Node types include statement nodes (SelectStmt), expression nodes (OpExpr, FuncCall), and reference nodes (ColumnRef, RangeVar).
•Construction happens during parsing, with semantic actions creating nodes as grammar rules are matched.
•Annotations are added during semantic analysis—type information, catalog references, coercion nodes.
•Transformations rewrite the AST for optimization—view expansion, subquery flattening, constant folding.
•Tools exist to view parse trees—debug settings, EXPLAIN variants, and standalone parsing libraries.
•Plan generation translates the declarative AST into an imperative execution plan with physical operations.

Module Complete:

Congratulations! You've completed the Parsing and Validation module. You now understand the complete journey from raw SQL text to validated, annotated parse tree:

Lexical analysis breaks text into tokens
Syntax analysis validates grammar and builds the parse tree
Semantic analysis verifies meaning (objects exist, types compatible)
Name resolution maps identifiers to specific objects
Type checking ensures type compatibility throughout
The parse tree captures all this information for the optimizer

This foundation is essential for understanding query optimization, which determines HOW your queries execute—the subject of upcoming modules.

Module Complete

You've mastered the parsing and validation phase of query processing. You understand how databases transform SQL text into structured, validated parse trees ready for optimization. This knowledge helps you write better queries, understand error messages, and appreciate the sophistication underlying every SQL statement you execute.

5 / 5

Loading learning content...

Database Management SystemsQuery Processing

Parsing and Validation

LevelIntermediate

Duration75 mins

TopicQuery Processing

5 / 5

Parse Tree

The Query's Skeleton

Understanding parse trees reveals how databases "see" your queries and helps explain why certain query formulations produce different execution plans.

What You Will Learn

Parse Tree Fundamentals

Tree Properties:

Root Node: Represents the entire statement (SELECT, INSERT, UPDATE, etc.)
Internal Nodes: Represent grammatical constructs (clauses, expressions, operators)
Leaf Nodes: Represent tokens (literals, identifiers, keywords)
Edges: Represent containment relationships ("this clause contains these elements")

Concrete Syntax Tree (CST) vs. Abstract Syntax Tree (AST):

Parsers can produce two types of trees:

CST (Parse Tree): Includes every token, every punctuation mark, preserves exact syntax
AST (Abstract Syntax Tree): Omits syntactic sugar, preserves only semantically significant elements

Most databases build ASTs because:

Smaller memory footprint
Easier to analyze and transform
Irrelevant details (parentheses, commas) don't clutter processing

cst-vs-ast.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Query: SELECT (a + b) * c FROM t WHERE x = 1;
 
-- Concrete Syntax Tree (CST) - includes everything:
SelectStatement
├── SELECT (keyword)
├── SelectList
│   └── Expression
│       ├── LPAREN "("
│       ├── Expression
│       │   ├── ColumnRef "a"
│       │   ├── PLUS "+"
│       │   └── ColumnRef "b"
│       ├── RPAREN ")"
│       ├── STAR "*"
│       └── ColumnRef "c"
├── FROM (keyword)
├── TableRef "t"
├── WHERE (keyword)
├── Expression
│   ├── ColumnRef "x"
│   ├── EQUALS "="
│   └── IntLiteral "1"
└── SEMICOLON ";"
 
-- Abstract Syntax Tree (AST) - semantics only:
SelectStmt
├── targetList: [MultExpr]
│   └── MultExpr (*)
│       ├── left: AddExpr (+)
│       │   ├── left: ColumnRef "a"
│       │   └── right: ColumnRef "b"
│       └── right: ColumnRef "c"
├── fromClause: [RangeVar "t"]
└── whereClause: OpExpr (=)
    ├── left: ColumnRef "x"
    └── right: Const 1

Common Node Types in SQL Parse Trees

SQL ASTs contain many node types, each representing a specific syntactic construct. Let's examine the major categories.

Statement Nodes:

These are root nodes representing complete SQL statements:

SQL Statement Node Types
Node Type	SQL Statement	Key Child Nodes
SelectStmt	SELECT query	targetList, fromClause, whereClause, groupClause, sortClause
InsertStmt	INSERT statement	relation, cols, selectStmt/valuesList
UpdateStmt	UPDATE statement	relation, targetList, whereClause, fromClause
DeleteStmt	DELETE statement	relation, whereClause, usingClause
CreateStmt	CREATE TABLE	relation, tableElts (columns), constraints
AlterStmt	ALTER TABLE	relation, alterCmds
IndexStmt	CREATE INDEX	idxname, relation, indexParams

Expression Nodes:

These represent values, calculations, and conditions:

expression-nodes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Expression Node Types (PostgreSQL internal names)
 
Const           -- Literal values: 42, 'hello', TRUE
ColumnRef       -- Column references: employees.name, e.salary
FuncCall        -- Function calls: UPPER(name), COUNT(*)
OpExpr          -- Binary operators: a + b, x > 5
BoolExpr        -- Boolean combinations: AND, OR, NOT
SubLink         -- Subqueries: (SELECT max(x) FROM t)
CaseExpr        -- CASE expressions
CoalesceExpr    -- COALESCE(a, b, c)
NullTest        -- IS NULL, IS NOT NULL
BooleanTest     -- IS TRUE, IS FALSE, IS UNKNOWN
TypeCast        -- CAST(x AS type), x::type
ArrayExpr       -- ARRAY[1,2,3]
RowExpr         -- ROW(a, b, c)
CoerceExpr      -- Type coercion wrapper
Aggref          -- Aggregate function: SUM, AVG, COUNT
WindowFunc      -- Window function: ROW_NUMBER() OVER(...)

Clause and Reference Nodes:

These represent structural elements within statements:

Clause and Reference Nodes

•RangeVar — Table reference with schema, table name, and alias
•JoinExpr — JOIN with type (INNER/LEFT/etc.), left table, right table, and condition
•RangeSubselect — Subquery in FROM clause with alias
•ResTarget — Item in SELECT list (expression + optional alias)
•SortBy — ORDER BY element with direction and nulls handling
•GroupClause — GROUP BY specification
•WindowDef — Window specification for OVER clause
•CommonTableExpr — CTE definition in WITH clause
•WithClause — Container for CTEs

AST Construction During Parsing

The parser constructs the AST incrementally as it processes the token stream. Understanding this process helps explain how parsing and tree building intertwine.

Bottom-Up Construction (LR Parsers):

Most database parsers use LR parsing, which builds the tree bottom-up:

Read tokens left-to-right
When tokens match a grammar rule, reduce them to a parent node
Continue until the entire query is reduced to a single root node

ast-construction.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Query: SELECT a FROM t WHERE b > 5
 
-- Parsing and AST construction trace:
 
Token Stream: SELECT  a  FROM  t  WHERE  b  >  5
 
Step 1: Read 'SELECT' → Shift to stack
Step 2: Read 'a' → Shift (identifier)
Step 3: 'a' matches column_ref rule → Reduce to ColumnRef node
Step 4: ColumnRef matches target rule → Reduce to ResTarget
Step 5: ResTarget matches target_list → Reduce to [ResTarget]
Step 6: Read 'FROM', 't' → Reduce 't' to RangeVar
Step 7: RangeVar matches from_clause → Reduce to FROM clause
Step 8: Read 'WHERE', 'b', '>', '5'
Step 9: 'b' → ColumnRef, '5' → Const
Step 10: ColumnRef > Const matches operator_expr → Reduce to OpExpr
Step 11: OpExpr matches where_clause → Reduce to WHERE clause
Step 12: target_list + from_clause + where_clause → Reduce to SelectStmt
 
Final AST:
SelectStmt
├── targetList: [ResTarget(ColumnRef "a")]
├── fromClause: [RangeVar "t"]
└── whereClause: OpExpr(ColumnRef "b" > Const 5)

Parser Actions:

During reduction, the parser executes semantic actions—code that creates AST nodes:

// Simplified from PostgreSQL's gram.y
select_with_parens:
    '(' select_no_parens ')'
    {
        // Semantic action: create SelectStmt node
        SelectStmt *n = makeNode(SelectStmt);
        n->targetList = $2->targetList;
        n->fromClause = $2->fromClause;
        n->whereClause = $2->whereClause;
        $$ = n;  // Return the created node
    }
    ;

Each grammar rule has associated code that builds AST nodes when that rule is matched.

Memory Allocation

AST Annotation During Semantic Analysis

The initial AST from parsing contains only syntactic information. Semantic analysis enriches this tree with annotations—additional metadata about each node.

Types of Annotations:

AST Annotations Added During Semantic Analysis
Annotation Type	Applied To	Information Added
Type Info	All expression nodes	Data type, nullability, collation
Catalog References	RangeVar, ColumnRef	OID of table/column, position
Resolution Info	ColumnRef	Which table the column comes from
Coercion Nodes	Type mismatches	Inserted nodes for implicit casts
Aggregate Info	Aggref	Aggregate function OID, distinct flag
Subquery Type	SubLink	EXISTS, IN, scalar, etc.
Parameter Info	Parameter nodes	Parameter number, type

annotated-ast.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Query: SELECT name, salary * 1.1 FROM employees WHERE dept = 'Sales'
 
-- Raw AST (after parsing, before semantic analysis):
SelectStmt
├── targetList:
│   ├── ResTarget(ColumnRef "name")
│   └── ResTarget(OpExpr: ColumnRef "salary" * Const 1.1)
├── fromClause: [RangeVar "employees"]
└── whereClause: OpExpr(ColumnRef "dept" = Const 'Sales')
 
-- Annotated AST (after semantic analysis):
SelectStmt
├── targetList:
│   ├── ResTarget
│   │   └── ColumnRef "name"
│   │       ├── table_oid: 16384 (employees)
│   │       ├── column_num: 2 (name)
│   │       └── type: VARCHAR(100), collation: en_US
│   └── ResTarget  
│       └── OpExpr(*)
│           ├── type: NUMERIC(12,2)
│           ├── left: ColumnRef "salary"
│           │   ├── table_oid: 16384
│           │   ├── column_num: 4
│           │   └── type: NUMERIC(10,2)
│           └── right: Const 1.1
│               └── type: NUMERIC
├── fromClause:
│   └── RangeVar "employees"
│       ├── schema_oid: 2200 (public)
│       ├── relation_oid: 16384
│       └── alias: none
└── whereClause:
    └── OpExpr(=)
        ├── type: BOOLEAN
        ├── left: ColumnRef "dept"
        │   ├── table_oid: 16384
        │   ├── column_num: 3
        │   └── type: VARCHAR(50)
        └── right: Const 'Sales'
            └── type: VARCHAR (coerced from unknown)

Coercion Node Insertion:

When types don't match exactly, the analyzer inserts coercion nodes:

-- Original expression: int_column = 3.14
OpExpr(=)
├── left: ColumnRef (type: INTEGER)
└── right: Const 3.14 (type: NUMERIC)

-- After type checking (coercion inserted):
OpExpr(=)
├── left: CoerceExpr 
│   ├── arg: ColumnRef (type: INTEGER)
│   └── resulttype: NUMERIC
└── right: Const 3.14 (type: NUMERIC)

The CoerceExpr node tells the executor to convert the integer to numeric before comparison.

AST Transformations and Rewrites

After semantic analysis, the AST may undergo transformations that rewrite it into equivalent but more optimizable forms. These transformations prepare the query for optimization.

Common AST Transformations:

Query Rewrite Transformations

•View Expansion — Replace view references with underlying query definition
•Subquery Flattening — Convert some subqueries to joins for better optimization
•Constant Folding — Evaluate constant expressions at compile time (2 + 3 → 5)
•Predicate Normalization — Convert to canonical form (NOT (a > b) → a <= b)
•IN-list Expansion — Convert IN (1,2,3) to OR conditions or array format
•Exists-to-Join — Convert EXISTS subqueries to semi-joins where possible
•Rule-Based Rewriting — Apply user-defined rewrite rules

ast-transformation-examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Example 1: View Expansion
 
-- View definition:
CREATE VIEW active_employees AS 
SELECT * FROM employees WHERE status = 'active';
 
-- User query:
SELECT name FROM active_employees WHERE salary > 50000;
 
-- After view expansion (conceptual AST transformation):
SELECT name FROM employees WHERE status = 'active' AND salary > 50000;
 
-- Example 2: Subquery Flattening
 
-- Original:
SELECT * FROM orders WHERE customer_id IN (
    SELECT id FROM customers WHERE region = 'West'
);
 
-- Flattened to semi-join:
SELECT orders.* FROM orders 
SEMI JOIN customers ON orders.customer_id = customers.id
WHERE customers.region = 'West';
 
-- Example 3: Constant Folding
 
-- Original:
SELECT * FROM t WHERE date > '2024-01-01'::DATE + INTERVAL '30 days';
 
-- After constant folding:
SELECT * FROM t WHERE date > '2024-01-31'::DATE;

Converting Mermaid diagram...

Viewing Parse Trees in Database Systems

Many database systems provide ways to inspect the parse tree or its representations. This is invaluable for understanding query processing and debugging.

PostgreSQL:

PostgreSQL offers several debugging options:

viewing-parse-trees.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- PostgreSQL: Enable parse tree debugging
SET debug_print_parse = on;
SET debug_print_rewritten = on;
SET debug_print_plan = on;
SET client_min_messages = log;
 
-- Execute a query and check server log for parse tree output
SELECT name, salary FROM employees WHERE department_id = 5;
 
-- The log will contain detailed parse tree representation:
-- DETAIL: {QUERY 
--    :commandType 1 
--    :querySource 0 
--    :canSetTag true 
--    :utilityStmt <> 
--    :resultRelation 0 
--    :targetList (
--       {TARGETENTRY 
--          :expr {VAR :varno 1 :varattno 2 :vartype 1043 ...}
--          :resno 1 
--          :resname name 
--          ...
--       }
--       ...
--    )
--    ...
-- }
 
-- MySQL: EXPLAIN can show some internal representation
EXPLAIN FORMAT=JSON SELECT * FROM employees;
 
-- SQL Server: Execution plan XML includes parsed structure
SET SHOWPLAN_XML ON;
GO
SELECT name FROM employees;
GO
SET SHOWPLAN_XML OFF;

Third-Party Tools:

Various tools can visualize parse trees:

pgAdmin — Query explain with visual plan
DBeaver — Execution plan visualization
SQL Fiddle — Online query analysis
Language-specific parsers — libpg_query, sql-parser-cst, etc.

Using libpg_query:

This library exposes PostgreSQL's parser as a standalone component:

libpg-query-example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Python example using pglast (libpg_query wrapper)
from pglast import parse_sql
from pglast.stream import RawStream
import json
 
query = "SELECT a, b FROM t WHERE c > 5 ORDER BY a"
 
# Parse query to AST
tree = parse_sql(query)
 
# Pretty-print the parse tree
print(json.dumps(tree.stmts[0].stmt, indent=2, default=str))
 
# Output (simplified):
# {
#   "SelectStmt": {
#     "targetList": [
#       {"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": "a"}]}}}},
#       {"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": "b"}]}}}}
#     ],
#     "fromClause": [
#       {"RangeVar": {"relname": "t", "inh": true, "relpersistence": "p"}}
#     ],
#     "whereClause": {
#       "A_Expr": {
#         "kind": 0,
#         "name": [">"],
#         "lexpr": {"ColumnRef": {"fields": [{"String": "c"}]}},
#         "rexpr": {"A_Const": {"val": {"Integer": 5}}}
#       }
#     },
#     "sortClause": [...]
#   }
# }

Learning by Exploring

From Parse Tree to Query Plan

The annotated, transformed parse tree is the input to the query optimizer. Understanding how the AST relates to query plans completes the parsing picture.

Parse Tree vs. Query Plan:

Parse Tree (AST)	Query Plan
Represents SQL structure	Represents execution strategy
Declarative (what to compute)	Imperative (how to compute)
One form per query	Many possible plans per query
Language-level operations	Physical operations (scans, joins)
Input to optimizer	Output of optimizer

Translation Process:

The optimizer translates AST constructs into physical operations:

ast-to-plan.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Query
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000
ORDER BY e.name;
 
-- Parse Tree (simplified)
SelectStmt
├── targetList: [e.name, d.dept_name]
├── fromClause: JoinExpr(employees e, departments d, e.dept_id = d.id)
├── whereClause: e.salary > 50000
└── sortClause: e.name ASC
 
-- Possible Query Plans (optimizer chooses best):
 
Plan 1: Nested Loop with Index
Sort (by name)
└── Nested Loop Join
    ├── Index Scan on employees (salary > 50000)
    └── Index Scan on departments (id = e.dept_id)
 
Plan 2: Hash Join with Seq Scan
Sort (by name)
└── Hash Join (dept_id = id)
    ├── Seq Scan on employees (salary > 50000)
    └── Hash on departments
 
Plan 3: Merge Join (if both sorted)
Sort (by name)
└── Merge Join (dept_id = id)
    ├── Index Scan on employees (salary > 50000, ordered by dept_id)
    └── Index Scan on departments (ordered by id)

Key Translation Decisions:

FROM clause → Access methods (table scan, index scan, etc.)
JOIN expressions → Join algorithms (nested loop, hash, merge)
WHERE predicates → Filters (may push down to scans)
SELECT expressions → Projection operations
ORDER BY → Sort operations (may use index order)
GROUP BY/aggregates → GroupAggregate or HashAggregate
Subqueries → SubPlan nodes or flattened into main plan

The AST provides the semantic specification; the plan provides the physical recipe.

Optimizer's Freedom

Summary: Parse Tree

The parse tree is the fundamental representation of your query that underlies all subsequent processing. Let's consolidate the key concepts:

Key Takeaways

•Parse trees (ASTs) are hierarchical representations of query structure, with statement nodes at the root and expression/literal nodes at leaves.
•AST vs. CST — ASTs omit syntactic details (parentheses, commas) and encode structure (like operator precedence) in tree shape.
•Node types include statement nodes (SelectStmt), expression nodes (OpExpr, FuncCall), and reference nodes (ColumnRef, RangeVar).
•Construction happens during parsing, with semantic actions creating nodes as grammar rules are matched.
•Annotations are added during semantic analysis—type information, catalog references, coercion nodes.
•Transformations rewrite the AST for optimization—view expansion, subquery flattening, constant folding.
•Tools exist to view parse trees—debug settings, EXPLAIN variants, and standalone parsing libraries.
•Plan generation translates the declarative AST into an imperative execution plan with physical operations.

Module Complete:

Congratulations! You've completed the Parsing and Validation module. You now understand the complete journey from raw SQL text to validated, annotated parse tree:

Lexical analysis breaks text into tokens
Syntax analysis validates grammar and builds the parse tree
Semantic analysis verifies meaning (objects exist, types compatible)
Name resolution maps identifiers to specific objects
Type checking ensures type compatibility throughout
The parse tree captures all this information for the optimizer

This foundation is essential for understanding query optimization, which determines HOW your queries execute—the subject of upcoming modules.

Module Complete

5 / 5