Loading learning content...
When you write a SQL query like SELECT name FROM employees WHERE salary > 50000, you're writing for humans to read. But databases don't execute text—they execute structured operations on data. The bridge between human-readable SQL and executable operations is the parse tree, also called the Abstract Syntax Tree (AST).
The parse tree is a hierarchical representation of your query's structure. Each clause becomes a subtree, each column reference becomes a node, and the relationships between elements become edges. This tree is the "skeleton" upon which all subsequent processing hangs—semantic analysis annotates it, optimization transforms it, and execution interprets it.
Understanding parse trees reveals how databases "see" your queries and helps explain why certain query formulations produce different execution plans.
By the end of this page, you will understand parse tree structure, node types, and construction process. You'll learn how annotations enrich the tree during semantic analysis, how the tree is transformed for optimization, and how to visualize and interpret parse trees in real database systems.
A parse tree is a tree data structure that represents the syntactic structure of a query according to the SQL grammar. Understanding its properties is essential for understanding query processing.
Tree Properties:
Concrete Syntax Tree (CST) vs. Abstract Syntax Tree (AST):
Parsers can produce two types of trees:
Most databases build ASTs because:
123456789101112131415161718192021222324252627282930313233343536
-- Query: SELECT (a + b) * c FROM t WHERE x = 1; -- Concrete Syntax Tree (CST) - includes everything:SelectStatement├── SELECT (keyword)├── SelectList│ └── Expression│ ├── LPAREN "("│ ├── Expression│ │ ├── ColumnRef "a"│ │ ├── PLUS "+"│ │ └── ColumnRef "b"│ ├── RPAREN ")"│ ├── STAR "*"│ └── ColumnRef "c"├── FROM (keyword)├── TableRef "t"├── WHERE (keyword)├── Expression│ ├── ColumnRef "x"│ ├── EQUALS "="│ └── IntLiteral "1"└── SEMICOLON ";" -- Abstract Syntax Tree (AST) - semantics only:SelectStmt├── targetList: [MultExpr]│ └── MultExpr (*)│ ├── left: AddExpr (+)│ │ ├── left: ColumnRef "a"│ │ └── right: ColumnRef "b"│ └── right: ColumnRef "c"├── fromClause: [RangeVar "t"]└── whereClause: OpExpr (=) ├── left: ColumnRef "x" └── right: Const 1Notice how the AST omits parentheses—the tree structure itself encodes operator precedence. The addition happens first because it's a child of the multiplication node. This is why ASTs are preferred: they encode meaning in structure rather than relying on syntactic markers.
SQL ASTs contain many node types, each representing a specific syntactic construct. Let's examine the major categories.
Statement Nodes:
These are root nodes representing complete SQL statements:
| Node Type | SQL Statement | Key Child Nodes |
|---|---|---|
| SelectStmt | SELECT query | targetList, fromClause, whereClause, groupClause, sortClause |
| InsertStmt | INSERT statement | relation, cols, selectStmt/valuesList |
| UpdateStmt | UPDATE statement | relation, targetList, whereClause, fromClause |
| DeleteStmt | DELETE statement | relation, whereClause, usingClause |
| CreateStmt | CREATE TABLE | relation, tableElts (columns), constraints |
| AlterStmt | ALTER TABLE | relation, alterCmds |
| IndexStmt | CREATE INDEX | idxname, relation, indexParams |
Expression Nodes:
These represent values, calculations, and conditions:
123456789101112131415161718
-- Expression Node Types (PostgreSQL internal names) Const -- Literal values: 42, 'hello', TRUEColumnRef -- Column references: employees.name, e.salaryFuncCall -- Function calls: UPPER(name), COUNT(*)OpExpr -- Binary operators: a + b, x > 5BoolExpr -- Boolean combinations: AND, OR, NOTSubLink -- Subqueries: (SELECT max(x) FROM t)CaseExpr -- CASE expressionsCoalesceExpr -- COALESCE(a, b, c)NullTest -- IS NULL, IS NOT NULLBooleanTest -- IS TRUE, IS FALSE, IS UNKNOWNTypeCast -- CAST(x AS type), x::typeArrayExpr -- ARRAY[1,2,3]RowExpr -- ROW(a, b, c)CoerceExpr -- Type coercion wrapperAggref -- Aggregate function: SUM, AVG, COUNTWindowFunc -- Window function: ROW_NUMBER() OVER(...)Clause and Reference Nodes:
These represent structural elements within statements:
The parser constructs the AST incrementally as it processes the token stream. Understanding this process helps explain how parsing and tree building intertwine.
Bottom-Up Construction (LR Parsers):
Most database parsers use LR parsing, which builds the tree bottom-up:
123456789101112131415161718192021222324
-- Query: SELECT a FROM t WHERE b > 5 -- Parsing and AST construction trace: Token Stream: SELECT a FROM t WHERE b > 5 Step 1: Read 'SELECT' → Shift to stackStep 2: Read 'a' → Shift (identifier)Step 3: 'a' matches column_ref rule → Reduce to ColumnRef nodeStep 4: ColumnRef matches target rule → Reduce to ResTargetStep 5: ResTarget matches target_list → Reduce to [ResTarget]Step 6: Read 'FROM', 't' → Reduce 't' to RangeVarStep 7: RangeVar matches from_clause → Reduce to FROM clauseStep 8: Read 'WHERE', 'b', '>', '5'Step 9: 'b' → ColumnRef, '5' → ConstStep 10: ColumnRef > Const matches operator_expr → Reduce to OpExprStep 11: OpExpr matches where_clause → Reduce to WHERE clauseStep 12: target_list + from_clause + where_clause → Reduce to SelectStmt Final AST:SelectStmt├── targetList: [ResTarget(ColumnRef "a")]├── fromClause: [RangeVar "t"]└── whereClause: OpExpr(ColumnRef "b" > Const 5)Parser Actions:
During reduction, the parser executes semantic actions—code that creates AST nodes:
// Simplified from PostgreSQL's gram.y
select_with_parens:
'(' select_no_parens ')'
{
// Semantic action: create SelectStmt node
SelectStmt *n = makeNode(SelectStmt);
n->targetList = $2->targetList;
n->fromClause = $2->fromClause;
n->whereClause = $2->whereClause;
$$ = n; // Return the created node
}
;
Each grammar rule has associated code that builds AST nodes when that rule is matched.
Database parsers typically allocate AST nodes from a memory context (memory pool) associated with the query. This allows efficient bulk deallocation when the query completes, avoiding individual free() calls for thousands of nodes.
The initial AST from parsing contains only syntactic information. Semantic analysis enriches this tree with annotations—additional metadata about each node.
Types of Annotations:
| Annotation Type | Applied To | Information Added |
|---|---|---|
| Type Info | All expression nodes | Data type, nullability, collation |
| Catalog References | RangeVar, ColumnRef | OID of table/column, position |
| Resolution Info | ColumnRef | Which table the column comes from |
| Coercion Nodes | Type mismatches | Inserted nodes for implicit casts |
| Aggregate Info | Aggref | Aggregate function OID, distinct flag |
| Subquery Type | SubLink | EXISTS, IN, scalar, etc. |
| Parameter Info | Parameter nodes | Parameter number, type |
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Query: SELECT name, salary * 1.1 FROM employees WHERE dept = 'Sales' -- Raw AST (after parsing, before semantic analysis):SelectStmt├── targetList:│ ├── ResTarget(ColumnRef "name")│ └── ResTarget(OpExpr: ColumnRef "salary" * Const 1.1)├── fromClause: [RangeVar "employees"]└── whereClause: OpExpr(ColumnRef "dept" = Const 'Sales') -- Annotated AST (after semantic analysis):SelectStmt├── targetList:│ ├── ResTarget│ │ └── ColumnRef "name"│ │ ├── table_oid: 16384 (employees)│ │ ├── column_num: 2 (name)│ │ └── type: VARCHAR(100), collation: en_US│ └── ResTarget │ └── OpExpr(*)│ ├── type: NUMERIC(12,2)│ ├── left: ColumnRef "salary"│ │ ├── table_oid: 16384│ │ ├── column_num: 4│ │ └── type: NUMERIC(10,2)│ └── right: Const 1.1│ └── type: NUMERIC├── fromClause:│ └── RangeVar "employees"│ ├── schema_oid: 2200 (public)│ ├── relation_oid: 16384│ └── alias: none└── whereClause: └── OpExpr(=) ├── type: BOOLEAN ├── left: ColumnRef "dept" │ ├── table_oid: 16384 │ ├── column_num: 3 │ └── type: VARCHAR(50) └── right: Const 'Sales' └── type: VARCHAR (coerced from unknown)Coercion Node Insertion:
When types don't match exactly, the analyzer inserts coercion nodes:
-- Original expression: int_column = 3.14
OpExpr(=)
├── left: ColumnRef (type: INTEGER)
└── right: Const 3.14 (type: NUMERIC)
-- After type checking (coercion inserted):
OpExpr(=)
├── left: CoerceExpr
│ ├── arg: ColumnRef (type: INTEGER)
│ └── resulttype: NUMERIC
└── right: Const 3.14 (type: NUMERIC)
The CoerceExpr node tells the executor to convert the integer to numeric before comparison.
After semantic analysis, the AST may undergo transformations that rewrite it into equivalent but more optimizable forms. These transformations prepare the query for optimization.
Common AST Transformations:
12345678910111213141516171819202122232425262728293031
-- Example 1: View Expansion -- View definition:CREATE VIEW active_employees AS SELECT * FROM employees WHERE status = 'active'; -- User query:SELECT name FROM active_employees WHERE salary > 50000; -- After view expansion (conceptual AST transformation):SELECT name FROM employees WHERE status = 'active' AND salary > 50000; -- Example 2: Subquery Flattening -- Original:SELECT * FROM orders WHERE customer_id IN ( SELECT id FROM customers WHERE region = 'West'); -- Flattened to semi-join:SELECT orders.* FROM orders SEMI JOIN customers ON orders.customer_id = customers.idWHERE customers.region = 'West'; -- Example 3: Constant Folding -- Original:SELECT * FROM t WHERE date > '2024-01-01'::DATE + INTERVAL '30 days'; -- After constant folding:SELECT * FROM t WHERE date > '2024-01-31'::DATE;Many database systems provide ways to inspect the parse tree or its representations. This is invaluable for understanding query processing and debugging.
PostgreSQL:
PostgreSQL offers several debugging options:
12345678910111213141516171819202122232425262728293031323334353637
-- PostgreSQL: Enable parse tree debuggingSET debug_print_parse = on;SET debug_print_rewritten = on;SET debug_print_plan = on;SET client_min_messages = log; -- Execute a query and check server log for parse tree outputSELECT name, salary FROM employees WHERE department_id = 5; -- The log will contain detailed parse tree representation:-- DETAIL: {QUERY -- :commandType 1 -- :querySource 0 -- :canSetTag true -- :utilityStmt <> -- :resultRelation 0 -- :targetList (-- {TARGETENTRY -- :expr {VAR :varno 1 :varattno 2 :vartype 1043 ...}-- :resno 1 -- :resname name -- ...-- }-- ...-- )-- ...-- } -- MySQL: EXPLAIN can show some internal representationEXPLAIN FORMAT=JSON SELECT * FROM employees; -- SQL Server: Execution plan XML includes parsed structureSET SHOWPLAN_XML ON;GOSELECT name FROM employees;GOSET SHOWPLAN_XML OFF;Third-Party Tools:
Various tools can visualize parse trees:
Using libpg_query:
This library exposes PostgreSQL's parser as a standalone component:
12345678910111213141516171819202122232425262728293031323334
# Python example using pglast (libpg_query wrapper)from pglast import parse_sqlfrom pglast.stream import RawStreamimport json query = "SELECT a, b FROM t WHERE c > 5 ORDER BY a" # Parse query to ASTtree = parse_sql(query) # Pretty-print the parse treeprint(json.dumps(tree.stmts[0].stmt, indent=2, default=str)) # Output (simplified):# {# "SelectStmt": {# "targetList": [# {"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": "a"}]}}}},# {"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": "b"}]}}}}# ],# "fromClause": [# {"RangeVar": {"relname": "t", "inh": true, "relpersistence": "p"}}# ],# "whereClause": {# "A_Expr": {# "kind": 0,# "name": [">"],# "lexpr": {"ColumnRef": {"fields": [{"String": "c"}]}},# "rexpr": {"A_Const": {"val": {"Integer": 5}}}# }# },# "sortClause": [...]# }# }Parsing queries and examining their AST is one of the best ways to deeply understand SQL semantics. Try parsing queries with subtle differences (e.g., implicit vs. explicit joins, different WHERE clause structures) and compare the resulting ASTs.
The annotated, transformed parse tree is the input to the query optimizer. Understanding how the AST relates to query plans completes the parsing picture.
Parse Tree vs. Query Plan:
| Parse Tree (AST) | Query Plan |
|---|---|
| Represents SQL structure | Represents execution strategy |
| Declarative (what to compute) | Imperative (how to compute) |
| One form per query | Many possible plans per query |
| Language-level operations | Physical operations (scans, joins) |
| Input to optimizer | Output of optimizer |
Translation Process:
The optimizer translates AST constructs into physical operations:
123456789101112131415161718192021222324252627282930313233
-- QuerySELECT e.name, d.dept_nameFROM employees eJOIN departments d ON e.dept_id = d.idWHERE e.salary > 50000ORDER BY e.name; -- Parse Tree (simplified)SelectStmt├── targetList: [e.name, d.dept_name]├── fromClause: JoinExpr(employees e, departments d, e.dept_id = d.id)├── whereClause: e.salary > 50000└── sortClause: e.name ASC -- Possible Query Plans (optimizer chooses best): Plan 1: Nested Loop with IndexSort (by name)└── Nested Loop Join ├── Index Scan on employees (salary > 50000) └── Index Scan on departments (id = e.dept_id) Plan 2: Hash Join with Seq ScanSort (by name)└── Hash Join (dept_id = id) ├── Seq Scan on employees (salary > 50000) └── Hash on departments Plan 3: Merge Join (if both sorted)Sort (by name)└── Merge Join (dept_id = id) ├── Index Scan on employees (salary > 50000, ordered by dept_id) └── Index Scan on departments (ordered by id)Key Translation Decisions:
The AST provides the semantic specification; the plan provides the physical recipe.
The optimizer has complete freedom to rearrange operations as long as the result is semantically equivalent. It can reorder joins, push predicates into scans, convert between plan shapes, and more—all guided by cost estimates. The AST constrains WHAT is computed; the plan determines HOW.
The parse tree is the fundamental representation of your query that underlies all subsequent processing. Let's consolidate the key concepts:
Module Complete:
Congratulations! You've completed the Parsing and Validation module. You now understand the complete journey from raw SQL text to validated, annotated parse tree:
This foundation is essential for understanding query optimization, which determines HOW your queries execute—the subject of upcoming modules.
You've mastered the parsing and validation phase of query processing. You understand how databases transform SQL text into structured, validated parse trees ready for optimization. This knowledge helps you write better queries, understand error messages, and appreciate the sophistication underlying every SQL statement you execute.