Loading learning content...
The parse tree tells us the structure of the SQL statement—which keywords appear where, how expressions nest, what identifiers are referenced. But structure alone isn't enough. The database must now understand what this query means and how to compute its result.
This is the job of the translation phase (also called semantic analysis or query translation). It transforms the syntactic parse tree into a logical query plan expressed in relational algebra—the mathematical foundation of relational databases.
Translation is where SQL meets database theory. The declarative "what I want" becomes a precise specification of "what operations produce it."
By the end of this page, you will understand semantic analysis (name resolution, type checking, scope analysis), the relational algebra operations that form logical query plans, how SQL constructs map to relational algebra operators, and how the logical plan provides a foundation for optimization.
The parser verifies that SQL is syntactically valid—tokens appear in legal sequences. Semantic analysis verifies that SQL is semantically valid—it makes sense in the context of the database schema.
Syntax vs. Semantics:
-- Syntactically valid, semantically invalid:
SELECT unicorn_column FROM nonexistent_table WHERE 'hello' > 42;
This query parses successfully—the structure is correct. But semantic analysis will reject it:
nonexistent_table doesn't existunicorn_column is unknownCore Semantic Analysis Tasks:
The first semantic task is name resolution: connecting identifiers in the query to actual database objects. This requires consulting the system catalog (also called data dictionary or metadata repository).
The System Catalog:
Every relational database maintains metadata tables describing all objects:
Resolution Process:
12345678910111213
-- Query to resolve:SELECT o.order_id, c.customer_name, o.totalFROM orders o JOIN customers c ON o.customer_id = c.idWHERE o.status = 'completed'; -- Name resolution must answer:-- 1. Does table 'orders' exist? In which schema?-- 2. Does table 'customers' exist? In which schema?-- 3. Does 'orders' have columns: order_id, customer_id, total, status?-- 4. Does 'customers' have columns: customer_name, id?-- 5. Are aliases 'o' and 'c' unique within this query scope?-- 6. Does current user have SELECT permission on these tables/columns?Schema Search Path:
When a table name isn't schema-qualified (e.g., orders not sales.orders), the database searches schemas in a configured order:
-- PostgreSQL search path
SHOW search_path;
-- Result: "$user", public
-- Table 'orders' will be resolved by checking:
-- 1. schema = current_user (e.g., 'john')
-- 2. schema = 'public'
-- First match wins; if no match, error
-- Explicitly set search path
SET search_path = sales, inventory, public;
Alias Handling:
Aliases create temporary names within query scope. The resolver maintains a scope stack tracking which aliases are valid at each nesting level.
When multiple tables have columns with the same name, unqualified references are ambiguous. For example, if both 'orders' and 'order_items' have a column 'created_at', the query SELECT created_at is ambiguous. Most databases reject ambiguous references with an error like "column reference 'created_at' is ambiguous".
After resolving names, the analyzer must verify that operations are type-safe. This involves checking that operators and functions receive compatible argument types.
Type Checking Rules:
Type Coercion:
When types don't exactly match but are compatible, the database applies implicit type coercion (also called type casting or type promotion):
| From Type | To Type | Example | Notes |
|---|---|---|---|
| INTEGER | BIGINT | INT + BIGINT | Promote to wider type |
| INTEGER | FLOAT | INT * FLOAT | Promote to float |
| VARCHAR | TEXT | VARCHAR = TEXT | Promote to longer text |
| DATE | TIMESTAMP | DATE < TIMESTAMP | Add midnight time |
| INTEGER | VARCHAR | INT \|\| VARCHAR | Concatenation context |
| NULL | (any) | Comparison with NULL | NULL propagation |
123456789101112131415161718192021222324
-- Implicit coercion examples -- Integer to float in arithmeticSELECT 42 + 3.14; -- Result: 45.14 (INT promoted to FLOAT) -- Integer to string in concatenationSELECT 'Order #' || 12345; -- Result: 'Order #12345' -- Date to timestamp in comparisonSELECT * FROM events WHERE event_time > '2024-01-15'; -- String coerced to timestamp -- Integer to decimal for precisionSELECT 100 / 3; -- Integer division: 33SELECT 100.0 / 3; -- Decimal division: 33.333...SELECT CAST(100 AS DECIMAL) / 3; -- Explicit cast -- Type errors (depending on database strictness)SELECT 'hello' + 5; -- Error in strict SQLSELECT 'hello' < 100; -- Error: incompatible types -- Explicit casts when implicit coercion isn't availableSELECT CAST(quantity AS VARCHAR) || ' units' FROM inventory;SELECT order_date::TEXT FROM orders; -- PostgreSQL-specific syntaxType Inference:
For expressions and subqueries, the analyzer infers result types by examining the operations:
-- Inferred types for SELECT expressions
SELECT
order_id, -- Type: from catalog (INTEGER)
quantity * price AS total, -- Type: numeric (from * operator)
UPPER(customer_name), -- Type: TEXT (from function signature)
order_date + INTERVAL '7 days' -- Type: TIMESTAMP (date + interval)
FROM orders;
Type information propagates through the query tree, ensuring every expression has a known type before execution planning begins.
Implicit coercion can cause unexpected results. Comparing VARCHAR to INTEGER may work but not use indexes efficiently. Always use explicit casts when types don't naturally align, and be especially careful with date/time comparisons against string literals.
With semantic analysis complete, the parse tree transforms into a logical query plan expressed in relational algebra. Relational algebra is the mathematical foundation of relational databases, providing a formal language for expressing data manipulation operations.
Why Relational Algebra?
Core Relational Algebra Operators:
| Operator | Symbol | Description | SQL Equivalent |
|---|---|---|---|
| Selection | σ (sigma) | Filter rows matching a predicate | WHERE clause |
| Projection | π (pi) | Choose specific columns | SELECT column list |
| Cartesian Product | × (cross) | All combinations of rows | CROSS JOIN |
| Join | ⋈ (bowtie) | Combine related rows | [INNER] JOIN ... ON |
| Union | ∪ | Combine row sets | UNION |
| Difference | − | Rows in first but not second | EXCEPT |
| Intersection | ∩ | Rows in both sets | INTERSECT |
| Rename | ρ (rho) | Rename relation or attributes | AS aliases |
Extended Operators for SQL Features:
Standard relational algebra is extended to support SQL's full feature set:
| Extended Operator | Purpose | SQL Feature |
|---|---|---|
| Aggregation (γ) | Group-by and aggregate functions | GROUP BY, SUM(), COUNT() |
| Outer Join (⟕⟖⟗) | Preserve unmatched rows | LEFT/RIGHT/FULL OUTER JOIN |
| Duplicate Elimination (δ) | Remove duplicate rows | DISTINCT |
| Sorting (τ) | Order result rows | ORDER BY |
| Limit (λ) | Restrict result size | LIMIT, OFFSET |
The translator systematically converts SQL constructs into relational algebra operators. Each SQL clause maps to specific algebra operations.
Translation Rules:
123456789101112131415161718192021222324252627282930313233343536373839
-- SQL QuerySELECT DISTINCT c.name, o.totalFROM customers c JOIN orders o ON c.id = o.customer_idWHERE o.status = 'completed' AND o.total > 1000ORDER BY o.total DESCLIMIT 10; -- Relational Algebra Translation (bottom-up): -- Step 1: Base relations with renameρ c: customers (Customers)ρ o: orders (Orders) -- Step 2: Join operation⋈ c.id = o.customer_id (c × o) -- Step 3: Selection (WHERE clause)σ o.status = 'completed' ∧ o.total > 1000 (joined) -- Step 4: Projection (SELECT columns)π c.name, o.total (selected) -- Step 5: Duplicate elimination (DISTINCT)δ (projected) -- Step 6: Sorting (ORDER BY)τ o.total DESC (distinct) -- Step 7: Limitλ 10 (sorted) -- Complete algebra expression (nested notation):λ₁₀ (τ_{total DESC} (δ (π_{c.name, o.total} ( σ_{status='completed' ∧ total>1000} ( ⋈_{c.id = o.customer_id} (ρ_c(Customers), ρ_o(Orders)) )))))Translation Rules for Common SQL Clauses:
| SQL Clause | Algebra Operator | Notes |
|---|---|---|
FROM table | Base relation | Starting point |
FROM t1, t2 | t1 × t2 (cross product) | Implicit join |
JOIN ... ON | ⋈ condition | Theta join |
WHERE cond | σ cond | Selection filter |
SELECT cols | π cols | Projection |
SELECT DISTINCT | δ (π cols) | Projection + dedup |
GROUP BY ... HAVING | γ grouping, agg; σ having | Grouping with filter |
ORDER BY | τ (sort) | May be optimized away |
UNION/EXCEPT/INTERSECT | ∪/−/∩ | Set operations |
Subquery in FROM | Nested algebra tree | Virtual relation |
SQL's conceptual order of operations defines the translation sequence: FROM (including JOINs) → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT. This order determines how algebra operators nest in the logical plan.
Subqueries add complexity to translation. Different subquery types require different handling strategies.
Subquery Types:
123456789101112131415161718192021222324252627282930
-- Scalar subquery (single value)SELECT order_id, (SELECT name FROM customers WHERE id = orders.customer_id) AS customer_nameFROM orders;-- Translation: Nested loop with scalar aggregation enforced -- EXISTS subquery (semi-join)SELECT *FROM customers cWHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);-- Translation: Semi-join ⋉ (returns c rows that have matching o rows) -- NOT EXISTS (anti-join) SELECT *FROM customers cWHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);-- Translation: Anti-join ▷ (returns c rows with NO matching o rows) -- IN subquery (also semi-join)SELECT *FROM productsWHERE category_id IN (SELECT id FROM categories WHERE active = true);-- Translation: Semi-join on products.category_id = categories.id -- Correlated subquery (references outer query)SELECT c.name, (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) AS order_countFROM customers c;-- Translation: Decoration pattern or lateral join-- Often "decorrelated" to a more efficient join form during optimizationSubquery Decorrelation:
Correlated subqueries execute once per outer row—potentially very expensive. The optimizer often "decorrelates" them into equivalent join expressions:
-- Correlated (slow: one subquery execution per customer)
SELECT c.name,
(SELECT SUM(total) FROM orders WHERE customer_id = c.id) AS total
FROM customers c;
-- Decorrelated (fast: single join operation)
SELECT c.name, COALESCE(o.total_sum, 0) AS total
FROM customers c
LEFT JOIN (
SELECT customer_id, SUM(total) AS total_sum
FROM orders
GROUP BY customer_id
) o ON c.id = o.customer_id;
Decorrelation is a key optimization performed during or after translation.
When a query references a view, the translator must expand the view definition into the logical plan. Views are essentially named, stored queries that appear as virtual tables.
View Expansion Process:
12345678910111213141516171819202122232425262728293031
-- View definitionCREATE VIEW active_orders ASSELECT o.order_id, o.customer_id, o.total, c.name AS customer_nameFROM orders o JOIN customers c ON o.customer_id = c.idWHERE o.status = 'active'; -- Query using the viewSELECT customer_name, SUM(total) AS total_valueFROM active_ordersWHERE total > 100GROUP BY customer_name; -- After view expansion, equivalent to:SELECT customer_name, SUM(total) AS total_valueFROM ( SELECT o.order_id, o.customer_id, o.total, c.name AS customer_name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE o.status = 'active') AS active_ordersWHERE total > 100GROUP BY customer_name; -- The optimizer can then merge and simplify:SELECT c.name AS customer_name, SUM(o.total) AS total_valueFROM orders o JOIN customers c ON o.customer_id = c.idWHERE o.status = 'active' AND o.total > 100GROUP BY c.name;Materialized View Handling:
Unlike regular views, materialized views store computed results. Translation differs:
| View Type | Translation Behavior |
|---|---|
| Regular View | Expand definition, optimize combined query |
| Materialized View | Reference stored data directly (like a table) |
| Materialized View (stale) | May expand or use stale data based on settings |
Some databases perform query rewriting to use materialized views even when the original query doesn't reference them—if the materialized view can answer the query more efficiently.
Views don't inherently add overhead—after expansion and optimization, the query plan may be identical to writing the query directly. However, complex nested views can obscure optimization opportunities. If performance matters, examine the actual execution plan, not the view nesting depth.
The translation phase produces a query tree (or logical plan tree) that represents the algebraic expression as a data structure. This tree flows to the optimizer for transformation.
Query Tree Properties:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
/* Logical plan node types */typedef enum PlanNodeType { /* Scan operators */ T_SeqScan, /* Sequential table scan */ T_IndexScan, /* Index scan */ T_SubqueryScan, /* Scan subquery result */ /* Join operators */ T_Join, /* Generic join (type specified in node) */ T_NestLoop, /* Nested loop join */ T_MergeJoin, /* Sort-merge join */ T_HashJoin, /* Hash join */ /* Set operators */ T_SetOp, /* UNION/INTERSECT/EXCEPT */ /* Transformation operators */ T_Sort, /* ORDER BY */ T_Group, /* GROUP BY */ T_Agg, /* Aggregation */ T_Unique, /* DISTINCT elimination */ T_Limit, /* LIMIT/OFFSET */ /* Result operators */ T_Result, /* Project/compute expressions */} PlanNodeType; /* Generic plan node */typedef struct Plan { PlanNodeType type; /* Cost estimates (filled by optimizer) */ Cost startup_cost; /* Cost before first tuple */ Cost total_cost; /* Cost for all tuples */ double plan_rows; /* Estimated row count */ int plan_width; /* Estimated row width in bytes */ /* Output schema */ List *targetlist; /* Expressions to compute */ List *qual; /* Selection predicates */ /* Child nodes */ struct Plan *lefttree; struct Plan *righttree;} Plan; /* Join-specific node */typedef struct Join { Plan plan; JoinType jointype; /* INNER, LEFT, RIGHT, FULL, SEMI, ANTI */ List *joinqual; /* Join conditions */} Join; /* Aggregation node */typedef struct Agg { Plan plan; AggStrategy aggstrategy; /* PLAIN, SORTED, HASHED */ int numCols; /* Number of grouping columns */ AttrNumber *grpColIdx; /* Grouping column indices */ List *aggfunctions; /* Aggregate function calls */} Agg;Viewing Query Trees:
Most databases provide ways to inspect the logical plan:
-- PostgreSQL: View parse tree, rewritten query, plan
SET debug_print_parse = on;
SET debug_print_rewritten = on;
SET debug_print_plan = on;
-- MySQL: Show query tree
EXPLAIN FORMAT=TREE SELECT ...;
-- SQL Server: Show execution plan
SET SHOWPLAN_ALL ON;
GO
SELECT ...;
GO
SET SHOWPLAN_ALL OFF;
Understanding these representations helps debug unexpected query behavior.
We've explored the translation phase comprehensively. Let's consolidate the key concepts:
What's Next:
With the logical query plan constructed, we enter the most intellectually rich phase: optimization. The next page examines how the query optimizer evaluates countless equivalent execution strategies to find the most efficient plan—using cost models, statistics, and sophisticated search algorithms.
You now understand how SQL statements transform into relational algebra expressions through semantic analysis and translation. This foundation prepares you for understanding query optimization—where the database finds the fastest path to your answer.