Loading learning content...
Relational algebra trees represent queries as hierarchical structures—operations flowing from leaves to root. But this tree-centric view obscures something critical: the relationships between tables.
Consider a query joining five tables. In a tree representation, those tables appear as leaves, and their connections are encoded in join nodes scattered throughout the tree. To understand how tables relate—which tables share join conditions, which predicates apply to which tables—you must mentally reconstruct the relationships by traversing the tree.
Query graphs flip this perspective. Instead of emphasizing the computation (operations flowing through nodes), they emphasize the structure (tables and their connections). This structural view proves invaluable for join ordering, predicate placement, and understanding query complexity at a glance.
By the end of this page, you will understand query graphs, their construction from SQL, their role in optimization, and how they complement relational algebra trees. You'll learn to read and reason about query structure in a way that directly informs optimization decisions.
A query graph is a graph-based representation of a SQL query where:
Unlike the hierarchical relational algebra tree, the query graph is a flat structure that shows all tables and their interconnections simultaneously.
Formal Definition:
A query graph G = (V, E, P) where:
Complementary Representations:
Query graphs and relational algebra trees are not competitors—they're complementary tools:
Optimizers typically maintain both representations, using query graphs for structural analysis (especially join ordering) and algebra trees for transformation and execution planning.
Query graph construction follows a systematic process of extracting entities (relations) and relationships (join conditions) from SQL.
Construction Algorithm:
Example Construction:
123456789101112131415161718192021222324
-- Original QuerySELECT c.name, o.order_id, p.product_name, oi.quantityFROM customers cJOIN orders o ON c.customer_id = o.customer_idJOIN order_items oi ON o.order_id = oi.order_idJOIN products p ON oi.product_id = p.product_idWHERE c.region = 'West' AND o.order_date > '2024-01-01' AND p.category = 'Electronics'; -- Step 1: Identify nodes (tables)-- Nodes: customers (c), orders (o), order_items (oi), products (p) -- Step 2-3: Classify predicates-- Join predicates:-- c.customer_id = o.customer_id → edge: c -- o-- o.order_id = oi.order_id → edge: o -- oi-- oi.product_id = p.product_id → edge: oi -- p-- Selection predicates:-- c.region = 'West' → annotates node c-- o.order_date > '2024-01-01' → annotates node o-- p.category = 'Electronics' → annotates node p -- Step 4-5: Build graph with labeled edges and nodesAnalysis of the Example Graph:
This query graph immediately reveals several properties:
Chain topology: The tables form a linear chain (c → o → oi → p). This is the most common topology for transactional queries following entity relationships.
No cycles: Each table connects to at most two others, with no circular paths. This simplifies join ordering.
Predicate distribution: Selection predicates appear on three of four tables (c, o, p), indicating opportunity for early filtering.
Critical path: Any join order must respect the chain—you can't directly join customers to products without intermediate tables.
Selectivity opportunities: With predicates on c, o, and p, starting from the most selective table could minimize intermediate result sizes.
The shape of a query graph tells you about query complexity. Chains are simple. Stars (one central table connected to many) are common in data warehouses. Cycles indicate complex relationships requiring careful handling. Dense graphs with many cross-connections suggest expensive optimizations.
Query graph topology—the pattern of connections between tables—has profound implications for optimization difficulty, join algorithm selection, and potential execution strategies.
Common Topology Patterns:
Chain (Linear) Topology:
T1 — T2 — T3 — T4 — ... — Tn
Tables form a single path with no branching. Each table connects to at most two neighbors.
| Topology | Join Orders | Optimization Complexity | Common Domain |
|---|---|---|---|
| Chain (n tables) | O(2^n) | Low—DP efficient | OLTP, transactional |
| Star (n dimensions) | O(n!) | Medium—can exploit structure | OLAP, data warehouse |
| Snowflake | Varies | Medium-High | Analytics, normalized DW |
| General/Cyclic | O(n!) | High—NP-hard in general | Complex applications |
| Clique (all-to-all) | O(n!) | Very High | Rare—usually design issue |
Predicates—the boolean conditions in WHERE, HAVING, and JOIN clauses—play a central role in query graphs. Their classification and placement determine filtering opportunities and join optimization.
customer.region = 'West', order.amount > 100. Placed as annotations on the corresponding node. Can be evaluated immediately after/during table access.order.customer_id = customer.id. Become edges in the query graph. Define the graph structure itself.a.x + b.y > c.z. Rarely occur; require special handling. Often indicate opportunity for query rewriting.Predicate Decomposition:
Complex predicates often need decomposition before placement. Consider:
WHERE (a.x = b.y AND a.region = 'East')
OR (c.z > 100)
This predicate cannot be simply placed because:
a.region is in a subexpression)Optimizers use predicate extraction rules to identify portions that can be evaluated early.
12345678910111213141516171819202122
-- Predicate Classification Example SELECT *FROM orders o, customers c, products p, order_items oiWHERE -- Join predicates (become edges) o.customer_id = c.id -- Edge: orders — customers AND o.order_id = oi.order_id -- Edge: orders — order_items AND oi.product_id = p.id -- Edge: order_items — products -- Selection predicates (node annotations) AND c.status = 'active' -- Node: customers AND o.order_date > CURRENT_DATE - 30 -- Node: orders AND p.price > 50 -- Node: products -- Post-join predicate (evaluated after join) AND o.total > c.credit_limit * 0.8 -- Involves both orders AND customers -- Must be evaluated after o-c join -- The last predicate references both 'o' and 'c', so it cannot be placed-- on either node independently. It becomes an annotation on the edge,-- or is evaluated after the join produces combined rows.Each predicate has a selectivity—the fraction of rows it eliminates. Highly selective predicates (e.g., filtering to one customer by ID) should be applied as early as possible. Query graphs help identify which predicates can be pushed to base table access and which must wait for joins.
The primary use of query graphs in optimization is join ordering—determining the sequence in which tables should be joined. This is one of the most critical optimization decisions, as different orderings can differ in cost by orders of magnitude.
The Join Ordering Problem:
For n tables, there are:
The query graph constrains possibilities: you can only join tables connected by an edge (or their joined results).
Using the Graph for Ordering:
Given the chain A — B — C — D, possible join orderings include:
The graph structure shows which orderings are valid—you cannot join A directly with D without first involving B and C.
Dynamic Programming Approach:
Optimizers use the graph to enumerate valid subsets:
| Join Order | Estimated Rows | Estimated Cost | Notes |
|---|---|---|---|
| (((A ⋈ B) ⋈ C) ⋈ D) | 500K → 5K → 2K | High | B is large—early join produces large intermediate |
| (((C ⋈ D) ⋈ B) ⋈ A) | 8K → 40K → 35K | Medium | Start small, grow moderately |
| ((A ⋈ B) ⋈ (C ⋈ D)) | 500K ⋈ 8K | Highest | Two large intermediates for final join |
| (((C ⋈ B) ⋈ A) ⋈ D) | 10K → 80K → 40K | Medium-Low | C as anchor reduces B join size |
| Start from most selective | Depends on predicates | Often Lowest | Filter early, join filtered results |
Join ordering decisions depend entirely on cardinality estimates. The optimizer needs to predict how many rows each intermediate join produces. Errors in these estimates—from stale statistics or correlated columns—can lead to catastrophically bad plans.
Basic query graphs represent simple join queries. Real databases extend the model to handle more complex SQL features.
1234567891011121314151617181920212223242526
-- Query with outer join and subquerySELECT c.name, o.order_id, (SELECT MAX(oi.price) FROM order_items oi WHERE oi.order_id = o.order_id) as max_item_priceFROM customers cLEFT JOIN orders o ON c.customer_id = o.customer_idWHERE c.region = 'West' AND (o.order_id IS NULL OR o.total > 100); -- Extended Query Graph Representation:-- -- Nodes:-- [customers] with selection: region = 'West'-- [orders] with selection: order_id IS NULL OR total > 100-- [order_items] - in scalar subquery---- Edges:-- customers ←LEFT_OUTER→ orders (customer_id = customer_id)-- orders ⟿ order_items (correlated: oi.order_id = o.order_id)-- (scalar aggregate subquery)---- The LEFT_OUTER edge indicates orders may be NULL-- The correlated subquery edge shows data dependency-- Aggregate boundary exists at the subquery levelOuter Join Complexity:
Outer joins complicate query graphs because they are not symmetric:
A ⋈ B = B ⋈ A (commutative)A LEFT JOIN B ≠ B LEFT JOIN A (not commutative)This restricts join reordering. You cannot freely reorder outer joins like inner joins. The query graph must track edge direction and type to ensure valid transformations.
Association patterns for outer joins:
A query with multiple outer joins has far fewer valid join orderings than an equivalent inner-join query. This is a significant optimization constraint. Understanding this helps explain why some queries are harder to optimize and why execution plans may seem suboptimal.
Query graphs and relational algebra trees work together in the optimization process. The query graph provides structural insight for decisions like join ordering, while the algebra tree provides the actual execution structure.
The Transformation Process:
12345678910111213141516171819202122232425262728293031323334353637
// Conceptual algorithm: Query Graph → Relational Algebra Tree function graphToTree(queryGraph: QueryGraph): AlgebraTree { // 1. Determine join order using query graph // (e.g., via dynamic programming on graph structure) const joinOrder = findOptimalJoinOrder(queryGraph); // 2. Build tree bottom-up from join order let tree = null; for (const step of joinOrder) { if (tree === null) { // First step: access base table with local predicates tree = createScan(step.table, step.localPredicates); } else { // Join current tree with next table const nextScan = createScan(step.table, step.localPredicates); tree = createJoin(tree, nextScan, step.joinPredicate); } } // 3. Add remaining operations (aggregation, sorting, projection) if (queryGraph.hasGroupBy) { tree = createGroupBy(tree, queryGraph.groupingCols, queryGraph.aggregates); } if (queryGraph.hasHaving) { tree = createSelect(tree, queryGraph.havingPredicate); } if (queryGraph.hasOrderBy) { tree = createSort(tree, queryGraph.orderByColumns); } tree = createProject(tree, queryGraph.selectList); return tree;} // The result is a fully-formed relational algebra tree// ready for physical planning and executionKey Transformation Points:
Graph remains undirected for joins: The graph doesn't dictate order; the optimizer chooses based on cost.
Tree becomes strictly ordered: Once transformed, the tree specifies exact operation sequence.
Predicates placed appropriately: Local predicates become selection nodes (often pushed to scans); join predicates become join node conditions.
Topology constrains but doesn't determine: The graph's structure limits valid orderings, but cost estimates determine the actual choice.
Both representations remain useful: The graph for understanding structure; the tree for executing operations.
Query graphs provide a structural lens for understanding and optimizing queries. While relational algebra trees capture computation, query graphs reveal relationships.
What's Next:
With relational algebra trees and query graphs understood, we turn to logical plans—the comprehensive representation that combines algebraic operations with optimization annotations. Logical plans are the working representation optimizers manipulate, transform, and refine into efficient execution strategies.
You now understand query graphs and their role in representing query structure. This knowledge helps you visualize complex queries, understand join ordering decisions, and reason about optimization possibilities. Next: Logical Plans and their transformation.