Database Management SystemsQuery Compilation

Query Compilation

LevelAdvanced

Duration75 mins

TopicQuery Compilation

1 / 5

Compiled Queries

The Query Compilation Revolution

Every SQL query you execute follows an intricate journey from human-readable text to machine-executable operations. For decades, databases relied on interpreted execution—traversing operator trees at runtime, invoking virtual function calls for each tuple processed. This approach, while flexible, left significant performance on the table.

Query compilation represents a fundamental paradigm shift. Instead of interpreting queries, modern databases compile them into optimized machine code, eliminating interpretation overhead and unlocking performance gains that were previously unattainable. This technique has become the defining characteristic of high-performance analytical databases and is increasingly adopted by traditional OLTP systems.

What You Will Learn

By the end of this page, you will understand the fundamental principles of query compilation, including the distinction between interpretation and compilation, the phases of the compilation pipeline, code generation strategies, and how leading database systems implement compiled query execution.

The Interpretation Problem

To appreciate query compilation, we must first understand what it replaces. Traditional database query execution follows the Volcano iterator model (also known as the pull-based model), introduced by Goetz Graefe in the 1990s. In this model:

Each relational operator (scan, filter, join, etc.) is implemented as an iterator
Operators pull tuples from child operators via next() calls
The root operator repeatedly calls next() until data is exhausted
Each operator processes one tuple at a time

While elegant and composable, this approach introduces substantial overhead:

Virtual function call overhead: Each next() call is typically a virtual function invocation. For a query processing millions of tuples, this translates to millions of indirect calls—each incurring branch misprediction penalties and instruction cache pollution.

Tuple-at-a-time processing: Processing one tuple per next() call prevents modern CPU optimization. Tight inner loops cannot leverage SIMD instructions, and the overhead of function calls dominates actual computation time.

Interpretation overhead: The executor must repeatedly examine operator types, dispatch to appropriate handlers, and manage generic data structures—work that could be eliminated with compile-time knowledge.

volcano_iterator_model.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Traditional Volcano Iterator Model
// Each operator inherits from a base class with virtual next()
 
class Operator {
public:
    virtual ~Operator() = default;
    virtual void open() = 0;
    virtual Tuple* next() = 0;  // Virtual call overhead!
    virtual void close() = 0;
};
 
class TableScan : public Operator {
    Table* table;
    size_t position = 0;
public:
    TableScan(Table* t) : table(t) {}
    
    void open() override { position = 0; }
    
    Tuple* next() override {
        // Each call processes exactly ONE tuple
        if (position >= table->size()) return nullptr;
        return table->getTuple(position++);
    }
    
    void close() override {}
};
 
class Filter : public Operator {
    Operator* child;
    Predicate* predicate;
public:
    Filter(Operator* c, Predicate* p) : child(c), predicate(p) {}
    
    Tuple* next() override {
        // Another virtual call, plus predicate evaluation
        while (Tuple* tuple = child->next()) {  // Virtual call!
            if (predicate->evaluate(tuple)) {   // Another virtual call!
                return tuple;
            }
        }
        return nullptr;
    }
    // ... open/close implementations
};
 
// Query execution: millions of virtual calls for large tables
void executeQuery(Operator* root) {
    root->open();
    while (Tuple* tuple = root->next()) {  // Virtual call per tuple!
        outputTuple(tuple);
    }
    root->close();
}

The Performance Tax of Interpretation

Studies show that interpreted execution spends only 10-20% of CPU time on actual data processing. The remaining 80-90% goes to interpretation overhead—virtual function dispatch, type checking, data structure management, and memory indirection. Query compilation aims to eliminate this tax entirely.

Query Compilation Fundamentals

Query compilation transforms a declarative SQL query into imperative, executable code that can be run directly by the CPU without interpretation overhead. The compiled code is specifically generated for the exact query at hand, allowing aggressive optimizations impossible with generic interpreted execution.

The core insight:

At query compilation time, we know everything about the query:

Which tables are accessed
Which columns are needed
What predicates will be evaluated
How joins will be performed
What aggregations will be computed

This complete knowledge enables generating specialized code rather than relying on generic, parameterized operators.

Key Principles of Query Compilation

•Specialization over Generalization — Generate code tailored to the specific query, eliminating conditional branches and type checks that generic operators require.
•Inlining over Function Calls — Replace virtual function calls with inlined code, enabling the compiler to perform cross-operator optimizations.
•Vectorization Opportunities — Compiled code can leverage SIMD instructions, processing multiple values per CPU instruction.
•Loop Fusion — Merge multiple operators into single tight loops, improving cache locality and reducing intermediate materialization.
•Register Allocation — Keep hot data in CPU registers instead of shuffling through memory, minimizing cache misses.

Interpreted Execution

•Virtual function calls per tuple
•Generic operator implementations
•Runtime type checking
•Tuple-at-a-time processing
•10-20% of time on actual work
•Poor branch prediction
•Limited CPU optimization

Compiled Execution

•Direct function calls or inlined code
•Query-specific code generation
•Compile-time type resolution
•Vectorized/batch processing
•80-90%+ of time on actual work
•Predictable branches
•Full CPU optimization potential

The Compilation Pipeline

Query compilation follows a multi-stage pipeline that transforms SQL text into executable machine code. Understanding this pipeline is essential for grasping how databases achieve high-performance execution.

Stage 1: Parsing and Analysis

The SQL query is parsed into an Abstract Syntax Tree (AST), then analyzed for semantic correctness. This includes name resolution, type checking, and authorization verification. The output is a validated query representation.

Stage 2: Logical Planning

The query is transformed into a logical plan expressed in relational algebra. This plan represents what the query computes, not how it will be executed.

Stage 3: Optimization

The query optimizer explores equivalent logical plans and selects the most efficient physical plan based on cost estimation. This includes selecting access methods, join algorithms, and operator ordering.

Stage 4: Code Generation

The physical plan is transformed into executable code. This is where compilation diverges from interpretation—instead of building an operator tree, we generate specialized code.

Stage 5: Compilation and Execution

The generated code is compiled (via JIT or ahead-of-time compilation) and executed. The result is streamed to the client.

Query Compilation Pipeline Stages
Stage	Input	Output	Key Transformations
Parsing	SQL text	AST	Lexical analysis, syntax validation, tree construction
Semantic Analysis	AST	Validated AST	Name resolution, type checking, privilege verification
Logical Planning	Validated AST	Logical Plan	Relational algebra translation, view expansion
Optimization	Logical Plan	Physical Plan	Cost estimation, plan enumeration, access path selection
Code Generation	Physical Plan	IR/Source Code	Operator fusion, specialization, vectorization
Compilation	IR/Source Code	Machine Code	JIT compilation, register allocation, instruction selection
Execution	Machine Code	Query Results	Direct CPU execution, minimal interpretation

Compilation Time vs. Execution Time Trade-off

Query compilation adds upfront latency (typically 10-100ms) before query execution begins. This is acceptable for long-running analytical queries where compilation time is amortized over billions of tuples. For short OLTP queries, databases use prepared statements and plan caching to avoid repeated compilation.

Code Generation Strategies

There are multiple approaches to generating executable code from query plans. Each strategy offers different trade-offs between compilation speed, execution speed, and implementation complexity.

Approach 1: Transpilation to C/C++

Generate C or C++ source code, then invoke a system compiler (gcc, clang). This leverages decades of compiler optimization but incurs high compilation latency (seconds per query).

Approach 2: LLVM JIT Compilation

Generate LLVM Intermediate Representation (IR) and use LLVM's JIT compiler. Faster than transpilation (~100ms), with excellent optimization quality.

Approach 3: Bytecode Interpretation

Generate bytecode for a custom virtual machine. Extremely fast code generation but lower execution performance than native compilation.

Approach 4: Adaptive Compilation

Start with interpretation or bytecode, then compile to native code for hot queries. Balances compilation latency with execution performance.

code_generation_example.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Example: Generating C code for a simple query
// SELECT name, salary FROM employees WHERE dept_id = 5 AND salary > 50000
 
// Generated code eliminates all interpretation overhead:
void execute_query_12345(
    Table* employees,
    OutputBuffer* output
) {
    // Column offsets resolved at compile time
    constexpr int NAME_OFFSET = 0;
    constexpr int SALARY_OFFSET = 32;
    constexpr int DEPT_OFFSET = 40;
    
    // Scan with inlined predicates - no virtual calls
    for (size_t i = 0; i < employees->num_rows; ++i) {
        char* row = employees->data + (i * employees->row_size);
        
        // Predicate evaluation - fully inlined
        int32_t dept_id = *reinterpret_cast<int32_t*>(row + DEPT_OFFSET);
        if (dept_id != 5) continue;  // Constant folded
        
        int64_t salary = *reinterpret_cast<int64_t*>(row + SALARY_OFFSET);
        if (salary <= 50000) continue;  // Constant folded
        
        // Project - only needed columns
        output->append(row + NAME_OFFSET, 32);  // name
        output->append(&salary, 8);              // salary
    }
}
 
// Contrast with interpreted version which would require:
// - Virtual calls to TableScan::next()
// - Virtual calls to Filter::next() with predicate object
// - Virtual calls to Project::next() with column list
// - Type dispatch for each column access
// - Runtime column offset lookup

Code Generation Strategy Comparison
Strategy	Compile Time	Execution Speed	Optimization Quality	Implementation Complexity
Transpilation (C/C++)	1-10 seconds	Excellent	Best (mature compilers)	Moderate
LLVM JIT	50-200ms	Excellent	Excellent	High
Custom JIT	5-20ms	Very Good	Good	Very High
Bytecode + VM	1-5ms	Good	Limited	Moderate
Adaptive	Variable	Adapts to workload	Varies	Very High

The Push Model and Loop Fusion

Query compilation often employs the push model (also called data-centric or producer-driven execution), which inverts the traditional Volcano pull model. Instead of operators pulling data from children, the producer pushes data toward consumers.

Why push is better for compilation:

In the push model, control flow is inverted. The innermost loop is the data source (scan), and all downstream operators are inlined into that loop. This enables loop fusion—combining multiple operators into a single, cache-efficient loop.

The key insight (from Thomas Neumann's seminal work):

Data should be pushed from one operator to the next, keeping tuples in CPU registers as long as possible.

This approach minimizes materialization, maximizes register utilization, and enables aggressive compiler optimizations.

push_model_execution.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Push Model with Loop Fusion
// Query: SELECT SUM(price) FROM orders WHERE status = 'shipped'
 
// PULL MODEL (Volcano) - operator boundaries prevent optimization
class Sum {
    Operator* child;
    int64_t sum = 0;
public:
    void execute() {
        while (Tuple* t = child->next()) {  // Virtual call
            sum += t->getInt64("price");     // Virtual call
        }
    }
};
 
// PUSH MODEL (Compiled) - single fused loop
void execute_sum_of_shipped_orders(Table* orders, int64_t* result) {
    int64_t sum = 0;  // Register variable!
    
    // All operators fused into single loop
    for (size_t i = 0; i < orders->num_rows; ++i) {
        // Table scan
        auto* row = orders->getRow(i);
        
        // Filter (inlined)
        if (row->status != Status::SHIPPED) continue;
        
        // Aggregation (inlined) - accumulator stays in register
        sum += row->price;
    }
    
    *result = sum;
}
 
// Benefits:
// 1. No function call overhead
// 2. 'sum' stays in CPU register throughout
// 3. No intermediate materialization
// 4. Compiler can auto-vectorize the loop
// 5. Excellent cache locality

Pipeline Breakers

Not all operators can be fused. Pipeline breakers like sorts, hash builds, and subquery results require materialization, breaking the pipeline. Compiled databases identify pipeline boundaries and generate separate code fragments for each pipeline, passing materialized results between them.

Common Pipeline Breakers

•Sort operators — Must see all input before producing output
•Hash table build phase — Build completes before probe begins
•Grouped aggregation — Groups accumulate before final output
•Subquery results — Inner query completes before outer continues
•Window functions — Require complete partitions for computation

Vectorization in Compiled Queries

Modern CPUs support SIMD (Single Instruction, Multiple Data) instructions that operate on multiple values simultaneously. A single AVX-512 instruction can process 16 32-bit integers or 8 64-bit integers in parallel.

Vectorization adapts query execution to exploit SIMD capabilities. Rather than processing one tuple at a time, vectorized execution processes vectors (batches) of column values, enabling the CPU to parallelize computation.

Query compilation can generate vectorized code that:

Processes data in column-oriented batches
Uses SIMD instructions for predicates and arithmetic
Achieves near-hardware-limit throughput

vectorized_execution.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Vectorized Predicate Evaluation using AVX-512
// Evaluates: salary > 50000 for 16 integers simultaneously
 
#include <immintrin.h>
 
// Scalar version - processes 1 value per iteration
void filter_salary_scalar(
    const int32_t* salaries,
    size_t count,
    bool* result
) {
    for (size_t i = 0; i < count; ++i) {
        result[i] = salaries[i] > 50000;
    }
}
 
// Vectorized version - processes 16 values per iteration
void filter_salary_simd(
    const int32_t* salaries,
    size_t count,
    bool* result
) {
    // Broadcast threshold to all lanes
    __m512i threshold = _mm512_set1_epi32(50000);
    
    size_t i = 0;
    for (; i + 16 <= count; i += 16) {
        // Load 16 salary values
        __m512i values = _mm512_loadu_si512(&salaries[i]);
        
        // Compare all 16 simultaneously (single instruction!)
        __mmask16 mask = _mm512_cmpgt_epi32_mask(values, threshold);
        
        // Store 16 boolean results
        _mm512_mask_storeu_epi8(&result[i], mask, _mm512_set1_epi8(1));
    }
    
    // Handle remaining elements
    for (; i < count; ++i) {
        result[i] = salaries[i] > 50000;
    }
}
 
// Performance: SIMD version achieves ~10-16x throughput
// on predicate evaluation compared to scalar code

SIMD Instruction Set Capabilities
Instruction Set	Register Width	32-bit Ints per Op	64-bit Ints per Op	Availability
SSE2	128-bit	4	2	Universal (2001+)
AVX2	256-bit	8	4	Common (2013+)
AVX-512	512-bit	16	8	Modern servers (2017+)
ARM NEON	128-bit	4	2	ARM processors
SVE/SVE2	Variable	Variable	Variable	ARM v8.2+

Compilation Enables Auto-Vectorization

When query code is generated as tight loops over columnar data, modern compilers (LLVM, GCC) can automatically apply SIMD optimizations. This means compiled queries often gain vectorization 'for free' without explicit SIMD intrinsics, as long as the loop structure is amenable to auto-vectorization.

Real-World Database Implementations

Query compilation has been adopted by numerous database systems, each with distinct implementation strategies. Understanding these implementations provides insight into the practical considerations of query compilation.

Query Compilation in Production Databases
Database	Compilation Strategy	Target	Notable Features
HyPer/Umbra	LLVM JIT	Native code	Pioneered data-centric compilation; morsel-driven parallelism
Apache Spark	Whole-stage codegen (Janino)	JVM bytecode	Tungsten engine; fuses multiple operators
ClickHouse	LLVM JIT (optional)	Native code	Vectorized + compiled hybrid; column-oriented
DuckDB	Vectorized + interpretation	Specialized	Optimized for embedded use; considers compilation
PostgreSQL	JIT via LLVM	Native code	Expression compilation; optional for complex queries
MemSQL/SingleStore	LLVM JIT	Native code	Full query compilation; code generation
Presto/Trino	Bytecode generation	JVM bytecode	Adaptive compilation based on query complexity

Case Study: HyPer's Data-Centric Compilation

HyPer (developed by Thomas Neumann at TU Munich) pioneered modern query compilation. Its key innovations:

Data-centric execution: Tuples are kept in CPU registers as long as possible, pushed through pipelines without materialization.
LLVM-based code generation: Queries are translated to LLVM IR, then JIT-compiled to native code. Compilation takes ~100ms per query.
Morsel-driven parallelism: Work is divided into morsels (small chunks), enabling parallel execution with good load balancing.
Adaptive re-compilation: Execution can switch between interpreted and compiled modes based on runtime characteristics.

HyPer demonstrated that compiled queries can achieve 10-100x performance improvements over interpreted execution for analytical workloads.

PostgreSQL JIT Example

PostgreSQL 11+ includes optional JIT compilation via LLVM. When enabled (jit = on), PostgreSQL compiles expression evaluation, tuple deforming, and other hot paths. JIT is triggered when estimated query cost exceeds jit_above_cost (default: 100000). This hybrid approach offers compilation benefits without universal compilation overhead.

Summary: Compiled Queries

We've explored the fundamental concepts of query compilation—a transformative technique that converts SQL queries into highly optimized machine code. Let's consolidate the key insights:

Key Takeaways

•Interpretation overhead is substantial — Traditional Volcano-style execution spends 80-90% of CPU time on overhead, not actual data processing.
•Compilation eliminates interpretation — By generating query-specific code, we trade upfront compilation cost for dramatically faster execution.
•The push model enables loop fusion — Data-centric execution keeps tuples in registers, fuses operators into tight loops, and maximizes cache efficiency.
•Vectorization multiplies performance — SIMD instructions process multiple values per CPU cycle, and compiled code is amenable to auto-vectorization.
•Multiple compilation strategies exist — From transpilation to LLVM JIT to bytecode generation, each offers different tradeoffs.
•Production systems embrace compilation — HyPer, Spark, ClickHouse, PostgreSQL, and others have adopted query compilation for performance-critical workloads.

What's next:

Compiled queries are most valuable when compilation cost is amortized over repeated executions. The next page explores prepared statements—the mechanism by which applications reuse compiled query plans, avoiding redundant parsing and optimization while maintaining security against SQL injection.

Page Complete

You now understand the fundamentals of query compilation, including why interpretation is costly, how compilation eliminates overhead, and how modern databases generate and execute compiled queries. Next, we'll explore prepared statements and their role in efficient query execution.

1 / 5

Loading learning content...

Database Management SystemsQuery Compilation

Query Compilation

LevelAdvanced

Duration75 mins

TopicQuery Compilation

1 / 5

Compiled Queries

The Query Compilation Revolution

What You Will Learn

The Interpretation Problem

Each relational operator (scan, filter, join, etc.) is implemented as an iterator
Operators pull tuples from child operators via next() calls
The root operator repeatedly calls next() until data is exhausted
Each operator processes one tuple at a time

While elegant and composable, this approach introduces substantial overhead:

volcano_iterator_model.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Traditional Volcano Iterator Model
// Each operator inherits from a base class with virtual next()
 
class Operator {
public:
    virtual ~Operator() = default;
    virtual void open() = 0;
    virtual Tuple* next() = 0;  // Virtual call overhead!
    virtual void close() = 0;
};
 
class TableScan : public Operator {
    Table* table;
    size_t position = 0;
public:
    TableScan(Table* t) : table(t) {}
    
    void open() override { position = 0; }
    
    Tuple* next() override {
        // Each call processes exactly ONE tuple
        if (position >= table->size()) return nullptr;
        return table->getTuple(position++);
    }
    
    void close() override {}
};
 
class Filter : public Operator {
    Operator* child;
    Predicate* predicate;
public:
    Filter(Operator* c, Predicate* p) : child(c), predicate(p) {}
    
    Tuple* next() override {
        // Another virtual call, plus predicate evaluation
        while (Tuple* tuple = child->next()) {  // Virtual call!
            if (predicate->evaluate(tuple)) {   // Another virtual call!
                return tuple;
            }
        }
        return nullptr;
    }
    // ... open/close implementations
};
 
// Query execution: millions of virtual calls for large tables
void executeQuery(Operator* root) {
    root->open();
    while (Tuple* tuple = root->next()) {  // Virtual call per tuple!
        outputTuple(tuple);
    }
    root->close();
}

The Performance Tax of Interpretation

Query Compilation Fundamentals

The core insight:

At query compilation time, we know everything about the query:

Which tables are accessed
Which columns are needed
What predicates will be evaluated
How joins will be performed
What aggregations will be computed

This complete knowledge enables generating specialized code rather than relying on generic, parameterized operators.

Key Principles of Query Compilation

•Specialization over Generalization — Generate code tailored to the specific query, eliminating conditional branches and type checks that generic operators require.
•Inlining over Function Calls — Replace virtual function calls with inlined code, enabling the compiler to perform cross-operator optimizations.
•Vectorization Opportunities — Compiled code can leverage SIMD instructions, processing multiple values per CPU instruction.
•Loop Fusion — Merge multiple operators into single tight loops, improving cache locality and reducing intermediate materialization.
•Register Allocation — Keep hot data in CPU registers instead of shuffling through memory, minimizing cache misses.

Interpreted Execution

•Virtual function calls per tuple
•Generic operator implementations
•Runtime type checking
•Tuple-at-a-time processing
•10-20% of time on actual work
•Poor branch prediction
•Limited CPU optimization

Compiled Execution

•Direct function calls or inlined code
•Query-specific code generation
•Compile-time type resolution
•Vectorized/batch processing
•80-90%+ of time on actual work
•Predictable branches
•Full CPU optimization potential

The Compilation Pipeline

Stage 1: Parsing and Analysis

Stage 2: Logical Planning

The query is transformed into a logical plan expressed in relational algebra. This plan represents what the query computes, not how it will be executed.

Stage 3: Optimization

Stage 4: Code Generation

The physical plan is transformed into executable code. This is where compilation diverges from interpretation—instead of building an operator tree, we generate specialized code.

Stage 5: Compilation and Execution

The generated code is compiled (via JIT or ahead-of-time compilation) and executed. The result is streamed to the client.

Query Compilation Pipeline Stages
Stage	Input	Output	Key Transformations
Parsing	SQL text	AST	Lexical analysis, syntax validation, tree construction
Semantic Analysis	AST	Validated AST	Name resolution, type checking, privilege verification
Logical Planning	Validated AST	Logical Plan	Relational algebra translation, view expansion
Optimization	Logical Plan	Physical Plan	Cost estimation, plan enumeration, access path selection
Code Generation	Physical Plan	IR/Source Code	Operator fusion, specialization, vectorization
Compilation	IR/Source Code	Machine Code	JIT compilation, register allocation, instruction selection
Execution	Machine Code	Query Results	Direct CPU execution, minimal interpretation

Compilation Time vs. Execution Time Trade-off

Code Generation Strategies

There are multiple approaches to generating executable code from query plans. Each strategy offers different trade-offs between compilation speed, execution speed, and implementation complexity.

Approach 1: Transpilation to C/C++

Generate C or C++ source code, then invoke a system compiler (gcc, clang). This leverages decades of compiler optimization but incurs high compilation latency (seconds per query).

Approach 2: LLVM JIT Compilation

Generate LLVM Intermediate Representation (IR) and use LLVM's JIT compiler. Faster than transpilation (~100ms), with excellent optimization quality.

Approach 3: Bytecode Interpretation

Generate bytecode for a custom virtual machine. Extremely fast code generation but lower execution performance than native compilation.

Approach 4: Adaptive Compilation

Start with interpretation or bytecode, then compile to native code for hot queries. Balances compilation latency with execution performance.

code_generation_example.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Example: Generating C code for a simple query
// SELECT name, salary FROM employees WHERE dept_id = 5 AND salary > 50000
 
// Generated code eliminates all interpretation overhead:
void execute_query_12345(
    Table* employees,
    OutputBuffer* output
) {
    // Column offsets resolved at compile time
    constexpr int NAME_OFFSET = 0;
    constexpr int SALARY_OFFSET = 32;
    constexpr int DEPT_OFFSET = 40;
    
    // Scan with inlined predicates - no virtual calls
    for (size_t i = 0; i < employees->num_rows; ++i) {
        char* row = employees->data + (i * employees->row_size);
        
        // Predicate evaluation - fully inlined
        int32_t dept_id = *reinterpret_cast<int32_t*>(row + DEPT_OFFSET);
        if (dept_id != 5) continue;  // Constant folded
        
        int64_t salary = *reinterpret_cast<int64_t*>(row + SALARY_OFFSET);
        if (salary <= 50000) continue;  // Constant folded
        
        // Project - only needed columns
        output->append(row + NAME_OFFSET, 32);  // name
        output->append(&salary, 8);              // salary
    }
}
 
// Contrast with interpreted version which would require:
// - Virtual calls to TableScan::next()
// - Virtual calls to Filter::next() with predicate object
// - Virtual calls to Project::next() with column list
// - Type dispatch for each column access
// - Runtime column offset lookup

Code Generation Strategy Comparison
Strategy	Compile Time	Execution Speed	Optimization Quality	Implementation Complexity
Transpilation (C/C++)	1-10 seconds	Excellent	Best (mature compilers)	Moderate
LLVM JIT	50-200ms	Excellent	Excellent	High
Custom JIT	5-20ms	Very Good	Good	Very High
Bytecode + VM	1-5ms	Good	Limited	Moderate
Adaptive	Variable	Adapts to workload	Varies	Very High

The Push Model and Loop Fusion

Why push is better for compilation:

The key insight (from Thomas Neumann's seminal work):

Data should be pushed from one operator to the next, keeping tuples in CPU registers as long as possible.

This approach minimizes materialization, maximizes register utilization, and enables aggressive compiler optimizations.

push_model_execution.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Push Model with Loop Fusion
// Query: SELECT SUM(price) FROM orders WHERE status = 'shipped'
 
// PULL MODEL (Volcano) - operator boundaries prevent optimization
class Sum {
    Operator* child;
    int64_t sum = 0;
public:
    void execute() {
        while (Tuple* t = child->next()) {  // Virtual call
            sum += t->getInt64("price");     // Virtual call
        }
    }
};
 
// PUSH MODEL (Compiled) - single fused loop
void execute_sum_of_shipped_orders(Table* orders, int64_t* result) {
    int64_t sum = 0;  // Register variable!
    
    // All operators fused into single loop
    for (size_t i = 0; i < orders->num_rows; ++i) {
        // Table scan
        auto* row = orders->getRow(i);
        
        // Filter (inlined)
        if (row->status != Status::SHIPPED) continue;
        
        // Aggregation (inlined) - accumulator stays in register
        sum += row->price;
    }
    
    *result = sum;
}
 
// Benefits:
// 1. No function call overhead
// 2. 'sum' stays in CPU register throughout
// 3. No intermediate materialization
// 4. Compiler can auto-vectorize the loop
// 5. Excellent cache locality

Pipeline Breakers

Common Pipeline Breakers

•Sort operators — Must see all input before producing output
•Hash table build phase — Build completes before probe begins
•Grouped aggregation — Groups accumulate before final output
•Subquery results — Inner query completes before outer continues
•Window functions — Require complete partitions for computation

Vectorization in Compiled Queries

Query compilation can generate vectorized code that:

Processes data in column-oriented batches
Uses SIMD instructions for predicates and arithmetic
Achieves near-hardware-limit throughput

vectorized_execution.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Vectorized Predicate Evaluation using AVX-512
// Evaluates: salary > 50000 for 16 integers simultaneously
 
#include <immintrin.h>
 
// Scalar version - processes 1 value per iteration
void filter_salary_scalar(
    const int32_t* salaries,
    size_t count,
    bool* result
) {
    for (size_t i = 0; i < count; ++i) {
        result[i] = salaries[i] > 50000;
    }
}
 
// Vectorized version - processes 16 values per iteration
void filter_salary_simd(
    const int32_t* salaries,
    size_t count,
    bool* result
) {
    // Broadcast threshold to all lanes
    __m512i threshold = _mm512_set1_epi32(50000);
    
    size_t i = 0;
    for (; i + 16 <= count; i += 16) {
        // Load 16 salary values
        __m512i values = _mm512_loadu_si512(&salaries[i]);
        
        // Compare all 16 simultaneously (single instruction!)
        __mmask16 mask = _mm512_cmpgt_epi32_mask(values, threshold);
        
        // Store 16 boolean results
        _mm512_mask_storeu_epi8(&result[i], mask, _mm512_set1_epi8(1));
    }
    
    // Handle remaining elements
    for (; i < count; ++i) {
        result[i] = salaries[i] > 50000;
    }
}
 
// Performance: SIMD version achieves ~10-16x throughput
// on predicate evaluation compared to scalar code

SIMD Instruction Set Capabilities
Instruction Set	Register Width	32-bit Ints per Op	64-bit Ints per Op	Availability
SSE2	128-bit	4	2	Universal (2001+)
AVX2	256-bit	8	4	Common (2013+)
AVX-512	512-bit	16	8	Modern servers (2017+)
ARM NEON	128-bit	4	2	ARM processors
SVE/SVE2	Variable	Variable	Variable	ARM v8.2+

Compilation Enables Auto-Vectorization

Real-World Database Implementations

Query Compilation in Production Databases
Database	Compilation Strategy	Target	Notable Features
HyPer/Umbra	LLVM JIT	Native code	Pioneered data-centric compilation; morsel-driven parallelism
Apache Spark	Whole-stage codegen (Janino)	JVM bytecode	Tungsten engine; fuses multiple operators
ClickHouse	LLVM JIT (optional)	Native code	Vectorized + compiled hybrid; column-oriented
DuckDB	Vectorized + interpretation	Specialized	Optimized for embedded use; considers compilation
PostgreSQL	JIT via LLVM	Native code	Expression compilation; optional for complex queries
MemSQL/SingleStore	LLVM JIT	Native code	Full query compilation; code generation
Presto/Trino	Bytecode generation	JVM bytecode	Adaptive compilation based on query complexity

Case Study: HyPer's Data-Centric Compilation

HyPer (developed by Thomas Neumann at TU Munich) pioneered modern query compilation. Its key innovations:

Data-centric execution: Tuples are kept in CPU registers as long as possible, pushed through pipelines without materialization.
LLVM-based code generation: Queries are translated to LLVM IR, then JIT-compiled to native code. Compilation takes ~100ms per query.
Morsel-driven parallelism: Work is divided into morsels (small chunks), enabling parallel execution with good load balancing.
Adaptive re-compilation: Execution can switch between interpreted and compiled modes based on runtime characteristics.

HyPer demonstrated that compiled queries can achieve 10-100x performance improvements over interpreted execution for analytical workloads.

PostgreSQL JIT Example

Summary: Compiled Queries

We've explored the fundamental concepts of query compilation—a transformative technique that converts SQL queries into highly optimized machine code. Let's consolidate the key insights:

Key Takeaways

•Interpretation overhead is substantial — Traditional Volcano-style execution spends 80-90% of CPU time on overhead, not actual data processing.
•Compilation eliminates interpretation — By generating query-specific code, we trade upfront compilation cost for dramatically faster execution.
•The push model enables loop fusion — Data-centric execution keeps tuples in registers, fuses operators into tight loops, and maximizes cache efficiency.
•Vectorization multiplies performance — SIMD instructions process multiple values per CPU cycle, and compiled code is amenable to auto-vectorization.
•Multiple compilation strategies exist — From transpilation to LLVM JIT to bytecode generation, each offers different tradeoffs.
•Production systems embrace compilation — HyPer, Spark, ClickHouse, PostgreSQL, and others have adopted query compilation for performance-critical workloads.

What's next:

Page Complete

1 / 5