System Design LLDCompile-Time vs Runtime Polymorphism

Compile-Time vs Runtime Polymorphism

LevelIntermediate

Duration60 mins

TopicCompile-Time vs Runtime Polymorphism

4 / 4

Performance Considerations

When Does Polymorphism Cost Too Much?

You've learned that static polymorphism resolves at compile time with zero runtime overhead, while dynamic polymorphism incurs vtable lookups and blocks inlining. But when does this actually matter? How much overhead are we talking about—nanoseconds or milliseconds? When should you sacrifice design elegance for speed?

These questions don't have universal answers. The impact of polymorphism overhead depends on your specific context: how hot the code path is, how small the methods are, what the alternative would be, and what your performance requirements are.

This page equips you to make informed decisions by understanding the actual costs, measuring them in your context, and applying strategies to optimize without abandoning good design.

What You Will Learn

By the end of this page, you will understand the quantitative costs of polymorphism, how to measure dispatch overhead in your systems, when polymorphism costs matter (and when they don't), and specific strategies for optimizing hot polymorphic code paths while maintaining design quality.

Quantifying Polymorphism Overhead

Let's put concrete numbers on the costs we've discussed. These figures are approximate and vary by CPU architecture, cache state, and compiler, but they provide a framework for reasoning.

Component costs of a virtual method call:

Virtual Call Overhead Breakdown (Modern x86-64 CPU)
Operation	Cost (CPU cycles)	Notes
Direct function call	1-2 cycles	Baseline: call instruction with known target
Load vptr from object	0-4 cycles (L1 hit) / 12+ cycles (L2) / 200+ (RAM)	Depends on cache state
Load function pointer from vtable	0-4 cycles (L1 hit) / 12+ cycles (L2)	Often in cache if class frequently used
Indirect call (branch)	1-5 cycles	Plus potential branch misprediction penalty
Branch misprediction	10-20 cycles	If CPU predicted wrong target
Lost inlining opportunity	Varies greatly	Prevents further optimizations

Best case scenario (everything cached, predicted):

Direct call: ~1-2 cycles
Virtual call: ~4-10 cycles
Overhead: ~3-8 cycles per call

Worst case scenario (cold cache, misprediction):

Direct call: ~1-2 cycles
Virtual call: ~100-300 cycles
Overhead: Can be 100x the base cost

Real-world typical case:

Hot methods in tight loops: 2-3x overhead if not devirtualized
Normal application code: Effectively zero (dominated by actual work)
Cold code paths: Completely irrelevant

The Inlining Loss is Often Bigger

The 2-3x overhead for vtable lookup understates the impact. The real cost is often the lost optimization opportunities. An inlined method enables constant propagation, dead code elimination, loop unrolling, and many other optimizations. A virtual call is an optimization barrier that prevents these cascading improvements.

Memory overhead:

Item	Size (64-bit)	When Incurred
vptr per object	8 bytes	Every polymorphic object
vtable per class	8 bytes × virtual methods	Once per class
Type information (RTTI)	Varies (~20-100 bytes)	Once per class (if used)

For objects with many instances (millions of particles, graph nodes, etc.), the 8-byte vptr overhead multiplies significantly. For typical business objects with few instances, it's negligible.

When Polymorphism Overhead Matters

Not all code is equal. The impact of polymorphism overhead ranges from "completely irrelevant" to "critical bottleneck" depending on the context.

When It DOES Matter

•Tight inner loops — Millions of iterations over polymorphic collections where methods are small (getters, simple calculations)
•Real-time systems — Audio processing, game physics, trading systems where predictable latency is critical
•High-throughput data processing — Processing billions of records where per-record overhead accumulates
•Resource-constrained systems — Embedded devices with limited CPU/memory budgets
•Small polymorphic methods — Virtual getters, small updaters where call overhead dominates work

When It DOESN'T Matter

•I/O-bound operations — Network calls, database queries, file I/O dwarf any dispatch overhead
•Methods doing substantial work — A method taking 1000 cycles of real work makes 10 cycles of dispatch invisible
•Cold code paths — Initialization, error handling, configuration parsing
•Typical business logic — CRUD operations, request handling, business rule evaluation
•Development velocity contexts — Where time-to-market outweighs nano-optimization

The 90/10 rule applies:

In most applications, 90% of execution time is spent in 10% of the code (often less). Optimizing that 10% matters enormously; optimizing the other 90% yields negligible benefit. Polymorphism overhead only matters if it's in that critical 10%.

A simple heuristic:

If the method being called does more than ~100 CPU operations of actual work, the dispatch overhead is noise. If the method is trivial (returns a field, does simple arithmetic) and is called millions of times, the overhead may dominate.

Profile Before Optimizing

Never guess about performance. Profile your actual application with real workloads. If polymorphic dispatch appears in your profiler's hot spots, optimize it. If it doesn't appear, leave it alone. Premature optimization destroys code quality for zero benefit.

Measuring Dispatch Overhead in Practice

To make informed decisions, you need to measure polymorphism overhead in your specific context. Here's how to benchmark dispatch costs accurately.

Microbenchmarking pitfalls:

Measuring individual method calls is tricky because:

Compiler may optimize away the call entirely
Results may not include cache effects
JIT warmup affects Java/C# measurements
CPU frequency scaling affects absolute timings

Proper benchmarking approach:

DispatchBenchmark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Using JMH (Java Microbenchmark Harness) for accurate measurements
// JMH handles warmup, fork isolation, and statistical analysis
 
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;
 
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
@Fork(2)   // Run in fresh JVMs
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
public class DispatchBenchmark {
    
    // Interface for polymorphic dispatch
    interface Operation {
        int execute(int value);
    }
    
    // Concrete implementation
    static class Doubler implements Operation {
        @Override
        public int execute(int value) {
            return value * 2;
        }
    }
    
    // Final class - helps devirtualization
    static final class FinalDoubler implements Operation {
        @Override
        public int execute(int value) {
            return value * 2;
        }
    }
    
    // Direct method (baseline)
    static int directDouble(int value) {
        return value * 2;
    }
    
    private Operation virtualOp;
    private Operation finalOp;
    private int input;
    
    @Setup
    public void setup() {
        virtualOp = new Doubler();
        finalOp = new FinalDoubler();
        input = 42;
    }
    
    @Benchmark
    public int directCall() {
        return directDouble(input);  // Direct static call
    }
    
    @Benchmark
    public int virtualCall() {
        return virtualOp.execute(input);  // Virtual dispatch
    }
    
    @Benchmark
    public int finalVirtualCall() {
        return finalOp.execute(input);  // Final class, may devirtualize
    }
    
    // Measure polymorphic site (alternating types)
    private Operation[] mixedOps;
    private int index;
    
    @Setup(Level.Invocation)
    public void setupMixed() {
        mixedOps = new Operation[] {
            new Doubler(), new FinalDoubler(), new Doubler()
        };
        index = (index + 1) % mixedOps.length;
    }
    
    @Benchmark
    public int polymorphicCall() {
        return mixedOps[index].execute(input);  // True polymorphic dispatch
    }
}
 
// Expected results (typical modern JVM):
// directCall:         ~1-2 ns (baseline)
// finalVirtualCall:   ~1-3 ns (likely devirtualized)
// virtualCall:        ~2-5 ns (monomorphic, inline cached)
// polymorphicCall:    ~5-15 ns (megamorphic, vtable dispatch)

Interpreting Results

The absolute numbers matter less than the ratios. If virtual dispatch is 3x slower than direct calls in your benchmark, that 3x ratio will roughly hold in real code. But 3x of 2 nanoseconds is 6 nanoseconds—still trivial compared to a network call taking 1 million nanoseconds.

Optimization Strategies for Hot Polymorphic Paths

When profiling reveals that polymorphic dispatch is a genuine bottleneck, several strategies can help without abandoning object-oriented design entirely.

Optimization Techniques

•1. Mark classes/methods as final — Prevents overriding, enables devirtualization. In Java, final classes and methods can be inlined aggressively.
•2. Use concrete types in hot paths — If you know the type, declare it concretely: Circle circle instead of Shape shape. Compiler can eliminate virtual dispatch.
•3. Hoist invariant checks out of loops — If type is constant throughout a loop, check once and dispatch to an optimized type-specific loop.
•4. Use homogeneous collections — Instead of List<Shape> with mixed types, maintain separate List<Circle>, List<Rectangle>. Process each type optimally.
•5. Batch operations — Instead of virtual call per element, call virtual method once with a batch. processAll(List<Item>) instead of process(Item) × N.
•6. Template/generics for static polymorphism — In C++, CRTP (Curiously Recurring Template Pattern) achieves polymorphism without vtables.
•7. Inline caching manually — For megamorphic sites, implement your own type-check-and-dispatch for the common cases.

OptimizationExamples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// BEFORE: Virtual dispatch in tight loop
void processShapes(List<Shape> shapes) {
    for (Shape shape : shapes) {
        shape.draw();  // Virtual call per shape
    }
}
 
// OPTIMIZATION 1: Homogeneous collections
record ShapeCollections(
    List<Circle> circles,
    List<Rectangle> rectangles,
    List<Triangle> triangles
) {}
 
void processShapesOptimized(ShapeCollections shapes) {
    // Each loop is monomorphic - JIT can devirtualize and inline
    for (Circle c : shapes.circles()) {
        c.draw();  // Concrete type, may inline
    }
    for (Rectangle r : shapes.rectangles()) {
        r.draw();
    }
    for (Triangle t : shapes.triangles()) {
        t.draw();
    }
}
 
// OPTIMIZATION 2: Batch processing
interface ShapeProcessor {
    void processAll(List<? extends Shape> shapes);  // Batch method
}
 
class CircleProcessor implements ShapeProcessor {
    @Override
    public void processAll(List<? extends Shape> shapes) {
        // Process all at once, virtual call happens once
        for (Shape s : shapes) {
            // Implementation knows all are Circles
            Circle c = (Circle) s;
            // ... optimized circle processing
        }
    }
}
 
// OPTIMIZATION 3: Manual inline caching
void processWithInlineCache(List<Shape> shapes) {
    // Track last type seen
    Class<?> lastType = null;
    
    for (Shape shape : shapes) {
        Class<?> currentType = shape.getClass();
        
        // Fast path: same type as before
        if (currentType == Circle.class) {
            ((Circle) shape).drawOptimized();  // Direct call
        } else if (currentType == Rectangle.class) {
            ((Rectangle) shape).drawOptimized();  // Direct call
        } else {
            shape.draw();  // Fallback virtual dispatch
        }
    }
}

Optimization Has Costs

Every optimization technique trades something: homogeneous collections lose flexibility, CRTP loses heterogeneous containers, manual caching adds complexity. Only optimize when profiling proves it's necessary. Premature optimization damages maintainability for no measurable gain.

Cache Effects and Memory Layout

Beyond dispatch overhead, polymorphism affects memory layout and cache performance in ways that can dominate raw call costs.

The cache locality problem:

Polymorphic collections often suffer from poor cache locality:

Scattered vtables — Objects of different types have vtables at different memory locations. Iterating mixed collections causes vtable cache misses.
Pointer chasing — Polymorphic references are pointers to heap objects. Following pointers defeats CPU prefetchers and spatial locality.
Object size variation — Different derived types have different sizes. Collections become arrays of pointers, not contiguous data.
RTTI overhead — Runtime type checking requires accessing type information structures, adding more cache pressure.

Memory Access Patterns Comparison
Pattern	Cache Behavior	Performance
struct array (values)	Sequential, predictable	Excellent (prefetch effective)
homogeneous object array	Pointer chase, but same vtable	Good (vtable cached)
heterogeneous object array	Pointer chase, different vtables	Poor (vtable thrashing)
Virtual calls per element	N vtable lookups + N function pointers	Poor for small methods

CacheFriendlyDesign
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Data-oriented design: Cache-friendly alternatives
 
// ANTI-PATTERN: Array of pointers to polymorphic objects
std::vector<Shape*> shapes;  // Scattered memory, vtable thrashing
 
// BETTER: Separate homogeneous arrays (Structure of Arrays)
struct ShapeData {
    std::vector<Circle> circles;
    std::vector<Rectangle> rectangles;
    std::vector<Triangle> triangles;
};
 
void processAllShapes(ShapeData& data) {
    // Process circles - all Circle vtables in cache
    for (auto& c : data.circles) {
        c.draw();  // Same vtable for all
    }
    
    // Process rectangles - now Rectangle vtable in cache
    for (auto& r : data.rectangles) {
        r.draw();
    }
    
    // etc.
}
 
// EVEN BETTER: Data-oriented, no polymorphism at all
struct CircleData {
    std::vector<double> x;      // Positions
    std::vector<double> y;
    std::vector<double> radius; // Circle-specific
};
 
void drawAllCircles(const CircleData& data, size_t count) {
    // Pure data iteration - maximum cache efficiency
    for (size_t i = 0; i < count; ++i) {
        drawCircle(data.x[i], data.y[i], data.radius[i]);
    }
    // Can vectorize with SIMD
}
 
// Trade-off: Lose OOP elegance, gain performance
// Use when processing millions of entities per frame (games, simulations)

Data-Oriented Design

In extremely performance-critical contexts (game engines, scientific computing), Data-Oriented Design (DOD) replaces OOP hierarchies with flat data structures optimized for cache access. This is the extreme end of the spectrum—maximum performance, minimum abstraction. Most applications don't need this, but knowing it exists helps understand the tradeoff space.

Decision Framework: Choosing the Right Approach

Given everything we've covered, how do you decide when to use static vs dynamic polymorphism? Here's a practical decision framework.

Converting Mermaid diagram...

Decision Criteria

•Default to dynamic polymorphism — It's simpler, more flexible, and sufficient for most code. Start here unless you have specific reasons not to.
•Use static polymorphism when: Types are known at compile time AND code is performance-critical AND extensibility to unknown types isn't needed.
•Measure before optimizing — Don't switch from dynamic to static polymorphism based on theory. Profile first, optimize hot spots only.
•Consider hybrid approaches — Use dynamic polymorphism at boundaries (APIs, plug-ins) and static polymorphism in hot inner loops.
•Prioritize correctness and maintainability — Wrong fast code is worse than correct slow code. Get it right first.

Quick Reference: When to Use Each
Situation	Recommendation	Reasoning
Plugin architecture	Dynamic (interfaces)	Types not known at compile time
Mathematical operations on known types	Static (generics/templates)	Full optimization, type safety
Business logic handlers	Dynamic (strategy pattern)	Flexibility trumps micro-performance
High-frequency trading loop	Static or no polymorphism	Every nanosecond counts
GUI event handling	Dynamic (observer)	Type of handler varies at runtime
Container library internals	Static (templates)	Performance critical, types known
Unit test mocking	Dynamic (interfaces)	Runtime injection of test doubles

Real-World Case Studies

Let's examine how major systems and teams have navigated these tradeoffs.

Industry Case Studies

•Google Protocol Buffers — Uses code generation to produce type-specific serialization code (static polymorphism) rather than runtime reflection. Result: 10-100x faster than reflection-based alternatives.
•Game Engine Entity Component Systems (Unity, Unreal) — Move from OOP inheritance (enemies inherit from Entity) to composition + data-oriented processing. Enables cache-friendly iteration over millions of entities.
•LLVM Compiler — Uses CRTP extensively for static polymorphism in core data structures. Virtual dispatch reserved for truly runtime-polymorphic cases (passes).
•Java Collections Framework — Uses interface polymorphism everywhere (List, Map, Set). JIT devirtualization makes this nearly as fast as concrete types in practice.
•High-Frequency Trading Systems — Often avoid polymorphism entirely in hot paths. Hand-optimized code with no abstraction overhead. Extreme but justified by requirements.
•Operating System Kernels (Linux) — Uses function pointer tables (virtual tables by another name) extensively for device drivers. Performance acceptable because I/O dominates.

The Common Thread

Across these examples, the pattern is consistent: use dynamic polymorphism by default, measure actual performance, optimize the specific hot spots that matter using techniques appropriate to the context. No one-size-fits-all solution exists; context determines the right tradeoff.

Summary: Performance-Aware Polymorphism

We've covered the performance landscape of polymorphism comprehensively. Here are the key insights:

Key Takeaways

•Virtual dispatch costs 2-10x a direct call — But this is nanoseconds. Only matters in tight loops with small methods.
•Lost inlining often costs more than dispatch — The optimization barrier is the real price, not the vtable lookup.
•Most code is not performance-critical — 90% of code can use polymorphism freely. Optimize the 10% that matters.
•Profile before optimizing — Never guess. Measure actual performance with real workloads.
•JIT devirtualization helps significantly — Modern JVMs convert monomorphic call sites to direct calls automatically.
•Optimization strategies exist — final, homogeneous collections, CRTP, batching, data-oriented design. Use when needed.
•Cache effects can dominate — Scattered memory access and vtable thrashing hurt more than raw dispatch in some cases.
•Default to dynamic polymorphism — It's more flexible and maintainable. Optimize to static only when proven necessary.

Module complete:

You now have a comprehensive understanding of compile-time and runtime polymorphism—from the mechanisms and resolution processes to the performance implications that guide real-world design decisions. This knowledge enables you to write flexible, maintainable code that performs well, choosing the right polymorphism approach for each specific context.

Module Complete

Congratulations! You've mastered compile-time vs runtime polymorphism. You understand static dispatch mechanics, dynamic dispatch via vtables, how compilers and runtimes resolve calls, and how to make performance-informed decisions. Apply this knowledge to write systems that are both elegant and efficient.

4 / 4

Loading learning content...

System Design LLDCompile-Time vs Runtime Polymorphism

Compile-Time vs Runtime Polymorphism

LevelIntermediate

Duration60 mins

TopicCompile-Time vs Runtime Polymorphism

4 / 4

Performance Considerations

When Does Polymorphism Cost Too Much?

This page equips you to make informed decisions by understanding the actual costs, measuring them in your context, and applying strategies to optimize without abandoning good design.

What You Will Learn

Quantifying Polymorphism Overhead

Let's put concrete numbers on the costs we've discussed. These figures are approximate and vary by CPU architecture, cache state, and compiler, but they provide a framework for reasoning.

Component costs of a virtual method call:

Virtual Call Overhead Breakdown (Modern x86-64 CPU)
Operation	Cost (CPU cycles)	Notes
Direct function call	1-2 cycles	Baseline: call instruction with known target
Load vptr from object	0-4 cycles (L1 hit) / 12+ cycles (L2) / 200+ (RAM)	Depends on cache state
Load function pointer from vtable	0-4 cycles (L1 hit) / 12+ cycles (L2)	Often in cache if class frequently used
Indirect call (branch)	1-5 cycles	Plus potential branch misprediction penalty
Branch misprediction	10-20 cycles	If CPU predicted wrong target
Lost inlining opportunity	Varies greatly	Prevents further optimizations

Best case scenario (everything cached, predicted):

Direct call: ~1-2 cycles
Virtual call: ~4-10 cycles
Overhead: ~3-8 cycles per call

Worst case scenario (cold cache, misprediction):

Direct call: ~1-2 cycles
Virtual call: ~100-300 cycles
Overhead: Can be 100x the base cost

Real-world typical case:

Hot methods in tight loops: 2-3x overhead if not devirtualized
Normal application code: Effectively zero (dominated by actual work)
Cold code paths: Completely irrelevant

The Inlining Loss is Often Bigger

Memory overhead:

Item	Size (64-bit)	When Incurred
vptr per object	8 bytes	Every polymorphic object
vtable per class	8 bytes × virtual methods	Once per class
Type information (RTTI)	Varies (~20-100 bytes)	Once per class (if used)

For objects with many instances (millions of particles, graph nodes, etc.), the 8-byte vptr overhead multiplies significantly. For typical business objects with few instances, it's negligible.

When Polymorphism Overhead Matters

Not all code is equal. The impact of polymorphism overhead ranges from "completely irrelevant" to "critical bottleneck" depending on the context.

When It DOES Matter

•Tight inner loops — Millions of iterations over polymorphic collections where methods are small (getters, simple calculations)
•Real-time systems — Audio processing, game physics, trading systems where predictable latency is critical
•High-throughput data processing — Processing billions of records where per-record overhead accumulates
•Resource-constrained systems — Embedded devices with limited CPU/memory budgets
•Small polymorphic methods — Virtual getters, small updaters where call overhead dominates work

When It DOESN'T Matter

•I/O-bound operations — Network calls, database queries, file I/O dwarf any dispatch overhead
•Methods doing substantial work — A method taking 1000 cycles of real work makes 10 cycles of dispatch invisible
•Cold code paths — Initialization, error handling, configuration parsing
•Typical business logic — CRUD operations, request handling, business rule evaluation
•Development velocity contexts — Where time-to-market outweighs nano-optimization

The 90/10 rule applies:

A simple heuristic:

Profile Before Optimizing

Measuring Dispatch Overhead in Practice

To make informed decisions, you need to measure polymorphism overhead in your specific context. Here's how to benchmark dispatch costs accurately.

Microbenchmarking pitfalls:

Measuring individual method calls is tricky because:

Compiler may optimize away the call entirely
Results may not include cache effects
JIT warmup affects Java/C# measurements
CPU frequency scaling affects absolute timings

Proper benchmarking approach:

DispatchBenchmark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Using JMH (Java Microbenchmark Harness) for accurate measurements
// JMH handles warmup, fork isolation, and statistical analysis
 
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;
 
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
@Fork(2)   // Run in fresh JVMs
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
public class DispatchBenchmark {
    
    // Interface for polymorphic dispatch
    interface Operation {
        int execute(int value);
    }
    
    // Concrete implementation
    static class Doubler implements Operation {
        @Override
        public int execute(int value) {
            return value * 2;
        }
    }
    
    // Final class - helps devirtualization
    static final class FinalDoubler implements Operation {
        @Override
        public int execute(int value) {
            return value * 2;
        }
    }
    
    // Direct method (baseline)
    static int directDouble(int value) {
        return value * 2;
    }
    
    private Operation virtualOp;
    private Operation finalOp;
    private int input;
    
    @Setup
    public void setup() {
        virtualOp = new Doubler();
        finalOp = new FinalDoubler();
        input = 42;
    }
    
    @Benchmark
    public int directCall() {
        return directDouble(input);  // Direct static call
    }
    
    @Benchmark
    public int virtualCall() {
        return virtualOp.execute(input);  // Virtual dispatch
    }
    
    @Benchmark
    public int finalVirtualCall() {
        return finalOp.execute(input);  // Final class, may devirtualize
    }
    
    // Measure polymorphic site (alternating types)
    private Operation[] mixedOps;
    private int index;
    
    @Setup(Level.Invocation)
    public void setupMixed() {
        mixedOps = new Operation[] {
            new Doubler(), new FinalDoubler(), new Doubler()
        };
        index = (index + 1) % mixedOps.length;
    }
    
    @Benchmark
    public int polymorphicCall() {
        return mixedOps[index].execute(input);  // True polymorphic dispatch
    }
}
 
// Expected results (typical modern JVM):
// directCall:         ~1-2 ns (baseline)
// finalVirtualCall:   ~1-3 ns (likely devirtualized)
// virtualCall:        ~2-5 ns (monomorphic, inline cached)
// polymorphicCall:    ~5-15 ns (megamorphic, vtable dispatch)

Interpreting Results

Optimization Strategies for Hot Polymorphic Paths

When profiling reveals that polymorphic dispatch is a genuine bottleneck, several strategies can help without abandoning object-oriented design entirely.

Optimization Techniques

•1. Mark classes/methods as final — Prevents overriding, enables devirtualization. In Java, final classes and methods can be inlined aggressively.
•2. Use concrete types in hot paths — If you know the type, declare it concretely: Circle circle instead of Shape shape. Compiler can eliminate virtual dispatch.
•3. Hoist invariant checks out of loops — If type is constant throughout a loop, check once and dispatch to an optimized type-specific loop.
•4. Use homogeneous collections — Instead of List<Shape> with mixed types, maintain separate List<Circle>, List<Rectangle>. Process each type optimally.
•5. Batch operations — Instead of virtual call per element, call virtual method once with a batch. processAll(List<Item>) instead of process(Item) × N.
•6. Template/generics for static polymorphism — In C++, CRTP (Curiously Recurring Template Pattern) achieves polymorphism without vtables.
•7. Inline caching manually — For megamorphic sites, implement your own type-check-and-dispatch for the common cases.

OptimizationExamples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// BEFORE: Virtual dispatch in tight loop
void processShapes(List<Shape> shapes) {
    for (Shape shape : shapes) {
        shape.draw();  // Virtual call per shape
    }
}
 
// OPTIMIZATION 1: Homogeneous collections
record ShapeCollections(
    List<Circle> circles,
    List<Rectangle> rectangles,
    List<Triangle> triangles
) {}
 
void processShapesOptimized(ShapeCollections shapes) {
    // Each loop is monomorphic - JIT can devirtualize and inline
    for (Circle c : shapes.circles()) {
        c.draw();  // Concrete type, may inline
    }
    for (Rectangle r : shapes.rectangles()) {
        r.draw();
    }
    for (Triangle t : shapes.triangles()) {
        t.draw();
    }
}
 
// OPTIMIZATION 2: Batch processing
interface ShapeProcessor {
    void processAll(List<? extends Shape> shapes);  // Batch method
}
 
class CircleProcessor implements ShapeProcessor {
    @Override
    public void processAll(List<? extends Shape> shapes) {
        // Process all at once, virtual call happens once
        for (Shape s : shapes) {
            // Implementation knows all are Circles
            Circle c = (Circle) s;
            // ... optimized circle processing
        }
    }
}
 
// OPTIMIZATION 3: Manual inline caching
void processWithInlineCache(List<Shape> shapes) {
    // Track last type seen
    Class<?> lastType = null;
    
    for (Shape shape : shapes) {
        Class<?> currentType = shape.getClass();
        
        // Fast path: same type as before
        if (currentType == Circle.class) {
            ((Circle) shape).drawOptimized();  // Direct call
        } else if (currentType == Rectangle.class) {
            ((Rectangle) shape).drawOptimized();  // Direct call
        } else {
            shape.draw();  // Fallback virtual dispatch
        }
    }
}

Optimization Has Costs

Cache Effects and Memory Layout

Beyond dispatch overhead, polymorphism affects memory layout and cache performance in ways that can dominate raw call costs.

The cache locality problem:

Polymorphic collections often suffer from poor cache locality:

Scattered vtables — Objects of different types have vtables at different memory locations. Iterating mixed collections causes vtable cache misses.
Pointer chasing — Polymorphic references are pointers to heap objects. Following pointers defeats CPU prefetchers and spatial locality.
Object size variation — Different derived types have different sizes. Collections become arrays of pointers, not contiguous data.
RTTI overhead — Runtime type checking requires accessing type information structures, adding more cache pressure.

Memory Access Patterns Comparison
Pattern	Cache Behavior	Performance
struct array (values)	Sequential, predictable	Excellent (prefetch effective)
homogeneous object array	Pointer chase, but same vtable	Good (vtable cached)
heterogeneous object array	Pointer chase, different vtables	Poor (vtable thrashing)
Virtual calls per element	N vtable lookups + N function pointers	Poor for small methods

CacheFriendlyDesign
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Data-oriented design: Cache-friendly alternatives
 
// ANTI-PATTERN: Array of pointers to polymorphic objects
std::vector<Shape*> shapes;  // Scattered memory, vtable thrashing
 
// BETTER: Separate homogeneous arrays (Structure of Arrays)
struct ShapeData {
    std::vector<Circle> circles;
    std::vector<Rectangle> rectangles;
    std::vector<Triangle> triangles;
};
 
void processAllShapes(ShapeData& data) {
    // Process circles - all Circle vtables in cache
    for (auto& c : data.circles) {
        c.draw();  // Same vtable for all
    }
    
    // Process rectangles - now Rectangle vtable in cache
    for (auto& r : data.rectangles) {
        r.draw();
    }
    
    // etc.
}
 
// EVEN BETTER: Data-oriented, no polymorphism at all
struct CircleData {
    std::vector<double> x;      // Positions
    std::vector<double> y;
    std::vector<double> radius; // Circle-specific
};
 
void drawAllCircles(const CircleData& data, size_t count) {
    // Pure data iteration - maximum cache efficiency
    for (size_t i = 0; i < count; ++i) {
        drawCircle(data.x[i], data.y[i], data.radius[i]);
    }
    // Can vectorize with SIMD
}
 
// Trade-off: Lose OOP elegance, gain performance
// Use when processing millions of entities per frame (games, simulations)

Data-Oriented Design

Decision Framework: Choosing the Right Approach

Given everything we've covered, how do you decide when to use static vs dynamic polymorphism? Here's a practical decision framework.

Converting Mermaid diagram...

Decision Criteria

•Default to dynamic polymorphism — It's simpler, more flexible, and sufficient for most code. Start here unless you have specific reasons not to.
•Use static polymorphism when: Types are known at compile time AND code is performance-critical AND extensibility to unknown types isn't needed.
•Measure before optimizing — Don't switch from dynamic to static polymorphism based on theory. Profile first, optimize hot spots only.
•Consider hybrid approaches — Use dynamic polymorphism at boundaries (APIs, plug-ins) and static polymorphism in hot inner loops.
•Prioritize correctness and maintainability — Wrong fast code is worse than correct slow code. Get it right first.

Quick Reference: When to Use Each
Situation	Recommendation	Reasoning
Plugin architecture	Dynamic (interfaces)	Types not known at compile time
Mathematical operations on known types	Static (generics/templates)	Full optimization, type safety
Business logic handlers	Dynamic (strategy pattern)	Flexibility trumps micro-performance
High-frequency trading loop	Static or no polymorphism	Every nanosecond counts
GUI event handling	Dynamic (observer)	Type of handler varies at runtime
Container library internals	Static (templates)	Performance critical, types known
Unit test mocking	Dynamic (interfaces)	Runtime injection of test doubles

Real-World Case Studies

Let's examine how major systems and teams have navigated these tradeoffs.

Industry Case Studies

•Google Protocol Buffers — Uses code generation to produce type-specific serialization code (static polymorphism) rather than runtime reflection. Result: 10-100x faster than reflection-based alternatives.
•Game Engine Entity Component Systems (Unity, Unreal) — Move from OOP inheritance (enemies inherit from Entity) to composition + data-oriented processing. Enables cache-friendly iteration over millions of entities.
•LLVM Compiler — Uses CRTP extensively for static polymorphism in core data structures. Virtual dispatch reserved for truly runtime-polymorphic cases (passes).
•Java Collections Framework — Uses interface polymorphism everywhere (List, Map, Set). JIT devirtualization makes this nearly as fast as concrete types in practice.
•High-Frequency Trading Systems — Often avoid polymorphism entirely in hot paths. Hand-optimized code with no abstraction overhead. Extreme but justified by requirements.
•Operating System Kernels (Linux) — Uses function pointer tables (virtual tables by another name) extensively for device drivers. Performance acceptable because I/O dominates.

The Common Thread

Summary: Performance-Aware Polymorphism

We've covered the performance landscape of polymorphism comprehensively. Here are the key insights:

Key Takeaways

•Virtual dispatch costs 2-10x a direct call — But this is nanoseconds. Only matters in tight loops with small methods.
•Lost inlining often costs more than dispatch — The optimization barrier is the real price, not the vtable lookup.
•Most code is not performance-critical — 90% of code can use polymorphism freely. Optimize the 10% that matters.
•Profile before optimizing — Never guess. Measure actual performance with real workloads.
•JIT devirtualization helps significantly — Modern JVMs convert monomorphic call sites to direct calls automatically.
•Optimization strategies exist — final, homogeneous collections, CRTP, batching, data-oriented design. Use when needed.
•Cache effects can dominate — Scattered memory access and vtable thrashing hurt more than raw dispatch in some cases.
•Default to dynamic polymorphism — It's more flexible and maintainable. Optimize to static only when proven necessary.

Module complete:

Module Complete

4 / 4