Top-K Problems - Learning Module

Loading content...

0/276

Top-K Pattern with Heaps

The Ubiquitous Top-K Problem

Imagine you're building a recommendation system for an e-commerce platform with 100 million products. A user searches for "wireless headphones," and your algorithm assigns a relevance score to each product. Now you need to display the top 10 most relevant results. How do you efficiently find these 10 products without sorting all 100 million items?

Or consider a real-time analytics dashboard that monitors server performance across 10,000 machines. You need to identify the 5 servers with the highest CPU usage to alert the operations team. The data changes every second—you can't afford to sort all 10,000 entries each time.

These are instances of the Top-K problem—one of the most ubiquitous patterns in software engineering. The naive approach (sort everything, take the first K) has O(n log n) time complexity. But with the right data structure—the heap—we can solve this in O(n log K) time, or even better in streaming scenarios.

What You Will Learn

By the end of this page, you will understand the Top-K pattern at a deep level: why heaps are the optimal choice, when to use a min-heap versus a max-heap, the critical insight of maintaining a "bounded window" of K elements, and how to recognize Top-K problems in disguise. This pattern appears in coding interviews, system design, and production systems regularly.

Defining the Top-K Problem

The Top-K problem has a deceptively simple formulation:

Given a collection of n elements, find the K largest (or smallest) elements.

But this simple statement hides considerable depth. Let's formalize it precisely:

Formal Definition:

Input: A collection A of n comparable elements and an integer K where 1 ≤ K ≤ n
Output: A set S containing exactly K elements from A such that every element in S is greater than or equal to every element in A − S (for Top-K largest)

Key Observations:

Order within the K elements may or may not matter. Some problems ask for just the K elements (unordered), while others want them in sorted order. This affects the final step of your solution.
The Kth largest element has special significance. It serves as the "threshold"—any element greater than or equal to this value is in the Top-K set.
K is typically much smaller than n. Most real-world Top-K problems have K << n. If K approaches n, you're essentially sorting the entire collection.
The elements don't need to be distinct. Duplicates are handled naturally by the algorithms we'll explore.

The Kth Element as Threshold

A crucial insight: finding the "Kth largest element" is equivalent to finding the threshold that separates the Top-K from the rest. Once you have this threshold, you can identify all elements in the Top-K set with a single pass. This connection is fundamental to efficient Top-K algorithms.

Variants of the Top-K Problem:

Variant	Description	Example Use Case
Top-K Largest	Find K elements with highest values	Highest-rated products, top-earning employees
Top-K Smallest	Find K elements with lowest values	Nearest neighbors, minimum costs
Top-K Frequent	Find K most frequently occurring elements	Most common words, trending hashtags
Kth Largest/Smallest	Find the single element at position K	Median (K = n/2), percentile calculations
Top-K in Streaming	Maintain Top-K as new elements arrive	Real-time leaderboards, live monitoring

Each variant uses the same fundamental heap-based approach, with slight modifications. Mastering the core pattern enables you to solve all variants.

Why Not Just Sort?

The most intuitive approach to Top-K is simple: sort the collection, then take the first K elements. Let's analyze why this isn't optimal.

The Sorting Approach:

function topKBySorting(arr, k):
    sort(arr)  // O(n log n)
    return arr[0...k-1]  // O(k)

Time Complexity: O(n log n) — dominated by the sorting step Space Complexity: O(1) to O(n) depending on sorting algorithm

The Problem:

The sorting approach does far more work than necessary. Consider the search engine example with n = 100,000,000 products and K = 10:

Sorting approach: ~2.66 billion comparisons (n log n ≈ 100M × 26.6)
What we actually need: Information about just 10 elements

We're using a sledgehammer to hang a picture frame. Sorting computes the complete ordering of all elements when we only need the partial ordering—specifically, which elements are in the top K.

The Wasted Work of Full Sorting

When K is small relative to n, full sorting wastes effort determining the relative order of elements outside the Top-K. Whether element 1,000 ranks higher than element 2,000 is irrelevant if both are outside the Top-10. This is the key insight that motivates the heap-based approach.

Alternative Approaches and Their Trade-offs:

Approach	Time Complexity	Space Complexity	Notes
Full Sort	O(n log n)	O(1)–O(n)	Simple but inefficient for small K
Partial Sort	O(n log K)	O(K)	Some languages offer this; same as heap
Heap-based	O(n log K)	O(K)	Optimal for small K; works with streams
QuickSelect	O(n) average, O(n²) worst	O(1)	Faster average case but modifies input
Median of Medians + QuickSelect	O(n) guaranteed	O(1)	Optimal worst-case but complex

For most practical scenarios, the heap-based approach provides the best balance:

O(n log K) time complexity
O(K) space complexity
Works on streams (elements arriving one at a time)
Simple to implement and understand
Consistent performance (no worst-case degradation)

When K is very small (say, K ≤ 100) and n is very large, the heap approach is almost always the right choice.

The Heap-Based Solution — The Core Pattern

Now we arrive at the heart of the Top-K pattern. The heap-based approach is elegant, efficient, and widely applicable. Here's the key insight:

To find the K largest elements, maintain a min-heap of size K. The smallest element in the heap is our "admission threshold"—any element larger must be in the Top-K.

Wait—a min-heap for finding the largest elements? This counterintuitive choice is precisely what makes the algorithm efficient. Let's break it down step by step.

The Min-Heap Paradox

Using a min-heap to find the maximum elements seems backwards, but it's the key insight. The min-heap lets us efficiently identify and evict the WEAKEST candidate from our Top-K set. We always know which element would be "first to go" if a better candidate arrives—it's right there at the top of the heap.

The Algorithm:

function topKLargest(arr, k):
    minHeap = new MinHeap()
    
    for each element in arr:
        if minHeap.size() < k:
            // Phase 1: Fill the heap until we have K elements
            minHeap.insert(element)
        else if element > minHeap.peek():
            // Phase 2: New element beats our weakest candidate
            minHeap.extractMin()  // Remove weakest
            minHeap.insert(element)  // Add new champion
    
    return minHeap.elements()

Step-by-Step Trace:

Let's trace through finding the Top-3 from [5, 2, 9, 1, 7, 6, 8]:

Step	Element	Heap Contents	Heap Min	Action
1	5	[5]	5	Insert (heap not full)
2	2	[2, 5]	2	Insert (heap not full)
3	9	[2, 5, 9]	2	Insert (heap now full with K=3)
4	1	[2, 5, 9]	2	Skip (1 ≤ 2, not a top candidate)
5	7	[5, 7, 9]	5	7 > 2 → Extract 2, Insert 7
6	6	[6, 7, 9]	6	6 > 5 → Extract 5, Insert 6
7	8	[7, 8, 9]	7	8 > 6 → Extract 6, Insert 8

Result: [7, 8, 9] — the three largest elements.

top_k_largest.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import heapq
from typing import List
 
def top_k_largest(arr: List[int], k: int) -> List[int]:
    """
    Find the K largest elements using a min-heap.
    
    Time Complexity: O(n log K)
    - We process each of n elements exactly once
    - Each heap operation (insert/extract) is O(log K)
    - Since heap size never exceeds K, operations are O(log K), not O(log n)
    
    Space Complexity: O(K)
    - The heap stores at most K elements at any time
    
    Args:
        arr: Input array of n elements
        k: Number of largest elements to find
    
    Returns:
        List of K largest elements (not necessarily sorted)
    """
    if k <= 0:
        return []
    if k >= len(arr):
        return arr[:]  # All elements are in Top-K
    
    # Python's heapq is a min-heap by default
    # We'll use it directly for finding K largest
    min_heap = []
    
    for element in arr:
        if len(min_heap) < k:
            # Phase 1: Fill the heap until we have K elements
            heapq.heappush(min_heap, element)
        elif element > min_heap[0]:
            # Phase 2: New element beats our weakest candidate
            # heapreplace is more efficient than extractMin + insert
            heapq.heapreplace(min_heap, element)
    
    return min_heap
 
 
def top_k_smallest(arr: List[int], k: int) -> List[int]:
    """
    Find the K smallest elements using a max-heap.
    
    For K smallest, we use a MAX-heap of size K.
    The max element is our threshold—anything smaller must be in Top-K.
    
    Since Python only provides min-heap, we negate values.
    """
    if k <= 0:
        return []
    if k >= len(arr):
        return arr[:]
    
    # Negate values to simulate max-heap
    max_heap = []
    
    for element in arr:
        if len(max_heap) < k:
            heapq.heappush(max_heap, -element)
        elif element < -max_heap[0]:  # Compare with current max
            heapq.heapreplace(max_heap, -element)
    
    # Negate back to get original values
    return [-x for x in max_heap]
 
 
# Example usage and verification
if __name__ == "__main__":
    arr = [5, 2, 9, 1, 7, 6, 8]
    
    print(f"Array: {arr}")
    print(f"Top 3 largest: {sorted(top_k_largest(arr, 3), reverse=True)}")
    print(f"Top 3 smallest: {sorted(top_k_smallest(arr, 3))}")
    
    # Output:
    # Array: [5, 2, 9, 1, 7, 6, 8]
    # Top 3 largest: [9, 8, 7]
    # Top 3 smallest: [1, 2, 5]

Why Min-Heap for Finding Maximum Elements?

This is the most common point of confusion for engineers learning the Top-K pattern. Let's build a deep, intuitive understanding of why the heap choice seems "backwards."

The Insight: Gatekeeping, Not Championing

Think of the heap as a waiting room with limited capacity K. We want to keep the K strongest candidates inside.

The crucial question isn't "Who's the strongest?" but rather "Who should we kick out if someone better arrives?"

For this gatekeeping role, the heap root tells us:

In a min-heap: Who is the weakest current occupant? (First to be evicted)
In a max-heap: Who is the strongest current occupant? (Not useful for eviction)

When finding K largest elements, we need quick access to the weakest candidate—the one we'll evict when a better candidate arrives. That's what the min-heap root provides.

Min-Heap for K Largest

•Root = smallest in Top-K (threshold)
•Root is our 'weakest link'
•New element > root? Evict root, admit new element
•New element ≤ root? Reject—can't beat the weakest
•O(1) to check if element qualifies
•O(log K) to perform replacement

Max-Heap for K Largest (Wrong)

•Root = largest in Top-K (champion)
•But we don't need to evict the champion!
•To find who to evict, must scan entire heap
•Eviction becomes O(K) instead of O(log K)
•Total complexity degrades to O(nK)
•Defeats the purpose of using a heap

The Golden Rule

For Top-K Largest → Use Min-Heap (quick access to weakest candidate) For Top-K Smallest → Use Max-Heap (quick access to strongest candidate)

The heap type is OPPOSITE to what you're looking for because you need quick access to the element that would be EVICTED, not the element you want to keep.

Worked Example: Why Max-Heap Fails

Suppose we incorrectly use a max-heap to find the Top-3 largest from [5, 2, 9, 1, 7]:

Step	Element	Max-Heap	To find eviction candidate...
1	5	[5]	— (not full)
2	2	[5, 2]	— (not full)
3	9	[9, 5, 2]	— (now full)
4	1	[9, 5, 2]	Must scan all 3 to find min (2). O(K)
5	7	[9, 7, 5]	Must scan all 3 to find min (2), then restructure. O(K)

With a max-heap, every new element requires scanning the entire heap to find the minimum. For n elements, that's O(nK) total—worse than sorting when K is large!

With a min-heap, the minimum is always at the root—O(1) access, and O(log K) for the replacement. Total: O(n log K).

Complexity Analysis — Understanding the Efficiency

Let's rigorously analyze the time and space complexity of the heap-based Top-K algorithm.

Time Complexity: O(n log K)

Breaking down the operations:

For each of n elements:
- Comparison with heap root: O(1)
- If heap not full: Insert → O(log K)
- If new element qualifies: Extract + Insert (or Replace) → O(log K)
Key insight: Heap size never exceeds K
- All heap operations are O(log K), not O(log n)
- This is the source of efficiency when K << n
Worst case: Every element triggers insertion
- Happens when elements arrive in increasing order
- Still O(n log K) because each insertion is O(log K)
Best case: After initial K elements, no replacements
- Happens when K largest elements come first
- O(K log K) for initial build + O(n - K) for comparisons
- Effectively O(n) when K << n

Time Complexity Comparison for Top-K
Approach	Time Complexity	When n=10⁸, K=10	Notes
Full Sort	O(n log n)	~2.66 billion ops	Wasteful for small K
Heap-Based	O(n log K)	~332 million ops	8x faster for this example
QuickSelect	O(n) average	~100 million ops	Fastest average, but worst case O(n²)
Median of Medians	O(n) guaranteed	~500 million ops	Larger constants, complex implementation

Space Complexity: O(K)

The heap maintains exactly K elements at any time. This is crucial for:

Memory-constrained environments: Can process billions of elements with tiny heap
Stream processing: Never need to store the full input
Distributed systems: Each node keeps local Top-K, merge at the end

Concrete Example:

n = 1 billion elements (need ~8GB to store all)
K = 100 (need ~800 bytes for heap)
Memory reduction: 10,000,000x!

This space efficiency is what enables stream processing of massive datasets.

The Power of Bounded Space

The O(K) space bound means you can find the Top-100 elements from an infinite stream using constant memory. This is impossible with sorting approaches that require storing all elements. This property makes heaps indispensable for real-time analytics and streaming systems.

Variations and Extensions

The basic Top-K pattern extends naturally to several common variations. Understanding these helps you recognize Top-K problems in different disguises.

Variation 1: Kth Largest Element (Single Element)

Sometimes you don't need all K elements—just the Kth one:

def kth_largest(arr, k):
    """The Kth largest is the root of the min-heap after processing."""
    min_heap = []
    for element in arr:
        heapq.heappush(min_heap, element)
        if len(min_heap) > k:
            heapq.heappop(min_heap)  # Evict smallest
    return min_heap[0]  # Root is Kth largest

The min-heap root IS the Kth largest—no additional work needed.

Variation 2: Top-K Frequent Elements

Instead of comparing by value, compare by frequency:

def top_k_frequent(arr, k):
    """Find K most frequently occurring elements."""
    # Step 1: Count frequencies - O(n)
    freq = Counter(arr)
    
    # Step 2: Use min-heap on frequencies - O(m log K) where m = unique elements
    min_heap = []  # (frequency, element)
    for element, count in freq.items():
        heapq.heappush(min_heap, (count, element))
        if len(min_heap) > k:
            heapq.heappop(min_heap)
    
    return [element for _, element in min_heap]

Variation 3: Top-K with Custom Comparator

Real-world objects have multiple attributes. Use a comparison key:

def top_k_products_by_rating(products, k):
    """Find K highest-rated products, breaking ties by price (lower wins)."""
    min_heap = []  # (rating, -price, product) - negate price for tie-breaking
    
    for product in products:
        key = (product.rating, -product.price, product)
        heapq.heappush(min_heap, key)
        if len(min_heap) > k:
            heapq.heappop(min_heap)
    
    return [item[2] for item in min_heap]

Sorted Output Requirement

If the problem requires the Top-K elements in sorted order, add a final O(K log K) sorting step. The heap gives you the K elements but not in sorted order. For K = 10, this adds negligible overhead. For larger K, consider whether sorted output is truly necessary.

Variation 4: Top-K by Key/Score

Very common in recommendation systems and search engines:

def top_k_by_score(items: List[Tuple[str, float]], k: int) -> List[str]:
    """
    Given items with scores, return K items with highest scores.
    
    items: List of (item_id, score) tuples
    Returns: List of item_ids for Top-K items
    """
    min_heap = []  # (score, item_id)
    
    for item_id, score in items:
        if len(min_heap) < k:
            heapq.heappush(min_heap, (score, item_id))
        elif score > min_heap[0][0]:
            heapq.heapreplace(min_heap, (score, item_id))
    
    # Return in descending score order
    return [item_id for _, item_id in sorted(min_heap, reverse=True)]

This pattern appears in:

Search result ranking
Product recommendations
Feed personalization
Fraud score prioritization

Common Mistakes and Pitfalls

The Top-K pattern is simple in concept but has several common implementation pitfalls. Being aware of these will save you debugging time.

Common Pitfalls to Avoid

•Wrong heap type: Using max-heap for K largest (should be min-heap). Remember: heap type is OPPOSITE to what you're finding.
•Forgetting the K bound: Not limiting heap size leads to O(n log n) complexity—same as sorting!
•Off-by-one in heap size: Should evict when size > K, not when size >= K.
•Incorrect comparison: For Top-K largest, new element must be STRICTLY greater than heap root to qualify.
•Forgetting edge cases: K = 0, K >= n, empty input array.
•Tie-breaking issues: When values are equal, order may matter. Define consistent tie-breaking rules.
•Modifying input: Some implementations modify the input array. Document and handle appropriately.
•Heap pollution: In languages with type erasure (Java), mixing types causes subtle bugs.

top_k_edge_cases.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def top_k_largest_robust(arr: List[int], k: int) -> List[int]:
    """
    Robust Top-K implementation handling all edge cases.
    """
    # Edge case 1: Invalid K
    if k <= 0:
        return []
    
    # Edge case 2: K >= n (return all elements)
    if k >= len(arr):
        return list(arr)  # Return copy, not reference
    
    # Edge case 3: Empty array
    if not arr:
        return []
    
    min_heap = []
    
    for element in arr:
        if len(min_heap) < k:
            heapq.heappush(min_heap, element)
        elif element > min_heap[0]:  # STRICTLY greater
            heapq.heapreplace(min_heap, element)
    
    return min_heap
 
 
# Test edge cases
assert top_k_largest_robust([], 5) == []
assert top_k_largest_robust([1, 2, 3], 0) == []
assert top_k_largest_robust([1, 2, 3], 5) == [1, 2, 3]
assert top_k_largest_robust([1, 1, 1], 2) == [1, 1]  # Duplicates OK
print("All edge cases passed!")

Summary: The Top-K Pattern

We've explored the Top-K pattern in depth—one of the most practical applications of the heap data structure. Let's consolidate the key insights:

Key Takeaways

•The Top-K problem is ubiquitous—finding K largest/smallest elements appears in search, recommendations, analytics, and countless other systems.
•Full sorting is wasteful when K << n. Why compute the complete ordering when you only need K elements?
•Use a min-heap for Top-K largest (and max-heap for smallest). The heap gives O(1) access to the eviction candidate.
•Time complexity is O(n log K), which is a significant improvement over O(n log n) sorting when K is small.
•Space complexity is O(K)—enabling processing of massive streams with bounded memory.
•The pattern extends naturally to Kth element, Top-K frequent, and custom comparators.
•Watch for common pitfalls: wrong heap type, missing bound enforcement, and edge cases.

What's Next:

In the next page, we'll explore space-efficient K element tracking—how to maintain a Top-K set that can be dynamically updated as elements are added or removed, and techniques for reducing memory overhead even further.

Pattern Mastered

You now understand the core Top-K pattern with heaps. This single pattern solves a wide variety of interview problems and real-world engineering challenges. In the upcoming pages, we'll build on this foundation to tackle more advanced scenarios including streaming data and real-world applications.