Loading content...
Imagine you're building a recommendation system for an e-commerce platform with 100 million products. A user searches for "wireless headphones," and your algorithm assigns a relevance score to each product. Now you need to display the top 10 most relevant results. How do you efficiently find these 10 products without sorting all 100 million items?
Or consider a real-time analytics dashboard that monitors server performance across 10,000 machines. You need to identify the 5 servers with the highest CPU usage to alert the operations team. The data changes every second—you can't afford to sort all 10,000 entries each time.
These are instances of the Top-K problem—one of the most ubiquitous patterns in software engineering. The naive approach (sort everything, take the first K) has O(n log n) time complexity. But with the right data structure—the heap—we can solve this in O(n log K) time, or even better in streaming scenarios.
By the end of this page, you will understand the Top-K pattern at a deep level: why heaps are the optimal choice, when to use a min-heap versus a max-heap, the critical insight of maintaining a "bounded window" of K elements, and how to recognize Top-K problems in disguise. This pattern appears in coding interviews, system design, and production systems regularly.
The Top-K problem has a deceptively simple formulation:
Given a collection of n elements, find the K largest (or smallest) elements.
But this simple statement hides considerable depth. Let's formalize it precisely:
Formal Definition:
Key Observations:
Order within the K elements may or may not matter. Some problems ask for just the K elements (unordered), while others want them in sorted order. This affects the final step of your solution.
The Kth largest element has special significance. It serves as the "threshold"—any element greater than or equal to this value is in the Top-K set.
K is typically much smaller than n. Most real-world Top-K problems have K << n. If K approaches n, you're essentially sorting the entire collection.
The elements don't need to be distinct. Duplicates are handled naturally by the algorithms we'll explore.
A crucial insight: finding the "Kth largest element" is equivalent to finding the threshold that separates the Top-K from the rest. Once you have this threshold, you can identify all elements in the Top-K set with a single pass. This connection is fundamental to efficient Top-K algorithms.
Variants of the Top-K Problem:
| Variant | Description | Example Use Case |
|---|---|---|
| Top-K Largest | Find K elements with highest values | Highest-rated products, top-earning employees |
| Top-K Smallest | Find K elements with lowest values | Nearest neighbors, minimum costs |
| Top-K Frequent | Find K most frequently occurring elements | Most common words, trending hashtags |
| Kth Largest/Smallest | Find the single element at position K | Median (K = n/2), percentile calculations |
| Top-K in Streaming | Maintain Top-K as new elements arrive | Real-time leaderboards, live monitoring |
Each variant uses the same fundamental heap-based approach, with slight modifications. Mastering the core pattern enables you to solve all variants.
The most intuitive approach to Top-K is simple: sort the collection, then take the first K elements. Let's analyze why this isn't optimal.
The Sorting Approach:
function topKBySorting(arr, k):
sort(arr) // O(n log n)
return arr[0...k-1] // O(k)
Time Complexity: O(n log n) — dominated by the sorting step Space Complexity: O(1) to O(n) depending on sorting algorithm
The Problem:
The sorting approach does far more work than necessary. Consider the search engine example with n = 100,000,000 products and K = 10:
We're using a sledgehammer to hang a picture frame. Sorting computes the complete ordering of all elements when we only need the partial ordering—specifically, which elements are in the top K.
When K is small relative to n, full sorting wastes effort determining the relative order of elements outside the Top-K. Whether element 1,000 ranks higher than element 2,000 is irrelevant if both are outside the Top-10. This is the key insight that motivates the heap-based approach.
Alternative Approaches and Their Trade-offs:
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Full Sort | O(n log n) | O(1)–O(n) | Simple but inefficient for small K |
| Partial Sort | O(n log K) | O(K) | Some languages offer this; same as heap |
| Heap-based | O(n log K) | O(K) | Optimal for small K; works with streams |
| QuickSelect | O(n) average, O(n²) worst | O(1) | Faster average case but modifies input |
| Median of Medians + QuickSelect | O(n) guaranteed | O(1) | Optimal worst-case but complex |
For most practical scenarios, the heap-based approach provides the best balance:
When K is very small (say, K ≤ 100) and n is very large, the heap approach is almost always the right choice.
Now we arrive at the heart of the Top-K pattern. The heap-based approach is elegant, efficient, and widely applicable. Here's the key insight:
To find the K largest elements, maintain a min-heap of size K. The smallest element in the heap is our "admission threshold"—any element larger must be in the Top-K.
Wait—a min-heap for finding the largest elements? This counterintuitive choice is precisely what makes the algorithm efficient. Let's break it down step by step.
Using a min-heap to find the maximum elements seems backwards, but it's the key insight. The min-heap lets us efficiently identify and evict the WEAKEST candidate from our Top-K set. We always know which element would be "first to go" if a better candidate arrives—it's right there at the top of the heap.
The Algorithm:
function topKLargest(arr, k):
minHeap = new MinHeap()
for each element in arr:
if minHeap.size() < k:
// Phase 1: Fill the heap until we have K elements
minHeap.insert(element)
else if element > minHeap.peek():
// Phase 2: New element beats our weakest candidate
minHeap.extractMin() // Remove weakest
minHeap.insert(element) // Add new champion
return minHeap.elements()
Step-by-Step Trace:
Let's trace through finding the Top-3 from [5, 2, 9, 1, 7, 6, 8]:
| Step | Element | Heap Contents | Heap Min | Action |
|---|---|---|---|---|
| 1 | 5 | [5] | 5 | Insert (heap not full) |
| 2 | 2 | [2, 5] | 2 | Insert (heap not full) |
| 3 | 9 | [2, 5, 9] | 2 | Insert (heap now full with K=3) |
| 4 | 1 | [2, 5, 9] | 2 | Skip (1 ≤ 2, not a top candidate) |
| 5 | 7 | [5, 7, 9] | 5 | 7 > 2 → Extract 2, Insert 7 |
| 6 | 6 | [6, 7, 9] | 6 | 6 > 5 → Extract 5, Insert 6 |
| 7 | 8 | [7, 8, 9] | 7 | 8 > 6 → Extract 6, Insert 8 |
Result: [7, 8, 9] — the three largest elements.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import heapqfrom typing import List def top_k_largest(arr: List[int], k: int) -> List[int]: """ Find the K largest elements using a min-heap. Time Complexity: O(n log K) - We process each of n elements exactly once - Each heap operation (insert/extract) is O(log K) - Since heap size never exceeds K, operations are O(log K), not O(log n) Space Complexity: O(K) - The heap stores at most K elements at any time Args: arr: Input array of n elements k: Number of largest elements to find Returns: List of K largest elements (not necessarily sorted) """ if k <= 0: return [] if k >= len(arr): return arr[:] # All elements are in Top-K # Python's heapq is a min-heap by default # We'll use it directly for finding K largest min_heap = [] for element in arr: if len(min_heap) < k: # Phase 1: Fill the heap until we have K elements heapq.heappush(min_heap, element) elif element > min_heap[0]: # Phase 2: New element beats our weakest candidate # heapreplace is more efficient than extractMin + insert heapq.heapreplace(min_heap, element) return min_heap def top_k_smallest(arr: List[int], k: int) -> List[int]: """ Find the K smallest elements using a max-heap. For K smallest, we use a MAX-heap of size K. The max element is our threshold—anything smaller must be in Top-K. Since Python only provides min-heap, we negate values. """ if k <= 0: return [] if k >= len(arr): return arr[:] # Negate values to simulate max-heap max_heap = [] for element in arr: if len(max_heap) < k: heapq.heappush(max_heap, -element) elif element < -max_heap[0]: # Compare with current max heapq.heapreplace(max_heap, -element) # Negate back to get original values return [-x for x in max_heap] # Example usage and verificationif __name__ == "__main__": arr = [5, 2, 9, 1, 7, 6, 8] print(f"Array: {arr}") print(f"Top 3 largest: {sorted(top_k_largest(arr, 3), reverse=True)}") print(f"Top 3 smallest: {sorted(top_k_smallest(arr, 3))}") # Output: # Array: [5, 2, 9, 1, 7, 6, 8] # Top 3 largest: [9, 8, 7] # Top 3 smallest: [1, 2, 5]This is the most common point of confusion for engineers learning the Top-K pattern. Let's build a deep, intuitive understanding of why the heap choice seems "backwards."
The Insight: Gatekeeping, Not Championing
Think of the heap as a waiting room with limited capacity K. We want to keep the K strongest candidates inside.
The crucial question isn't "Who's the strongest?" but rather "Who should we kick out if someone better arrives?"
For this gatekeeping role, the heap root tells us:
When finding K largest elements, we need quick access to the weakest candidate—the one we'll evict when a better candidate arrives. That's what the min-heap root provides.
For Top-K Largest → Use Min-Heap (quick access to weakest candidate) For Top-K Smallest → Use Max-Heap (quick access to strongest candidate)
The heap type is OPPOSITE to what you're looking for because you need quick access to the element that would be EVICTED, not the element you want to keep.
Worked Example: Why Max-Heap Fails
Suppose we incorrectly use a max-heap to find the Top-3 largest from [5, 2, 9, 1, 7]:
| Step | Element | Max-Heap | To find eviction candidate... |
|---|---|---|---|
| 1 | 5 | [5] | — (not full) |
| 2 | 2 | [5, 2] | — (not full) |
| 3 | 9 | [9, 5, 2] | — (now full) |
| 4 | 1 | [9, 5, 2] | Must scan all 3 to find min (2). O(K) |
| 5 | 7 | [9, 7, 5] | Must scan all 3 to find min (2), then restructure. O(K) |
With a max-heap, every new element requires scanning the entire heap to find the minimum. For n elements, that's O(nK) total—worse than sorting when K is large!
With a min-heap, the minimum is always at the root—O(1) access, and O(log K) for the replacement. Total: O(n log K).
Let's rigorously analyze the time and space complexity of the heap-based Top-K algorithm.
Time Complexity: O(n log K)
Breaking down the operations:
For each of n elements:
Key insight: Heap size never exceeds K
Worst case: Every element triggers insertion
Best case: After initial K elements, no replacements
| Approach | Time Complexity | When n=10⁸, K=10 | Notes |
|---|---|---|---|
| Full Sort | O(n log n) | ~2.66 billion ops | Wasteful for small K |
| Heap-Based | O(n log K) | ~332 million ops | 8x faster for this example |
| QuickSelect | O(n) average | ~100 million ops | Fastest average, but worst case O(n²) |
| Median of Medians | O(n) guaranteed | ~500 million ops | Larger constants, complex implementation |
Space Complexity: O(K)
The heap maintains exactly K elements at any time. This is crucial for:
Concrete Example:
This space efficiency is what enables stream processing of massive datasets.
The O(K) space bound means you can find the Top-100 elements from an infinite stream using constant memory. This is impossible with sorting approaches that require storing all elements. This property makes heaps indispensable for real-time analytics and streaming systems.
The basic Top-K pattern extends naturally to several common variations. Understanding these helps you recognize Top-K problems in different disguises.
Variation 1: Kth Largest Element (Single Element)
Sometimes you don't need all K elements—just the Kth one:
def kth_largest(arr, k):
"""The Kth largest is the root of the min-heap after processing."""
min_heap = []
for element in arr:
heapq.heappush(min_heap, element)
if len(min_heap) > k:
heapq.heappop(min_heap) # Evict smallest
return min_heap[0] # Root is Kth largest
The min-heap root IS the Kth largest—no additional work needed.
Variation 2: Top-K Frequent Elements
Instead of comparing by value, compare by frequency:
def top_k_frequent(arr, k):
"""Find K most frequently occurring elements."""
# Step 1: Count frequencies - O(n)
freq = Counter(arr)
# Step 2: Use min-heap on frequencies - O(m log K) where m = unique elements
min_heap = [] # (frequency, element)
for element, count in freq.items():
heapq.heappush(min_heap, (count, element))
if len(min_heap) > k:
heapq.heappop(min_heap)
return [element for _, element in min_heap]
Variation 3: Top-K with Custom Comparator
Real-world objects have multiple attributes. Use a comparison key:
def top_k_products_by_rating(products, k):
"""Find K highest-rated products, breaking ties by price (lower wins)."""
min_heap = [] # (rating, -price, product) - negate price for tie-breaking
for product in products:
key = (product.rating, -product.price, product)
heapq.heappush(min_heap, key)
if len(min_heap) > k:
heapq.heappop(min_heap)
return [item[2] for item in min_heap]
If the problem requires the Top-K elements in sorted order, add a final O(K log K) sorting step. The heap gives you the K elements but not in sorted order. For K = 10, this adds negligible overhead. For larger K, consider whether sorted output is truly necessary.
Variation 4: Top-K by Key/Score
Very common in recommendation systems and search engines:
def top_k_by_score(items: List[Tuple[str, float]], k: int) -> List[str]:
"""
Given items with scores, return K items with highest scores.
items: List of (item_id, score) tuples
Returns: List of item_ids for Top-K items
"""
min_heap = [] # (score, item_id)
for item_id, score in items:
if len(min_heap) < k:
heapq.heappush(min_heap, (score, item_id))
elif score > min_heap[0][0]:
heapq.heapreplace(min_heap, (score, item_id))
# Return in descending score order
return [item_id for _, item_id in sorted(min_heap, reverse=True)]
This pattern appears in:
The Top-K pattern is simple in concept but has several common implementation pitfalls. Being aware of these will save you debugging time.
123456789101112131415161718192021222324252627282930313233
def top_k_largest_robust(arr: List[int], k: int) -> List[int]: """ Robust Top-K implementation handling all edge cases. """ # Edge case 1: Invalid K if k <= 0: return [] # Edge case 2: K >= n (return all elements) if k >= len(arr): return list(arr) # Return copy, not reference # Edge case 3: Empty array if not arr: return [] min_heap = [] for element in arr: if len(min_heap) < k: heapq.heappush(min_heap, element) elif element > min_heap[0]: # STRICTLY greater heapq.heapreplace(min_heap, element) return min_heap # Test edge casesassert top_k_largest_robust([], 5) == []assert top_k_largest_robust([1, 2, 3], 0) == []assert top_k_largest_robust([1, 2, 3], 5) == [1, 2, 3]assert top_k_largest_robust([1, 1, 1], 2) == [1, 1] # Duplicates OKprint("All edge cases passed!")We've explored the Top-K pattern in depth—one of the most practical applications of the heap data structure. Let's consolidate the key insights:
What's Next:
In the next page, we'll explore space-efficient K element tracking—how to maintain a Top-K set that can be dynamically updated as elements are added or removed, and techniques for reducing memory overhead even further.
You now understand the core Top-K pattern with heaps. This single pattern solves a wide variety of interview problems and real-world engineering challenges. In the upcoming pages, we'll build on this foundation to tackle more advanced scenarios including streaming data and real-world applications.