Binary Search - Learning Module

Loading content...

0/276

Prerequisite — Sorted Data

The Foundation Binary Search Stands On

Before we explore how binary search works, we must understand why it works. Binary search isn't magic—it's a logical consequence of one specific property in our data: order.

Imagine trying to find a word in a dictionary. You don't start at page one and scan every word. Instead, you open the dictionary somewhere in the middle, see if your word comes before or after that point, and eliminate half the remaining pages. You do this instinctively because dictionaries are alphabetically sorted.

Now imagine a dictionary where words appear in random order. Every search becomes a nightmare—you'd have to check every single entry. The sorting property that makes dictionaries usable is precisely the property that makes binary search possible.

What You Will Learn

By the end of this page, you will understand:

• Why sorted data isn't just helpful, but mathematically essential for binary search • What properties sorted data guarantees and how binary search exploits them • The formal definition of sortedness and its implications • Common scenarios where this prerequisite is satisfied or must be established • The relationship between sorting cost and search efficiency

Understanding Sortedness: A Formal Definition

Before we can appreciate why sorted data enables binary search, we must precisely define what it means for data to be sorted.

Formal Definition:

A sequence A = [a₀, a₁, a₂, ..., aₙ₋₁] is said to be sorted in ascending order if and only if:

For all indices i and j where 0 ≤ i < j < n, we have aᵢ ≤ aⱼ

Equivalently, for consecutive elements:

For all i where 0 ≤ i < n-1, we have aᵢ ≤ aᵢ₊₁

This seemingly simple property—that every element is less than or equal to the one following it—has profound implications that binary search exploits.

Strict vs. Non-Strict Ordering

We use ≤ (less than or equal) rather than < (strictly less than) to allow duplicate values. A sorted array can contain repeated elements: [1, 2, 2, 3, 5, 5, 5, 8] is perfectly valid sorted data. This distinction becomes critical when we later discuss finding first/last occurrences of a target value.

Descending Order:

An array sorted in descending order reverses the inequality:

For all i where 0 ≤ i < n-1, we have aᵢ ≥ aᵢ₊₁

Binary search works identically on descending-sorted arrays—we simply reverse our elimination logic. When the midpoint value is greater than our target, we search right (not left).

The Transitivity Property:

Sortedness guarantees transitivity: if we know a₃ < a₇, and the array is sorted, we can immediately conclude:

All elements a₀, a₁, a₂, a₃ are ≤ a₃ < a₇
All elements a₇, a₈, ..., aₙ₋₁ are ≥ a₇ > a₃

This transitivity is the mathematical foundation of binary search. By examining a single element, we gain information about all elements on either side of it.

The Order Invariant: What Sortedness Guarantees

Sortedness establishes what we call the order invariant—a property that holds at every position in the array and enables logical deduction about element locations.

The Key Insight:

In a sorted array, if we're searching for a target value T and we examine any element at position mid with value M:

If M < T: The target cannot exist at indices 0 through mid (if it exists at all). Why? Because all elements before mid are ≤ M, and M < T, so all elements before mid are < T.
If M > T: The target cannot exist at indices mid through n-1. All elements after mid are ≥ M > T, so they're all too large.
If M = T: We've found the target (or one instance of it).

This is the order invariant in action: a single comparison gives us information about half the array.

Information Gained from a Single Comparison in Sorted vs. Unsorted Data
Comparison Result	Sorted Array	Unsorted Array
M < Target	Target cannot be in left half (indices 0 to mid)	No information about other elements
M > Target	Target cannot be in right half (indices mid to n-1)	No information about other elements
M = Target	Found target (may be duplicates elsewhere)	Found target (may be duplicates elsewhere)
Information per comparison	Eliminates ~50% of remaining elements	Eliminates exactly 1 element

The Power of Elimination:

The table above reveals something profound: in sorted data, each comparison eliminates roughly half of the remaining search space. In unsorted data, each comparison only tells us about the specific element we examined—providing no information about unexamined elements.

This is why binary search achieves O(log n) complexity: we halve the problem with each step. After k comparisons, we've reduced the search space to n/2ᵏ elements. In unsorted data, after k comparisons, we still have n - k elements (linear reduction).

Visualizing the Invariant

Think of a sorted array as a number line where values increase left-to-right. Any vertical line (representing a position) divides all values into two groups: 'less than or equal' on the left, 'greater than or equal' on the right. Binary search exploits this partition to navigate toward the target.

Why Unsorted Data Breaks Binary Search

To truly understand why sorted data is essential, let's examine what happens when we attempt binary search on unsorted data.

Example: Searching for 5 in an Unsorted Array

Consider the array: [7, 2, 9, 5, 1, 8, 3]

Using binary search logic:

Check middle element (index 3): value is 5 ✓ Found!

This seems to work! But that's pure luck. Let's search for 8:

Check middle element (index 3): value is 5
Since 8 > 5, search right half: [1, 8, 3]
Check middle of right half (index 5): value is 8 ✓ Found!

Still lucky. Now search for 2:

Check middle element (index 3): value is 5
Since 2 < 5, search left half: [7, 2, 9]
Check middle of left half (index 1): value is 2 ✓ Found!

But now search for 3:

Check middle element (index 3): value is 5
Since 3 < 5, search left half: [7, 2, 9]
Check middle of left half (index 1): value is 2
Since 3 > 2, search right of index 1: [9]
Check: value is 9 ≠ 3 → TARGET NOT FOUND

But 3 IS in the array (at index 6)! Binary search failed because it made assumptions ("3 must be to the right of 2") that only hold for sorted data.

The Fundamental Problem

In unsorted data, seeing that M < Target tells us nothing about where Target might be. It could be anywhere—left, right, or not present at all. The logical deduction that powers binary search becomes invalid, and the algorithm produces incorrect results.

Formal Proof by Counterexample:

For any unsorted array of n elements (n ≥ 3), we can always construct a scenario where binary search fails:

Let the middle element have value M
Place a target value T where T > M but in the left half
Binary search will look in the right half and miss the target

Since this counterexample exists for every unsorted array, binary search is guaranteed to be unreliable without sorted data. Not "often wrong"—provably incorrect.

unsorted_failure_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Demonstration: Binary search fails on unsorted data
 
def binary_search(arr, target):
    """Standard binary search (assumes sorted input)"""
    left, right = 0, len(arr) - 1
    
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    
    return -1  # Not found
 
# Unsorted array
unsorted = [7, 2, 9, 5, 1, 8, 3]
 
# Search for 3 (which exists at index 6)
result = binary_search(unsorted, 3)
print(f"Search for 3: index {result}")  # Outputs: -1 (wrong!)
 
# Search for 7 (which exists at index 0)
result = binary_search(unsorted, 7)
print(f"Search for 7: index {result}")  # May output -1 (wrong!)
 
# The same array, but sorted
sorted_arr = sorted(unsorted)  # [1, 2, 3, 5, 7, 8, 9]
 
# Now binary search works correctly
result = binary_search(sorted_arr, 3)
print(f"Sorted - Search for 3: index {result}")  # Outputs: 2 (correct!)

Sources of Sorted Data in Practice

Understanding when data arrives already sorted—and when we must sort it ourselves—is crucial for effective algorithm selection.

Naturally Sorted Data:

Some data sources inherently produce sorted data:

Common Sources of Pre-Sorted Data

•Time-series data — Logs, events, and measurements recorded chronologically are sorted by timestamp by construction.
•Database indices — B-trees and similar structures store keys in sorted order, enabling range queries and binary search.
•Merge results — The output of merge operations (as in merge sort) is sorted if inputs were sorted.
•Priority queue extractions — Extracting all elements from a min-heap produces ascending order.
•Lexicographic enumerations — Algorithms generating permutations or combinations often produce sorted output.
•Sorted insertions — Data structures like balanced BSTs maintain sorted order as elements are added.

Data Requiring Explicit Sorting:

Most real-world data is not naturally sorted. User inputs, API responses, file contents, and random samples all arrive in arbitrary order. To use binary search, we must first apply a sorting algorithm.

The Sorting Investment:

Here's where strategic thinking becomes essential. Sorting takes O(n log n) time for comparison-based sorts. If we're only searching once, spending O(n log n) to enable O(log n) search is worse than just using O(n) linear search!

The calculus changes when we search repeatedly:

Break-Even Analysis: Sorting + Binary Search vs. Repeated Linear Search
Searches (k)	Linear Search Total	Sort Once + Binary Search	Winner
1	O(n)	O(n log n) + O(log n) ≈ O(n log n)	Linear Search
log n	O(n log n)	O(n log n) + O(log² n) ≈ O(n log n)	Approximately Equal
n	O(n²)	O(n log n) + O(n log n) = O(n log n)	Binary Search (by n/log n factor)
n²	O(n³)	O(n log n) + O(n² log n) = O(n² log n)	Binary Search (by n/log n factor)

The Rule of Thumb

If you'll search the same data more than log n times, sorting first and using binary search is more efficient than repeated linear searches. For n = 1,000,000, that's about 20 searches. For n = 1 billion, it's about 30. In practice, most applications perform thousands or millions of searches on the same data, making the sorting investment highly worthwhile.

Maintaining Sorted Order with Dynamic Data

Real systems often face a challenge: data changes over time. Elements are inserted, deleted, or modified. How do we maintain sorted order efficiently?

Option 1: Re-Sort After Each Modification

The simplest approach—run a full sort after any change—is often impractical. For n = 1,000,000 elements, each modification triggers O(n log n) ≈ 20 million operations.

Option 2: Insertion at Correct Position

When inserting into a sorted array:

Use binary search to find the correct insertion position: O(log n)
Shift all subsequent elements to make room: O(n)
Insert the new element: O(1)

Total: O(n) per insertion. Better than re-sorting, but still expensive for frequent insertions.

Option 3: Sorted Data Structures

For truly dynamic data, specialized structures maintain sorted order efficiently:

Data Structures for Dynamic Sorted Data

•Balanced BSTs (AVL, Red-Black) — O(log n) insert, delete, and search. In-order traversal yields sorted sequence.
•Skip Lists — Probabilistic structure with O(log n) expected operations. Used in Redis and LevelDB.
•B-Trees / B+ Trees — Optimized for disk access with O(log n) operations. Foundation of database indexing.
•Sorted Sets (e.g., Java TreeSet, C++ std::set) — Library implementations of balanced BSTs.

The Trade-off Triangle:

Choosing a data representation involves balancing:

Search efficiency — Arrays with binary search: O(log n), optimal
Modification efficiency — BSTs: O(log n), arrays: O(n)
Memory efficiency — Arrays: minimal overhead, BSTs: pointer overhead per node

For read-heavy workloads (many searches, few modifications), sorted arrays with binary search are ideal. For write-heavy workloads, balanced trees or other dynamic structures are preferred.

Batch Operations:

A common optimization for mixed workloads: buffer modifications, then periodically rebuild. For example:

Accept new elements into an unsorted buffer
When buffer reaches threshold, sort and merge into main array
Search both locations (main array + buffer)

This amortizes the sorting cost across multiple insertions, achieving better average-case performance.

The Real-World Balance

Production systems often use hybrid approaches. A database might use a B-tree for the live index (supporting efficient updates) but periodically compact data into sorted runs for optimal read performance. Understanding sorted data's role in binary search helps you appreciate these architectural decisions.

Beyond Numerical Ordering: Comparison Functions

Binary search works whenever data has a total ordering—a consistent way to compare any two elements. This extends far beyond numbers.

Requirements for a Valid Comparison:

A comparison function must satisfy three properties:

Antisymmetry: If a ≤ b and b ≤ a, then a = b
Transitivity: If a ≤ b and b ≤ c, then a ≤ c
Totality: For any a and b, either a ≤ b or b ≤ a (or both)

Any comparison satisfying these properties enables binary search.

custom_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from dataclasses import dataclass
from typing import List
import bisect
 
@dataclass
class Student:
    name: str
    gpa: float
    id: int
    
    def __lt__(self, other):
        # Primary sort: by GPA (descending)
        if self.gpa != other.gpa:
            return self.gpa > other.gpa  # Higher GPA comes first
        # Secondary sort: by name (ascending)
        if self.name != other.name:
            return self.name < other.name
        # Tertiary sort: by ID (ascending)
        return self.id < other.id
 
# Creating a sorted list of students
students = [
    Student("Alice", 3.9, 101),
    Student("Bob", 3.7, 102),
    Student("Charlie", 3.9, 103),
    Student("Diana", 3.5, 104),
]
 
# Sort according to our custom ordering
students.sort()
 
# Binary search for a student with GPA 3.7
# Using bisect with a key function (Python 3.10+)
target_gpa = 3.7
# Finding position where a student with this GPA would be inserted
gpas = [s.gpa for s in sorted(students, key=lambda s: -s.gpa)]
position = bisect.bisect_left(gpas, target_gpa, key=lambda x: -x)
 
print(f"Students sorted by custom ordering:")
for s in students:
    print(f"  {s.name}: GPA {s.gpa}")

Common Orderings in Practice:

Lexicographic (strings): "apple" < "banana" < "cherry"
Chronological (dates/times): Earlier < Later
Hierarchical (versions): "2.1.3" < "2.2.0" < "2.10.1" (requires semantic parsing)
Custom business rules: Priority levels, urgency rankings

The Critical Constraint:

Whatever comparison you use for sorting, you must use the same comparison for binary search. Sorting by one criterion and searching by another produces incorrect results—the order invariant only holds for the specific comparison used to establish order.

Consistency is Non-Negotiable

A common bug: sorting a list of (key, value) pairs by key, then binary searching by value. This fails silently—the element may exist but won't be found because the order invariant applies to keys, not values. Always verify that your search criterion matches your sort criterion.

Verifying Sortedness: Defensive Programming

In production code, trusting that input is sorted can lead to subtle, hard-to-diagnose bugs. Binary search on unsorted data doesn't crash—it returns wrong answers silently.

When to Verify:

Data received from external sources (APIs, user uploads, files)
Data processed through pipelines where sorting might be lost
After modifications that could disrupt order
During debugging of mysterious search failures

verify_sorted.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
from typing import TypeVar, Callable, Sequence
import random
 
T = TypeVar('T')
 
def is_sorted(
    arr: Sequence[T],
    key: Callable[[T], any] = lambda x: x,
    reverse: bool = False
) -> bool:
    """
    Verify that a sequence is sorted.
    
    Args:
        arr: Sequence to check
        key: Function to extract comparison key
        reverse: True if checking descending order
    
    Returns:
        True if sorted according to criteria
    
    Time: O(n), Space: O(1)
    """
    if len(arr) <= 1:
        return True
    
    prev = key(arr[0])
    for i in range(1, len(arr)):
        curr = key(arr[i])
        if reverse:
            if curr > prev:
                return False
        else:
            if curr < prev:
                return False
        prev = curr
    
    return True
 
 
def find_sorted_violations(
    arr: Sequence[T],
    key: Callable[[T], any] = lambda x: x,
    reverse: bool = False
) -> list[tuple[int, T, T]]:
    """
    Find all positions where sorted order is violated.
    
    Returns:
        List of (index, element, previous_element) for violations
    """
    violations = []
    if len(arr) <= 1:
        return violations
    
    prev = key(arr[0])
    for i in range(1, len(arr)):
        curr = key(arr[i])
        violated = (curr > prev) if reverse else (curr < prev)
        if violated:
            violations.append((i, arr[i], arr[i-1]))
        prev = curr
    
    return violations
 
 
def safe_binary_search(arr: Sequence[T], target: T, verify: bool = True) -> int:
    """
    Binary search with optional sortedness verification.
    
    In development/testing: verify=True catches bugs early
    In production: verify=False for performance (after validation)
    """
    if verify and not is_sorted(arr):
        raise ValueError("Binary search requires sorted input!")
    
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1
 
 
# Usage examples
sorted_data = [1, 3, 5, 7, 9, 11, 13, 15]
unsorted_data = [1, 5, 3, 7, 9, 11, 13, 15]
 
print(f"Sorted check: {is_sorted(sorted_data)}")     # True
print(f"Unsorted check: {is_sorted(unsorted_data)}") # False
 
violations = find_sorted_violations(unsorted_data)
print(f"Violations found: {violations}")  # [(2, 3, 5)]
 
try:
    safe_binary_search(unsorted_data, 7, verify=True)
except ValueError as e:
    print(f"Caught error: {e}")

Debug Mode vs. Production

Verification costs O(n)—the same as linear search. Enable it during development and testing to catch bugs early, but disable it in performance-critical production paths (after you've validated your data pipeline ensures sortedness).

Summary: The Sorted Data Contract

Sorted data isn't just a nice-to-have for binary search—it's the mathematical foundation that makes the algorithm correct. Without order, we cannot make deductions; without deductions, we cannot eliminate half the search space; without elimination, we cannot achieve logarithmic efficiency.

Let's consolidate what we've learned:

Key Takeaways

•The Order Invariant — Sorted data guarantees that examining one element tells us about all elements to its left and right. This is the foundation of binary search's power.
•Unsorted Data Breaks Everything — Binary search on unsorted data doesn't just perform poorly—it produces incorrect results. The algorithm's correctness proof depends on sortedness.
•The Sorting Investment — Sorting costs O(n log n), but enables O(log n) searches. For repeated searches, this investment pays enormous dividends.
•Dynamic Data Requires Care — Maintaining sorted order amid insertions and deletions requires either O(n) array shifts or specialized data structures like balanced trees.
•Comparison Consistency — Any total ordering works, but you must use the same comparison for sorting and searching. Mixing criteria produces silent failures.
•Verify When Uncertain — In defensive programming, validate sortedness when data sources are untrusted or pipelines are complex.

What's Next:

Now that we understand why sorted data is essential, we're ready to explore how binary search exploits this property. The next page introduces the divide-and-eliminate strategy—the elegant approach that transforms the order invariant into logarithmic efficiency.

Page Complete

You now understand the essential prerequisite for binary search: sorted data. This isn't merely a requirement to memorize—it's the deep mathematical reason the algorithm works at all. With this foundation, you're ready to learn the divide-and-eliminate strategy that makes binary search one of the most elegant algorithms in computer science.