Database Management SystemsFull-Text Indexes

Full-Text Indexing and Search

LevelIntermediate

Duration75 mins

TopicFull-Text Indexes

2 / 5

Inverted Index Architecture

The Fundamental Data Structure of Search

Every modern search system—from Google's web search to your IDE's code search to PostgreSQL's full-text capabilities—relies on a deceptively simple yet profoundly powerful data structure: the inverted index.

The name comes from 'inverting' the natural relationship between documents and words. Normally, we think of a document as containing words. An inverted index flips this perspective: for each word, we record which documents contain it. This simple inversion transforms an O(n × m) scan problem (check every word in every document) into an O(k) lookup problem (find the k documents containing a specific word).

What You Will Learn

By the end of this page, you will understand the complete architecture of an inverted index: the vocabulary (lexicon), posting lists, positional information, and skip pointers. You'll learn how indexes are stored on disk and how they support efficient query operations.

Understanding the Inversion

Let's build intuition with a concrete example. Consider three documents:

Doc 1: "The database stores data efficiently." Doc 2: "Efficient data indexing improves database queries." Doc 3: "Query processing requires efficient indexing."

Forward View (Documents → Words):

Naturally, we'd represent these as:

forward_index.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Forward Index (Document → Words):
 
Doc 1: {the, database, stores, data, efficiently}
Doc 2: {efficient, data, indexing, improves, database, queries}
Doc 3: {query, processing, requires, efficient, indexing}
 
To find documents containing "database":
- Scan Doc 1 words → Found!
- Scan Doc 2 words → Found!
- Scan Doc 3 words → Not found.
- Result: [Doc 1, Doc 2]
 
Cost: O(total words across all documents) = O(n × average doc length)

This forward view requires scanning all documents for every query. With millions of documents and thousands of words each, this is catastrophically slow.

Inverted View (Words → Documents):

inverted_index.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Inverted Index (Word → Documents):
 
database    → [Doc 1, Doc 2]
data        → [Doc 1, Doc 2]
efficient   → [Doc 2, Doc 3]
efficiently → [Doc 1]
improves    → [Doc 2]
indexing    → [Doc 2, Doc 3]
processing  → [Doc 3]
queries     → [Doc 2]
query       → [Doc 3]
requires    → [Doc 3]
stores      → [Doc 1]
the         → [Doc 1]
 
To find documents containing "database":
- Look up "database" → [Doc 1, Doc 2]
- Result: [Doc 1, Doc 2]
 
Cost: O(1) hash lookup + O(k) to read k matching docs

The Power of Inversion

The inverted index pre-computes the answer to 'which documents contain word X?' for every word. At query time, we simply look up the precomputed answer. Building the index is expensive, but queries become nearly instant.

The Core Components:

An inverted index consists of two main components:

Vocabulary (Lexicon): A sorted list or hash of all unique terms (words) in the document collection. Each term points to its posting list.
Posting Lists: For each term, a list of documents containing that term, along with additional information (positions, frequency, etc.).

Converting Mermaid diagram...

Vocabulary (Lexicon) Structure

The vocabulary is the entry point to the inverted index. For each unique term in the collection, it stores:

The term string itself
A pointer to the term's posting list
Aggregate statistics (document frequency, collection frequency)

Size Considerations:

The vocabulary size depends on the collection:

Typical Vocabulary Sizes
Collection Type	Documents	Total Tokens	Unique Terms	Avg. Term Length
News Articles (1 year)	~500K	~500M	~500K	7 chars
Wikipedia (English)	~6M	~4B	~10M	8 chars
web Crawl	~100B+	~1T+	~1B+	8 chars
Source Code Repository	~10M files	~50B	~50M	10 chars
Legal Documents	~1M	~2B	~2M	9 chars

Vocabulary Organization Options:

The vocabulary can be organized in several ways, each with tradeoffs:

Vocabulary Data Structures

•Sorted Array + Binary Search — Terms stored in sorted order, binary search for lookup. O(log n) lookup, O(n) for prefix queries. Compact, cache-friendly.
•Hash Table — Terms hashed for O(1) average lookup. Cannot support prefix or range queries on terms. Fast but space-overhead.
•B+-Tree — Balanced tree with sorted terms. O(log n) lookup, efficient prefix queries. Better for disk-resident vocabularies.
•Trie (Prefix Tree) — Characters form edges, shared prefixes reduce space. Excellent for prefix/wildcard queries. Variable-length terms natural.

Trie-Based Vocabulary:

For large vocabularies with prefix query support, tries are highly effective:

trie_vocabulary.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Trie for vocabulary: {data, database, date, index, indexing}
 
                    [root]
                   /      \
                  d        i
                  |        |
                  a        n
                  |        |
                  t        d
                 / \       |
                a   e      e
                |   |      |
           [base]  [*]     x
                |          |
               [*]        [*]
                         /
                        i
                        |
                        n
                        |
                        g
                        |
                       [*]
 
[*] = term endpoint
 
Benefits:
- Shared prefixes save space ("data" and "database" share "data")
- Prefix query "data*" just traverses to "data" and collects all descendants
- Natural support for wildcard queries

Memory-Resident vs. Disk-Resident

For small vocabularies (< few million terms), the entire vocabulary fits in memory. For web-scale vocabularies with billions of terms, disk-resident structures like B+-trees or compressed tries become necessary.

Posting List Structure

Posting lists are the heart of the inverted index. Each posting list contains all documents where a term appears, along with information needed for ranking and phrase queries.

Basic Posting:

The simplest posting contains just a document ID:

basic_posting.txt
1
2
3
4
5
6
7
8
9
10
11
12
Basic Posting Format:
--------------------------------------------------
term        | posting list
--------------------------------------------------
database    | [1, 5, 12, 47, 89, 156, ...]
index       | [1, 3, 12, 15, 47, ...]
performance | [5, 47, 89, 234, ...]
 
Each number is a document ID (sorted in ascending order).
 
Query "database" → Return documents [1, 5, 12, 47, 89, 156, ...]
Query "database AND index" → Intersect lists → [1, 12, 47, ...]

Extended Posting with Frequency:

For relevance ranking, we need term frequency—how many times the term appears in each document:

frequency_posting.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Posting with Term Frequency (TF):
--------------------------------------------------
term        | posting list: (docID, termFreq)
--------------------------------------------------
database    | [(1, 3), (5, 1), (12, 7), (47, 2), ...]
            |  Doc 1 has 'database' 3 times
            |  Doc 5 has 'database' 1 time
            |  Doc 12 has 'database' 7 times (likely more relevant!)
 
This enables TF-IDF ranking:
- Term Frequency (TF): How often term appears in THIS document
- Document Frequency (DF): In how many documents does term appear
- Inverse Document Frequency (IDF): log(N/DF) - rarer terms score higher

Positional Postings:

For phrase queries and proximity searches, we need positions:

positional_posting.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Positional Posting Format:
--------------------------------------------------
term        | posting list: (docID, [positions])
--------------------------------------------------
database    | [(1, [0, 15, 42]), (5, [7]), (12, [3, 8, 12, 45, 67, 89, 102]), ...]
            |  Doc 1: 'database' at positions 0, 15, and 42
            |  Doc 5: 'database' at position 7
            |  Doc 12: 'database' at 7 positions
 
Example: Find phrase "database index"
--------------------------------------------------
database    | [(1, [0, 15, 42]), (5, [7]), (12, [3])]
index       | [(1, [1, 43]), (5, [12]), (12, [4])]
 
Check each shared document:
- Doc 1: database at 0, index at 1 → positions differ by 1 → PHRASE MATCH!
         database at 42, index at 43 → positions differ by 1 → PHRASE MATCH!
- Doc 5: database at 7, index at 12 → differ by 5 → no phrase match
- Doc 12: database at 3, index at 4 → differ by 1 → PHRASE MATCH!
 
Result: [Doc 1, Doc 12] contain the phrase "database index"

Position Storage Cost

Positional indexes are significantly larger than frequency-only indexes—often 2-4× larger. Each word occurrence requires storing its position. A document with 1000 words has 1000 positions to store. This space cost is the price for phrase and proximity query support.

Complete Posting Entry Structure:

A production inverted index entry typically contains:

Complete Posting Entry Fields
Field	Description	Purpose
Document ID	Unique identifier for document	Identify which document
Term Frequency	Count of term occurrences	TF-IDF ranking
Positions	List of byte/word offsets	Phrase/proximity queries
Field ID	Which field (title/body/etc.)	Field-weighted ranking
Payload	Custom per-occurrence data	Domain-specific scoring

Skip Pointers for Efficient Traversal

When processing boolean queries, we must compute intersections and unions of posting lists. For AND queries with long posting lists, naively scanning both lists is inefficient. Skip pointers enable jumping ahead in posting lists.

The Intersection Problem:

intersection_problem.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Query: "database AND performance"
 
database    → [1, 3, 7, 12, 15, 23, 45, 67, 89, 156, 234, 345, 456, ...]
performance → [12, 156, 789, ...]
 
Naive intersection:
- Compare database[0]=1 with performance[0]=12 → 1 < 12, advance database
- Compare database[1]=3 with performance[0]=12 → 3 < 12, advance database
- Compare database[2]=7 with performance[0]=12 → 7 < 12, advance database
- Compare database[3]=12 with performance[0]=12 → MATCH! Add 12 to result
- Compare database[4]=15 with performance[1]=156 → 15 < 156, advance database
- ... keep advancing database through many non-matching entries
 
Problem: If performance list is short and database list is long,
we scan most of database list just to find a few matches.

Skip Pointer Solution:

Add skip pointers at regular intervals to allow jumping ahead:

Converting Mermaid diagram...

skip_pointer_search.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Query: "database AND performance" (performance[0] = 12)
 
database with skip pointers every 3 elements:
[1] -skip→ [12] -skip→ [45] -skip→ [156] ...
 ↓          ↓          ↓           ↓
[3]        [15]       [67]        [234]
 ↓          ↓          ↓           ↓
[7]        [23]       [89]        [345]
 
Searching for match with performance[0]=12:
1. At database[0]=1: Check skip to database[3]=12
   - 12 >= target 12, so we CAN skip! Jump to [12]
2. Compare database[3]=12 with performance[0]=12 → MATCH!
 
We skipped comparing 3 and 7 entirely!
 
With longer lists and searching for 156:
1. At [1]: skip to [12]=12 < 156, take skip
2. At [12]: skip to [45]=45 < 156, take skip  
3. At [45]: skip to [156]=156 >= 156, take skip
4. At [156]: 156 == 156, MATCH!
 
Skipped: 1,3,7,12,15,23,45,67,89 → Only visited 4 entries instead of 10

Optimal Skip Distance:

How far apart should skip pointers be? This involves a tradeoff:

Short skip distance: More skip pointers, higher chance of useful skip, but more space overhead
Long skip distance: Fewer skip pointers, lower overhead, but less skipping opportunity

The optimal skip distance for a posting list of length L is approximately √L. This balances expected skip distance against pointer overhead.

Skip Distance Analysis
List Length (L)	Optimal Skip (√L)	Max Comparisons (worst)	Comparisons without skip
100	10	20	100
1,000	~32	~64	1,000
10,000	100	200	10,000
1,000,000	1,000	2,000	1,000,000

Multi-Level Skip Lists

For very long posting lists, multi-level skip pointers (like skip lists for search) can provide O(log n) intersection cost. Each level skips farther ahead, with fewer pointers at higher levels.

Index Compression

Inverted indexes can be enormous—often larger than the original document collection. Compression is essential for practical systems. Fortunately, posting lists have properties that make them highly compressible.

Key Observation: Gaps are Small

Posting lists store document IDs in sorted order. Instead of storing absolute IDs, store the gaps (differences) between consecutive IDs:

gap_encoding.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Original posting list (document IDs):
[1023, 1047, 1089, 1156, 1178, 1234, 1289, 1301]
 
Gap-encoded (first value + gaps):
[1023, 24, 42, 67, 22, 56, 55, 12]
 
Why this helps:
- Original values: up to 4 digits each, need ~10 bits minimum
- Gap values: mostly 2 digits, need only ~6 bits each
 
For common terms appearing in many documents:
Original: [1, 5, 12, 18, 23, 31, 45, 52, 67, 78, ...]
Gaps:     [1, 4, 7, 6, 5, 8, 14, 7, 15, 11, ...]
 
Most gaps are small numbers requiring few bits!

Variable-Byte (VByte) Encoding:

Variable-byte encoding uses 1 byte for small numbers, more bytes for larger numbers:

vbyte_encoding.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Variable-Byte Encoding:
Each byte uses 7 bits for data, 1 bit as continuation flag.
High bit = 1 means "more bytes follow"
High bit = 0 means "this is the last byte"
 
Examples:
Number 5:    0|0000101     (1 byte)  - high bit 0 = done
Number 127:  0|1111111     (1 byte)  - high bit 0 = done
Number 128:  1|0000001  0|0000000   (2 bytes) - 1×128 + 0
Number 130:  1|0000001  0|0000010   (2 bytes) - 1×128 + 2
Number 16383: 1|1111111  0|1111111  (2 bytes) - 127×128 + 127
Number 16384: 1|0000001  1|0000000  0|0000000 (3 bytes)
 
Compression ratio for posting lists:
- Frequent terms (small gaps): ~1.2 bytes per posting (vs 4 bytes fixed)
- Rare terms (large gaps): ~2-3 bytes per posting
- Overall: typical 4:1 compression ratio

Other Compression Schemes:

Posting List Compression Techniques
Technique	Idea	Pros	Cons
VByte	Variable-length byte encoding	Simple, byte-aligned, fast	Not optimal compression
γ (Gamma) Coding	Unary length prefix + binary value	Good for small values	Complex bit operations
δ (Delta) Coding	γ-encode the length, then value	Better for medium range	More complex than VByte
Simple-9	Pack multiple small ints in 32-bit word	Very fast decoding	Wasted bits for irregular gaps
PForDelta	Pack frame + exceptions	Fastest modern scheme	Complex implementation

Compression-Speed Tradeoff

Better compression means smaller indexes (less I/O) but slower decoding. Modern systems often choose VByte or SIMD-accelerated schemes like PForDelta that balance well. Compression also affects random access—some schemes require sequential decoding.

Physical Storage Organization

How an inverted index is organized on disk significantly impacts query performance. Several design decisions affect I/O efficiency.

Single-File vs. Multi-File Layout:

index_file_layouts.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Single-File Layout:
--------------------
[Vocabulary Section]
term1 → offset1
term2 → offset2
...
 
[Posting Lists Section]  
offset1: [docID, docID, docID, ...]
offset2: [docID, docID, ...]
...
 
Multi-File Layout:
------------------
vocabulary.idx     - Terms and metadata
postings.idx       - Posting lists (sequential)
positions.idx      - Position data (optional)
skip.idx           - Skip pointer data
 
Trade-offs:
- Single file: Simpler, atomic updates, one file handle
- Multi-file: Parallel I/O, independent updates, easier memory mapping

Block-Based Organization:

For disk efficiency, posting lists are organized into fixed-size blocks:

block_organization.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Block-Based Posting List Storage:
                                
Term: "database" (very frequent, long posting list)
      _____________________________________________
Block 0:  | docIDs 1-500    | skip → Block 1      |
          | compressed      | max docID: 12547    |
          |_________________|_____________________|
Block 1:  | docIDs 501-1000 | skip → Block 2      |
          | compressed      | max docID: 28934    |
          |_________________|_____________________|
Block 2:  | docIDs 1001-1500| skip → Block 3      |
          | compressed      | max docID: 45123    |
          |_________________|_____________________|
...
 
Each block:
- Fixed size (e.g., 4KB = one disk page)
- Contains compressed postings
- Stores max docID (for skipping without decompression)
- Has skip pointer to next block
 
Benefits:
- Aligned with disk pages → efficient I/O
- Can skip entire blocks without reading
- Enables efficient random access within long lists

Memory Mapping:

Modern systems often use memory-mapped files:

Memory Mapping Benefits

•OS Page Cache: Let the operating system manage which index pages are in memory
•Zero-Copy Access: No need to copy data from kernel to user space
•Lazy Loading: Pages loaded on-demand as they're accessed
•Shared Access: Multiple processes can share same cached pages

Hot/Cold Posting Lists

Frequent terms ("the", "and") have long posting lists accessed often—keep these in fast storage or cache. Rare terms have short lists accessed infrequently—can reside on slower storage. This tiered approach optimizes cost/performance.

Index Construction Algorithms

Building an inverted index from a document collection is a significant computational task. For web-scale collections with billions of documents, efficient construction algorithms are critical.

Single-Pass In-Memory Indexing (SPIMI):

For collections that fit in memory:

spimi_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def spimi_index(documents):
    """
    Single-Pass In-Memory Indexing
    Builds inverted index entirely in memory
    """
    inverted_index = {}  # term -> list of (docID, positions)
    
    for doc_id, document in enumerate(documents):
        # Tokenize and normalize
        tokens = tokenize(document)
        
        for position, token in enumerate(tokens):
            # Get or create posting list for this term
            if token not in inverted_index:
                inverted_index[token] = []
            
            # Get last posting for this term
            posting_list = inverted_index[token]
            
            # If same document, just add position
            if posting_list and posting_list[-1][0] == doc_id:
                posting_list[-1][1].append(position)
            else:
                # New document for this term
                posting_list.append((doc_id, [position]))
    
    # Sort vocabulary and write index
    sorted_terms = sorted(inverted_index.keys())
    write_index(sorted_terms, inverted_index)
    
    return inverted_index

Block Sort-Based Indexing (BSBI):

For large collections that don't fit in memory, use external sorting:

bsbi_algorithm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Block Sort-Based Indexing (BSBI):
==================================
 
Phase 1: Create sorted runs
---------------------------
For each block of documents that fits in memory:
  1. Parse documents, create (term, docID, position) triples
  2. Sort triples by (term, docID, position)
  3. Write sorted block to disk as "run"
 
Result: Many sorted runs on disk
 
Phase 2: Merge sorted runs
--------------------------
Use k-way merge:
  1. Open all runs, read first entry from each
  2. Output smallest entry
  3. Read next entry from that run
  4. Repeat until all runs exhausted
 
Result: Single sorted file of all (term, docID, position) triples
 
Phase 3: Build final index
--------------------------
Scan sorted file:
  - Accumulate postings for each term
  - When term changes, write posting list to index
  - Update vocabulary with term and offset
 
Complexity:
- I/O: O(n log(n/M)) block transfers, where M = memory size
- Dominated by external merge sort

Distributed Index Construction

For web-scale indexing (billions of documents), MapReduce-style distributed construction is used. Map: emit (term, docID, positions) pairs. Reduce: aggregate all postings for each term. This parallelizes both phases across thousands of machines.

Summary: Inverted Index Architecture

We've thoroughly explored the inverted index—the data structure that makes full-text search possible. Let's consolidate the key concepts:

Key Takeaways

•Inversion is the key insight — Mapping words→documents (not documents→words) enables O(1) term lookup
•Two components work together — Vocabulary provides term→pointer mapping; posting lists store matching documents
•Posting lists carry rich information — Document IDs, frequencies, positions, and field IDs support various query types
•Skip pointers accelerate intersection — O(√n) intersection instead of O(n) for boolean queries
•Compression is essential — Gap encoding + variable-byte compression achieves 4:1 ratios typically
•Physical layout matters — Block organization, memory mapping, and tiered storage optimize I/O
•Construction requires care — SPIMI for memory-fit, BSBI for external, MapReduce for web-scale

What's Next:

With the inverted index structure understood, the next page explores Term Frequency and TF-IDF—how we quantify the relevance of a document to a query and rank results meaningfully.

Architecture Mastered

You now understand the complete architecture of inverted indexes—from vocabulary organization through posting list structure, skip pointers, compression, and physical storage. This foundation enables understanding of how queries are processed and results ranked.

2 / 5

Loading learning content...

Database Management SystemsFull-Text Indexes

Full-Text Indexing and Search

LevelIntermediate

Duration75 mins

TopicFull-Text Indexes

2 / 5

Inverted Index Architecture

The Fundamental Data Structure of Search

What You Will Learn

Understanding the Inversion

Let's build intuition with a concrete example. Consider three documents:

Doc 1: "The database stores data efficiently." Doc 2: "Efficient data indexing improves database queries." Doc 3: "Query processing requires efficient indexing."

Forward View (Documents → Words):

Naturally, we'd represent these as:

forward_index.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Forward Index (Document → Words):
 
Doc 1: {the, database, stores, data, efficiently}
Doc 2: {efficient, data, indexing, improves, database, queries}
Doc 3: {query, processing, requires, efficient, indexing}
 
To find documents containing "database":
- Scan Doc 1 words → Found!
- Scan Doc 2 words → Found!
- Scan Doc 3 words → Not found.
- Result: [Doc 1, Doc 2]
 
Cost: O(total words across all documents) = O(n × average doc length)

This forward view requires scanning all documents for every query. With millions of documents and thousands of words each, this is catastrophically slow.

Inverted View (Words → Documents):

inverted_index.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Inverted Index (Word → Documents):
 
database    → [Doc 1, Doc 2]
data        → [Doc 1, Doc 2]
efficient   → [Doc 2, Doc 3]
efficiently → [Doc 1]
improves    → [Doc 2]
indexing    → [Doc 2, Doc 3]
processing  → [Doc 3]
queries     → [Doc 2]
query       → [Doc 3]
requires    → [Doc 3]
stores      → [Doc 1]
the         → [Doc 1]
 
To find documents containing "database":
- Look up "database" → [Doc 1, Doc 2]
- Result: [Doc 1, Doc 2]
 
Cost: O(1) hash lookup + O(k) to read k matching docs

The Power of Inversion

The Core Components:

An inverted index consists of two main components:

Vocabulary (Lexicon): A sorted list or hash of all unique terms (words) in the document collection. Each term points to its posting list.
Posting Lists: For each term, a list of documents containing that term, along with additional information (positions, frequency, etc.).

Converting Mermaid diagram...

Vocabulary (Lexicon) Structure

The vocabulary is the entry point to the inverted index. For each unique term in the collection, it stores:

The term string itself
A pointer to the term's posting list
Aggregate statistics (document frequency, collection frequency)

Size Considerations:

The vocabulary size depends on the collection:

Typical Vocabulary Sizes
Collection Type	Documents	Total Tokens	Unique Terms	Avg. Term Length
News Articles (1 year)	~500K	~500M	~500K	7 chars
Wikipedia (English)	~6M	~4B	~10M	8 chars
web Crawl	~100B+	~1T+	~1B+	8 chars
Source Code Repository	~10M files	~50B	~50M	10 chars
Legal Documents	~1M	~2B	~2M	9 chars

Vocabulary Organization Options:

The vocabulary can be organized in several ways, each with tradeoffs:

Vocabulary Data Structures

•Sorted Array + Binary Search — Terms stored in sorted order, binary search for lookup. O(log n) lookup, O(n) for prefix queries. Compact, cache-friendly.
•Hash Table — Terms hashed for O(1) average lookup. Cannot support prefix or range queries on terms. Fast but space-overhead.
•B+-Tree — Balanced tree with sorted terms. O(log n) lookup, efficient prefix queries. Better for disk-resident vocabularies.
•Trie (Prefix Tree) — Characters form edges, shared prefixes reduce space. Excellent for prefix/wildcard queries. Variable-length terms natural.

Trie-Based Vocabulary:

For large vocabularies with prefix query support, tries are highly effective:

trie_vocabulary.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Trie for vocabulary: {data, database, date, index, indexing}
 
                    [root]
                   /      \
                  d        i
                  |        |
                  a        n
                  |        |
                  t        d
                 / \       |
                a   e      e
                |   |      |
           [base]  [*]     x
                |          |
               [*]        [*]
                         /
                        i
                        |
                        n
                        |
                        g
                        |
                       [*]
 
[*] = term endpoint
 
Benefits:
- Shared prefixes save space ("data" and "database" share "data")
- Prefix query "data*" just traverses to "data" and collects all descendants
- Natural support for wildcard queries

Memory-Resident vs. Disk-Resident

Posting List Structure

Posting lists are the heart of the inverted index. Each posting list contains all documents where a term appears, along with information needed for ranking and phrase queries.

Basic Posting:

The simplest posting contains just a document ID:

basic_posting.txt
1
2
3
4
5
6
7
8
9
10
11
12
Basic Posting Format:
--------------------------------------------------
term        | posting list
--------------------------------------------------
database    | [1, 5, 12, 47, 89, 156, ...]
index       | [1, 3, 12, 15, 47, ...]
performance | [5, 47, 89, 234, ...]
 
Each number is a document ID (sorted in ascending order).
 
Query "database" → Return documents [1, 5, 12, 47, 89, 156, ...]
Query "database AND index" → Intersect lists → [1, 12, 47, ...]

Extended Posting with Frequency:

For relevance ranking, we need term frequency—how many times the term appears in each document:

frequency_posting.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Posting with Term Frequency (TF):
--------------------------------------------------
term        | posting list: (docID, termFreq)
--------------------------------------------------
database    | [(1, 3), (5, 1), (12, 7), (47, 2), ...]
            |  Doc 1 has 'database' 3 times
            |  Doc 5 has 'database' 1 time
            |  Doc 12 has 'database' 7 times (likely more relevant!)
 
This enables TF-IDF ranking:
- Term Frequency (TF): How often term appears in THIS document
- Document Frequency (DF): In how many documents does term appear
- Inverse Document Frequency (IDF): log(N/DF) - rarer terms score higher

Positional Postings:

For phrase queries and proximity searches, we need positions:

positional_posting.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Positional Posting Format:
--------------------------------------------------
term        | posting list: (docID, [positions])
--------------------------------------------------
database    | [(1, [0, 15, 42]), (5, [7]), (12, [3, 8, 12, 45, 67, 89, 102]), ...]
            |  Doc 1: 'database' at positions 0, 15, and 42
            |  Doc 5: 'database' at position 7
            |  Doc 12: 'database' at 7 positions
 
Example: Find phrase "database index"
--------------------------------------------------
database    | [(1, [0, 15, 42]), (5, [7]), (12, [3])]
index       | [(1, [1, 43]), (5, [12]), (12, [4])]
 
Check each shared document:
- Doc 1: database at 0, index at 1 → positions differ by 1 → PHRASE MATCH!
         database at 42, index at 43 → positions differ by 1 → PHRASE MATCH!
- Doc 5: database at 7, index at 12 → differ by 5 → no phrase match
- Doc 12: database at 3, index at 4 → differ by 1 → PHRASE MATCH!
 
Result: [Doc 1, Doc 12] contain the phrase "database index"

Position Storage Cost

Complete Posting Entry Structure:

A production inverted index entry typically contains:

Complete Posting Entry Fields
Field	Description	Purpose
Document ID	Unique identifier for document	Identify which document
Term Frequency	Count of term occurrences	TF-IDF ranking
Positions	List of byte/word offsets	Phrase/proximity queries
Field ID	Which field (title/body/etc.)	Field-weighted ranking
Payload	Custom per-occurrence data	Domain-specific scoring

Skip Pointers for Efficient Traversal

The Intersection Problem:

intersection_problem.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Query: "database AND performance"
 
database    → [1, 3, 7, 12, 15, 23, 45, 67, 89, 156, 234, 345, 456, ...]
performance → [12, 156, 789, ...]
 
Naive intersection:
- Compare database[0]=1 with performance[0]=12 → 1 < 12, advance database
- Compare database[1]=3 with performance[0]=12 → 3 < 12, advance database
- Compare database[2]=7 with performance[0]=12 → 7 < 12, advance database
- Compare database[3]=12 with performance[0]=12 → MATCH! Add 12 to result
- Compare database[4]=15 with performance[1]=156 → 15 < 156, advance database
- ... keep advancing database through many non-matching entries
 
Problem: If performance list is short and database list is long,
we scan most of database list just to find a few matches.

Skip Pointer Solution:

Add skip pointers at regular intervals to allow jumping ahead:

Converting Mermaid diagram...

skip_pointer_search.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Query: "database AND performance" (performance[0] = 12)
 
database with skip pointers every 3 elements:
[1] -skip→ [12] -skip→ [45] -skip→ [156] ...
 ↓          ↓          ↓           ↓
[3]        [15]       [67]        [234]
 ↓          ↓          ↓           ↓
[7]        [23]       [89]        [345]
 
Searching for match with performance[0]=12:
1. At database[0]=1: Check skip to database[3]=12
   - 12 >= target 12, so we CAN skip! Jump to [12]
2. Compare database[3]=12 with performance[0]=12 → MATCH!
 
We skipped comparing 3 and 7 entirely!
 
With longer lists and searching for 156:
1. At [1]: skip to [12]=12 < 156, take skip
2. At [12]: skip to [45]=45 < 156, take skip  
3. At [45]: skip to [156]=156 >= 156, take skip
4. At [156]: 156 == 156, MATCH!
 
Skipped: 1,3,7,12,15,23,45,67,89 → Only visited 4 entries instead of 10

Optimal Skip Distance:

How far apart should skip pointers be? This involves a tradeoff:

Short skip distance: More skip pointers, higher chance of useful skip, but more space overhead
Long skip distance: Fewer skip pointers, lower overhead, but less skipping opportunity

The optimal skip distance for a posting list of length L is approximately √L. This balances expected skip distance against pointer overhead.

Skip Distance Analysis
List Length (L)	Optimal Skip (√L)	Max Comparisons (worst)	Comparisons without skip
100	10	20	100
1,000	~32	~64	1,000
10,000	100	200	10,000
1,000,000	1,000	2,000	1,000,000

Multi-Level Skip Lists

For very long posting lists, multi-level skip pointers (like skip lists for search) can provide O(log n) intersection cost. Each level skips farther ahead, with fewer pointers at higher levels.

Index Compression

Key Observation: Gaps are Small

Posting lists store document IDs in sorted order. Instead of storing absolute IDs, store the gaps (differences) between consecutive IDs:

gap_encoding.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Original posting list (document IDs):
[1023, 1047, 1089, 1156, 1178, 1234, 1289, 1301]
 
Gap-encoded (first value + gaps):
[1023, 24, 42, 67, 22, 56, 55, 12]
 
Why this helps:
- Original values: up to 4 digits each, need ~10 bits minimum
- Gap values: mostly 2 digits, need only ~6 bits each
 
For common terms appearing in many documents:
Original: [1, 5, 12, 18, 23, 31, 45, 52, 67, 78, ...]
Gaps:     [1, 4, 7, 6, 5, 8, 14, 7, 15, 11, ...]
 
Most gaps are small numbers requiring few bits!

Variable-Byte (VByte) Encoding:

Variable-byte encoding uses 1 byte for small numbers, more bytes for larger numbers:

vbyte_encoding.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Variable-Byte Encoding:
Each byte uses 7 bits for data, 1 bit as continuation flag.
High bit = 1 means "more bytes follow"
High bit = 0 means "this is the last byte"
 
Examples:
Number 5:    0|0000101     (1 byte)  - high bit 0 = done
Number 127:  0|1111111     (1 byte)  - high bit 0 = done
Number 128:  1|0000001  0|0000000   (2 bytes) - 1×128 + 0
Number 130:  1|0000001  0|0000010   (2 bytes) - 1×128 + 2
Number 16383: 1|1111111  0|1111111  (2 bytes) - 127×128 + 127
Number 16384: 1|0000001  1|0000000  0|0000000 (3 bytes)
 
Compression ratio for posting lists:
- Frequent terms (small gaps): ~1.2 bytes per posting (vs 4 bytes fixed)
- Rare terms (large gaps): ~2-3 bytes per posting
- Overall: typical 4:1 compression ratio

Other Compression Schemes:

Posting List Compression Techniques
Technique	Idea	Pros	Cons
VByte	Variable-length byte encoding	Simple, byte-aligned, fast	Not optimal compression
γ (Gamma) Coding	Unary length prefix + binary value	Good for small values	Complex bit operations
δ (Delta) Coding	γ-encode the length, then value	Better for medium range	More complex than VByte
Simple-9	Pack multiple small ints in 32-bit word	Very fast decoding	Wasted bits for irregular gaps
PForDelta	Pack frame + exceptions	Fastest modern scheme	Complex implementation

Compression-Speed Tradeoff

Physical Storage Organization

How an inverted index is organized on disk significantly impacts query performance. Several design decisions affect I/O efficiency.

Single-File vs. Multi-File Layout:

index_file_layouts.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Single-File Layout:
--------------------
[Vocabulary Section]
term1 → offset1
term2 → offset2
...
 
[Posting Lists Section]  
offset1: [docID, docID, docID, ...]
offset2: [docID, docID, ...]
...
 
Multi-File Layout:
------------------
vocabulary.idx     - Terms and metadata
postings.idx       - Posting lists (sequential)
positions.idx      - Position data (optional)
skip.idx           - Skip pointer data
 
Trade-offs:
- Single file: Simpler, atomic updates, one file handle
- Multi-file: Parallel I/O, independent updates, easier memory mapping

Block-Based Organization:

For disk efficiency, posting lists are organized into fixed-size blocks:

block_organization.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Block-Based Posting List Storage:
                                
Term: "database" (very frequent, long posting list)
      _____________________________________________
Block 0:  | docIDs 1-500    | skip → Block 1      |
          | compressed      | max docID: 12547    |
          |_________________|_____________________|
Block 1:  | docIDs 501-1000 | skip → Block 2      |
          | compressed      | max docID: 28934    |
          |_________________|_____________________|
Block 2:  | docIDs 1001-1500| skip → Block 3      |
          | compressed      | max docID: 45123    |
          |_________________|_____________________|
...
 
Each block:
- Fixed size (e.g., 4KB = one disk page)
- Contains compressed postings
- Stores max docID (for skipping without decompression)
- Has skip pointer to next block
 
Benefits:
- Aligned with disk pages → efficient I/O
- Can skip entire blocks without reading
- Enables efficient random access within long lists

Memory Mapping:

Modern systems often use memory-mapped files:

Memory Mapping Benefits

•OS Page Cache: Let the operating system manage which index pages are in memory
•Zero-Copy Access: No need to copy data from kernel to user space
•Lazy Loading: Pages loaded on-demand as they're accessed
•Shared Access: Multiple processes can share same cached pages

Hot/Cold Posting Lists

Index Construction Algorithms

Building an inverted index from a document collection is a significant computational task. For web-scale collections with billions of documents, efficient construction algorithms are critical.

Single-Pass In-Memory Indexing (SPIMI):

For collections that fit in memory:

spimi_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def spimi_index(documents):
    """
    Single-Pass In-Memory Indexing
    Builds inverted index entirely in memory
    """
    inverted_index = {}  # term -> list of (docID, positions)
    
    for doc_id, document in enumerate(documents):
        # Tokenize and normalize
        tokens = tokenize(document)
        
        for position, token in enumerate(tokens):
            # Get or create posting list for this term
            if token not in inverted_index:
                inverted_index[token] = []
            
            # Get last posting for this term
            posting_list = inverted_index[token]
            
            # If same document, just add position
            if posting_list and posting_list[-1][0] == doc_id:
                posting_list[-1][1].append(position)
            else:
                # New document for this term
                posting_list.append((doc_id, [position]))
    
    # Sort vocabulary and write index
    sorted_terms = sorted(inverted_index.keys())
    write_index(sorted_terms, inverted_index)
    
    return inverted_index

Block Sort-Based Indexing (BSBI):

For large collections that don't fit in memory, use external sorting:

bsbi_algorithm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Block Sort-Based Indexing (BSBI):
==================================
 
Phase 1: Create sorted runs
---------------------------
For each block of documents that fits in memory:
  1. Parse documents, create (term, docID, position) triples
  2. Sort triples by (term, docID, position)
  3. Write sorted block to disk as "run"
 
Result: Many sorted runs on disk
 
Phase 2: Merge sorted runs
--------------------------
Use k-way merge:
  1. Open all runs, read first entry from each
  2. Output smallest entry
  3. Read next entry from that run
  4. Repeat until all runs exhausted
 
Result: Single sorted file of all (term, docID, position) triples
 
Phase 3: Build final index
--------------------------
Scan sorted file:
  - Accumulate postings for each term
  - When term changes, write posting list to index
  - Update vocabulary with term and offset
 
Complexity:
- I/O: O(n log(n/M)) block transfers, where M = memory size
- Dominated by external merge sort

Distributed Index Construction

Summary: Inverted Index Architecture

We've thoroughly explored the inverted index—the data structure that makes full-text search possible. Let's consolidate the key concepts:

Key Takeaways

•Inversion is the key insight — Mapping words→documents (not documents→words) enables O(1) term lookup
•Two components work together — Vocabulary provides term→pointer mapping; posting lists store matching documents
•Posting lists carry rich information — Document IDs, frequencies, positions, and field IDs support various query types
•Skip pointers accelerate intersection — O(√n) intersection instead of O(n) for boolean queries
•Compression is essential — Gap encoding + variable-byte compression achieves 4:1 ratios typically
•Physical layout matters — Block organization, memory mapping, and tiered storage optimize I/O
•Construction requires care — SPIMI for memory-fit, BSBI for external, MapReduce for web-scale

What's Next:

With the inverted index structure understood, the next page explores Term Frequency and TF-IDF—how we quantify the relevance of a document to a query and rank results meaningfully.

Architecture Mastered

2 / 5