Loading learning content...
Every modern search system—from Google's web search to your IDE's code search to PostgreSQL's full-text capabilities—relies on a deceptively simple yet profoundly powerful data structure: the inverted index.
The name comes from 'inverting' the natural relationship between documents and words. Normally, we think of a document as containing words. An inverted index flips this perspective: for each word, we record which documents contain it. This simple inversion transforms an O(n × m) scan problem (check every word in every document) into an O(k) lookup problem (find the k documents containing a specific word).
By the end of this page, you will understand the complete architecture of an inverted index: the vocabulary (lexicon), posting lists, positional information, and skip pointers. You'll learn how indexes are stored on disk and how they support efficient query operations.
Let's build intuition with a concrete example. Consider three documents:
Doc 1: "The database stores data efficiently." Doc 2: "Efficient data indexing improves database queries." Doc 3: "Query processing requires efficient indexing."
Forward View (Documents → Words):
Naturally, we'd represent these as:
12345678910111213
Forward Index (Document → Words): Doc 1: {the, database, stores, data, efficiently}Doc 2: {efficient, data, indexing, improves, database, queries}Doc 3: {query, processing, requires, efficient, indexing} To find documents containing "database":- Scan Doc 1 words → Found!- Scan Doc 2 words → Found!- Scan Doc 3 words → Not found.- Result: [Doc 1, Doc 2] Cost: O(total words across all documents) = O(n × average doc length)This forward view requires scanning all documents for every query. With millions of documents and thousands of words each, this is catastrophically slow.
Inverted View (Words → Documents):
1234567891011121314151617181920
Inverted Index (Word → Documents): database → [Doc 1, Doc 2]data → [Doc 1, Doc 2]efficient → [Doc 2, Doc 3]efficiently → [Doc 1]improves → [Doc 2]indexing → [Doc 2, Doc 3]processing → [Doc 3]queries → [Doc 2]query → [Doc 3]requires → [Doc 3]stores → [Doc 1]the → [Doc 1] To find documents containing "database":- Look up "database" → [Doc 1, Doc 2]- Result: [Doc 1, Doc 2] Cost: O(1) hash lookup + O(k) to read k matching docsThe inverted index pre-computes the answer to 'which documents contain word X?' for every word. At query time, we simply look up the precomputed answer. Building the index is expensive, but queries become nearly instant.
The Core Components:
An inverted index consists of two main components:
Vocabulary (Lexicon): A sorted list or hash of all unique terms (words) in the document collection. Each term points to its posting list.
Posting Lists: For each term, a list of documents containing that term, along with additional information (positions, frequency, etc.).
The vocabulary is the entry point to the inverted index. For each unique term in the collection, it stores:
Size Considerations:
The vocabulary size depends on the collection:
| Collection Type | Documents | Total Tokens | Unique Terms | Avg. Term Length |
|---|---|---|---|---|
| News Articles (1 year) | ~500K | ~500M | ~500K | 7 chars |
| Wikipedia (English) | ~6M | ~4B | ~10M | 8 chars |
| web Crawl | ~100B+ | ~1T+ | ~1B+ | 8 chars |
| Source Code Repository | ~10M files | ~50B | ~50M | 10 chars |
| Legal Documents | ~1M | ~2B | ~2M | 9 chars |
Vocabulary Organization Options:
The vocabulary can be organized in several ways, each with tradeoffs:
Trie-Based Vocabulary:
For large vocabularies with prefix query support, tries are highly effective:
123456789101112131415161718192021222324252627282930
Trie for vocabulary: {data, database, date, index, indexing} [root] / \ d i | | a n | | t d / \ | a e e | | | [base] [*] x | | [*] [*] / i | n | g | [*] [*] = term endpoint Benefits:- Shared prefixes save space ("data" and "database" share "data")- Prefix query "data*" just traverses to "data" and collects all descendants- Natural support for wildcard queriesFor small vocabularies (< few million terms), the entire vocabulary fits in memory. For web-scale vocabularies with billions of terms, disk-resident structures like B+-trees or compressed tries become necessary.
Posting lists are the heart of the inverted index. Each posting list contains all documents where a term appears, along with information needed for ranking and phrase queries.
Basic Posting:
The simplest posting contains just a document ID:
123456789101112
Basic Posting Format:--------------------------------------------------term | posting list--------------------------------------------------database | [1, 5, 12, 47, 89, 156, ...]index | [1, 3, 12, 15, 47, ...]performance | [5, 47, 89, 234, ...] Each number is a document ID (sorted in ascending order). Query "database" → Return documents [1, 5, 12, 47, 89, 156, ...]Query "database AND index" → Intersect lists → [1, 12, 47, ...]Extended Posting with Frequency:
For relevance ranking, we need term frequency—how many times the term appears in each document:
12345678910111213
Posting with Term Frequency (TF):--------------------------------------------------term | posting list: (docID, termFreq)--------------------------------------------------database | [(1, 3), (5, 1), (12, 7), (47, 2), ...] | Doc 1 has 'database' 3 times | Doc 5 has 'database' 1 time | Doc 12 has 'database' 7 times (likely more relevant!) This enables TF-IDF ranking:- Term Frequency (TF): How often term appears in THIS document- Document Frequency (DF): In how many documents does term appear- Inverse Document Frequency (IDF): log(N/DF) - rarer terms score higherPositional Postings:
For phrase queries and proximity searches, we need positions:
123456789101112131415161718192021
Positional Posting Format:--------------------------------------------------term | posting list: (docID, [positions])--------------------------------------------------database | [(1, [0, 15, 42]), (5, [7]), (12, [3, 8, 12, 45, 67, 89, 102]), ...] | Doc 1: 'database' at positions 0, 15, and 42 | Doc 5: 'database' at position 7 | Doc 12: 'database' at 7 positions Example: Find phrase "database index"--------------------------------------------------database | [(1, [0, 15, 42]), (5, [7]), (12, [3])]index | [(1, [1, 43]), (5, [12]), (12, [4])] Check each shared document:- Doc 1: database at 0, index at 1 → positions differ by 1 → PHRASE MATCH! database at 42, index at 43 → positions differ by 1 → PHRASE MATCH!- Doc 5: database at 7, index at 12 → differ by 5 → no phrase match- Doc 12: database at 3, index at 4 → differ by 1 → PHRASE MATCH! Result: [Doc 1, Doc 12] contain the phrase "database index"Positional indexes are significantly larger than frequency-only indexes—often 2-4× larger. Each word occurrence requires storing its position. A document with 1000 words has 1000 positions to store. This space cost is the price for phrase and proximity query support.
Complete Posting Entry Structure:
A production inverted index entry typically contains:
| Field | Description | Purpose |
|---|---|---|
| Document ID | Unique identifier for document | Identify which document |
| Term Frequency | Count of term occurrences | TF-IDF ranking |
| Positions | List of byte/word offsets | Phrase/proximity queries |
| Field ID | Which field (title/body/etc.) | Field-weighted ranking |
| Payload | Custom per-occurrence data | Domain-specific scoring |
When processing boolean queries, we must compute intersections and unions of posting lists. For AND queries with long posting lists, naively scanning both lists is inefficient. Skip pointers enable jumping ahead in posting lists.
The Intersection Problem:
123456789101112131415
Query: "database AND performance" database → [1, 3, 7, 12, 15, 23, 45, 67, 89, 156, 234, 345, 456, ...]performance → [12, 156, 789, ...] Naive intersection:- Compare database[0]=1 with performance[0]=12 → 1 < 12, advance database- Compare database[1]=3 with performance[0]=12 → 3 < 12, advance database- Compare database[2]=7 with performance[0]=12 → 7 < 12, advance database- Compare database[3]=12 with performance[0]=12 → MATCH! Add 12 to result- Compare database[4]=15 with performance[1]=156 → 15 < 156, advance database- ... keep advancing database through many non-matching entries Problem: If performance list is short and database list is long,we scan most of database list just to find a few matches.Skip Pointer Solution:
Add skip pointers at regular intervals to allow jumping ahead:
1234567891011121314151617181920212223
Query: "database AND performance" (performance[0] = 12) database with skip pointers every 3 elements:[1] -skip→ [12] -skip→ [45] -skip→ [156] ... ↓ ↓ ↓ ↓[3] [15] [67] [234] ↓ ↓ ↓ ↓[7] [23] [89] [345] Searching for match with performance[0]=12:1. At database[0]=1: Check skip to database[3]=12 - 12 >= target 12, so we CAN skip! Jump to [12]2. Compare database[3]=12 with performance[0]=12 → MATCH! We skipped comparing 3 and 7 entirely! With longer lists and searching for 156:1. At [1]: skip to [12]=12 < 156, take skip2. At [12]: skip to [45]=45 < 156, take skip 3. At [45]: skip to [156]=156 >= 156, take skip4. At [156]: 156 == 156, MATCH! Skipped: 1,3,7,12,15,23,45,67,89 → Only visited 4 entries instead of 10Optimal Skip Distance:
How far apart should skip pointers be? This involves a tradeoff:
The optimal skip distance for a posting list of length L is approximately √L. This balances expected skip distance against pointer overhead.
| List Length (L) | Optimal Skip (√L) | Max Comparisons (worst) | Comparisons without skip |
|---|---|---|---|
| 100 | 10 | 20 | 100 |
| 1,000 | ~32 | ~64 | 1,000 |
| 10,000 | 100 | 200 | 10,000 |
| 1,000,000 | 1,000 | 2,000 | 1,000,000 |
For very long posting lists, multi-level skip pointers (like skip lists for search) can provide O(log n) intersection cost. Each level skips farther ahead, with fewer pointers at higher levels.
Inverted indexes can be enormous—often larger than the original document collection. Compression is essential for practical systems. Fortunately, posting lists have properties that make them highly compressible.
Key Observation: Gaps are Small
Posting lists store document IDs in sorted order. Instead of storing absolute IDs, store the gaps (differences) between consecutive IDs:
123456789101112131415
Original posting list (document IDs):[1023, 1047, 1089, 1156, 1178, 1234, 1289, 1301] Gap-encoded (first value + gaps):[1023, 24, 42, 67, 22, 56, 55, 12] Why this helps:- Original values: up to 4 digits each, need ~10 bits minimum- Gap values: mostly 2 digits, need only ~6 bits each For common terms appearing in many documents:Original: [1, 5, 12, 18, 23, 31, 45, 52, 67, 78, ...]Gaps: [1, 4, 7, 6, 5, 8, 14, 7, 15, 11, ...] Most gaps are small numbers requiring few bits!Variable-Byte (VByte) Encoding:
Variable-byte encoding uses 1 byte for small numbers, more bytes for larger numbers:
1234567891011121314151617
Variable-Byte Encoding:Each byte uses 7 bits for data, 1 bit as continuation flag.High bit = 1 means "more bytes follow"High bit = 0 means "this is the last byte" Examples:Number 5: 0|0000101 (1 byte) - high bit 0 = doneNumber 127: 0|1111111 (1 byte) - high bit 0 = doneNumber 128: 1|0000001 0|0000000 (2 bytes) - 1×128 + 0Number 130: 1|0000001 0|0000010 (2 bytes) - 1×128 + 2Number 16383: 1|1111111 0|1111111 (2 bytes) - 127×128 + 127Number 16384: 1|0000001 1|0000000 0|0000000 (3 bytes) Compression ratio for posting lists:- Frequent terms (small gaps): ~1.2 bytes per posting (vs 4 bytes fixed)- Rare terms (large gaps): ~2-3 bytes per posting- Overall: typical 4:1 compression ratioOther Compression Schemes:
| Technique | Idea | Pros | Cons |
|---|---|---|---|
| VByte | Variable-length byte encoding | Simple, byte-aligned, fast | Not optimal compression |
| γ (Gamma) Coding | Unary length prefix + binary value | Good for small values | Complex bit operations |
| δ (Delta) Coding | γ-encode the length, then value | Better for medium range | More complex than VByte |
| Simple-9 | Pack multiple small ints in 32-bit word | Very fast decoding | Wasted bits for irregular gaps |
| PForDelta | Pack frame + exceptions | Fastest modern scheme | Complex implementation |
Better compression means smaller indexes (less I/O) but slower decoding. Modern systems often choose VByte or SIMD-accelerated schemes like PForDelta that balance well. Compression also affects random access—some schemes require sequential decoding.
How an inverted index is organized on disk significantly impacts query performance. Several design decisions affect I/O efficiency.
Single-File vs. Multi-File Layout:
12345678910111213141516171819202122
Single-File Layout:--------------------[Vocabulary Section]term1 → offset1term2 → offset2... [Posting Lists Section] offset1: [docID, docID, docID, ...]offset2: [docID, docID, ...]... Multi-File Layout:------------------vocabulary.idx - Terms and metadatapostings.idx - Posting lists (sequential)positions.idx - Position data (optional)skip.idx - Skip pointer data Trade-offs:- Single file: Simpler, atomic updates, one file handle- Multi-file: Parallel I/O, independent updates, easier memory mappingBlock-Based Organization:
For disk efficiency, posting lists are organized into fixed-size blocks:
12345678910111213141516171819202122232425
Block-Based Posting List Storage: Term: "database" (very frequent, long posting list) _____________________________________________Block 0: | docIDs 1-500 | skip → Block 1 | | compressed | max docID: 12547 | |_________________|_____________________|Block 1: | docIDs 501-1000 | skip → Block 2 | | compressed | max docID: 28934 | |_________________|_____________________|Block 2: | docIDs 1001-1500| skip → Block 3 | | compressed | max docID: 45123 | |_________________|_____________________|... Each block:- Fixed size (e.g., 4KB = one disk page)- Contains compressed postings- Stores max docID (for skipping without decompression)- Has skip pointer to next block Benefits:- Aligned with disk pages → efficient I/O- Can skip entire blocks without reading- Enables efficient random access within long listsMemory Mapping:
Modern systems often use memory-mapped files:
Frequent terms ("the", "and") have long posting lists accessed often—keep these in fast storage or cache. Rare terms have short lists accessed infrequently—can reside on slower storage. This tiered approach optimizes cost/performance.
Building an inverted index from a document collection is a significant computational task. For web-scale collections with billions of documents, efficient construction algorithms are critical.
Single-Pass In-Memory Indexing (SPIMI):
For collections that fit in memory:
12345678910111213141516171819202122232425262728293031
def spimi_index(documents): """ Single-Pass In-Memory Indexing Builds inverted index entirely in memory """ inverted_index = {} # term -> list of (docID, positions) for doc_id, document in enumerate(documents): # Tokenize and normalize tokens = tokenize(document) for position, token in enumerate(tokens): # Get or create posting list for this term if token not in inverted_index: inverted_index[token] = [] # Get last posting for this term posting_list = inverted_index[token] # If same document, just add position if posting_list and posting_list[-1][0] == doc_id: posting_list[-1][1].append(position) else: # New document for this term posting_list.append((doc_id, [position])) # Sort vocabulary and write index sorted_terms = sorted(inverted_index.keys()) write_index(sorted_terms, inverted_index) return inverted_indexBlock Sort-Based Indexing (BSBI):
For large collections that don't fit in memory, use external sorting:
1234567891011121314151617181920212223242526272829303132
Block Sort-Based Indexing (BSBI):================================== Phase 1: Create sorted runs---------------------------For each block of documents that fits in memory: 1. Parse documents, create (term, docID, position) triples 2. Sort triples by (term, docID, position) 3. Write sorted block to disk as "run" Result: Many sorted runs on disk Phase 2: Merge sorted runs--------------------------Use k-way merge: 1. Open all runs, read first entry from each 2. Output smallest entry 3. Read next entry from that run 4. Repeat until all runs exhausted Result: Single sorted file of all (term, docID, position) triples Phase 3: Build final index--------------------------Scan sorted file: - Accumulate postings for each term - When term changes, write posting list to index - Update vocabulary with term and offset Complexity:- I/O: O(n log(n/M)) block transfers, where M = memory size- Dominated by external merge sortFor web-scale indexing (billions of documents), MapReduce-style distributed construction is used. Map: emit (term, docID, positions) pairs. Reduce: aggregate all postings for each term. This parallelizes both phases across thousands of machines.
We've thoroughly explored the inverted index—the data structure that makes full-text search possible. Let's consolidate the key concepts:
What's Next:
With the inverted index structure understood, the next page explores Term Frequency and TF-IDF—how we quantify the relevance of a document to a query and rank results meaningfully.
You now understand the complete architecture of inverted indexes—from vocabulary organization through posting list structure, skip pointers, compression, and physical storage. This foundation enables understanding of how queries are processed and results ranked.