System DesignSearch Architecture Overview

Search Architecture Overview

LevelIntermediate

Duration75 mins

TopicSearch Architecture Overview

1 / 4

Search System Components: The Anatomy of Modern Search

The Miracle We Take for Granted

Consider what happens when you type a query into a search box. Within 200 milliseconds—faster than you can blink—a system has scanned through billions of documents, evaluated their relevance to your specific intent, ranked them by a sophisticated algorithm considering hundreds of signals, and returned a neatly ordered list of results. This happens roughly 8.5 billion times per day on Google alone.

This is not magic. It is engineering at its finest—a carefully orchestrated symphony of specialized components, each designed to excel at one part of the search problem. Understanding these components is not merely academic; it is essential knowledge for any engineer building systems that need to find things quickly.

Whether you're building product search for an e-commerce platform, implementing document retrieval for a knowledge base, designing log search for observability, or architecting semantic search for an AI application—the fundamental components remain remarkably consistent.

What You Will Master

By the end of this page, you will understand the complete anatomy of a search system—from document ingestion to result ranking. You will be able to identify each component, articulate its responsibilities, explain how components interact, and make informed decisions about which components to prioritize for different search use cases. This knowledge forms the foundation for every subsequent topic in search systems.

The Search Problem at Scale

Before dissecting components, we must understand why search requires specialized architecture. The naive approach—scanning every document for matches—fails catastrophically at scale.

The mathematics of brute force:

Imagine searching 1 billion documents, each 10KB on average:

Total data: 10 TB
Time to scan all documents (at 500 MB/s): ~5.5 hours
Required latency: < 200ms
Gap: ~100,000x slower than acceptable

This gap cannot be closed by faster hardware alone. Moore's Law cannot save us when data grows faster than processing power. The solution lies in fundamentally different data organization—structures that trade storage and preprocessing time for query speed.

The Core Insight

Search systems solve the scale problem through a simple but profound trade-off: invest heavily in indexing time (when documents are added) to dramatically reduce query time (when users search). This is the architectural foundation upon which all search systems are built. We accept slower writes to enable dramatically faster reads.

The three fundamental challenges:

Every search system must solve three interrelated problems:

Discovery: How do we find and acquire documents to search? (Crawling, ingestion)
Organization: How do we structure documents for fast retrieval? (Indexing)
Retrieval: How do we find and rank relevant results quickly? (Query processing)

These challenges naturally decompose into specialized components, each optimized for its specific task. Let's examine each component in depth.

Document Acquisition Layer: The Front Door

The document acquisition layer is responsible for discovering, fetching, and preparing documents for indexing. This is where raw data enters the search system. The specific implementation varies dramatically based on the data source, but the conceptual responsibilities remain consistent.

Document Acquisition Components

•Crawlers / Fetchers — Actively discover and retrieve documents from external sources. Web crawlers (like Googlebot) navigate hyperlinks to find pages. API fetchers pull data from structured sources. File system watchers monitor directories for changes.
•Ingestion Pipelines — Receive documents pushed from upstream systems. Message queues buffer incoming documents. Stream processors handle real-time data feeds. Batch importers process bulk uploads.
•Content Extractors — Transform raw input into searchable text and structured metadata. HTML parsers extract text from web pages. PDF/Office document processors extract embedded text. Media analyzers extract metadata from images, audio, and video.
•Deduplication Detection — Identify and handle duplicate or near-duplicate content. URL normalization catches obvious duplicates. Content fingerprinting (MinHash, SimHash) identifies near-duplicates. Canonical URL resolution prevents index bloat.

Web Crawling: A Case Study in Complexity

Web crawling exemplifies the complexity of document acquisition. A production web crawler must handle:

Politeness: Respecting robots.txt and not overwhelming origin servers
Freshness: Prioritizing frequently-changing pages for re-crawling
Traps: Detecting infinite URL spaces (calendars, session IDs)
Canonicalization: Resolving multiple URLs pointing to identical content
Encoding: Handling diverse character encodings and languages
JavaScript rendering: Executing JavaScript to access dynamically-generated content

Googlebot, for example, renders JavaScript using a headless Chrome instance—a massive investment in infrastructure to handle the reality of the modern web.

Document Acquisition Patterns by Use Case
Use Case	Acquisition Pattern	Refresh Strategy	Key Challenges
Web Search	Active crawling	Continuous, priority-based	Scale, freshness, traps
E-commerce Search	Database CDC streams	Real-time sync	Consistency, deletions
Log Search	Agent push / Syslog	Append-only streams	Volume, retention
Document Search	File system watch + API	On-change triggers	Format diversity
Email Search	IMAP sync / Push	Incremental sync	Privacy, permissions

Acquisition Quality Matters

The quality of your document acquisition layer directly impacts search quality. Garbage in, garbage out. If crawlers miss pages, users can't find them. If extractors fail to parse content correctly, relevance suffers. If deduplication is weak, results become cluttered. Investment in acquisition infrastructure pays dividends across the entire search experience.

Document Processing Pipeline: From Raw to Searchable

Once documents are acquired, they must be transformed into a representation suitable for indexing. This processing pipeline—often called the analysis chain or ingestion pipeline—applies a series of transformations that dramatically impact search quality.

The Analysis Chain:

Document processing typically follows this sequence:

Raw Document → Character Filtering → Tokenization → Token Filtering → Normalization → Indexing

Each stage serves a critical purpose:

Processing Pipeline Stages

•Character Filtering — Cleans raw text before tokenization. Strips HTML tags, removes control characters, normalizes Unicode (NFC/NFKC), converts character encodings to UTF-8. Example: Converting & to &, removing <script> blocks.
•Tokenization — Splits text into discrete tokens (words). Simple whitespace splitting works for English but fails for Chinese (no spaces), German (compound words), or URLs. Sophisticated tokenizers handle edge cases: don't → don + t or don't, 192.168.1.1 → kept as single token.
•Token Filtering — Removes or modifies tokens based on rules. Stop word removal (the, is, at), length filtering (remove single characters), pattern-based filtering (remove numbers-only tokens). Each filter trades recall for precision.
•Normalization — Transforms tokens to canonical forms. Lowercasing (DOG → dog), stemming (running → run), lemmatization (better → good), synonym expansion (NYC → New York City). These transformations increase recall but may reduce precision.

text-analysis-example.md

Analysis Pipeline

Original Text:
"The Quick Brown Fox Jumped Over the Lazy Dog's Fence!"
 
Stage 1 - Character Filtering:
"The Quick Brown Fox Jumped Over the Lazy Dog's Fence"
 
Stage 2 - Tokenization:
["The", "Quick", "Brown", "Fox", "Jumped", "Over", "the", "Lazy", "Dog's", "Fence"]
 
Stage 3 - Token Filtering (stop word removal):
["Quick", "Brown", "Fox", "Jumped", "Lazy", "Dog's", "Fence"]
 
Stage 4 - Normalization (lowercase + stemming):
["quick", "brown", "fox", "jump", "lazi", "dog", "fenc"]
 
Final Indexed Tokens:
["quick", "brown", "fox", "jump", "lazi", "dog", "fenc"]
 
This document will match queries for: "quickly", "foxes", "jumping", "lazy", "dogs", "fenced"

Language-Specific Challenges:

Text processing varies dramatically across languages:

Language	Challenge	Solution
Chinese/Japanese	No word boundaries	Segmentation algorithms (Jieba, Kuromoji)
German	Compound words	Decompounding (`Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz`)
Arabic	Right-to-left, root-based	Morphological analysis
Thai	No spaces	Dictionary-based segmentation

Production search systems often maintain separate analysis chains per language, auto-detecting document language during processing.

Analysis Symmetry Is Critical

The same analysis chain must be applied to both documents during indexing AND queries during search. If you stem documents but not queries, 'running' (document) won't match 'running' (query) because the document stored 'run' but the query searched for 'running'. This symmetry requirement is a common source of bugs in search implementations.

Field-Level Processing:

Real documents contain multiple fields with different semantics:

Title: Often given higher weight; may use different tokenization
Body: Main content; standard analysis
Tags/Keywords: Often exact-match only; minimal analysis
Date fields: Parsed to timestamps; range query support
Numeric fields: Stored as numbers; comparison operations
Geo fields: Latitude/longitude; spatial indexing

Each field type may have its own analysis chain and storage format. This field-level configuration is a major aspect of search schema design.

The Indexing Engine: Building the Search Brain

The indexing engine is the heart of any search system. It takes processed documents and builds data structures optimized for fast retrieval. The primary structure is the inverted index, but modern search engines maintain multiple complementary indexes for different query types.

The Inverted Index: Core Data Structure

An inverted index flips the document-to-word relationship:

Forward index: Document → [words in document]
Inverted index: Word → [documents containing word]

This inversion is what enables fast search. Instead of scanning every document for a word, we look up the word directly and get the list of matching documents instantly.

Forward Index (how documents are stored):
  Doc1: "the quick brown fox"
  Doc2: "the lazy brown dog"
  Doc3: "the quick red fox"

Inverted Index (how search works):
  "the"   → [Doc1, Doc2, Doc3]
  "quick" → [Doc1, Doc3]
  "brown" → [Doc1, Doc2]
  "fox"   → [Doc1, Doc3]
  "lazy"  → [Doc2]
  "dog"   → [Doc2]
  "red"   → [Doc3]

Searching for "quick fox" becomes: intersect([Doc1, Doc3], [Doc1, Doc3]) = [Doc1, Doc3]

Indexing Engine Responsibilities

•Inverted Index Construction — Build the term-to-document mapping. Handle term dictionaries, posting lists, term frequencies, document frequencies. Compress posting lists for storage efficiency.
•Document Store — Store original or processed document content. Enable retrieval of full documents for display. Support field-level storage decisions (store vs. index-only).
•Auxiliary Indexes — Build specialized structures for non-text queries. Numeric range indexes (BKD trees), date indexes, geo-spatial indexes (R-trees, geohashes), vector indexes (HNSW, IVF).
•Segment Management — Organize indexes into immutable segments. Handle segment creation during indexing bursts. Manage segment merging to balance query performance and space efficiency.
•Transaction Logging — Ensure durability of indexed documents. Write-ahead logging for crash recovery. Support for rollback in case of indexing failures.

Segment-Based Architecture:

Modern search engines (Lucene, Elasticsearch, OpenSearch) use segment-based indexing:

New documents are written to an in-memory buffer
Periodically, the buffer is flushed to an immutable segment on disk
Segments accumulate over time
Background merge operations combine small segments into larger ones

This approach provides:

Write efficiency: Append-only writes to new segments
Concurrency: Readers don't block writers (immutable segments)
Recovery: Segments are self-contained units

The trade-off: queries must search across multiple segments and merge results, which is why segment management (merge policies, number of segments) significantly impacts query performance.

Index Types in Modern Search Engines
Index Type	Data Structure	Query Support	Example Use
Inverted Index	Term Dictionary + Posting Lists	Full-text search	Find documents containing 'machine learning'
Doc Values	Column-oriented storage	Sorting, aggregations	Sort results by date, compute facet counts
BKD Tree	Block K-D tree	Numeric ranges	Find products with price between $10-$50
Points Index	Multi-dimensional points	Geo queries	Find restaurants within 5km
Vector Index	HNSW / IVF	Similarity search	Find similar images, semantic search
Stored Fields	Row-oriented document store	Document retrieval	Return title and description for display

Index Design Is Query Design

Every index you build is an investment toward specific query patterns. If you don't build a numeric index on the 'price' field, range queries on price will be slow or impossible. If you don't store the 'description' field, you can't display it in results. Schema design in search is fundamentally about anticipating query requirements and building the right indexes.

Query Processing Engine: Understanding User Intent

The query processing engine transforms a user's raw query into a structured execution plan. This is where the search system interprets user intent, applies analysis, and constructs the operations needed to find relevant documents.

Query Processing Pipeline:

User Query → Query Parsing → Query Analysis → Query Expansion → Query Planning → Execution

Each stage refines the query toward executable form:

Query Processing Stages

•Query Parsing — Convert raw query string to structured representation. Handle operators ("and", "or", "not", quotes for phrases), field specifications (title:search), range syntax (price:10-50), and escape sequences. Parse errors should fail gracefully.
•Query Analysis — Apply the same text analysis used during indexing. Tokenize, filter, and normalize query terms. Ensure symmetry with document analysis. Handle language detection for multilingual search.
•Query Expansion — Broaden query to improve recall. Synonym injection (car → automobile), spelling correction (teh → the), related term expansion, acronym handling (NYSE → New York Stock Exchange). Each expansion increases recall but may reduce precision.
•Query Planning — Convert logical query to execution plan. Estimate cost of different execution strategies. Decide term evaluation order (rare terms first). Choose between algorithms (WAND, MaxScore, DAAT).
•Query Rewriting — Transform query for optimization. Flatten nested Boolean expressions, apply constant folding, remove redundant clauses, push down filters to reduce document evaluation.

query-processing-example.md

Query Processing

User Input:
"machine lerning tutorials" category:technology
 
Stage 1 - Query Parsing (structure extraction):
{
  type: "boolean",
  must: [
    { type: "text", field: "_all", value: "machine lerning tutorials" },
    { type: "term", field: "category", value: "technology" }
  ]
}
 
Stage 2 - Query Analysis (tokenization, normalization):
{
  text_terms: ["machin", "lern", "tutori"],  // stemmed
  filter_terms: ["technology"]
}
 
Stage 3 - Query Expansion (spell correction, synonyms):
{
  original: ["machin", "lern", "tutori"],
  expanded: ["machin", "learn", "ml", "tutori", "guide"],  // "lerning" → "learning"
  filter_terms: ["technology"]
}
 
Stage 4 - Query Planning (execution strategy):
{
  strategy: "WAND",
  term_order: ["ml", "learn", "tutori", "machin", "guide"],  // rarest first
  filter: { field: "category", value: "technology" }
}
 
Stage 5 - Execution:
→ Retrieve posting lists for terms
→ Apply WAND algorithm for top-K scoring
→ Filter by category=technology
→ Return ranked document IDs

Query Understanding: Beyond Text Matching

Advanced search systems go beyond textual analysis to understand query intent:

Query classification: Is this navigational (user wants specific site), informational (user wants information), or transactional (user wants to buy/do something)?
Entity recognition: Does the query contain known entities ("Apple stock price")?
Intent detection: What is the user trying to accomplish?

For example, the query "apple" requires disambiguation:

After "iphone release": fruit company
After "apple pie recipe": fruit
With context of stock ticker lookup: $AAPL

This contextual understanding dramatically improves search relevance but requires significant ML infrastructure.

Query Processing Latency Budget

Query processing must be fast—typically under 10-20ms of a 200ms total latency budget. This constrains how much processing is practical. Spell correction, for instance, must use precomputed dictionaries rather than computing edit distances on-the-fly. Complex ML models must be distilled or approximated. Every millisecond counts.

Retrieval and Ranking: Finding and Ordering Results

The retrieval and ranking engine is where the index pays off. Given a processed query, this component finds matching documents and orders them by relevance. This is typically the most computationally intensive part of search, and the target of extensive optimization.

Two-Phase Retrieval:

Most search systems use a two-phase approach:

Phase 1 - Candidate Retrieval (Fast, Low Precision)

Goal: Find a rough set of potentially relevant documents
Scale: From billions to thousands
Techniques: Index lookups, approximate nearest neighbor, Boolean filtering
Time budget: 50-100ms

Phase 2 - Ranking/Reranking (Slow, High Precision)

Goal: Order candidates by true relevance
Scale: From thousands to top 10-100
Techniques: ML ranking models, feature computation, personalization
Time budget: 50-100ms

This cascade allows using fast, approximate methods for initial filtering while reserving expensive, accurate methods for final ranking.

Phase 1: Retrieval Algorithms

•DAAT (Document-At-A-Time) — Process documents one at a time across all terms. Simple but can be slow for long posting lists.
•TAAT (Term-At-A-Time) — Process all documents for one term before moving to next. Memory-intensive for large result sets.
•WAND (Weak AND) — Skip documents that can't possibly score high enough. Dramatically faster for top-K queries.
•MaxScore — Partition terms by impact score and skip low-impact terms when possible. Optimizes for early termination.
•Block-Max WAND — Combine WAND with block-level max scores for better skipping. State-of-the-art for BM25 retrieval.

Phase 2: Ranking Signals

•Textual Relevance — TF-IDF, BM25, language model scores. How well do query terms match document terms?
•Document Quality — PageRank, domain authority, content quality scores. Is this document trustworthy?
•Freshness — Publication date, last update time. Is this content current?
•Personalization — User history, preferences, location. Is this relevant to this user?
•Engagement — Click-through rate, dwell time, bounce rate. Do users find this useful?

Learning to Rank (LTR):

Modern ranking systems use machine learning to combine signals:

Pointwise: Score each document independently (regression problem)
Pairwise: Learn which of two documents should rank higher (classification)
Listwise: Optimize the entire ranked list directly (e.g., LambdaMART)

LTR models learn optimal signal combinations from training data (click logs, human judgments). They can capture complex interactions between features that hand-tuned formulas miss.

Key LTR algorithms:

RankNet (pairwise neural network)
LambdaMART (gradient boosted trees)
RankSVM (pairwise SVM)
Neural rankers (BERT-based reranking)

The Top-K Trap

Users typically look at only the first 10 results. Position 1 gets ~30% of clicks, position 2 gets ~15%, position 10 gets ~2%. This extreme skew means the difference between position 1 and position 2 can be enormous in user experience and business impact. Ranking isn't just 'nice to have'—it's existential for search quality.

Serving Infrastructure: The Distributed Challenge

For search systems of any significant scale, a single machine cannot hold the entire index or handle all queries. The serving infrastructure distributes the workload across many machines while maintaining low latency and high availability.

Distribution Strategies:

1. Sharding (Partitioning) — Horizontal

Split the document corpus across multiple shards:

Each shard holds a subset of documents (e.g., by document ID hash)
Queries go to ALL shards; results are merged
Increases capacity (more documents), not throughput

Query → [Shard 1] → Results 1-10    ─┐
      → [Shard 2] → Results 1-10    ─┼→ Merge → Top 10 overall
      → [Shard 3] → Results 1-10    ─┘

2. Replication — Vertical

Copy shards across multiple replicas:

Each replica serves identical content
Queries go to ONE replica per shard (load balanced)
Increases throughput (more queries/second), not capacity

Shard 1: [Replica A] [Replica B] [Replica C]  ← Load balanced
Shard 2: [Replica A] [Replica B] [Replica C]  ← Load balanced
Shard 3: [Replica A] [Replica B] [Replica C]  ← Load balanced

Sharding vs Replication Trade-offs
Aspect	Sharding (More Shards)	Replication (More Replicas)
Capacity	✓ More documents	✗ Same documents
Throughput	✗ Same (still need all shards)	✓ More queries/second
Latency	↑ More merge overhead	↓ Less load per replica
Failure impact	Lose fraction of data	Lose throughput, not data
Index updates	Simpler (one shard)	Complex (sync all replicas)

The Scatter-Gather Pattern:

Distributed search follows the scatter-gather pattern:

Router receives query from client
Scatter: Router forwards query to all shards (one replica per shard)
Each shard searches its local index, returns top-K results
Gather: Router merges results from all shards
Final ranking may rerank merged results
Return final top-K to client

Latency implications:

Query latency = max(shard latencies) + merge time
Tail latency matters: p99 shard latency dominates
More shards = more chances for a slow replica
Hedged requests can mitigate: send to multiple replicas, take first response

Serving Infrastructure Components

•Load Balancers — Route traffic across query processing tier. Session affinity for caching benefits. Health checking to route around failures.
•Query Routers / Coordinators — Manage scatter-gather across shards. Handle partial shard failures (timeout, return partial results). Merge and rerank results.
•Index Servers — Serve search queries against local shard. Highly optimized for CPU (scoring) and I/O (index access). May keep hot data in memory.
•Cache Layers — Cache frequent queries and their results. Query caches (exact query match), filter caches (posting list caches), field data caches. Cache invalidation on index updates.
•Cluster Management — Monitor node health, manage shard allocation. Handle rebalancing when nodes join/leave. Coordinate index refreshes across replicas.

Real-World Scale

Google's web search runs on millions of servers, with indexes sharded across thousands of machines. Elasticsearch clusters at large companies run hundreds of nodes with terabytes of RAM. Even 'small' search deployments at startups often run 10+ nodes. The distributed nature of search is not optional at scale—it's fundamental to the architecture.

Component Interactions: The Complete Picture

Now that we've examined each component individually, let's see how they work together in a complete search flow. Understanding these interactions is crucial for debugging, optimization, and system design.

Converting Mermaid diagram...

The Indexing Path (Write Path):

Crawlers/Fetchers discover or receive new documents
Content Extractors pull searchable text from raw formats
Analyzer Chain tokenizes, filters, and normalizes text
Document Processor prepares structured document representation
Indexing Engine updates inverted index, document store, and auxiliary indexes
Index Refresh makes new documents searchable (near-real-time or batch)

The Query Path (Read Path):

Load Balancer routes request to available query processor
Cache checks for exact query match (hit = instant return)
Query Parser converts query string to structured query
Query Analyzer applies same analysis as indexing (symmetry!)
Query Expander adds synonyms, corrections, related terms
Router scatters query to all shards
Retriever finds candidate documents from inverted index
Ranker scores and orders candidates
Router gathers and merges results from shards
Document Store retrieves fields for display
Return ranked results to user

Asymmetry of Read vs Write

Search systems are heavily read-optimized. The indexing path can be slow (seconds to minutes for a document to become searchable) because it runs in the background. The query path must be fast (milliseconds) because users are waiting. This asymmetry shapes every design decision: we invest in indexing complexity to simplify query execution.

Summary: The Search System Anatomy

We've dissected the complete anatomy of a modern search system. Let's consolidate the key components and their responsibilities:

Search System Components Summary

•Document Acquisition — Discovers and retrieves documents for indexing. Crawlers for web, ingestion pipelines for internal data, extractors for content parsing.
•Document Processing — Transforms raw documents into searchable form. Analysis chain applies tokenization, filtering, and normalization.
•Indexing Engine — Builds optimized data structures for retrieval. Inverted indexes for text, specialized indexes for numbers, dates, and vectors.
•Query Processing — Interprets user queries and plans execution. Parsing, analysis, expansion, and optimization for fast retrieval.
•Retrieval & Ranking — Finds relevant documents and orders by quality. Two-phase: fast candidate retrieval, then ML-based reranking.
•Serving Infrastructure — Distributes workload across machines. Sharding for capacity, replication for throughput, caching for efficiency.

Key Architectural Principles:

Trade write time for read time: Heavy indexing investment enables fast queries
Analysis symmetry: Documents and queries must be processed identically
Distribute for scale: No single machine can handle production search
Cache aggressively: Query patterns are often repetitive
Fail gracefully: Return partial results rather than errors when possible

What's Next:

Now that you understand the components, the next page dives deep into the critical distinction between indexing and querying—the two fundamental paths through a search system, each with distinct optimization strategies and trade-offs.

Page Complete

You now understand the complete anatomy of a search system—from document acquisition to result ranking. You can identify each component, articulate its responsibilities, and understand how they interact. This foundation will inform every subsequent topic in search systems architecture.

1 / 4

Loading learning content...

System DesignSearch Architecture Overview

Search Architecture Overview

LevelIntermediate

Duration75 mins

TopicSearch Architecture Overview

1 / 4

Search System Components: The Anatomy of Modern Search

The Miracle We Take for Granted

What You Will Master

The Search Problem at Scale

Before dissecting components, we must understand why search requires specialized architecture. The naive approach—scanning every document for matches—fails catastrophically at scale.

The mathematics of brute force:

Imagine searching 1 billion documents, each 10KB on average:

Total data: 10 TB
Time to scan all documents (at 500 MB/s): ~5.5 hours
Required latency: < 200ms
Gap: ~100,000x slower than acceptable

The Core Insight

The three fundamental challenges:

Every search system must solve three interrelated problems:

Discovery: How do we find and acquire documents to search? (Crawling, ingestion)
Organization: How do we structure documents for fast retrieval? (Indexing)
Retrieval: How do we find and rank relevant results quickly? (Query processing)

These challenges naturally decompose into specialized components, each optimized for its specific task. Let's examine each component in depth.

Document Acquisition Layer: The Front Door

Document Acquisition Components

•Crawlers / Fetchers — Actively discover and retrieve documents from external sources. Web crawlers (like Googlebot) navigate hyperlinks to find pages. API fetchers pull data from structured sources. File system watchers monitor directories for changes.
•Ingestion Pipelines — Receive documents pushed from upstream systems. Message queues buffer incoming documents. Stream processors handle real-time data feeds. Batch importers process bulk uploads.
•Content Extractors — Transform raw input into searchable text and structured metadata. HTML parsers extract text from web pages. PDF/Office document processors extract embedded text. Media analyzers extract metadata from images, audio, and video.
•Deduplication Detection — Identify and handle duplicate or near-duplicate content. URL normalization catches obvious duplicates. Content fingerprinting (MinHash, SimHash) identifies near-duplicates. Canonical URL resolution prevents index bloat.

Web Crawling: A Case Study in Complexity

Web crawling exemplifies the complexity of document acquisition. A production web crawler must handle:

Politeness: Respecting robots.txt and not overwhelming origin servers
Freshness: Prioritizing frequently-changing pages for re-crawling
Traps: Detecting infinite URL spaces (calendars, session IDs)
Canonicalization: Resolving multiple URLs pointing to identical content
Encoding: Handling diverse character encodings and languages
JavaScript rendering: Executing JavaScript to access dynamically-generated content

Googlebot, for example, renders JavaScript using a headless Chrome instance—a massive investment in infrastructure to handle the reality of the modern web.

Document Acquisition Patterns by Use Case
Use Case	Acquisition Pattern	Refresh Strategy	Key Challenges
Web Search	Active crawling	Continuous, priority-based	Scale, freshness, traps
E-commerce Search	Database CDC streams	Real-time sync	Consistency, deletions
Log Search	Agent push / Syslog	Append-only streams	Volume, retention
Document Search	File system watch + API	On-change triggers	Format diversity
Email Search	IMAP sync / Push	Incremental sync	Privacy, permissions

Acquisition Quality Matters

Document Processing Pipeline: From Raw to Searchable

The Analysis Chain:

Document processing typically follows this sequence:

Raw Document → Character Filtering → Tokenization → Token Filtering → Normalization → Indexing

Each stage serves a critical purpose:

Processing Pipeline Stages

•Character Filtering — Cleans raw text before tokenization. Strips HTML tags, removes control characters, normalizes Unicode (NFC/NFKC), converts character encodings to UTF-8. Example: Converting & to &, removing <script> blocks.
•Tokenization — Splits text into discrete tokens (words). Simple whitespace splitting works for English but fails for Chinese (no spaces), German (compound words), or URLs. Sophisticated tokenizers handle edge cases: don't → don + t or don't, 192.168.1.1 → kept as single token.
•Token Filtering — Removes or modifies tokens based on rules. Stop word removal (the, is, at), length filtering (remove single characters), pattern-based filtering (remove numbers-only tokens). Each filter trades recall for precision.
•Normalization — Transforms tokens to canonical forms. Lowercasing (DOG → dog), stemming (running → run), lemmatization (better → good), synonym expansion (NYC → New York City). These transformations increase recall but may reduce precision.

text-analysis-example.md

Analysis Pipeline

Original Text:
"The Quick Brown Fox Jumped Over the Lazy Dog's Fence!"
 
Stage 1 - Character Filtering:
"The Quick Brown Fox Jumped Over the Lazy Dog's Fence"
 
Stage 2 - Tokenization:
["The", "Quick", "Brown", "Fox", "Jumped", "Over", "the", "Lazy", "Dog's", "Fence"]
 
Stage 3 - Token Filtering (stop word removal):
["Quick", "Brown", "Fox", "Jumped", "Lazy", "Dog's", "Fence"]
 
Stage 4 - Normalization (lowercase + stemming):
["quick", "brown", "fox", "jump", "lazi", "dog", "fenc"]
 
Final Indexed Tokens:
["quick", "brown", "fox", "jump", "lazi", "dog", "fenc"]
 
This document will match queries for: "quickly", "foxes", "jumping", "lazy", "dogs", "fenced"

Language-Specific Challenges:

Text processing varies dramatically across languages:

Language	Challenge	Solution
Chinese/Japanese	No word boundaries	Segmentation algorithms (Jieba, Kuromoji)
German	Compound words	Decompounding (`Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz`)
Arabic	Right-to-left, root-based	Morphological analysis
Thai	No spaces	Dictionary-based segmentation

Production search systems often maintain separate analysis chains per language, auto-detecting document language during processing.

Analysis Symmetry Is Critical

Field-Level Processing:

Real documents contain multiple fields with different semantics:

Title: Often given higher weight; may use different tokenization
Body: Main content; standard analysis
Tags/Keywords: Often exact-match only; minimal analysis
Date fields: Parsed to timestamps; range query support
Numeric fields: Stored as numbers; comparison operations
Geo fields: Latitude/longitude; spatial indexing

Each field type may have its own analysis chain and storage format. This field-level configuration is a major aspect of search schema design.

The Indexing Engine: Building the Search Brain

The Inverted Index: Core Data Structure

An inverted index flips the document-to-word relationship:

Forward index: Document → [words in document]
Inverted index: Word → [documents containing word]

This inversion is what enables fast search. Instead of scanning every document for a word, we look up the word directly and get the list of matching documents instantly.

Forward Index (how documents are stored):
  Doc1: "the quick brown fox"
  Doc2: "the lazy brown dog"
  Doc3: "the quick red fox"

Inverted Index (how search works):
  "the"   → [Doc1, Doc2, Doc3]
  "quick" → [Doc1, Doc3]
  "brown" → [Doc1, Doc2]
  "fox"   → [Doc1, Doc3]
  "lazy"  → [Doc2]
  "dog"   → [Doc2]
  "red"   → [Doc3]

Searching for "quick fox" becomes: intersect([Doc1, Doc3], [Doc1, Doc3]) = [Doc1, Doc3]

Indexing Engine Responsibilities

•Inverted Index Construction — Build the term-to-document mapping. Handle term dictionaries, posting lists, term frequencies, document frequencies. Compress posting lists for storage efficiency.
•Document Store — Store original or processed document content. Enable retrieval of full documents for display. Support field-level storage decisions (store vs. index-only).
•Auxiliary Indexes — Build specialized structures for non-text queries. Numeric range indexes (BKD trees), date indexes, geo-spatial indexes (R-trees, geohashes), vector indexes (HNSW, IVF).
•Segment Management — Organize indexes into immutable segments. Handle segment creation during indexing bursts. Manage segment merging to balance query performance and space efficiency.
•Transaction Logging — Ensure durability of indexed documents. Write-ahead logging for crash recovery. Support for rollback in case of indexing failures.

Segment-Based Architecture:

Modern search engines (Lucene, Elasticsearch, OpenSearch) use segment-based indexing:

New documents are written to an in-memory buffer
Periodically, the buffer is flushed to an immutable segment on disk
Segments accumulate over time
Background merge operations combine small segments into larger ones

This approach provides:

Write efficiency: Append-only writes to new segments
Concurrency: Readers don't block writers (immutable segments)
Recovery: Segments are self-contained units

The trade-off: queries must search across multiple segments and merge results, which is why segment management (merge policies, number of segments) significantly impacts query performance.

Index Types in Modern Search Engines
Index Type	Data Structure	Query Support	Example Use
Inverted Index	Term Dictionary + Posting Lists	Full-text search	Find documents containing 'machine learning'
Doc Values	Column-oriented storage	Sorting, aggregations	Sort results by date, compute facet counts
BKD Tree	Block K-D tree	Numeric ranges	Find products with price between $10-$50
Points Index	Multi-dimensional points	Geo queries	Find restaurants within 5km
Vector Index	HNSW / IVF	Similarity search	Find similar images, semantic search
Stored Fields	Row-oriented document store	Document retrieval	Return title and description for display

Index Design Is Query Design

Query Processing Engine: Understanding User Intent

Query Processing Pipeline:

User Query → Query Parsing → Query Analysis → Query Expansion → Query Planning → Execution

Each stage refines the query toward executable form:

Query Processing Stages

•Query Parsing — Convert raw query string to structured representation. Handle operators ("and", "or", "not", quotes for phrases), field specifications (title:search), range syntax (price:10-50), and escape sequences. Parse errors should fail gracefully.
•Query Analysis — Apply the same text analysis used during indexing. Tokenize, filter, and normalize query terms. Ensure symmetry with document analysis. Handle language detection for multilingual search.
•Query Expansion — Broaden query to improve recall. Synonym injection (car → automobile), spelling correction (teh → the), related term expansion, acronym handling (NYSE → New York Stock Exchange). Each expansion increases recall but may reduce precision.
•Query Planning — Convert logical query to execution plan. Estimate cost of different execution strategies. Decide term evaluation order (rare terms first). Choose between algorithms (WAND, MaxScore, DAAT).
•Query Rewriting — Transform query for optimization. Flatten nested Boolean expressions, apply constant folding, remove redundant clauses, push down filters to reduce document evaluation.

query-processing-example.md

Query Processing

User Input:
"machine lerning tutorials" category:technology
 
Stage 1 - Query Parsing (structure extraction):
{
  type: "boolean",
  must: [
    { type: "text", field: "_all", value: "machine lerning tutorials" },
    { type: "term", field: "category", value: "technology" }
  ]
}
 
Stage 2 - Query Analysis (tokenization, normalization):
{
  text_terms: ["machin", "lern", "tutori"],  // stemmed
  filter_terms: ["technology"]
}
 
Stage 3 - Query Expansion (spell correction, synonyms):
{
  original: ["machin", "lern", "tutori"],
  expanded: ["machin", "learn", "ml", "tutori", "guide"],  // "lerning" → "learning"
  filter_terms: ["technology"]
}
 
Stage 4 - Query Planning (execution strategy):
{
  strategy: "WAND",
  term_order: ["ml", "learn", "tutori", "machin", "guide"],  // rarest first
  filter: { field: "category", value: "technology" }
}
 
Stage 5 - Execution:
→ Retrieve posting lists for terms
→ Apply WAND algorithm for top-K scoring
→ Filter by category=technology
→ Return ranked document IDs

Query Understanding: Beyond Text Matching

Advanced search systems go beyond textual analysis to understand query intent:

Query classification: Is this navigational (user wants specific site), informational (user wants information), or transactional (user wants to buy/do something)?
Entity recognition: Does the query contain known entities ("Apple stock price")?
Intent detection: What is the user trying to accomplish?

For example, the query "apple" requires disambiguation:

After "iphone release": fruit company
After "apple pie recipe": fruit
With context of stock ticker lookup: $AAPL

This contextual understanding dramatically improves search relevance but requires significant ML infrastructure.

Query Processing Latency Budget

Retrieval and Ranking: Finding and Ordering Results

Two-Phase Retrieval:

Most search systems use a two-phase approach:

Phase 1 - Candidate Retrieval (Fast, Low Precision)

Goal: Find a rough set of potentially relevant documents
Scale: From billions to thousands
Techniques: Index lookups, approximate nearest neighbor, Boolean filtering
Time budget: 50-100ms

Phase 2 - Ranking/Reranking (Slow, High Precision)

Goal: Order candidates by true relevance
Scale: From thousands to top 10-100
Techniques: ML ranking models, feature computation, personalization
Time budget: 50-100ms

This cascade allows using fast, approximate methods for initial filtering while reserving expensive, accurate methods for final ranking.

Phase 1: Retrieval Algorithms

•DAAT (Document-At-A-Time) — Process documents one at a time across all terms. Simple but can be slow for long posting lists.
•TAAT (Term-At-A-Time) — Process all documents for one term before moving to next. Memory-intensive for large result sets.
•WAND (Weak AND) — Skip documents that can't possibly score high enough. Dramatically faster for top-K queries.
•MaxScore — Partition terms by impact score and skip low-impact terms when possible. Optimizes for early termination.
•Block-Max WAND — Combine WAND with block-level max scores for better skipping. State-of-the-art for BM25 retrieval.

Phase 2: Ranking Signals

•Textual Relevance — TF-IDF, BM25, language model scores. How well do query terms match document terms?
•Document Quality — PageRank, domain authority, content quality scores. Is this document trustworthy?
•Freshness — Publication date, last update time. Is this content current?
•Personalization — User history, preferences, location. Is this relevant to this user?
•Engagement — Click-through rate, dwell time, bounce rate. Do users find this useful?

Learning to Rank (LTR):

Modern ranking systems use machine learning to combine signals:

Pointwise: Score each document independently (regression problem)
Pairwise: Learn which of two documents should rank higher (classification)
Listwise: Optimize the entire ranked list directly (e.g., LambdaMART)

LTR models learn optimal signal combinations from training data (click logs, human judgments). They can capture complex interactions between features that hand-tuned formulas miss.

Key LTR algorithms:

RankNet (pairwise neural network)
LambdaMART (gradient boosted trees)
RankSVM (pairwise SVM)
Neural rankers (BERT-based reranking)

The Top-K Trap

Serving Infrastructure: The Distributed Challenge

Distribution Strategies:

1. Sharding (Partitioning) — Horizontal

Split the document corpus across multiple shards:

Each shard holds a subset of documents (e.g., by document ID hash)
Queries go to ALL shards; results are merged
Increases capacity (more documents), not throughput

Query → [Shard 1] → Results 1-10    ─┐
      → [Shard 2] → Results 1-10    ─┼→ Merge → Top 10 overall
      → [Shard 3] → Results 1-10    ─┘

2. Replication — Vertical

Copy shards across multiple replicas:

Each replica serves identical content
Queries go to ONE replica per shard (load balanced)
Increases throughput (more queries/second), not capacity

Shard 1: [Replica A] [Replica B] [Replica C]  ← Load balanced
Shard 2: [Replica A] [Replica B] [Replica C]  ← Load balanced
Shard 3: [Replica A] [Replica B] [Replica C]  ← Load balanced

Sharding vs Replication Trade-offs
Aspect	Sharding (More Shards)	Replication (More Replicas)
Capacity	✓ More documents	✗ Same documents
Throughput	✗ Same (still need all shards)	✓ More queries/second
Latency	↑ More merge overhead	↓ Less load per replica
Failure impact	Lose fraction of data	Lose throughput, not data
Index updates	Simpler (one shard)	Complex (sync all replicas)

The Scatter-Gather Pattern:

Distributed search follows the scatter-gather pattern:

Router receives query from client
Scatter: Router forwards query to all shards (one replica per shard)
Each shard searches its local index, returns top-K results
Gather: Router merges results from all shards
Final ranking may rerank merged results
Return final top-K to client

Latency implications:

Query latency = max(shard latencies) + merge time
Tail latency matters: p99 shard latency dominates
More shards = more chances for a slow replica
Hedged requests can mitigate: send to multiple replicas, take first response

Serving Infrastructure Components

•Load Balancers — Route traffic across query processing tier. Session affinity for caching benefits. Health checking to route around failures.
•Query Routers / Coordinators — Manage scatter-gather across shards. Handle partial shard failures (timeout, return partial results). Merge and rerank results.
•Index Servers — Serve search queries against local shard. Highly optimized for CPU (scoring) and I/O (index access). May keep hot data in memory.
•Cache Layers — Cache frequent queries and their results. Query caches (exact query match), filter caches (posting list caches), field data caches. Cache invalidation on index updates.
•Cluster Management — Monitor node health, manage shard allocation. Handle rebalancing when nodes join/leave. Coordinate index refreshes across replicas.

Real-World Scale

Component Interactions: The Complete Picture

Converting Mermaid diagram...

The Indexing Path (Write Path):

Crawlers/Fetchers discover or receive new documents
Content Extractors pull searchable text from raw formats
Analyzer Chain tokenizes, filters, and normalizes text
Document Processor prepares structured document representation
Indexing Engine updates inverted index, document store, and auxiliary indexes
Index Refresh makes new documents searchable (near-real-time or batch)

The Query Path (Read Path):

Load Balancer routes request to available query processor
Cache checks for exact query match (hit = instant return)
Query Parser converts query string to structured query
Query Analyzer applies same analysis as indexing (symmetry!)
Query Expander adds synonyms, corrections, related terms
Router scatters query to all shards
Retriever finds candidate documents from inverted index
Ranker scores and orders candidates
Router gathers and merges results from shards
Document Store retrieves fields for display
Return ranked results to user

Asymmetry of Read vs Write

Summary: The Search System Anatomy

We've dissected the complete anatomy of a modern search system. Let's consolidate the key components and their responsibilities:

Search System Components Summary

•Document Acquisition — Discovers and retrieves documents for indexing. Crawlers for web, ingestion pipelines for internal data, extractors for content parsing.
•Document Processing — Transforms raw documents into searchable form. Analysis chain applies tokenization, filtering, and normalization.
•Indexing Engine — Builds optimized data structures for retrieval. Inverted indexes for text, specialized indexes for numbers, dates, and vectors.
•Query Processing — Interprets user queries and plans execution. Parsing, analysis, expansion, and optimization for fast retrieval.
•Retrieval & Ranking — Finds relevant documents and orders by quality. Two-phase: fast candidate retrieval, then ML-based reranking.
•Serving Infrastructure — Distributes workload across machines. Sharding for capacity, replication for throughput, caching for efficiency.

Key Architectural Principles:

Trade write time for read time: Heavy indexing investment enables fast queries
Analysis symmetry: Documents and queries must be processed identically
Distribute for scale: No single machine can handle production search
Cache aggressively: Query patterns are often repetitive
Fail gracefully: Return partial results rather than errors when possible

What's Next:

Page Complete

1 / 4