Search Architecture Overview - Learning Module

Loading content...

0/273

Indexing vs Querying: The Two Paths of Search

The Fundamental Trade-off

Every search system lives in a constant tension between two competing concerns: how fast can we add new content versus how fast can we find existing content. These are not just different operations—they are fundamentally different paths through the system, with different architectures, different bottlenecks, and often different teams responsible for them.

Understanding this dichotomy is not academic trivia. It directly impacts:

System design: Where do you invest engineering effort?
Capacity planning: How do you size machines and clusters?
Latency optimization: Where do you look when performance degrades?
Feature prioritization: What trade-offs do you accept?

This page will make you fluent in thinking about search systems as two interrelated but distinct paths, each with its own optimization strategies.

What You Will Master

By the end of this page, you will understand: (1) Why search systems separate indexing from querying, (2) The distinct characteristics, bottlenecks, and optimization strategies for each path, (3) How to reason about consistency guarantees between paths, (4) Real-world patterns for balancing indexing freshness with query performance, and (5) How modern systems blur these boundaries for specific use cases.

Why Two Paths?

The separation of indexing and querying is not an arbitrary design choice—it emerges from fundamental physics and mathematics. Let's understand why.

The Brute Force Problem:

The naive approach to search—scan all documents for each query—has linear time complexity: O(n) per query. For small collections, this works fine:

Documents	Scan Time (1ms/doc)
1,000	1 second
100,000	1.6 minutes
10,000,000	2.7 hours
1,000,000,000	11.5 days

Clearly, brute force doesn't scale. But if we can't scan documents at query time, when can we scan them?

The Key Insight

We scan documents at indexing time—when they're added or updated—and build data structures that allow O(log n) or O(1) lookups at query time. This is the fundamental trade-off: invest time when documents arrive (indexing) to save time when users search (querying). Write once, read many.

The Write-Read Asymmetry:

This trade-off is justified by a crucial asymmetry in real systems:

Writes are rare, reads are common: Most search systems see 100-10,000x more read traffic than write traffic.
Writes can be batched, reads cannot: Users expect instant search results, but indexing can happen in the background.
Writes are produced by systems, reads by humans: Humans are impatient; systems can wait.

Because of this asymmetry, it makes economic sense to invest heavily in indexing complexity to achieve query simplicity.

Read/Write Ratios in Real Search Systems
System Type	~Writes/Day	~Reads/Day	Ratio
Web Search (Google-scale)	Billions (crawl)	8.5 billion queries	~1:1 to 1:10
E-commerce Search	100K product updates	100M searches	1:1,000
Email Search (Gmail-scale)	300M emails	Billions of searches	1:10+
Log Search (Observability)	TB/day of logs	1000s of dev queries	Very high write
Document Search (Enterprise)	1000s of docs	100Ks of searches	1:100

Exception: Write-Heavy Systems

Log search and observability platforms are interesting exceptions—they are write-heavy. Systems like Elasticsearch for logs, Splunk, and Datadog ingest terabytes of logs per day while serving relatively few queries. These systems still use indexes, but their optimization focus shifts toward write throughput and storage efficiency rather than query latency.

The Indexing Path: Building the Search Brain

The indexing path is responsible for transforming raw documents into optimized data structures. Let's trace a document's journey through this path and understand each step's purpose and challenges.

The Indexing Pipeline:

Document → Ingestion → Analysis → Index Building → Commit → Refresh → Searchable

Each stage has distinct characteristics:

Indexing Pipeline Stages

•Ingestion — Receive document from external source. Validate format, apply schema mapping, route to appropriate index. Bottleneck: network I/O, deserialization.
•Analysis — Apply text processing pipeline. Tokenize, filter, normalize text fields. Handle language detection, stemming, synonyms. Bottleneck: CPU-intensive for complex analysis.
•Index Building — Update in-memory index structures. Add terms to inverted index, update document store, build auxiliary indexes. Bottleneck: memory allocation, data structure updates.
•Commit — Persist in-memory changes to durable storage. Write transaction log (WAL), create new index segments. Bottleneck: disk I/O, especially fsyncs.
•Refresh — Make committed changes visible to search. Build new searcher pointing to new segments. Bottleneck: segment switching overhead, searcher allocation.

Near-Real-Time (NRT) Indexing:

Modern search systems offer near-real-time indexing, where documents become searchable within seconds of ingestion. This is achieved through:

Memory buffers: Documents accumulate in memory before disk writes
Frequent refreshes: New segments are created every 1-5 seconds
Segment replication: New segments replicate to replicas quickly

However, NRT involves trade-offs:

Near-Real-Time Trade-offs
Refresh Interval	Freshness	Resource Cost	Query Overhead
100ms	Excellent	Very High (many small segments)	High (many segments to search)
1 second	Great	High	Moderate
5 seconds	Good	Moderate	Low
30 seconds	Acceptable	Low	Very Low
5 minutes	Poor	Very Low	Minimal

The Segment Explosion Problem

Frequent refreshes create many small segments. Each segment adds overhead: file handles, memory for index structures, merge candidates. Without background merge processes, segment counts explode. Elasticsearch defaults to 1-second refresh, but production systems often increase this to reduce segment pressure during high ingestion rates.

Indexing Throughput Optimization:

When indexing throughput is critical (bulk imports, migration, log ingestion), these strategies help:

Disable refresh during bulk indexing: Set refresh_interval: -1, refresh manually when done
Increase indexing buffer size: More memory for in-memory segments
Reduce replica count during bulk: Index to primary only, replicate after
Use bulk APIs: Batch documents into single requests (100-1000 docs per bulk)
Parallelize ingestion: Multiple indexing threads/processes
Optimize analysis: Simpler analyzers for non-search fields
Consider index sorting: Sorted indexes compress better, but slower to build

Typical indexing throughput:

Single node: 10,000-50,000 docs/second (depends on doc size, analysis)
Cluster: Linear scaling with nodes (until coordination overhead dominates)

The Querying Path: The Speed Imperative

The querying path is where all the indexing investment pays off. Users expect sub-second responses regardless of corpus size. Let's trace a query through the system and understand where time is spent.

The Query Pipeline:

Query → Parse → Analyze → Plan → Route → Execute → Score → Merge → Return

Each stage has a latency budget—typically summing to under 200ms for competitive search:

query-latency-breakdown.txt

Latency Budget

Typical Query Latency Breakdown (200ms budget)
 
Stage                    Time (ms)    % of Budget
─────────────────────────────────────────────────
Network round-trip       20-50        10-25%
Query parsing            1-2          <1%
Query analysis           2-5          1-3%
Query planning           1-5          1-3%
Cache check              1-2          <1%
Scatter to shards        5-10         3-5%
Shard execution          50-100       25-50%
  ├─ Index lookup        10-30
  ├─ Posting traversal   20-50
  └─ Scoring             20-40
Gather from shards       5-10         3-5%
Result merging           5-10         3-5%
Document fetch           10-30        5-15%
Response serialization   5-10         3-5%
─────────────────────────────────────────────────
Total                    ~150-250ms

Query Execution Stages

•Parse — Convert query string to AST. Handle operators, field specifiers, phrases, ranges. Reject malformed queries early.
•Analyze — Apply same analysis as indexing. Tokenize, stem, normalize. Symmetry with indexing is critical for matching.
•Plan — Choose execution strategy. Estimate term frequencies, decide term order, select algorithm (WAND, MaxScore, etc.).
•Route — Send query to all shards. Choose one replica per shard (load balancing). Set timeout for shard responses.
•Execute — Traverse index structures. Look up terms in dictionary, traverse posting lists, compute initial scores.
•Score — Apply ranking algorithm. BM25 for text relevance, boost factors, function scores, ML reranking.
•Merge — Combine results from shards. Merge by score, deduplicate, apply global sorting.
•Fetch — Retrieve source documents. Pull stored fields for display. May require second round-trip to shards.

Query Latency Optimization:

Query latency optimization is a deep art. Key strategies include:

Index-Level Optimizations:

Fewer, larger segments (reduce segment overhead)
Hot indexes in memory (filed system caches or explicit warming)
Appropriate shard sizing (too many = coordination overhead, too few = per-query work)
Index sorting (improves compression and skip efficiency)

Query-Level Optimizations:

Filter before score (eliminate documents cheaply)
Terminate early (stop after top-K found with sufficient confidence)
Avoid deep pagination (offset 10000 requires computing 10000 results)
Use routing (send query only to relevant shards when possible)

Query Performance Wins

•Cache frequent queries aggressively
•Use filters for non-scored constraints
•Limit result set size (page 1 only)
•Avoid wildcards at start of terms
•Pre-compute aggregations when possible

Query Performance Killers

•Deep pagination (from: 10000)
•Leading wildcards (*search)
•Regex queries on large fields
•Script scoring on every doc
•Large aggregation cardinality

The Tail Latency Problem

In distributed search, you're waiting for the slowest shard. If you have 100 shards and each has 1% chance of being slow (>500ms), you have 63% chance of at least one being slow. Strategies: Hedged requests (send to multiple replicas, take first response), aggressive timeouts with partial results, replica load balancing considering recent latency.

The Consistency Challenge: When Indexing Meets Querying

The separation of indexing and querying creates an inherent consistency challenge: there is a window between when a document is indexed and when it becomes searchable. Understanding and managing this window is crucial for system correctness.

The Visibility Window:

Time:       T0          T1          T2          T3          T4
            │           │           │           │           │
Document:   Created → Indexed → Committed → Refreshed → Searchable
                        │           │           │
                     Memory      Disk       Query Path

Between T0 and T4, the document exists but isn't visible to search. This window can range from milliseconds to minutes depending on configuration.

Consistency Models in Search Systems
Model	Visibility Delay	Performance	Use Case
Real-time	<100ms	Lowest throughput	Financial systems, gaming
Near-real-time	1-5 seconds	Good balance	E-commerce, social media
Eventual	30s - 5min	Highest throughput	Log search, batch analytics
Batch	Hours	Maximum efficiency	Web search crawl, data warehouse

Managing Consistency Expectations:

Read-Your-Writes Consistency

Users expect to immediately see content they just created. Solutions:

Force refresh after write (expensive, use sparingly)
Primary routing (route user's queries to shard where they write)
Optimistic UI (show document before indexing confirms, hide if fails)

Delete Consistency

Deleted documents appearing in search is jarring. Solutions:

Soft delete (filter by deletion flag, garbage collect later)
Wait for refresh before confirming delete
Combine with above read-your-writes strategy

Update Consistency

Partially updated documents are confusing. Solutions:

Index entire document atomically (no field-level updates)
Use versioning (optimistic concurrency control)
Accept eventual consistency for most fields

The Indexing Queue Problem

Under high load, indexing queues grow. When queue depth exceeds capacity, either indexing blocks (backpressure) or documents are dropped. Monitor queue depth as a leading indicator of indexing problems. Solutions include: scaling indexing nodes, temporarily disabling replicas, increasing batch size, reducing refresh frequency.

Distributed Consistency Complications:

In a distributed search cluster, consistency becomes more complex:

Scenario	What Can Go Wrong	Impact
Replica lag	Query hits stale replica	Missing recent documents
Primary failure	Indexing to new primary	Brief inconsistency during failover
Network partition	Shard isolation	Partial results or errors
Split brain	Two primaries	Conflicting documents

Search systems typically choose availability over consistency (AP in CAP theorem). Missing a document for a few seconds is acceptable; being completely unavailable is not.

Resource Contention: When Paths Collide

Although indexing and querying are logically separate, they share physical resources. Understanding this contention is essential for capacity planning and performance tuning.

Shared Resource Contention Points

•CPU — Indexing (analysis, index building) and querying (scoring, ranking) compete for CPU cycles. Heavy indexing can starve query threads.
•Memory — Indexing buffers, query caches, segment metadata, and heap all share JVM memory. OOM during indexing can take down the query path.
•Disk I/O — Index segment writes compete with segment reads. Merges are particularly I/O intensive and can spike query latency.
•Network — Replication traffic (indexing) competes with query traffic. Bulk indexing can saturate network, delaying query responses.

Mitigation Strategies:

1. Physical Separation

Dedicate nodes to specific roles:

Data nodes: Hold shards, serve queries and indexing
Ingest nodes: Pre-process documents (parsing, enrichment)
Coordinator nodes: Route queries, merge results

For critical workloads, separate query-serving data nodes from indexing-receiving data nodes.

2. Temporal Separation

Schedule heavy operations during off-peak hours:

Run merges at night
Bulk imports during low-traffic periods
Backup operations scheduled around query patterns

3. Resource Quotas

Limit resource consumption:

Thread pool sizing (separate indexing and search pools)
Memory allocation (indexing buffer vs. query cache)
I/O throttling (merge bandwidth limits)

4. Priority Queuing

Prioritize query path:

Query requests preempt indexing
Search thread pool priority > bulk indexing priority
Merges pause during query spikes

elasticsearch-thread-pool-config.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Elasticsearch thread pool configuration example
# Separate pools prevent indexing from starving queries
 
thread_pool:
  # Search operations - highest priority
  search:
    type: fixed
    size: 16  # cores * 3 / 2 + 1 recommended
    queue_size: 1000
  
  # Indexing operations - allow to queue
  write:
    type: fixed
    size: 8  # lower than search for query priority
    queue_size: 10000  # larger queue to absorb bursts
  
  # Bulk indexing - lowest priority
  bulk:
    type: fixed
    size: 4
    queue_size: 50000  # very large queue, can wait
  
  # Background operations
  refresh:
    type: scaling
    min: 1
    max: 4
  
  # Merge operations - can be throttled
  merge:
    type: scaling
    min: 1
    max: 2

Monitor Both Paths

Track metrics for both paths separately: indexing rate (docs/sec), indexing latency (time to commit), refresh latency, query rate (QPS), query latency (p50, p95, p99), cache hit rate. Correlate indexing spikes with query latency increases. This visibility helps diagnose contention issues.

Architectural Patterns: Organizing the Two Paths

Different systems organize indexing and querying differently based on their requirements. Let's examine common patterns and when to use each.

Pattern 1: Unified Cluster (Simple)

                  ┌─────────────────────┐
   Documents  ──► │   Search Cluster    │ ◄── Queries
                  │ (Index + Query)     │
                  └─────────────────────┘

All nodes handle both indexing and querying. Simple, but indexing can impact query latency.

When to use: Small to medium deployments, balanced workloads, limited operational complexity.

Pattern 2: Role-Based Separation (Medium)

   Documents  ──► [Ingest Nodes] ──► [Data Nodes] ◄── [Coordinator] ◄── Queries
                                           │
                                      (Shards)

Specialized node roles reduce contention. Ingest nodes handle document processing; coordinators handle query routing.

When to use: Larger clusters, predictable workloads, some operational maturity.

Pattern 3: Full Separation (Complex)

   Documents  ──► [Indexing Cluster] ──► [Object Storage] ──► [Query Cluster]
                  (Build segments)        (Segment files)      (Serve queries)

Completely separate indexing and serving clusters. Segments are built offline and pushed to query servers.

When to use: Very high scale, strict query latency SLAs, different scaling requirements for read vs. write.

Pattern Comparison
Pattern	Operational Complexity	Isolation	Common Users
Unified Cluster	Low	None	Most Elasticsearch deployments
Role-Based	Medium	Partial	Large Elasticsearch, Solr
Full Separation	High	Complete	Google, Bing, Large tech companies

Pattern 4: Lambda Architecture for Search

Combine batch and streaming indexing:

                        ┌─────────────────┐
   Documents ──────────►│  Stream Layer   │──► Real-time Index
        │               │  (Kafka, etc.)  │         │
        │               └─────────────────┘         │
        │                                           ▼
        │               ┌─────────────────┐    ┌─────────┐
        └──────────────►│  Batch Layer   │───►│  Merge  │──► Query Cluster
                        │  (Spark, etc.) │    │         │
                        └─────────────────┘    └─────────┘

Stream layer: Indexes recent data with minimal latency
Batch layer: Reprocesses all data periodically for accuracy
Serving layer: Merges real-time and batch indexes

When to use: When freshness matters but complete processing is too slow, when data requires complex enrichment.

Trade-off: Complexity of maintaining two data paths, potential for inconsistency between layers.

The Infrastructure vs Application Trade-off

Simpler patterns push complexity to application code (handling eventual consistency, managing refresh timing). Complex patterns push complexity to infrastructure (more systems, more failure modes, more ops burden). Choose based on team capability and SLA requirements, not just technical elegance.

Real-World System Comparisons

Let's examine how major search systems handle the indexing-querying balance:

Elasticsearch / OpenSearch

Indexing Path:

Documents arrive via REST API (single or bulk)
Parsed and analyzed on data node holding primary shard
Written to in-memory translog (durability) and in-memory segment
Refresh (default 1s) makes documents searchable
Background merge consolidates segments

Querying Path:

Query arrives at any node (coordinator or data)
Coordinator routes to one replica of each shard
Each shard searches local segments, returns top-K
Coordinator merges results, fetches stored fields

Key Characteristics:

Near-real-time by default (1s refresh)
Unified cluster (both paths on same nodes)
Configurable trade-offs via refresh_interval, translog settings
Horizontal scaling by adding nodes/shards

No One-Size-Fits-All

Each system makes different trade-offs. Elasticsearch prioritizes flexibility and near-real-time. Google-scale systems prioritize query performance at massive scale. Algolia prioritizes simplicity and ultra-low latency. Choose based on your specific requirements: scale, latency, freshness, operational complexity, and cost.

Decision Framework: Balancing the Paths

When designing a search system, you'll face decisions about how to balance indexing and querying. Here's a framework for making these decisions:

Key Questions to Answer

•What is acceptable search latency? — If p99 < 50ms is required, invest heavily in query path and consider index pre-warming. If 500ms is acceptable, more optimization flexibility.
•What is acceptable indexing latency? — Real-time visibility (<1s) requires NRT configuration. Batch visibility (minutes/hours) allows more efficient bulk processing.
•What is the read/write ratio? — High read ratio (1000:1) → optimize query path. Low ratio (10:1) → balance both paths. Very low ratio (1:1 like logs) → optimize indexing throughput.
•How critical is consistency? — Financial/transactional data often needs read-your-writes. Social media can tolerate seconds of eventual consistency.
•What are the peak load characteristics? — Sustained high load → size for steady state. Spiky load → need burst capacity and aggressive queuing.

Common Decision Patterns:

Scenario	Recommendation
E-commerce search	NRT indexing, optimize query path heavily, cache popular queries
Log analytics	Optimize indexing throughput, batch refreshes, tolerate query latency
Enterprise document search	Moderate NRT, focus on relevance/ML ranking, tolerate some latency
Social media search	Real-time indexing for user content, cache/batch for analytics
Web search	Batch indexing, extreme query optimization, multi-tier ranking

Start Simple, Optimize Later

Don't over-engineer initially. Start with defaults (Elasticsearch 1s refresh, unified cluster), measure actual performance, identify bottlenecks, then optimize. Premature optimization of the wrong path wastes effort. Let production data guide your optimization priorities.

Summary: Two Paths, One System

The separation of indexing and querying is fundamental to search system architecture. Let's consolidate the key insights:

Key Takeaways

•Two paths, one trade-off — Indexing invests time upfront to enable fast querying later. Write time fuels read speed.
•Read/write asymmetry justifies the trade-off — Most systems are read-heavy, so optimizing the read path has higher ROI.
•Consistency is a spectrum — From real-time to batch, choose the freshness level that matches your requirements.
•Resources are shared — CPU, memory, disk, and network are contested by both paths. Plan for contention.
•Architectural patterns vary — From unified clusters to full separation, match complexity to requirements.
•Different systems, different trade-offs — Elasticsearch, Solr, Google, and Algolia all make different choices. Learn from their designs.

The Mental Model:

Think of search as a time-shifting operation:

Documents arrive at unpredictable times
Indexing moves document processing to a controlled background operation
Queries execute against pre-computed structures
Users perceive instant search because the hard work was done earlier

What's Next:

Now that you understand the indexing/querying dichotomy, the next page explores the heart of the indexing path: the inverted index. This data structure is the secret sauce that makes sub-second search possible, and understanding it deeply is essential for effective search system design.

Page Complete

You now understand the fundamental separation between indexing and querying paths in search systems. You can reason about trade-offs, identify bottlenecks, and make informed architectural decisions. This understanding will inform every subsequent topic in search systems.