System Design (HLD)Search Indexing Strategies

Search Indexing Strategies

LevelAdvanced

Duration75 mins

TopicSearch Indexing Strategies

1 / 5

Real-Time vs Batch Indexing

The Indexing Dilemma

When you type a search query on Amazon and see results in milliseconds, you're witnessing the culmination of sophisticated indexing strategies that have been refined over decades. Behind every fast search experience lies a critical architectural decision: how should documents be added to the search index?

This question sits at the heart of search system design. The answer determines not just how quickly new content becomes searchable, but also how your system behaves under load, how much infrastructure you need, and whether your users experience stale results or fresh ones.

The two fundamental approaches—real-time indexing and batch indexing—represent different philosophies about the trade-offs between freshness, consistency, resource utilization, and operational complexity. Understanding when to use each approach, and how to combine them, separates competent search engineers from exceptional ones.

What You Will Learn

By the end of this page, you will understand the fundamental differences between real-time and batch indexing, their architectural implications, when to choose each approach, and how to design hybrid systems that leverage the strengths of both. You'll gain the vocabulary and mental models used by principal engineers at companies like Google, Amazon, and LinkedIn when designing search infrastructure.

The Fundamental Trade-Off

Before diving into specific strategies, let's establish the core tension that drives indexing architecture decisions. Every search system must balance multiple competing requirements:

Freshness: How quickly do new or updated documents appear in search results?

Throughput: How many documents can be indexed per unit time?

Resource Efficiency: How much CPU, memory, and I/O does indexing consume?

Query Performance: How does indexing activity affect search latency?

Consistency: Do all users see the same search results at the same time?

No approach optimizes all dimensions simultaneously. Real-time indexing prioritizes freshness at the cost of throughput and resource efficiency. Batch indexing prioritizes efficiency and consistency at the cost of freshness. Understanding this trade-off is essential because the right choice depends entirely on your use case.

Core Trade-offs: Real-Time vs Batch Indexing
Dimension	Real-Time Indexing	Batch Indexing
Freshness	Seconds to low minutes	Minutes to hours
Throughput	Limited by write amplification	Optimized for bulk operations
Resource Efficiency	Higher per-document overhead	Lower per-document overhead
Query Impact	Potential interference	Isolated from queries
Consistency	Eventually consistent	Point-in-time consistent
Operational Complexity	Continuous monitoring	Scheduled job management

The Hybrid Reality

Most production search systems at scale are neither purely real-time nor purely batch. They use hybrid architectures that combine both approaches strategically. A common pattern: real-time indexing for critical updates (price changes, stock availability) and batch indexing for comprehensive data refreshes and new document ingestion.

Real-Time Indexing: Architecture and Mechanics

Real-time indexing (also called near-real-time or NRT indexing) makes documents searchable within seconds of ingestion. This is the default mode for most search engines today, including Elasticsearch, Apache Solr, and managed services like Amazon OpenSearch.

How Real-Time Indexing Works

The mechanics involve several coordinated steps that happen in rapid succession:

Document Arrival: A document is submitted via API (e.g., POST /_doc in Elasticsearch)
In-Memory Buffering: The document is written to an in-memory buffer (often called a transaction log or translog)
Segment Creation: Periodically (every 1 second in Elasticsearch's default config), buffered documents are written to a new immutable segment
Segment Visibility: The new segment becomes visible to searches (the refresh operation)
Background Merging: Over time, small segments are merged into larger ones to maintain query efficiency

This architecture reveals a key insight: real-time doesn't mean instant. There's always a configurable delay (the refresh interval) between document ingestion and searchability.

elasticsearch-refresh.json
Elasticsearch Settings
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Index settings for real-time indexing behavior
{
  "settings": {
    "index": {
      // How often to refresh (make new docs searchable)
      // 1s is default, can be decreased for faster updates
      "refresh_interval": "1s",
      
      // Transaction log durability settings
      "translog": {
        // fsync on every request vs. async
        "durability": "request",
        // Max size before force flush
        "flush_threshold_size": "512mb"
      },
      
      // Merge policy affects indexing performance
      "merge": {
        "scheduler": {
          // Max concurrent merges
          "max_thread_count": 1
        }
      }
    }
  }
}
 
// Force refresh for immediate visibility (use sparingly!)
POST /my_index/_refresh
 
// Index with refresh=true (expensive, use for critical updates only)
POST /my_index/_doc?refresh=true
{
  "title": "Urgent Update",
  "price": 29.99
}

The Refresh Trap

Setting refresh_interval too low (e.g., 100ms) or using refresh=true on every write can devastate cluster performance. Each refresh creates a new segment, causing segment explosion and excessive I/O. A common anti-pattern: developers set aggressive refresh intervals during development, forget to change them, and wonder why production clusters struggle.

The Hidden Costs of Real-Time Indexing

Real-time indexing appears straightforward on the surface, but production deployments reveal several significant challenges that architects must address:

Write Amplification

Every document written to a search index triggers multiple subsequent operations:

Initial write to transaction log
Write to in-memory buffer
Flush to segment on disk
Multiple segment merges as data accumulates

This write amplification means that indexing 1 GB of data might result in 5-10 GB of actual disk I/O. At scale, this becomes the dominant bottleneck.

Segment Fragmentation

Frequent refreshes create many small segments. While each is individually fast to create, having hundreds or thousands of small segments degrades query performance dramatically. The search engine must query each segment and merge results—more segments mean more overhead.

Resource Contention

Indexing and querying compete for the same resources: CPU for analysis and merging, memory for caches and buffers, I/O bandwidth for reads and writes. Heavy indexing loads can cause query latency spikes, a phenomenon known as indexing interference.

Recovery Time

If a node fails, it must replay its transaction log to recover uncommitted documents. Larger transaction logs (from higher throughput) mean longer recovery times, extending your system's vulnerability window during incidents.

Real-Time Indexing Failure Modes

•Segment Explosion: Aggressive refresh intervals create hundreds of segments, slowing queries by 10-100x
•Merge Storm: Background merges consume all I/O, causing both indexing and query latency spikes
•Memory Pressure: Too many active segments exhaust file descriptor limits and memory-mapped regions
•Indexing Backlog: Bursts of writes exceed indexing capacity, creating growing queues that never drain
•Recovery Cascade: After a failure, transaction log replay takes hours, leaving the cluster degraded

Observability is Critical

Real-time indexing systems require comprehensive monitoring: segment count, merge rate, transaction log size, indexing latency percentiles, and refresh timing. Without visibility into these metrics, problems compound silently until they cause outages.

Batch Indexing: Architecture and Mechanics

Batch indexing takes a fundamentally different approach: instead of indexing documents as they arrive, it accumulates documents and processes them in large, scheduled jobs. This model dominated early search systems and remains essential for many use cases.

How Batch Indexing Works

Document Accumulation: New and updated documents are stored in a staging area (database, data lake, message queue)
Batch Preparation: A scheduled job reads accumulated documents and prepares them for indexing
Bulk Ingestion: Documents are indexed using bulk APIs optimized for high-throughput writes
Index Optimization: After ingestion, the index is optimized (force-merged to fewer segments)
Visibility Switch: The new index version replaces the old one atomically

This approach leverages a crucial observation: bulk operations are dramatically more efficient than individual operations. Indexing 1 million documents in a single bulk request uses a fraction of the resources required to index the same documents one at a time.

batch-indexing-pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
from elasticsearch import Elasticsearch, helpers
from datetime import datetime
import json
 
class BatchIndexingPipeline:
    """
    Production batch indexing pipeline with optimized settings.
    
    This pipeline demonstrates patterns used at LinkedIn, Twitter,
    and other companies processing billions of documents.
    """
    
    def __init__(self, es_hosts: list[str]):
        self.es = Elasticsearch(es_hosts)
        
    def prepare_index_for_bulk_load(self, index_name: str) -> None:
        """
        Optimizes index settings for bulk ingestion.
        These settings dramatically improve indexing throughput.
        """
        self.es.indices.put_settings(
            index=index_name,
            body={
                "index": {
                    # Disable refresh during bulk load
                    "refresh_interval": "-1",
                    # Increase indexing buffer size
                    "translog.flush_threshold_size": "1gb",
                    # Reduce replica overhead (restore after)
                    "number_of_replicas": 0,
                }
            }
        )
        
    def bulk_index_documents(
        self, 
        index_name: str, 
        documents: list[dict],
        batch_size: int = 5000
    ) -> dict:
        """
        Index documents using bulk API with proper error handling.
        
        batch_size of 5000-15000 is typically optimal, depending
        on document size. Larger batches reduce overhead but
        increase memory pressure and retry scope on failure.
        """
        def generate_actions():
            for doc in documents:
                yield {
                    "_index": index_name,
                    "_id": doc["id"],
                    "_source": doc
                }
        
        success, failed = 0, []
        
        # Use parallel_bulk for multi-threaded ingestion
        for ok, result in helpers.parallel_bulk(
            self.es,
            generate_actions(),
            chunk_size=batch_size,
            thread_count=4,
            raise_on_error=False
        ):
            if ok:
                success += 1
            else:
                failed.append(result)
                
        return {
            "success": success,
            "failed": len(failed),
            "errors": failed[:10]  # Sample of errors
        }
        
    def finalize_index(self, index_name: str, replicas: int = 1) -> None:
        """
        Restore production settings and optimize index.
        """
        # Force merge to reduce segment count (expensive but worth it)
        self.es.indices.forcemerge(
            index=index_name,
            max_num_segments=1  # Single segment = fastest queries
        )
        
        # Restore production settings
        self.es.indices.put_settings(
            index=index_name,
            body={
                "index": {
                    "refresh_interval": "1s",
                    "number_of_replicas": replicas,
                }
            }
        )
        
        # Final refresh to make all docs searchable
        self.es.indices.refresh(index=index_name)
 
 
# Example usage
pipeline = BatchIndexingPipeline(["http://es-cluster:9200"])
pipeline.prepare_index_for_bulk_load("products_v2")
 
# Simulated batch of 1M documents
documents = [{"id": i, "title": f"Product {i}"} for i in range(1_000_000)]
result = pipeline.bulk_index_documents("products_v2", documents)
 
pipeline.finalize_index("products_v2", replicas=2)

Batch Indexing Throughput Comparison

The efficiency gains from batch indexing are substantial. In typical benchmarks:

Indexing Mode	Documents/Second	CPU Usage	I/O Writes
One-by-one (sync refresh)	100-500	High	Very High
One-by-one (1s refresh)	1,000-3,000	High	High
Bulk API (5K chunks)	10,000-30,000	Medium	Medium
Bulk API (optimized)	50,000-100,000+	Low	Low

These gains come from amortizing fixed costs (connection setup, segment creation, cache invalidation) across many documents instead of paying them repeatedly.

When Batch Indexing Shines

Batch indexing isn't a legacy approach—it's the optimal choice for many production scenarios. Understanding when to use it is crucial for efficient system design.

Ideal Use Cases for Batch Indexing

Data Warehouse Integration: When search indexes are populated from nightly ETL jobs processing data warehouses, batch indexing aligns perfectly with the data availability pattern.

Full Catalog Rebuilds: E-commerce catalogs, content management systems, and media libraries often need periodic full refreshes. Batch indexing makes these operations predictable and efficient.

Historical Data Migration: Moving years of data from legacy systems to new search infrastructure requires the throughput only batch processing can provide.

ML Model Updates: When search relevance depends on machine learning features (embeddings, classifications), batch processing allows recomputing features for all documents when models change.

Consistency Requirements: Some applications require that all users see exactly the same search results. Batch indexing to a new index version provides point-in-time consistency that real-time indexing cannot.

Batch Indexing Strengths

•Predictable resource consumption
•No interference with query performance
•Atomic visibility of changes
•Easy rollback to previous versions
•Optimal for large-scale data loads
•Consistent query results during indexing
•Simple operational model

Batch Indexing Limitations

•Hours-old data in search results
•Requires staging infrastructure
•Periods of stale search results
•Complex job scheduling and monitoring
•Storage overhead for multiple versions
•Delayed visibility of critical updates
•Batch job failures can delay updates

The Freshness Perception Gap

Users often care less about absolute freshness than engineers assume. An e-commerce search with 2-hour-old prices is usually acceptable; users understand prices can change. But a stock trading platform with 2-second-old prices is unacceptable. Understand your domain's actual freshness requirements before choosing an indexing strategy.

Hybrid Architectures: The Best of Both Worlds

Most sophisticated search systems at scale use hybrid architectures that combine real-time and batch indexing strategically. The key insight is that different types of data have different freshness requirements.

The Tiered Indexing Pattern

Consider a product search engine for a large e-commerce platform. Documents have multiple field types:

Static Fields: Product name, description, brand, category—rarely change
Dynamic Fields: Price, inventory count, seller ratings—change frequently
Computed Fields: ML-generated features, popularity scores—recomputed periodically

A hybrid approach indexes these differently:

Batch Layer: Full catalog rebuild nightly or weekly, ensuring all static fields and computed features are current
Real-Time Layer: Continuous streaming of price and inventory updates, applied as partial document updates
Merge at Query Time: Queries combine results from both layers, with real-time overrides taking precedence

This pattern, sometimes called the Lambda Architecture for Search, provides fresh critical data while maintaining efficient bulk processing for everything else.

hybrid-architecture.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
/**
 * Hybrid Indexing Architecture
 * 
 * Combines batch indexing for full catalog with real-time
 * updates for critical fields like price and inventory.
 */
 
interface ProductDocument {
  id: string;
  // Static fields (batch indexed)
  name: string;
  description: string;
  category: string;
  brand: string;
  // Dynamic fields (real-time indexed)
  price: number;
  inventory: number;
  lastUpdated: Date;
  // Computed fields (batch recomputed)
  popularityScore: number;
  mlEmbedding: number[];
}
 
interface UpdateEvent {
  productId: string;
  field: 'price' | 'inventory';
  value: number;
  timestamp: Date;
}
 
class HybridIndexingPipeline {
  private esClient: ElasticsearchClient;
  private updateQueue: KafkaConsumer;
  
  /**
   * Real-time path: Process update events as they arrive.
   * Uses partial document updates to minimize overhead.
   */
  async processRealTimeUpdates(): Promise<void> {
    for await (const batch of this.updateQueue.consume()) {
      // Group updates by document ID
      const grouped = this.groupByProductId(batch);
      
      // Use bulk partial update API
      const bulkOps = Object.entries(grouped).flatMap(
        ([productId, updates]) => [
          { update: { _id: productId, _index: 'products' } },
          { 
            doc: {
              ...this.mergeUpdates(updates),
              lastUpdated: new Date()
            }
          }
        ]
      );
      
      await this.esClient.bulk({ body: bulkOps });
    }
  }
  
  /**
   * Batch path: Full catalog rebuild on schedule.
   * Rebuilds entire index with fresh computed features.
   */
  async runBatchReindex(newVersion: string): Promise<void> {
    const newIndex = `products_${newVersion}`;
    
    // Create optimized index for bulk loading
    await this.createIndexWithBulkSettings(newIndex);
    
    // Stream all products from source of truth
    for await (const batch of this.streamFromDataWarehouse()) {
      // Compute ML features for entire batch
      const enriched = await this.enrichWithMLFeatures(batch);
      await this.bulkIndex(newIndex, enriched);
    }
    
    // Finalize and swap
    await this.finalizeIndex(newIndex);
    await this.atomicAliasSwap('products', newIndex);
  }
  
  /**
   * Query-time merge for real-time overrides.
   * Not needed if using Elasticsearch's doc values,
   * but useful for complex merge logic.
   */
  async searchWithOverrides(query: SearchQuery): Promise<SearchResults> {
    // Get base results from search index
    const results = await this.esClient.search(query);
    
    // Fetch any real-time overrides from cache
    const overrides = await this.fetchRealtimeOverrides(
      results.hits.map(h => h._id)
    );
    
    // Merge overrides into results
    return this.applyOverrides(results, overrides);
  }
  
  private groupByProductId(
    updates: UpdateEvent[]
  ): Record<string, UpdateEvent[]> {
    // Implementation: group updates by product ID
    return updates.reduce((acc, update) => {
      (acc[update.productId] ??= []).push(update);
      return acc;
    }, {} as Record<string, UpdateEvent[]>);
  }
  
  private mergeUpdates(
    updates: UpdateEvent[]
  ): Partial<ProductDocument> {
    // Take latest value for each field
    return updates.reduce((acc, update) => {
      acc[update.field] = update.value;
      return acc;
    }, {} as Partial<ProductDocument>);
  }
}

Choosing Your Strategy: A Decision Framework

Given the trade-offs we've explored, how should you choose an indexing strategy for your system? Use this framework:

Step 1: Define Freshness Requirements

Ask stakeholders: "If a document is updated, how quickly must it appear in search?"

Seconds: Real-time required (stock prices, chat messages, live events)
Minutes: Near-real-time sufficient (social feeds, news articles)
Hours: Batch is viable (product catalogs, document archives)
Days: Batch is optimal (analytics dashboards, historical data)

Step 2: Assess Update Patterns

High frequency, partial updates: Real-time with partial document updates
Low frequency, full documents: Batch with complete rebuilds
Mixed patterns: Hybrid architecture

Step 3: Consider Scale and Resources

< 1 million documents: Either approach works, optimize for simplicity
1-100 million documents: Batch starts showing efficiency advantages
> 100 million documents: Hybrid is almost always required

Step 4: Evaluate Operational Capacity

Small team, limited monitoring: Batch is easier to operate
Large team, mature observability: Real-time is manageable
Any team during critical periods: Have batch as a fallback

Quick Reference: Indexing Strategy Selection
Use Case	Recommended Strategy	Key Considerations
Chat/Messaging Search	Real-time	Sub-second freshness critical; accept higher resource cost
E-commerce Catalog	Hybrid	Price/inventory real-time; descriptions/features batch
Document Search (Enterprise)	Batch + Real-time deltas	Nightly full rebuild; real-time for new documents
Log Analytics	Real-time with retention	Accept eventual consistency; optimize for write throughput
Data Warehouse Search	Batch only	Aligned with ETL schedules; consistency matters
Social Media Feed	Real-time with fan-out	Post visibility critical; handle viral content spikes

Start Simple, Evolve as Needed

Many systems begin with pure real-time indexing because it's simpler to implement. As scale increases and operational pain accumulates, teams add batch processing for specific workloads. This evolution is natural—don't over-engineer initially, but design with eventual hybrid capability in mind.

Summary: Real-Time vs Batch Indexing

We've explored the foundational decision in search indexing architecture. Let's consolidate the key insights:

Key Takeaways

•Real-time indexing makes documents searchable in seconds but incurs higher per-document overhead and can interfere with query performance
•Batch indexing is dramatically more efficient for large-scale data loads but introduces latency between updates and searchability
•Write amplification is the hidden cost of real-time indexing—a single document causes multiple disk operations
•Hybrid architectures combine both approaches, using real-time for critical updates and batch for comprehensive refreshes
•Freshness requirements should drive strategy selection, not technical preferences—understand what your users actually need
•Operational complexity differs significantly—batch is easier to debug and recover from, real-time requires continuous monitoring

What's next:

Now that we understand the high-level choice between real-time and batch indexing, we'll examine how to handle the ongoing challenge of index updates and delta indexing—keeping search indexes synchronized with source data without full rebuilds.

Page Complete

You now understand the fundamental trade-offs between real-time and batch indexing strategies. This mental model will inform every subsequent decision about search infrastructure design. Next, we'll explore how to efficiently update existing indexes without full rebuilds.

1 / 5

Loading learning content...

System Design (HLD)Search Indexing Strategies

Search Indexing Strategies

LevelAdvanced

Duration75 mins

TopicSearch Indexing Strategies

1 / 5

Real-Time vs Batch Indexing

The Indexing Dilemma

What You Will Learn

The Fundamental Trade-Off

Before diving into specific strategies, let's establish the core tension that drives indexing architecture decisions. Every search system must balance multiple competing requirements:

Freshness: How quickly do new or updated documents appear in search results?

Throughput: How many documents can be indexed per unit time?

Resource Efficiency: How much CPU, memory, and I/O does indexing consume?

Query Performance: How does indexing activity affect search latency?

Consistency: Do all users see the same search results at the same time?

Core Trade-offs: Real-Time vs Batch Indexing
Dimension	Real-Time Indexing	Batch Indexing
Freshness	Seconds to low minutes	Minutes to hours
Throughput	Limited by write amplification	Optimized for bulk operations
Resource Efficiency	Higher per-document overhead	Lower per-document overhead
Query Impact	Potential interference	Isolated from queries
Consistency	Eventually consistent	Point-in-time consistent
Operational Complexity	Continuous monitoring	Scheduled job management

The Hybrid Reality

Real-Time Indexing: Architecture and Mechanics

How Real-Time Indexing Works

The mechanics involve several coordinated steps that happen in rapid succession:

Document Arrival: A document is submitted via API (e.g., POST /_doc in Elasticsearch)
In-Memory Buffering: The document is written to an in-memory buffer (often called a transaction log or translog)
Segment Creation: Periodically (every 1 second in Elasticsearch's default config), buffered documents are written to a new immutable segment
Segment Visibility: The new segment becomes visible to searches (the refresh operation)
Background Merging: Over time, small segments are merged into larger ones to maintain query efficiency

This architecture reveals a key insight: real-time doesn't mean instant. There's always a configurable delay (the refresh interval) between document ingestion and searchability.

elasticsearch-refresh.json
Elasticsearch Settings
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Index settings for real-time indexing behavior
{
  "settings": {
    "index": {
      // How often to refresh (make new docs searchable)
      // 1s is default, can be decreased for faster updates
      "refresh_interval": "1s",
      
      // Transaction log durability settings
      "translog": {
        // fsync on every request vs. async
        "durability": "request",
        // Max size before force flush
        "flush_threshold_size": "512mb"
      },
      
      // Merge policy affects indexing performance
      "merge": {
        "scheduler": {
          // Max concurrent merges
          "max_thread_count": 1
        }
      }
    }
  }
}
 
// Force refresh for immediate visibility (use sparingly!)
POST /my_index/_refresh
 
// Index with refresh=true (expensive, use for critical updates only)
POST /my_index/_doc?refresh=true
{
  "title": "Urgent Update",
  "price": 29.99
}

The Refresh Trap

The Hidden Costs of Real-Time Indexing

Real-time indexing appears straightforward on the surface, but production deployments reveal several significant challenges that architects must address:

Write Amplification

Every document written to a search index triggers multiple subsequent operations:

Initial write to transaction log
Write to in-memory buffer
Flush to segment on disk
Multiple segment merges as data accumulates

This write amplification means that indexing 1 GB of data might result in 5-10 GB of actual disk I/O. At scale, this becomes the dominant bottleneck.

Segment Fragmentation

Resource Contention

Recovery Time

Real-Time Indexing Failure Modes

•Segment Explosion: Aggressive refresh intervals create hundreds of segments, slowing queries by 10-100x
•Merge Storm: Background merges consume all I/O, causing both indexing and query latency spikes
•Memory Pressure: Too many active segments exhaust file descriptor limits and memory-mapped regions
•Indexing Backlog: Bursts of writes exceed indexing capacity, creating growing queues that never drain
•Recovery Cascade: After a failure, transaction log replay takes hours, leaving the cluster degraded

Observability is Critical

Batch Indexing: Architecture and Mechanics

How Batch Indexing Works

Document Accumulation: New and updated documents are stored in a staging area (database, data lake, message queue)
Batch Preparation: A scheduled job reads accumulated documents and prepares them for indexing
Bulk Ingestion: Documents are indexed using bulk APIs optimized for high-throughput writes
Index Optimization: After ingestion, the index is optimized (force-merged to fewer segments)
Visibility Switch: The new index version replaces the old one atomically

batch-indexing-pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
from elasticsearch import Elasticsearch, helpers
from datetime import datetime
import json
 
class BatchIndexingPipeline:
    """
    Production batch indexing pipeline with optimized settings.
    
    This pipeline demonstrates patterns used at LinkedIn, Twitter,
    and other companies processing billions of documents.
    """
    
    def __init__(self, es_hosts: list[str]):
        self.es = Elasticsearch(es_hosts)
        
    def prepare_index_for_bulk_load(self, index_name: str) -> None:
        """
        Optimizes index settings for bulk ingestion.
        These settings dramatically improve indexing throughput.
        """
        self.es.indices.put_settings(
            index=index_name,
            body={
                "index": {
                    # Disable refresh during bulk load
                    "refresh_interval": "-1",
                    # Increase indexing buffer size
                    "translog.flush_threshold_size": "1gb",
                    # Reduce replica overhead (restore after)
                    "number_of_replicas": 0,
                }
            }
        )
        
    def bulk_index_documents(
        self, 
        index_name: str, 
        documents: list[dict],
        batch_size: int = 5000
    ) -> dict:
        """
        Index documents using bulk API with proper error handling.
        
        batch_size of 5000-15000 is typically optimal, depending
        on document size. Larger batches reduce overhead but
        increase memory pressure and retry scope on failure.
        """
        def generate_actions():
            for doc in documents:
                yield {
                    "_index": index_name,
                    "_id": doc["id"],
                    "_source": doc
                }
        
        success, failed = 0, []
        
        # Use parallel_bulk for multi-threaded ingestion
        for ok, result in helpers.parallel_bulk(
            self.es,
            generate_actions(),
            chunk_size=batch_size,
            thread_count=4,
            raise_on_error=False
        ):
            if ok:
                success += 1
            else:
                failed.append(result)
                
        return {
            "success": success,
            "failed": len(failed),
            "errors": failed[:10]  # Sample of errors
        }
        
    def finalize_index(self, index_name: str, replicas: int = 1) -> None:
        """
        Restore production settings and optimize index.
        """
        # Force merge to reduce segment count (expensive but worth it)
        self.es.indices.forcemerge(
            index=index_name,
            max_num_segments=1  # Single segment = fastest queries
        )
        
        # Restore production settings
        self.es.indices.put_settings(
            index=index_name,
            body={
                "index": {
                    "refresh_interval": "1s",
                    "number_of_replicas": replicas,
                }
            }
        )
        
        # Final refresh to make all docs searchable
        self.es.indices.refresh(index=index_name)
 
 
# Example usage
pipeline = BatchIndexingPipeline(["http://es-cluster:9200"])
pipeline.prepare_index_for_bulk_load("products_v2")
 
# Simulated batch of 1M documents
documents = [{"id": i, "title": f"Product {i}"} for i in range(1_000_000)]
result = pipeline.bulk_index_documents("products_v2", documents)
 
pipeline.finalize_index("products_v2", replicas=2)

Batch Indexing Throughput Comparison

The efficiency gains from batch indexing are substantial. In typical benchmarks:

Indexing Mode	Documents/Second	CPU Usage	I/O Writes
One-by-one (sync refresh)	100-500	High	Very High
One-by-one (1s refresh)	1,000-3,000	High	High
Bulk API (5K chunks)	10,000-30,000	Medium	Medium
Bulk API (optimized)	50,000-100,000+	Low	Low

These gains come from amortizing fixed costs (connection setup, segment creation, cache invalidation) across many documents instead of paying them repeatedly.

When Batch Indexing Shines

Batch indexing isn't a legacy approach—it's the optimal choice for many production scenarios. Understanding when to use it is crucial for efficient system design.

Ideal Use Cases for Batch Indexing

Data Warehouse Integration: When search indexes are populated from nightly ETL jobs processing data warehouses, batch indexing aligns perfectly with the data availability pattern.

Full Catalog Rebuilds: E-commerce catalogs, content management systems, and media libraries often need periodic full refreshes. Batch indexing makes these operations predictable and efficient.

Historical Data Migration: Moving years of data from legacy systems to new search infrastructure requires the throughput only batch processing can provide.

ML Model Updates: When search relevance depends on machine learning features (embeddings, classifications), batch processing allows recomputing features for all documents when models change.

Batch Indexing Strengths

•Predictable resource consumption
•No interference with query performance
•Atomic visibility of changes
•Easy rollback to previous versions
•Optimal for large-scale data loads
•Consistent query results during indexing
•Simple operational model

Batch Indexing Limitations

•Hours-old data in search results
•Requires staging infrastructure
•Periods of stale search results
•Complex job scheduling and monitoring
•Storage overhead for multiple versions
•Delayed visibility of critical updates
•Batch job failures can delay updates

The Freshness Perception Gap

Hybrid Architectures: The Best of Both Worlds

The Tiered Indexing Pattern

Consider a product search engine for a large e-commerce platform. Documents have multiple field types:

Static Fields: Product name, description, brand, category—rarely change
Dynamic Fields: Price, inventory count, seller ratings—change frequently
Computed Fields: ML-generated features, popularity scores—recomputed periodically

A hybrid approach indexes these differently:

Batch Layer: Full catalog rebuild nightly or weekly, ensuring all static fields and computed features are current
Real-Time Layer: Continuous streaming of price and inventory updates, applied as partial document updates
Merge at Query Time: Queries combine results from both layers, with real-time overrides taking precedence

This pattern, sometimes called the Lambda Architecture for Search, provides fresh critical data while maintaining efficient bulk processing for everything else.

hybrid-architecture.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
/**
 * Hybrid Indexing Architecture
 * 
 * Combines batch indexing for full catalog with real-time
 * updates for critical fields like price and inventory.
 */
 
interface ProductDocument {
  id: string;
  // Static fields (batch indexed)
  name: string;
  description: string;
  category: string;
  brand: string;
  // Dynamic fields (real-time indexed)
  price: number;
  inventory: number;
  lastUpdated: Date;
  // Computed fields (batch recomputed)
  popularityScore: number;
  mlEmbedding: number[];
}
 
interface UpdateEvent {
  productId: string;
  field: 'price' | 'inventory';
  value: number;
  timestamp: Date;
}
 
class HybridIndexingPipeline {
  private esClient: ElasticsearchClient;
  private updateQueue: KafkaConsumer;
  
  /**
   * Real-time path: Process update events as they arrive.
   * Uses partial document updates to minimize overhead.
   */
  async processRealTimeUpdates(): Promise<void> {
    for await (const batch of this.updateQueue.consume()) {
      // Group updates by document ID
      const grouped = this.groupByProductId(batch);
      
      // Use bulk partial update API
      const bulkOps = Object.entries(grouped).flatMap(
        ([productId, updates]) => [
          { update: { _id: productId, _index: 'products' } },
          { 
            doc: {
              ...this.mergeUpdates(updates),
              lastUpdated: new Date()
            }
          }
        ]
      );
      
      await this.esClient.bulk({ body: bulkOps });
    }
  }
  
  /**
   * Batch path: Full catalog rebuild on schedule.
   * Rebuilds entire index with fresh computed features.
   */
  async runBatchReindex(newVersion: string): Promise<void> {
    const newIndex = `products_${newVersion}`;
    
    // Create optimized index for bulk loading
    await this.createIndexWithBulkSettings(newIndex);
    
    // Stream all products from source of truth
    for await (const batch of this.streamFromDataWarehouse()) {
      // Compute ML features for entire batch
      const enriched = await this.enrichWithMLFeatures(batch);
      await this.bulkIndex(newIndex, enriched);
    }
    
    // Finalize and swap
    await this.finalizeIndex(newIndex);
    await this.atomicAliasSwap('products', newIndex);
  }
  
  /**
   * Query-time merge for real-time overrides.
   * Not needed if using Elasticsearch's doc values,
   * but useful for complex merge logic.
   */
  async searchWithOverrides(query: SearchQuery): Promise<SearchResults> {
    // Get base results from search index
    const results = await this.esClient.search(query);
    
    // Fetch any real-time overrides from cache
    const overrides = await this.fetchRealtimeOverrides(
      results.hits.map(h => h._id)
    );
    
    // Merge overrides into results
    return this.applyOverrides(results, overrides);
  }
  
  private groupByProductId(
    updates: UpdateEvent[]
  ): Record<string, UpdateEvent[]> {
    // Implementation: group updates by product ID
    return updates.reduce((acc, update) => {
      (acc[update.productId] ??= []).push(update);
      return acc;
    }, {} as Record<string, UpdateEvent[]>);
  }
  
  private mergeUpdates(
    updates: UpdateEvent[]
  ): Partial<ProductDocument> {
    // Take latest value for each field
    return updates.reduce((acc, update) => {
      acc[update.field] = update.value;
      return acc;
    }, {} as Partial<ProductDocument>);
  }
}

Choosing Your Strategy: A Decision Framework

Given the trade-offs we've explored, how should you choose an indexing strategy for your system? Use this framework:

Step 1: Define Freshness Requirements

Ask stakeholders: "If a document is updated, how quickly must it appear in search?"

Seconds: Real-time required (stock prices, chat messages, live events)
Minutes: Near-real-time sufficient (social feeds, news articles)
Hours: Batch is viable (product catalogs, document archives)
Days: Batch is optimal (analytics dashboards, historical data)

Step 2: Assess Update Patterns

High frequency, partial updates: Real-time with partial document updates
Low frequency, full documents: Batch with complete rebuilds
Mixed patterns: Hybrid architecture

Step 3: Consider Scale and Resources

< 1 million documents: Either approach works, optimize for simplicity
1-100 million documents: Batch starts showing efficiency advantages
> 100 million documents: Hybrid is almost always required

Step 4: Evaluate Operational Capacity

Small team, limited monitoring: Batch is easier to operate
Large team, mature observability: Real-time is manageable
Any team during critical periods: Have batch as a fallback

Quick Reference: Indexing Strategy Selection
Use Case	Recommended Strategy	Key Considerations
Chat/Messaging Search	Real-time	Sub-second freshness critical; accept higher resource cost
E-commerce Catalog	Hybrid	Price/inventory real-time; descriptions/features batch
Document Search (Enterprise)	Batch + Real-time deltas	Nightly full rebuild; real-time for new documents
Log Analytics	Real-time with retention	Accept eventual consistency; optimize for write throughput
Data Warehouse Search	Batch only	Aligned with ETL schedules; consistency matters
Social Media Feed	Real-time with fan-out	Post visibility critical; handle viral content spikes

Start Simple, Evolve as Needed

Summary: Real-Time vs Batch Indexing

We've explored the foundational decision in search indexing architecture. Let's consolidate the key insights:

Key Takeaways

•Real-time indexing makes documents searchable in seconds but incurs higher per-document overhead and can interfere with query performance
•Batch indexing is dramatically more efficient for large-scale data loads but introduces latency between updates and searchability
•Write amplification is the hidden cost of real-time indexing—a single document causes multiple disk operations
•Hybrid architectures combine both approaches, using real-time for critical updates and batch for comprehensive refreshes
•Freshness requirements should drive strategy selection, not technical preferences—understand what your users actually need
•Operational complexity differs significantly—batch is easier to debug and recover from, real-time requires continuous monitoring

What's next:

Page Complete

1 / 5