Loading learning content...
When you type a search query on Amazon and see results in milliseconds, you're witnessing the culmination of sophisticated indexing strategies that have been refined over decades. Behind every fast search experience lies a critical architectural decision: how should documents be added to the search index?
This question sits at the heart of search system design. The answer determines not just how quickly new content becomes searchable, but also how your system behaves under load, how much infrastructure you need, and whether your users experience stale results or fresh ones.
The two fundamental approaches—real-time indexing and batch indexing—represent different philosophies about the trade-offs between freshness, consistency, resource utilization, and operational complexity. Understanding when to use each approach, and how to combine them, separates competent search engineers from exceptional ones.
By the end of this page, you will understand the fundamental differences between real-time and batch indexing, their architectural implications, when to choose each approach, and how to design hybrid systems that leverage the strengths of both. You'll gain the vocabulary and mental models used by principal engineers at companies like Google, Amazon, and LinkedIn when designing search infrastructure.
Before diving into specific strategies, let's establish the core tension that drives indexing architecture decisions. Every search system must balance multiple competing requirements:
Freshness: How quickly do new or updated documents appear in search results?
Throughput: How many documents can be indexed per unit time?
Resource Efficiency: How much CPU, memory, and I/O does indexing consume?
Query Performance: How does indexing activity affect search latency?
Consistency: Do all users see the same search results at the same time?
No approach optimizes all dimensions simultaneously. Real-time indexing prioritizes freshness at the cost of throughput and resource efficiency. Batch indexing prioritizes efficiency and consistency at the cost of freshness. Understanding this trade-off is essential because the right choice depends entirely on your use case.
| Dimension | Real-Time Indexing | Batch Indexing |
|---|---|---|
| Freshness | Seconds to low minutes | Minutes to hours |
| Throughput | Limited by write amplification | Optimized for bulk operations |
| Resource Efficiency | Higher per-document overhead | Lower per-document overhead |
| Query Impact | Potential interference | Isolated from queries |
| Consistency | Eventually consistent | Point-in-time consistent |
| Operational Complexity | Continuous monitoring | Scheduled job management |
Most production search systems at scale are neither purely real-time nor purely batch. They use hybrid architectures that combine both approaches strategically. A common pattern: real-time indexing for critical updates (price changes, stock availability) and batch indexing for comprehensive data refreshes and new document ingestion.
Real-time indexing (also called near-real-time or NRT indexing) makes documents searchable within seconds of ingestion. This is the default mode for most search engines today, including Elasticsearch, Apache Solr, and managed services like Amazon OpenSearch.
The mechanics involve several coordinated steps that happen in rapid succession:
POST /_doc in Elasticsearch)This architecture reveals a key insight: real-time doesn't mean instant. There's always a configurable delay (the refresh interval) between document ingestion and searchability.
123456789101112131415161718192021222324252627282930313233343536
// Index settings for real-time indexing behavior{ "settings": { "index": { // How often to refresh (make new docs searchable) // 1s is default, can be decreased for faster updates "refresh_interval": "1s", // Transaction log durability settings "translog": { // fsync on every request vs. async "durability": "request", // Max size before force flush "flush_threshold_size": "512mb" }, // Merge policy affects indexing performance "merge": { "scheduler": { // Max concurrent merges "max_thread_count": 1 } } } }} // Force refresh for immediate visibility (use sparingly!)POST /my_index/_refresh // Index with refresh=true (expensive, use for critical updates only)POST /my_index/_doc?refresh=true{ "title": "Urgent Update", "price": 29.99}Setting refresh_interval too low (e.g., 100ms) or using refresh=true on every write can devastate cluster performance. Each refresh creates a new segment, causing segment explosion and excessive I/O. A common anti-pattern: developers set aggressive refresh intervals during development, forget to change them, and wonder why production clusters struggle.
Real-time indexing appears straightforward on the surface, but production deployments reveal several significant challenges that architects must address:
Every document written to a search index triggers multiple subsequent operations:
This write amplification means that indexing 1 GB of data might result in 5-10 GB of actual disk I/O. At scale, this becomes the dominant bottleneck.
Frequent refreshes create many small segments. While each is individually fast to create, having hundreds or thousands of small segments degrades query performance dramatically. The search engine must query each segment and merge results—more segments mean more overhead.
Indexing and querying compete for the same resources: CPU for analysis and merging, memory for caches and buffers, I/O bandwidth for reads and writes. Heavy indexing loads can cause query latency spikes, a phenomenon known as indexing interference.
If a node fails, it must replay its transaction log to recover uncommitted documents. Larger transaction logs (from higher throughput) mean longer recovery times, extending your system's vulnerability window during incidents.
Real-time indexing systems require comprehensive monitoring: segment count, merge rate, transaction log size, indexing latency percentiles, and refresh timing. Without visibility into these metrics, problems compound silently until they cause outages.
Batch indexing takes a fundamentally different approach: instead of indexing documents as they arrive, it accumulates documents and processes them in large, scheduled jobs. This model dominated early search systems and remains essential for many use cases.
This approach leverages a crucial observation: bulk operations are dramatically more efficient than individual operations. Indexing 1 million documents in a single bulk request uses a fraction of the resources required to index the same documents one at a time.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
from elasticsearch import Elasticsearch, helpersfrom datetime import datetimeimport json class BatchIndexingPipeline: """ Production batch indexing pipeline with optimized settings. This pipeline demonstrates patterns used at LinkedIn, Twitter, and other companies processing billions of documents. """ def __init__(self, es_hosts: list[str]): self.es = Elasticsearch(es_hosts) def prepare_index_for_bulk_load(self, index_name: str) -> None: """ Optimizes index settings for bulk ingestion. These settings dramatically improve indexing throughput. """ self.es.indices.put_settings( index=index_name, body={ "index": { # Disable refresh during bulk load "refresh_interval": "-1", # Increase indexing buffer size "translog.flush_threshold_size": "1gb", # Reduce replica overhead (restore after) "number_of_replicas": 0, } } ) def bulk_index_documents( self, index_name: str, documents: list[dict], batch_size: int = 5000 ) -> dict: """ Index documents using bulk API with proper error handling. batch_size of 5000-15000 is typically optimal, depending on document size. Larger batches reduce overhead but increase memory pressure and retry scope on failure. """ def generate_actions(): for doc in documents: yield { "_index": index_name, "_id": doc["id"], "_source": doc } success, failed = 0, [] # Use parallel_bulk for multi-threaded ingestion for ok, result in helpers.parallel_bulk( self.es, generate_actions(), chunk_size=batch_size, thread_count=4, raise_on_error=False ): if ok: success += 1 else: failed.append(result) return { "success": success, "failed": len(failed), "errors": failed[:10] # Sample of errors } def finalize_index(self, index_name: str, replicas: int = 1) -> None: """ Restore production settings and optimize index. """ # Force merge to reduce segment count (expensive but worth it) self.es.indices.forcemerge( index=index_name, max_num_segments=1 # Single segment = fastest queries ) # Restore production settings self.es.indices.put_settings( index=index_name, body={ "index": { "refresh_interval": "1s", "number_of_replicas": replicas, } } ) # Final refresh to make all docs searchable self.es.indices.refresh(index=index_name) # Example usagepipeline = BatchIndexingPipeline(["http://es-cluster:9200"])pipeline.prepare_index_for_bulk_load("products_v2") # Simulated batch of 1M documentsdocuments = [{"id": i, "title": f"Product {i}"} for i in range(1_000_000)]result = pipeline.bulk_index_documents("products_v2", documents) pipeline.finalize_index("products_v2", replicas=2)The efficiency gains from batch indexing are substantial. In typical benchmarks:
| Indexing Mode | Documents/Second | CPU Usage | I/O Writes |
|---|---|---|---|
| One-by-one (sync refresh) | 100-500 | High | Very High |
| One-by-one (1s refresh) | 1,000-3,000 | High | High |
| Bulk API (5K chunks) | 10,000-30,000 | Medium | Medium |
| Bulk API (optimized) | 50,000-100,000+ | Low | Low |
These gains come from amortizing fixed costs (connection setup, segment creation, cache invalidation) across many documents instead of paying them repeatedly.
Batch indexing isn't a legacy approach—it's the optimal choice for many production scenarios. Understanding when to use it is crucial for efficient system design.
Data Warehouse Integration: When search indexes are populated from nightly ETL jobs processing data warehouses, batch indexing aligns perfectly with the data availability pattern.
Full Catalog Rebuilds: E-commerce catalogs, content management systems, and media libraries often need periodic full refreshes. Batch indexing makes these operations predictable and efficient.
Historical Data Migration: Moving years of data from legacy systems to new search infrastructure requires the throughput only batch processing can provide.
ML Model Updates: When search relevance depends on machine learning features (embeddings, classifications), batch processing allows recomputing features for all documents when models change.
Consistency Requirements: Some applications require that all users see exactly the same search results. Batch indexing to a new index version provides point-in-time consistency that real-time indexing cannot.
Users often care less about absolute freshness than engineers assume. An e-commerce search with 2-hour-old prices is usually acceptable; users understand prices can change. But a stock trading platform with 2-second-old prices is unacceptable. Understand your domain's actual freshness requirements before choosing an indexing strategy.
Most sophisticated search systems at scale use hybrid architectures that combine real-time and batch indexing strategically. The key insight is that different types of data have different freshness requirements.
Consider a product search engine for a large e-commerce platform. Documents have multiple field types:
A hybrid approach indexes these differently:
This pattern, sometimes called the Lambda Architecture for Search, provides fresh critical data while maintaining efficient bulk processing for everything else.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
/** * Hybrid Indexing Architecture * * Combines batch indexing for full catalog with real-time * updates for critical fields like price and inventory. */ interface ProductDocument { id: string; // Static fields (batch indexed) name: string; description: string; category: string; brand: string; // Dynamic fields (real-time indexed) price: number; inventory: number; lastUpdated: Date; // Computed fields (batch recomputed) popularityScore: number; mlEmbedding: number[];} interface UpdateEvent { productId: string; field: 'price' | 'inventory'; value: number; timestamp: Date;} class HybridIndexingPipeline { private esClient: ElasticsearchClient; private updateQueue: KafkaConsumer; /** * Real-time path: Process update events as they arrive. * Uses partial document updates to minimize overhead. */ async processRealTimeUpdates(): Promise<void> { for await (const batch of this.updateQueue.consume()) { // Group updates by document ID const grouped = this.groupByProductId(batch); // Use bulk partial update API const bulkOps = Object.entries(grouped).flatMap( ([productId, updates]) => [ { update: { _id: productId, _index: 'products' } }, { doc: { ...this.mergeUpdates(updates), lastUpdated: new Date() } } ] ); await this.esClient.bulk({ body: bulkOps }); } } /** * Batch path: Full catalog rebuild on schedule. * Rebuilds entire index with fresh computed features. */ async runBatchReindex(newVersion: string): Promise<void> { const newIndex = `products_${newVersion}`; // Create optimized index for bulk loading await this.createIndexWithBulkSettings(newIndex); // Stream all products from source of truth for await (const batch of this.streamFromDataWarehouse()) { // Compute ML features for entire batch const enriched = await this.enrichWithMLFeatures(batch); await this.bulkIndex(newIndex, enriched); } // Finalize and swap await this.finalizeIndex(newIndex); await this.atomicAliasSwap('products', newIndex); } /** * Query-time merge for real-time overrides. * Not needed if using Elasticsearch's doc values, * but useful for complex merge logic. */ async searchWithOverrides(query: SearchQuery): Promise<SearchResults> { // Get base results from search index const results = await this.esClient.search(query); // Fetch any real-time overrides from cache const overrides = await this.fetchRealtimeOverrides( results.hits.map(h => h._id) ); // Merge overrides into results return this.applyOverrides(results, overrides); } private groupByProductId( updates: UpdateEvent[] ): Record<string, UpdateEvent[]> { // Implementation: group updates by product ID return updates.reduce((acc, update) => { (acc[update.productId] ??= []).push(update); return acc; }, {} as Record<string, UpdateEvent[]>); } private mergeUpdates( updates: UpdateEvent[] ): Partial<ProductDocument> { // Take latest value for each field return updates.reduce((acc, update) => { acc[update.field] = update.value; return acc; }, {} as Partial<ProductDocument>); }}Given the trade-offs we've explored, how should you choose an indexing strategy for your system? Use this framework:
Ask stakeholders: "If a document is updated, how quickly must it appear in search?"
| Use Case | Recommended Strategy | Key Considerations |
|---|---|---|
| Chat/Messaging Search | Real-time | Sub-second freshness critical; accept higher resource cost |
| E-commerce Catalog | Hybrid | Price/inventory real-time; descriptions/features batch |
| Document Search (Enterprise) | Batch + Real-time deltas | Nightly full rebuild; real-time for new documents |
| Log Analytics | Real-time with retention | Accept eventual consistency; optimize for write throughput |
| Data Warehouse Search | Batch only | Aligned with ETL schedules; consistency matters |
| Social Media Feed | Real-time with fan-out | Post visibility critical; handle viral content spikes |
Many systems begin with pure real-time indexing because it's simpler to implement. As scale increases and operational pain accumulates, teams add batch processing for specific workloads. This evolution is natural—don't over-engineer initially, but design with eventual hybrid capability in mind.
We've explored the foundational decision in search indexing architecture. Let's consolidate the key insights:
What's next:
Now that we understand the high-level choice between real-time and batch indexing, we'll examine how to handle the ongoing challenge of index updates and delta indexing—keeping search indexes synchronized with source data without full rebuilds.
You now understand the fundamental trade-offs between real-time and batch indexing strategies. This mental model will inform every subsequent decision about search infrastructure design. Next, we'll explore how to efficiently update existing indexes without full rebuilds.