System Design (HLD)Search Indexing Strategies

Search Indexing Strategies

LevelAdvanced

Duration75 mins

TopicSearch Indexing Strategies

2 / 5

Index Updates and Delta Indexing

The Synchronization Challenge

Search indexes are not databases of record—they're derived data stores that must stay synchronized with authoritative sources. This synchronization is straightforward when you can simply reindex everything, but that approach breaks down at scale.

Imagine you're running the search infrastructure for a major e-commerce platform with 500 million products. A full reindex takes 6 hours and consumes significant cluster resources. Now consider that 10 million products change every day—prices update, descriptions change, inventory fluctuates. You can't reindex everything daily, but you can't ignore changes either.

Delta indexing solves this problem by identifying and processing only the documents that have changed since the last index update. Done well, delta indexing reduces a 6-hour full reindex to a 15-minute incremental job. Done poorly, it becomes a source of subtle bugs, inconsistent search results, and operational nightmares.

Mastering delta indexing is what separates teams that struggle with search from teams that make it look effortless.

What You Will Learn

By the end of this page, you will understand how to track document changes, implement efficient delta detection, handle updates, deletes, and partial modifications, and design robust synchronization pipelines. You'll learn the patterns used by companies like Shopify, Airbnb, and Netflix to keep billions of documents synchronized.

The Anatomy of Index Updates

Before designing delta indexing strategies, we must understand what happens when we update a search index. The mechanics differ significantly from traditional database updates.

The Immutability Reality

Most search engines, including Lucene-based systems (Elasticsearch, Solr, OpenSearch), use immutable segments. When you "update" a document:

The old document version is marked as deleted in its segment
A new document version is written to a new or current segment
The old version remains on disk until segment merging occurs
Searches filter out deleted documents dynamically

This design has profound implications:

Updates are actually delete + insert operations internally
Disk space grows with update frequency until merges compact segments
Update order matters—the last write wins, but ordering is non-trivial in distributed systems
Partial updates still rewrite the entire document in most implementations

index-update-mechanics.json
Elasticsearch Operations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// 1. Full document update (replaces entire document)
PUT /products/_doc/12345
{
  "product_id": "12345",
  "name": "Wireless Headphones",
  "price": 79.99,
  "inventory": 150,
  "category": "electronics",
  "last_updated": "2024-01-15T10:30:00Z"
}
 
// 2. Partial update (preferred for single field changes)
// Still rewrites document internally, but API is simpler
POST /products/_update/12345
{
  "doc": {
    "price": 69.99,
    "last_updated": "2024-01-15T11:00:00Z"
  }
}
 
// 3. Scripted update (for computed changes)
POST /products/_update/12345
{
  "script": {
    "source": "ctx._source.inventory -= params.sold",
    "params": { "sold": 5 }
  }
}
 
// 4. Upsert (update or insert if not exists)
POST /products/_update/12345
{
  "doc": {
    "price": 69.99,
    "last_updated": "2024-01-15T11:00:00Z"
  },
  "upsert": {
    "product_id": "12345",
    "name": "New Product",
    "price": 69.99,
    "inventory": 100,
    "last_updated": "2024-01-15T11:00:00Z"
  }
}
 
// 5. Bulk update (efficient for multiple documents)
POST /_bulk
{"update": {"_index": "products", "_id": "12345"}}
{"doc": {"price": 69.99}}
{"update": {"_index": "products", "_id": "12346"}}
{"doc": {"price": 89.99}}
{"update": {"_index": "products", "_id": "12347"}}
{"doc": {"price": 49.99}}

The Partial Update Illusion

While APIs like _update appear to modify single fields, internally the search engine must fetch the current document, merge changes, and write the complete new version. This means partial updates to large documents are nearly as expensive as full replacements. Design your documents with update patterns in mind.

Change Detection: Finding What's Changed

The foundation of delta indexing is change detection—identifying which documents have been created, modified, or deleted since the last indexing run. There are several approaches, each with distinct trade-offs.

Timestamp-Based Detection

The most common approach: every document has a last_modified timestamp, and delta queries select documents modified after the last checkpoint.

Advantages:

Simple to implement
Works with existing data models
Easy to reason about

Disadvantages:

Clock skew in distributed systems causes missed or duplicate updates
Requires reliable timestamps in all data sources
Doesn't detect deletes (the document is gone, along with its timestamp)

Change Data Capture (CDC)

Capture changes at the database level using transaction logs, triggers, or database features like PostgreSQL's logical replication.

Advantages:

Captures all changes including deletes
Preserves change ordering
Works even when application bypasses normal update paths

Disadvantages:

Database-specific implementation
Requires additional infrastructure (Debezium, Maxwell, etc.)
Log retention and replay complexity

Event Sourcing

Every change is published as an event to a message queue, consumed by the indexing pipeline.

Advantages:

Decoupled from database internals
Easy to replay history
Supports multiple consumers

Disadvantages:

Requires application changes to publish events
Event ordering challenges in distributed systems
Message queue becomes a critical dependency

Change Detection Strategy Comparison
Strategy	Detects Creates	Detects Updates	Detects Deletes	Ordering	Complexity
Timestamp Query	✓	✓	✗	Approximate	Low
Full Diff (Hash)	✓	✓	✓	N/A	High
CDC (Debezium)	✓	✓	✓	Exact	Medium
Event Sourcing	✓	✓	✓	Per-partition	Medium
Soft Deletes + Timestamps	✓	✓	✓	Approximate	Low-Medium

The Delete Problem

Deletes deserve special attention because they're inherently difficult to detect. If a record is deleted from the source database, it's gone—there's nothing to query for "modified since" timestamps.

Common solutions:

Soft Deletes: Never delete records; mark them with a deleted_at timestamp. Delta queries include these, and the indexer removes them from the search index.
Tombstone Tables: On delete, write the deleted ID to a separate "tombstones" table with a timestamp. Delta queries check both the main table and tombstones.
Full ID Reconciliation: Periodically, compare all IDs in the search index against all IDs in the source database. Missing IDs in source = deleted.
CDC: Database-level change capture naturally includes delete events.

Most production systems use a combination: CDC or soft deletes for operational deletes, with periodic full reconciliation as a safety net.

Delta Indexing Pipeline Architecture

A robust delta indexing pipeline consists of several coordinated components. Let's examine a production-grade architecture used by major platforms.

Core Components

1. Change Source The authoritative database or event stream where changes originate. This could be:

A relational database with timestamps
A CDC stream from Debezium
A Kafka topic receiving application events

2. Change Collector A service that polls or subscribes to the change source, maintaining a checkpoint of the last processed change. This component must be:

Idempotent (safe to restart)
Resumable (picks up from checkpoint after failure)
Scalable (handles bursts without falling behind)

3. Document Enricher Changes often contain only IDs or partial data. The enricher fetches complete documents from source systems, resolving foreign keys, joining related data, and computing derived fields.

4. Document Transformer Converts source documents into the search index schema, applying normalization, tokenization, and field mappings.

5. Bulk Indexer Batches transformed documents and writes them to the search cluster using bulk APIs.

6. Checkpoint Manager Persists the latest successfully indexed change marker, enabling resumption after failures.

delta-indexing-pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
"""
Production Delta Indexing Pipeline
 
This implementation demonstrates patterns used at companies
processing millions of document updates per hour.
"""
 
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Generator, Optional
import json
 
 
@dataclass
class ChangeEvent:
    """Represents a single change in the source system."""
    event_id: str          # Unique identifier for idempotency
    entity_type: str       # e.g., 'product', 'user', 'order'
    entity_id: str         # Primary key of changed entity
    operation: str         # 'INSERT', 'UPDATE', 'DELETE'
    timestamp: datetime    # When the change occurred
    payload: Optional[dict]  # Changed data (may be partial)
 
 
@dataclass
class IndexCheckpoint:
    """Tracks progress through the change stream."""
    last_event_id: str
    last_timestamp: datetime
    events_processed: int
    last_updated: datetime
 
 
class DeltaIndexingPipeline:
    """
    Coordinates delta indexing from change detection through
    bulk indexing with fault tolerance and checkpointing.
    """
    
    def __init__(
        self,
        change_source: 'ChangeSource',
        document_enricher: 'DocumentEnricher', 
        search_client: 'SearchClient',
        checkpoint_store: 'CheckpointStore',
        batch_size: int = 1000,
        max_batch_wait_seconds: int = 5
    ):
        self.change_source = change_source
        self.enricher = document_enricher
        self.search = search_client
        self.checkpoints = checkpoint_store
        self.batch_size = batch_size
        self.max_wait = max_batch_wait_seconds
        
    async def run(self) -> None:
        """
        Main loop: consume changes, enrich, transform, index.
        
        This loop is designed to run continuously in production,
        handling backpressure and failures gracefully.
        """
        while True:
            checkpoint = await self.checkpoints.load()
            
            try:
                async for batch in self._consume_batches(checkpoint):
                    # Separate deletes from upserts
                    deletes = [c for c in batch if c.operation == 'DELETE']
                    upserts = [c for c in batch if c.operation != 'DELETE']
                    
                    # Process upserts: enrich, transform, index
                    if upserts:
                        documents = await self._process_upserts(upserts)
                        await self._bulk_index(documents)
                    
                    # Process deletes: remove from index
                    if deletes:
                        await self._bulk_delete(
                            [c.entity_id for c in deletes]
                        )
                    
                    # Checkpoint after successful batch
                    new_checkpoint = IndexCheckpoint(
                        last_event_id=batch[-1].event_id,
                        last_timestamp=batch[-1].timestamp,
                        events_processed=(
                            checkpoint.events_processed + len(batch)
                        ),
                        last_updated=datetime.utcnow()
                    )
                    await self.checkpoints.save(new_checkpoint)
                    checkpoint = new_checkpoint
                    
            except Exception as e:
                # Log error, wait, retry from checkpoint
                await self._handle_failure(e, checkpoint)
                
    async def _consume_batches(
        self, 
        checkpoint: IndexCheckpoint
    ) -> Generator[list[ChangeEvent], None, None]:
        """
        Consume changes in batches, balancing latency and throughput.
        
        Uses time-based batching: emit batch when either size limit
        or time limit is reached, whichever comes first.
        """
        batch = []
        batch_start = datetime.utcnow()
        
        async for change in self.change_source.stream(checkpoint):
            batch.append(change)
            
            batch_age = (datetime.utcnow() - batch_start).seconds
            
            if len(batch) >= self.batch_size or batch_age >= self.max_wait:
                yield batch
                batch = []
                batch_start = datetime.utcnow()
                
        # Emit remaining items
        if batch:
            yield batch
            
    async def _process_upserts(
        self, 
        changes: list[ChangeEvent]
    ) -> list[dict]:
        """
        Enrich and transform changes into index-ready documents.
        
        Handles the common case where change events contain only IDs,
        requiring enrichment from the source database.
        """
        # Deduplicate: if same entity changed multiple times in batch,
        # only process the latest
        latest_changes = {}
        for change in changes:
            key = f"{change.entity_type}:{change.entity_id}"
            existing = latest_changes.get(key)
            if not existing or change.timestamp > existing.timestamp:
                latest_changes[key] = change
                
        # Enrich: fetch full documents from source
        enriched = await self.enricher.enrich_batch(
            list(latest_changes.values())
        )
        
        # Transform: convert to search schema
        documents = []
        for entity_id, data in enriched.items():
            doc = self._transform_to_search_schema(data)
            documents.append(doc)
            
        return documents
        
    async def _bulk_index(self, documents: list[dict]) -> None:
        """
        Write documents to search index using bulk API.
        
        Implements retry with exponential backoff for transient failures.
        """
        for attempt in range(3):
            try:
                result = await self.search.bulk_index(documents)
                
                if result.errors:
                    # Handle partial failures
                    failed_ids = [
                        item['_id'] 
                        for item in result.items 
                        if 'error' in item
                    ]
                    raise PartialIndexingError(failed_ids)
                    
                return
                
            except TransientError as e:
                wait = 2 ** attempt
                await asyncio.sleep(wait)
                
        raise IndexingError("Max retries exceeded")
        
    async def _bulk_delete(self, entity_ids: list[str]) -> None:
        """Remove deleted entities from search index."""
        await self.search.bulk_delete(entity_ids)
        
    def _transform_to_search_schema(self, data: dict) -> dict:
        """
        Map source data to search index schema.
        
        This is where you apply field mappings, compute derived fields,
        and normalize data for search.
        """
        return {
            '_id': data['id'],
            'name': data['name'],
            'description': data.get('description', ''),
            'price': float(data['price']),
            'category': data['category']['name'],
            'brand': data.get('brand', {}).get('name'),
            'in_stock': data.get('inventory', 0) > 0,
            'popularity_score': self._compute_popularity(data),
            'indexed_at': datetime.utcnow().isoformat()
        }

Handling Edge Cases and Failure Modes

Delta indexing pipelines must handle numerous edge cases that can cause subtle data inconsistencies if not addressed properly.

Out-of-Order Updates

In distributed systems, updates can arrive out of order. If a product's price changes from $100 → $80 → $90, but updates arrive as $80, $90, $100, the index will show $100 (the last processed) instead of $90 (the current value).

Solutions:

Include version numbers or timestamps in documents; reject older versions
Use optimistic concurrency control (if_seq_no and if_primary_term in Elasticsearch)
Periodically reconcile against the source of truth

Enrichment Failures

When fetching full documents during enrichment, related data might be missing (e.g., a product references a deleted category).

Solutions:

Use defaults for missing relations
Skip problematic documents and log for investigation
Maintain denormalized snapshots for frequently-missing relations

Partial Batch Failures

When indexing 1,000 documents, 995 might succeed and 5 might fail due to mapping conflicts or validation errors.

Solutions:

Checkpoint at document granularity, not batch granularity
Dead-letter queue for failed documents
Separate indexing paths for "clean" vs. "problematic" documents

Common Delta Indexing Failures

•Version Conflicts: Concurrent updates to the same document cause rejection; implement merge-on-conflict logic
•Schema Drift: Source data structure changes break transformation; add schema validation and alerting
•Checkpoint Corruption: Lost or corrupted checkpoint causes re-processing or gaps; use durable storage with backups
•Enrichment Timeouts: Source database slowness delays indexing; implement circuit breakers and degraded modes
•Message Queue Lag: High change volume exceeds processing capacity; monitor consumer lag and scale horizontally
•Duplicate Processing: Failures after indexing but before checkpoint cause duplicates; ensure idempotent indexing

Idempotency is Non-Negotiable

Every operation in your delta indexing pipeline must be idempotent—safe to execute multiple times with the same result. When failures occur (and they will), you'll restart from the last checkpoint and reprocess some events. Non-idempotent operations cause data corruption or duplicates.

Optimizing Delta Indexing Performance

Delta indexing performance directly impacts how fresh your search results are. Slow delta processing means stale data. Here are proven optimization techniques:

Batch Size Tuning

The optimal batch size balances latency and throughput:

Too small (10-100): High per-batch overhead dominates; throughput suffers
Too large (50,000+): Memory pressure and retry scope become problems; failures are expensive
Sweet spot (1,000-10,000): Typically optimal for most workloads

Parallel Processing

Delta pipelines often have embarrassingly parallel workloads:

Change consumption: Can shard by entity type or ID hash
Enrichment: Multiple workers can enrich different documents concurrently
Transformation: Stateless, infinitely parallelizable
Indexing: Bulk requests can be sent concurrently to different shards

Enrichment Optimization

Enrichment (fetching full documents) is often the bottleneck:

Batch fetching: Fetch 1,000 documents in one query, not 1,000 individual queries
Caching: Cache frequently-accessed relations (categories, brands)
Denormalization: Store complete documents in the change event, avoiding enrichment

Network Optimization

Minimize data transfer and connection overhead:

Compression: Enable HTTP compression for bulk requests
Connection pooling: Reuse HTTP connections to the search cluster
Locality: Run indexing jobs in the same region as the search cluster

performance-optimizations.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
/**
 * Performance-optimized delta indexing with parallel processing.
 * 
 * Achieves 50,000+ documents/second on typical hardware.
 */
 
interface PerformanceConfig {
  // Parallelism settings
  enrichmentConcurrency: number;    // Parallel enrichment workers
  indexingConcurrency: number;      // Parallel bulk request streams
  
  // Batch settings
  enrichmentBatchSize: number;      // IDs per enrichment query
  indexingBatchSize: number;        // Documents per bulk request
  
  // Caching settings
  relationCacheTTLSeconds: number;  // TTL for category/brand cache
  relationCacheMaxSize: number;     // Max cached relations
}
 
const PRODUCTION_CONFIG: PerformanceConfig = {
  enrichmentConcurrency: 8,
  indexingConcurrency: 4,
  enrichmentBatchSize: 500,
  indexingBatchSize: 2000,
  relationCacheTTLSeconds: 300,
  relationCacheMaxSize: 100_000,
};
 
class OptimizedDeltaPipeline {
  private enrichmentSemaphore: Semaphore;
  private indexingSemaphore: Semaphore;
  private relationCache: LRUCache<string, Relation>;
  
  constructor(private config: PerformanceConfig) {
    this.enrichmentSemaphore = new Semaphore(config.enrichmentConcurrency);
    this.indexingSemaphore = new Semaphore(config.indexingConcurrency);
    this.relationCache = new LRUCache({
      maxSize: config.relationCacheMaxSize,
      ttl: config.relationCacheTTLSeconds * 1000,
    });
  }
  
  /**
   * Process changes with optimal parallelism.
   * 
   * Uses a pipeline pattern: enrichment and indexing
   * happen concurrently on different batches.
   */
  async processBatch(changes: ChangeEvent[]): Promise<void> {
    // Split into enrichment batches
    const enrichmentBatches = chunk(
      changes, 
      this.config.enrichmentBatchSize
    );
    
    // Process enrichment batches in parallel
    const enrichedDocs = await Promise.all(
      enrichmentBatches.map(batch => 
        this.enrichmentSemaphore.runExclusive(() =>
          this.enrichBatch(batch)
        )
      )
    );
    
    // Flatten and split into indexing batches
    const allDocs = enrichedDocs.flat();
    const indexingBatches = chunk(allDocs, this.config.indexingBatchSize);
    
    // Index batches in parallel
    await Promise.all(
      indexingBatches.map(batch =>
        this.indexingSemaphore.runExclusive(() =>
          this.indexBatch(batch)
        )
      )
    );
  }
  
  /**
   * Batch enrichment with caching for relations.
   */
  private async enrichBatch(changes: ChangeEvent[]): Promise<Document[]> {
    const entityIds = changes.map(c => c.entity_id);
    
    // Single query for all entities
    const entities = await this.db.query(`
      SELECT * FROM products 
      WHERE id = ANY($1::uuid[])
    `, [entityIds]);
    
    // Collect unique relation IDs
    const categoryIds = new Set(entities.map(e => e.category_id));
    const brandIds = new Set(entities.map(e => e.brand_id).filter(Boolean));
    
    // Fetch uncached relations
    const uncachedCategoryIds = [...categoryIds].filter(
      id => !this.relationCache.has(`category:${id}`)
    );
    const uncachedBrandIds = [...brandIds].filter(
      id => !this.relationCache.has(`brand:${id}`)
    );
    
    if (uncachedCategoryIds.length > 0) {
      const categories = await this.db.query(
        'SELECT * FROM categories WHERE id = ANY($1::uuid[])',
        [uncachedCategoryIds]
      );
      categories.forEach(c => 
        this.relationCache.set(`category:${c.id}`, c)
      );
    }
    
    if (uncachedBrandIds.length > 0) {
      const brands = await this.db.query(
        'SELECT * FROM brands WHERE id = ANY($1::uuid[])',
        [uncachedBrandIds]
      );
      brands.forEach(b => 
        this.relationCache.set(`brand:${b.id}`, b)
      );
    }
    
    // Build enriched documents
    return entities.map(entity => ({
      ...entity,
      category: this.relationCache.get(`category:${entity.category_id}`),
      brand: entity.brand_id 
        ? this.relationCache.get(`brand:${entity.brand_id}`) 
        : null,
    }));
  }
  
  /**
   * Bulk index with compression and connection reuse.
   */
  private async indexBatch(documents: Document[]): Promise<void> {
    const body = documents.flatMap(doc => [
      { index: { _index: 'products', _id: doc.id } },
      this.transformDocument(doc)
    ]);
    
    await this.esClient.bulk({
      body,
      refresh: false,  // Don't refresh after each bulk
      timeout: '30s',
      // Enable request compression
      headers: { 'Content-Encoding': 'gzip' }
    });
  }
}

Monitoring Delta Indexing Pipelines

A delta indexing pipeline is only as good as your ability to understand its behavior. Comprehensive observability is essential for maintaining data freshness and diagnosing issues.

Key Metrics

Lag Metrics

Consumer Lag: How far behind is your pipeline from the latest change?
Checkpoint Age: How old is the last successfully indexed change?
Queue Depth: How many changes are waiting to be processed?

Throughput Metrics

Changes Processed/Second: Raw processing rate
Documents Indexed/Second: Rate of bulk API calls
Enrichment Batch Duration: Time to fetch full documents

Error Metrics

Failed Documents: Count and rate of indexing failures
Enrichment Errors: Missing relations, timeout errors
Checkpoint Failures: Issues persisting progress

Resource Metrics

Memory Usage: Batch buffers, caches, connections
CPU Usage: Transformation and serialization overhead
Network Throughput: Bandwidth to databases and search cluster

Critical Alerts for Delta Indexing
Alert	Condition	Severity	Response
Lag Exceeds Threshold	Consumer lag > 1 hour	High	Scale indexers, check for bottlenecks
Indexing Failure Rate	Failures > 1% of documents	High	Check mapping conflicts, document validation
Enrichment Timeout	DB queries > 30s	Medium	Check database performance, add indexes
Checkpoint Staleness	No checkpoint update in 10min	Critical	Check pipeline health, possible crash
Memory Pressure	Heap > 90%	Medium	Reduce batch sizes, check for memory leaks
Dead Letter Queue Growth	DLQ size increasing	Medium	Investigate failed documents, fix root cause

The Freshness SLO

Define a freshness Service Level Objective (SLO), such as '99% of changes indexed within 5 minutes.' Monitor this continuously, alert when violated, and use it in capacity planning. This transforms abstract 'delta indexing performance' into a concrete, measurable target.

Change Data Capture Integration

Change Data Capture (CDC) represents the most robust approach to delta detection. By tapping into database transaction logs, CDC captures every change reliably, including deletes, with guaranteed ordering.

Why CDC for Search Indexing?

Traditional timestamp-based polling has fundamental limitations:

Clock skew causes missed updates
Polling frequency limits freshness
Deletes are invisible
Database load increases with polling frequency

CDC solves all of these by streaming changes as they're committed, providing:

Completeness: Every committed change is captured
Ordering: Changes within a transaction are ordered correctly
Efficiency: Near-zero overhead on the source database
Low Latency: Changes stream in real-time, not polled intervals

Popular CDC Tools

Debezium: Open-source platform supporting PostgreSQL, MySQL, MongoDB, SQL Server. Runs on Kafka Connect.

Maxwell: Lightweight MySQL CDC to Kafka. Simple but limited to MySQL.

AWS DMS: Managed CDC service for AWS databases. Supports various targets.

Google Datastream: CDC for Cloud SQL, AlloyDB to BigQuery, GCS, or Pub/Sub.

debezium-config.json
Debezium Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Debezium PostgreSQL connector configuration
// Captures changes from products, categories, and brands tables
 
{
  "name": "search-indexing-connector",
  "config": {
    // Connector class for PostgreSQL
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    
    // Database connection
    "database.hostname": "primary.db.example.com",
    "database.port": "5432",
    "database.user": "replication_user",
    "database.password": "${secrets.db_password}",
    "database.dbname": "ecommerce",
    
    // Logical replication slot name
    "slot.name": "search_indexing_slot",
    "plugin.name": "pgoutput",
    
    // What to capture
    "table.include.list": "public.products,public.categories,public.brands",
    
    // Kafka topic routing
    "topic.prefix": "cdc",
    // Produces topics: cdc.public.products, cdc.public.categories, etc.
    
    // Schema handling
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "false",
    "value.converter.schemas.enable": "true",
    
    // Snapshot mode: what to do on first start
    "snapshot.mode": "initial",  // Take snapshot then stream changes
    
    // Heartbeat for health monitoring
    "heartbeat.interval.ms": "10000",
    
    // Transforms for convenience
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.add.fields": "op,ts_ms",
    "transforms.unwrap.delete.handling.mode": "rewrite"
  }
}
 
// Sample Debezium change event (after ExtractNewRecordState transform)
{
  "id": "12345",
  "name": "Wireless Headphones",
  "price": 79.99,
  "category_id": "cat-001",
  "brand_id": "brand-abc",
  "updated_at": "2024-01-15T10:30:00.000000Z",
  "__op": "u",       // 'c' = create, 'u' = update, 'd' = delete
  "__ts_ms": 1705315800000,
  "__deleted": "false"
}

CDC Operational Complexity

CDC introduces new failure modes: replication slot growth during connector outages, schema changes breaking parsing, and slot invalidation from long-running transactions. Plan for these scenarios with monitoring, slot size limits, and schema evolution strategies.

Summary: Index Updates and Delta Indexing

Delta indexing is the bridge between your authoritative data stores and your search indexes. Mastering it ensures fresh, consistent search results without the cost of full rebuilds. Let's consolidate our learning:

Key Takeaways

•Index updates are delete + insert operations internally due to segment immutability; design with this awareness
•Change detection requires choosing between timestamps, CDC, or event sourcing—each has trade-offs around deletes and ordering
•The delete problem is particularly challenging; use soft deletes, tombstones, or CDC to capture deletions reliably
•Pipeline architecture matters: separate change collection, enrichment, transformation, and indexing into distinct stages
•Edge cases abound: out-of-order updates, enrichment failures, and partial batch failures all require explicit handling
•Performance optimization focuses on batch sizing, parallel processing, enrichment caching, and network efficiency
•Observability is essential: monitor lag, throughput, error rates, and resource usage; define freshness SLOs
•CDC provides the most robust solution for delta detection, capturing all changes including deletes with reliable ordering

What's next:

Delta indexing handles incremental updates, but sometimes you need to completely rebuild an index—due to schema changes, data structure evolution, or accumulated inconsistencies. Next, we'll explore reindexing strategies for performing full rebuilds without disrupting production search.

Page Complete

You now understand how to keep search indexes synchronized with source data through delta indexing. This capability is essential for maintaining fresh search results at scale. Next, we'll tackle the challenge of full reindexing when incremental updates aren't enough.

2 / 5

Loading learning content...

System Design (HLD)Search Indexing Strategies

Search Indexing Strategies

LevelAdvanced

Duration75 mins

TopicSearch Indexing Strategies

2 / 5

Index Updates and Delta Indexing

The Synchronization Challenge

Mastering delta indexing is what separates teams that struggle with search from teams that make it look effortless.

What You Will Learn

The Anatomy of Index Updates

Before designing delta indexing strategies, we must understand what happens when we update a search index. The mechanics differ significantly from traditional database updates.

The Immutability Reality

Most search engines, including Lucene-based systems (Elasticsearch, Solr, OpenSearch), use immutable segments. When you "update" a document:

The old document version is marked as deleted in its segment
A new document version is written to a new or current segment
The old version remains on disk until segment merging occurs
Searches filter out deleted documents dynamically

This design has profound implications:

Updates are actually delete + insert operations internally
Disk space grows with update frequency until merges compact segments
Update order matters—the last write wins, but ordering is non-trivial in distributed systems
Partial updates still rewrite the entire document in most implementations

index-update-mechanics.json
Elasticsearch Operations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// 1. Full document update (replaces entire document)
PUT /products/_doc/12345
{
  "product_id": "12345",
  "name": "Wireless Headphones",
  "price": 79.99,
  "inventory": 150,
  "category": "electronics",
  "last_updated": "2024-01-15T10:30:00Z"
}
 
// 2. Partial update (preferred for single field changes)
// Still rewrites document internally, but API is simpler
POST /products/_update/12345
{
  "doc": {
    "price": 69.99,
    "last_updated": "2024-01-15T11:00:00Z"
  }
}
 
// 3. Scripted update (for computed changes)
POST /products/_update/12345
{
  "script": {
    "source": "ctx._source.inventory -= params.sold",
    "params": { "sold": 5 }
  }
}
 
// 4. Upsert (update or insert if not exists)
POST /products/_update/12345
{
  "doc": {
    "price": 69.99,
    "last_updated": "2024-01-15T11:00:00Z"
  },
  "upsert": {
    "product_id": "12345",
    "name": "New Product",
    "price": 69.99,
    "inventory": 100,
    "last_updated": "2024-01-15T11:00:00Z"
  }
}
 
// 5. Bulk update (efficient for multiple documents)
POST /_bulk
{"update": {"_index": "products", "_id": "12345"}}
{"doc": {"price": 69.99}}
{"update": {"_index": "products", "_id": "12346"}}
{"doc": {"price": 89.99}}
{"update": {"_index": "products", "_id": "12347"}}
{"doc": {"price": 49.99}}

The Partial Update Illusion

Change Detection: Finding What's Changed

Timestamp-Based Detection

The most common approach: every document has a last_modified timestamp, and delta queries select documents modified after the last checkpoint.

Advantages:

Simple to implement
Works with existing data models
Easy to reason about

Disadvantages:

Clock skew in distributed systems causes missed or duplicate updates
Requires reliable timestamps in all data sources
Doesn't detect deletes (the document is gone, along with its timestamp)

Change Data Capture (CDC)

Capture changes at the database level using transaction logs, triggers, or database features like PostgreSQL's logical replication.

Advantages:

Captures all changes including deletes
Preserves change ordering
Works even when application bypasses normal update paths

Disadvantages:

Database-specific implementation
Requires additional infrastructure (Debezium, Maxwell, etc.)
Log retention and replay complexity

Event Sourcing

Every change is published as an event to a message queue, consumed by the indexing pipeline.

Advantages:

Decoupled from database internals
Easy to replay history
Supports multiple consumers

Disadvantages:

Requires application changes to publish events
Event ordering challenges in distributed systems
Message queue becomes a critical dependency

Change Detection Strategy Comparison
Strategy	Detects Creates	Detects Updates	Detects Deletes	Ordering	Complexity
Timestamp Query	✓	✓	✗	Approximate	Low
Full Diff (Hash)	✓	✓	✓	N/A	High
CDC (Debezium)	✓	✓	✓	Exact	Medium
Event Sourcing	✓	✓	✓	Per-partition	Medium
Soft Deletes + Timestamps	✓	✓	✓	Approximate	Low-Medium

The Delete Problem

Common solutions:

Soft Deletes: Never delete records; mark them with a deleted_at timestamp. Delta queries include these, and the indexer removes them from the search index.
Tombstone Tables: On delete, write the deleted ID to a separate "tombstones" table with a timestamp. Delta queries check both the main table and tombstones.
Full ID Reconciliation: Periodically, compare all IDs in the search index against all IDs in the source database. Missing IDs in source = deleted.
CDC: Database-level change capture naturally includes delete events.

Most production systems use a combination: CDC or soft deletes for operational deletes, with periodic full reconciliation as a safety net.

Delta Indexing Pipeline Architecture

A robust delta indexing pipeline consists of several coordinated components. Let's examine a production-grade architecture used by major platforms.

Core Components

1. Change Source The authoritative database or event stream where changes originate. This could be:

A relational database with timestamps
A CDC stream from Debezium
A Kafka topic receiving application events

2. Change Collector A service that polls or subscribes to the change source, maintaining a checkpoint of the last processed change. This component must be:

Idempotent (safe to restart)
Resumable (picks up from checkpoint after failure)
Scalable (handles bursts without falling behind)

4. Document Transformer Converts source documents into the search index schema, applying normalization, tokenization, and field mappings.

5. Bulk Indexer Batches transformed documents and writes them to the search cluster using bulk APIs.

6. Checkpoint Manager Persists the latest successfully indexed change marker, enabling resumption after failures.

delta-indexing-pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
"""
Production Delta Indexing Pipeline
 
This implementation demonstrates patterns used at companies
processing millions of document updates per hour.
"""
 
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Generator, Optional
import json
 
 
@dataclass
class ChangeEvent:
    """Represents a single change in the source system."""
    event_id: str          # Unique identifier for idempotency
    entity_type: str       # e.g., 'product', 'user', 'order'
    entity_id: str         # Primary key of changed entity
    operation: str         # 'INSERT', 'UPDATE', 'DELETE'
    timestamp: datetime    # When the change occurred
    payload: Optional[dict]  # Changed data (may be partial)
 
 
@dataclass
class IndexCheckpoint:
    """Tracks progress through the change stream."""
    last_event_id: str
    last_timestamp: datetime
    events_processed: int
    last_updated: datetime
 
 
class DeltaIndexingPipeline:
    """
    Coordinates delta indexing from change detection through
    bulk indexing with fault tolerance and checkpointing.
    """
    
    def __init__(
        self,
        change_source: 'ChangeSource',
        document_enricher: 'DocumentEnricher', 
        search_client: 'SearchClient',
        checkpoint_store: 'CheckpointStore',
        batch_size: int = 1000,
        max_batch_wait_seconds: int = 5
    ):
        self.change_source = change_source
        self.enricher = document_enricher
        self.search = search_client
        self.checkpoints = checkpoint_store
        self.batch_size = batch_size
        self.max_wait = max_batch_wait_seconds
        
    async def run(self) -> None:
        """
        Main loop: consume changes, enrich, transform, index.
        
        This loop is designed to run continuously in production,
        handling backpressure and failures gracefully.
        """
        while True:
            checkpoint = await self.checkpoints.load()
            
            try:
                async for batch in self._consume_batches(checkpoint):
                    # Separate deletes from upserts
                    deletes = [c for c in batch if c.operation == 'DELETE']
                    upserts = [c for c in batch if c.operation != 'DELETE']
                    
                    # Process upserts: enrich, transform, index
                    if upserts:
                        documents = await self._process_upserts(upserts)
                        await self._bulk_index(documents)
                    
                    # Process deletes: remove from index
                    if deletes:
                        await self._bulk_delete(
                            [c.entity_id for c in deletes]
                        )
                    
                    # Checkpoint after successful batch
                    new_checkpoint = IndexCheckpoint(
                        last_event_id=batch[-1].event_id,
                        last_timestamp=batch[-1].timestamp,
                        events_processed=(
                            checkpoint.events_processed + len(batch)
                        ),
                        last_updated=datetime.utcnow()
                    )
                    await self.checkpoints.save(new_checkpoint)
                    checkpoint = new_checkpoint
                    
            except Exception as e:
                # Log error, wait, retry from checkpoint
                await self._handle_failure(e, checkpoint)
                
    async def _consume_batches(
        self, 
        checkpoint: IndexCheckpoint
    ) -> Generator[list[ChangeEvent], None, None]:
        """
        Consume changes in batches, balancing latency and throughput.
        
        Uses time-based batching: emit batch when either size limit
        or time limit is reached, whichever comes first.
        """
        batch = []
        batch_start = datetime.utcnow()
        
        async for change in self.change_source.stream(checkpoint):
            batch.append(change)
            
            batch_age = (datetime.utcnow() - batch_start).seconds
            
            if len(batch) >= self.batch_size or batch_age >= self.max_wait:
                yield batch
                batch = []
                batch_start = datetime.utcnow()
                
        # Emit remaining items
        if batch:
            yield batch
            
    async def _process_upserts(
        self, 
        changes: list[ChangeEvent]
    ) -> list[dict]:
        """
        Enrich and transform changes into index-ready documents.
        
        Handles the common case where change events contain only IDs,
        requiring enrichment from the source database.
        """
        # Deduplicate: if same entity changed multiple times in batch,
        # only process the latest
        latest_changes = {}
        for change in changes:
            key = f"{change.entity_type}:{change.entity_id}"
            existing = latest_changes.get(key)
            if not existing or change.timestamp > existing.timestamp:
                latest_changes[key] = change
                
        # Enrich: fetch full documents from source
        enriched = await self.enricher.enrich_batch(
            list(latest_changes.values())
        )
        
        # Transform: convert to search schema
        documents = []
        for entity_id, data in enriched.items():
            doc = self._transform_to_search_schema(data)
            documents.append(doc)
            
        return documents
        
    async def _bulk_index(self, documents: list[dict]) -> None:
        """
        Write documents to search index using bulk API.
        
        Implements retry with exponential backoff for transient failures.
        """
        for attempt in range(3):
            try:
                result = await self.search.bulk_index(documents)
                
                if result.errors:
                    # Handle partial failures
                    failed_ids = [
                        item['_id'] 
                        for item in result.items 
                        if 'error' in item
                    ]
                    raise PartialIndexingError(failed_ids)
                    
                return
                
            except TransientError as e:
                wait = 2 ** attempt
                await asyncio.sleep(wait)
                
        raise IndexingError("Max retries exceeded")
        
    async def _bulk_delete(self, entity_ids: list[str]) -> None:
        """Remove deleted entities from search index."""
        await self.search.bulk_delete(entity_ids)
        
    def _transform_to_search_schema(self, data: dict) -> dict:
        """
        Map source data to search index schema.
        
        This is where you apply field mappings, compute derived fields,
        and normalize data for search.
        """
        return {
            '_id': data['id'],
            'name': data['name'],
            'description': data.get('description', ''),
            'price': float(data['price']),
            'category': data['category']['name'],
            'brand': data.get('brand', {}).get('name'),
            'in_stock': data.get('inventory', 0) > 0,
            'popularity_score': self._compute_popularity(data),
            'indexed_at': datetime.utcnow().isoformat()
        }

Handling Edge Cases and Failure Modes

Delta indexing pipelines must handle numerous edge cases that can cause subtle data inconsistencies if not addressed properly.

Out-of-Order Updates

Solutions:

Include version numbers or timestamps in documents; reject older versions
Use optimistic concurrency control (if_seq_no and if_primary_term in Elasticsearch)
Periodically reconcile against the source of truth

Enrichment Failures

When fetching full documents during enrichment, related data might be missing (e.g., a product references a deleted category).

Solutions:

Use defaults for missing relations
Skip problematic documents and log for investigation
Maintain denormalized snapshots for frequently-missing relations

Partial Batch Failures

When indexing 1,000 documents, 995 might succeed and 5 might fail due to mapping conflicts or validation errors.

Solutions:

Checkpoint at document granularity, not batch granularity
Dead-letter queue for failed documents
Separate indexing paths for "clean" vs. "problematic" documents

Common Delta Indexing Failures

•Version Conflicts: Concurrent updates to the same document cause rejection; implement merge-on-conflict logic
•Schema Drift: Source data structure changes break transformation; add schema validation and alerting
•Checkpoint Corruption: Lost or corrupted checkpoint causes re-processing or gaps; use durable storage with backups
•Enrichment Timeouts: Source database slowness delays indexing; implement circuit breakers and degraded modes
•Message Queue Lag: High change volume exceeds processing capacity; monitor consumer lag and scale horizontally
•Duplicate Processing: Failures after indexing but before checkpoint cause duplicates; ensure idempotent indexing

Idempotency is Non-Negotiable

Optimizing Delta Indexing Performance

Delta indexing performance directly impacts how fresh your search results are. Slow delta processing means stale data. Here are proven optimization techniques:

Batch Size Tuning

The optimal batch size balances latency and throughput:

Too small (10-100): High per-batch overhead dominates; throughput suffers
Too large (50,000+): Memory pressure and retry scope become problems; failures are expensive
Sweet spot (1,000-10,000): Typically optimal for most workloads

Parallel Processing

Delta pipelines often have embarrassingly parallel workloads:

Change consumption: Can shard by entity type or ID hash
Enrichment: Multiple workers can enrich different documents concurrently
Transformation: Stateless, infinitely parallelizable
Indexing: Bulk requests can be sent concurrently to different shards

Enrichment Optimization

Enrichment (fetching full documents) is often the bottleneck:

Batch fetching: Fetch 1,000 documents in one query, not 1,000 individual queries
Caching: Cache frequently-accessed relations (categories, brands)
Denormalization: Store complete documents in the change event, avoiding enrichment

Network Optimization

Minimize data transfer and connection overhead:

Compression: Enable HTTP compression for bulk requests
Connection pooling: Reuse HTTP connections to the search cluster
Locality: Run indexing jobs in the same region as the search cluster

performance-optimizations.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
/**
 * Performance-optimized delta indexing with parallel processing.
 * 
 * Achieves 50,000+ documents/second on typical hardware.
 */
 
interface PerformanceConfig {
  // Parallelism settings
  enrichmentConcurrency: number;    // Parallel enrichment workers
  indexingConcurrency: number;      // Parallel bulk request streams
  
  // Batch settings
  enrichmentBatchSize: number;      // IDs per enrichment query
  indexingBatchSize: number;        // Documents per bulk request
  
  // Caching settings
  relationCacheTTLSeconds: number;  // TTL for category/brand cache
  relationCacheMaxSize: number;     // Max cached relations
}
 
const PRODUCTION_CONFIG: PerformanceConfig = {
  enrichmentConcurrency: 8,
  indexingConcurrency: 4,
  enrichmentBatchSize: 500,
  indexingBatchSize: 2000,
  relationCacheTTLSeconds: 300,
  relationCacheMaxSize: 100_000,
};
 
class OptimizedDeltaPipeline {
  private enrichmentSemaphore: Semaphore;
  private indexingSemaphore: Semaphore;
  private relationCache: LRUCache<string, Relation>;
  
  constructor(private config: PerformanceConfig) {
    this.enrichmentSemaphore = new Semaphore(config.enrichmentConcurrency);
    this.indexingSemaphore = new Semaphore(config.indexingConcurrency);
    this.relationCache = new LRUCache({
      maxSize: config.relationCacheMaxSize,
      ttl: config.relationCacheTTLSeconds * 1000,
    });
  }
  
  /**
   * Process changes with optimal parallelism.
   * 
   * Uses a pipeline pattern: enrichment and indexing
   * happen concurrently on different batches.
   */
  async processBatch(changes: ChangeEvent[]): Promise<void> {
    // Split into enrichment batches
    const enrichmentBatches = chunk(
      changes, 
      this.config.enrichmentBatchSize
    );
    
    // Process enrichment batches in parallel
    const enrichedDocs = await Promise.all(
      enrichmentBatches.map(batch => 
        this.enrichmentSemaphore.runExclusive(() =>
          this.enrichBatch(batch)
        )
      )
    );
    
    // Flatten and split into indexing batches
    const allDocs = enrichedDocs.flat();
    const indexingBatches = chunk(allDocs, this.config.indexingBatchSize);
    
    // Index batches in parallel
    await Promise.all(
      indexingBatches.map(batch =>
        this.indexingSemaphore.runExclusive(() =>
          this.indexBatch(batch)
        )
      )
    );
  }
  
  /**
   * Batch enrichment with caching for relations.
   */
  private async enrichBatch(changes: ChangeEvent[]): Promise<Document[]> {
    const entityIds = changes.map(c => c.entity_id);
    
    // Single query for all entities
    const entities = await this.db.query(`
      SELECT * FROM products 
      WHERE id = ANY($1::uuid[])
    `, [entityIds]);
    
    // Collect unique relation IDs
    const categoryIds = new Set(entities.map(e => e.category_id));
    const brandIds = new Set(entities.map(e => e.brand_id).filter(Boolean));
    
    // Fetch uncached relations
    const uncachedCategoryIds = [...categoryIds].filter(
      id => !this.relationCache.has(`category:${id}`)
    );
    const uncachedBrandIds = [...brandIds].filter(
      id => !this.relationCache.has(`brand:${id}`)
    );
    
    if (uncachedCategoryIds.length > 0) {
      const categories = await this.db.query(
        'SELECT * FROM categories WHERE id = ANY($1::uuid[])',
        [uncachedCategoryIds]
      );
      categories.forEach(c => 
        this.relationCache.set(`category:${c.id}`, c)
      );
    }
    
    if (uncachedBrandIds.length > 0) {
      const brands = await this.db.query(
        'SELECT * FROM brands WHERE id = ANY($1::uuid[])',
        [uncachedBrandIds]
      );
      brands.forEach(b => 
        this.relationCache.set(`brand:${b.id}`, b)
      );
    }
    
    // Build enriched documents
    return entities.map(entity => ({
      ...entity,
      category: this.relationCache.get(`category:${entity.category_id}`),
      brand: entity.brand_id 
        ? this.relationCache.get(`brand:${entity.brand_id}`) 
        : null,
    }));
  }
  
  /**
   * Bulk index with compression and connection reuse.
   */
  private async indexBatch(documents: Document[]): Promise<void> {
    const body = documents.flatMap(doc => [
      { index: { _index: 'products', _id: doc.id } },
      this.transformDocument(doc)
    ]);
    
    await this.esClient.bulk({
      body,
      refresh: false,  // Don't refresh after each bulk
      timeout: '30s',
      // Enable request compression
      headers: { 'Content-Encoding': 'gzip' }
    });
  }
}

Monitoring Delta Indexing Pipelines

A delta indexing pipeline is only as good as your ability to understand its behavior. Comprehensive observability is essential for maintaining data freshness and diagnosing issues.

Key Metrics

Lag Metrics

Consumer Lag: How far behind is your pipeline from the latest change?
Checkpoint Age: How old is the last successfully indexed change?
Queue Depth: How many changes are waiting to be processed?

Throughput Metrics

Changes Processed/Second: Raw processing rate
Documents Indexed/Second: Rate of bulk API calls
Enrichment Batch Duration: Time to fetch full documents

Error Metrics

Failed Documents: Count and rate of indexing failures
Enrichment Errors: Missing relations, timeout errors
Checkpoint Failures: Issues persisting progress

Resource Metrics

Memory Usage: Batch buffers, caches, connections
CPU Usage: Transformation and serialization overhead
Network Throughput: Bandwidth to databases and search cluster

Critical Alerts for Delta Indexing
Alert	Condition	Severity	Response
Lag Exceeds Threshold	Consumer lag > 1 hour	High	Scale indexers, check for bottlenecks
Indexing Failure Rate	Failures > 1% of documents	High	Check mapping conflicts, document validation
Enrichment Timeout	DB queries > 30s	Medium	Check database performance, add indexes
Checkpoint Staleness	No checkpoint update in 10min	Critical	Check pipeline health, possible crash
Memory Pressure	Heap > 90%	Medium	Reduce batch sizes, check for memory leaks
Dead Letter Queue Growth	DLQ size increasing	Medium	Investigate failed documents, fix root cause

The Freshness SLO

Change Data Capture Integration

Why CDC for Search Indexing?

Traditional timestamp-based polling has fundamental limitations:

Clock skew causes missed updates
Polling frequency limits freshness
Deletes are invisible
Database load increases with polling frequency

CDC solves all of these by streaming changes as they're committed, providing:

Completeness: Every committed change is captured
Ordering: Changes within a transaction are ordered correctly
Efficiency: Near-zero overhead on the source database
Low Latency: Changes stream in real-time, not polled intervals

Popular CDC Tools

Debezium: Open-source platform supporting PostgreSQL, MySQL, MongoDB, SQL Server. Runs on Kafka Connect.

Maxwell: Lightweight MySQL CDC to Kafka. Simple but limited to MySQL.

AWS DMS: Managed CDC service for AWS databases. Supports various targets.

Google Datastream: CDC for Cloud SQL, AlloyDB to BigQuery, GCS, or Pub/Sub.

debezium-config.json
Debezium Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Debezium PostgreSQL connector configuration
// Captures changes from products, categories, and brands tables
 
{
  "name": "search-indexing-connector",
  "config": {
    // Connector class for PostgreSQL
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    
    // Database connection
    "database.hostname": "primary.db.example.com",
    "database.port": "5432",
    "database.user": "replication_user",
    "database.password": "${secrets.db_password}",
    "database.dbname": "ecommerce",
    
    // Logical replication slot name
    "slot.name": "search_indexing_slot",
    "plugin.name": "pgoutput",
    
    // What to capture
    "table.include.list": "public.products,public.categories,public.brands",
    
    // Kafka topic routing
    "topic.prefix": "cdc",
    // Produces topics: cdc.public.products, cdc.public.categories, etc.
    
    // Schema handling
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "false",
    "value.converter.schemas.enable": "true",
    
    // Snapshot mode: what to do on first start
    "snapshot.mode": "initial",  // Take snapshot then stream changes
    
    // Heartbeat for health monitoring
    "heartbeat.interval.ms": "10000",
    
    // Transforms for convenience
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.add.fields": "op,ts_ms",
    "transforms.unwrap.delete.handling.mode": "rewrite"
  }
}
 
// Sample Debezium change event (after ExtractNewRecordState transform)
{
  "id": "12345",
  "name": "Wireless Headphones",
  "price": 79.99,
  "category_id": "cat-001",
  "brand_id": "brand-abc",
  "updated_at": "2024-01-15T10:30:00.000000Z",
  "__op": "u",       // 'c' = create, 'u' = update, 'd' = delete
  "__ts_ms": 1705315800000,
  "__deleted": "false"
}

CDC Operational Complexity

Summary: Index Updates and Delta Indexing

Key Takeaways

•Index updates are delete + insert operations internally due to segment immutability; design with this awareness
•Change detection requires choosing between timestamps, CDC, or event sourcing—each has trade-offs around deletes and ordering
•The delete problem is particularly challenging; use soft deletes, tombstones, or CDC to capture deletions reliably
•Pipeline architecture matters: separate change collection, enrichment, transformation, and indexing into distinct stages
•Edge cases abound: out-of-order updates, enrichment failures, and partial batch failures all require explicit handling
•Performance optimization focuses on batch sizing, parallel processing, enrichment caching, and network efficiency
•Observability is essential: monitor lag, throughput, error rates, and resource usage; define freshness SLOs
•CDC provides the most robust solution for delta detection, capturing all changes including deletes with reliable ordering

What's next:

Page Complete

2 / 5