Search Indexing - Learning Module

Loading content...

0/273

Blue-Green Indexing

The Zero-Risk Deployment Pattern

Imagine deploying a major search index change—new analyzers, restructured documents, completely rebuilt from source—with the confidence of a routine code deploy. That's the promise of blue-green indexing.

Borrowed from application deployment practices, blue-green indexing maintains two complete, independent index environments. At any moment:

Blue (or "live") is serving production traffic
Green (or "staging") is where you build, test, and validate changes

When green is ready, you switch traffic atomically. If problems emerge, you switch back. The failed deployment becomes a learning opportunity, not an outage.

This pattern synthesizes everything we've explored: real-time and batch indexing, delta updates, reindexing strategies, and index aliases. It's the operational model that enables teams to iterate on search quality without fear.

Companies like Spotify, Airbnb, and Stripe use blue-green indexing to deploy search changes multiple times per day. For them, search index updates are as routine as code deployments—because they've built the infrastructure to make them so.

What You Will Learn

By the end of this page, you will understand the complete blue-green indexing workflow, how to maintain synchronization between environments, advanced patterns like canary deployments and shadow traffic, and how to build organizational confidence in frequent search deployments.

Blue-Green Indexing Fundamentals

The blue-green pattern is conceptually simple but requires careful implementation across several dimensions.

Core Concepts

Two Independent Environments: Blue and green are completely separate index sets. They may have different mappings, different analyzers, or contain different data at any moment.

One Serves Traffic: At any time, exactly one environment serves production search traffic. This is the "live" environment.

One Prepares for Deployment: The other environment is used for building and validating changes. This is the "staging" environment.

Atomic Switch: When staging is ready, traffic switches from live to staging atomically. The former live becomes the new staging.

Role Reversal: After switching, blue becomes green and green becomes blue—the roles alternate with each deployment.

The Deployment Lifecycle

Baseline: Blue is live, serving traffic. Green is idle or contains the previous version.
Preparation: Build the new index version in green (reindex, apply changes, etc.)
Synchronization: Apply delta updates to green until it's current with blue
Validation: Run verification tests against green
Switch: Atomically move production traffic to green
Observation: Monitor green with production traffic, ready to rollback
Cleanup: Green is now live. Blue can be deleted or retained for fast rollback
Role Swap: Next deployment: build in blue (now the staging), green is live

Blue-Green Environment States
State	Blue	Green	Traffic Routing
Initial	Live, current data	Idle or previous version	100% → Blue
Building	Live, receiving updates	Reindexing from source	100% → Blue
Syncing	Live, receiving updates	Applying delta catchup	100% → Blue
Validating	Live, receiving updates	Complete, being tested	100% → Blue
Switching	Being deprecated	Becoming live	0% → Blue, 100% → Green
Observing	Standby for rollback	Live, receiving updates	100% → Green
Committed	Ready for next build	Live, current data	100% → Green

The Color is Arbitrary

Blue/green is just naming convention. Some teams use production/staging, active/passive, or v1/v2. The key is having two environments that alternate roles. Use whatever naming resonates with your team.

Implementation Architecture

Implementing blue-green indexing requires coordinating several components. Here's a production-grade architecture.

Core Components

Index Naming Convention Use consistent naming that identifies the environment:

products_blue_v{version}
products_green_v{version}

Alias Structure

products → Points to the live environment's index
products_blue → Points to blue environment's current index
products_green → Points to green environment's current index

State Tracking A durable store (database, config service) tracks:

Which environment is currently live
Current index version in each environment
Deployment history

Dual-Write Coordinator During synchronization, writes must go to both environments. This component manages:

Routing writes to appropriate indexes
Handling failures in either environment
Tracking synchronization lag

Verification Service Automated testing against staging environment:

Document count comparisons
Query parity testing
Performance benchmarking

Switch Controller Orchestrates the atomic switch:

Waits for synchronization to complete
Executes alias update
Updates state tracking
Notifies monitoring systems

blue-green-orchestrator.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
/**
 * Blue-Green Index Orchestrator
 * 
 * Manages the complete lifecycle of blue-green search index deployments.
 * This is a production-grade implementation used at scale.
 */
 
type Environment = 'blue' | 'green';
 
interface DeploymentState {
  liveEnvironment: Environment;
  blueIndex: string;
  greenIndex: string;
  blueVersion: number;
  greenVersion: number;
  deploymentInProgress: boolean;
  lastDeployedAt: Date;
}
 
interface DeploymentConfig {
  baseName: string;           // e.g., "products"
  sourceDatabase: string;     // Data source for reindexing
  verificationQueries: Query[]; // Queries for parity testing
  minDocumentCount: number;   // Minimum expected documents
  maxSyncLagSeconds: number;  // Max acceptable sync delay
}
 
class BlueGreenOrchestrator {
  constructor(
    private esClient: ElasticsearchClient,
    private stateStore: StateStore,
    private changeSource: ChangeDataCapture,
    private config: DeploymentConfig,
  ) {}
 
  /**
   * Execute a complete blue-green deployment.
   * 
   * This method orchestrates the entire process from building
   * the new index through to traffic switchover.
   */
  async deploy(
    newMappings: object,
    newSettings: object,
  ): Promise<DeploymentResult> {
    const state = await this.stateStore.getState(this.config.baseName);
    
    if (state.deploymentInProgress) {
      throw new Error('Deployment already in progress');
    }
    
    const stagingEnv = this.getStagingEnvironment(state);
    const liveEnv = state.liveEnvironment;
    
    const newVersion = this.getNextVersion(state, stagingEnv);
    const newIndexName = this.generateIndexName(stagingEnv, newVersion);
    
    try {
      // Mark deployment in progress
      await this.stateStore.updateState({
        ...state,
        deploymentInProgress: true,
      });
      
      // Step 1: Create new index with target schema
      await this.createIndex(newIndexName, newMappings, newSettings);
      console.log(`Created index: ${newIndexName}`);
      
      // Step 2: Start change capture BEFORE bulk load
      const changeCursor = await this.changeSource.startCapture();
      console.log('Started change capture');
      
      // Step 3: Bulk load all data to new index
      const bulkResult = await this.bulkLoadFromSource(newIndexName);
      console.log(`Bulk loaded ${bulkResult.documentCount} documents`);
      
      // Step 4: Enable dual-write to keep new index synchronized
      await this.enableDualWrite(newIndexName);
      console.log('Enabled dual-write');
      
      // Step 5: Replay changes that occurred during bulk load
      const catchupResult = await this.replayChanges(
        newIndexName, 
        changeCursor
      );
      console.log(`Replayed ${catchupResult.count} changes`);
      
      // Step 6: Wait for synchronization
      await this.waitForSync(state.liveEnvironment, newIndexName);
      console.log('Environments synchronized');
      
      // Step 7: Run verification suite
      const verification = await this.verify(newIndexName);
      if (!verification.passed) {
        throw new DeploymentVerificationError(verification.issues);
      }
      console.log('Verification passed');
      
      // Step 8: Atomic switch
      const previousLiveIndex = this.getCurrentLiveIndex(state);
      await this.atomicSwitch(previousLiveIndex, newIndexName);
      console.log(`Switched traffic: ${previousLiveIndex} → ${newIndexName}`);
      
      // Step 9: Update state
      const newState: DeploymentState = {
        liveEnvironment: stagingEnv,
        blueIndex: stagingEnv === 'blue' ? newIndexName : state.blueIndex,
        greenIndex: stagingEnv === 'green' ? newIndexName : state.greenIndex,
        blueVersion: stagingEnv === 'blue' ? newVersion : state.blueVersion,
        greenVersion: stagingEnv === 'green' ? newVersion : state.greenVersion,
        deploymentInProgress: false,
        lastDeployedAt: new Date(),
      };
      await this.stateStore.updateState(newState);
      
      // Step 10: Disable dual-write (optional, can keep for fast rollback)
      await this.disableDualWrite();
      
      return {
        success: true,
        previousIndex: previousLiveIndex,
        newIndex: newIndexName,
        documentsIndexed: bulkResult.documentCount,
        changesReplayed: catchupResult.count,
      };
      
    } catch (error) {
      // Mark deployment failed, state unchanged
      await this.stateStore.updateState({
        ...state,
        deploymentInProgress: false,
      });
      
      // Cleanup partial index if exists
      await this.cleanupFailedIndex(newIndexName);
      
      throw error;
    }
  }
 
  /**
   * Quickly rollback to the previous environment.
   * 
   * If dual-write is still active, the previous index is current.
   * This is essentially an alias switch.
   */
  async rollback(): Promise<void> {
    const state = await this.stateStore.getState(this.config.baseName);
    
    const currentLive = state.liveEnvironment;
    const previousEnv = currentLive === 'blue' ? 'green' : 'blue';
    const previousIndex = currentLive === 'blue' 
      ? state.greenIndex 
      : state.blueIndex;
    const currentIndex = this.getCurrentLiveIndex(state);
    
    // Verify previous index exists and has data
    const exists = await this.esClient.indices.exists({
      index: previousIndex
    });
    
    if (!exists) {
      throw new Error(
        `Rollback target ${previousIndex} does not exist`
      );
    }
    
    // Atomic switch back
    await this.atomicSwitch(currentIndex, previousIndex);
    
    // Update state
    await this.stateStore.updateState({
      ...state,
      liveEnvironment: previousEnv,
    });
    
    console.log(`Rolled back: ${currentIndex} → ${previousIndex}`);
  }
 
  private getStagingEnvironment(state: DeploymentState): Environment {
    // Staging is whichever environment is NOT live
    return state.liveEnvironment === 'blue' ? 'green' : 'blue';
  }
 
  private getNextVersion(state: DeploymentState, env: Environment): number {
    const currentVersion = env === 'blue' 
      ? state.blueVersion 
      : state.greenVersion;
    return currentVersion + 1;
  }
 
  private generateIndexName(env: Environment, version: number): string {
    const timestamp = new Date().toISOString()
      .replace(/[-:T]/g, '')
      .slice(0, 14);
    return `${this.config.baseName}_${env}_v${version}_${timestamp}`;
  }
 
  private getCurrentLiveIndex(state: DeploymentState): string {
    return state.liveEnvironment === 'blue' 
      ? state.blueIndex 
      : state.greenIndex;
  }
 
  private async atomicSwitch(
    fromIndex: string, 
    toIndex: string
  ): Promise<void> {
    const alias = this.config.baseName;
    
    await this.esClient.indices.updateAliases({
      body: {
        actions: [
          { remove: { index: fromIndex, alias } },
          { add: { index: toIndex, alias } },
        ],
      },
    });
  }
 
  private async waitForSync(
    liveIndex: string, 
    stagingIndex: string
  ): Promise<void> {
    const maxWait = 300_000; // 5 minutes
    const checkInterval = 1000; // 1 second
    let elapsed = 0;
    
    while (elapsed < maxWait) {
      const [liveCount, stagingCount] = await Promise.all([
        this.esClient.count({ index: liveIndex }),
        this.esClient.count({ index: stagingIndex }),
      ]);
      
      const lag = Math.abs(liveCount.count - stagingCount.count);
      
      if (lag === 0) {
        return; // Synchronized
      }
      
      await this.sleep(checkInterval);
      elapsed += checkInterval;
    }
    
    throw new Error('Synchronization timeout exceeded');
  }
}

Keeping Environments Synchronized

The critical challenge in blue-green indexing is keeping both environments synchronized. When you switch traffic, the staging environment must be exactly current with production.

Dual-Write During Build

While building the new index, you need writes to go to both environments:

Option 1: Application-Level Dual-Write The application writes to both indexes explicitly. Simple but tightly coupled.

Option 2: Message Queue Fan-Out Writes go to a queue, consumers update both indexes. Decoupled but adds latency.

Option 3: CDC-Based Replication CDC captures all writes and applies to both indexes. Most robust but complex.

The Catchup Problem

During bulk load (which takes hours), the production index receives continuous updates. When bulk load completes:

Capture start timestamp: T_start
Complete bulk load: Happens from T_start to T_end
Replay changes: Apply all changes from T_start to T_end to new index
Enable real-time dual-write: New changes now go to both
Wait for catchup: New index processes replay while receiving new writes
Synchronization achieved: Both indexes have identical data

The key is that change capture must start BEFORE bulk load, capturing changes during the entire build process.

Handling High-Velocity Updates

For systems with high update rates (thousands per second), catchup can be challenging:

Replay must be faster than incoming change rate, or you never catch up
Parallelize replay across partitions
Use bulk operations for replay, not individual updates
Consider pausing non-critical updates during catchup window

synchronization-manager.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
"""
Synchronization Manager for Blue-Green Indexing
 
Handles the complex task of keeping staging synchronized with
production during the build and transition phases.
"""
 
from datetime import datetime, timedelta
from typing import AsyncIterator
import asyncio
 
 
class SynchronizationManager:
    """
    Manages synchronization between live and staging indexes.
    
    The critical invariant: staging must never fall behind live
    by more than max_lag_seconds at switch time.
    """
    
    def __init__(
        self,
        es_client: 'ElasticsearchClient',
        change_source: 'ChangeDataCapture',
        live_index: str,
        staging_index: str,
        max_lag_seconds: int = 30
    ):
        self.es = es_client
        self.changes = change_source
        self.live = live_index
        self.staging = staging_index
        self.max_lag = max_lag_seconds
        
        self._dual_write_enabled = False
        self._replay_cursor = None
        
    async def start_capture_before_build(self) -> 'ChangeCursor':
        """
        Start capturing changes BEFORE beginning bulk load.
        
        This is critical - we need all changes from this moment
        forward to replay after bulk load completes.
        """
        self._replay_cursor = await self.changes.start_capture()
        return self._replay_cursor
        
    async def replay_and_synchronize(self) -> SyncResult:
        """
        Replay captured changes and achieve synchronization.
        
        Called after bulk load completes. Will:
        1. Replay all changes captured during bulk load
        2. Enable dual-write for new changes
        3. Continue replay until caught up
        4. Return when synchronized
        """
        if not self._replay_cursor:
            raise ValueError(
                "Must call start_capture_before_build() first"
            )
            
        # Phase 1: Replay historical changes (during bulk load)
        replay_count = 0
        replay_start = datetime.utcnow()
        
        async for batch in self._read_changes(self._replay_cursor):
            await self._apply_batch(batch)
            replay_count += len(batch)
            
            # Log progress every 10k changes
            if replay_count % 10000 == 0:
                lag = await self._measure_lag()
                print(f"Replayed {replay_count}, lag: {lag} docs")
                
        # Phase 2: Enable dual-write so new changes go to both
        await self._enable_dual_write()
        
        # Phase 3: Wait for complete synchronization
        await self._wait_for_sync()
        
        return SyncResult(
            changes_replayed=replay_count,
            duration=datetime.utcnow() - replay_start,
            final_lag=0
        )
        
    async def _enable_dual_write(self) -> None:
        """
        Enable dual-write mode.
        
        All writes to live index are now also applied to staging.
        Implementation depends on architecture:
        - Application-level: configure write router
        - CDC-level: add staging as CDC consumer
        - Proxy-level: enable write duplication
        """
        # Example: Configure CDC to write to both
        await self.changes.add_target(self.staging)
        self._dual_write_enabled = True
        print(f"Dual-write enabled: {self.live} + {self.staging}")
        
    async def _wait_for_sync(self) -> None:
        """
        Wait until staging is synchronized with live.
        
        Polls document counts until they match (within tolerance).
        With dual-write enabled, this should converge quickly
        unless replay is behind.
        """
        max_attempts = 300  # 5 minutes with 1s intervals
        
        for attempt in range(max_attempts):
            lag = await self._measure_lag()
            
            if lag == 0:
                print("Synchronization achieved!")
                return
                
            if lag < 100:
                print(f"Nearly synchronized, lag: {lag}")
            elif attempt % 10 == 0:
                print(f"Waiting for sync, lag: {lag}")
                
            await asyncio.sleep(1)
            
        raise SynchronizationError(
            f"Failed to synchronize within {max_attempts}s"
        )
        
    async def _measure_lag(self) -> int:
        """
        Measure the document count difference between indexes.
        
        Note: count equality doesn't guarantee content equality,
        but it's a necessary condition for synchronization.
        """
        live_count = await self.es.count(index=self.live)
        staging_count = await self.es.count(index=self.staging)
        return abs(live_count['count'] - staging_count['count'])
        
    async def _apply_batch(self, changes: list[Change]) -> None:
        """Apply a batch of changes to the staging index."""
        bulk_ops = []
        
        for change in changes:
            if change.operation == 'DELETE':
                bulk_ops.extend([
                    {"delete": {
                        "_index": self.staging, 
                        "_id": change.id
                    }}
                ])
            else:
                bulk_ops.extend([
                    {"index": {
                        "_index": self.staging, 
                        "_id": change.id
                    }},
                    change.document
                ])
                
        if bulk_ops:
            await self.es.bulk(body=bulk_ops)
            
    async def _read_changes(
        self, 
        cursor: 'ChangeCursor'
    ) -> AsyncIterator[list[Change]]:
        """Read changes from the capture source."""
        async for batch in self.changes.stream(cursor):
            yield batch
            
    async def disable_dual_write(self) -> None:
        """Disable dual-write after successful switch."""
        if self._dual_write_enabled:
            await self.changes.remove_target(self.staging)
            self._dual_write_enabled = False
            print("Dual-write disabled")

Canary Deployments and Gradual Rollout

While atomic switching works well, some organizations prefer more gradual transitions. Canary deployments and gradual rollout patterns reduce risk further by exposing only a fraction of traffic to the new index initially.

Canary Deployment Pattern

Deploy to staging: Build and verify the new index
Route 1% traffic to canary: Small subset of queries hit new index
Monitor canary metrics: Error rates, latencies, result quality
Increase gradually: 1% → 5% → 25% → 50% → 100%
Full cutover: All traffic now on new index

Implementation Approaches

Application-Level Routing The application randomly selects which alias to query based on a percentage. Simple to implement but requires application changes.

Load Balancer Routing Route a percentage of requests to different search endpoints. Completely transparent to applications.

Search Proxy Layer A dedicated proxy routes queries to different backends. Adds latency but provides rich control.

Metrics to Monitor During Rollout

Error Rate: Any increase signals problems
P50/P95/P99 Latency: Compare old vs new
Result Set Sizes: Unexpected changes indicate issues
User Engagement: Click-through rates, conversions (if available)
Resource Utilization: CPU, memory, I/O on search nodes

canary-router.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
/**
 * Canary Router for Gradual Search Index Rollout
 * 
 * Routes a configurable percentage of search traffic to the
 * canary (new) index while the rest goes to the stable index.
 */
 
interface CanaryConfig {
  stableAlias: string;      // e.g., "products"
  canaryAlias: string;      // e.g., "products_canary"
  canaryPercent: number;    // 0-100
  stickyRouting: boolean;   // Same user always hits same index
}
 
class CanarySearchRouter {
  private config: CanaryConfig;
  private metrics: MetricsClient;
  
  constructor(
    private esClient: ElasticsearchClient,
    config: CanaryConfig,
    metrics: MetricsClient,
  ) {
    this.config = config;
    this.metrics = metrics;
  }
  
  /**
   * Route a search request to the appropriate index.
   * 
   * Respects canary percentage and optionally provides sticky
   * routing so the same user always hits the same index.
   */
  async search(
    query: SearchQuery, 
    context: RequestContext
  ): Promise<SearchResult> {
    const useCanary = this.shouldUseCanary(context);
    const targetAlias = useCanary 
      ? this.config.canaryAlias 
      : this.config.stableAlias;
      
    const startTime = Date.now();
    
    try {
      const result = await this.esClient.search({
        index: targetAlias,
        body: query,
      });
      
      // Track metrics for comparison
      const latency = Date.now() - startTime;
      this.recordMetrics(useCanary, latency, result, null);
      
      return result;
      
    } catch (error) {
      const latency = Date.now() - startTime;
      this.recordMetrics(useCanary, latency, null, error);
      
      // Optionally fall back to stable on canary error
      if (useCanary && this.config.canaryPercent < 100) {
        console.warn(`Canary failed, falling back to stable: ${error}`);
        return this.esClient.search({
          index: this.config.stableAlias,
          body: query,
        });
      }
      
      throw error;
    }
  }
  
  /**
   * Determine if this request should use the canary index.
   * 
   * Uses consistent hashing for sticky routing if enabled,
   * otherwise random selection based on percentage.
   */
  private shouldUseCanary(context: RequestContext): boolean {
    if (this.config.canaryPercent === 0) return false;
    if (this.config.canaryPercent === 100) return true;
    
    let value: number;
    
    if (this.config.stickyRouting && context.userId) {
      // Consistent hash of user ID
      value = this.consistentHash(context.userId) % 100;
    } else {
      // Random selection
      value = Math.random() * 100;
    }
    
    return value < this.config.canaryPercent;
  }
  
  private consistentHash(input: string): number {
    // Simple hash for consistent routing
    let hash = 0;
    for (let i = 0; i < input.length; i++) {
      hash = ((hash << 5) - hash) + input.charCodeAt(i);
      hash = hash & hash; // Convert to 32-bit integer
    }
    return Math.abs(hash);
  }
  
  private recordMetrics(
    isCanary: boolean,
    latencyMs: number,
    result: SearchResult | null,
    error: Error | null,
  ): void {
    const labels = {
      index_type: isCanary ? 'canary' : 'stable',
      success: error ? 'false' : 'true',
    };
    
    this.metrics.histogram('search_latency_ms', latencyMs, labels);
    this.metrics.counter('search_requests_total', 1, labels);
    
    if (result) {
      this.metrics.histogram(
        'search_result_count', 
        result.hits.total.value, 
        labels
      );
    }
    
    if (error) {
      this.metrics.counter('search_errors_total', 1, {
        ...labels,
        error_type: error.constructor.name,
      });
    }
  }
  
  /**
   * Adjust canary percentage dynamically.
   * 
   * Called by automation based on metrics analysis
   * or manually by operators.
   */
  async setCanaryPercent(percent: number): Promise<void> {
    if (percent < 0 || percent > 100) {
      throw new Error('Canary percent must be between 0 and 100');
    }
    
    const previous = this.config.canaryPercent;
    this.config.canaryPercent = percent;
    
    console.log(
      `Canary traffic: ${previous}% → ${percent}%`
    );
    
    // Record the change for audit
    this.metrics.gauge('canary_traffic_percent', percent);
  }
}
 
/**
 * Automated canary promotion based on metrics.
 * 
 * Watches error rates and latency, automatically increases
 * traffic if healthy or rolls back if problems detected.
 */
class CanaryPromoter {
  private stages = [1, 5, 10, 25, 50, 75, 100];
  private currentStageIndex = 0;
  
  constructor(
    private router: CanarySearchRouter,
    private metrics: MetricsClient,
    private alerting: AlertingService,
  ) {}
  
  async runPromotion(): Promise<PromotionResult> {
    for (const stage of this.stages) {
      await this.router.setCanaryPercent(stage);
      
      // Wait for metrics to accumulate
      await this.sleep(60_000); // 1 minute per stage
      
      // Check health
      const health = await this.checkHealth();
      
      if (!health.healthy) {
        // Rollback to 0%
        await this.router.setCanaryPercent(0);
        await this.alerting.sendAlert({
          severity: 'warning',
          title: 'Canary rollback triggered',
          details: health.issues.join(', '),
        });
        
        return {
          success: false,
          rolledBackAtStage: stage,
          issues: health.issues,
        };
      }
      
      console.log(`Stage ${stage}% healthy, proceeding...`);
    }
    
    return { success: true, finalPercent: 100 };
  }
  
  private async checkHealth(): Promise<HealthCheckResult> {
    const issues: string[] = [];
    
    // Compare canary vs stable error rates
    const canaryErrors = await this.metrics.query(
      'rate(search_errors_total{index_type="canary"}[5m])'
    );
    const stableErrors = await this.metrics.query(
      'rate(search_errors_total{index_type="stable"}[5m])'
    );
    
    if (canaryErrors > stableErrors * 1.5) {
      issues.push(
        `Canary error rate ${canaryErrors} > stable ${stableErrors}`
      );
    }
    
    // Compare latencies
    const canaryP95 = await this.metrics.query(
      'histogram_quantile(0.95, search_latency_ms{index_type="canary"})'
    );
    const stableP95 = await this.metrics.query(
      'histogram_quantile(0.95, search_latency_ms{index_type="stable"})'
    );
    
    if (canaryP95 > stableP95 * 1.2) {
      issues.push(
        `Canary P95 latency ${canaryP95}ms > stable ${stableP95}ms`
      );
    }
    
    return {
      healthy: issues.length === 0,
      issues,
    };
  }
}

Shadow Traffic Testing

Shadow traffic testing (also called "dark launching") sends production queries to both indexes but only returns results from the stable index. This allows comprehensive testing with real traffic without any user impact.

How Shadow Traffic Works

Every production query is duplicated
One copy goes to the stable index (results returned to user)
Another copy goes to the shadow index (results discarded)
Both results are logged for comparison
Automated analysis detects discrepancies

Benefits of Shadow Testing

Zero User Impact: Users always get results from the proven stable index.

Real Traffic Patterns: No need to construct synthetic test queries.

Scale Testing: You verify the new index handles full production load.

Result Comparison: Identify queries where results differ significantly.

Implementation Considerations

Performance Overhead: Every query costs 2x. Ensure capacity.

Async Shadow Queries: Fire shadow query without blocking response to user.

Sampling: For very high traffic, shadow a percentage rather than 100%.

Comparison Logic: Define what "different enough to investigate" means.

shadow-traffic-tester.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
"""
Shadow Traffic Testing for Search Indexes
 
Duplicates production queries to a shadow index for comparison
without affecting user experience.
"""
 
import asyncio
from dataclasses import dataclass
from typing import Optional
 
 
@dataclass
class ComparisonResult:
    query_id: str
    stable_latency_ms: float
    shadow_latency_ms: float
    stable_count: int
    shadow_count: int
    result_overlap_percent: float
    score_correlation: float
 
 
class ShadowTrafficTester:
    """
    Routes queries to both stable and shadow indexes,
    comparing results for validation.
    """
    
    def __init__(
        self,
        es_client: 'ElasticsearchClient',
        stable_alias: str,
        shadow_alias: str,
        comparison_store: 'ComparisonStore',
        sample_rate: float = 1.0  # 1.0 = 100% shadow
    ):
        self.es = es_client
        self.stable = stable_alias
        self.shadow = shadow_alias
        self.store = comparison_store
        self.sample_rate = sample_rate
        
    async def search(
        self, 
        query: dict, 
        query_id: str
    ) -> 'SearchResult':
        """
        Execute search with shadow comparison.
        
        Returns stable results immediately while shadow query
        runs asynchronously in the background.
        """
        # Always execute stable query synchronously
        stable_start = asyncio.get_event_loop().time()
        stable_result = await self.es.search(
            index=self.stable,
            body=query
        )
        stable_latency = (asyncio.get_event_loop().time() - stable_start) * 1000
        
        # Maybe execute shadow query asynchronously
        if self._should_shadow():
            asyncio.create_task(
                self._shadow_and_compare(
                    query, query_id, stable_result, stable_latency
                )
            )
            
        return stable_result
        
    async def _shadow_and_compare(
        self,
        query: dict,
        query_id: str,
        stable_result: 'SearchResult',
        stable_latency: float
    ) -> None:
        """
        Execute shadow query and store comparison.
        
        Runs in background, doesn't affect user response.
        """
        try:
            shadow_start = asyncio.get_event_loop().time()
            shadow_result = await self.es.search(
                index=self.shadow,
                body=query
            )
            shadow_latency = (
                asyncio.get_event_loop().time() - shadow_start
            ) * 1000
            
            # Compare results
            comparison = self._compare(
                query_id,
                stable_result, stable_latency,
                shadow_result, shadow_latency
            )
            
            # Store for analysis
            await self.store.save(comparison)
            
            # Alert on significant divergence
            if comparison.result_overlap_percent < 0.8:
                await self._alert_divergence(query_id, query, comparison)
                
        except Exception as e:
            # Shadow failures are logged but don't affect users
            await self.store.save_error(query_id, str(e))
            
    def _compare(
        self,
        query_id: str,
        stable: 'SearchResult',
        stable_latency: float,
        shadow: 'SearchResult', 
        shadow_latency: float
    ) -> ComparisonResult:
        """
        Compare stable and shadow results.
        """
        stable_ids = [hit['_id'] for hit in stable['hits']['hits']]
        shadow_ids = [hit['_id'] for hit in shadow['hits']['hits']]
        
        # Calculate overlap
        stable_set = set(stable_ids[:10])  # Compare top 10
        shadow_set = set(shadow_ids[:10])
        overlap = len(stable_set & shadow_set) / max(len(stable_set), 1)
        
        # Calculate score correlation for overlapping docs
        correlation = self._score_correlation(stable, shadow)
        
        return ComparisonResult(
            query_id=query_id,
            stable_latency_ms=stable_latency,
            shadow_latency_ms=shadow_latency,
            stable_count=stable['hits']['total']['value'],
            shadow_count=shadow['hits']['total']['value'],
            result_overlap_percent=overlap,
            score_correlation=correlation
        )
        
    def _score_correlation(
        self,
        stable: 'SearchResult',
        shadow: 'SearchResult'
    ) -> float:
        """
        Calculate Pearson correlation of scores for matching docs.
        
        High correlation means ranking is similar even if
        absolute scores differ (which is expected with
        different indexes).
        """
        stable_scores = {
            hit['_id']: hit['_score'] 
            for hit in stable['hits']['hits']
        }
        
        pairs = []
        for hit in shadow['hits']['hits']:
            if hit['_id'] in stable_scores:
                pairs.append((
                    stable_scores[hit['_id']], 
                    hit['_score']
                ))
                
        if len(pairs) < 2:
            return 0.0  # Not enough data
            
        # Calculate Pearson correlation
        n = len(pairs)
        sum_x = sum(p[0] for p in pairs)
        sum_y = sum(p[1] for p in pairs)
        sum_xy = sum(p[0] * p[1] for p in pairs)
        sum_x2 = sum(p[0] ** 2 for p in pairs)
        sum_y2 = sum(p[1] ** 2 for p in pairs)
        
        numerator = n * sum_xy - sum_x * sum_y
        denominator = (
            (n * sum_x2 - sum_x ** 2) * (n * sum_y2 - sum_y ** 2)
        ) ** 0.5
        
        if denominator == 0:
            return 0.0
            
        return numerator / denominator
        
    def _should_shadow(self) -> bool:
        """Determine if this request should be shadowed."""
        return asyncio.get_event_loop().time() % 1 < self.sample_rate
        
    async def _alert_divergence(
        self,
        query_id: str,
        query: dict,
        comparison: ComparisonResult
    ) -> None:
        """Alert when shadow results diverge significantly."""
        # Log for investigation
        print(f"""
Shadow divergence detected:
  Query ID: {query_id}
  Overlap: {comparison.result_overlap_percent:.1%}
  Correlation: {comparison.score_correlation:.3f}
  Latency: {comparison.stable_latency_ms:.0f}ms stable, 
           {comparison.shadow_latency_ms:.0f}ms shadow
""")

Shadow Testing Duration

Run shadow traffic for at least 24-48 hours to capture diverse query patterns: weekday vs weekend, peak vs off-peak, different user segments. One hour of shadow testing might miss rare but important query types.

Rollback and Recovery Procedures

The power of blue-green indexing lies in its ability to quickly recover from problems. A well-designed rollback procedure is essential.

Immediate Rollback (< 1 minute)

If the previous environment is still available and was receiving dual-writes:

Execute atomic alias switch back to previous environment
Disable dual-write to failed environment
Monitor to confirm recovery
Investigate root cause without time pressure

This is the ideal scenario—rollback is literally one API call.

Recovery from Stale Rollback Target

If dual-write was disabled and the previous environment is stale:

Assess staleness duration (how much data is missing?)
If acceptable, rollback and accept temporary data loss
Re-enable CDC/change capture to catch up
Monitor until caught up

Alternatively, if data loss is unacceptable:

Fix the issue in the current (broken) environment
Deploy a hotfix rather than rollback

Recovery from Complete Failure

If both environments are unusable:

Create a new index from the source of truth
This is a full reindex scenario (see Reindexing Strategies)
Takes hours, not minutes
Post-incident: analyze why both failed

Rollback Decision Criteria

Symptom	Severity	Action
Slightly higher latency	Low	Monitor, consider fix forward
Some queries returning errors	Medium	Rollback if > 1% error rate
Wrong results for specific queries	Medium	Assess impact, maybe rollback
All queries failing	Critical	Immediate rollback
Data corruption detected	Critical	Rollback, investigate
Performance degraded 10x	High	Rollback, investigate

Rollback Runbook

•Step 1: Confirm the issue warrants rollback (not a transient spike)
•Step 2: Announce rollback in incident channel
•Step 3: Verify rollback target index exists and has data
•Step 4: Execute atomic alias switch: POST /_aliases {...}
•Step 5: Verify alias now points to previous index: GET /_alias/products
•Step 6: Monitor error rates and latencies for improvement
•Step 7: Disable dual-write if still active
•Step 8: Document the rollback and begin root cause analysis
•Step 9: Plan corrective actions before next deployment attempt

Rollback Window Expiration

Define a clear rollback window (e.g., 48 hours) after which the previous environment may be deleted. If issues emerge after this window, you'll need to fix forward or do a full rebuild. Communicate this window to stakeholders.

Building Organizational Confidence

Technical implementation is only half the battle. Achieving frequent, confident search deployments requires organizational practices that build trust in the process.

Practice Deployments

Practice in Production (safely):

Deploy identical configurations to exercise the pipeline
Deploy to staging frequently, even without changes
Run full deployment drills during low-traffic periods

The goal: deployment becomes routine muscle memory, not a rare event.

Deployment Cadence

Start slow and accelerate as confidence builds:

Initial Phase: Deploy monthly with extensive manual verification
Building Confidence: Deploy weekly with automated verification
Mature Operations: Deploy daily or on-demand with full automation

Documentation and Runbooks

Maintain comprehensive documentation:

Deployment runbook: Step-by-step process
Verification queries: Standard checks to run post-deploy
Rollback runbook: Fast recovery procedure
Incident playbook: What to do when things go wrong
Post-mortem template: How to learn from failures

Metrics and SLOs

Define and track deployment health:

Deployment duration: How long does the full cycle take?
Verification pass rate: What percentage pass on first try?
Rollback frequency: How often do we need to rollback?
Time to production: From commit to live traffic

Deployment Maturity Model
Level	Frequency	Automation	Verification	Confidence
Ad-hoc	Quarterly	Manual	Manual checks	Low, high stress
Structured	Monthly	Scripted	Checklist-based	Growing
Automated	Weekly	Fully automated	Automated suite	High
Continuous	Daily/On-demand	Self-service	Continuous	Complete trust

Celebrate Successful Deployments

Acknowledge successful deployments, especially early ones. Moving from quarterly to weekly deployments is a significant achievement. Recognition reinforces the value of investing in deployment infrastructure and encourages continued improvement.

Summary: Blue-Green Indexing

Blue-green indexing transforms search index deployments from high-risk events into routine operations. By maintaining two environments and switching atomically, you gain the confidence to iterate rapidly on search quality.

Key Takeaways

•Two independent environments (blue/green) enable zero-downtime deployments with instant rollback capability
•Atomic alias switching moves production traffic from one environment to another without partial states
•Synchronization is critical: Start change capture before bulk load, maintain dual-write until switch
•Canary deployments reduce risk further by gradually increasing traffic to the new index
•Shadow traffic testing validates with real queries without impacting users
•Robust rollback procedures are essential—the instant rollback capability is blue-green's superpower
•Organizational practices matter as much as technical implementation for successful frequent deployments
•Deploy often: The more routine deployments become, the lower the risk of each individual deployment

Module Complete:

You've now mastered the full spectrum of search indexing strategies, from the fundamental choice between real-time and batch indexing, through delta updates and reindexing, to the operational patterns that make search index management routine. These capabilities will serve you whether you're building a startup's first search feature or scaling to billions of documents at a major platform.

Module Complete: Search Indexing Strategies

You now have a complete mental model for search index management at scale. From the fundamentals of indexing approaches through to sophisticated blue-green deployment patterns, you're equipped to design and operate world-class search infrastructure. The next module explores Search Relevance Tuning—the art of making search results truly useful.