Loading content...
Imagine deploying a major search index change—new analyzers, restructured documents, completely rebuilt from source—with the confidence of a routine code deploy. That's the promise of blue-green indexing.
Borrowed from application deployment practices, blue-green indexing maintains two complete, independent index environments. At any moment:
When green is ready, you switch traffic atomically. If problems emerge, you switch back. The failed deployment becomes a learning opportunity, not an outage.
This pattern synthesizes everything we've explored: real-time and batch indexing, delta updates, reindexing strategies, and index aliases. It's the operational model that enables teams to iterate on search quality without fear.
Companies like Spotify, Airbnb, and Stripe use blue-green indexing to deploy search changes multiple times per day. For them, search index updates are as routine as code deployments—because they've built the infrastructure to make them so.
By the end of this page, you will understand the complete blue-green indexing workflow, how to maintain synchronization between environments, advanced patterns like canary deployments and shadow traffic, and how to build organizational confidence in frequent search deployments.
The blue-green pattern is conceptually simple but requires careful implementation across several dimensions.
Two Independent Environments: Blue and green are completely separate index sets. They may have different mappings, different analyzers, or contain different data at any moment.
One Serves Traffic: At any time, exactly one environment serves production search traffic. This is the "live" environment.
One Prepares for Deployment: The other environment is used for building and validating changes. This is the "staging" environment.
Atomic Switch: When staging is ready, traffic switches from live to staging atomically. The former live becomes the new staging.
Role Reversal: After switching, blue becomes green and green becomes blue—the roles alternate with each deployment.
| State | Blue | Green | Traffic Routing |
|---|---|---|---|
| Initial | Live, current data | Idle or previous version | 100% → Blue |
| Building | Live, receiving updates | Reindexing from source | 100% → Blue |
| Syncing | Live, receiving updates | Applying delta catchup | 100% → Blue |
| Validating | Live, receiving updates | Complete, being tested | 100% → Blue |
| Switching | Being deprecated | Becoming live | 0% → Blue, 100% → Green |
| Observing | Standby for rollback | Live, receiving updates | 100% → Green |
| Committed | Ready for next build | Live, current data | 100% → Green |
Blue/green is just naming convention. Some teams use production/staging, active/passive, or v1/v2. The key is having two environments that alternate roles. Use whatever naming resonates with your team.
Implementing blue-green indexing requires coordinating several components. Here's a production-grade architecture.
Index Naming Convention Use consistent naming that identifies the environment:
products_blue_v{version}
products_green_v{version}
Alias Structure
products → Points to the live environment's indexproducts_blue → Points to blue environment's current indexproducts_green → Points to green environment's current indexState Tracking A durable store (database, config service) tracks:
Dual-Write Coordinator During synchronization, writes must go to both environments. This component manages:
Verification Service Automated testing against staging environment:
Switch Controller Orchestrates the atomic switch:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247
/** * Blue-Green Index Orchestrator * * Manages the complete lifecycle of blue-green search index deployments. * This is a production-grade implementation used at scale. */ type Environment = 'blue' | 'green'; interface DeploymentState { liveEnvironment: Environment; blueIndex: string; greenIndex: string; blueVersion: number; greenVersion: number; deploymentInProgress: boolean; lastDeployedAt: Date;} interface DeploymentConfig { baseName: string; // e.g., "products" sourceDatabase: string; // Data source for reindexing verificationQueries: Query[]; // Queries for parity testing minDocumentCount: number; // Minimum expected documents maxSyncLagSeconds: number; // Max acceptable sync delay} class BlueGreenOrchestrator { constructor( private esClient: ElasticsearchClient, private stateStore: StateStore, private changeSource: ChangeDataCapture, private config: DeploymentConfig, ) {} /** * Execute a complete blue-green deployment. * * This method orchestrates the entire process from building * the new index through to traffic switchover. */ async deploy( newMappings: object, newSettings: object, ): Promise<DeploymentResult> { const state = await this.stateStore.getState(this.config.baseName); if (state.deploymentInProgress) { throw new Error('Deployment already in progress'); } const stagingEnv = this.getStagingEnvironment(state); const liveEnv = state.liveEnvironment; const newVersion = this.getNextVersion(state, stagingEnv); const newIndexName = this.generateIndexName(stagingEnv, newVersion); try { // Mark deployment in progress await this.stateStore.updateState({ ...state, deploymentInProgress: true, }); // Step 1: Create new index with target schema await this.createIndex(newIndexName, newMappings, newSettings); console.log(`Created index: ${newIndexName}`); // Step 2: Start change capture BEFORE bulk load const changeCursor = await this.changeSource.startCapture(); console.log('Started change capture'); // Step 3: Bulk load all data to new index const bulkResult = await this.bulkLoadFromSource(newIndexName); console.log(`Bulk loaded ${bulkResult.documentCount} documents`); // Step 4: Enable dual-write to keep new index synchronized await this.enableDualWrite(newIndexName); console.log('Enabled dual-write'); // Step 5: Replay changes that occurred during bulk load const catchupResult = await this.replayChanges( newIndexName, changeCursor ); console.log(`Replayed ${catchupResult.count} changes`); // Step 6: Wait for synchronization await this.waitForSync(state.liveEnvironment, newIndexName); console.log('Environments synchronized'); // Step 7: Run verification suite const verification = await this.verify(newIndexName); if (!verification.passed) { throw new DeploymentVerificationError(verification.issues); } console.log('Verification passed'); // Step 8: Atomic switch const previousLiveIndex = this.getCurrentLiveIndex(state); await this.atomicSwitch(previousLiveIndex, newIndexName); console.log(`Switched traffic: ${previousLiveIndex} → ${newIndexName}`); // Step 9: Update state const newState: DeploymentState = { liveEnvironment: stagingEnv, blueIndex: stagingEnv === 'blue' ? newIndexName : state.blueIndex, greenIndex: stagingEnv === 'green' ? newIndexName : state.greenIndex, blueVersion: stagingEnv === 'blue' ? newVersion : state.blueVersion, greenVersion: stagingEnv === 'green' ? newVersion : state.greenVersion, deploymentInProgress: false, lastDeployedAt: new Date(), }; await this.stateStore.updateState(newState); // Step 10: Disable dual-write (optional, can keep for fast rollback) await this.disableDualWrite(); return { success: true, previousIndex: previousLiveIndex, newIndex: newIndexName, documentsIndexed: bulkResult.documentCount, changesReplayed: catchupResult.count, }; } catch (error) { // Mark deployment failed, state unchanged await this.stateStore.updateState({ ...state, deploymentInProgress: false, }); // Cleanup partial index if exists await this.cleanupFailedIndex(newIndexName); throw error; } } /** * Quickly rollback to the previous environment. * * If dual-write is still active, the previous index is current. * This is essentially an alias switch. */ async rollback(): Promise<void> { const state = await this.stateStore.getState(this.config.baseName); const currentLive = state.liveEnvironment; const previousEnv = currentLive === 'blue' ? 'green' : 'blue'; const previousIndex = currentLive === 'blue' ? state.greenIndex : state.blueIndex; const currentIndex = this.getCurrentLiveIndex(state); // Verify previous index exists and has data const exists = await this.esClient.indices.exists({ index: previousIndex }); if (!exists) { throw new Error( `Rollback target ${previousIndex} does not exist` ); } // Atomic switch back await this.atomicSwitch(currentIndex, previousIndex); // Update state await this.stateStore.updateState({ ...state, liveEnvironment: previousEnv, }); console.log(`Rolled back: ${currentIndex} → ${previousIndex}`); } private getStagingEnvironment(state: DeploymentState): Environment { // Staging is whichever environment is NOT live return state.liveEnvironment === 'blue' ? 'green' : 'blue'; } private getNextVersion(state: DeploymentState, env: Environment): number { const currentVersion = env === 'blue' ? state.blueVersion : state.greenVersion; return currentVersion + 1; } private generateIndexName(env: Environment, version: number): string { const timestamp = new Date().toISOString() .replace(/[-:T]/g, '') .slice(0, 14); return `${this.config.baseName}_${env}_v${version}_${timestamp}`; } private getCurrentLiveIndex(state: DeploymentState): string { return state.liveEnvironment === 'blue' ? state.blueIndex : state.greenIndex; } private async atomicSwitch( fromIndex: string, toIndex: string ): Promise<void> { const alias = this.config.baseName; await this.esClient.indices.updateAliases({ body: { actions: [ { remove: { index: fromIndex, alias } }, { add: { index: toIndex, alias } }, ], }, }); } private async waitForSync( liveIndex: string, stagingIndex: string ): Promise<void> { const maxWait = 300_000; // 5 minutes const checkInterval = 1000; // 1 second let elapsed = 0; while (elapsed < maxWait) { const [liveCount, stagingCount] = await Promise.all([ this.esClient.count({ index: liveIndex }), this.esClient.count({ index: stagingIndex }), ]); const lag = Math.abs(liveCount.count - stagingCount.count); if (lag === 0) { return; // Synchronized } await this.sleep(checkInterval); elapsed += checkInterval; } throw new Error('Synchronization timeout exceeded'); }}The critical challenge in blue-green indexing is keeping both environments synchronized. When you switch traffic, the staging environment must be exactly current with production.
While building the new index, you need writes to go to both environments:
Option 1: Application-Level Dual-Write The application writes to both indexes explicitly. Simple but tightly coupled.
Option 2: Message Queue Fan-Out Writes go to a queue, consumers update both indexes. Decoupled but adds latency.
Option 3: CDC-Based Replication CDC captures all writes and applies to both indexes. Most robust but complex.
During bulk load (which takes hours), the production index receives continuous updates. When bulk load completes:
T_startT_start to T_endT_start to T_end to new indexThe key is that change capture must start BEFORE bulk load, capturing changes during the entire build process.
For systems with high update rates (thousands per second), catchup can be challenging:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179
"""Synchronization Manager for Blue-Green Indexing Handles the complex task of keeping staging synchronized withproduction during the build and transition phases.""" from datetime import datetime, timedeltafrom typing import AsyncIteratorimport asyncio class SynchronizationManager: """ Manages synchronization between live and staging indexes. The critical invariant: staging must never fall behind live by more than max_lag_seconds at switch time. """ def __init__( self, es_client: 'ElasticsearchClient', change_source: 'ChangeDataCapture', live_index: str, staging_index: str, max_lag_seconds: int = 30 ): self.es = es_client self.changes = change_source self.live = live_index self.staging = staging_index self.max_lag = max_lag_seconds self._dual_write_enabled = False self._replay_cursor = None async def start_capture_before_build(self) -> 'ChangeCursor': """ Start capturing changes BEFORE beginning bulk load. This is critical - we need all changes from this moment forward to replay after bulk load completes. """ self._replay_cursor = await self.changes.start_capture() return self._replay_cursor async def replay_and_synchronize(self) -> SyncResult: """ Replay captured changes and achieve synchronization. Called after bulk load completes. Will: 1. Replay all changes captured during bulk load 2. Enable dual-write for new changes 3. Continue replay until caught up 4. Return when synchronized """ if not self._replay_cursor: raise ValueError( "Must call start_capture_before_build() first" ) # Phase 1: Replay historical changes (during bulk load) replay_count = 0 replay_start = datetime.utcnow() async for batch in self._read_changes(self._replay_cursor): await self._apply_batch(batch) replay_count += len(batch) # Log progress every 10k changes if replay_count % 10000 == 0: lag = await self._measure_lag() print(f"Replayed {replay_count}, lag: {lag} docs") # Phase 2: Enable dual-write so new changes go to both await self._enable_dual_write() # Phase 3: Wait for complete synchronization await self._wait_for_sync() return SyncResult( changes_replayed=replay_count, duration=datetime.utcnow() - replay_start, final_lag=0 ) async def _enable_dual_write(self) -> None: """ Enable dual-write mode. All writes to live index are now also applied to staging. Implementation depends on architecture: - Application-level: configure write router - CDC-level: add staging as CDC consumer - Proxy-level: enable write duplication """ # Example: Configure CDC to write to both await self.changes.add_target(self.staging) self._dual_write_enabled = True print(f"Dual-write enabled: {self.live} + {self.staging}") async def _wait_for_sync(self) -> None: """ Wait until staging is synchronized with live. Polls document counts until they match (within tolerance). With dual-write enabled, this should converge quickly unless replay is behind. """ max_attempts = 300 # 5 minutes with 1s intervals for attempt in range(max_attempts): lag = await self._measure_lag() if lag == 0: print("Synchronization achieved!") return if lag < 100: print(f"Nearly synchronized, lag: {lag}") elif attempt % 10 == 0: print(f"Waiting for sync, lag: {lag}") await asyncio.sleep(1) raise SynchronizationError( f"Failed to synchronize within {max_attempts}s" ) async def _measure_lag(self) -> int: """ Measure the document count difference between indexes. Note: count equality doesn't guarantee content equality, but it's a necessary condition for synchronization. """ live_count = await self.es.count(index=self.live) staging_count = await self.es.count(index=self.staging) return abs(live_count['count'] - staging_count['count']) async def _apply_batch(self, changes: list[Change]) -> None: """Apply a batch of changes to the staging index.""" bulk_ops = [] for change in changes: if change.operation == 'DELETE': bulk_ops.extend([ {"delete": { "_index": self.staging, "_id": change.id }} ]) else: bulk_ops.extend([ {"index": { "_index": self.staging, "_id": change.id }}, change.document ]) if bulk_ops: await self.es.bulk(body=bulk_ops) async def _read_changes( self, cursor: 'ChangeCursor' ) -> AsyncIterator[list[Change]]: """Read changes from the capture source.""" async for batch in self.changes.stream(cursor): yield batch async def disable_dual_write(self) -> None: """Disable dual-write after successful switch.""" if self._dual_write_enabled: await self.changes.remove_target(self.staging) self._dual_write_enabled = False print("Dual-write disabled")While atomic switching works well, some organizations prefer more gradual transitions. Canary deployments and gradual rollout patterns reduce risk further by exposing only a fraction of traffic to the new index initially.
Application-Level Routing The application randomly selects which alias to query based on a percentage. Simple to implement but requires application changes.
Load Balancer Routing Route a percentage of requests to different search endpoints. Completely transparent to applications.
Search Proxy Layer A dedicated proxy routes queries to different backends. Adds latency but provides rich control.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244
/** * Canary Router for Gradual Search Index Rollout * * Routes a configurable percentage of search traffic to the * canary (new) index while the rest goes to the stable index. */ interface CanaryConfig { stableAlias: string; // e.g., "products" canaryAlias: string; // e.g., "products_canary" canaryPercent: number; // 0-100 stickyRouting: boolean; // Same user always hits same index} class CanarySearchRouter { private config: CanaryConfig; private metrics: MetricsClient; constructor( private esClient: ElasticsearchClient, config: CanaryConfig, metrics: MetricsClient, ) { this.config = config; this.metrics = metrics; } /** * Route a search request to the appropriate index. * * Respects canary percentage and optionally provides sticky * routing so the same user always hits the same index. */ async search( query: SearchQuery, context: RequestContext ): Promise<SearchResult> { const useCanary = this.shouldUseCanary(context); const targetAlias = useCanary ? this.config.canaryAlias : this.config.stableAlias; const startTime = Date.now(); try { const result = await this.esClient.search({ index: targetAlias, body: query, }); // Track metrics for comparison const latency = Date.now() - startTime; this.recordMetrics(useCanary, latency, result, null); return result; } catch (error) { const latency = Date.now() - startTime; this.recordMetrics(useCanary, latency, null, error); // Optionally fall back to stable on canary error if (useCanary && this.config.canaryPercent < 100) { console.warn(`Canary failed, falling back to stable: ${error}`); return this.esClient.search({ index: this.config.stableAlias, body: query, }); } throw error; } } /** * Determine if this request should use the canary index. * * Uses consistent hashing for sticky routing if enabled, * otherwise random selection based on percentage. */ private shouldUseCanary(context: RequestContext): boolean { if (this.config.canaryPercent === 0) return false; if (this.config.canaryPercent === 100) return true; let value: number; if (this.config.stickyRouting && context.userId) { // Consistent hash of user ID value = this.consistentHash(context.userId) % 100; } else { // Random selection value = Math.random() * 100; } return value < this.config.canaryPercent; } private consistentHash(input: string): number { // Simple hash for consistent routing let hash = 0; for (let i = 0; i < input.length; i++) { hash = ((hash << 5) - hash) + input.charCodeAt(i); hash = hash & hash; // Convert to 32-bit integer } return Math.abs(hash); } private recordMetrics( isCanary: boolean, latencyMs: number, result: SearchResult | null, error: Error | null, ): void { const labels = { index_type: isCanary ? 'canary' : 'stable', success: error ? 'false' : 'true', }; this.metrics.histogram('search_latency_ms', latencyMs, labels); this.metrics.counter('search_requests_total', 1, labels); if (result) { this.metrics.histogram( 'search_result_count', result.hits.total.value, labels ); } if (error) { this.metrics.counter('search_errors_total', 1, { ...labels, error_type: error.constructor.name, }); } } /** * Adjust canary percentage dynamically. * * Called by automation based on metrics analysis * or manually by operators. */ async setCanaryPercent(percent: number): Promise<void> { if (percent < 0 || percent > 100) { throw new Error('Canary percent must be between 0 and 100'); } const previous = this.config.canaryPercent; this.config.canaryPercent = percent; console.log( `Canary traffic: ${previous}% → ${percent}%` ); // Record the change for audit this.metrics.gauge('canary_traffic_percent', percent); }} /** * Automated canary promotion based on metrics. * * Watches error rates and latency, automatically increases * traffic if healthy or rolls back if problems detected. */class CanaryPromoter { private stages = [1, 5, 10, 25, 50, 75, 100]; private currentStageIndex = 0; constructor( private router: CanarySearchRouter, private metrics: MetricsClient, private alerting: AlertingService, ) {} async runPromotion(): Promise<PromotionResult> { for (const stage of this.stages) { await this.router.setCanaryPercent(stage); // Wait for metrics to accumulate await this.sleep(60_000); // 1 minute per stage // Check health const health = await this.checkHealth(); if (!health.healthy) { // Rollback to 0% await this.router.setCanaryPercent(0); await this.alerting.sendAlert({ severity: 'warning', title: 'Canary rollback triggered', details: health.issues.join(', '), }); return { success: false, rolledBackAtStage: stage, issues: health.issues, }; } console.log(`Stage ${stage}% healthy, proceeding...`); } return { success: true, finalPercent: 100 }; } private async checkHealth(): Promise<HealthCheckResult> { const issues: string[] = []; // Compare canary vs stable error rates const canaryErrors = await this.metrics.query( 'rate(search_errors_total{index_type="canary"}[5m])' ); const stableErrors = await this.metrics.query( 'rate(search_errors_total{index_type="stable"}[5m])' ); if (canaryErrors > stableErrors * 1.5) { issues.push( `Canary error rate ${canaryErrors} > stable ${stableErrors}` ); } // Compare latencies const canaryP95 = await this.metrics.query( 'histogram_quantile(0.95, search_latency_ms{index_type="canary"})' ); const stableP95 = await this.metrics.query( 'histogram_quantile(0.95, search_latency_ms{index_type="stable"})' ); if (canaryP95 > stableP95 * 1.2) { issues.push( `Canary P95 latency ${canaryP95}ms > stable ${stableP95}ms` ); } return { healthy: issues.length === 0, issues, }; }}Shadow traffic testing (also called "dark launching") sends production queries to both indexes but only returns results from the stable index. This allows comprehensive testing with real traffic without any user impact.
Zero User Impact: Users always get results from the proven stable index.
Real Traffic Patterns: No need to construct synthetic test queries.
Scale Testing: You verify the new index handles full production load.
Result Comparison: Identify queries where results differ significantly.
Performance Overhead: Every query costs 2x. Ensure capacity.
Async Shadow Queries: Fire shadow query without blocking response to user.
Sampling: For very high traffic, shadow a percentage rather than 100%.
Comparison Logic: Define what "different enough to investigate" means.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210
"""Shadow Traffic Testing for Search Indexes Duplicates production queries to a shadow index for comparisonwithout affecting user experience.""" import asynciofrom dataclasses import dataclassfrom typing import Optional @dataclassclass ComparisonResult: query_id: str stable_latency_ms: float shadow_latency_ms: float stable_count: int shadow_count: int result_overlap_percent: float score_correlation: float class ShadowTrafficTester: """ Routes queries to both stable and shadow indexes, comparing results for validation. """ def __init__( self, es_client: 'ElasticsearchClient', stable_alias: str, shadow_alias: str, comparison_store: 'ComparisonStore', sample_rate: float = 1.0 # 1.0 = 100% shadow ): self.es = es_client self.stable = stable_alias self.shadow = shadow_alias self.store = comparison_store self.sample_rate = sample_rate async def search( self, query: dict, query_id: str ) -> 'SearchResult': """ Execute search with shadow comparison. Returns stable results immediately while shadow query runs asynchronously in the background. """ # Always execute stable query synchronously stable_start = asyncio.get_event_loop().time() stable_result = await self.es.search( index=self.stable, body=query ) stable_latency = (asyncio.get_event_loop().time() - stable_start) * 1000 # Maybe execute shadow query asynchronously if self._should_shadow(): asyncio.create_task( self._shadow_and_compare( query, query_id, stable_result, stable_latency ) ) return stable_result async def _shadow_and_compare( self, query: dict, query_id: str, stable_result: 'SearchResult', stable_latency: float ) -> None: """ Execute shadow query and store comparison. Runs in background, doesn't affect user response. """ try: shadow_start = asyncio.get_event_loop().time() shadow_result = await self.es.search( index=self.shadow, body=query ) shadow_latency = ( asyncio.get_event_loop().time() - shadow_start ) * 1000 # Compare results comparison = self._compare( query_id, stable_result, stable_latency, shadow_result, shadow_latency ) # Store for analysis await self.store.save(comparison) # Alert on significant divergence if comparison.result_overlap_percent < 0.8: await self._alert_divergence(query_id, query, comparison) except Exception as e: # Shadow failures are logged but don't affect users await self.store.save_error(query_id, str(e)) def _compare( self, query_id: str, stable: 'SearchResult', stable_latency: float, shadow: 'SearchResult', shadow_latency: float ) -> ComparisonResult: """ Compare stable and shadow results. """ stable_ids = [hit['_id'] for hit in stable['hits']['hits']] shadow_ids = [hit['_id'] for hit in shadow['hits']['hits']] # Calculate overlap stable_set = set(stable_ids[:10]) # Compare top 10 shadow_set = set(shadow_ids[:10]) overlap = len(stable_set & shadow_set) / max(len(stable_set), 1) # Calculate score correlation for overlapping docs correlation = self._score_correlation(stable, shadow) return ComparisonResult( query_id=query_id, stable_latency_ms=stable_latency, shadow_latency_ms=shadow_latency, stable_count=stable['hits']['total']['value'], shadow_count=shadow['hits']['total']['value'], result_overlap_percent=overlap, score_correlation=correlation ) def _score_correlation( self, stable: 'SearchResult', shadow: 'SearchResult' ) -> float: """ Calculate Pearson correlation of scores for matching docs. High correlation means ranking is similar even if absolute scores differ (which is expected with different indexes). """ stable_scores = { hit['_id']: hit['_score'] for hit in stable['hits']['hits'] } pairs = [] for hit in shadow['hits']['hits']: if hit['_id'] in stable_scores: pairs.append(( stable_scores[hit['_id']], hit['_score'] )) if len(pairs) < 2: return 0.0 # Not enough data # Calculate Pearson correlation n = len(pairs) sum_x = sum(p[0] for p in pairs) sum_y = sum(p[1] for p in pairs) sum_xy = sum(p[0] * p[1] for p in pairs) sum_x2 = sum(p[0] ** 2 for p in pairs) sum_y2 = sum(p[1] ** 2 for p in pairs) numerator = n * sum_xy - sum_x * sum_y denominator = ( (n * sum_x2 - sum_x ** 2) * (n * sum_y2 - sum_y ** 2) ) ** 0.5 if denominator == 0: return 0.0 return numerator / denominator def _should_shadow(self) -> bool: """Determine if this request should be shadowed.""" return asyncio.get_event_loop().time() % 1 < self.sample_rate async def _alert_divergence( self, query_id: str, query: dict, comparison: ComparisonResult ) -> None: """Alert when shadow results diverge significantly.""" # Log for investigation print(f"""Shadow divergence detected: Query ID: {query_id} Overlap: {comparison.result_overlap_percent:.1%} Correlation: {comparison.score_correlation:.3f} Latency: {comparison.stable_latency_ms:.0f}ms stable, {comparison.shadow_latency_ms:.0f}ms shadow""")Run shadow traffic for at least 24-48 hours to capture diverse query patterns: weekday vs weekend, peak vs off-peak, different user segments. One hour of shadow testing might miss rare but important query types.
The power of blue-green indexing lies in its ability to quickly recover from problems. A well-designed rollback procedure is essential.
If the previous environment is still available and was receiving dual-writes:
This is the ideal scenario—rollback is literally one API call.
If dual-write was disabled and the previous environment is stale:
Alternatively, if data loss is unacceptable:
If both environments are unusable:
| Symptom | Severity | Action |
|---|---|---|
| Slightly higher latency | Low | Monitor, consider fix forward |
| Some queries returning errors | Medium | Rollback if > 1% error rate |
| Wrong results for specific queries | Medium | Assess impact, maybe rollback |
| All queries failing | Critical | Immediate rollback |
| Data corruption detected | Critical | Rollback, investigate |
| Performance degraded 10x | High | Rollback, investigate |
POST /_aliases {...}GET /_alias/productsDefine a clear rollback window (e.g., 48 hours) after which the previous environment may be deleted. If issues emerge after this window, you'll need to fix forward or do a full rebuild. Communicate this window to stakeholders.
Technical implementation is only half the battle. Achieving frequent, confident search deployments requires organizational practices that build trust in the process.
Practice in Production (safely):
The goal: deployment becomes routine muscle memory, not a rare event.
Start slow and accelerate as confidence builds:
Maintain comprehensive documentation:
Define and track deployment health:
| Level | Frequency | Automation | Verification | Confidence |
|---|---|---|---|---|
| Quarterly | Manual | Manual checks | Low, high stress |
| Monthly | Scripted | Checklist-based | Growing |
| Weekly | Fully automated | Automated suite | High |
| Daily/On-demand | Self-service | Continuous | Complete trust |
Acknowledge successful deployments, especially early ones. Moving from quarterly to weekly deployments is a significant achievement. Recognition reinforces the value of investing in deployment infrastructure and encourages continued improvement.
Blue-green indexing transforms search index deployments from high-risk events into routine operations. By maintaining two environments and switching atomically, you gain the confidence to iterate rapidly on search quality.
Module Complete:
You've now mastered the full spectrum of search indexing strategies, from the fundamental choice between real-time and batch indexing, through delta updates and reindexing, to the operational patterns that make search index management routine. These capabilities will serve you whether you're building a startup's first search feature or scaling to billions of documents at a major platform.
You now have a complete mental model for search index management at scale. From the fundamentals of indexing approaches through to sophisticated blue-green deployment patterns, you're equipped to design and operate world-class search infrastructure. The next module explores Search Relevance Tuning—the art of making search results truly useful.