Search Relevance Tuning - Learning Module

Loading content...

0/273

Personalization: Making Search Personal

One Query, Many Intents

When Alice searches for "python," she's a software developer looking for programming documentation. When Bob searches for the same query, he's a pet enthusiast looking for snake care information. A static ranking that treats all users identically will fail one of them—or both.

Personalization is the art and science of adapting search results to individual users. It transforms search from a one-size-fits-all experience into something that anticipates what this specific user actually wants.

The potential is enormous: Amazon attributes 35% of its revenue to personalized recommendations. Netflix estimates personalization is worth $1 billion annually. But personalization also carries risks—filter bubbles, privacy concerns, and the complexity of building systems that truly understand user intent.

This page explores personalization comprehensively: the signals that enable it, the architectures that implement it, the trade-offs that constrain it, and the best practices that make it effective.

What You Will Learn

By the end of this page, you will understand the spectrum of personalization approaches, how to build user profiles from behavioral signals, the architecture of real-time personalization systems, privacy-preserving personalization techniques, and how to balance personalization with exploration and serendipity.

The Personalization Spectrum

Personalization exists on a spectrum from no personalization (everyone sees identical results) to heavy personalization (results are almost entirely determined by user profile). Different points on this spectrum are appropriate for different contexts.

The spectrum visualized:

The Personalization Spectrum
Level	Description	Signals Used	Example Use Cases
None	Identical results for all users	Query only	Legal search, academic databases
Contextual	Results vary by context, not user	Location, time, device	Local business results, mobile optimization
Segment-based	Results for user groups	Demographics, cohorts	Age-appropriate content, language localization
History-based	Recent behavior affects ranking	Session history, recent clicks	Continuing interrupted searches
Profile-based	Long-term preferences shape results	Full behavioral history, preferences	E-commerce, content platforms
Predictive	Anticipate needs before expression	ML models on rich profiles	Proactive recommendations, smart home

Query types and appropriate personalization levels:

Not all queries benefit equally from personalization:

Low Personalization Value

•Navigational queries: "facebook login" — user wants the same page
•Factual queries: "speed of light" — one correct answer
•Highly specific queries: "iPhone 15 Pro Max 256GB blue" — intent is clear
•New domains: Topics the user has no history with

High Personalization Value

•Ambiguous queries: "python", "apple" — multiple meanings
•Preference-dependent: "best restaurants" — subjective
•Broad informational: "news" — infinite possibilities
•Repeated queries: Returning to past interests

Adaptive Personalization

The best systems don't apply uniform personalization. They estimate query ambiguity and user confidence, then adjust personalization strength accordingly. A known power user searching an ambiguous query gets heavy personalization. A new user searching a specific query gets minimal personalization.

User Profile Construction

Personalization requires understanding who the user is and what they want. This understanding is encoded in a user profile—a data structure that captures interests, preferences, behaviors, and context.

Profile data sources:

User Profile Signal Categories

•Explicit signals: User-provided preferences, saved items, ratings, stated interests. Highest signal quality but lowest quantity.
•Behavioral signals: Clicks, views, dwell time, purchases, search refinements. High quantity, requires interpretation.
•Social signals: Connections, shared content, collaborative filtering inputs. Powerful for discovery but privacy-sensitive.
•Contextual signals: Device, location, time of day, referrer. Always available but shallow.
•Inferred signals: ML-derived interests, predicted demographics, embedding similarities. Scalable but can be wrong.

user_profile_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Set
from datetime import datetime
from enum import Enum
import numpy as np
 
class InterestLevel(Enum):
    """Confidence level for inferred interests."""
    EXPLICIT = 4    # User stated preference
    STRONG = 3      # Multiple behavioral signals
    MODERATE = 2    # Some behavioral signals
    WEAK = 1        # Single signal or inferred
    NEGATIVE = 0    # User indicated disinterest
 
@dataclass
class UserInterest:
    """A single interest with confidence and recency."""
    topic: str
    level: InterestLevel
    score: float               # 0.0-1.0 interest strength
    first_seen: datetime
    last_seen: datetime
    signal_count: int          # Number of supporting signals
    
    def decay(self, half_life_days: float = 30.0) -> float:
        """Apply time decay to interest score."""
        age_days = (datetime.now() - self.last_seen).days
        decay_factor = 0.5 ** (age_days / half_life_days)
        return self.score * decay_factor
 
@dataclass
class BehavioralEvent:
    """A recorded user behavior."""
    event_type: str           # click, purchase, view, dwell, etc.
    item_id: str
    timestamp: datetime
    metadata: Dict = field(default_factory=dict)
    # metadata examples: dwell_time_seconds, scroll_depth, add_to_cart
 
@dataclass 
class UserProfile:
    """
    Complete user profile for personalization.
    
    This represents the accumulated understanding of a user
    that drives personalized ranking.
    """
    user_id: str
    
    # Explicit preferences
    preferred_categories: Set[str] = field(default_factory=set)
    preferred_brands: Set[str] = field(default_factory=set)
    blocked_categories: Set[str] = field(default_factory=set)
    price_preference: str = "medium"  # low, medium, high
    
    # Inferred interests (derived from behavior)
    interests: Dict[str, UserInterest] = field(default_factory=dict)
    
    # Recent behavior (for session personalization)
    recent_events: List[BehavioralEvent] = field(default_factory=list)
    
    # Aggregated statistics
    total_searches: int = 0
    total_purchases: int = 0
    average_order_value: float = 0.0
    
    # ML embeddings (dense representations)
    interest_embedding: Optional[np.ndarray] = None
    context_embedding: Optional[np.ndarray] = None
    
    # Segment memberships
    segments: Set[str] = field(default_factory=set)
    
    # Privacy and compliance
    personalization_consent: bool = True
    data_retention_date: Optional[datetime] = None
    
    def get_active_interests(self, 
                            min_score: float = 0.1,
                            min_level: InterestLevel = InterestLevel.WEAK
                            ) -> List[UserInterest]:
        """Get interests above threshold after decay."""
        active = []
        for interest in self.interests.values():
            decayed_score = interest.decay()
            if decayed_score >= min_score and interest.level.value >= min_level.value:
                active.append(interest)
        return sorted(active, key=lambda i: i.decay(), reverse=True)
    
    def get_session_interests(self, 
                             lookback_minutes: int = 30
                             ) -> List[str]:
        """Extract interests from recent session activity."""
        cutoff = datetime.now() - timedelta(minutes=lookback_minutes)
        session_items = [
            e.item_id for e in self.recent_events 
            if e.timestamp > cutoff
        ]
        # Would resolve items to categories/topics
        return session_items
 
 
class UserProfileBuilder:
    """
    Builds and updates user profiles from behavioral events.
    
    Key design decisions:
    - Incremental updates (don't reprocess entire history)
    - Configurable decay to forget old preferences
    - Separate short-term and long-term signals
    - Privacy-aware (respects consent, retention limits)
    """
    
    def __init__(self, 
                 interest_half_life_days: float = 30.0,
                 max_events_retained: int = 1000,
                 min_signals_for_interest: int = 3):
        self.half_life = interest_half_life_days
        self.max_events = max_events_retained
        self.min_signals = min_signals_for_interest
    
    def process_event(self, 
                     profile: UserProfile, 
                     event: BehavioralEvent,
                     item_metadata: Dict) -> UserProfile:
        """
        Update profile based on a new behavioral event.
        """
        if not profile.personalization_consent:
            return profile  # Respect privacy preference
        
        # Add to recent events (with eviction)
        profile.recent_events.append(event)
        if len(profile.recent_events) > self.max_events:
            profile.recent_events = profile.recent_events[-self.max_events:]
        
        # Extract topics from item
        topics = item_metadata.get("topics", [])
        categories = item_metadata.get("categories", [])
        brand = item_metadata.get("brand")
        
        # Update interests based on event type
        interest_boost = self._event_to_interest_boost(event)
        
        for topic in topics + categories:
            self._update_interest(profile, topic, interest_boost, event.timestamp)
        
        if brand:
            self._update_interest(profile, f"brand:{brand}", 
                                 interest_boost * 0.5, event.timestamp)
        
        # Update statistics
        if event.event_type == "search":
            profile.total_searches += 1
        elif event.event_type == "purchase":
            profile.total_purchases += 1
            price = item_metadata.get("price", 0)
            self._update_price_preference(profile, price)
        
        return profile
    
    def _event_to_interest_boost(self, event: BehavioralEvent) -> float:
        """Map event types to interest score increases."""
        # Different events indicate different interest levels
        boost_map = {
            "view": 0.1,           # Weak signal
            "click": 0.2,          # Moderate signal
            "dwell": 0.3,          # Spent time = interest
            "add_to_cart": 0.5,    # Strong commercial intent
            "purchase": 1.0,       # Strongest signal
            "save": 0.7,           # Explicit interest
            "share": 0.6,          # Advocacy signal
            "rate_positive": 0.4,
            "rate_negative": -0.5, # Negative signal
        }
        
        base_boost = boost_map.get(event.event_type, 0.1)
        
        # Adjust by dwell time if available
        dwell_seconds = event.metadata.get("dwell_time_seconds", 0)
        if dwell_seconds > 60:
            base_boost *= 1.5
        elif dwell_seconds > 180:
            base_boost *= 2.0
        
        return base_boost
    
    def _update_interest(self, 
                        profile: UserProfile, 
                        topic: str,
                        boost: float,
                        timestamp: datetime):
        """Update a single interest in the profile."""
        if topic in profile.interests:
            interest = profile.interests[topic]
            interest.score = min(1.0, interest.score + boost)
            interest.last_seen = timestamp
            interest.signal_count += 1
            
            # Upgrade level based on signal count
            if interest.signal_count >= 10:
                interest.level = InterestLevel.STRONG
            elif interest.signal_count >= self.min_signals:
                interest.level = InterestLevel.MODERATE
        else:
            profile.interests[topic] = UserInterest(
                topic=topic,
                level=InterestLevel.WEAK,
                score=boost,
                first_seen=timestamp,
                last_seen=timestamp,
                signal_count=1
            )
    
    def _update_price_preference(self, profile: UserProfile, price: float):
        """Infer price sensitivity from purchase patterns."""
        # Simplified: would use statistical analysis in production
        n = profile.total_purchases
        old_avg = profile.average_order_value
        profile.average_order_value = ((old_avg * (n-1)) + price) / n
        
        if profile.average_order_value > 200:
            profile.price_preference = "high"
        elif profile.average_order_value < 50:
            profile.price_preference = "low"
        else:
            profile.price_preference = "medium"
 
 
from datetime import timedelta
 
# Example usage
def build_sample_profile():
    builder = UserProfileBuilder()
    profile = UserProfile(user_id="user_123")
    
    events = [
        BehavioralEvent("view", "prod_001", datetime.now() - timedelta(days=5), 
                       {"categories": ["electronics", "headphones"]}),
        BehavioralEvent("click", "prod_001", datetime.now() - timedelta(days=5),
                       {"categories": ["electronics", "headphones"], "brand": "Sony"}),
        BehavioralEvent("purchase", "prod_001", datetime.now() - timedelta(days=4),
                       {"categories": ["electronics", "headphones"], "brand": "Sony", "price": 150}),
        BehavioralEvent("view", "prod_002", datetime.now() - timedelta(days=2),
                       {"categories": ["electronics", "speakers"], "brand": "Bose"}),
    ]
    
    item_metadata = {
        "prod_001": {"topics": ["audio", "wireless"], "categories": ["electronics", "headphones"], "brand": "Sony", "price": 150},
        "prod_002": {"topics": ["audio", "home"], "categories": ["electronics", "speakers"], "brand": "Bose"},
    }
    
    for event in events:
        profile = builder.process_event(profile, event, item_metadata.get(event.item_id, {}))
    
    return profile

Real-Time Personalization Architecture

Personalization must happen in real-time. When a user searches, you have milliseconds to incorporate their profile into ranking. This requires careful architectural planning.

The personalization pipeline:

personalization_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
from dataclasses import dataclass
from typing import List, Dict, Optional, Any
from abc import ABC, abstractmethod
import asyncio
import time
 
@dataclass
class SearchRequest:
    """Incoming search request with context."""
    query: str
    user_id: Optional[str]
    session_id: str
    device_type: str
    location: Optional[Dict]  # lat, lon
    timestamp: float
 
@dataclass
class PersonalizedSearchContext:
    """Assembled context for personalized ranking."""
    request: SearchRequest
    user_profile: Optional['UserProfile']
    session_signals: List['BehavioralEvent']
    personalization_enabled: bool
    personalization_strength: float  # 0.0 to 1.0
 
class ProfileStore(ABC):
    """Abstract interface for user profile storage."""
    
    @abstractmethod
    async def get_profile(self, user_id: str) -> Optional['UserProfile']:
        pass
    
    @abstractmethod
    async def get_session_events(self, session_id: str) -> List['BehavioralEvent']:
        pass
 
class RedisProfileStore(ProfileStore):
    """
    Redis-based profile store for low-latency access.
    
    Design considerations:
    - Profile data is serialized to Redis hashes
    - Hot profiles are cached in local memory (LRU)
    - Session events use Redis lists with TTL
    - Interest embeddings stored in Redis for vector search
    """
    
    def __init__(self, redis_client, local_cache_size: int = 10000):
        self.redis = redis_client
        self.local_cache = LRUCache(local_cache_size)
    
    async def get_profile(self, user_id: str) -> Optional['UserProfile']:
        # Check local cache first (fastest)
        cached = self.local_cache.get(user_id)
        if cached:
            return cached
        
        # Fetch from Redis
        profile_data = await self.redis.hgetall(f"profile:{user_id}")
        if not profile_data:
            return None
        
        profile = self._deserialize_profile(profile_data)
        self.local_cache.set(user_id, profile)
        return profile
    
    async def get_session_events(self, session_id: str) -> List['BehavioralEvent']:
        # Get last N events from session list
        events_data = await self.redis.lrange(f"session:{session_id}", 0, 50)
        return [self._deserialize_event(e) for e in events_data]
 
 
class PersonalizationService:
    """
    Main personalization service that assembles search context.
    
    This service is called on every search request to prepare
    the personalization context that influences ranking.
    
    Latency budget: < 10ms (we can't slow down search)
    """
    
    def __init__(self, 
                 profile_store: ProfileStore,
                 default_strength: float = 0.5,
                 anonymous_strength: float = 0.2):
        self.profiles = profile_store
        self.default_strength = default_strength
        self.anonymous_strength = anonymous_strength
    
    async def prepare_context(self, 
                             request: SearchRequest
                             ) -> PersonalizedSearchContext:
        """
        Prepare personalization context for a search request.
        Must be fast - called on every query.
        """
        start = time.monotonic()
        
        # Parallel profile and session fetch
        if request.user_id:
            profile_task = self.profiles.get_profile(request.user_id)
            session_task = self.profiles.get_session_events(request.session_id)
            profile, session_events = await asyncio.gather(
                profile_task, session_task
            )
        else:
            profile = None
            session_events = await self.profiles.get_session_events(request.session_id)
        
        # Determine personalization strength
        strength = self._calculate_strength(profile, session_events, request)
        enabled = strength > 0 and (profile or session_events)
        
        elapsed = time.monotonic() - start
        if elapsed > 0.010:  # > 10ms
            log_warning(f"Slow personalization: {elapsed:.3f}s")
        
        return PersonalizedSearchContext(
            request=request,
            user_profile=profile,
            session_signals=session_events,
            personalization_enabled=enabled,
            personalization_strength=strength
        )
    
    def _calculate_strength(self,
                           profile: Optional['UserProfile'],
                           session_events: List['BehavioralEvent'],
                           request: SearchRequest) -> float:
        """
        Determine how strongly to personalize.
        
        Factors:
        - Profile completeness (rich profile → more confidence)
        - Query ambiguity (ambiguous → more personalization helps)
        - Session activity (active session → session signals matter)
        - Explicit consent
        """
        if profile and not profile.personalization_consent:
            return 0.0
        
        base_strength = self.anonymous_strength
        
        if profile:
            base_strength = self.default_strength
            
            # Boost for rich profiles
            interest_count = len(profile.get_active_interests())
            if interest_count >= 20:
                base_strength *= 1.3
            elif interest_count >= 10:
                base_strength *= 1.15
            
            # Boost for high-value users
            if profile.total_purchases > 10:
                base_strength *= 1.2
        
        # Session recency boost
        if session_events:
            recency = time.time() - session_events[-1].timestamp.timestamp()
            if recency < 300:  # Active in last 5 minutes
                base_strength *= 1.2
        
        return min(1.0, base_strength)
 
 
class PersonalizedRanker:
    """
    Applies personalization to search results.
    
    Two-phase approach:
    1. Query modification: Inject boost terms for user interests
    2. Result re-ranking: Adjust scores based on profile match
    """
    
    def __init__(self, search_client):
        self.search = search_client
    
    async def search_with_personalization(
        self,
        query: str,
        context: PersonalizedSearchContext,
        num_results: int = 10
    ) -> List[Dict]:
        """Execute personalized search."""
        
        if not context.personalization_enabled:
            return await self.search.basic_search(query, num_results)
        
        # Phase 1: Build personalized query
        personalized_query = self._build_personalized_query(
            query, context
        )
        
        # Phase 2: Execute and optionally re-rank
        results = await self.search.execute(personalized_query, num_results * 3)
        
        if context.user_profile and context.user_profile.interest_embedding is not None:
            results = self._rerank_by_embedding(
                results, context.user_profile.interest_embedding
            )
        
        return results[:num_results]
    
    def _build_personalized_query(
        self,
        query: str,
        context: PersonalizedSearchContext
    ) -> Dict:
        """
        Construct query with personalization boosts.
        """
        strength = context.personalization_strength
        should_clauses = []
        
        # Add interest boosts from profile
        if context.user_profile:
            for interest in context.user_profile.get_active_interests()[:10]:
                boost = interest.decay() * strength * 0.5
                should_clauses.append({
                    "term": {
                        "categories": {
                            "value": interest.topic,
                            "boost": boost
                        }
                    }
                })
            
            # Brand preference boosts
            for brand in context.user_profile.preferred_brands:
                should_clauses.append({
                    "term": {
                        "brand.keyword": {
                            "value": brand,
                            "boost": strength * 0.8
                        }
                    }
                })
        
        # Session-based boosts (recency matters more)
        for event in context.session_signals[-5:]:  # Last 5 events
            should_clauses.append({
                "term": {
                    "category": {
                        "value": event.metadata.get("category", ""),
                        "boost": strength * 1.5
                    }
                }
            })
        
        return {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["name^3", "description"]
                    }
                },
                "should": should_clauses
            }
        }
    
    def _rerank_by_embedding(
        self,
        results: List[Dict],
        user_embedding: 'np.ndarray'
    ) -> List[Dict]:
        """
        Re-rank results by similarity to user interest embedding.
        
        Embedding-based personalization captures semantic similarity
        beyond keyword matching.
        """
        for result in results:
            if "embedding" in result:
                similarity = np.dot(user_embedding, result["embedding"])
                result["personalization_score"] = similarity
                result["final_score"] = (
                    result["text_score"] * 0.7 + 
                    similarity * 0.3
                )
            else:
                result["final_score"] = result["text_score"]
        
        return sorted(results, key=lambda r: -r["final_score"])

Latency is Critical

Personalization cannot add significant latency to search. Profile fetches should be cached and parallelized. If personalization would add >20ms, consider skipping it for that request. Users notice 100ms delays; they abandon after 200ms. Don't let personalization hurt the baseline experience.

Contextual Personalization

Even without user profiles, we can personalize based on context—signals available from the current request that indicate what the user might want.

Available contextual signals:

Contextual Personalization Signals
Signal	What It Tells Us	How to Use It	Privacy Level
Location (IP/GPS)	Geographic relevance	Boost local results, currency, language	Medium - can be sensitive
Time of day	Temporal intent patterns	Breakfast queries at 7am vs 7pm	Low - not personal
Day of week	Work vs leisure patterns	Business queries Mon-Fri	Low - not personal
Device type	Mobile vs desktop needs	Mobile-friendly formatting, quick answers	Low - technical
Browser language	Language preference	Localized results	Low - explicit setting
Referrer URL	Entry context	Coming from review site → review intent	Medium - browsing context
Search within session	Query refinement	Building on previous queries	Medium - session tracking

contextual_personalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
from dataclasses import dataclass
from typing import Dict, Optional, List, Tuple
from datetime import datetime, timezone
import math
 
@dataclass
class RequestContext:
    """Contextual signals from the current request."""
    # Geographic
    country_code: str
    region: Optional[str]
    city: Optional[str]
    latitude: Optional[float]
    longitude: Optional[float]
    
    # Temporal
    local_time: datetime
    timezone: str
    
    # Technical
    device_type: str  # mobile, tablet, desktop
    os: str
    browser: str
    browser_language: str
    screen_width: int
    
    # Session
    referrer: Optional[str]
    queries_in_session: int
    session_duration_seconds: int
 
class ContextualPersonalizer:
    """
    Applies contextual personalization without user profiles.
    
    This works for anonymous users and augments profile-based
    personalization for known users.
    """
    
    def __init__(self):
        # Time-of-day intent patterns (learned from historical data)
        self.time_patterns = {
            "breakfast": (6, 10),
            "lunch": (11, 14),
            "dinner": (17, 21),
            "nightlife": (21, 2),
            "morning_commute": (7, 9),
            "evening_commute": (17, 19),
        }
        
        # Device-specific preferences
        self.device_preferences = {
            "mobile": {
                "prefer_quick_answers": True,
                "prefer_mobile_friendly": True,
                "max_text_length": 150,
            },
            "desktop": {
                "prefer_detailed_content": True,
                "prefer_mobile_friendly": False,
                "max_text_length": 500,
            },
        }
    
    def build_context_boosts(self, 
                            context: RequestContext,
                            query: str) -> List[Dict]:
        """Generate boost clauses from context."""
        boosts = []
        
        # Geographic boosts
        if context.city:
            boosts.extend(self._geo_boosts(context))
        
        # Temporal boosts
        boosts.extend(self._temporal_boosts(context, query))
        
        # Device-specific boosts
        boosts.extend(self._device_boosts(context))
        
        # Language boosts
        boosts.extend(self._language_boosts(context))
        
        return boosts
    
    def _geo_boosts(self, context: RequestContext) -> List[Dict]:
        """Boost results relevant to user's location."""
        boosts = []
        
        # Local business preference
        if context.latitude and context.longitude:
            boosts.append({
                "function_score": {
                    "functions": [{
                        "gauss": {
                            "location": {
                                "origin": {
                                    "lat": context.latitude,
                                    "lon": context.longitude
                                },
                                "scale": "10km",
                                "decay": 0.5
                            }
                        },
                        "weight": 2.0
                    }]
                }
            })
        
        # Country-specific content
        boosts.append({
            "term": {
                "country": {
                    "value": context.country_code,
                    "boost": 1.5
                }
            }
        })
        
        # Regional boost
        if context.region:
            boosts.append({
                "term": {
                    "region": {
                        "value": context.region,
                        "boost": 1.3
                    }
                }
            })
        
        return boosts
    
    def _temporal_boosts(self, 
                        context: RequestContext,
                        query: str) -> List[Dict]:
        """Boost based on time of day patterns."""
        boosts = []
        hour = context.local_time.hour
        day_of_week = context.local_time.weekday()
        
        # Check for food-related queries and time patterns
        food_keywords = ["restaurant", "food", "eat", "dinner", "lunch", "breakfast"]
        is_food_query = any(kw in query.lower() for kw in food_keywords)
        
        if is_food_query:
            # Determine meal type from time
            for meal, (start, end) in self.time_patterns.items():
                if meal in ["breakfast", "lunch", "dinner"]:
                    in_window = (start <= hour <= end) if start < end else (hour >= start or hour <= end)
                    if in_window:
                        boosts.append({
                            "term": {
                                "meal_type": {
                                    "value": meal,
                                    "boost": 1.5
                                }
                            }
                        })
        
        # Weekend vs weekday patterns
        is_weekend = day_of_week >= 5
        if is_weekend:
            boosts.append({
                "term": {
                    "weekend_friendly": {
                        "value": True,
                        "boost": 1.2
                    }
                }
            })
        
        return boosts
    
    def _device_boosts(self, context: RequestContext) -> List[Dict]:
        """Boost content suitable for the device."""
        boosts = []
        
        if context.device_type == "mobile":
            boosts.append({
                "term": {
                    "mobile_optimized": {
                        "value": True,
                        "boost": 1.4
                    }
                }
            })
            # Penalize very long content on mobile
            boosts.append({
                "function_score": {
                    "functions": [{
                        "linear": {
                            "content_length": {
                                "origin": 500,
                                "scale": 2000,
                                "decay": 0.5
                            }
                        },
                        "weight": 0.8
                    }]
                }
            })
        
        return boosts
    
    def _language_boosts(self, context: RequestContext) -> List[Dict]:
        """Boost content in user's preferred language."""
        lang = context.browser_language.split("-")[0]  # "en-US" -> "en"
        
        return [{
            "term": {
                "language": {
                    "value": lang,
                    "boost": 2.0
                }
            }
        }]
    
    def infer_intent_from_context(self,
                                  context: RequestContext,
                                  query: str) -> Dict[str, float]:
        """
        Infer likely user intent from contextual signals.
        Returns probability distribution over intent types.
        """
        intent_scores = {
            "navigational": 0.2,
            "informational": 0.4,
            "transactional": 0.2,
            "local": 0.1,
            "entertainment": 0.1,
        }
        
        hour = context.local_time.hour
        is_weekend = context.local_time.weekday() >= 5
        is_mobile = context.device_type == "mobile"
        
        # Time-based adjustments
        if 9 <= hour <= 17 and not is_weekend:
            # Business hours: more work-related queries
            intent_scores["informational"] += 0.1
            intent_scores["entertainment"] -= 0.05
        
        if hour >= 20 or is_weekend:
            # Evening/weekend: more entertainment
            intent_scores["entertainment"] += 0.1
            intent_scores["informational"] -= 0.05
        
        # Device-based adjustments
        if is_mobile:
            intent_scores["local"] += 0.15
            intent_scores["navigational"] += 0.1
        
        # Location signals
        if context.latitude and context.longitude:
            intent_scores["local"] += 0.1
        
        # Normalize
        total = sum(intent_scores.values())
        return {k: v/total for k, v in intent_scores.items()}

Privacy-Preserving Personalization

Personalization requires user data, but users increasingly—and rightfully—demand privacy. Regulations (GDPR, CCPA) mandate explicit consent, data minimization, and right to deletion. Building personalization systems that respect privacy isn't just ethical; it's legally required.

Privacy-preserving approaches:

Privacy-Preserving Personalization Techniques

•On-device personalization: Compute user profiles on the user's device; only send queries to server, not histories. Apple's approach with on-device ML.
•Federated learning: Train personalization models across devices without centralizing data. Google's approach for keyboard prediction.
•Differential privacy: Add noise to individual data while preserving aggregate patterns. Provides mathematical privacy guarantees.
•Aggregated cohorts: Group users into cohorts and personalize by cohort, not individual. Google's FLoC (now deprecated) attempted this.
•Session-only personalization: Use only current session data, forget everything when session ends. Privacy-friendly but limited.
•Explicit preference settings: Let users explicitly state preferences instead of tracking behavior. Higher friction but total transparency.

privacy_preserving.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
from dataclasses import dataclass
from typing import List, Dict, Optional
from datetime import datetime, timedelta
import hashlib
import random
 
@dataclass
class PrivacyConfig:
    """Privacy configuration for personalization."""
    
    # Data retention
    max_history_days: int = 30
    max_events_stored: int = 100
    
    # Consent requirements
    require_explicit_consent: bool = True
    allow_cross_site_tracking: bool = False
    
    # Anonymization
    hash_user_id: bool = True
    strip_ip_last_octet: bool = True
    
    # Data minimization
    store_only_categories: bool = True  # Don't store specific items
    aggregate_by_day: bool = True       # Don't store exact timestamps
 
class PrivacyAwareProfileStore:
    """
    Profile store with built-in privacy protections.
    
    Key principles:
    1. Data minimization: Store only what's needed
    2. Purpose limitation: Use data only for stated purpose
    3. Storage limitation: Delete data after retention period
    4. Consent management: Respect user consent choices
    """
    
    def __init__(self, config: PrivacyConfig):
        self.config = config
    
    def store_event(self, 
                   user_id: str,
                   event: Dict,
                   consent_record: Dict) -> bool:
        """
        Store an event with privacy protections applied.
        """
        # Check consent first
        if not self._check_consent(consent_record, "personalization"):
            return False
        
        # Apply data minimization
        minimized_event = self._minimize_event(event)
        
        # Hash user ID if configured
        storage_id = self._get_storage_id(user_id)
        
        # Set automatic expiration
        expiry = datetime.now() + timedelta(days=self.config.max_history_days)
        
        # Store with expiration (e.g., Redis TTL)
        # self.storage.set(storage_id, minimized_event, expiry)
        
        return True
    
    def _check_consent(self, 
                      consent_record: Dict,
                      purpose: str) -> bool:
        """Verify user has consented to this data use."""
        if not self.config.require_explicit_consent:
            return True
        
        return (
            consent_record.get("personalization", False) and
            consent_record.get("purpose_" + purpose, False)
        )
    
    def _minimize_event(self, event: Dict) -> Dict:
        """Apply data minimization rules."""
        minimized = {}
        
        if self.config.store_only_categories:
            # Store category, not specific product
            minimized["category"] = event.get("category")
            # Don't store: item_id, item_name, etc.
        else:
            minimized = event.copy()
        
        if self.config.aggregate_by_day:
            # Round timestamp to day
            if "timestamp" in event:
                ts = event["timestamp"]
                minimized["date"] = ts.date() if hasattr(ts, 'date') else ts[:10]
        
        return minimized
    
    def _get_storage_id(self, user_id: str) -> str:
        """Generate privacy-preserving storage ID."""
        if self.config.hash_user_id:
            # Use cryptographic hash; can't reverse to identify user
            return hashlib.sha256(user_id.encode()).hexdigest()
        return user_id
    
    def delete_user_data(self, user_id: str) -> bool:
        """
        GDPR right to erasure / CCPA right to delete.
        Must delete all user data when requested.
        """
        storage_id = self._get_storage_id(user_id)
        
        # Delete profile
        # self.storage.delete(f"profile:{storage_id}")
        
        # Delete all events
        # self.storage.delete(f"events:{storage_id}")
        
        # Delete embeddings
        # self.storage.delete(f"embedding:{storage_id}")
        
        # Log deletion for compliance
        self._log_deletion(user_id)
        
        return True
    
    def export_user_data(self, user_id: str) -> Dict:
        """
        GDPR right to portability / CCPA right to know.
        Export all data about a user in machine-readable format.
        """
        storage_id = self._get_storage_id(user_id)
        
        return {
            "profile": {},  # self.storage.get(f"profile:{storage_id}"),
            "events": [],   # self.storage.lrange(f"events:{storage_id}", 0, -1),
            "preferences": {},
            "export_date": datetime.now().isoformat(),
        }
 
 
class DifferentiallyPrivateAggregator:
    """
    Aggregate personalization signals with differential privacy.
    
    Differential privacy adds calibrated noise to prevent
    identifying individuals from aggregate statistics.
    """
    
    def __init__(self, epsilon: float = 1.0):
        """
        epsilon: Privacy budget. Lower = more privacy, more noise.
        Typical values: 0.1 (strong privacy) to 10 (weak privacy)
        """
        self.epsilon = epsilon
    
    def add_laplace_noise(self, 
                         value: float,
                         sensitivity: float = 1.0) -> float:
        """
        Add Laplace noise for ε-differential privacy.
        
        Sensitivity: max change to output from single record change.
        For counting queries, sensitivity = 1.
        For averages, sensitivity = range / n.
        """
        scale = sensitivity / self.epsilon
        noise = random.uniform(-1, 1)
        noise = -scale * ( (1 if noise >= 0 else -1) - noise)
        return value + noise
    
    def private_category_counts(self,
                               events: List[Dict],
                               categories: List[str]) -> Dict[str, int]:
        """
        Get differentially private category interest counts.
        
        Even if an attacker knows all events except one,
        they cannot determine the last event's category.
        """
        # True counts
        counts = {cat: 0 for cat in categories}
        for event in events:
            cat = event.get("category")
            if cat in counts:
                counts[cat] += 1
        
        # Add noise to each count
        private_counts = {
            cat: max(0, int(self.add_laplace_noise(count)))
            for cat, count in counts.items()
        }
        
        return private_counts

The Privacy-Personalization Trade-off

Stronger privacy means weaker personalization. With no data, you can only do contextual personalization. With full behavioral tracking, you can achieve highly relevant results. Most systems operate somewhere in between, balancing user experience improvements against privacy risks. Make this trade-off explicit and give users control.

Filter Bubbles and Exploration

Personalization optimizes for what users have liked before. This creates filter bubbles—users see increasingly narrow content matching their past preferences, missing potentially valuable content outside their historical interests.

The risks of over-personalization:

Filter Bubble Problems

•Echo chambers: Users only see content confirming existing beliefs (problematic for news)
•Missed opportunities: Users never discover interests they'd love but haven't expressed yet
•Stagnation: System reinforces past behaviors even when preferences are changing
•Cold start amplification: New items never get shown because no user has interacted with them
•Homogenization: Different users converge to similar results over time

Solutions: Exploration and diversity:

Well-designed personalization systems balance exploitation (showing what users are likely to want) with exploration (showing new things to learn more about user preferences and introduce variety).

exploration_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
from dataclasses import dataclass
from typing import List, Dict, Set
import random
import math
 
@dataclass
class SearchResult:
    id: str
    score: float
    categories: List[str]
    is_familiar: bool  # User has seen similar content
 
class DiversityOptimizer:
    """
    Ensure result diversity while maintaining personalization.
    
    Key principle: The best result isn't always the most relevant.
    Users benefit from variety, and the system benefits from exploration.
    """
    
    @staticmethod
    def maximal_marginal_relevance(
        results: List[SearchResult],
        lambda_param: float = 0.7,
        top_k: int = 10
    ) -> List[SearchResult]:
        """
        Maximal Marginal Relevance (MMR) diversification.
        
        Balances relevance and diversity by penalizing results
        that are too similar to already-selected results.
        
        lambda_param: 1.0 = pure relevance, 0.0 = pure diversity
        """
        if not results:
            return []
        
        selected = [results[0]]  # Start with most relevant
        remaining = results[1:]
        
        while len(selected) < top_k and remaining:
            best_idx = -1
            best_mmr = -float('inf')
            
            for i, candidate in enumerate(remaining):
                # Relevance score
                relevance = candidate.score
                
                # Diversity penalty: similarity to already selected
                max_similarity = max(
                    DiversityOptimizer._category_similarity(candidate, s)
                    for s in selected
                )
                
                # MMR score balances both
                mmr = lambda_param * relevance - (1 - lambda_param) * max_similarity
                
                if mmr > best_mmr:
                    best_mmr = mmr
                    best_idx = i
            
            if best_idx >= 0:
                selected.append(remaining.pop(best_idx))
        
        return selected
    
    @staticmethod
    def _category_similarity(r1: SearchResult, r2: SearchResult) -> float:
        """Jaccard similarity between result categories."""
        set1 = set(r1.categories)
        set2 = set(r2.categories)
        if not set1 or not set2:
            return 0.0
        return len(set1 & set2) / len(set1 | set2)
 
 
class ExplorationStrategies:
    """
    Strategies for incorporating exploration into personalized results.
    """
    
    @staticmethod
    def epsilon_greedy(
        personalized_results: List[SearchResult],
        exploration_pool: List[SearchResult],
        epsilon: float = 0.1
    ) -> List[SearchResult]:
        """
        With probability epsilon, swap in an exploratory result.
        
        Simple but effective. epsilon=0.1 means 10% of results
        are exploratory items from outside the user's profile.
        """
        results = personalized_results.copy()
        
        for i in range(len(results)):
            if random.random() < epsilon and exploration_pool:
                results[i] = exploration_pool.pop(
                    random.randint(0, len(exploration_pool) - 1)
                )
        
        return results
    
    @staticmethod
    def thompson_sampling_exploration(
        candidates: List[Dict],
        user_history: Dict[str, int],  # category -> interaction count
        prior_alpha: float = 1.0,
        prior_beta: float = 1.0
    ) -> List[Dict]:
        """
        Thompson Sampling for exploration-exploitation balance.
        
        Treats personalization as a multi-armed bandit problem.
        Categories with uncertain reward get more exploration.
        Categories with known reward get more exploitation.
        """
        scored_candidates = []
        
        for candidate in candidates:
            category = candidate.get("category", "unknown")
            
            # Get historical success rate for this category
            successes = user_history.get(f"{category}_successes", 0)
            failures = user_history.get(f"{category}_failures", 0)
            
            # Sample from Beta distribution (uncertainty modeling)
            alpha = prior_alpha + successes
            beta = prior_beta + failures
            sampled_score = random.betavariate(alpha, beta)
            
            # Combine with base relevance score
            base_score = candidate.get("score", 0.5)
            exploration_score = 0.5 * base_score + 0.5 * sampled_score
            
            scored_candidates.append({
                **candidate,
                "exploration_score": exploration_score
            })
        
        return sorted(scored_candidates, 
                     key=lambda x: -x["exploration_score"])
    
    @staticmethod
    def inject_serendipity_slots(
        personalized_results: List[SearchResult],
        serendipity_candidates: List[SearchResult],
        positions: List[int] = [3, 7]
    ) -> List[SearchResult]:
        """
        Reserve specific slots for serendipitous content.
        
        Positions 3 and 7 (1-indexed: 4th and 8th) show content
        that's intentionally outside the user's normal preferences.
        """
        results = personalized_results.copy()
        
        for pos in positions:
            if pos < len(results) and serendipity_candidates:
                # Insert serendipitous result
                serendipity_item = serendipity_candidates.pop(0)
                serendipity_item.is_exploratory = True
                results.insert(pos, serendipity_item)
        
        return results[:max(len(personalized_results), len(results))]
 
 
class FeedbackLoop:
    """
    Use exploration results to improve personalization over time.
    """
    
    def __init__(self, learning_rate: float = 0.1):
        self.learning_rate = learning_rate
    
    def record_exploration_result(
        self,
        user_id: str,
        item_id: str,
        item_category: str,
        was_successful: bool  # Click, purchase, long dwell, etc.
    ):
        """
        Record user response to exploratory content.
        
        If user engages with exploratory content, expand their profile.
        If they ignore it, that's valuable signal too.
        """
        # Pseudocode for profile update
        if was_successful:
            # This is a new interest! Add to profile with moderate confidence
            # profile.add_interest(item_category, level=MODERATE, score=0.4)
            pass
        else:
            # Note the lack of interest (but don't permanently block)
            # profile.note_exploration_failure(item_category)
            pass
    
    def adjust_exploration_rate(
        self,
        user_id: str,
        exploration_success_rate: float,
        current_epsilon: float
    ) -> float:
        """
        Dynamically adjust exploration rate per user.
        
        Users who often engage with exploratory content get more exploration.
        Users who always ignore it get less.
        """
        target_success_rate = 0.15  # We want 15% of explorations to succeed
        
        if exploration_success_rate > target_success_rate:
            # Exploration is working well, increase it
            return min(0.3, current_epsilon * 1.1)
        else:
            # Exploration isn't landing, reduce it
            return max(0.05, current_epsilon * 0.9)

Measure Both Exploitation and Exploration

Track separate metrics for personalized (exploitation) and exploratory results. Success rate for exploration should be lower than exploitation (by definition, it's showing less-certain content). But if exploration success rate drops to zero, you're not finding new user interests. If it's too high, your personalization is too conservative.

Summary: The Personal Search Experience

Personalization transforms search from a generic experience into something uniquely tailored to each user. When done well, it dramatically improves relevance for ambiguous queries and helps users discover content they'll love. When done poorly, it traps users in filter bubbles, violates privacy, and adds latency without value.

Key Takeaways

•Personalization exists on a spectrum — From no personalization to heavy profile-based ranking. Choose the right level for your context and query types.
•User profiles are built incrementally — From explicit preferences, behavioral events, and inferred interests. Balance signal quality against quantity.
•Real-time personalization requires speed — Profile fetches, assembly, and query modification must complete in <10ms. Cache aggressively, parallelize, and fail gracefully.
•Contextual personalization works for everyone — Even anonymous users can receive personalized results based on location, time, device, and session signals.
•Privacy is non-negotiable — Implement consent management, data minimization, storage limits, and deletion capabilities. Privacy regulations have teeth.
•Exploration prevents filter bubbles — Reserve capacity for showing content outside user profiles. Use techniques like epsilon-greedy, Thompson sampling, or dedicated serendipity slots.
•Measure personalization separately from baseline — Track how much personalization changes rankings and whether those changes correlate with improved engagement.

What's next:

We've explored how to build personalization. But how do we know it's working? The next page covers A/B testing relevance—how to rigorously measure whether relevance changes (including personalization) actually improve user satisfaction.

Page Complete

You now understand search personalization comprehensively—from spectrum of approaches to profile construction, real-time architecture, privacy mechanisms, and anti-filter-bubble techniques. Next, we'll learn how to measure whether all this complexity is actually making search better through A/B testing.