System Design (HLD)Twitter/X Feed Design

Designing Twitter/X: Social Media Feed at Scale

LevelAdvanced

Duration120 mins

TopicTwitter/X Feed Design

5 / 6

Trending Topics

Capturing the Pulse of the World

"What's happening?" is Twitter's core value proposition. The trending topics feature surfaces the most-discussed subjects in real-time, helping users discover breaking news, viral moments, and cultural events as they unfold.

The Trending Challenge:

Detecting trends sounds simple: count hashtags, show the most popular ones. But at Twitter's scale, this naive approach fails spectacularly:

Volume: 500 million tweets/day = ~6,000 tweets/second to process
Real-time: Trends must surface within minutes, not hours
Context: "#Christmas" is always popular—it's not trending unless it's unusual
Gaming: Bots and coordinated campaigns try to manipulate trends
Personalization: Trends should be relevant to the user's location and interests
Global + Local: Need both worldwide trends and hyper-local trends

This page explores how to build a trending system that handles these challenges at scale.

What You Will Learn

By the end of this page, you'll understand streaming data pipelines for trend detection, approximate counting algorithms (Count-Min Sketch, HyperLogLog), velocity-based trend scoring, personalization strategies, and anti-gaming measures.

Defining What "Trending" Means

Before building the system, we must precisely define what makes something "trending." This definition drives our algorithm design.

1.1 Trending vs Popular

A common mistake is equating "trending" with "popular." They're fundamentally different:

Popular (Always High Volume)

•#love - millions of tweets daily
•#music - consistently high
•#food - always discussed
•#work - weekday constant
•#sleep - nightly regular

Trending (Unusual Spike)

•#SuperBowl - spike during event
•#Breaking - news event spike
•#RIP[Celebrity] - sudden surge
•#Earthquake - real-time event
•#NewMovie - release day surge

1.2 The Mathematical Definition of Trending

A topic is "trending" when its current velocity significantly exceeds its baseline:

trending_formula.txt
Trending Score Formula
======================
 
Core insight: Trending = Current Volume / Expected Volume
 
trending_score = (current_count - expected_count) / sqrt(expected_count)
 
Where:
- current_count: tweets in the last N minutes (e.g., 15 min)
- expected_count: typical tweets in this time window (historical baseline)
- sqrt(expected_count): normalization factor (variance adjustment)
 
Example calculations:
 
Topic: #SuperBowl
- Current (last 15 min): 50,000 tweets
- Expected (typical Sunday 6pm, 15 min): 200 tweets
- Score: (50000 - 200) / sqrt(200) = 3521
 
Topic: #love
- Current (last 15 min): 10,000 tweets  
- Expected (typical 15 min): 9,800 tweets
- Score: (10000 - 9800) / sqrt(9800) = 2.02
 
Result: #SuperBowl (score 3521) >> #love (score 2.02)
#SuperBowl is trending despite #love having lower absolute volume.

1.3 Time Windows and Decay

Trends have lifecycles. A topic that was trending 2 hours ago might no longer be newsworthy:

Time-Based Trend Factors

•Detection window: 15-30 minutes of tweet volume considered
•Baseline window: 7-day historical average at same time of day
•Day-of-week adjustment: Weekends have different patterns than weekdays
•Recency decay: Recent tweets weighted more heavily than older ones
•Trend fatigue: Topic that's been trending for hours gets penalized

Interview Insight

When asked to design trending topics, immediately clarify: 'Are we looking for absolute popularity or unusual velocity?' This demonstrates understanding of the core challenge. Most interviewers want velocity-based trending.

Streaming Data Pipeline Architecture

Processing 6,000+ tweets per second in real-time requires a streaming architecture. Batch processing is too slow—by the time batch results are ready, the trend might have passed.

2.1 High-Level Pipeline Architecture

Converting Mermaid diagram...

2.2 Entity Extraction

Each tweet must be parsed to extract trendable entities (hashtags, keywords, named entities):

entity_extractor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
"""
Entity Extractor for Trend Detection
 
Extracts trending-relevant entities from tweets. Goes beyond
simple hashtag extraction to capture keyword-based trends.
"""
 
import re
from typing import List, Set, Dict
from dataclasses import dataclass
import spacy
 
@dataclass
class TrendEntity:
    text: str
    entity_type: str  # hashtag, keyword, person, location, event
    normalized: str   # lowercase, stemmed version for counting
    weight: float     # importance weight
 
class TweetEntityExtractor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")
        self.stopwords = self._load_stopwords()
        self.common_hashtags = self._load_common_hashtags()
    
    def extract_entities(self, tweet: str) -> List[TrendEntity]:
        """
        Extract all trendable entities from a tweet.
        
        Entity types:
        - Hashtags: #SuperBowl
        - Mentions (for person trends): @elonmusk
        - Keywords: 2-3 word phrases that spike
        - Named entities: detected by NLP (people, orgs, events)
        """
        entities = []
        
        # 1. Extract hashtags
        hashtags = self._extract_hashtags(tweet)
        entities.extend(hashtags)
        
        # 2. Extract mentioned usernames (for person trends)
        mentions = self._extract_mentions(tweet)
        entities.extend(mentions)
        
        # 3. Extract named entities via NLP
        named_entities = self._extract_named_entities(tweet)
        entities.extend(named_entities)
        
        # 4. Extract key phrases
        phrases = self._extract_key_phrases(tweet)
        entities.extend(phrases)
        
        return entities
    
    def _extract_hashtags(self, tweet: str) -> List[TrendEntity]:
        """Extract hashtags with filtering."""
        hashtag_pattern = r'#(w+)'
        matches = re.findall(hashtag_pattern, tweet)
        
        entities = []
        for tag in matches:
            normalized = tag.lower()
            
            # Skip common/spammy hashtags
            if normalized in self.common_hashtags:
                continue
            
            entities.append(TrendEntity(
                text=f"#{tag}",
                entity_type="hashtag",
                normalized=normalized,
                weight=1.0
            ))
        
        return entities
    
    def _extract_named_entities(self, tweet: str) -> List[TrendEntity]:
        """
        Use NLP to extract named entities.
        Captures trends that don't use hashtags.
        
        Example: "RIP Steve Jobs" wouldn't have a hashtag
        but NLP detects "Steve Jobs" as a PERSON entity.
        """
        doc = self.nlp(tweet)
        
        entities = []
        for ent in doc.ents:
            if ent.label_ in {"PERSON", "ORG", "EVENT", "GPE", "LOC"}:
                entities.append(TrendEntity(
                    text=ent.text,
                    entity_type=ent.label_.lower(),
                    normalized=ent.text.lower().strip(),
                    weight=0.8  # Slightly lower weight than hashtags
                ))
        
        return entities
    
    def _extract_key_phrases(self, tweet: str) -> List[TrendEntity]:
        """
        Extract significant 2-3 word phrases.
        Captures trends like "World Cup Final" without hashtags.
        """
        doc = self.nlp(tweet)
        
        # Extract noun chunks
        entities = []
        for chunk in doc.noun_chunks:
            words = [token.text for token in chunk 
                    if not token.is_stop and token.is_alpha]
            
            if 2 <= len(words) <= 3:
                phrase = " ".join(words)
                entities.append(TrendEntity(
                    text=phrase,
                    entity_type="phrase",
                    normalized=phrase.lower(),
                    weight=0.6
                ))
        
        return entities

2.3 Stream Processing with Apache Kafka/Flink

trend_stream_processor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
"""
Trend Stream Processor
 
Uses Apache Flink-style windowed aggregation to count entities
in real-time sliding windows. Built for exactly-once processing
semantics and horizontal scaling.
"""
 
from typing import Dict, List
from collections import defaultdict
import time
 
class TrendStreamProcessor:
    def __init__(
        self,
        window_size_seconds: int = 900,  # 15-minute window
        slide_interval_seconds: int = 60  # Update every minute
    ):
        self.window_size = window_size_seconds
        self.slide_interval = slide_interval_seconds
        
        # In-memory counts for current window
        # In production: distributed state store (RocksDB, Redis)
        self.entity_counts: Dict[str, Dict[str, int]] = defaultdict(
            lambda: defaultdict(int)
        )
        
        # Location-based counts for local trends
        self.location_counts: Dict[str, Dict[str, int]] = defaultdict(
            lambda: defaultdict(int)
        )
    
    def process_tweet(self, tweet: Dict):
        """
        Process a single tweet event.
        Called for each tweet as it arrives in the stream.
        """
        timestamp = tweet["created_at"]
        window_key = self._get_window_key(timestamp)
        location = tweet.get("location", "global")
        
        # Extract entities
        entities = self.entity_extractor.extract_entities(tweet["text"])
        
        # Update counts
        for entity in entities:
            # Global count
            self.entity_counts[window_key][entity.normalized] += 1
            
            # Location-specific count
            loc_key = f"{window_key}:{location}"
            self.location_counts[loc_key][entity.normalized] += 1
    
    def get_sliding_window_counts(
        self, 
        location: str = "global"
    ) -> Dict[str, int]:
        """
        Get aggregate counts across all windows in the sliding window.
        
        Example: For 15-min window, sum counts from last 15 one-minute buckets.
        """
        current_time = time.time()
        counts = defaultdict(int)
        
        # Sum counts from all windows in the sliding range
        for bucket in range(0, self.window_size, 60):
            window_time = current_time - bucket
            window_key = self._get_window_key(window_time)
            
            if location == "global":
                for entity, count in self.entity_counts[window_key].items():
                    counts[entity] += count
            else:
                loc_key = f"{window_key}:{location}"
                for entity, count in self.location_counts[loc_key].items():
                    counts[entity] += count
        
        return dict(counts)
    
    def _get_window_key(self, timestamp: float) -> str:
        """Get bucket key for a timestamp (1-minute buckets)."""
        bucket = int(timestamp // 60)
        return f"w:{bucket}"
    
    def cleanup_old_windows(self):
        """
        Remove windows older than retention period.
        Called periodically to prevent memory growth.
        """
        current_bucket = int(time.time() // 60)
        retention_buckets = self.window_size // 60 + 10  # Keep some buffer
        
        old_keys = [
            key for key in self.entity_counts.keys()
            if int(key.split(":")[1]) < current_bucket - retention_buckets
        ]
        
        for key in old_keys:
            del self.entity_counts[key]

Stateful Stream Processing

Real implementations use Apache Flink or Kafka Streams for stateful stream processing. These frameworks handle state checkpointing, exactly-once semantics, and fault recovery automatically. The concepts above illustrate the core logic.

Approximate Counting at Scale

With millions of unique hashtags and billions of tweets, exact counting becomes impractical. Approximate counting algorithms provide memory-efficient solutions with bounded error.

3.1 Count-Min Sketch

Count-Min Sketch provides frequency estimation using sublinear space:

count_min_sketch.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
"""
Count-Min Sketch Implementation
 
A probabilistic data structure for frequency estimation.
Uses O(width × depth) space regardless of number of elements.
 
Properties:
- Never underestimates: actual_count <= estimate
- Overestimates by small factor with high probability
- Additive error: estimate <= actual_count + ε × N
"""
 
import mmh3  # MurmurHash3 for fast hashing
import numpy as np
 
class CountMinSketch:
    def __init__(
        self, 
        width: int = 10000,    # Number of counters per row
        depth: int = 5         # Number of hash functions
    ):
        """
        Initialize Count-Min Sketch.
        
        Error guarantees (with high probability):
        - Relative error: ε = e / width ≈ 0.00027 for width=10000
        - Probability of exceeding error: δ = e^(-depth) ≈ 0.0067 for depth=5
        """
        self.width = width
        self.depth = depth
        self.table = np.zeros((depth, width), dtype=np.int64)
        self.total_count = 0
    
    def add(self, item: str, count: int = 1):
        """Add an item to the sketch."""
        self.total_count += count
        
        for i in range(self.depth):
            # Each row uses a different hash function (seed)
            hash_val = mmh3.hash(item, seed=i) % self.width
            self.table[i][hash_val] += count
    
    def estimate(self, item: str) -> int:
        """
        Estimate the count of an item.
        Returns the minimum across all hash functions.
        """
        min_count = float('inf')
        
        for i in range(self.depth):
            hash_val = mmh3.hash(item, seed=i) % self.width
            min_count = min(min_count, self.table[i][hash_val])
        
        return int(min_count)
    
    def merge(self, other: 'CountMinSketch'):
        """
        Merge another sketch into this one.
        Enables parallel processing with distributed sketches.
        """
        if self.width != other.width or self.depth != other.depth:
            raise ValueError("Sketches must have same dimensions")
        
        self.table += other.table
        self.total_count += other.total_count
 
 
class TrendingCountMinSketch:
    """
    Application of Count-Min Sketch for trending topics.
    Uses separate sketches for current and baseline periods.
    """
    
    def __init__(self):
        # Current 15-minute window
        self.current_sketch = CountMinSketch(width=50000, depth=7)
        
        # Baseline (rolling 7-day average for this time window)
        self.baseline_sketch = CountMinSketch(width=50000, depth=7)
        
        # Unique user tracking (prevent spam)
        self.user_sketches: Dict[str, 'HyperLogLog'] = {}
    
    def record_entity(self, entity: str, user_id: str):
        """Record an entity mention from a user."""
        # Add to current window
        self.current_sketch.add(entity)
        
        # Track unique users mentioning this entity
        if entity not in self.user_sketches:
            self.user_sketches[entity] = HyperLogLog()
        self.user_sketches[entity].add(user_id)
    
    def get_trending_score(self, entity: str) -> float:
        """Calculate trending score using sketch estimates."""
        current = self.current_sketch.estimate(entity)
        baseline = self.baseline_sketch.estimate(entity) / (7 * 96)  # 7 days × 96 windows/day
        
        if baseline < 1:
            baseline = 1  # Avoid division by zero
        
        # Z-score style calculation
        return (current - baseline) / np.sqrt(baseline)
    
    def get_unique_user_estimate(self, entity: str) -> int:
        """Estimate unique users mentioning entity."""
        if entity in self.user_sketches:
            return self.user_sketches[entity].cardinality()
        return 0

3.2 HyperLogLog for Unique User Counting

We also need to count unique users per topic (for anti-gaming). HyperLogLog provides cardinality estimation:

hyperloglog.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
"""
HyperLogLog Implementation
 
Estimates cardinality (count of unique elements) using O(log log N) space.
Can estimate billions of unique items using only ~12KB of memory.
 
Properties:
- Space: O(m) where m = 2^p (typically 2^14 = 16K registers)
- Error: Standard error ≈ 1.04 / sqrt(m) ≈ 0.81% for m=16384
- Mergeable: Can union multiple HLLs without re-processing elements
"""
 
import mmh3
import math
 
class HyperLogLog:
    def __init__(self, precision: int = 14):
        """
        Initialize HyperLogLog.
        
        precision = 14 means 2^14 = 16384 registers
        Standard error ≈ 0.81%
        """
        self.precision = precision
        self.num_registers = 1 << precision  # 2^p
        self.registers = [0] * self.num_registers
        
        # Alpha correction factor
        if self.num_registers >= 128:
            self.alpha = 0.7213 / (1 + 1.079 / self.num_registers)
        elif self.num_registers >= 64:
            self.alpha = 0.709
        elif self.num_registers >= 32:
            self.alpha = 0.697
        else:
            self.alpha = 0.673
    
    def add(self, item: str):
        """Add an item to the HLL."""
        hash_val = mmh3.hash64(item, signed=False)[0]
        
        # Use first p bits to select register
        register_idx = hash_val >> (64 - self.precision)
        
        # Use remaining bits to count leading zeros
        remaining = hash_val & ((1 << (64 - self.precision)) - 1)
        leading_zeros = self._count_leading_zeros(remaining, 64 - self.precision)
        
        # Store maximum observed leading zeros + 1
        self.registers[register_idx] = max(
            self.registers[register_idx], 
            leading_zeros + 1
        )
    
    def cardinality(self) -> int:
        """Estimate the number of unique items."""
        # Harmonic mean of 2^(-register)
        sum_inv = sum(2 ** (-reg) for reg in self.registers)
        
        raw_estimate = (self.alpha * self.num_registers ** 2) / sum_inv
        
        # Bias correction for small and large cardinalities
        if raw_estimate <= 2.5 * self.num_registers:
            # Small range correction
            zeros = self.registers.count(0)
            if zeros > 0:
                return int(
                    self.num_registers * 
                    math.log(self.num_registers / zeros)
                )
        elif raw_estimate > (1 << 32) / 30:
            # Large range correction
            return int(-((1 << 32) * math.log(1 - raw_estimate / (1 << 32))))
        
        return int(raw_estimate)
    
    def merge(self, other: 'HyperLogLog'):
        """Merge another HLL into this one."""
        if self.precision != other.precision:
            raise ValueError("HLLs must have same precision")
        
        for i in range(self.num_registers):
            self.registers[i] = max(self.registers[i], other.registers[i])
    
    def _count_leading_zeros(self, value: int, max_bits: int) -> int:
        if value == 0:
            return max_bits
        
        count = 0
        for i in range(max_bits - 1, -1, -1):
            if value & (1 << i):
                break
            count += 1
        
        return count

Approximate Counting Algorithm Comparison
Algorithm	Purpose	Space	Error	Use Case
Count-Min Sketch	Frequency estimation	O(w × d)	ε × N	How many times #topic mentioned?
HyperLogLog	Cardinality estimation	O(2^p)	1.04/√m	How many unique users?
Bloom Filter	Membership test	O(n)	False positive ε	Did user already count?
Top-K (Heavy Hitters)	Find most frequent	O(k)	Depends on skew	Top 10 trending topics

Space Savings in Practice

Exact counting of 100 million unique hashtags would require ~800MB+ just for the hash map structure. A Count-Min Sketch with width=50000, depth=7 uses only ~2.6MB. HyperLogLog uses ~12KB per entity. These savings enable in-memory processing at scale.

Trend Scoring and Ranking

With entity counts computed, we need to score and rank topics to determine what's truly trending.

4.1 Multi-Factor Scoring Model

trend_scorer.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
"""
Multi-Factor Trend Scorer
 
Combines multiple signals to produce a final trending score.
Each factor handles a different aspect of what makes something
worth surfacing to users.
"""
 
from dataclasses import dataclass
from typing import Dict
import math
 
@dataclass
class TrendCandidate:
    entity: str
    entity_type: str
    current_count: int
    baseline_count: float
    unique_users: int
    unique_sources: int  # Different apps/clients
    geographic_spread: float  # 0-1, how globally distributed
    account_variety: float  # Mix of verified/regular accounts
    first_seen: float  # Timestamp when trend started
    peak_velocity: float  # Maximum tweets per minute observed
 
class TrendScorer:
    def __init__(self):
        # Weight for each factor (tuned via experimentation)
        self.weights = {
            "velocity": 0.35,
            "acceleration": 0.20,
            "diversity": 0.20,
            "quality": 0.15,
            "freshness": 0.10
        }
    
    def score(self, candidate: TrendCandidate) -> float:
        """
        Calculate overall trending score.
        Higher score = more likely to be shown as trending.
        """
        scores = {
            "velocity": self._velocity_score(candidate),
            "acceleration": self._acceleration_score(candidate),
            "diversity": self._diversity_score(candidate),
            "quality": self._quality_score(candidate),
            "freshness": self._freshness_score(candidate),
        }
        
        # Weighted sum
        total = sum(
            scores[factor] * self.weights[factor]
            for factor in scores
        )
        
        # Apply anti-gaming penalty
        penalty = self._gaming_penalty(candidate)
        
        return total * penalty
    
    def _velocity_score(self, c: TrendCandidate) -> float:
        """
        Score based on current vs baseline volume.
        Higher velocity = more trending.
        """
        if c.baseline_count < 1:
            c.baseline_count = 1
        
        # Z-score calculation
        z = (c.current_count - c.baseline_count) / math.sqrt(c.baseline_count)
        
        # Normalize to 0-100 scale
        return min(100, max(0, z * 10))
    
    def _acceleration_score(self, c: TrendCandidate) -> float:
        """
        Score based on how quickly velocity is increasing.
        Rapid acceleration = breaking news.
        """
        # Compare current velocity to velocity 5 minutes ago
        # Higher acceleration = more newsworthy
        if c.peak_velocity > 0:
            current_velocity = c.current_count / 15  # per minute
            acceleration = current_velocity / c.peak_velocity
            return min(100, acceleration * 50)
        
        return 0
    
    def _diversity_score(self, c: TrendCandidate) -> float:
        """
        Score based on diversity of users discussing topic.
        Real trends have diverse participants; bot campaigns don't.
        """
        # Unique users / total mentions ratio
        if c.current_count == 0:
            return 0
        
        user_ratio = c.unique_users / c.current_count
        
        # Geographic diversity
        geo_score = c.geographic_spread * 50
        
        # Source diversity (different apps/clients)
        source_score = min(50, c.unique_sources * 5)
        
        return (user_ratio * 50) + (geo_score * 0.3) + (source_score * 0.2)
    
    def _quality_score(self, c: TrendCandidate) -> float:
        """
        Score based on quality signals.
        Verified accounts, engagement, etc.
        """
        # Mix of verified and regular accounts
        variety_score = c.account_variety * 100
        
        return variety_score
    
    def _freshness_score(self, c: TrendCandidate) -> float:
        """
        Score based on how recently trend started.
        Newer trends are more interesting.
        """
        age_hours = (time.time() - c.first_seen) / 3600
        
        # Decay over 24 hours
        if age_hours > 24:
            return 0
        
        return 100 * (1 - age_hours / 24)
    
    def _gaming_penalty(self, c: TrendCandidate) -> float:
        """
        Detect potential gaming and apply penalty.
        Returns multiplier between 0 (blocked) and 1 (no penalty).
        """
        # Low user diversity = likely bot campaign
        if c.unique_users < c.current_count * 0.3:
            return 0.1
        
        # Single geographic region = potential coordinated campaign
        if c.geographic_spread < 0.1:
            return 0.5
        
        # Sudden spike from new accounts = suspicious
        # (would need additional signals in practice)
        
        return 1.0

4.2 Location-Based Trending

Trends vary dramatically by location. A local news event might trend in one city but be irrelevant globally:

geo_trending.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
"""
Geographic Trending
 
Computes trends at multiple geographic granularities:
- Global (worldwide)
- Country
- City/Metro area
 
Uses user-specified location and tweet geolocation when available.
"""
 
from typing import List, Dict, Optional
from dataclasses import dataclass
 
@dataclass
class GeoTrend:
    entity: str
    score: float
    volume: int
    rank: int
 
class GeoTrendingService:
    # Geographic hierarchy for trend levels
    TREND_LEVELS = ["global", "country", "city"]
    
    def __init__(
        self,
        trend_scorer: TrendScorer,
        geo_counter: Dict[str, TrendingCountMinSketch]  # Per-location counters
    ):
        self.scorer = trend_scorer
        self.counters = geo_counter
    
    def get_trends_for_location(
        self,
        location: str,
        limit: int = 10
    ) -> List[GeoTrend]:
        """
        Get trending topics for a specific location.
        Falls back to broader geography if insufficient local trends.
        """
        trends = []
        
        # Try to get trends at this location level
        local_trends = self._get_local_trends(location, limit)
        trends.extend(local_trends)
        
        # If not enough local trends, supplement with broader
        if len(trends) < limit:
            parent_location = self._get_parent_location(location)
            remaining = limit - len(trends)
            
            if parent_location:
                parent_trends = self._get_local_trends(
                    parent_location, remaining
                )
                # Avoid duplicates
                existing = {t.entity for t in trends}
                for t in parent_trends:
                    if t.entity not in existing:
                        trends.append(t)
        
        # Sort by score and assign ranks
        trends.sort(key=lambda t: t.score, reverse=True)
        for i, trend in enumerate(trends[:limit]):
            trend.rank = i + 1
        
        return trends[:limit]
    
    def _get_local_trends(
        self, 
        location: str, 
        limit: int
    ) -> List[GeoTrend]:
        """Get trends specific to one location."""
        if location not in self.counters:
            return []
        
        counter = self.counters[location]
        
        # Get candidates above minimum threshold
        candidates = self._get_candidates_above_threshold(counter, location)
        
        # Score each candidate
        scored = []
        for candidate in candidates:
            score = self.scorer.score(candidate)
            if score > 0:
                scored.append(GeoTrend(
                    entity=candidate.entity,
                    score=score,
                    volume=candidate.current_count,
                    rank=0  # Assigned later
                ))
        
        return scored
    
    def assign_user_location(self, user: Dict) -> str:
        """
        Determine which location's trends to show a user.
        Uses profile location, IP geolocation, or activity patterns.
        """
        if user.get("profile_location"):
            # Geocode profile location to normalized form
            return self._geocode(user["profile_location"])
        
        if user.get("ip_location"):
            return user["ip_location"]
        
        # Default to global
        return "global"

Privacy Considerations

Geographic trending uses aggregate data only—individual user locations are never exposed. The system answers 'What's trending in New York?' without revealing which specific users are in New York discussing the topic.

Anti-Gaming and Quality Control

Trending topics are valuable real estate. Advertisers, political campaigns, and bad actors will try to game the system. Robust anti-gaming measures are essential.

5.1 Gaming Attack Vectors

Common Gaming Attacks

•Bot networks: Automated accounts tweeting coordinated hashtags
•Paid tweet campaigns: Real users paid to tweet a hashtag
•Trending parties: Organized groups scheduling coordinated tweets
•Hashtag hijacking: Piggyback on existing trend with spam content
•Time-based coordination: Burst of activity in short window
•Geographic manipulation: VPNs to fake location-based trends

anti_gaming.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
"""
Anti-Gaming Detection System
 
Multi-layer defense against trend manipulation.
Each layer catches different types of gaming attempts.
"""
 
from dataclasses import dataclass
from typing import List, Set
import numpy as np
 
@dataclass
class GamingSignal:
    signal_type: str
    confidence: float  # 0-1
    details: str
 
class AntiGamingDetector:
    def __init__(
        self,
        bot_detector,
        spam_classifier,
        account_scorer
    ):
        self.bot_detector = bot_detector
        self.spam_classifier = spam_classifier
        self.account_scorer = account_scorer
    
    def analyze_trend(
        self, 
        entity: str,
        tweets: List[Dict],
        users: List[Dict]
    ) -> List[GamingSignal]:
        """
        Analyze a potential trend for gaming signals.
        Returns list of detected signals with confidence scores.
        """
        signals = []
        
        # 1. Account age analysis
        signals.extend(self._analyze_account_ages(users))
        
        # 2. Temporal pattern analysis
        signals.extend(self._analyze_temporal_patterns(tweets))
        
        # 3. Bot probability analysis
        signals.extend(self._analyze_bot_probability(users))
        
        # 4. Content similarity analysis
        signals.extend(self._analyze_content_similarity(tweets))
        
        # 5. Engagement authenticity
        signals.extend(self._analyze_engagement_patterns(tweets))
        
        return signals
    
    def _analyze_account_ages(self, users: List[Dict]) -> List[GamingSignal]:
        """
        Detect suspicious concentration of new accounts.
        Bot campaigns often use freshly created accounts.
        """
        signals = []
        
        age_days = [u.get("account_age_days", 0) for u in users]
        new_account_ratio = sum(1 for a in age_days if a < 30) / len(age_days)
        
        if new_account_ratio > 0.5:
            signals.append(GamingSignal(
                signal_type="new_account_concentration",
                confidence=min(1.0, new_account_ratio),
                details=f"{new_account_ratio:.0%} of participants are <30 days old"
            ))
        
        return signals
    
    def _analyze_temporal_patterns(
        self, 
        tweets: List[Dict]
    ) -> List[GamingSignal]:
        """
        Detect unnatural timing patterns.
        Real trends have organic growth; gaming has suspicious spikes.
        """
        signals = []
        
        # Calculate inter-arrival times
        timestamps = sorted([t["created_at"] for t in tweets])
        intervals = np.diff(timestamps)
        
        if len(intervals) > 10:
            # Check for suspiciously regular intervals
            cv = np.std(intervals) / np.mean(intervals)  # Coefficient of variation
            
            if cv < 0.1:  # Too regular = likely automated
                signals.append(GamingSignal(
                    signal_type="regular_timing",
                    confidence=1 - cv * 10,
                    details=f"Suspiciously regular tweet intervals (CV={cv:.3f})"
                ))
            
            # Check for burst patterns
            burst_threshold = np.percentile(intervals, 5)
            if burst_threshold < 0.1:  # Sub-second bursts
                signals.append(GamingSignal(
                    signal_type="burst_pattern",
                    confidence=0.9,
                    details="Abnormal burst of activity detected"
                ))
        
        return signals
    
    def _analyze_content_similarity(
        self, 
        tweets: List[Dict]
    ) -> List[GamingSignal]:
        """
        Detect copy-paste or template-based tweets.
        Real discussion has variety; gaming often uses templates.
        """
        signals = []
        
        contents = [t["text"] for t in tweets]
        
        # Calculate text similarity matrix
        # (In production: use MinHash or SimHash for efficiency)
        unique_ratio = len(set(contents)) / len(contents)
        
        if unique_ratio < 0.5:
            signals.append(GamingSignal(
                signal_type="content_duplication",
                confidence=1 - unique_ratio,
                details=f"Only {unique_ratio:.0%} of tweets are unique"
            ))
        
        return signals
    
    def calculate_gaming_penalty(
        self, 
        signals: List[GamingSignal]
    ) -> float:
        """
        Calculate overall penalty multiplier based on detected signals.
        Returns 0-1 where 0 = definitely gaming, 1 = no signals.
        """
        if not signals:
            return 1.0
        
        # Weight signals by type severity
        severity_weights = {
            "new_account_concentration": 0.7,
            "regular_timing": 0.9,
            "burst_pattern": 0.8,
            "content_duplication": 0.85,
            "bot_probability": 0.95,
        }
        
        penalty = 1.0
        for signal in signals:
            weight = severity_weights.get(signal.signal_type, 0.5)
            penalty *= (1 - signal.confidence * weight)
        
        return max(0, penalty)

5.2 Content Quality Filters

Beyond gaming detection, we filter trends for content quality and safety:

Content Quality Filters

•Profanity filter: Block offensive hashtags from trending
•Adult content filter: Prevent NSFW topics from general trends
•Violence filter: Block promotion of violence or harm
•Spam keyword blocklist: Known spam patterns auto-blocked
•Trademark protection: Prevent brand impersonation trends
•Crisis sensitivity: Delay trends during active emergencies

The Arms Race

Anti-gaming is an ongoing arms race. Attackers adapt to defenses. Continuous monitoring, regular model updates, and human review of edge cases are essential. No automated system is perfect—but layered defenses raise the bar significantly.

Summary: Trending Topics Mastery

You've now mastered the core concepts of real-time trend detection. Let's consolidate:

Key Takeaways

•Trending ≠ Popular: Trending is about velocity (unusual spike) relative to baseline, not absolute volume.
•Streaming Pipeline: Real-time processing with Kafka/Flink for sub-minute trend detection at 6,000+ tweets/second.
•Approximate Counting: Count-Min Sketch for frequency, HyperLogLog for cardinality—enabling in-memory processing at massive scale.
•Multi-Factor Scoring: Combine velocity, acceleration, diversity, quality, and freshness for robust trend detection.
•Geographic Trends: Location-aware trending with fallback to broader geography when local volume is insufficient.
•Anti-Gaming: Multi-layer defenses against bot campaigns, coordination attacks, and content manipulation.

What's Next:

The final page in this module covers Scaling Considerations—how to take everything we've designed so far and make it work at Twitter's full scale. We'll discuss global distribution, caching at the edge, capacity planning, and graceful degradation strategies.

Page Complete

You now understand how to detect and surface trending topics in real-time. This knowledge applies to any system needing real-time signal detection—not just social media trends, but also anomaly detection, breaking news alerts, and viral content identification.

5 / 6

Loading learning content...

System Design (HLD)Twitter/X Feed Design

Designing Twitter/X: Social Media Feed at Scale

LevelAdvanced

Duration120 mins

TopicTwitter/X Feed Design

5 / 6

Trending Topics

Capturing the Pulse of the World

The Trending Challenge:

Detecting trends sounds simple: count hashtags, show the most popular ones. But at Twitter's scale, this naive approach fails spectacularly:

Volume: 500 million tweets/day = ~6,000 tweets/second to process
Real-time: Trends must surface within minutes, not hours
Context: "#Christmas" is always popular—it's not trending unless it's unusual
Gaming: Bots and coordinated campaigns try to manipulate trends
Personalization: Trends should be relevant to the user's location and interests
Global + Local: Need both worldwide trends and hyper-local trends

This page explores how to build a trending system that handles these challenges at scale.

What You Will Learn

Defining What "Trending" Means

Before building the system, we must precisely define what makes something "trending." This definition drives our algorithm design.

1.1 Trending vs Popular

A common mistake is equating "trending" with "popular." They're fundamentally different:

Popular (Always High Volume)

•#love - millions of tweets daily
•#music - consistently high
•#food - always discussed
•#work - weekday constant
•#sleep - nightly regular

Trending (Unusual Spike)

•#SuperBowl - spike during event
•#Breaking - news event spike
•#RIP[Celebrity] - sudden surge
•#Earthquake - real-time event
•#NewMovie - release day surge

1.2 The Mathematical Definition of Trending

A topic is "trending" when its current velocity significantly exceeds its baseline:

trending_formula.txt
Trending Score Formula
======================
 
Core insight: Trending = Current Volume / Expected Volume
 
trending_score = (current_count - expected_count) / sqrt(expected_count)
 
Where:
- current_count: tweets in the last N minutes (e.g., 15 min)
- expected_count: typical tweets in this time window (historical baseline)
- sqrt(expected_count): normalization factor (variance adjustment)
 
Example calculations:
 
Topic: #SuperBowl
- Current (last 15 min): 50,000 tweets
- Expected (typical Sunday 6pm, 15 min): 200 tweets
- Score: (50000 - 200) / sqrt(200) = 3521
 
Topic: #love
- Current (last 15 min): 10,000 tweets  
- Expected (typical 15 min): 9,800 tweets
- Score: (10000 - 9800) / sqrt(9800) = 2.02
 
Result: #SuperBowl (score 3521) >> #love (score 2.02)
#SuperBowl is trending despite #love having lower absolute volume.

1.3 Time Windows and Decay

Trends have lifecycles. A topic that was trending 2 hours ago might no longer be newsworthy:

Time-Based Trend Factors

•Detection window: 15-30 minutes of tweet volume considered
•Baseline window: 7-day historical average at same time of day
•Day-of-week adjustment: Weekends have different patterns than weekdays
•Recency decay: Recent tweets weighted more heavily than older ones
•Trend fatigue: Topic that's been trending for hours gets penalized

Interview Insight

Streaming Data Pipeline Architecture

Processing 6,000+ tweets per second in real-time requires a streaming architecture. Batch processing is too slow—by the time batch results are ready, the trend might have passed.

2.1 High-Level Pipeline Architecture

Converting Mermaid diagram...

2.2 Entity Extraction

Each tweet must be parsed to extract trendable entities (hashtags, keywords, named entities):

entity_extractor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
"""
Entity Extractor for Trend Detection
 
Extracts trending-relevant entities from tweets. Goes beyond
simple hashtag extraction to capture keyword-based trends.
"""
 
import re
from typing import List, Set, Dict
from dataclasses import dataclass
import spacy
 
@dataclass
class TrendEntity:
    text: str
    entity_type: str  # hashtag, keyword, person, location, event
    normalized: str   # lowercase, stemmed version for counting
    weight: float     # importance weight
 
class TweetEntityExtractor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")
        self.stopwords = self._load_stopwords()
        self.common_hashtags = self._load_common_hashtags()
    
    def extract_entities(self, tweet: str) -> List[TrendEntity]:
        """
        Extract all trendable entities from a tweet.
        
        Entity types:
        - Hashtags: #SuperBowl
        - Mentions (for person trends): @elonmusk
        - Keywords: 2-3 word phrases that spike
        - Named entities: detected by NLP (people, orgs, events)
        """
        entities = []
        
        # 1. Extract hashtags
        hashtags = self._extract_hashtags(tweet)
        entities.extend(hashtags)
        
        # 2. Extract mentioned usernames (for person trends)
        mentions = self._extract_mentions(tweet)
        entities.extend(mentions)
        
        # 3. Extract named entities via NLP
        named_entities = self._extract_named_entities(tweet)
        entities.extend(named_entities)
        
        # 4. Extract key phrases
        phrases = self._extract_key_phrases(tweet)
        entities.extend(phrases)
        
        return entities
    
    def _extract_hashtags(self, tweet: str) -> List[TrendEntity]:
        """Extract hashtags with filtering."""
        hashtag_pattern = r'#(w+)'
        matches = re.findall(hashtag_pattern, tweet)
        
        entities = []
        for tag in matches:
            normalized = tag.lower()
            
            # Skip common/spammy hashtags
            if normalized in self.common_hashtags:
                continue
            
            entities.append(TrendEntity(
                text=f"#{tag}",
                entity_type="hashtag",
                normalized=normalized,
                weight=1.0
            ))
        
        return entities
    
    def _extract_named_entities(self, tweet: str) -> List[TrendEntity]:
        """
        Use NLP to extract named entities.
        Captures trends that don't use hashtags.
        
        Example: "RIP Steve Jobs" wouldn't have a hashtag
        but NLP detects "Steve Jobs" as a PERSON entity.
        """
        doc = self.nlp(tweet)
        
        entities = []
        for ent in doc.ents:
            if ent.label_ in {"PERSON", "ORG", "EVENT", "GPE", "LOC"}:
                entities.append(TrendEntity(
                    text=ent.text,
                    entity_type=ent.label_.lower(),
                    normalized=ent.text.lower().strip(),
                    weight=0.8  # Slightly lower weight than hashtags
                ))
        
        return entities
    
    def _extract_key_phrases(self, tweet: str) -> List[TrendEntity]:
        """
        Extract significant 2-3 word phrases.
        Captures trends like "World Cup Final" without hashtags.
        """
        doc = self.nlp(tweet)
        
        # Extract noun chunks
        entities = []
        for chunk in doc.noun_chunks:
            words = [token.text for token in chunk 
                    if not token.is_stop and token.is_alpha]
            
            if 2 <= len(words) <= 3:
                phrase = " ".join(words)
                entities.append(TrendEntity(
                    text=phrase,
                    entity_type="phrase",
                    normalized=phrase.lower(),
                    weight=0.6
                ))
        
        return entities

2.3 Stream Processing with Apache Kafka/Flink

trend_stream_processor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
"""
Trend Stream Processor
 
Uses Apache Flink-style windowed aggregation to count entities
in real-time sliding windows. Built for exactly-once processing
semantics and horizontal scaling.
"""
 
from typing import Dict, List
from collections import defaultdict
import time
 
class TrendStreamProcessor:
    def __init__(
        self,
        window_size_seconds: int = 900,  # 15-minute window
        slide_interval_seconds: int = 60  # Update every minute
    ):
        self.window_size = window_size_seconds
        self.slide_interval = slide_interval_seconds
        
        # In-memory counts for current window
        # In production: distributed state store (RocksDB, Redis)
        self.entity_counts: Dict[str, Dict[str, int]] = defaultdict(
            lambda: defaultdict(int)
        )
        
        # Location-based counts for local trends
        self.location_counts: Dict[str, Dict[str, int]] = defaultdict(
            lambda: defaultdict(int)
        )
    
    def process_tweet(self, tweet: Dict):
        """
        Process a single tweet event.
        Called for each tweet as it arrives in the stream.
        """
        timestamp = tweet["created_at"]
        window_key = self._get_window_key(timestamp)
        location = tweet.get("location", "global")
        
        # Extract entities
        entities = self.entity_extractor.extract_entities(tweet["text"])
        
        # Update counts
        for entity in entities:
            # Global count
            self.entity_counts[window_key][entity.normalized] += 1
            
            # Location-specific count
            loc_key = f"{window_key}:{location}"
            self.location_counts[loc_key][entity.normalized] += 1
    
    def get_sliding_window_counts(
        self, 
        location: str = "global"
    ) -> Dict[str, int]:
        """
        Get aggregate counts across all windows in the sliding window.
        
        Example: For 15-min window, sum counts from last 15 one-minute buckets.
        """
        current_time = time.time()
        counts = defaultdict(int)
        
        # Sum counts from all windows in the sliding range
        for bucket in range(0, self.window_size, 60):
            window_time = current_time - bucket
            window_key = self._get_window_key(window_time)
            
            if location == "global":
                for entity, count in self.entity_counts[window_key].items():
                    counts[entity] += count
            else:
                loc_key = f"{window_key}:{location}"
                for entity, count in self.location_counts[loc_key].items():
                    counts[entity] += count
        
        return dict(counts)
    
    def _get_window_key(self, timestamp: float) -> str:
        """Get bucket key for a timestamp (1-minute buckets)."""
        bucket = int(timestamp // 60)
        return f"w:{bucket}"
    
    def cleanup_old_windows(self):
        """
        Remove windows older than retention period.
        Called periodically to prevent memory growth.
        """
        current_bucket = int(time.time() // 60)
        retention_buckets = self.window_size // 60 + 10  # Keep some buffer
        
        old_keys = [
            key for key in self.entity_counts.keys()
            if int(key.split(":")[1]) < current_bucket - retention_buckets
        ]
        
        for key in old_keys:
            del self.entity_counts[key]

Stateful Stream Processing

Approximate Counting at Scale

With millions of unique hashtags and billions of tweets, exact counting becomes impractical. Approximate counting algorithms provide memory-efficient solutions with bounded error.

3.1 Count-Min Sketch

Count-Min Sketch provides frequency estimation using sublinear space:

count_min_sketch.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
"""
Count-Min Sketch Implementation
 
A probabilistic data structure for frequency estimation.
Uses O(width × depth) space regardless of number of elements.
 
Properties:
- Never underestimates: actual_count <= estimate
- Overestimates by small factor with high probability
- Additive error: estimate <= actual_count + ε × N
"""
 
import mmh3  # MurmurHash3 for fast hashing
import numpy as np
 
class CountMinSketch:
    def __init__(
        self, 
        width: int = 10000,    # Number of counters per row
        depth: int = 5         # Number of hash functions
    ):
        """
        Initialize Count-Min Sketch.
        
        Error guarantees (with high probability):
        - Relative error: ε = e / width ≈ 0.00027 for width=10000
        - Probability of exceeding error: δ = e^(-depth) ≈ 0.0067 for depth=5
        """
        self.width = width
        self.depth = depth
        self.table = np.zeros((depth, width), dtype=np.int64)
        self.total_count = 0
    
    def add(self, item: str, count: int = 1):
        """Add an item to the sketch."""
        self.total_count += count
        
        for i in range(self.depth):
            # Each row uses a different hash function (seed)
            hash_val = mmh3.hash(item, seed=i) % self.width
            self.table[i][hash_val] += count
    
    def estimate(self, item: str) -> int:
        """
        Estimate the count of an item.
        Returns the minimum across all hash functions.
        """
        min_count = float('inf')
        
        for i in range(self.depth):
            hash_val = mmh3.hash(item, seed=i) % self.width
            min_count = min(min_count, self.table[i][hash_val])
        
        return int(min_count)
    
    def merge(self, other: 'CountMinSketch'):
        """
        Merge another sketch into this one.
        Enables parallel processing with distributed sketches.
        """
        if self.width != other.width or self.depth != other.depth:
            raise ValueError("Sketches must have same dimensions")
        
        self.table += other.table
        self.total_count += other.total_count
 
 
class TrendingCountMinSketch:
    """
    Application of Count-Min Sketch for trending topics.
    Uses separate sketches for current and baseline periods.
    """
    
    def __init__(self):
        # Current 15-minute window
        self.current_sketch = CountMinSketch(width=50000, depth=7)
        
        # Baseline (rolling 7-day average for this time window)
        self.baseline_sketch = CountMinSketch(width=50000, depth=7)
        
        # Unique user tracking (prevent spam)
        self.user_sketches: Dict[str, 'HyperLogLog'] = {}
    
    def record_entity(self, entity: str, user_id: str):
        """Record an entity mention from a user."""
        # Add to current window
        self.current_sketch.add(entity)
        
        # Track unique users mentioning this entity
        if entity not in self.user_sketches:
            self.user_sketches[entity] = HyperLogLog()
        self.user_sketches[entity].add(user_id)
    
    def get_trending_score(self, entity: str) -> float:
        """Calculate trending score using sketch estimates."""
        current = self.current_sketch.estimate(entity)
        baseline = self.baseline_sketch.estimate(entity) / (7 * 96)  # 7 days × 96 windows/day
        
        if baseline < 1:
            baseline = 1  # Avoid division by zero
        
        # Z-score style calculation
        return (current - baseline) / np.sqrt(baseline)
    
    def get_unique_user_estimate(self, entity: str) -> int:
        """Estimate unique users mentioning entity."""
        if entity in self.user_sketches:
            return self.user_sketches[entity].cardinality()
        return 0

3.2 HyperLogLog for Unique User Counting

We also need to count unique users per topic (for anti-gaming). HyperLogLog provides cardinality estimation:

hyperloglog.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
"""
HyperLogLog Implementation
 
Estimates cardinality (count of unique elements) using O(log log N) space.
Can estimate billions of unique items using only ~12KB of memory.
 
Properties:
- Space: O(m) where m = 2^p (typically 2^14 = 16K registers)
- Error: Standard error ≈ 1.04 / sqrt(m) ≈ 0.81% for m=16384
- Mergeable: Can union multiple HLLs without re-processing elements
"""
 
import mmh3
import math
 
class HyperLogLog:
    def __init__(self, precision: int = 14):
        """
        Initialize HyperLogLog.
        
        precision = 14 means 2^14 = 16384 registers
        Standard error ≈ 0.81%
        """
        self.precision = precision
        self.num_registers = 1 << precision  # 2^p
        self.registers = [0] * self.num_registers
        
        # Alpha correction factor
        if self.num_registers >= 128:
            self.alpha = 0.7213 / (1 + 1.079 / self.num_registers)
        elif self.num_registers >= 64:
            self.alpha = 0.709
        elif self.num_registers >= 32:
            self.alpha = 0.697
        else:
            self.alpha = 0.673
    
    def add(self, item: str):
        """Add an item to the HLL."""
        hash_val = mmh3.hash64(item, signed=False)[0]
        
        # Use first p bits to select register
        register_idx = hash_val >> (64 - self.precision)
        
        # Use remaining bits to count leading zeros
        remaining = hash_val & ((1 << (64 - self.precision)) - 1)
        leading_zeros = self._count_leading_zeros(remaining, 64 - self.precision)
        
        # Store maximum observed leading zeros + 1
        self.registers[register_idx] = max(
            self.registers[register_idx], 
            leading_zeros + 1
        )
    
    def cardinality(self) -> int:
        """Estimate the number of unique items."""
        # Harmonic mean of 2^(-register)
        sum_inv = sum(2 ** (-reg) for reg in self.registers)
        
        raw_estimate = (self.alpha * self.num_registers ** 2) / sum_inv
        
        # Bias correction for small and large cardinalities
        if raw_estimate <= 2.5 * self.num_registers:
            # Small range correction
            zeros = self.registers.count(0)
            if zeros > 0:
                return int(
                    self.num_registers * 
                    math.log(self.num_registers / zeros)
                )
        elif raw_estimate > (1 << 32) / 30:
            # Large range correction
            return int(-((1 << 32) * math.log(1 - raw_estimate / (1 << 32))))
        
        return int(raw_estimate)
    
    def merge(self, other: 'HyperLogLog'):
        """Merge another HLL into this one."""
        if self.precision != other.precision:
            raise ValueError("HLLs must have same precision")
        
        for i in range(self.num_registers):
            self.registers[i] = max(self.registers[i], other.registers[i])
    
    def _count_leading_zeros(self, value: int, max_bits: int) -> int:
        if value == 0:
            return max_bits
        
        count = 0
        for i in range(max_bits - 1, -1, -1):
            if value & (1 << i):
                break
            count += 1
        
        return count

Approximate Counting Algorithm Comparison
Algorithm	Purpose	Space	Error	Use Case
Count-Min Sketch	Frequency estimation	O(w × d)	ε × N	How many times #topic mentioned?
HyperLogLog	Cardinality estimation	O(2^p)	1.04/√m	How many unique users?
Bloom Filter	Membership test	O(n)	False positive ε	Did user already count?
Top-K (Heavy Hitters)	Find most frequent	O(k)	Depends on skew	Top 10 trending topics

Space Savings in Practice

Trend Scoring and Ranking

With entity counts computed, we need to score and rank topics to determine what's truly trending.

4.1 Multi-Factor Scoring Model

trend_scorer.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
"""
Multi-Factor Trend Scorer
 
Combines multiple signals to produce a final trending score.
Each factor handles a different aspect of what makes something
worth surfacing to users.
"""
 
from dataclasses import dataclass
from typing import Dict
import math
 
@dataclass
class TrendCandidate:
    entity: str
    entity_type: str
    current_count: int
    baseline_count: float
    unique_users: int
    unique_sources: int  # Different apps/clients
    geographic_spread: float  # 0-1, how globally distributed
    account_variety: float  # Mix of verified/regular accounts
    first_seen: float  # Timestamp when trend started
    peak_velocity: float  # Maximum tweets per minute observed
 
class TrendScorer:
    def __init__(self):
        # Weight for each factor (tuned via experimentation)
        self.weights = {
            "velocity": 0.35,
            "acceleration": 0.20,
            "diversity": 0.20,
            "quality": 0.15,
            "freshness": 0.10
        }
    
    def score(self, candidate: TrendCandidate) -> float:
        """
        Calculate overall trending score.
        Higher score = more likely to be shown as trending.
        """
        scores = {
            "velocity": self._velocity_score(candidate),
            "acceleration": self._acceleration_score(candidate),
            "diversity": self._diversity_score(candidate),
            "quality": self._quality_score(candidate),
            "freshness": self._freshness_score(candidate),
        }
        
        # Weighted sum
        total = sum(
            scores[factor] * self.weights[factor]
            for factor in scores
        )
        
        # Apply anti-gaming penalty
        penalty = self._gaming_penalty(candidate)
        
        return total * penalty
    
    def _velocity_score(self, c: TrendCandidate) -> float:
        """
        Score based on current vs baseline volume.
        Higher velocity = more trending.
        """
        if c.baseline_count < 1:
            c.baseline_count = 1
        
        # Z-score calculation
        z = (c.current_count - c.baseline_count) / math.sqrt(c.baseline_count)
        
        # Normalize to 0-100 scale
        return min(100, max(0, z * 10))
    
    def _acceleration_score(self, c: TrendCandidate) -> float:
        """
        Score based on how quickly velocity is increasing.
        Rapid acceleration = breaking news.
        """
        # Compare current velocity to velocity 5 minutes ago
        # Higher acceleration = more newsworthy
        if c.peak_velocity > 0:
            current_velocity = c.current_count / 15  # per minute
            acceleration = current_velocity / c.peak_velocity
            return min(100, acceleration * 50)
        
        return 0
    
    def _diversity_score(self, c: TrendCandidate) -> float:
        """
        Score based on diversity of users discussing topic.
        Real trends have diverse participants; bot campaigns don't.
        """
        # Unique users / total mentions ratio
        if c.current_count == 0:
            return 0
        
        user_ratio = c.unique_users / c.current_count
        
        # Geographic diversity
        geo_score = c.geographic_spread * 50
        
        # Source diversity (different apps/clients)
        source_score = min(50, c.unique_sources * 5)
        
        return (user_ratio * 50) + (geo_score * 0.3) + (source_score * 0.2)
    
    def _quality_score(self, c: TrendCandidate) -> float:
        """
        Score based on quality signals.
        Verified accounts, engagement, etc.
        """
        # Mix of verified and regular accounts
        variety_score = c.account_variety * 100
        
        return variety_score
    
    def _freshness_score(self, c: TrendCandidate) -> float:
        """
        Score based on how recently trend started.
        Newer trends are more interesting.
        """
        age_hours = (time.time() - c.first_seen) / 3600
        
        # Decay over 24 hours
        if age_hours > 24:
            return 0
        
        return 100 * (1 - age_hours / 24)
    
    def _gaming_penalty(self, c: TrendCandidate) -> float:
        """
        Detect potential gaming and apply penalty.
        Returns multiplier between 0 (blocked) and 1 (no penalty).
        """
        # Low user diversity = likely bot campaign
        if c.unique_users < c.current_count * 0.3:
            return 0.1
        
        # Single geographic region = potential coordinated campaign
        if c.geographic_spread < 0.1:
            return 0.5
        
        # Sudden spike from new accounts = suspicious
        # (would need additional signals in practice)
        
        return 1.0

4.2 Location-Based Trending

Trends vary dramatically by location. A local news event might trend in one city but be irrelevant globally:

geo_trending.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
"""
Geographic Trending
 
Computes trends at multiple geographic granularities:
- Global (worldwide)
- Country
- City/Metro area
 
Uses user-specified location and tweet geolocation when available.
"""
 
from typing import List, Dict, Optional
from dataclasses import dataclass
 
@dataclass
class GeoTrend:
    entity: str
    score: float
    volume: int
    rank: int
 
class GeoTrendingService:
    # Geographic hierarchy for trend levels
    TREND_LEVELS = ["global", "country", "city"]
    
    def __init__(
        self,
        trend_scorer: TrendScorer,
        geo_counter: Dict[str, TrendingCountMinSketch]  # Per-location counters
    ):
        self.scorer = trend_scorer
        self.counters = geo_counter
    
    def get_trends_for_location(
        self,
        location: str,
        limit: int = 10
    ) -> List[GeoTrend]:
        """
        Get trending topics for a specific location.
        Falls back to broader geography if insufficient local trends.
        """
        trends = []
        
        # Try to get trends at this location level
        local_trends = self._get_local_trends(location, limit)
        trends.extend(local_trends)
        
        # If not enough local trends, supplement with broader
        if len(trends) < limit:
            parent_location = self._get_parent_location(location)
            remaining = limit - len(trends)
            
            if parent_location:
                parent_trends = self._get_local_trends(
                    parent_location, remaining
                )
                # Avoid duplicates
                existing = {t.entity for t in trends}
                for t in parent_trends:
                    if t.entity not in existing:
                        trends.append(t)
        
        # Sort by score and assign ranks
        trends.sort(key=lambda t: t.score, reverse=True)
        for i, trend in enumerate(trends[:limit]):
            trend.rank = i + 1
        
        return trends[:limit]
    
    def _get_local_trends(
        self, 
        location: str, 
        limit: int
    ) -> List[GeoTrend]:
        """Get trends specific to one location."""
        if location not in self.counters:
            return []
        
        counter = self.counters[location]
        
        # Get candidates above minimum threshold
        candidates = self._get_candidates_above_threshold(counter, location)
        
        # Score each candidate
        scored = []
        for candidate in candidates:
            score = self.scorer.score(candidate)
            if score > 0:
                scored.append(GeoTrend(
                    entity=candidate.entity,
                    score=score,
                    volume=candidate.current_count,
                    rank=0  # Assigned later
                ))
        
        return scored
    
    def assign_user_location(self, user: Dict) -> str:
        """
        Determine which location's trends to show a user.
        Uses profile location, IP geolocation, or activity patterns.
        """
        if user.get("profile_location"):
            # Geocode profile location to normalized form
            return self._geocode(user["profile_location"])
        
        if user.get("ip_location"):
            return user["ip_location"]
        
        # Default to global
        return "global"

Privacy Considerations

Anti-Gaming and Quality Control

Trending topics are valuable real estate. Advertisers, political campaigns, and bad actors will try to game the system. Robust anti-gaming measures are essential.

5.1 Gaming Attack Vectors

Common Gaming Attacks

•Bot networks: Automated accounts tweeting coordinated hashtags
•Paid tweet campaigns: Real users paid to tweet a hashtag
•Trending parties: Organized groups scheduling coordinated tweets
•Hashtag hijacking: Piggyback on existing trend with spam content
•Time-based coordination: Burst of activity in short window
•Geographic manipulation: VPNs to fake location-based trends

anti_gaming.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
"""
Anti-Gaming Detection System
 
Multi-layer defense against trend manipulation.
Each layer catches different types of gaming attempts.
"""
 
from dataclasses import dataclass
from typing import List, Set
import numpy as np
 
@dataclass
class GamingSignal:
    signal_type: str
    confidence: float  # 0-1
    details: str
 
class AntiGamingDetector:
    def __init__(
        self,
        bot_detector,
        spam_classifier,
        account_scorer
    ):
        self.bot_detector = bot_detector
        self.spam_classifier = spam_classifier
        self.account_scorer = account_scorer
    
    def analyze_trend(
        self, 
        entity: str,
        tweets: List[Dict],
        users: List[Dict]
    ) -> List[GamingSignal]:
        """
        Analyze a potential trend for gaming signals.
        Returns list of detected signals with confidence scores.
        """
        signals = []
        
        # 1. Account age analysis
        signals.extend(self._analyze_account_ages(users))
        
        # 2. Temporal pattern analysis
        signals.extend(self._analyze_temporal_patterns(tweets))
        
        # 3. Bot probability analysis
        signals.extend(self._analyze_bot_probability(users))
        
        # 4. Content similarity analysis
        signals.extend(self._analyze_content_similarity(tweets))
        
        # 5. Engagement authenticity
        signals.extend(self._analyze_engagement_patterns(tweets))
        
        return signals
    
    def _analyze_account_ages(self, users: List[Dict]) -> List[GamingSignal]:
        """
        Detect suspicious concentration of new accounts.
        Bot campaigns often use freshly created accounts.
        """
        signals = []
        
        age_days = [u.get("account_age_days", 0) for u in users]
        new_account_ratio = sum(1 for a in age_days if a < 30) / len(age_days)
        
        if new_account_ratio > 0.5:
            signals.append(GamingSignal(
                signal_type="new_account_concentration",
                confidence=min(1.0, new_account_ratio),
                details=f"{new_account_ratio:.0%} of participants are <30 days old"
            ))
        
        return signals
    
    def _analyze_temporal_patterns(
        self, 
        tweets: List[Dict]
    ) -> List[GamingSignal]:
        """
        Detect unnatural timing patterns.
        Real trends have organic growth; gaming has suspicious spikes.
        """
        signals = []
        
        # Calculate inter-arrival times
        timestamps = sorted([t["created_at"] for t in tweets])
        intervals = np.diff(timestamps)
        
        if len(intervals) > 10:
            # Check for suspiciously regular intervals
            cv = np.std(intervals) / np.mean(intervals)  # Coefficient of variation
            
            if cv < 0.1:  # Too regular = likely automated
                signals.append(GamingSignal(
                    signal_type="regular_timing",
                    confidence=1 - cv * 10,
                    details=f"Suspiciously regular tweet intervals (CV={cv:.3f})"
                ))
            
            # Check for burst patterns
            burst_threshold = np.percentile(intervals, 5)
            if burst_threshold < 0.1:  # Sub-second bursts
                signals.append(GamingSignal(
                    signal_type="burst_pattern",
                    confidence=0.9,
                    details="Abnormal burst of activity detected"
                ))
        
        return signals
    
    def _analyze_content_similarity(
        self, 
        tweets: List[Dict]
    ) -> List[GamingSignal]:
        """
        Detect copy-paste or template-based tweets.
        Real discussion has variety; gaming often uses templates.
        """
        signals = []
        
        contents = [t["text"] for t in tweets]
        
        # Calculate text similarity matrix
        # (In production: use MinHash or SimHash for efficiency)
        unique_ratio = len(set(contents)) / len(contents)
        
        if unique_ratio < 0.5:
            signals.append(GamingSignal(
                signal_type="content_duplication",
                confidence=1 - unique_ratio,
                details=f"Only {unique_ratio:.0%} of tweets are unique"
            ))
        
        return signals
    
    def calculate_gaming_penalty(
        self, 
        signals: List[GamingSignal]
    ) -> float:
        """
        Calculate overall penalty multiplier based on detected signals.
        Returns 0-1 where 0 = definitely gaming, 1 = no signals.
        """
        if not signals:
            return 1.0
        
        # Weight signals by type severity
        severity_weights = {
            "new_account_concentration": 0.7,
            "regular_timing": 0.9,
            "burst_pattern": 0.8,
            "content_duplication": 0.85,
            "bot_probability": 0.95,
        }
        
        penalty = 1.0
        for signal in signals:
            weight = severity_weights.get(signal.signal_type, 0.5)
            penalty *= (1 - signal.confidence * weight)
        
        return max(0, penalty)

5.2 Content Quality Filters

Beyond gaming detection, we filter trends for content quality and safety:

Content Quality Filters

•Profanity filter: Block offensive hashtags from trending
•Adult content filter: Prevent NSFW topics from general trends
•Violence filter: Block promotion of violence or harm
•Spam keyword blocklist: Known spam patterns auto-blocked
•Trademark protection: Prevent brand impersonation trends
•Crisis sensitivity: Delay trends during active emergencies

The Arms Race

Summary: Trending Topics Mastery

You've now mastered the core concepts of real-time trend detection. Let's consolidate:

Key Takeaways

•Trending ≠ Popular: Trending is about velocity (unusual spike) relative to baseline, not absolute volume.
•Streaming Pipeline: Real-time processing with Kafka/Flink for sub-minute trend detection at 6,000+ tweets/second.
•Approximate Counting: Count-Min Sketch for frequency, HyperLogLog for cardinality—enabling in-memory processing at massive scale.
•Multi-Factor Scoring: Combine velocity, acceleration, diversity, quality, and freshness for robust trend detection.
•Geographic Trends: Location-aware trending with fallback to broader geography when local volume is insufficient.
•Anti-Gaming: Multi-layer defenses against bot campaigns, coordination attacks, and content manipulation.

What's Next:

Page Complete

5 / 6