Notification System - Learning Module

Loading content...

0/273

Notification Routing

The Intelligence Behind Every Notification

When a notification event occurs—a new message, a security alert, a friend request—a cascade of decisions must happen in milliseconds: Which users should receive this notification? Through which channels? With what priority? Should it be delivered immediately or batched? This decision-making layer is notification routing, and it's the brain of any notification system.

Routing transforms raw notification events into targeted, channel-specific deliveries. A poorly designed routing layer creates frustrated users (too many notifications through wrong channels) and wasted resources (redundant deliveries, expensive SMS for non-critical events). A well-designed routing layer feels invisible—users receive exactly the right information through the right channel at the right time.

What You Will Learn

This page explores the architecture and algorithms of notification routing: how to design a flexible routing rules engine, implement priority-based delivery, handle large-scale fanout for viral content, and build intelligent channel selection that adapts to user behavior and system state.

Routing Architecture Overview

The routing layer sits between notification producers (services that trigger notifications) and channel-specific delivery systems. Its responsibilities include recipient resolution, channel selection, priority assignment, and delivery scheduling.

Converting Mermaid diagram...

Components of the Routing Layer:

Routing Queue — Ingests notification requests from all producer services. Provides buffering during traffic spikes and ensures durability.
Rules Engine — Evaluates routing rules to determine how each notification should be processed. Rules can be based on notification type, user attributes, time of day, etc.
Preference Service — Fetches and caches user notification preferences. Critical path that must be highly available and low-latency.
Channel Selector — Determines which channel(s) to use based on preferences, notification type, delivery requirements, and channel availability.
Priority Router — Routes notifications to channel-specific queues with appropriate priority levels, ensuring critical notifications skip the line.

Decoupling Producers from Channels

Producers should never need to know about channels. They emit events like 'user_received_message' or 'order_shipped' without specifying push vs. email. The routing layer handles all channel logic, enabling new channels to be added without modifying producer code.

Priority-Based Routing

Not all notifications are created equal. A security alert about a compromised account is infinitely more important than a notification about someone liking a post. Priority-based routing ensures critical notifications receive immediate processing while less urgent ones can wait or be batched.

Notification Priority Levels
Priority	SLA	Bypass Rate Limits	Examples
P0 - Critical	< 30 seconds	Yes	Security alerts, fraud alerts, emergency notifications
P1 - High	< 2 minutes	Partial	2FA codes, password resets, payment confirmations
P2 - Medium	< 15 minutes	No	New messages, order updates, friend requests
P3 - Low	< 1 hour	No	Social updates, recommendations, weekly digests
P4 - Bulk	Best effort	No	Marketing campaigns, announcements, non-urgent updates

Implementing Priority Queues:

Priority routing requires separate queues or queue priorities:

class NotificationRouter:
    def __init__(self):
        self.queues = {
            'critical': PriorityQueue(workers=50),   # P0
            'high': PriorityQueue(workers=30),       # P1
            'normal': PriorityQueue(workers=20),     # P2-P3
            'bulk': PriorityQueue(workers=5),        # P4
        }
    
    def route(self, notification: Notification):
        priority = self.determine_priority(notification)
        queue_name = self.priority_to_queue(priority)
        
        self.queues[queue_name].enqueue(notification)
    
    def determine_priority(self, notification: Notification) -> int:
        # Check explicit priority from producer
        if notification.priority is not None:
            return notification.priority
        
        # Derive from notification type
        return self.priority_rules.get(
            notification.type, 
            Priority.NORMAL
        )

Worker Allocation:

Different priority levels should have dedicated worker pools:

Critical queue: Most workers, always available capacity
High queue: Generous allocation, auto-scales aggressively
Normal queue: Standard allocation, scales based on depth
Bulk queue: Minimal workers, processes during off-peak

Starvation Prevention

Pure priority queuing can starve low-priority notifications during high load. Implement aging: notifications waiting beyond their SLA automatically get priority boosts. Also reserve minimum capacity for each priority level to ensure all notifications eventually process.

The Routing Rules Engine

A routing rules engine evaluates conditions against notification context to make routing decisions. Well-designed rules engines enable product teams to modify routing behavior without engineering changes.

Rule Structure:

Each rule consists of conditions (when to apply) and actions (what to do):

{
  "rule_id": "security-alerts-multi-channel",
  "priority": 1,
  "conditions": {
    "all": [
      {"fact": "notification_type", "operator": "in", "value": ["password_reset", "login_from_new_device", "account_compromised"]},
      {"fact": "user.has_phone", "operator": "equals", "value": true}
    ]
  },
  "actions": [
    {"type": "set_priority", "value": "critical"},
    {"type": "add_channel", "value": "sms"},
    {"type": "add_channel", "value": "push"},
    {"type": "add_channel", "value": "email"},
    {"type": "bypass_rate_limit", "value": true}
  ]
}

Rule Evaluation Order:

Rules should be evaluated by priority, with first-match or all-match semantics:

First-Match — Stop after first matching rule (simpler, may miss combinations)
All-Match — Apply all matching rules, merge actions (more flexible, requires conflict resolution)

Most notification systems use all-match with priority-based conflict resolution: if two rules set different priorities, the higher priority wins.

Supported Conditions

•Notification Attributes — type, source_service, payload_size, contains_attachment
•User Attributes — country, language, subscription_tier, account_age, is_verified
•User State — is_online, last_active_timestamp, current_platform, timezone
•Delivery Context — time_of_day, day_of_week, is_quiet_hours, current_channel_load
•Historical Data — notifications_sent_today, last_notification_time, engagement_rate
•External Signals — channel_availability, provider_error_rate, cost_per_channel

Rule Performance

Rules are evaluated on the hot path for every notification. Pre-compile rules into optimized execution plans. Cache rule results for identical condition combinations. Use a Rete algorithm or similar for efficient multi-rule evaluation. Target < 5ms for full rule evaluation.

Fanout Strategies

When a single event needs to notify many users—a celebrity posts, a breaking news alert goes out, a service outage affects all customers—the notification system must efficiently fan out to potentially millions of recipients. This is one of the most challenging aspects of notification system design.

Fanout on Write

•Generate individual notifications immediately when event occurs
•Pre-materialize all deliveries into channel queues
•Low read-time latency (notifications ready to send)
•High write-time latency for large audiences
•Best for: Real-time requirements, smaller audiences

Fanout on Read

•Store event once, resolve recipients at delivery time
•Pull-based: recipients fetch their notifications
•Low write-time latency (single event stored)
•Higher read-time latency (must resolve audience)
•Best for: In-app notifications, less time-sensitive content

Hybrid Approach (Most Common):

Real-world systems use hybrid strategies:

def fanout_notification(event: Event, audience: Audience):
    audience_size = audience.get_size()
    
    if audience_size <= 1000:
        # Small audience: immediate fanout
        return fanout_on_write(event, audience)
    
    elif audience_size <= 100_000:
        # Medium audience: async fanout workers
        return async_fanout(event, audience)
    
    else:
        # Large audience: hybrid approach
        # Immediate fanout to active users
        active_users = audience.get_active_subset(max_size=10_000)
        fanout_on_write(event, active_users)
        
        # Lazy fanout for rest (when they come online)
        store_pending_event(event, audience.exclude(active_users))
        
        # Background worker continues fanout
        schedule_background_fanout(event, audience)

Fanout Scaling Techniques:

Partition by User — Distribute fanout across worker pools partitioned by user ID hash
Batch Processing — Group recipients for bulk API calls to providers (FCM supports 500 tokens per request)
Priority Segmentation — Fanout to most engaged users first, then rest
Rate Limiting — Spread fanout over time to avoid overwhelming providers
Delta Fanout — For audience updates, only process adds/removes, not full recalculation

The Celebrity Problem

A celebrity with 100 million followers posting creates 100 million notification records instantly. Solutions: (1) Dedicated high-fanout processing clusters, (2) Pre-computed follower lists partitioned across shards, (3) Probabilistic delivery (notify 10% immediately, rest via feed pull), (4) Throttled fanout (spread over minutes, not seconds).

Intelligent Channel Selection

Beyond static rules, advanced notification systems use machine learning and real-time signals to optimize channel selection for each user and notification combination.

Channel Selection Signals
Signal Category	Signals	Usage
User Behavior	App usage patterns, notification interactions, preferred channels	Predict which channel user will engage with
Context	Time of day, device state, user location, WiFi vs cellular	Determine optimal delivery timing and channel
Notification	Content type, urgency, rich media, action required	Match content to channel capabilities
System State	Channel availability, queue depths, provider health	Route around failures, balance load
Historical	Past delivery success, open rates, conversion rates	Learn from previous notification performance

ML-Based Channel Selection:

A machine learning model can predict the best channel for each notification:

class ChannelPredictor:
    def __init__(self, model_path: str):
        self.model = load_model(model_path)
    
    def predict_channel(self, notification: Notification, user: User) -> ChannelSelection:
        features = self.extract_features(notification, user)
        
        # Model outputs probability for each channel
        probabilities = self.model.predict(features)
        
        # Map to channel selection
        channels = []
        for channel, prob in probabilities.items():
            if prob > self.threshold(channel):
                channels.append(ChannelConfig(
                    channel=channel,
                    confidence=prob,
                    delay=self.compute_optimal_delay(channel, user)
                ))
        
        return ChannelSelection(
            channels=sorted(channels, key=lambda c: -c.confidence),
            fallback_chain=self.generate_fallbacks(channels)
        )
    
    def extract_features(self, notification: Notification, user: User) -> dict:
        return {
            'notification_type': notification.type,
            'hour_of_day': datetime.now(user.timezone).hour,
            'user_last_active_channel': user.last_active_channel,
            'push_open_rate_7d': user.metrics.push_open_rate_7d,
            'email_open_rate_7d': user.metrics.email_open_rate_7d,
            'days_since_last_notification': user.metrics.days_since_last,
            'user_tier': user.subscription_tier,
            'device_type': user.primary_device_type,
            # ... many more features
        }

A/B Testing Channels:

Continuously experiment with channel strategies:

Randomly assign users to channel strategy variants
Measure engagement metrics per variant
Gradually roll out winning strategies
Account for long-term effects (notification fatigue)

Respecting User Agency

ML predictions should enhance, not override, user preferences. If a user has explicitly disabled email notifications, no model should re-enable them. Use predictions to optimize within user-permitted channels, not to circumvent user choices.

Routing State Management

Sophisticated routing requires maintaining state about notifications in flight, delivery attempts, and user interactions. This state enables features like fallback chains, acknowledgment tracking, and duplicate suppression.

Notification Lifecycle States:

┌─────────────┐     ┌──────────────┐     ┌────────────┐
│   CREATED   │ ──▶ │    QUEUED    │ ──▶ │   ROUTED   │
└─────────────┘     └──────────────┘     └────────────┘
                                               │
                    ┌──────────────────────────┼──────────────────────────┐
                    │                          │                          │
                    ▼                          ▼                          ▼
            ┌──────────────┐           ┌──────────────┐           ┌──────────────┐
            │ SENT (Push)  │           │ SENT (Email) │           │ SENT (SMS)   │
            └──────────────┘           └──────────────┘           └──────────────┘
                    │                          │                          │
          ┌─────────┼─────────┐      ┌─────────┼─────────┐      ┌─────────┴─────────┐
          ▼         ▼         ▼      ▼         ▼         ▼      ▼                   ▼
      DELIVERED  FAILED   EXPIRED  BOUNCED  DELIVERED  OPENED  DELIVERED        FAILED

State Storage Requirements:

In-Flight Notifications — Track notifications currently being processed. Redis or similar for sub-millisecond lookups.
Delivery Attempts — Log each attempt per channel with timestamps and outcomes. Enables retry logic and debugging.
Acknowledgment State — Track which notifications user has seen/interacted with. Synced across devices.
Notification History — Long-term storage for analytics, compliance, and user-facing notification center.

State Management Patterns

•Idempotency Keys — Every notification carries a unique key. Duplicate submissions with same key are rejected.
•TTL-Based Expiration — Transient state (in-flight tracking) expires automatically. Prevents unbounded state growth.
•Event Sourcing — Store notification events immutably. Reconstruct current state from event log. Excellent for audit trails.
•Distributed Locks — Prevent concurrent processing of same notification. Use Redis SETNX or similar.
•Saga Pattern — Multi-channel deliveries as distributed transactions. Enable rollback of failed multi-channel sends.

State Scalability

At 1 billion notifications per day, storing complete history forever is impractical. Implement tiered storage: hot tier (Redis) for last 24 hours, warm tier (PostgreSQL/DynamoDB) for 30 days, cold tier (S3/BigQuery) for long-term analytics. Define clear retention policies aligned with compliance requirements.

Fallback and Retry Logic

Delivery failures are inevitable. Devices go offline, email servers reject messages, SMS carriers experience outages. Robust routing includes sophisticated retry and fallback mechanisms.

Retry Strategy:

class RetryPolicy:
    def __init__(self, channel: str):
        self.configs = {
            'push': RetryConfig(
                max_attempts=3,
                initial_delay=1,  # seconds
                max_delay=60,
                backoff_multiplier=2,
                jitter=True
            ),
            'email': RetryConfig(
                max_attempts=5,
                initial_delay=30,
                max_delay=3600,
                backoff_multiplier=2,
                jitter=True
            ),
            'sms': RetryConfig(
                max_attempts=3,
                initial_delay=10,
                max_delay=300,
                backoff_multiplier=2,
                jitter=True
            ),
        }
        self.config = self.configs.get(channel)
    
    def should_retry(self, attempt: int, error: Error) -> bool:
        if attempt >= self.config.max_attempts:
            return False
        
        # Don't retry permanent failures
        if error.is_permanent():
            return False
        
        return True
    
    def get_delay(self, attempt: int) -> float:
        delay = min(
            self.config.initial_delay * (self.config.backoff_multiplier ** attempt),
            self.config.max_delay
        )
        
        if self.config.jitter:
            delay *= random.uniform(0.5, 1.5)
        
        return delay

Distinguishing Error Types:

Transient — Rate limited, temporary network issue, provider overloaded → Retry with backoff
Permanent — Invalid token, user unsubscribed, address doesn't exist → Don't retry, update user state
Unknown — Provider timeout, ambiguous response → Retry cautiously, check delivery status later

Fallback Chain Examples
Notification Type	Primary	Fallback 1	Fallback 2	Final Fallback
2FA Code	SMS	Push	Email	Voice Call
Fraud Alert	Push + SMS (parallel)	Voice Call	Email
New Message	Push	In-App	Email (if 1hr passed)
Order Shipped	Push	Email
Marketing	Email	Push (if opted in)

Acknowledgment-Based Fallbacks

For critical notifications, trigger fallbacks based on user acknowledgment, not just delivery success. If a fraud alert push was delivered but not opened within 5 minutes, escalate to SMS. This requires tracking notification opens, not just sends.

Summary: Notification Routing

Notification routing is the intelligent core of the notification system. It transforms raw events into targeted, prioritized, channel-specific deliveries while handling the complexities of scale, failures, and user preferences.

Key Takeaways

•Decoupled Architecture — Routing layer separates producers from channels, enabling independent evolution
•Priority-Based Routing — Different priority levels with dedicated resources ensure critical notifications are never delayed
•Rules Engine — Flexible, configurable routing rules enable product teams to adjust behavior without code changes
•Fanout Strategies — Hybrid approaches (fanout on write + read) handle both small and massive audiences efficiently
•Intelligent Selection — ML models and real-time signals optimize channel selection beyond static rules
•State Management — Tracking notification lifecycle enables fallbacks, deduplication, and analytics
•Fallback Chains — Automatic fallback to alternative channels when primary delivery fails

What's Next:

With routing covered, we'll address a critical user experience challenge: notification overload. The next page explores Batching and Deduplication—techniques for grouping similar notifications, preventing duplicates, and respecting user attention while maintaining deliverability.

Page Complete

You now understand the architecture and algorithms behind notification routing. You can design priority systems, build flexible rules engines, handle massive fanout, and implement intelligent channel selection. These skills are essential for any notification system serving more than a handful of users.