Loading content...
When a notification event occurs—a new message, a security alert, a friend request—a cascade of decisions must happen in milliseconds: Which users should receive this notification? Through which channels? With what priority? Should it be delivered immediately or batched? This decision-making layer is notification routing, and it's the brain of any notification system.
Routing transforms raw notification events into targeted, channel-specific deliveries. A poorly designed routing layer creates frustrated users (too many notifications through wrong channels) and wasted resources (redundant deliveries, expensive SMS for non-critical events). A well-designed routing layer feels invisible—users receive exactly the right information through the right channel at the right time.
This page explores the architecture and algorithms of notification routing: how to design a flexible routing rules engine, implement priority-based delivery, handle large-scale fanout for viral content, and build intelligent channel selection that adapts to user behavior and system state.
The routing layer sits between notification producers (services that trigger notifications) and channel-specific delivery systems. Its responsibilities include recipient resolution, channel selection, priority assignment, and delivery scheduling.
Components of the Routing Layer:
Routing Queue — Ingests notification requests from all producer services. Provides buffering during traffic spikes and ensures durability.
Rules Engine — Evaluates routing rules to determine how each notification should be processed. Rules can be based on notification type, user attributes, time of day, etc.
Preference Service — Fetches and caches user notification preferences. Critical path that must be highly available and low-latency.
Channel Selector — Determines which channel(s) to use based on preferences, notification type, delivery requirements, and channel availability.
Priority Router — Routes notifications to channel-specific queues with appropriate priority levels, ensuring critical notifications skip the line.
Producers should never need to know about channels. They emit events like 'user_received_message' or 'order_shipped' without specifying push vs. email. The routing layer handles all channel logic, enabling new channels to be added without modifying producer code.
Not all notifications are created equal. A security alert about a compromised account is infinitely more important than a notification about someone liking a post. Priority-based routing ensures critical notifications receive immediate processing while less urgent ones can wait or be batched.
| Priority | SLA | Bypass Rate Limits | Examples |
|---|---|---|---|
| P0 - Critical | < 30 seconds | Yes | Security alerts, fraud alerts, emergency notifications |
| P1 - High | < 2 minutes | Partial | 2FA codes, password resets, payment confirmations |
| P2 - Medium | < 15 minutes | No | New messages, order updates, friend requests |
| P3 - Low | < 1 hour | No | Social updates, recommendations, weekly digests |
| P4 - Bulk | Best effort | No | Marketing campaigns, announcements, non-urgent updates |
Implementing Priority Queues:
Priority routing requires separate queues or queue priorities:
class NotificationRouter:
def __init__(self):
self.queues = {
'critical': PriorityQueue(workers=50), # P0
'high': PriorityQueue(workers=30), # P1
'normal': PriorityQueue(workers=20), # P2-P3
'bulk': PriorityQueue(workers=5), # P4
}
def route(self, notification: Notification):
priority = self.determine_priority(notification)
queue_name = self.priority_to_queue(priority)
self.queues[queue_name].enqueue(notification)
def determine_priority(self, notification: Notification) -> int:
# Check explicit priority from producer
if notification.priority is not None:
return notification.priority
# Derive from notification type
return self.priority_rules.get(
notification.type,
Priority.NORMAL
)
Worker Allocation:
Different priority levels should have dedicated worker pools:
Pure priority queuing can starve low-priority notifications during high load. Implement aging: notifications waiting beyond their SLA automatically get priority boosts. Also reserve minimum capacity for each priority level to ensure all notifications eventually process.
A routing rules engine evaluates conditions against notification context to make routing decisions. Well-designed rules engines enable product teams to modify routing behavior without engineering changes.
Rule Structure:
Each rule consists of conditions (when to apply) and actions (what to do):
{
"rule_id": "security-alerts-multi-channel",
"priority": 1,
"conditions": {
"all": [
{"fact": "notification_type", "operator": "in", "value": ["password_reset", "login_from_new_device", "account_compromised"]},
{"fact": "user.has_phone", "operator": "equals", "value": true}
]
},
"actions": [
{"type": "set_priority", "value": "critical"},
{"type": "add_channel", "value": "sms"},
{"type": "add_channel", "value": "push"},
{"type": "add_channel", "value": "email"},
{"type": "bypass_rate_limit", "value": true}
]
}
Rule Evaluation Order:
Rules should be evaluated by priority, with first-match or all-match semantics:
Most notification systems use all-match with priority-based conflict resolution: if two rules set different priorities, the higher priority wins.
Rules are evaluated on the hot path for every notification. Pre-compile rules into optimized execution plans. Cache rule results for identical condition combinations. Use a Rete algorithm or similar for efficient multi-rule evaluation. Target < 5ms for full rule evaluation.
When a single event needs to notify many users—a celebrity posts, a breaking news alert goes out, a service outage affects all customers—the notification system must efficiently fan out to potentially millions of recipients. This is one of the most challenging aspects of notification system design.
Hybrid Approach (Most Common):
Real-world systems use hybrid strategies:
def fanout_notification(event: Event, audience: Audience):
audience_size = audience.get_size()
if audience_size <= 1000:
# Small audience: immediate fanout
return fanout_on_write(event, audience)
elif audience_size <= 100_000:
# Medium audience: async fanout workers
return async_fanout(event, audience)
else:
# Large audience: hybrid approach
# Immediate fanout to active users
active_users = audience.get_active_subset(max_size=10_000)
fanout_on_write(event, active_users)
# Lazy fanout for rest (when they come online)
store_pending_event(event, audience.exclude(active_users))
# Background worker continues fanout
schedule_background_fanout(event, audience)
Fanout Scaling Techniques:
A celebrity with 100 million followers posting creates 100 million notification records instantly. Solutions: (1) Dedicated high-fanout processing clusters, (2) Pre-computed follower lists partitioned across shards, (3) Probabilistic delivery (notify 10% immediately, rest via feed pull), (4) Throttled fanout (spread over minutes, not seconds).
Beyond static rules, advanced notification systems use machine learning and real-time signals to optimize channel selection for each user and notification combination.
| Signal Category | Signals | Usage |
|---|---|---|
| User Behavior | App usage patterns, notification interactions, preferred channels | Predict which channel user will engage with |
| Context | Time of day, device state, user location, WiFi vs cellular | Determine optimal delivery timing and channel |
| Notification | Content type, urgency, rich media, action required | Match content to channel capabilities |
| System State | Channel availability, queue depths, provider health | Route around failures, balance load |
| Historical | Past delivery success, open rates, conversion rates | Learn from previous notification performance |
ML-Based Channel Selection:
A machine learning model can predict the best channel for each notification:
class ChannelPredictor:
def __init__(self, model_path: str):
self.model = load_model(model_path)
def predict_channel(self, notification: Notification, user: User) -> ChannelSelection:
features = self.extract_features(notification, user)
# Model outputs probability for each channel
probabilities = self.model.predict(features)
# Map to channel selection
channels = []
for channel, prob in probabilities.items():
if prob > self.threshold(channel):
channels.append(ChannelConfig(
channel=channel,
confidence=prob,
delay=self.compute_optimal_delay(channel, user)
))
return ChannelSelection(
channels=sorted(channels, key=lambda c: -c.confidence),
fallback_chain=self.generate_fallbacks(channels)
)
def extract_features(self, notification: Notification, user: User) -> dict:
return {
'notification_type': notification.type,
'hour_of_day': datetime.now(user.timezone).hour,
'user_last_active_channel': user.last_active_channel,
'push_open_rate_7d': user.metrics.push_open_rate_7d,
'email_open_rate_7d': user.metrics.email_open_rate_7d,
'days_since_last_notification': user.metrics.days_since_last,
'user_tier': user.subscription_tier,
'device_type': user.primary_device_type,
# ... many more features
}
A/B Testing Channels:
Continuously experiment with channel strategies:
ML predictions should enhance, not override, user preferences. If a user has explicitly disabled email notifications, no model should re-enable them. Use predictions to optimize within user-permitted channels, not to circumvent user choices.
Sophisticated routing requires maintaining state about notifications in flight, delivery attempts, and user interactions. This state enables features like fallback chains, acknowledgment tracking, and duplicate suppression.
Notification Lifecycle States:
┌─────────────┐ ┌──────────────┐ ┌────────────┐
│ CREATED │ ──▶ │ QUEUED │ ──▶ │ ROUTED │
└─────────────┘ └──────────────┘ └────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ SENT (Push) │ │ SENT (Email) │ │ SENT (SMS) │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
┌─────────┼─────────┐ ┌─────────┼─────────┐ ┌─────────┴─────────┐
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
DELIVERED FAILED EXPIRED BOUNCED DELIVERED OPENED DELIVERED FAILED
State Storage Requirements:
In-Flight Notifications — Track notifications currently being processed. Redis or similar for sub-millisecond lookups.
Delivery Attempts — Log each attempt per channel with timestamps and outcomes. Enables retry logic and debugging.
Acknowledgment State — Track which notifications user has seen/interacted with. Synced across devices.
Notification History — Long-term storage for analytics, compliance, and user-facing notification center.
At 1 billion notifications per day, storing complete history forever is impractical. Implement tiered storage: hot tier (Redis) for last 24 hours, warm tier (PostgreSQL/DynamoDB) for 30 days, cold tier (S3/BigQuery) for long-term analytics. Define clear retention policies aligned with compliance requirements.
Delivery failures are inevitable. Devices go offline, email servers reject messages, SMS carriers experience outages. Robust routing includes sophisticated retry and fallback mechanisms.
Retry Strategy:
class RetryPolicy:
def __init__(self, channel: str):
self.configs = {
'push': RetryConfig(
max_attempts=3,
initial_delay=1, # seconds
max_delay=60,
backoff_multiplier=2,
jitter=True
),
'email': RetryConfig(
max_attempts=5,
initial_delay=30,
max_delay=3600,
backoff_multiplier=2,
jitter=True
),
'sms': RetryConfig(
max_attempts=3,
initial_delay=10,
max_delay=300,
backoff_multiplier=2,
jitter=True
),
}
self.config = self.configs.get(channel)
def should_retry(self, attempt: int, error: Error) -> bool:
if attempt >= self.config.max_attempts:
return False
# Don't retry permanent failures
if error.is_permanent():
return False
return True
def get_delay(self, attempt: int) -> float:
delay = min(
self.config.initial_delay * (self.config.backoff_multiplier ** attempt),
self.config.max_delay
)
if self.config.jitter:
delay *= random.uniform(0.5, 1.5)
return delay
Distinguishing Error Types:
| Notification Type | Primary | Fallback 1 | Fallback 2 | Final Fallback |
|---|---|---|---|---|
| 2FA Code | SMS | Push | Voice Call | |
| Fraud Alert | Push + SMS (parallel) | Voice Call | ||
| New Message | Push | In-App | Email (if 1hr passed) | |
| Order Shipped | Push | |||
| Marketing | Push (if opted in) |
For critical notifications, trigger fallbacks based on user acknowledgment, not just delivery success. If a fraud alert push was delivered but not opened within 5 minutes, escalate to SMS. This requires tracking notification opens, not just sends.
Notification routing is the intelligent core of the notification system. It transforms raw events into targeted, prioritized, channel-specific deliveries while handling the complexities of scale, failures, and user preferences.
What's Next:
With routing covered, we'll address a critical user experience challenge: notification overload. The next page explores Batching and Deduplication—techniques for grouping similar notifications, preventing duplicates, and respecting user attention while maintaining deliverability.
You now understand the architecture and algorithms behind notification routing. You can design priority systems, build flexible rules engines, handle massive fanout, and implement intelligent channel selection. These skills are essential for any notification system serving more than a handful of users.