Loading system design...
Design a real-time ad click aggregation system that captures billions of ad click and impression events, aggregates them across multiple dimensions (campaign, geography, device, time window) using stream processing (Apache Flink), ensures exactly-once counting for billing accuracy, detects and filters click fraud, stores results in an OLAP database for interactive analytics, and reconciles real-time aggregates with batch-recomputed counts for billing integrity.
| Metric | Value |
|---|---|
| Ad impressions per day | 10 billion |
| Ad clicks per day | 500 million |
| Click rate (avg) | 10,000 clicks/sec |
| Click rate (peak) | 100,000 clicks/sec |
| Unique ads | 10 million |
| Active campaigns | 1 million |
| Advertisers | 500,000 |
| Aggregation dimensions | 50+ (ad_id × campaign × country × device × time) |
| Aggregation window | 1 minute (primary), 5-min, hourly, daily rollups |
| Click-to-dashboard latency | < 1 minute |
| Billing accuracy | 99.99%+ (exactly-once) |
| Fraud rate (industry) | 10–30% of clicks |
Click event ingestion: capture ad click events in real-time from billions of ad impressions; each event: {click_id, ad_id, campaign_id, advertiser_id, user_id, timestamp, ip, device, country, referrer}; ingest at 10,000+ clicks per second (peak: 100K/s)
Real-time aggregation: compute click counts aggregated by multiple dimensions — per ad_id, per campaign_id, per advertiser_id — in 1-minute windows; results available within 1 minute of the click occurring (near real-time)
Multi-dimensional rollups: aggregate clicks across dimensions — by time window (1-min, 5-min, 1-hour, 1-day), by geography (country, region), by device type, by campaign, by ad creative; support drill-down from high-level to granular views
Click-through rate (CTR): compute CTR = clicks / impressions for each ad and campaign; requires correlating click events with impression events; update CTR in near real-time for advertiser dashboards
Click fraud detection: identify and filter fraudulent clicks — bot traffic (abnormal click patterns), click farms (many clicks from same IP/device), competitor click fraud; filter before counting for billing
Billing and attribution: accurately count billable clicks per advertiser per day; clicks are the basis for advertiser billing (cost-per-click model); each click counted exactly-once (no missed clicks, no double-counting); attribution: which click led to a conversion
Advertiser reporting dashboard: real-time and historical reports — click count, impression count, CTR, spend, conversions, CPA (cost per acquisition); filterable by campaign, ad group, date range, geography, device
Late-arriving events: handle click events that arrive out of order or late (up to 5 minutes late due to mobile network delays, CDN buffering); late events should still be counted in the correct time window
Data reconciliation: periodically reconcile real-time aggregated counts against batch-recomputed counts from raw events (Lambda architecture); discrepancies flagged and corrected; ensures billing accuracy
Advertiser-level isolation: each advertiser sees only their own data; per-advertiser rate limits on API queries; data partitioned by advertiser for security and performance
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?