URL Shortener - Learning Module

Loading content...

0/273

Analytics Collection

The Analytics Goldmine

Every click on a short URL is a data point. At 1 billion redirects per day, that's 1 billion data points—a treasure trove of insights about user behavior, campaign effectiveness, geographic distribution, and content engagement.

For many URL shortener businesses (like Bitly), analytics are the product. The URL shortening is merely the means to collect valuable marketing intelligence. Understanding who clicks, from where, when, and on what devices transforms a simple redirect service into a powerful analytics platform.

But collecting analytics at this scale is challenging:

Volume: 1 billion events/day = 11,574 events/second average, 50,000+ at peak
Latency Constraint: Analytics collection cannot slow redirects
Storage: Raw events would consume petabytes; aggregation is essential
Query Speed: Dashboards must load in seconds despite massive data

What You Will Learn

By the end of this page, you will understand event collection patterns, streaming architectures with Kafka, aggregation strategies for time-series data, storage optimization techniques, and how to serve analytics queries efficiently.

Analytics Data Model

Before building the collection pipeline, we must define what data we're collecting and how we'll structure it.

Raw Click Event Schema

click_event.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
/**
 * Raw Click Event
 * 
 * Captured at the moment of redirect, before any aggregation.
 * This is the highest-fidelity data we collect.
 */
interface RawClickEvent {
    // Identity
    eventId: string;              // UUID for deduplication
    shortCode: string;            // Which short URL was clicked
    timestamp: number;            // Unix milliseconds
    
    // User Information (derived from request)
    ip: string;                   // Visitor IP (for geo-lookup)
    userAgent: string;            // Browser/device identification
    acceptLanguage: string;       // Language preference
    
    // Traffic Source
    referer: string | null;       // Referring page
    utmSource: string | null;     // utm_source parameter
    utmMedium: string | null;     // utm_medium parameter
    utmCampaign: string | null;   // utm_campaign parameter
    
    // Session/Visitor Tracking
    visitorId: string | null;     // Cookie-based visitor ID
    sessionId: string | null;     // Session identifier
    
    // Technical Details
    protocol: string;             // HTTP or HTTPS
    responseTime: number;         // Redirect latency in ms
    cacheLevel: string;           // local/redis/database
}
 
// Estimated size per event: ~500 bytes
// 1 billion events × 500 bytes = 500 GB/day of raw events
 
/**
 * Enriched Click Event
 * 
 * After processing, we add derived fields from lookups.
 */
interface EnrichedClickEvent extends RawClickEvent {
    // Geo-derived (from IP lookup)
    country: string;              // ISO country code
    region: string;               // State/province
    city: string;                 // City name
    latitude: number;             // Coordinates
    longitude: number;
    
    // Device-derived (from User-Agent parsing)
    deviceType: 'desktop' | 'mobile' | 'tablet' | 'bot';
    os: string;                   // Operating system
    osVersion: string;
    browser: string;              // Browser name
    browserVersion: string;
    isBot: boolean;               // Crawler detection
    
    // Time-derived
    hour: number;                 // 0-23
    dayOfWeek: number;            // 0-6
    isWeekend: boolean;
    
    // Visitor classification
    isUnique: boolean;            // First visit to this short URL
    isReturning: boolean;         // Seen this visitor before
}

Aggregated Metrics Schema

Raw events are aggregated into time-bucketed summaries for efficient querying:

aggregated_metrics.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/**
 * Aggregated Click Metrics
 * 
 * Pre-computed aggregations stored at various time granularities.
 */
 
// Hourly aggregation (most detailed, retained 30 days)
interface HourlyMetrics {
    shortCode: string;
    hour: string;                 // "2024-01-15T14:00:00Z"
    clicks: number;
    uniqueVisitors: number;
    
    // Grouped distributions
    byCountry: Record<string, number>;      // {"US": 523, "UK": 234, ...}
    byDevice: Record<string, number>;       // {"mobile": 612, "desktop": 145}
    byBrowser: Record<string, number>;      // {"Chrome": 432, "Safari": 215}
    byReferer: Record<string, number>;      // {"twitter.com": 324, "direct": 433}
}
 
// Daily aggregation (retained 1 year)
interface DailyMetrics {
    shortCode: string;
    date: string;                 // "2024-01-15"
    clicks: number;
    uniqueVisitors: number;
    byCountry: Record<string, number>;
    byDevice: Record<string, number>;
    peakHour: number;             // Hour with most clicks
    avgResponseTime: number;
}
 
// Monthly aggregation (retained indefinitely)
interface MonthlyMetrics {
    shortCode: string;
    month: string;                // "2024-01"
    clicks: number;
    uniqueVisitors: number;
    topCountries: { country: string; clicks: number }[];  // Top 10
    topReferers: { referer: string; clicks: number }[];   // Top 10
}
 
// Real-time counters (for dashboard display)
interface RealtimeCounter {
    shortCode: string;
    windowStart: number;          // Unix timestamp
    windowDuration: number;       // Seconds (60, 300, 3600)
    clicks: number;
}
 
/**
 * Data Retention Policy
 * 
 * Raw events:     7 days (then deleted)
 * Hourly:        30 days
 * Daily:          1 year
 * Monthly:        Forever
 * Real-time:      Rolling (auto-expires)
 */

Aggregation Reduces Storage 1000x

1 million clicks on a URL over a day becomes 24 hourly records instead of 1 million raw events. This 40,000x reduction makes long-term storage and querying feasible. The trade-off: you lose individual event details after the retention window.

Event Collection Architecture

The cardinal rule of analytics collection is: never let analytics slow down the user's redirect. We achieve this through asynchronous, decoupled event emission.

High-Level Architecture

architecture_overview.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Analytics Collection Pipeline
==============================
 
┌──────────────┐     ┌──────────────────────────────────┐
│   Redirect   │     │     Analytics Pipeline           │
│   Service    │     │                                  │
│              │     │  ┌───────────────────────────┐   │
│  ┌────────┐  │────▶│  │     Kafka / Kinesis       │   │
│  │Redirect│  │     │  │  (Event Streaming)        │   │
│  │Handler │  │     │  └───────────┬───────────────┘   │
│  └────────┘  │     │              │                   │
│      │       │     │              ▼                   │
│      │ fire  │     │  ┌───────────────────────────┐   │
│      │ and   │     │  │   Stream Processors       │   │
│      │ forget│     │  │  (Flink / Spark)          │   │
│      │       │     │  │                           │   │
│      ▼       │     │  │ • Enrich (geo, device)    │   │
│  ┌────────┐  │     │  │ • Deduplicate             │   │
│  │ Async  │  │     │  │ • Aggregate by time       │   │
│  │Producer│──┼────▶│  └───────────┬───────────────┘   │
│  └────────┘  │     │              │                   │
│              │     │              ▼                   │
└──────────────┘     │  ┌───────────────────────────┐   │
                     │  │    Time-Series Storage    │   │
                     │  │  (InfluxDB / TimescaleDB) │   │
                     │  └───────────────────────────┘   │
                     │              │                   │
                     │              ▼                   │
                     │  ┌───────────────────────────┐   │
                     │  │   Analytics API           │   │
                     │  │  (Query Service)          │   │
                     │  └───────────────────────────┘   │
                     │                                  │
                     └──────────────────────────────────┘
 
Key Design Principles:
1. Redirect path is isolated from analytics processing
2. Kafka provides durability buffer (survives downstream failures)
3. Stream processing handles enrichment and aggregation
4. Time-series DB optimized for analytics queries

Asynchronous Event Emission

The redirect handler must emit events without blocking:

async_emission.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
/**
 * Asynchronous Event Emission
 * 
 * Fire-and-forget event publishing that never blocks redirects.
 */
 
import { Kafka, Producer } from 'kafkajs';
 
class AnalyticsEmitter {
    private producer: Producer;
    private localBuffer: RawClickEvent[] = [];
    private readonly BATCH_SIZE = 100;
    private readonly FLUSH_INTERVAL_MS = 100;
    
    constructor(kafka: Kafka) {
        this.producer = kafka.producer({
            allowAutoTopicCreation: false,
            idempotent: true,  // Exactly-once semantics
        });
        
        // Start background flush loop
        setInterval(() => this.flush(), this.FLUSH_INTERVAL_MS);
    }
    
    /**
     * Emit a click event. Returns immediately.
     * Event is buffered and sent in background.
     */
    emit(event: RawClickEvent): void {
        // Generate unique event ID for deduplication
        event.eventId = crypto.randomUUID();
        
        // Add to local buffer (fast memory operation)
        this.localBuffer.push(event);
        
        // If buffer is full, trigger immediate flush (still async)
        if (this.localBuffer.length >= this.BATCH_SIZE) {
            setImmediate(() => this.flush());
        }
    }
    
    /**
     * Flush buffered events to Kafka.
     * Runs in background, errors logged but never thrown to caller.
     */
    private async flush(): Promise<void> {
        if (this.localBuffer.length === 0) return;
        
        // Swap buffer (atomic operation)
        const events = this.localBuffer;
        this.localBuffer = [];
        
        try {
            await this.producer.send({
                topic: 'click-events',
                messages: events.map(event => ({
                    key: event.shortCode,  // Partition by short code
                    value: JSON.stringify(event),
                    timestamp: String(event.timestamp),
                })),
            });
        } catch (error) {
            // Log error but don't crash - analytics are non-critical
            console.error('Failed to send analytics batch:', error);
            
            // Optional: retry failed events
            // this.retryBuffer.push(...events);
        }
    }
}
 
// Usage in redirect handler:
const analytics = new AnalyticsEmitter(kafka);
 
function handleRedirect(request: Request): Response {
    const shortCode = parseShortCode(request.url);
    const longUrl = lookupUrl(shortCode);
    
    // Emit analytics event (non-blocking)
    analytics.emit({
        shortCode,
        timestamp: Date.now(),
        ip: request.headers.get('cf-connecting-ip') ?? '',
        userAgent: request.headers.get('user-agent') ?? '',
        referer: request.headers.get('referer'),
        // ... other fields
    });
    
    // Return redirect immediately
    return Response.redirect(longUrl, 302);
}

Batching Reduces Kafka Load

Sending 1 billion individual Kafka messages per day is expensive. Batching 100 events per message reduces this to 10 million messages—a 100x reduction in Kafka overhead. The trade-off is slight latency (100ms buffer window) before events appear in the pipeline.

Stream Processing Pipeline

Raw events need enrichment (geo-location, device parsing) and aggregation before storage. Stream processing frameworks like Apache Flink or Kafka Streams handle this in real-time.

Processing Stages

stream_processor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
"""
Analytics Stream Processor (Apache Flink / Conceptual)
 
Processes raw click events through enrichment and aggregation stages.
"""
 
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
 
# Stage 1: Parse and Validate
def parse_event(raw_event: str) -> ClickEvent:
    """Parse JSON and validate required fields."""
    event = json.loads(raw_event)
    
    # Validate required fields
    required = ['shortCode', 'timestamp', 'ip', 'userAgent']
    if not all(field in event for field in required):
        raise ValueError(f"Missing required field in event")
    
    return ClickEvent(**event)
 
# Stage 2: Enrich with Geo Data
def enrich_geo(event: ClickEvent) -> EnrichedEvent:
    """Add geographic information from IP address."""
    geo = geo_ip_lookup(event.ip)  # MaxMind or similar
    
    return EnrichedEvent(
        **event.__dict__,
        country=geo.country_code,
        region=geo.region,
        city=geo.city,
        latitude=geo.latitude,
        longitude=geo.longitude,
    )
 
# Stage 3: Parse User-Agent
def enrich_device(event: EnrichedEvent) -> EnrichedEvent:
    """Add device information from User-Agent."""
    ua = parse_user_agent(event.user_agent)  # ua-parser library
    
    event.device_type = ua.device.family  # mobile, desktop, etc.
    event.os = ua.os.family
    event.os_version = ua.os.version_string
    event.browser = ua.browser.family
    event.browser_version = ua.browser.version_string
    event.is_bot = ua.is_bot
    
    return event
 
# Stage 4: Deduplicate
def deduplicate(events: Stream[EnrichedEvent]) -> Stream[EnrichedEvent]:
    """Remove duplicate events within a time window."""
    # Key by event_id, keep first occurrence
    return events \
        .key_by(lambda e: e.event_id) \
        .reduce(lambda a, b: a)  # Keep first
 
# Stage 5: Time-Window Aggregation
def aggregate_hourly(events: Stream[EnrichedEvent]):
    """Aggregate events into hourly buckets."""
    return events \
        .key_by(lambda e: e.short_code) \
        .window(TumblingEventTimeWindows.of(Time.hours(1))) \
        .aggregate(
            HourlyAggregator(),  # Custom aggregator
            ProcessWindowFunction()
        )
 
class HourlyAggregator:
    """Aggregate click metrics for one hour window."""
    
    def create_accumulator(self):
        return {
            'clicks': 0,
            'unique_visitors': set(),
            'by_country': Counter(),
            'by_device': Counter(),
            'by_browser': Counter(),
            'by_referer': Counter(),
        }
    
    def add(self, event: EnrichedEvent, accumulator):
        accumulator['clicks'] += 1
        accumulator['unique_visitors'].add(event.visitor_id)
        accumulator['by_country'][event.country] += 1
        accumulator['by_device'][event.device_type] += 1
        accumulator['by_browser'][event.browser] += 1
        
        referer = extract_domain(event.referer) or 'direct'
        accumulator['by_referer'][referer] += 1
        
        return accumulator
    
    def get_result(self, accumulator):
        return HourlyMetrics(
            clicks=accumulator['clicks'],
            unique_visitors=len(accumulator['unique_visitors']),
            by_country=dict(accumulator['by_country']),
            by_device=dict(accumulator['by_device']),
            by_browser=dict(accumulator['by_browser']),
            by_referer=dict(accumulator['by_referer']),
        )

Processing Pipeline Summary

•Parse & Validate — Convert JSON to typed events, reject malformed
•Geo Enrichment — IP to country/city using MaxMind GeoIP
•Device Parsing — User-Agent to device/browser/OS using ua-parser
•Deduplication — Remove duplicate events (network retries, etc.)
•Time Windowing — Group events into hourly buckets
•Aggregation — Count clicks, unique visitors, distributions
•Output — Write aggregates to time-series database

Late Event Handling

Events may arrive late due to network delays or mobile devices coming online. Use event-time (timestamp in event) not processing-time, and configure watermarks with allowed lateness (e.g., 1 hour). Late events trigger window re-computation.

Real-Time vs. Batch Analytics

Different analytics use cases have different freshness requirements. A Lambda Architecture or Kappa Architecture can serve both.

Freshness Requirements by Use Case

Analytics Freshness Requirements
Use Case	Freshness Needed	Acceptable Latency	Approach
Live dashboard 'Now'	Real-time	< 1 second	Streaming counters
Hourly reports	Near real-time	< 5 minutes	Micro-batch
Daily summaries	Batch	< 1 hour	Daily batch jobs
Historical analysis	Batch	Next day	Nightly aggregation
Ad-hoc queries	On-demand	Minutes	Query on aggregates

Real-Time Counters with Redis

For truly real-time metrics ('clicks in last 60 seconds'), we use Redis sorted sets or HyperLogLog:

realtime_counters.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/**
 * Real-Time Click Counters
 * 
 * Uses Redis for sub-second counter updates and queries.
 */
 
class RealtimeCounters {
    private redis: Redis;
    
    /**
     * Increment click counter for current minute window.
     * Called from stream processor, not redirect handler.
     */
    async incrementClick(shortCode: string, timestamp: number): Promise<void> {
        const minuteWindow = Math.floor(timestamp / 60000) * 60000;
        const key = `clicks:${shortCode}:realtime`;
        
        // ZADD to sorted set with timestamp as score
        // Member is unique per click (use counter)
        await this.redis.zadd(key, minuteWindow, `${timestamp}:${Math.random()}`);
        
        // Set expiration (keep 1 hour of data)
        await this.redis.expire(key, 3600);
    }
    
    /**
     * Get clicks in the last N seconds.
     */
    async getClicksInWindow(shortCode: string, windowSeconds: number): Promise<number> {
        const key = `clicks:${shortCode}:realtime`;
        const now = Date.now();
        const windowStart = now - (windowSeconds * 1000);
        
        // Count members with score >= windowStart
        return await this.redis.zcount(key, windowStart, now);
    }
    
    /**
     * Get unique visitors using HyperLogLog (approximate, memory efficient).
     */
    async addUniqueVisitor(shortCode: string, visitorId: string, date: string): Promise<void> {
        const key = `unique:${shortCode}:${date}`;
        await this.redis.pfadd(key, visitorId);
        await this.redis.expire(key, 86400 * 7);  // Keep 7 days
    }
    
    async getUniqueVisitors(shortCode: string, date: string): Promise<number> {
        const key = `unique:${shortCode}:${date}`;
        return await this.redis.pfcount(key);
    }
}
 
// HyperLogLog properties:
// - Uses only 12KB per counter regardless of cardinality
// - 0.81% standard error (accurate enough for analytics)
// - Can merge multiple HLLs for combined counts
 
// For 1M short URLs with daily unique tracking:
// Memory: 1M × 12KB = 12GB (very efficient!)

HyperLogLog for Uniques

Counting exact unique visitors requires storing every visitor ID—impossible at scale. HyperLogLog provides approximate counts (±1% error) using only 12KB per counter. It's the industry standard for web analytics unique counting.

Analytics Storage Architecture

Choosing the right storage for analytics data depends on query patterns, retention requirements, and scale.

Storage Options Comparison

Analytics Storage Options
Database	Best For	Query Speed	Write Speed	Scalability
InfluxDB	Time-series metrics	Fast for time queries	High throughput	Clustering available
TimescaleDB	Time-series with SQL	Excellent (PostgreSQL)	Very high	Excellent
ClickHouse	OLAP analytics	Extremely fast	Very high	Excellent
Druid	Real-time OLAP	Sub-second	High	Excellent
BigQuery	Ad-hoc analytics	Fast for aggregations	Batch preferred	Serverless

Recommended: ClickHouse for Analytics

ClickHouse is purpose-built for analytics workloads with columnar storage and vectorized query execution:

clickhouse_schema.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
-- ClickHouse Schema for URL Shortener Analytics
 
-- Table for hourly aggregated metrics
CREATE TABLE hourly_metrics (
    short_code      LowCardinality(String),
    hour            DateTime,
    
    -- Core metrics
    clicks          UInt64,
    unique_visitors UInt64,
    
    -- Distributions (stored as nested structures)
    country_clicks  Nested(
        country LowCardinality(String),
        count   UInt32
    ),
    device_clicks   Nested(
        device  LowCardinality(String),
        count   UInt32
    ),
    browser_clicks  Nested(
        browser LowCardinality(String),
        count   UInt32
    ),
    referer_clicks  Nested(
        referer String,
        count   UInt32
    )
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(hour)           -- Partition by month
ORDER BY (short_code, hour)            -- Primary key
TTL hour + INTERVAL 30 DAY;            -- Auto-delete after 30 days
 
-- Table for daily rollups (longer retention)
CREATE TABLE daily_metrics (
    short_code      LowCardinality(String),
    date            Date,
    
    clicks          UInt64,
    unique_visitors UInt64,
    
    -- Top-N only (reduces storage)
    top_countries   Array(Tuple(String, UInt32)),  -- Top 10
    top_devices     Array(Tuple(String, UInt32)),
    top_referers    Array(Tuple(String, UInt32)),
    
    peak_hour       UInt8,
    avg_response_ms Float32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (short_code, date)
TTL date + INTERVAL 1 YEAR;
 
-- Materialized view for automatic rollup from hourly to daily
CREATE MATERIALIZED VIEW daily_metrics_mv
TO daily_metrics
AS SELECT
    short_code,
    toDate(hour) as date,
    sum(clicks) as clicks,
    sum(unique_visitors) as unique_visitors,  -- Approximate!
    -- ... aggregation logic
FROM hourly_metrics
GROUP BY short_code, toDate(hour);

analytics_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Common Analytics Queries
 
-- 1. Clicks over time for a short URL (dashboard chart)
SELECT 
    toStartOfHour(hour) as time,
    sum(clicks) as clicks
FROM hourly_metrics
WHERE short_code = 'a7Xk2B'
  AND hour >= now() - INTERVAL 7 DAY
GROUP BY time
ORDER BY time;
 
-- 2. Top countries for a URL
SELECT 
    country_clicks.country as country,
    sum(country_clicks.count) as clicks
FROM hourly_metrics
ARRAY JOIN country_clicks
WHERE short_code = 'a7Xk2B'
  AND hour >= now() - INTERVAL 30 DAY
GROUP BY country
ORDER BY clicks DESC
LIMIT 10;
 
-- 3. Total clicks across all URLs (global stats)
SELECT sum(clicks) as total_clicks
FROM daily_metrics
WHERE date = today();
 
-- 4. Top performing URLs today
SELECT 
    short_code,
    sum(clicks) as clicks
FROM hourly_metrics
WHERE hour >= today()
GROUP BY short_code
ORDER BY clicks DESC
LIMIT 100;
 
-- Query performance: sub-second even on billions of rows
-- ClickHouse scans 100M+ rows/second on commodity hardware

Columnar Storage Advantage

ClickHouse stores data by column, not row. For a query selecting only 'clicks' from 1 billion rows, it reads only the 'clicks' column (8 bytes × 1B = 8GB), not entire rows (500 bytes × 1B = 500GB). This 60x reduction makes analytics queries blazingly fast.

Unique Visitor Tracking

Counting unique visitors is one of the most challenging analytics problems, especially in a privacy-conscious world where cookies are blocked and IP addresses are shared.

Tracking Approaches

Unique Visitor Tracking Approaches
Approach	Accuracy	Privacy Impact	Limitations
First-Party Cookie	High (when available)	Low (consented)	Blocked by ~30% of users
IP + User-Agent Hash	Medium	Medium	Shared IPs, VPNs, NAT
Browser Fingerprinting	High	High (controversial)	Privacy regulations restrict
Login-based	Perfect	Low (explicit)	Only works for logged-in users
Statistical Estimation	Medium	None	Approximate, not individual-level

visitor_identification.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
/**
 * Visitor Identification Strategy
 * 
 * Privacy-respecting approach combining multiple signals.
 */
 
interface VisitorIdentification {
    visitorId: string | null;       // Primary identifier
    sessionId: string | null;       // Session tracking
    confidence: 'high' | 'medium' | 'low';
    method: 'cookie' | 'fingerprint' | 'statistical';
}
 
class VisitorIdentifier {
    /**
     * Identify visitor from request headers and optional cookie.
     * Falls back to fingerprint, then to no identification.
     */
    identify(request: Request): VisitorIdentification {
        // Tier 1: First-party cookie (highest accuracy, consented)
        const cookieId = this.extractVisitorCookie(request);
        if (cookieId) {
            return {
                visitorId: cookieId,
                sessionId: this.extractSessionCookie(request),
                confidence: 'high',
                method: 'cookie',
            };
        }
        
        // Tier 2: IP + User-Agent hash (medium accuracy)
        const ip = request.headers.get('cf-connecting-ip') ?? '';
        const userAgent = request.headers.get('user-agent') ?? '';
        
        if (ip && userAgent) {
            const fingerprint = this.hashFingerprint(ip, userAgent);
            return {
                visitorId: fingerprint,
                sessionId: null,
                confidence: 'medium',
                method: 'fingerprint',
            };
        }
        
        // Tier 3: No identification possible
        return {
            visitorId: null,
            sessionId: null,
            confidence: 'low',
            method: 'statistical',
        };
    }
    
    private hashFingerprint(ip: string, userAgent: string): string {
        // Hash IP and UA together - not unique but better than nothing
        // Add date component so same visitor is counted fresh each day
        const date = new Date().toISOString().split('T')[0];
        const input = `${ip}:${userAgent}:${date}`;
        return crypto.createHash('sha256').update(input).digest('hex').slice(0, 16);
    }
    
    /**
     * Set visitor cookie in redirect response.
     * Only if privacy policy allows and user hasn't opted out.
     */
    setVisitorCookie(response: Response, visitorId?: string): Response {
        const id = visitorId ?? crypto.randomUUID();
        
        response.headers.set('Set-Cookie', 
            `vid=${id}; ` +
                                `Max-Age=31536000; ` +    // 1 year
                            `Path=/; ` +
                        `Secure; ` +
                        `HttpOnly; ` +
                        `SameSite=Lax`
        );
        
        return response;
    }
}
 
// Privacy considerations:
// - Cookie requires user consent in GDPR regions
// - Fingerprinting may be restricted by ePrivacy regulations
// - Always honor DNT (Do Not Track) header
// - Provide opt-out mechanism in privacy policy

Privacy Regulations

GDPR, CCPA, and ePrivacy regulations restrict tracking technologies. Cookies require consent banners. Fingerprinting is increasingly restricted. Always consult legal counsel before implementing visitor tracking, and provide clear opt-out mechanisms.

Analytics API Design

The analytics API serves dashboards and integrations. It must be fast, flexible, and protect user data.

API Endpoints

analytics_api.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
/**
 * Analytics API Endpoints
 */
 
// GET /api/v1/analytics/{shortCode}
// Get overview metrics for a short URL
interface AnalyticsOverviewResponse {
    shortCode: string;
    shortUrl: string;
    longUrl: string;
    
    // Summary metrics
    totalClicks: number;
    uniqueVisitors: number;
    
    // Period comparisons
    clicksToday: number;
    clicksYesterday: number;
    clicks7Days: number;
    clicks30Days: number;
    
    // Growth indicators
    dailyGrowth: number;        // Percentage
    weeklyGrowth: number;
    
    // Metadata
    createdAt: string;
    lastClickAt: string;
}
 
// GET /api/v1/analytics/{shortCode}/timeseries
// Get clicks over time for charting
interface TimeseriesRequest {
    shortCode: string;
    startDate: string;          // ISO date
    endDate: string;
    granularity: 'hour' | 'day' | 'week' | 'month';
    metrics: ('clicks' | 'uniqueVisitors' | 'avgResponseTime')[];
}
 
interface TimeseriesResponse {
    data: {
        timestamp: string;
        clicks: number;
        uniqueVisitors: number;
        avgResponseTime?: number;
    }[];
    granularity: string;
    timezone: string;
}
 
// GET /api/v1/analytics/{shortCode}/breakdown
// Get breakdown by dimension
interface BreakdownRequest {
    shortCode: string;
    startDate: string;
    endDate: string;
    dimension: 'country' | 'device' | 'browser' | 'os' | 'referer';
    limit?: number;             // Default 10
}
 
interface BreakdownResponse {
    dimension: string;
    data: {
        value: string;          // Country code, device name, etc.
        clicks: number;
        percentage: number;
    }[];
    total: number;
    other: number;              // Clicks in items beyond limit
}
 
// GET /api/v1/analytics/{shortCode}/realtime
// Get live metrics (last 5 minutes)
interface RealtimeResponse {
    shortCode: string;
    clicksLastMinute: number;
    clicksLast5Minutes: number;
    clicksLastHour: number;
    activeNow: number;          // Approximate concurrent viewers
    recentClicks: {
        timestamp: string;
        country: string;
        device: string;
        referer: string;
    }[];                        // Last 10 clicks
}

Query Optimization and Caching

API Performance Techniques

•Pre-computed aggregates — Store hourly/daily rollups, never query raw events
•Query caching — Cache identical queries in Redis (5-60 second TTL)
•Result pagination — Limit breakdown results to top N, paginate for more
•Date range limits — Restrict queries to reasonable ranges (max 1 year)
•Rate limiting — Prevent abuse with per-user/per-IP limits
•Async export — Large data exports run as background jobs

Dashboard Pre-fetching

For dashboard views, pre-fetch the most common queries (overview, 7-day chart, top 5 countries) in parallel. Users perceive faster load when data appears progressively rather than waiting for a single large query.

Analytics Collection Summary

We've built a comprehensive analytics collection and serving system. Let's consolidate the key architectural decisions:

Analytics Architecture Summary
Component	Technology	Purpose
Event Emission	Async producer + Kafka	Decouple from redirect, handle bursts
Stream Processing	Flink / Kafka Streams	Enrich, dedupe, aggregate in real-time
Real-time Counters	Redis HyperLogLog	Sub-second unique counts, live dashboards
Time-series Storage	ClickHouse	Fast analytical queries on aggregates
Raw Event Storage	S3 / GCS (7-day)	Audit trail, reprocessing capability
API Layer	REST + caching	Serve dashboards and integrations

Key Takeaways

•Analytics must never block redirects — Async emission is non-negotiable.
•Aggregation is essential at scale — Raw events → hourly → daily → monthly reduces storage 1000x+.
•Different freshness for different uses — Real-time counters for 'now', batch for history.
•Columnar databases excel for analytics — ClickHouse scans 100M+ rows/second.
•Unique counting is hard — Use HyperLogLog for efficient approximate counts.
•Privacy regulations constrain tracking — Design for consent and opt-out from day one.

Page Complete

You now understand how to collect, process, and serve analytics from billions of redirect events. Next, we'll explore custom short URLs—how to let users choose their own short codes while maintaining uniqueness and security.