System Design (HLD)URL Shortener

Designing a URL Shortener

LevelIntermediate

Duration90 mins

TopicURL Shortener

2 / 6

URL Encoding Strategies

The Art of Making URLs Short

At the heart of every URL shortener lies a deceptively simple problem: How do you generate a unique, short string for each of billions of URLs?

The short code (like a7Xk2B in https://short.url/a7Xk2B) must satisfy several competing constraints:

Uniqueness: No two URLs can share the same code
Brevity: Shorter is better (6-8 characters ideal)
Unpredictability: Users shouldn't be able to guess other URLs
Performance: Generation must be fast even at massive scale
Distribution: Must work across multiple data centers without coordination overhead

This seemingly trivial problem touches on encoding theory, distributed systems, database design, and security—making it a perfect microcosm of system design challenges.

What You Will Learn

By the end of this page, you will master multiple URL encoding strategies: Base62 encoding, hash-based approaches, counter-based systems, and distributed ID generators. You'll understand the trade-offs of each and when to apply them.

Understanding the Encoding Space

Before choosing an encoding strategy, we must understand how many URLs we can represent with different character sets and lengths.

Character Set Options

URL-safe characters that can appear in a path without encoding:

Character Set Comparison
Character Set	Characters Included	Size	Considerations
Numeric	0-9	10	Very limited space; requires long codes
Lowercase Alpha	a-z	26	Better but still limited
Base36	0-9, a-z	36	Case-insensitive; good for voice/print
Base62	0-9, a-z, A-Z	62	Most common; maximum density without special chars
Base64 (URL-safe)	Base62 + - and _	64	Slightly denser; special chars may cause issues

Capacity Calculations

Let's calculate how many unique URLs we can represent with Base62 at different lengths:

capacity_calculation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Base62 Capacity Calculation
============================
 
Base62 character set: 0-9 (10) + a-z (26) + A-Z (26) = 62 characters
 
Length  |  Possible Codes    |  Human Readable
--------|--------------------|-----------------
1       |  62^1 = 62         |  62
2       |  62^2 = 3,844      |  ~4K
3       |  62^3 = 238,328    |  ~238K
4       |  62^4 = 14.8M      |  ~15 million
5       |  62^5 = 916M       |  ~1 billion
6       |  62^6 = 56.8B      |  ~57 billion
7       |  62^7 = 3.5T       |  ~3.5 trillion
8       |  62^8 = 218T       |  ~218 trillion
 
Our Requirements:
- 100M URLs/day × 365 days × 10 years = 365 billion URLs total
- Need buffer for deletions, collisions, custom codes
 
Conclusion:
- 6 characters (56.8B) insufficient for 10-year horizon
- 7 characters (3.5T) provides ~10x buffer - IDEAL
- 8 characters overkill for most use cases

Why 7 Characters is the Sweet Spot

With 7 characters in Base62, we have 3.5 trillion possible codes. At 100 million URLs/day, that's 35,000 days (96 years) of capacity. Even accounting for collisions, reserved codes, and custom aliases, 7 characters provides ample headroom while keeping URLs short.

Base62 Encoding Approach

Base62 encoding converts a numeric ID into a string using 62 characters. The key insight is that any unique integer can be converted to a unique Base62 string.

The Encoding Algorithm

base62.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Base62 Encoding/Decoding Implementation
 
const BASE62_CHARS = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
const BASE = 62n; // Using BigInt for large numbers
 
/**
 * Encode a numeric ID to Base62 string
 * Example: 125 → "21" (1*62 + 25 + 10 + 26 = 125)
 */
function encodeBase62(id: bigint, minLength: number = 7): string {
    if (id === 0n) {
        return BASE62_CHARS[0].repeat(minLength);
    }
    
    let result = '';
    let num = id;
    
    while (num > 0n) {
        const remainder = Number(num % BASE);
        result = BASE62_CHARS[remainder] + result;
        num = num / BASE; // Integer division
    }
    
    // Pad to minimum length if needed
    while (result.length < minLength) {
        result = BASE62_CHARS[0] + result;
    }
    
    return result;
}
 
/**
 * Decode a Base62 string back to numeric ID
 * Example: "21" → 125
 */
function decodeBase62(code: string): bigint {
    let result = 0n;
    
    for (const char of code) {
        const index = BASE62_CHARS.indexOf(char);
        if (index === -1) {
            throw new Error(`Invalid Base62 character: ${char}`);
        }
        result = result * BASE + BigInt(index);
    }
    
    return result;
}
 
// Example usage:
// encodeBase62(1000000000n)     → "15FTGg"
// encodeBase62(3521614606207n)  → "zzzzzz" (max 6-char value)
// decodeBase62("a7Xk2B")        → corresponding integer

The Core Challenge: Generating Unique IDs

Base62 encoding is deterministic—the same integer always produces the same string. The real challenge is generating unique integers efficiently at scale.

We have several strategies for ID generation, each with distinct trade-offs:

ID Generation Strategies
Strategy	Mechanism	Pros	Cons
Auto-increment DB	Database generates sequential IDs	Simple, guaranteed unique, efficient	Single point of failure, ID leakage, sharding issues
UUID → Truncate	Generate UUID, take first N chars	No coordination needed	High collision risk, wastes entropy
Hash-based	Hash the long URL	Deterministic, idempotent	Collision handling complex, predictable
Distributed Counter	Coordinated counter across nodes	Sequential-ish, predictable	Coordination overhead, complexity
Snowflake-style	Timestamp + worker ID + sequence	Sorted, distributed, fast	Clock dependencies, worker ID management

The Pattern

Notice that all strategies fundamentally solve the same problem: mapping something (a URL, a timestamp, a counter) to a unique integer, then encoding that integer in Base62. The strategy you choose depends on your system constraints.

Strategy 1: Auto-Increment Database IDs

The simplest approach is to use a database's auto-increment feature. Each new URL gets the next sequential ID, which is then encoded.

Implementation Flow

auto_increment_flow.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Auto-Increment ID Generation
 
async function createShortUrl(longUrl: string): Promise<ShortUrl> {
    // 1. Insert URL into database, get auto-generated ID
    const result = await db.query(
        'INSERT INTO urls (long_url, created_at) VALUES ($1, NOW()) RETURNING id',
        [longUrl]
    );
    
    const numericId = result.rows[0].id; // e.g., 1000000001
    
    // 2. Encode ID to Base62
    const shortCode = encodeBase62(BigInt(numericId)); // e.g., "15FTGh"
    
    // 3. Update record with short code (optional, for indexed lookup)
    await db.query(
        'UPDATE urls SET short_code = $1 WHERE id = $2',
        [shortCode, numericId]
    );
    
    return {
        shortCode,
        longUrl,
        shortUrl: `https://short.url/${shortCode}`
    };
}
 
// Redirect lookup is fast since ID is embedded in short code:
async function resolveShortCode(shortCode: string): Promise<string> {
    const numericId = decodeBase62(shortCode);
    
    // Direct lookup by primary key - extremely fast!
    const result = await db.query(
        'SELECT long_url FROM urls WHERE id = $1',
        [numericId]
    );
    
    return result.rows[0].long_url;
}

Advantages

•Simplicity — No custom ID generation logic; database handles uniqueness
•Guaranteed Unique — Auto-increment is atomic and unique by definition
•Efficient Lookup — Short code decodes to primary key; index-free lookup
•Sequential — IDs are roughly time-ordered; useful for analytics
•No Collisions — Impossible to generate duplicate IDs

Disadvantages

•Single Point of Failure — One database must coordinate all IDs
•Sharding Difficulties — Auto-increment doesn't distribute naturally
•ID Enumeration Risk — Sequential IDs allow attackers to scrape all URLs
•Write Bottleneck — All creates funnel through one system
•ID Leakage — Reveals approximate URL volume (competitive intel)

Making Auto-Increment Work at Scale

Despite limitations, auto-increment can scale with modifications:

scaled_auto_increment.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Strategy: Range-based ID allocation
-- Each server reserves ID ranges, eliminating coordination
 
-- Server 1: IDs 1-1,000,000
-- Server 2: IDs 1,000,001-2,000,000
-- Server 3: IDs 2,000,001-3,000,000
 
-- ID Range Manager Table
CREATE TABLE id_ranges (
    server_id      INT PRIMARY KEY,
    range_start    BIGINT NOT NULL,
    range_end      BIGINT NOT NULL,
    current_value  BIGINT NOT NULL,
    created_at     TIMESTAMP DEFAULT NOW()
);
 
-- Each server atomically reserves a new range when needed:
UPDATE id_ranges 
SET range_start = (SELECT MAX(range_end) + 1 FROM id_ranges),
    range_end = (SELECT MAX(range_end) + 1000000 FROM id_ranges),
    current_value = (SELECT MAX(range_end) + 1 FROM id_ranges)
WHERE server_id = ?;
 
-- Alternative: Use stride-based approach
-- Server 1: 1, 4, 7, 10, ...  (n*3 + 1)
-- Server 2: 2, 5, 8, 11, ...  (n*3 + 2)  
-- Server 3: 3, 6, 9, 12, ...  (n*3 + 3)
 
-- This works but creates non-sequential IDs per server

Security Consideration

Sequential IDs expose information. If your latest short code is 'a7Xk2G', attackers know 'a7Xk2F' exists. To prevent URL enumeration, you can randomly shuffle the mapping or use a more sophisticated encoding that obscures the sequence.

Strategy 2: Hash-Based Encoding

Hash-based encoding derives the short code directly from the long URL. This approach is deterministic and idempotent—the same URL always produces the same code.

Basic Hash Implementation

hash_encoding.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import crypto from 'crypto';
 
/**
 * Generate short code by hashing the long URL
 * Uses first 43 bits of MD5/SHA hash, encoded in Base62
 */
function hashBasedShortCode(longUrl: string): string {
    // Step 1: Compute hash of URL
    const hash = crypto.createHash('md5').update(longUrl).digest();
    
    // Step 2: Take first 7 bytes (56 bits > 43 bits needed for 7 Base62 chars)
    const numeric = hash.readBigUInt64BE(0) & 0xFFFFFFFFFFFFn; // 48 bits
    
    // Step 3: Encode as Base62 (7 characters)
    return encodeBase62(numeric, 7);
}
 
// Problem: Different URLs may produce same short code (collision!)
// Solution: Check and handle collisions
 
async function createShortUrlWithHash(longUrl: string): Promise<ShortUrl> {
    let shortCode = hashBasedShortCode(longUrl);
    let attempt = 0;
    
    while (attempt < 5) {
        // Check if code already exists
        const existing = await db.query(
            'SELECT long_url FROM urls WHERE short_code = $1',
            [shortCode]
        );
        
        if (existing.rows.length === 0) {
            // Code is available
            await db.query(
                'INSERT INTO urls (short_code, long_url) VALUES ($1, $2)',
                [shortCode, longUrl]
            );
            return { shortCode, longUrl, shortUrl: `https://short.url/${shortCode}` };
        }
        
        if (existing.rows[0].long_url === longUrl) {
            // Same URL already shortened - return existing code
            return { shortCode, longUrl, shortUrl: `https://short.url/${shortCode}` };
        }
        
        // Collision! Different URL has same code
        // Strategy: Append attempt number and re-hash
        shortCode = hashBasedShortCode(longUrl + String(++attempt));
    }
    
    throw new Error('Failed to generate unique short code after 5 attempts');
}

Collision Analysis

Understanding collision probability is crucial for hash-based approaches:

collision_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Collision Probability Analysis (Birthday Paradox)
=================================================
 
For a hash of k bits with n items:
P(collision) ≈ 1 - e^(-n²/2^(k+1))
 
With 7 Base62 characters (≈42 bits of entropy):
- 2^42 = 4.4 trillion possible values
 
At 100 billion URLs (10 years of data):
- P(at least one collision) ≈ 1 - e^(-(100B)²/(2*4.4T))
- P(collision) ≈ 1 - e^(-1136)
- P(collision) ≈ 100% (virtually guaranteed!)
 
Expected number of collisions:
- E[collisions] = n²/(2*2^k) = (100B)²/(2*4.4T) ≈ 1,136 collisions
 
Reality Check:
- With 7-character codes and 100 billion URLs, expect ~1000+ collisions
- This is manageable with collision resolution, but not negligible
- Using 8 characters (62^8 = 218T) reduces expected collisions to ~23

Advantages

•Idempotent — Same URL → same code (unless collision)
•No Coordination — Hash computed locally, no DB needed for generation
•Deduplication — Natural handling of duplicate URL submissions
•Distributed — Any server can generate without coordination
•Unpredictable — Codes aren't enumerable

Disadvantages

•Collision Handling — Must check and resolve conflicts
•Longer Codes Needed — 7+ chars to keep collision probability low
•DB Lookup Required — Can't decode to lookup; need indexed search
•Not Truly Random — Same URL always same code (trackable)
•Complex Resolution — Collision resolution adds latency and complexity

When to Use Hash-Based

Hash-based encoding works well when: (1) you want repeat submissions of the same URL to return the same short code, (2) you need fully distributed generation with no coordination, and (3) you accept occasional collision handling.

Strategy 3: Snowflake-Style Distributed IDs

Twitter's Snowflake and similar systems generate unique, roughly time-ordered IDs without central coordination. This is the gold standard for distributed ID generation at scale.

Snowflake ID Structure

snowflake_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Snowflake ID Structure (64 bits total)
======================================
 
┌─────────────────────────────────────────────────────────────────┐
│ 1 │          41 bits          │  10 bits  │     12 bits        │
│   │      (timestamp ms)       │ (machine) │    (sequence)       │
└───┴───────────────────────────┴───────────┴────────────────────┘
  │              │                    │              │
  │              │                    │              └─ Sequence number (0-4095)
  │              │                    │                 Reset each millisecond
  │              │                    │
  │              │                    └─ Machine/Worker ID (0-1023)
  │              │                       Unique per server
  │              │
  │              └─ Milliseconds since epoch (Jan 1, 2015)
  │                 ~69 years of range
  │
  └─ Sign bit (always 0 for positive)
 
Capacity:
- 41 bits timestamp: 2^41 ms = ~69 years from epoch
- 10 bits machine ID: 1024 unique machines
- 12 bits sequence: 4096 IDs per millisecond per machine
 
Total throughput: 4096 × 1000 = 4.1 million IDs/second/machine

snowflake.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/**
 * Snowflake-style ID Generator
 * Generates unique, time-sorted 64-bit IDs without coordination
 */
class SnowflakeGenerator {
    private readonly EPOCH = 1420070400000n; // Jan 1, 2015 UTC
    private readonly MACHINE_ID_BITS = 10n;
    private readonly SEQUENCE_BITS = 12n;
    
    private readonly maxMachineId = (1n << this.MACHINE_ID_BITS) - 1n;
    private readonly maxSequence = (1n << this.SEQUENCE_BITS) - 1n;
    
    private machineId: bigint;
    private sequence: bigint = 0n;
    private lastTimestamp: bigint = 0n;
    
    constructor(machineId: number) {
        this.machineId = BigInt(machineId);
        if (this.machineId > this.maxMachineId) {
            throw new Error(`Machine ID must be between 0 and ${this.maxMachineId}`);
        }
    }
    
    generate(): bigint {
        let timestamp = BigInt(Date.now()) - this.EPOCH;
        
        if (timestamp === this.lastTimestamp) {
            // Same millisecond - increment sequence
            this.sequence = (this.sequence + 1n) & this.maxSequence;
            
            if (this.sequence === 0n) {
                // Sequence overflow - wait for next millisecond
                timestamp = this.waitNextMillis(timestamp);
            }
        } else {
            // New millisecond - reset sequence
            this.sequence = 0n;
        }
        
        this.lastTimestamp = timestamp;
        
        // Compose the 64-bit ID
        return (timestamp << (this.MACHINE_ID_BITS + this.SEQUENCE_BITS)) |
               (this.machineId << this.SEQUENCE_BITS) |
               this.sequence;
    }
    
    private waitNextMillis(lastTimestamp: bigint): bigint {
        let timestamp = BigInt(Date.now()) - this.EPOCH;
        while (timestamp <= lastTimestamp) {
            timestamp = BigInt(Date.now()) - this.EPOCH;
        }
        return timestamp;
    }
}
 
// Usage for URL shortening:
const generator = new SnowflakeGenerator(42); // Machine #42
 
function createShortUrl(longUrl: string): ShortUrl {
    const id = generator.generate();           // e.g., 1234567890123456789n
    const shortCode = encodeBase62(id, 7);     // e.g., "4KbW8M2"
    
    // Store mapping in database
    await db.insert({ id, shortCode, longUrl });
    
    return { shortCode, longUrl };
}

Optimizing for 7-Character Codes

Standard Snowflake produces 64-bit IDs. For 7-character Base62 codes (3.5T values, ~42 bits), we can use a modified structure:

compact_snowflake.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Compact Snowflake (43 bits for 7-char Base62)
==============================================
 
┌─────────────────────────────────────────────────┐
│      32 bits        │  5 bits  │    6 bits     │
│   (timestamp secs)  │ (worker) │  (sequence)   │
└─────────────────────┴──────────┴───────────────┘
 
- 32 bits timestamp (seconds): ~136 years from epoch
- 5 bits worker ID: 32 workers maximum
- 6 bits sequence: 64 IDs per second per worker
 
Total: 43 bits → 8.8T possible IDs → fits in 7 Base62 chars
 
Throughput: 64 IDs/sec/worker × 32 workers = 2048 IDs/second
           (Sufficient for many use cases)
 
For higher throughput, use 8-character codes (218T capacity):
┌───────────────────────────────────────────────────────┐
│      35 bits        │  7 bits  │    8 bits          │
│   (timestamp secs)  │ (worker) │  (sequence)        │
└─────────────────────┴──────────┴────────────────────┘

Why Snowflake Excels at Scale

Snowflake-style IDs combine the best of all approaches: no coordination (each server generates independently), guaranteed uniqueness (timestamp + worker + sequence), time-sortability (useful for debugging and analytics), and high throughput (millions per second).

Strategy 4: Range-Based ID Allocation

Range-based allocation pre-assigns ID ranges to servers, eliminating real-time coordination. This approach is used by Flickr's ticket servers and similar systems.

Architecture Overview

range_based_allocation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/**
 * Range-Based ID Generator
 * 
 * Central service allocates ranges of IDs to worker servers.
 * Workers use their range locally without further coordination.
 */
 
// Central Range Allocator Service
class RangeAllocator {
    private nextRangeStart: bigint = 1n;
    private readonly rangeSize: bigint = 100000n; // 100K IDs per range
    
    async allocateRange(): Promise<{ start: bigint; end: bigint }> {
        return await this.db.transaction(async (tx) => {
            // Atomically get next range
            const result = await tx.query(
                'UPDATE range_counter SET value = value + $1 RETURNING value - $1 as start',
                [this.rangeSize]
            );
            
            const start = BigInt(result.rows[0].start);
            const end = start + this.rangeSize - 1n;
            
            return { start, end };
        });
    }
}
 
// Worker Server (URL Shortener Instance)
class RangeBasedIdGenerator {
    private currentId: bigint = 0n;
    private rangeEnd: bigint = 0n;
    
    constructor(private allocator: RangeAllocator) {}
    
    async getId(): Promise<bigint> {
        // Check if we need a new range
        if (this.currentId >= this.rangeEnd) {
            await this.fetchNewRange();
        }
        
        const id = this.currentId;
        this.currentId += 1n;
        return id;
    }
    
    private async fetchNewRange() {
        const range = await this.allocator.allocateRange();
        this.currentId = range.start;
        this.rangeEnd = range.end;
        console.log(`Fetched new range: ${range.start} - ${range.end}`);
    }
}
 
// Usage flow:
// 1. Server starts → requests range (e.g., 1-100,000)
// 2. Generates IDs 1, 2, 3, ... 99,999, 100,000
// 3. Requests new range (100,001-200,000)
// 4. Continues generating...

Handling Range Gaps and Efficiency

Range-Based Considerations

•Wasted IDs — If a server crashes, unused IDs in its range are lost. With 100K ranges and occasional crashes, this is negligible (<0.1% waste).
•Range Size Trade-off — Large ranges reduce allocator calls but waste more on crash. Small ranges increase coordination. 100K-1M is typical.
•Pre-fetching — Fetch next range before current exhausts to avoid latency spikes.
•Multi-Allocator — Run 2-3 allocators with interleaved ranges (Server A: 1-100K, 300K-400K; Server B: 100K-200K, 400K-500K) for high availability.
•Local Persistence — Store current position to disk; resume after restart without losing range.

Flickr's Ticket Server

Flickr uses two MySQL servers with auto-increment starting at 1 and 2, incrementing by 2. This simple setup provides high availability: if one server fails, the other continues generating (with odd or even IDs only). It's elegant for moderate scale.

Handling Custom Aliases

Many URL shorteners allow users to specify their own short codes (e.g., short.url/my-brand instead of short.url/a7Xk2B). This introduces additional complexity.

Custom Alias Considerations

Custom Alias Requirements
Aspect	Requirement	Implementation
Uniqueness	No conflicts with generated or other custom codes	Check both custom and auto-generated namespace
Character Validation	Only URL-safe characters allowed	Regex validation: ^[a-zA-Z0-9_-]+$
Length Limits	2-30 characters typical	Minimum prevents single-char hoarding; max for sanity
Reserved Words	Block system paths and common words	Blocklist: api, admin, login, www, etc.
Availability Check	Real-time availability feedback	Fast lookup before commit
Case Sensitivity	Match URL case rules (typically insensitive)	Store and compare lowercase

custom_alias.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/**
 * Custom Alias Handling
 */
 
const RESERVED_WORDS = new Set([
    'api', 'admin', 'login', 'logout', 'signup', 'settings',
    'www', 'mail', 'help', 'support', 'about', 'terms', 'privacy'
]);
 
const ALIAS_REGEX = /^[a-zA-Z0-9][a-zA-Z0-9_-]{1,29}$/;
 
interface AliasValidationResult {
    valid: boolean;
    error?: string;
}
 
function validateCustomAlias(alias: string): AliasValidationResult {
    // Normalize to lowercase for case-insensitive comparison
    const normalized = alias.toLowerCase();
    
    // Check format
    if (!ALIAS_REGEX.test(alias)) {
        return {
            valid: false,
            error: 'Alias must be 2-30 characters, alphanumeric with hyphens/underscores'
        };
    }
    
    // Check reserved words
    if (RESERVED_WORDS.has(normalized)) {
        return {
            valid: false,
            error: 'This alias is reserved and cannot be used'
        };
    }
    
    // Check for confusing patterns
    if (/^[0-9]+$/.test(alias)) {
        return {
            valid: false,
            error: 'Alias cannot be purely numeric (conflicts with auto-generated codes)'
        };
    }
    
    return { valid: true };
}
 
async function createWithCustomAlias(
    longUrl: string, 
    customAlias: string
): Promise<ShortUrl> {
    // Validate format
    const validation = validateCustomAlias(customAlias);
    if (!validation.valid) {
        throw new ValidationError(validation.error!);
    }
    
    const normalized = customAlias.toLowerCase();
    
    try {
        // Attempt to insert with unique constraint
        await db.query(
            `INSERT INTO urls (short_code, short_code_lower, long_url) 
             VALUES ($1, $2, $3)`,
            [customAlias, normalized, longUrl]
        );
        
        return {
            shortCode: customAlias,
            longUrl,
            shortUrl: `https://short.url/${customAlias}`
        };
    } catch (err) {
        if (err.code === 'UNIQUE_VIOLATION') {
            throw new ConflictError('This alias is already taken');
        }
        throw err;
    }
}

Namespace Collision Prevention

Critical: auto-generated codes must not conflict with custom aliases. Solutions: (1) Prefix auto-generated codes with a character never allowed in custom (e.g., '0'), (2) Use different length constraints (auto: exactly 7 chars; custom: 2-6 or 8+ chars), (3) Check both spaces on every creation.

URL Encoding Strategies Summary

We've explored four major strategies for generating unique short codes. Each has its place depending on system constraints:

Strategy Comparison Summary
Strategy	Best For	Avoid When	Complexity
Auto-Increment	Simple systems, single DB, low scale	Need distribution, security matters	Low
Hash-Based	Deduplication needed, fully distributed	Collision avoidance critical	Medium
Snowflake-Style	High scale, distributed systems	Clock sync issues, simpler options work	Medium-High
Range-Based	Moderate scale, simple distribution	True high availability needed	Medium

Key Takeaways

•Base62 with 7 characters provides 3.5 trillion unique codes — sufficient for a decade at 100M URLs/day with ample margin.
•Auto-increment is simplest but creates bottlenecks — fine for moderate scale; needs adaptation for true distribution.
•Hash-based provides natural deduplication — but requires collision handling and indexed lookups.
•Snowflake-style is the industry standard at scale — Twitter, Discord, Instagram all use variants.
•Custom aliases need validation, reservation, and namespace management — more complex than they appear.

Page Complete

You now understand the core encoding strategies for URL shorteners. Next, we'll focus on redirect latency optimization—how to serve those short codes at sub-50ms latency for billions of daily requests.

2 / 6

Loading learning content...

System Design (HLD)URL Shortener

Designing a URL Shortener

LevelIntermediate

Duration90 mins

TopicURL Shortener

2 / 6

URL Encoding Strategies

The Art of Making URLs Short

At the heart of every URL shortener lies a deceptively simple problem: How do you generate a unique, short string for each of billions of URLs?

The short code (like a7Xk2B in https://short.url/a7Xk2B) must satisfy several competing constraints:

Uniqueness: No two URLs can share the same code
Brevity: Shorter is better (6-8 characters ideal)
Unpredictability: Users shouldn't be able to guess other URLs
Performance: Generation must be fast even at massive scale
Distribution: Must work across multiple data centers without coordination overhead

This seemingly trivial problem touches on encoding theory, distributed systems, database design, and security—making it a perfect microcosm of system design challenges.

What You Will Learn

Understanding the Encoding Space

Before choosing an encoding strategy, we must understand how many URLs we can represent with different character sets and lengths.

Character Set Options

URL-safe characters that can appear in a path without encoding:

Character Set Comparison
Character Set	Characters Included	Size	Considerations
Numeric	0-9	10	Very limited space; requires long codes
Lowercase Alpha	a-z	26	Better but still limited
Base36	0-9, a-z	36	Case-insensitive; good for voice/print
Base62	0-9, a-z, A-Z	62	Most common; maximum density without special chars
Base64 (URL-safe)	Base62 + - and _	64	Slightly denser; special chars may cause issues

Capacity Calculations

Let's calculate how many unique URLs we can represent with Base62 at different lengths:

capacity_calculation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Base62 Capacity Calculation
============================
 
Base62 character set: 0-9 (10) + a-z (26) + A-Z (26) = 62 characters
 
Length  |  Possible Codes    |  Human Readable
--------|--------------------|-----------------
1       |  62^1 = 62         |  62
2       |  62^2 = 3,844      |  ~4K
3       |  62^3 = 238,328    |  ~238K
4       |  62^4 = 14.8M      |  ~15 million
5       |  62^5 = 916M       |  ~1 billion
6       |  62^6 = 56.8B      |  ~57 billion
7       |  62^7 = 3.5T       |  ~3.5 trillion
8       |  62^8 = 218T       |  ~218 trillion
 
Our Requirements:
- 100M URLs/day × 365 days × 10 years = 365 billion URLs total
- Need buffer for deletions, collisions, custom codes
 
Conclusion:
- 6 characters (56.8B) insufficient for 10-year horizon
- 7 characters (3.5T) provides ~10x buffer - IDEAL
- 8 characters overkill for most use cases

Why 7 Characters is the Sweet Spot

Base62 Encoding Approach

Base62 encoding converts a numeric ID into a string using 62 characters. The key insight is that any unique integer can be converted to a unique Base62 string.

The Encoding Algorithm

base62.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Base62 Encoding/Decoding Implementation
 
const BASE62_CHARS = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
const BASE = 62n; // Using BigInt for large numbers
 
/**
 * Encode a numeric ID to Base62 string
 * Example: 125 → "21" (1*62 + 25 + 10 + 26 = 125)
 */
function encodeBase62(id: bigint, minLength: number = 7): string {
    if (id === 0n) {
        return BASE62_CHARS[0].repeat(minLength);
    }
    
    let result = '';
    let num = id;
    
    while (num > 0n) {
        const remainder = Number(num % BASE);
        result = BASE62_CHARS[remainder] + result;
        num = num / BASE; // Integer division
    }
    
    // Pad to minimum length if needed
    while (result.length < minLength) {
        result = BASE62_CHARS[0] + result;
    }
    
    return result;
}
 
/**
 * Decode a Base62 string back to numeric ID
 * Example: "21" → 125
 */
function decodeBase62(code: string): bigint {
    let result = 0n;
    
    for (const char of code) {
        const index = BASE62_CHARS.indexOf(char);
        if (index === -1) {
            throw new Error(`Invalid Base62 character: ${char}`);
        }
        result = result * BASE + BigInt(index);
    }
    
    return result;
}
 
// Example usage:
// encodeBase62(1000000000n)     → "15FTGg"
// encodeBase62(3521614606207n)  → "zzzzzz" (max 6-char value)
// decodeBase62("a7Xk2B")        → corresponding integer

The Core Challenge: Generating Unique IDs

Base62 encoding is deterministic—the same integer always produces the same string. The real challenge is generating unique integers efficiently at scale.

We have several strategies for ID generation, each with distinct trade-offs:

ID Generation Strategies
Strategy	Mechanism	Pros	Cons
Auto-increment DB	Database generates sequential IDs	Simple, guaranteed unique, efficient	Single point of failure, ID leakage, sharding issues
UUID → Truncate	Generate UUID, take first N chars	No coordination needed	High collision risk, wastes entropy
Hash-based	Hash the long URL	Deterministic, idempotent	Collision handling complex, predictable
Distributed Counter	Coordinated counter across nodes	Sequential-ish, predictable	Coordination overhead, complexity
Snowflake-style	Timestamp + worker ID + sequence	Sorted, distributed, fast	Clock dependencies, worker ID management

The Pattern

Strategy 1: Auto-Increment Database IDs

The simplest approach is to use a database's auto-increment feature. Each new URL gets the next sequential ID, which is then encoded.

Implementation Flow

auto_increment_flow.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Auto-Increment ID Generation
 
async function createShortUrl(longUrl: string): Promise<ShortUrl> {
    // 1. Insert URL into database, get auto-generated ID
    const result = await db.query(
        'INSERT INTO urls (long_url, created_at) VALUES ($1, NOW()) RETURNING id',
        [longUrl]
    );
    
    const numericId = result.rows[0].id; // e.g., 1000000001
    
    // 2. Encode ID to Base62
    const shortCode = encodeBase62(BigInt(numericId)); // e.g., "15FTGh"
    
    // 3. Update record with short code (optional, for indexed lookup)
    await db.query(
        'UPDATE urls SET short_code = $1 WHERE id = $2',
        [shortCode, numericId]
    );
    
    return {
        shortCode,
        longUrl,
        shortUrl: `https://short.url/${shortCode}`
    };
}
 
// Redirect lookup is fast since ID is embedded in short code:
async function resolveShortCode(shortCode: string): Promise<string> {
    const numericId = decodeBase62(shortCode);
    
    // Direct lookup by primary key - extremely fast!
    const result = await db.query(
        'SELECT long_url FROM urls WHERE id = $1',
        [numericId]
    );
    
    return result.rows[0].long_url;
}

Advantages

•Simplicity — No custom ID generation logic; database handles uniqueness
•Guaranteed Unique — Auto-increment is atomic and unique by definition
•Efficient Lookup — Short code decodes to primary key; index-free lookup
•Sequential — IDs are roughly time-ordered; useful for analytics
•No Collisions — Impossible to generate duplicate IDs

Disadvantages

•Single Point of Failure — One database must coordinate all IDs
•Sharding Difficulties — Auto-increment doesn't distribute naturally
•ID Enumeration Risk — Sequential IDs allow attackers to scrape all URLs
•Write Bottleneck — All creates funnel through one system
•ID Leakage — Reveals approximate URL volume (competitive intel)

Making Auto-Increment Work at Scale

Despite limitations, auto-increment can scale with modifications:

scaled_auto_increment.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Strategy: Range-based ID allocation
-- Each server reserves ID ranges, eliminating coordination
 
-- Server 1: IDs 1-1,000,000
-- Server 2: IDs 1,000,001-2,000,000
-- Server 3: IDs 2,000,001-3,000,000
 
-- ID Range Manager Table
CREATE TABLE id_ranges (
    server_id      INT PRIMARY KEY,
    range_start    BIGINT NOT NULL,
    range_end      BIGINT NOT NULL,
    current_value  BIGINT NOT NULL,
    created_at     TIMESTAMP DEFAULT NOW()
);
 
-- Each server atomically reserves a new range when needed:
UPDATE id_ranges 
SET range_start = (SELECT MAX(range_end) + 1 FROM id_ranges),
    range_end = (SELECT MAX(range_end) + 1000000 FROM id_ranges),
    current_value = (SELECT MAX(range_end) + 1 FROM id_ranges)
WHERE server_id = ?;
 
-- Alternative: Use stride-based approach
-- Server 1: 1, 4, 7, 10, ...  (n*3 + 1)
-- Server 2: 2, 5, 8, 11, ...  (n*3 + 2)  
-- Server 3: 3, 6, 9, 12, ...  (n*3 + 3)
 
-- This works but creates non-sequential IDs per server

Security Consideration

Strategy 2: Hash-Based Encoding

Hash-based encoding derives the short code directly from the long URL. This approach is deterministic and idempotent—the same URL always produces the same code.

Basic Hash Implementation

hash_encoding.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import crypto from 'crypto';
 
/**
 * Generate short code by hashing the long URL
 * Uses first 43 bits of MD5/SHA hash, encoded in Base62
 */
function hashBasedShortCode(longUrl: string): string {
    // Step 1: Compute hash of URL
    const hash = crypto.createHash('md5').update(longUrl).digest();
    
    // Step 2: Take first 7 bytes (56 bits > 43 bits needed for 7 Base62 chars)
    const numeric = hash.readBigUInt64BE(0) & 0xFFFFFFFFFFFFn; // 48 bits
    
    // Step 3: Encode as Base62 (7 characters)
    return encodeBase62(numeric, 7);
}
 
// Problem: Different URLs may produce same short code (collision!)
// Solution: Check and handle collisions
 
async function createShortUrlWithHash(longUrl: string): Promise<ShortUrl> {
    let shortCode = hashBasedShortCode(longUrl);
    let attempt = 0;
    
    while (attempt < 5) {
        // Check if code already exists
        const existing = await db.query(
            'SELECT long_url FROM urls WHERE short_code = $1',
            [shortCode]
        );
        
        if (existing.rows.length === 0) {
            // Code is available
            await db.query(
                'INSERT INTO urls (short_code, long_url) VALUES ($1, $2)',
                [shortCode, longUrl]
            );
            return { shortCode, longUrl, shortUrl: `https://short.url/${shortCode}` };
        }
        
        if (existing.rows[0].long_url === longUrl) {
            // Same URL already shortened - return existing code
            return { shortCode, longUrl, shortUrl: `https://short.url/${shortCode}` };
        }
        
        // Collision! Different URL has same code
        // Strategy: Append attempt number and re-hash
        shortCode = hashBasedShortCode(longUrl + String(++attempt));
    }
    
    throw new Error('Failed to generate unique short code after 5 attempts');
}

Collision Analysis

Understanding collision probability is crucial for hash-based approaches:

collision_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Collision Probability Analysis (Birthday Paradox)
=================================================
 
For a hash of k bits with n items:
P(collision) ≈ 1 - e^(-n²/2^(k+1))
 
With 7 Base62 characters (≈42 bits of entropy):
- 2^42 = 4.4 trillion possible values
 
At 100 billion URLs (10 years of data):
- P(at least one collision) ≈ 1 - e^(-(100B)²/(2*4.4T))
- P(collision) ≈ 1 - e^(-1136)
- P(collision) ≈ 100% (virtually guaranteed!)
 
Expected number of collisions:
- E[collisions] = n²/(2*2^k) = (100B)²/(2*4.4T) ≈ 1,136 collisions
 
Reality Check:
- With 7-character codes and 100 billion URLs, expect ~1000+ collisions
- This is manageable with collision resolution, but not negligible
- Using 8 characters (62^8 = 218T) reduces expected collisions to ~23

Advantages

•Idempotent — Same URL → same code (unless collision)
•No Coordination — Hash computed locally, no DB needed for generation
•Deduplication — Natural handling of duplicate URL submissions
•Distributed — Any server can generate without coordination
•Unpredictable — Codes aren't enumerable

Disadvantages

•Collision Handling — Must check and resolve conflicts
•Longer Codes Needed — 7+ chars to keep collision probability low
•DB Lookup Required — Can't decode to lookup; need indexed search
•Not Truly Random — Same URL always same code (trackable)
•Complex Resolution — Collision resolution adds latency and complexity

When to Use Hash-Based

Strategy 3: Snowflake-Style Distributed IDs

Twitter's Snowflake and similar systems generate unique, roughly time-ordered IDs without central coordination. This is the gold standard for distributed ID generation at scale.

Snowflake ID Structure

snowflake_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Snowflake ID Structure (64 bits total)
======================================
 
┌─────────────────────────────────────────────────────────────────┐
│ 1 │          41 bits          │  10 bits  │     12 bits        │
│   │      (timestamp ms)       │ (machine) │    (sequence)       │
└───┴───────────────────────────┴───────────┴────────────────────┘
  │              │                    │              │
  │              │                    │              └─ Sequence number (0-4095)
  │              │                    │                 Reset each millisecond
  │              │                    │
  │              │                    └─ Machine/Worker ID (0-1023)
  │              │                       Unique per server
  │              │
  │              └─ Milliseconds since epoch (Jan 1, 2015)
  │                 ~69 years of range
  │
  └─ Sign bit (always 0 for positive)
 
Capacity:
- 41 bits timestamp: 2^41 ms = ~69 years from epoch
- 10 bits machine ID: 1024 unique machines
- 12 bits sequence: 4096 IDs per millisecond per machine
 
Total throughput: 4096 × 1000 = 4.1 million IDs/second/machine

snowflake.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/**
 * Snowflake-style ID Generator
 * Generates unique, time-sorted 64-bit IDs without coordination
 */
class SnowflakeGenerator {
    private readonly EPOCH = 1420070400000n; // Jan 1, 2015 UTC
    private readonly MACHINE_ID_BITS = 10n;
    private readonly SEQUENCE_BITS = 12n;
    
    private readonly maxMachineId = (1n << this.MACHINE_ID_BITS) - 1n;
    private readonly maxSequence = (1n << this.SEQUENCE_BITS) - 1n;
    
    private machineId: bigint;
    private sequence: bigint = 0n;
    private lastTimestamp: bigint = 0n;
    
    constructor(machineId: number) {
        this.machineId = BigInt(machineId);
        if (this.machineId > this.maxMachineId) {
            throw new Error(`Machine ID must be between 0 and ${this.maxMachineId}`);
        }
    }
    
    generate(): bigint {
        let timestamp = BigInt(Date.now()) - this.EPOCH;
        
        if (timestamp === this.lastTimestamp) {
            // Same millisecond - increment sequence
            this.sequence = (this.sequence + 1n) & this.maxSequence;
            
            if (this.sequence === 0n) {
                // Sequence overflow - wait for next millisecond
                timestamp = this.waitNextMillis(timestamp);
            }
        } else {
            // New millisecond - reset sequence
            this.sequence = 0n;
        }
        
        this.lastTimestamp = timestamp;
        
        // Compose the 64-bit ID
        return (timestamp << (this.MACHINE_ID_BITS + this.SEQUENCE_BITS)) |
               (this.machineId << this.SEQUENCE_BITS) |
               this.sequence;
    }
    
    private waitNextMillis(lastTimestamp: bigint): bigint {
        let timestamp = BigInt(Date.now()) - this.EPOCH;
        while (timestamp <= lastTimestamp) {
            timestamp = BigInt(Date.now()) - this.EPOCH;
        }
        return timestamp;
    }
}
 
// Usage for URL shortening:
const generator = new SnowflakeGenerator(42); // Machine #42
 
function createShortUrl(longUrl: string): ShortUrl {
    const id = generator.generate();           // e.g., 1234567890123456789n
    const shortCode = encodeBase62(id, 7);     // e.g., "4KbW8M2"
    
    // Store mapping in database
    await db.insert({ id, shortCode, longUrl });
    
    return { shortCode, longUrl };
}

Optimizing for 7-Character Codes

Standard Snowflake produces 64-bit IDs. For 7-character Base62 codes (3.5T values, ~42 bits), we can use a modified structure:

compact_snowflake.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Compact Snowflake (43 bits for 7-char Base62)
==============================================
 
┌─────────────────────────────────────────────────┐
│      32 bits        │  5 bits  │    6 bits     │
│   (timestamp secs)  │ (worker) │  (sequence)   │
└─────────────────────┴──────────┴───────────────┘
 
- 32 bits timestamp (seconds): ~136 years from epoch
- 5 bits worker ID: 32 workers maximum
- 6 bits sequence: 64 IDs per second per worker
 
Total: 43 bits → 8.8T possible IDs → fits in 7 Base62 chars
 
Throughput: 64 IDs/sec/worker × 32 workers = 2048 IDs/second
           (Sufficient for many use cases)
 
For higher throughput, use 8-character codes (218T capacity):
┌───────────────────────────────────────────────────────┐
│      35 bits        │  7 bits  │    8 bits          │
│   (timestamp secs)  │ (worker) │  (sequence)        │
└─────────────────────┴──────────┴────────────────────┘

Why Snowflake Excels at Scale

Strategy 4: Range-Based ID Allocation

Range-based allocation pre-assigns ID ranges to servers, eliminating real-time coordination. This approach is used by Flickr's ticket servers and similar systems.

Architecture Overview

range_based_allocation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/**
 * Range-Based ID Generator
 * 
 * Central service allocates ranges of IDs to worker servers.
 * Workers use their range locally without further coordination.
 */
 
// Central Range Allocator Service
class RangeAllocator {
    private nextRangeStart: bigint = 1n;
    private readonly rangeSize: bigint = 100000n; // 100K IDs per range
    
    async allocateRange(): Promise<{ start: bigint; end: bigint }> {
        return await this.db.transaction(async (tx) => {
            // Atomically get next range
            const result = await tx.query(
                'UPDATE range_counter SET value = value + $1 RETURNING value - $1 as start',
                [this.rangeSize]
            );
            
            const start = BigInt(result.rows[0].start);
            const end = start + this.rangeSize - 1n;
            
            return { start, end };
        });
    }
}
 
// Worker Server (URL Shortener Instance)
class RangeBasedIdGenerator {
    private currentId: bigint = 0n;
    private rangeEnd: bigint = 0n;
    
    constructor(private allocator: RangeAllocator) {}
    
    async getId(): Promise<bigint> {
        // Check if we need a new range
        if (this.currentId >= this.rangeEnd) {
            await this.fetchNewRange();
        }
        
        const id = this.currentId;
        this.currentId += 1n;
        return id;
    }
    
    private async fetchNewRange() {
        const range = await this.allocator.allocateRange();
        this.currentId = range.start;
        this.rangeEnd = range.end;
        console.log(`Fetched new range: ${range.start} - ${range.end}`);
    }
}
 
// Usage flow:
// 1. Server starts → requests range (e.g., 1-100,000)
// 2. Generates IDs 1, 2, 3, ... 99,999, 100,000
// 3. Requests new range (100,001-200,000)
// 4. Continues generating...

Handling Range Gaps and Efficiency

Range-Based Considerations

•Wasted IDs — If a server crashes, unused IDs in its range are lost. With 100K ranges and occasional crashes, this is negligible (<0.1% waste).
•Range Size Trade-off — Large ranges reduce allocator calls but waste more on crash. Small ranges increase coordination. 100K-1M is typical.
•Pre-fetching — Fetch next range before current exhausts to avoid latency spikes.
•Multi-Allocator — Run 2-3 allocators with interleaved ranges (Server A: 1-100K, 300K-400K; Server B: 100K-200K, 400K-500K) for high availability.
•Local Persistence — Store current position to disk; resume after restart without losing range.

Flickr's Ticket Server

Handling Custom Aliases

Many URL shorteners allow users to specify their own short codes (e.g., short.url/my-brand instead of short.url/a7Xk2B). This introduces additional complexity.

Custom Alias Considerations

Custom Alias Requirements
Aspect	Requirement	Implementation
Uniqueness	No conflicts with generated or other custom codes	Check both custom and auto-generated namespace
Character Validation	Only URL-safe characters allowed	Regex validation: ^[a-zA-Z0-9_-]+$
Length Limits	2-30 characters typical	Minimum prevents single-char hoarding; max for sanity
Reserved Words	Block system paths and common words	Blocklist: api, admin, login, www, etc.
Availability Check	Real-time availability feedback	Fast lookup before commit
Case Sensitivity	Match URL case rules (typically insensitive)	Store and compare lowercase

custom_alias.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/**
 * Custom Alias Handling
 */
 
const RESERVED_WORDS = new Set([
    'api', 'admin', 'login', 'logout', 'signup', 'settings',
    'www', 'mail', 'help', 'support', 'about', 'terms', 'privacy'
]);
 
const ALIAS_REGEX = /^[a-zA-Z0-9][a-zA-Z0-9_-]{1,29}$/;
 
interface AliasValidationResult {
    valid: boolean;
    error?: string;
}
 
function validateCustomAlias(alias: string): AliasValidationResult {
    // Normalize to lowercase for case-insensitive comparison
    const normalized = alias.toLowerCase();
    
    // Check format
    if (!ALIAS_REGEX.test(alias)) {
        return {
            valid: false,
            error: 'Alias must be 2-30 characters, alphanumeric with hyphens/underscores'
        };
    }
    
    // Check reserved words
    if (RESERVED_WORDS.has(normalized)) {
        return {
            valid: false,
            error: 'This alias is reserved and cannot be used'
        };
    }
    
    // Check for confusing patterns
    if (/^[0-9]+$/.test(alias)) {
        return {
            valid: false,
            error: 'Alias cannot be purely numeric (conflicts with auto-generated codes)'
        };
    }
    
    return { valid: true };
}
 
async function createWithCustomAlias(
    longUrl: string, 
    customAlias: string
): Promise<ShortUrl> {
    // Validate format
    const validation = validateCustomAlias(customAlias);
    if (!validation.valid) {
        throw new ValidationError(validation.error!);
    }
    
    const normalized = customAlias.toLowerCase();
    
    try {
        // Attempt to insert with unique constraint
        await db.query(
            `INSERT INTO urls (short_code, short_code_lower, long_url) 
             VALUES ($1, $2, $3)`,
            [customAlias, normalized, longUrl]
        );
        
        return {
            shortCode: customAlias,
            longUrl,
            shortUrl: `https://short.url/${customAlias}`
        };
    } catch (err) {
        if (err.code === 'UNIQUE_VIOLATION') {
            throw new ConflictError('This alias is already taken');
        }
        throw err;
    }
}

Namespace Collision Prevention

URL Encoding Strategies Summary

We've explored four major strategies for generating unique short codes. Each has its place depending on system constraints:

Strategy Comparison Summary
Strategy	Best For	Avoid When	Complexity
Auto-Increment	Simple systems, single DB, low scale	Need distribution, security matters	Low
Hash-Based	Deduplication needed, fully distributed	Collision avoidance critical	Medium
Snowflake-Style	High scale, distributed systems	Clock sync issues, simpler options work	Medium-High
Range-Based	Moderate scale, simple distribution	True high availability needed	Medium

Key Takeaways

•Base62 with 7 characters provides 3.5 trillion unique codes — sufficient for a decade at 100M URLs/day with ample margin.
•Auto-increment is simplest but creates bottlenecks — fine for moderate scale; needs adaptation for true distribution.
•Hash-based provides natural deduplication — but requires collision handling and indexed lookups.
•Snowflake-style is the industry standard at scale — Twitter, Discord, Instagram all use variants.
•Custom aliases need validation, reservation, and namespace management — more complex than they appear.

Page Complete

2 / 6