Loading learning content...
At the heart of every URL shortener lies a deceptively simple problem: How do you generate a unique, short string for each of billions of URLs?
The short code (like a7Xk2B in https://short.url/a7Xk2B) must satisfy several competing constraints:
This seemingly trivial problem touches on encoding theory, distributed systems, database design, and security—making it a perfect microcosm of system design challenges.
By the end of this page, you will master multiple URL encoding strategies: Base62 encoding, hash-based approaches, counter-based systems, and distributed ID generators. You'll understand the trade-offs of each and when to apply them.
Before choosing an encoding strategy, we must understand how many URLs we can represent with different character sets and lengths.
URL-safe characters that can appear in a path without encoding:
| Character Set | Characters Included | Size | Considerations |
|---|---|---|---|
| Numeric | 0-9 | 10 | Very limited space; requires long codes |
| Lowercase Alpha | a-z | 26 | Better but still limited |
| Base36 | 0-9, a-z | 36 | Case-insensitive; good for voice/print |
| Base62 | 0-9, a-z, A-Z | 62 | Most common; maximum density without special chars |
| Base64 (URL-safe) | Base62 + - and _ | 64 | Slightly denser; special chars may cause issues |
Let's calculate how many unique URLs we can represent with Base62 at different lengths:
123456789101112131415161718192021222324
Base62 Capacity Calculation============================ Base62 character set: 0-9 (10) + a-z (26) + A-Z (26) = 62 characters Length | Possible Codes | Human Readable--------|--------------------|-----------------1 | 62^1 = 62 | 622 | 62^2 = 3,844 | ~4K3 | 62^3 = 238,328 | ~238K4 | 62^4 = 14.8M | ~15 million5 | 62^5 = 916M | ~1 billion6 | 62^6 = 56.8B | ~57 billion7 | 62^7 = 3.5T | ~3.5 trillion8 | 62^8 = 218T | ~218 trillion Our Requirements:- 100M URLs/day × 365 days × 10 years = 365 billion URLs total- Need buffer for deletions, collisions, custom codes Conclusion:- 6 characters (56.8B) insufficient for 10-year horizon- 7 characters (3.5T) provides ~10x buffer - IDEAL- 8 characters overkill for most use casesWith 7 characters in Base62, we have 3.5 trillion possible codes. At 100 million URLs/day, that's 35,000 days (96 years) of capacity. Even accounting for collisions, reserved codes, and custom aliases, 7 characters provides ample headroom while keeping URLs short.
Base62 encoding converts a numeric ID into a string using 62 characters. The key insight is that any unique integer can be converted to a unique Base62 string.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
// Base62 Encoding/Decoding Implementation const BASE62_CHARS = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';const BASE = 62n; // Using BigInt for large numbers /** * Encode a numeric ID to Base62 string * Example: 125 → "21" (1*62 + 25 + 10 + 26 = 125) */function encodeBase62(id: bigint, minLength: number = 7): string { if (id === 0n) { return BASE62_CHARS[0].repeat(minLength); } let result = ''; let num = id; while (num > 0n) { const remainder = Number(num % BASE); result = BASE62_CHARS[remainder] + result; num = num / BASE; // Integer division } // Pad to minimum length if needed while (result.length < minLength) { result = BASE62_CHARS[0] + result; } return result;} /** * Decode a Base62 string back to numeric ID * Example: "21" → 125 */function decodeBase62(code: string): bigint { let result = 0n; for (const char of code) { const index = BASE62_CHARS.indexOf(char); if (index === -1) { throw new Error(`Invalid Base62 character: ${char}`); } result = result * BASE + BigInt(index); } return result;} // Example usage:// encodeBase62(1000000000n) → "15FTGg"// encodeBase62(3521614606207n) → "zzzzzz" (max 6-char value)// decodeBase62("a7Xk2B") → corresponding integerBase62 encoding is deterministic—the same integer always produces the same string. The real challenge is generating unique integers efficiently at scale.
We have several strategies for ID generation, each with distinct trade-offs:
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Auto-increment DB | Database generates sequential IDs | Simple, guaranteed unique, efficient | Single point of failure, ID leakage, sharding issues |
| UUID → Truncate | Generate UUID, take first N chars | No coordination needed | High collision risk, wastes entropy |
| Hash-based | Hash the long URL | Deterministic, idempotent | Collision handling complex, predictable |
| Distributed Counter | Coordinated counter across nodes | Sequential-ish, predictable | Coordination overhead, complexity |
| Snowflake-style | Timestamp + worker ID + sequence | Sorted, distributed, fast | Clock dependencies, worker ID management |
Notice that all strategies fundamentally solve the same problem: mapping something (a URL, a timestamp, a counter) to a unique integer, then encoding that integer in Base62. The strategy you choose depends on your system constraints.
The simplest approach is to use a database's auto-increment feature. Each new URL gets the next sequential ID, which is then encoded.
123456789101112131415161718192021222324252627282930313233343536373839
// Auto-Increment ID Generation async function createShortUrl(longUrl: string): Promise<ShortUrl> { // 1. Insert URL into database, get auto-generated ID const result = await db.query( 'INSERT INTO urls (long_url, created_at) VALUES ($1, NOW()) RETURNING id', [longUrl] ); const numericId = result.rows[0].id; // e.g., 1000000001 // 2. Encode ID to Base62 const shortCode = encodeBase62(BigInt(numericId)); // e.g., "15FTGh" // 3. Update record with short code (optional, for indexed lookup) await db.query( 'UPDATE urls SET short_code = $1 WHERE id = $2', [shortCode, numericId] ); return { shortCode, longUrl, shortUrl: `https://short.url/${shortCode}` };} // Redirect lookup is fast since ID is embedded in short code:async function resolveShortCode(shortCode: string): Promise<string> { const numericId = decodeBase62(shortCode); // Direct lookup by primary key - extremely fast! const result = await db.query( 'SELECT long_url FROM urls WHERE id = $1', [numericId] ); return result.rows[0].long_url;}Despite limitations, auto-increment can scale with modifications:
1234567891011121314151617181920212223242526272829
-- Strategy: Range-based ID allocation-- Each server reserves ID ranges, eliminating coordination -- Server 1: IDs 1-1,000,000-- Server 2: IDs 1,000,001-2,000,000-- Server 3: IDs 2,000,001-3,000,000 -- ID Range Manager TableCREATE TABLE id_ranges ( server_id INT PRIMARY KEY, range_start BIGINT NOT NULL, range_end BIGINT NOT NULL, current_value BIGINT NOT NULL, created_at TIMESTAMP DEFAULT NOW()); -- Each server atomically reserves a new range when needed:UPDATE id_ranges SET range_start = (SELECT MAX(range_end) + 1 FROM id_ranges), range_end = (SELECT MAX(range_end) + 1000000 FROM id_ranges), current_value = (SELECT MAX(range_end) + 1 FROM id_ranges)WHERE server_id = ?; -- Alternative: Use stride-based approach-- Server 1: 1, 4, 7, 10, ... (n*3 + 1)-- Server 2: 2, 5, 8, 11, ... (n*3 + 2) -- Server 3: 3, 6, 9, 12, ... (n*3 + 3) -- This works but creates non-sequential IDs per serverSequential IDs expose information. If your latest short code is 'a7Xk2G', attackers know 'a7Xk2F' exists. To prevent URL enumeration, you can randomly shuffle the mapping or use a more sophisticated encoding that obscures the sequence.
Hash-based encoding derives the short code directly from the long URL. This approach is deterministic and idempotent—the same URL always produces the same code.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import crypto from 'crypto'; /** * Generate short code by hashing the long URL * Uses first 43 bits of MD5/SHA hash, encoded in Base62 */function hashBasedShortCode(longUrl: string): string { // Step 1: Compute hash of URL const hash = crypto.createHash('md5').update(longUrl).digest(); // Step 2: Take first 7 bytes (56 bits > 43 bits needed for 7 Base62 chars) const numeric = hash.readBigUInt64BE(0) & 0xFFFFFFFFFFFFn; // 48 bits // Step 3: Encode as Base62 (7 characters) return encodeBase62(numeric, 7);} // Problem: Different URLs may produce same short code (collision!)// Solution: Check and handle collisions async function createShortUrlWithHash(longUrl: string): Promise<ShortUrl> { let shortCode = hashBasedShortCode(longUrl); let attempt = 0; while (attempt < 5) { // Check if code already exists const existing = await db.query( 'SELECT long_url FROM urls WHERE short_code = $1', [shortCode] ); if (existing.rows.length === 0) { // Code is available await db.query( 'INSERT INTO urls (short_code, long_url) VALUES ($1, $2)', [shortCode, longUrl] ); return { shortCode, longUrl, shortUrl: `https://short.url/${shortCode}` }; } if (existing.rows[0].long_url === longUrl) { // Same URL already shortened - return existing code return { shortCode, longUrl, shortUrl: `https://short.url/${shortCode}` }; } // Collision! Different URL has same code // Strategy: Append attempt number and re-hash shortCode = hashBasedShortCode(longUrl + String(++attempt)); } throw new Error('Failed to generate unique short code after 5 attempts');}Understanding collision probability is crucial for hash-based approaches:
123456789101112131415161718192021
Collision Probability Analysis (Birthday Paradox)================================================= For a hash of k bits with n items:P(collision) ≈ 1 - e^(-n²/2^(k+1)) With 7 Base62 characters (≈42 bits of entropy):- 2^42 = 4.4 trillion possible values At 100 billion URLs (10 years of data):- P(at least one collision) ≈ 1 - e^(-(100B)²/(2*4.4T))- P(collision) ≈ 1 - e^(-1136)- P(collision) ≈ 100% (virtually guaranteed!) Expected number of collisions:- E[collisions] = n²/(2*2^k) = (100B)²/(2*4.4T) ≈ 1,136 collisions Reality Check:- With 7-character codes and 100 billion URLs, expect ~1000+ collisions- This is manageable with collision resolution, but not negligible- Using 8 characters (62^8 = 218T) reduces expected collisions to ~23Hash-based encoding works well when: (1) you want repeat submissions of the same URL to return the same short code, (2) you need fully distributed generation with no coordination, and (3) you accept occasional collision handling.
Twitter's Snowflake and similar systems generate unique, roughly time-ordered IDs without central coordination. This is the gold standard for distributed ID generation at scale.
12345678910111213141516171819202122232425
Snowflake ID Structure (64 bits total)====================================== ┌─────────────────────────────────────────────────────────────────┐│ 1 │ 41 bits │ 10 bits │ 12 bits ││ │ (timestamp ms) │ (machine) │ (sequence) │└───┴───────────────────────────┴───────────┴────────────────────┘ │ │ │ │ │ │ │ └─ Sequence number (0-4095) │ │ │ Reset each millisecond │ │ │ │ │ └─ Machine/Worker ID (0-1023) │ │ Unique per server │ │ │ └─ Milliseconds since epoch (Jan 1, 2015) │ ~69 years of range │ └─ Sign bit (always 0 for positive) Capacity:- 41 bits timestamp: 2^41 ms = ~69 years from epoch- 10 bits machine ID: 1024 unique machines- 12 bits sequence: 4096 IDs per millisecond per machine Total throughput: 4096 × 1000 = 4.1 million IDs/second/machine1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
/** * Snowflake-style ID Generator * Generates unique, time-sorted 64-bit IDs without coordination */class SnowflakeGenerator { private readonly EPOCH = 1420070400000n; // Jan 1, 2015 UTC private readonly MACHINE_ID_BITS = 10n; private readonly SEQUENCE_BITS = 12n; private readonly maxMachineId = (1n << this.MACHINE_ID_BITS) - 1n; private readonly maxSequence = (1n << this.SEQUENCE_BITS) - 1n; private machineId: bigint; private sequence: bigint = 0n; private lastTimestamp: bigint = 0n; constructor(machineId: number) { this.machineId = BigInt(machineId); if (this.machineId > this.maxMachineId) { throw new Error(`Machine ID must be between 0 and ${this.maxMachineId}`); } } generate(): bigint { let timestamp = BigInt(Date.now()) - this.EPOCH; if (timestamp === this.lastTimestamp) { // Same millisecond - increment sequence this.sequence = (this.sequence + 1n) & this.maxSequence; if (this.sequence === 0n) { // Sequence overflow - wait for next millisecond timestamp = this.waitNextMillis(timestamp); } } else { // New millisecond - reset sequence this.sequence = 0n; } this.lastTimestamp = timestamp; // Compose the 64-bit ID return (timestamp << (this.MACHINE_ID_BITS + this.SEQUENCE_BITS)) | (this.machineId << this.SEQUENCE_BITS) | this.sequence; } private waitNextMillis(lastTimestamp: bigint): bigint { let timestamp = BigInt(Date.now()) - this.EPOCH; while (timestamp <= lastTimestamp) { timestamp = BigInt(Date.now()) - this.EPOCH; } return timestamp; }} // Usage for URL shortening:const generator = new SnowflakeGenerator(42); // Machine #42 function createShortUrl(longUrl: string): ShortUrl { const id = generator.generate(); // e.g., 1234567890123456789n const shortCode = encodeBase62(id, 7); // e.g., "4KbW8M2" // Store mapping in database await db.insert({ id, shortCode, longUrl }); return { shortCode, longUrl };}Standard Snowflake produces 64-bit IDs. For 7-character Base62 codes (3.5T values, ~42 bits), we can use a modified structure:
12345678910111213141516171819202122
Compact Snowflake (43 bits for 7-char Base62)============================================== ┌─────────────────────────────────────────────────┐│ 32 bits │ 5 bits │ 6 bits ││ (timestamp secs) │ (worker) │ (sequence) │└─────────────────────┴──────────┴───────────────┘ - 32 bits timestamp (seconds): ~136 years from epoch- 5 bits worker ID: 32 workers maximum- 6 bits sequence: 64 IDs per second per worker Total: 43 bits → 8.8T possible IDs → fits in 7 Base62 chars Throughput: 64 IDs/sec/worker × 32 workers = 2048 IDs/second (Sufficient for many use cases) For higher throughput, use 8-character codes (218T capacity):┌───────────────────────────────────────────────────────┐│ 35 bits │ 7 bits │ 8 bits ││ (timestamp secs) │ (worker) │ (sequence) │└─────────────────────┴──────────┴────────────────────┘Snowflake-style IDs combine the best of all approaches: no coordination (each server generates independently), guaranteed uniqueness (timestamp + worker + sequence), time-sortability (useful for debugging and analytics), and high throughput (millions per second).
Range-based allocation pre-assigns ID ranges to servers, eliminating real-time coordination. This approach is used by Flickr's ticket servers and similar systems.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
/** * Range-Based ID Generator * * Central service allocates ranges of IDs to worker servers. * Workers use their range locally without further coordination. */ // Central Range Allocator Serviceclass RangeAllocator { private nextRangeStart: bigint = 1n; private readonly rangeSize: bigint = 100000n; // 100K IDs per range async allocateRange(): Promise<{ start: bigint; end: bigint }> { return await this.db.transaction(async (tx) => { // Atomically get next range const result = await tx.query( 'UPDATE range_counter SET value = value + $1 RETURNING value - $1 as start', [this.rangeSize] ); const start = BigInt(result.rows[0].start); const end = start + this.rangeSize - 1n; return { start, end }; }); }} // Worker Server (URL Shortener Instance)class RangeBasedIdGenerator { private currentId: bigint = 0n; private rangeEnd: bigint = 0n; constructor(private allocator: RangeAllocator) {} async getId(): Promise<bigint> { // Check if we need a new range if (this.currentId >= this.rangeEnd) { await this.fetchNewRange(); } const id = this.currentId; this.currentId += 1n; return id; } private async fetchNewRange() { const range = await this.allocator.allocateRange(); this.currentId = range.start; this.rangeEnd = range.end; console.log(`Fetched new range: ${range.start} - ${range.end}`); }} // Usage flow:// 1. Server starts → requests range (e.g., 1-100,000)// 2. Generates IDs 1, 2, 3, ... 99,999, 100,000// 3. Requests new range (100,001-200,000)// 4. Continues generating...Flickr uses two MySQL servers with auto-increment starting at 1 and 2, incrementing by 2. This simple setup provides high availability: if one server fails, the other continues generating (with odd or even IDs only). It's elegant for moderate scale.
Many URL shorteners allow users to specify their own short codes (e.g., short.url/my-brand instead of short.url/a7Xk2B). This introduces additional complexity.
| Aspect | Requirement | Implementation |
|---|---|---|
| Uniqueness | No conflicts with generated or other custom codes | Check both custom and auto-generated namespace |
| Character Validation | Only URL-safe characters allowed | Regex validation: ^[a-zA-Z0-9_-]+$ |
| Length Limits | 2-30 characters typical | Minimum prevents single-char hoarding; max for sanity |
| Reserved Words | Block system paths and common words | Blocklist: api, admin, login, www, etc. |
| Availability Check | Real-time availability feedback | Fast lookup before commit |
| Case Sensitivity | Match URL case rules (typically insensitive) | Store and compare lowercase |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
/** * Custom Alias Handling */ const RESERVED_WORDS = new Set([ 'api', 'admin', 'login', 'logout', 'signup', 'settings', 'www', 'mail', 'help', 'support', 'about', 'terms', 'privacy']); const ALIAS_REGEX = /^[a-zA-Z0-9][a-zA-Z0-9_-]{1,29}$/; interface AliasValidationResult { valid: boolean; error?: string;} function validateCustomAlias(alias: string): AliasValidationResult { // Normalize to lowercase for case-insensitive comparison const normalized = alias.toLowerCase(); // Check format if (!ALIAS_REGEX.test(alias)) { return { valid: false, error: 'Alias must be 2-30 characters, alphanumeric with hyphens/underscores' }; } // Check reserved words if (RESERVED_WORDS.has(normalized)) { return { valid: false, error: 'This alias is reserved and cannot be used' }; } // Check for confusing patterns if (/^[0-9]+$/.test(alias)) { return { valid: false, error: 'Alias cannot be purely numeric (conflicts with auto-generated codes)' }; } return { valid: true };} async function createWithCustomAlias( longUrl: string, customAlias: string): Promise<ShortUrl> { // Validate format const validation = validateCustomAlias(customAlias); if (!validation.valid) { throw new ValidationError(validation.error!); } const normalized = customAlias.toLowerCase(); try { // Attempt to insert with unique constraint await db.query( `INSERT INTO urls (short_code, short_code_lower, long_url) VALUES ($1, $2, $3)`, [customAlias, normalized, longUrl] ); return { shortCode: customAlias, longUrl, shortUrl: `https://short.url/${customAlias}` }; } catch (err) { if (err.code === 'UNIQUE_VIOLATION') { throw new ConflictError('This alias is already taken'); } throw err; }}Critical: auto-generated codes must not conflict with custom aliases. Solutions: (1) Prefix auto-generated codes with a character never allowed in custom (e.g., '0'), (2) Use different length constraints (auto: exactly 7 chars; custom: 2-6 or 8+ chars), (3) Check both spaces on every creation.
We've explored four major strategies for generating unique short codes. Each has its place depending on system constraints:
| Strategy | Best For | Avoid When | Complexity |
|---|---|---|---|
| Auto-Increment | Simple systems, single DB, low scale | Need distribution, security matters | Low |
| Hash-Based | Deduplication needed, fully distributed | Collision avoidance critical | Medium |
| Snowflake-Style | High scale, distributed systems | Clock sync issues, simpler options work | Medium-High |
| Range-Based | Moderate scale, simple distribution | True high availability needed | Medium |
You now understand the core encoding strategies for URL shorteners. Next, we'll focus on redirect latency optimization—how to serve those short codes at sub-50ms latency for billions of daily requests.