Loading learning content...
Uploading large files over the internet presents unique challenges. A 4K video file might be 50GB. A database backup might be 200GB. Uploading such files in a single HTTP request is impractical—any network interruption means starting from scratch.
The Core Problems:
The Solution: Chunked Uploads
Break large files into smaller chunks (typically 4-16MB each) and upload each chunk independently. This enables resumption, parallel uploads, and progress tracking.
By the end of this page, you'll understand: (1) Chunked upload protocols and their design rationale, (2) Resumable upload implementation, (3) Parallel chunk upload for speed, (4) Server-side chunk assembly and verification, and (5) Content-defined chunking for deduplication.
There are two fundamental approaches to dividing files into chunks: fixed-size chunking and content-defined chunking (CDC). Each has distinct trade-offs.
Content-Defined Chunking (CDC) Algorithm:
Rolling Hash Chunking:
1. Initialize rolling hash window (e.g., 48 bytes)
2. Slide window byte-by-byte through file
3. At each position, check if hash matches pattern:
if (hash & MASK) == TARGET: // e.g., last 13 bits are zero
Create chunk boundary here
4. Enforce minimum chunk size (don't cut too often)
5. Enforce maximum chunk size (cut even without pattern match)
Example with average 8KB chunks:
MASK = 0x1FFF (13 bits)
TARGET = 0
Expected avg chunk: 2^13 = 8192 bytes
Min chunk: 2KB (avoid tiny chunks)
Max chunk: 64KB (ensure progress on non-matching content)
| Chunk Size | Chunks per 1GB | Metadata Overhead | Dedup Efficiency | Best Use Case |
|---|---|---|---|---|
| 256 KB | 4,096 | High | Excellent | Text documents, code |
| 1 MB | 1,024 | Moderate | Very Good | Mixed content |
| 4 MB | 256 | Low | Good | General purpose |
| 8 MB | 128 | Very Low | Moderate | Large media files |
| 16 MB | 64 | Minimal | Lower | Streaming video |
Production systems often use a hybrid: fixed-size chunking for the initial upload (simpler, parallel-friendly), and content-defined chunking for subsequent syncs (better deduplication). The storage layer uses content-addressed storage regardless of how chunks were created.
A resumable upload protocol enables clients to continue interrupted uploads without retransmitting already-uploaded chunks. Here's how a well-designed protocol works:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
// Resumable upload client implementationinterface UploadSession { uploadId: string; uploadUrl: string; expiry: Date; uploadedChunks: Set<number>;} class ResumableUploader { private chunkSize = 4 * 1024 * 1024; // 4MB async uploadFile(file: File, onProgress: (pct: number) => void): Promise<string> { // Phase 1: Initialize or resume session const session = await this.getOrCreateSession(file); // Phase 2: Upload chunks const totalChunks = Math.ceil(file.size / this.chunkSize); for (let i = 0; i < totalChunks; i++) { if (session.uploadedChunks.has(i)) continue; // Already uploaded const start = i * this.chunkSize; const end = Math.min(start + this.chunkSize, file.size); const chunk = file.slice(start, end); await this.uploadChunkWithRetry(session, i, chunk); onProgress((i + 1) / totalChunks * 100); this.saveSessionToLocalStorage(session); // Persist progress } // Phase 3: Complete upload const fileId = await this.completeUpload(session, file); this.clearSession(session.uploadId); return fileId; } private async uploadChunkWithRetry( session: UploadSession, index: number, chunk: Blob, maxRetries = 3 ): Promise<void> { const hash = await this.computeHash(chunk); for (let attempt = 0; attempt < maxRetries; attempt++) { try { const response = await fetch( `${session.uploadUrl}/chunk/${index}`, { method: 'PUT', headers: { 'Content-Type': 'application/octet-stream', 'Content-MD5': hash, 'Content-Range': `bytes ${index * this.chunkSize}-${index * this.chunkSize + chunk.size - 1}`, }, body: chunk, } ); if (response.ok) { session.uploadedChunks.add(index); return; } if (response.status === 409) { // Chunk already exists (idempotent retry) session.uploadedChunks.add(index); return; } throw new Error(`Upload failed: ${response.status}`); } catch (error) { if (attempt === maxRetries - 1) throw error; await this.exponentialBackoff(attempt); } } } private async getOrCreateSession(file: File): Promise<UploadSession> { // Check for existing session in localStorage const existing = this.loadSessionFromLocalStorage(file.name, file.size); if (existing) { // Verify session still valid on server const status = await this.getSessionStatus(existing.uploadId); if (status.valid) { existing.uploadedChunks = new Set(status.uploadedChunks); return existing; } } // Create new session const response = await fetch('/api/uploads/init', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ filename: file.name, size: file.size, chunkSize: this.chunkSize, contentHash: await this.computeFileHash(file), }), }); return response.json(); }}The 'tus' protocol (tus.io) is an open standard for resumable uploads, implemented by many cloud providers. It defines HTTP-based resumable upload semantics that are widely supported. Consider using tus rather than inventing a custom protocol—it handles edge cases you might miss.
Sequential chunk upload is simple but doesn't maximize bandwidth utilization. Parallel uploading multiple chunks simultaneously dramatically improves upload speed, especially on high-latency connections.
Why Parallel Uploads Are Faster:
Sequential Upload (1 chunk at a time):
Total Time = N × (latency + chunk_transfer_time)
For 10 chunks, 100ms latency, 2s transfer each:
Total = 10 × (0.1 + 2) = 21 seconds
Parallel Upload (4 concurrent):
Total Time ≈ ceil(N/4) × (latency + chunk_transfer_time)
For 10 chunks, 100ms latency, 2s transfer each:
Total ≈ 3 × (0.1 + 2) = 6.3 seconds
3.3x faster!
The key insight: while one chunk is transferring, other connections can be established and chunks can be queued. TCP slow-start is amortized across multiple connections.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
// Parallel chunk upload with concurrency controlclass ParallelUploader { private concurrency = 4; // Simultaneous uploads private chunkSize = 4 * 1024 * 1024; async uploadFile(file: File, session: UploadSession): Promise<void> { const totalChunks = Math.ceil(file.size / this.chunkSize); const pendingChunks: number[] = []; // Build list of chunks that need uploading for (let i = 0; i < totalChunks; i++) { if (!session.uploadedChunks.has(i)) { pendingChunks.push(i); } } // Process chunks with limited concurrency await this.processWithConcurrency( pendingChunks, async (chunkIndex) => { const start = chunkIndex * this.chunkSize; const end = Math.min(start + this.chunkSize, file.size); const chunk = file.slice(start, end); await this.uploadChunk(session, chunkIndex, chunk); session.uploadedChunks.add(chunkIndex); }, this.concurrency ); } private async processWithConcurrency<T>( items: T[], processor: (item: T) => Promise<void>, limit: number ): Promise<void> { const queue = [...items]; const inFlight: Promise<void>[] = []; const processNext = async (): Promise<void> => { while (queue.length > 0) { const item = queue.shift()!; await processor(item); } }; // Start 'limit' concurrent workers const workers = Array(Math.min(limit, items.length)) .fill(null) .map(() => processNext()); await Promise.all(workers); }} // Advanced: Priority queue for smart chunk orderingclass SmartParallelUploader extends ParallelUploader { // Upload chunks near current read position first // This enables streaming playback during upload getChunkPriority(chunkIndex: number, totalChunks: number): number { // Priority: first few chunks (for preview), then sequential if (chunkIndex < 3) return 0; // Highest priority return chunkIndex; // Then sequential } // Adaptive concurrency based on bandwidth async measureBandwidth(): Promise<number> { const testChunk = new Uint8Array(64 * 1024); // 64KB test const start = performance.now(); await this.uploadTestChunk(testChunk); const elapsed = performance.now() - start; const mbps = (64 / 1024) / (elapsed / 1000); // Adjust concurrency based on observed bandwidth if (mbps > 10) return 6; // Fast connection: more parallel if (mbps > 2) return 4; // Medium: moderate parallel return 2; // Slow: fewer parallel }}| Connection | Bandwidth | Latency | Optimal Concurrency |
|---|---|---|---|
| Slow WiFi | 1-5 Mbps | 50-200ms | 2 |
| Home Broadband | 10-50 Mbps | 20-50ms | 4 |
| Fast Fiber | 100-500 Mbps | 5-20ms | 6-8 |
| Enterprise LAN | 1+ Gbps | 1-5ms | 8-16 |
| Same Datacenter | 10+ Gbps | <1ms | 16-32 |
Browsers limit concurrent HTTP connections per domain to 6 (HTTP/1.1) or effectively unlimited but prioritized (HTTP/2). Exceeding this creates head-of-line blocking. Solutions: use multiple subdomains for uploads, ensure HTTP/2 support, or use WebSocket-based upload protocols.
The server must efficiently receive, store, and reassemble chunks while handling concurrent uploads from millions of users. This requires careful architecture design.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
// Server-side chunk handling (simplified)interface ChunkUploadRequest { uploadId: string; chunkIndex: number; contentMD5: string; data: Buffer;} class ChunkHandler { constructor( private redis: RedisClient, private storage: ObjectStorage, private queue: MessageQueue, ) {} async handleChunkUpload(req: ChunkUploadRequest): Promise<void> { // 1. Validate session exists and is not expired const session = await this.redis.get(`upload:${req.uploadId}`); if (!session) throw new UploadExpiredError(); // 2. Verify chunk hash const actualHash = md5(req.data); if (actualHash !== req.contentMD5) { throw new ChunkCorruptedError('Hash mismatch'); } // 3. Store chunk (streaming write, not buffered) const chunkKey = `chunks/${req.uploadId}/${req.chunkIndex}`; await this.storage.put(chunkKey, req.data, { contentMD5: req.contentMD5, metadata: { uploadId: req.uploadId, index: req.chunkIndex }, }); // 4. Mark chunk as received in session const allReceived = await this.redis.eval(` redis.call('SADD', KEYS[1], ARGV[1]) local count = redis.call('SCARD', KEYS[1]) local expected = redis.call('HGET', KEYS[2], 'totalChunks') return count >= tonumber(expected) and 1 or 0 `, [`upload:${req.uploadId}:chunks`, `upload:${req.uploadId}`], [req.chunkIndex]); // 5. If all chunks received, queue assembly if (allReceived) { await this.queue.send('file-assembly', { uploadId: req.uploadId, timestamp: Date.now(), }); } } async assembleFile(uploadId: string): Promise<string> { const session = await this.redis.hgetall(`upload:${uploadId}`); const totalChunks = parseInt(session.totalChunks); // Stream chunks to final location const finalKey = `files/${session.userId}/${session.filename}`; const multipartUpload = await this.storage.createMultipartUpload(finalKey); for (let i = 0; i < totalChunks; i++) { const chunkKey = `chunks/${uploadId}/${i}`; await this.storage.copyPart( multipartUpload, chunkKey, i + 1 // Parts are 1-indexed ); } const result = await this.storage.completeMultipartUpload(multipartUpload); // Verify final hash if (result.etag !== session.expectedHash) { await this.storage.delete(finalKey); throw new AssemblyError('Final hash mismatch'); } // Cleanup temp chunks await this.cleanupTempChunks(uploadId, totalChunks); await this.redis.del(`upload:${uploadId}`, `upload:${uploadId}:chunks`); return result.fileId; }}AWS S3 (and compatible storage) has native multipart upload support. Client can upload parts directly to S3 with pre-signed URLs, bypassing your servers entirely for large files. This dramatically reduces server load and bandwidth costs—the data flows Client → S3 directly, not Client → Server → S3.
For maximum efficiency, production systems enable clients to upload directly to object storage (S3, GCS, Azure Blob) using pre-signed URLs. This bypasses your API servers entirely for the bulk data transfer.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
// Generate presigned URLs for direct-to-S3 uploadimport { S3Client, CreateMultipartUploadCommand, UploadPartCommand } from '@aws-sdk/client-s3';import { getSignedUrl } from '@aws-sdk/s3-request-presigner'; async function initializeDirectUpload( userId: string, filename: string, fileSize: number, chunkSize: number = 10 * 1024 * 1024 // 10MB): Promise<DirectUploadSession> { const s3 = new S3Client({ region: 'us-east-1' }); const key = `uploads/${userId}/${Date.now()}-${filename}`; // Create multipart upload const createCommand = new CreateMultipartUploadCommand({ Bucket: 'my-bucket', Key: key, ContentType: getMimeType(filename), }); const { UploadId } = await s3.send(createCommand); // Calculate number of parts const numParts = Math.ceil(fileSize / chunkSize); // Generate presigned URLs for each part const presignedUrls: PresignedPart[] = []; for (let partNumber = 1; partNumber <= numParts; partNumber++) { const uploadPartCommand = new UploadPartCommand({ Bucket: 'my-bucket', Key: key, UploadId, PartNumber: partNumber, }); const url = await getSignedUrl(s3, uploadPartCommand, { expiresIn: 3600 * 24, // 24 hours }); presignedUrls.push({ partNumber, url, startByte: (partNumber - 1) * chunkSize, endByte: Math.min(partNumber * chunkSize, fileSize), }); } // Store session in database await db.uploadSessions.create({ uploadId: UploadId, userId, key, filename, status: 'in_progress', createdAt: new Date(), expiresAt: new Date(Date.now() + 24 * 3600 * 1000), }); return { uploadId: UploadId, key, presignedUrls, chunkSize, };}Presigned URLs are bearer tokens—anyone with the URL can upload. Mitigations: (1) Short expiry times (1-24 hours), (2) Include content-type and content-length in signature to prevent misuse, (3) Use separate buckets for uploads with restricted permissions, (4) Validate uploaded content before accepting (virus scan, format validation).
Before accepting uploaded files into the system, rigorous validation ensures data integrity and security. This is especially critical when clients upload directly to storage.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
// Comprehensive upload validation pipelineclass UploadValidator { async validateUpload( uploadId: string, expectedHash: string, expectedSize: number, declaredType: string ): Promise<ValidationResult> { const errors: string[] = []; const warnings: string[] = []; // 1. Size verification const actualSize = await this.storage.getSize(`uploads/${uploadId}`); if (actualSize !== expectedSize) { errors.push(`Size mismatch: expected ${expectedSize}, got ${actualSize}`); } // 2. Hash verification const actualHash = await this.storage.computeHash(`uploads/${uploadId}`); if (actualHash !== expectedHash) { errors.push('Hash mismatch: file may be corrupted'); } // 3. Content type verification const magicBytes = await this.storage.readBytes(`uploads/${uploadId}`, 0, 8); const detectedType = this.detectMimeType(magicBytes); if (detectedType !== declaredType) { if (this.isDangerous(detectedType)) { errors.push(`Dangerous file type detected: ${detectedType}`); } else { warnings.push(`Type mismatch: claimed ${declaredType}, detected ${detectedType}`); } } // 4. Virus scan (async, may take time) const scanResult = await this.virusScanner.scan(`uploads/${uploadId}`); if (scanResult.infected) { errors.push(`Malware detected: ${scanResult.threat}`); await this.quarantineFile(uploadId); } // 5. Format-specific validation if (this.isImage(detectedType)) { const imageValid = await this.validateImage(`uploads/${uploadId}`); if (!imageValid) errors.push('Invalid image format'); } // 6. Zip bomb detection if (this.isArchive(detectedType)) { const compressionRatio = await this.getCompressionRatio(`uploads/${uploadId}`); if (compressionRatio > 100) { // 100:1 expansion ratio errors.push('Suspicious compression ratio (potential zip bomb)'); } } return { valid: errors.length === 0, errors, warnings, metadata: { actualSize, actualHash, detectedType, }, }; } // MIME type detection via magic bytes detectMimeType(bytes: Buffer): string { // PNG: 89 50 4E 47 if (bytes.slice(0, 4).equals(Buffer.from([0x89, 0x50, 0x4E, 0x47]))) { return 'image/png'; } // JPEG: FF D8 FF if (bytes.slice(0, 3).equals(Buffer.from([0xFF, 0xD8, 0xFF]))) { return 'image/jpeg'; } // PDF: 25 50 44 46 if (bytes.slice(0, 4).equals(Buffer.from([0x25, 0x50, 0x44, 0x46]))) { return 'application/pdf'; } // ... more types return 'application/octet-stream'; }}Synchronous validation delays upload completion. For better UX, accept the upload immediately with 'processing' status, then validate asynchronously. Notify user if validation fails. This improves perceived performance while maintaining security.
Efficient chunk storage requires content-addressed storage with deduplication. When multiple users upload the same file (or file has common chunks), we store each unique chunk only once.
Content-Addressed Storage Model:
Traditional Path-Based Storage:
/users/alice/report.pdf → [file content]
/users/bob/report.pdf → [same file content] (duplicate!)
Content-Addressed Storage:
/chunks/sha256-abc123... → [chunk A]
/chunks/sha256-def456... → [chunk B]
/files/alice/report.pdf → manifest: [abc123, def456, ghi789]
/files/bob/report.pdf → manifest: [abc123, def456, ghi789]
Same chunks, different manifests. Storage saved!
| File Type | Typical Dedup Ratio | Reason |
|---|---|---|
| Source code repos | 60-80% | Many common files (libraries, configs) |
| Office documents | 40-60% | Common templates, shared files |
| Photos (same camera) | 20-30% | Similar headers, embedded profiles |
| Compressed media | 5-15% | Already compressed, unique content |
| Random data | 0-1% | No patterns to deduplicate |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
// Content-addressed chunk storage with deduplicationinterface ChunkManifest { fileId: string; chunks: string[]; // Array of chunk hashes totalSize: number; createdAt: Date;} class DedupStorage { // Upload chunk with inline deduplication async storeChunk(content: Buffer): Promise<{ hash: string; stored: boolean }> { const hash = sha256(content); // Check if chunk already exists const exists = await this.storage.exists(`chunks/${hash}`); if (exists) { // Increment reference count await this.incrementRefCount(hash); return { hash, stored: false }; // Deduped! } // Store new chunk await this.storage.put(`chunks/${hash}`, content, { contentType: 'application/octet-stream', metadata: { refCount: 1 }, }); return { hash, stored: true }; } // Create file manifest from chunks async createManifest( fileId: string, chunkHashes: string[], totalSize: number ): Promise<void> { const manifest: ChunkManifest = { fileId, chunks: chunkHashes, totalSize, createdAt: new Date(), }; await this.db.manifests.create(manifest); } // Read file by streaming chunks async* streamFile(fileId: string): AsyncGenerator<Buffer> { const manifest = await this.db.manifests.findById(fileId); for (const chunkHash of manifest.chunks) { const chunk = await this.storage.get(`chunks/${chunkHash}`); yield chunk; } } // Delete file (decrement refs, garbage collect) async deleteFile(fileId: string): Promise<void> { const manifest = await this.db.manifests.findById(fileId); // Decrement reference counts for (const chunkHash of manifest.chunks) { const refCount = await this.decrementRefCount(chunkHash); if (refCount === 0) { // No more references, chunk can be deleted await this.storage.delete(`chunks/${chunkHash}`); } } await this.db.manifests.delete(fileId); }}Reference counting seems simple but has edge cases: What if decrement crashes after delete but before updating count? Use transactions or idempotent operations. Some systems use periodic garbage collection instead—scan for unreferenced chunks and delete. Simpler but delayed space recovery.
Chunked uploads are essential for handling large files reliably. Let's consolidate the key concepts:
What's Next:
With upload and sync mechanics covered, the next page explores Version History—how systems track file changes over time, enable rollback, and manage storage costs of keeping historical versions.
You now understand the complete chunked upload pipeline: from client-side chunking through parallel upload to server-side assembly and storage optimization. These techniques enable reliable upload of files of any size over any network condition. Next, we explore version history and its architectural implications.