System DesignSpotify Music Streaming

Designing Spotify: Music Streaming at Scale

LevelAdvanced

Duration180 mins

TopicSpotify Music Streaming

2 / 6

Audio Streaming Architecture

The Art of Instant Audio Delivery

When a user presses play on Spotify, they expect music within milliseconds—not seconds. This seemingly simple interaction triggers a sophisticated orchestration of content delivery networks, adaptive bitrate algorithms, predictive buffering, and distributed caching. The user never sees this complexity; they simply hear music.

Building audio streaming at this scale requires solving multiple interconnected problems: How do we encode audio for different devices and network conditions? How do we store petabytes of audio files efficiently? How do we deliver audio to users anywhere in the world with sub-200ms latency? How do we handle the unpredictable nature of mobile networks?

This page explores the architecture that makes instantaneous, reliable audio streaming possible.

What You Will Learn

You will understand the complete audio delivery pipeline: from ingest and encoding, through storage and distribution, to client-side playback. We'll cover adaptive bitrate streaming, CDN architecture, buffering strategies, and the trade-offs at every layer.

Audio Ingestion Pipeline

Before any streaming occurs, audio content must be ingested, processed, and prepared for delivery. The ingestion pipeline handles content from labels, distributors, and independent artists.

Ingestion Process Overview:

Label/Distributor → Upload API → Validation → Processing Queue → Transcoding → Quality Control → Storage → CDN Propagation → Available for Streaming

This pipeline runs continuously, processing approximately 40,000 new tracks daily.

Ingestion Steps

•Upload Reception — Labels upload via SFTP, API, or partner integrations. Files arrive in various formats: WAV, FLAC, MP3, AAC at various bitrates.
•Validation — Verify file integrity (checksums), format compliance, audio quality (bit depth, sample rate), and metadata presence.
•Metadata Extraction — Parse metadata from file tags; cross-reference with provided metadata from distributor.
•Audio Fingerprinting — Generate acoustic fingerprint for duplicate detection and copyright matching.
•Queue for Processing — Valid files enter transcoding queue with priority based on release date and label tier.

Transcoding Architecture:

Each source file must be encoded into multiple output formats to support different quality tiers, device capabilities, and network conditions.

Audio Quality Tiers
Quality Tier	Codec	Bitrate	Use Case	File Size (4 min track)
Low	AAC-HE	24 kbps	Very poor network, data saving	~750 KB
Normal	Ogg Vorbis	96 kbps	Standard mobile streaming	~3 MB
High	Ogg Vorbis	160 kbps	Premium mobile	~5 MB
Very High	Ogg Vorbis	320 kbps	Premium desktop/WiFi	~10 MB
Lossless	FLAC	~1,000 kbps	HiFi tier (audiophiles)	~30 MB

transcoding-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Transcoding configuration per quality tier
output_formats:
  - name: low_24kbps
    codec: aac-he
    bitrate: 24000
    sample_rate: 44100
    channels: 2
    
  - name: normal_96kbps
    codec: vorbis
    bitrate: 96000
    sample_rate: 44100
    channels: 2
    
  - name: high_160kbps
    codec: vorbis
    bitrate: 160000
    sample_rate: 44100
    channels: 2
    
  - name: very_high_320kbps
    codec: vorbis
    bitrate: 320000
    sample_rate: 44100
    channels: 2
    
  - name: lossless
    codec: flac
    bitrate: variable
    sample_rate: 44100  # Or preserve source if higher
    channels: 2
    bit_depth: 16  # Or preserve source
 
processing:
  normalize_loudness: true
  target_lufs: -14
  analyze_true_peak: true
  generate_waveform: true

Why Ogg Vorbis?

Spotify primarily uses Ogg Vorbis for streaming because it's royalty-free (unlike AAC/MP3), offers excellent quality-to-bitrate ratio, and supports gapless playback natively. FLAC is used for lossless tier due to broad device support and efficient compression.

Audio Storage Architecture

With 100+ million tracks each encoded at 5 quality levels, audio storage must be massive, durable, and efficient. This isn't just about raw capacity—it's about access patterns, redundancy, and cost optimization.

Storage Scale:

100 million tracks × 5 qualities × 10MB average = ~5 PB encoded audio
Plus redundancy (3x) ≈ 15 PB total storage
Plus original masters ≈ 5 PB archival storage

Storage Architecture Layers

•Origin Storage (Cold) — Cloud object storage (S3, GCS) holds all encoded files. High durability (11 nines), lower access cost, higher latency. Used as source-of-truth.
•Edge Cache (Hot) — CDN Points of Presence (PoPs) cache popular tracks. Low latency access (<50ms). Limited capacity per PoP.
•Regional Cache (Warm) — Intermediate caching tier between origin and edge. Stores regionally popular content not cached at all edge PoPs.
•Client Cache — On-device cache for recently played and predicted upcoming tracks. Fastest access but limited by device storage.

File Organization:

Audio files are organized by a content-addressable scheme that enables efficient caching, deduplication, and version management:

storage-path-structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Origin storage path structure
/audio/{track_id}/{quality}/{version}-{checksum}.ogg
 
# Example:
/audio/5Gu0PDLN4YJeW52vHuLgR5/320k/v3-a7d8f923c4.ogg
 
# Components:
# - track_id: Spotify Track URI identifier
# - quality: 24k, 96k, 160k, 320k, flac
# - version: Encoding version (allows re-encoding with improved codec)
# - checksum: First 10 chars of file hash (cache invalidation)
 
# CDN edge cache key (includes geo for licensing):
/{country_code}/audio/{track_id}/{quality}/{version}-{checksum}.ogg
 
# This allows:
# - Easy cache invalidation via checksum change
# - Codec version upgrades without breaking existing streams
# - Geo-based routing for licensing compliance
# - Efficient CDN cache key design

Access Pattern Optimization:

Audio streaming has a Zipf-like distribution—a small percentage of tracks account for the majority of streams:

Content Popularity Distribution
Content Tier	% of Catalog	% of Streams	Caching Strategy
Mega-hits (top 1%)	1%	30%	Always cached at edge, pre-warmed
Popular (top 10%)	9%	40%	Cached at regional + edge when accessed
Long-tail (remaining)	90%	30%	Origin fetch with regional caching

Cache Warming Strategy

Before major album releases (Taylor Swift, Drake), CDN caches are pre-warmed by pushing content to edge PoPs proactively. This prevents origin overload during the release spike when millions stream simultaneously.

Content Delivery Network Architecture

A global CDN is essential for delivering audio with low latency to users worldwide. The CDN architecture spans hundreds of Points of Presence (PoPs) and must handle hundreds of thousands of concurrent streams.

CDN Design Principles:

CDN Architecture Goals

•Low Latency — Users should stream from a PoP within 50ms network latency. Requires hundreds of globally distributed PoPs.
•High Throughput — Each PoP must serve thousands of concurrent streams. Aggregate capacity of millions of streams per second.
•High Cache Hit Rate — Minimize origin fetches. Target 95%+ cache hit rate for audio content.
•Fault Tolerance — Single PoP failure should automatically redirect to next-closest PoP without user impact.
•Geo-Compliance — Route users to appropriate PoPs based on licensing boundaries.

Tiered CDN Architecture:

cdn-architecture.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
┌─────────────────────────────────────────────────────────────────┐
│                         CDN ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    TIER 1: EDGE PoPs                      │   │
│  │  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐   │   │
│  │  │ NYC │  │ LA  │  │ LON │  │ TOK │  │ SYD │  │ ... │   │   │
│  │  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘   │   │
│  │     │        │        │        │        │        │       │   │
│  │  ~100KB cache per PoP. Cache popular content.            │   │
│  │  Serves ~95% of requests. <10ms response.                │   │
│  └──────────────────────────────────────────────────────────┘   │
│                             │                                    │
│                             ▼ Cache MISS                         │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │               TIER 2: REGIONAL CACHES                     │   │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │   │
│  │  │ US-EAST   │ │ US-WEST   │ │ EU-WEST   │ │ APAC      │ │   │
│  │  └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │   │
│  │        │             │             │             │        │   │
│  │  ~10TB cache per region. Cache regionally popular.       │   │
│  │  Serves ~4% of requests. <50ms response.                 │   │
│  └──────────────────────────────────────────────────────────┘   │
│                             │                                    │
│                             ▼ Cache MISS                         │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                 TIER 3: ORIGIN STORAGE                    │   │
│  │                                                           │   │
│  │  ┌─────────────────────────────────────────────────────┐ │   │
│  │  │        Cloud Object Storage (S3/GCS)                │ │   │
│  │  │        Multi-region replication                     │ │   │
│  │  │        ~15 PB total storage                         │ │   │
│  │  │        Serves ~1% of requests. <500ms response.     │ │   │
│  │  └─────────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

PoP Selection Algorithm:

When a client requests audio, the system must select the optimal PoP:

pop-selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def select_optimal_pop(user_location, track_id, user_country):
    """
    Select the optimal CDN PoP for serving audio.
    
    Factors considered:
    1. Geographic proximity (latency)
    2. Licensing compliance (geo-restrictions)
    3. PoP health and load
    4. Cache probability (is content likely cached?)
    """
    
    # Get candidate PoPs within acceptable latency
    candidate_pops = get_pops_by_proximity(user_location, max_latency_ms=100)
    
    # Filter by licensing compliance
    licensed_pops = [
        pop for pop in candidate_pops 
        if is_track_available(track_id, pop.country, user_country)
    ]
    
    if not licensed_pops:
        raise GeoRestrictionError(f"Track {track_id} not available in {user_country}")
    
    # Score each PoP
    scored_pops = []
    for pop in licensed_pops:
        score = calculate_pop_score(
            latency=pop.estimated_latency(user_location),
            load=pop.current_load_percent,
            cache_probability=pop.cache_probability(track_id),
            health_score=pop.health_score
        )
        scored_pops.append((pop, score))
    
    # Select best PoP (consider randomization for load balancing)
    best_pop = weighted_random_selection(scored_pops)
    
    return best_pop
 
def calculate_pop_score(latency, load, cache_probability, health_score):
    """
    Weighted scoring for PoP selection.
    Lower latency, lower load, higher cache probability = higher score.
    """
    return (
        (100 - latency) * 0.3 +        # Latency weight: 30%
        (100 - load) * 0.2 +            # Load weight: 20%
        cache_probability * 100 * 0.3 + # Cache weight: 30%
        health_score * 0.2              # Health weight: 20%
    )

Multi-CDN Strategy

Large streaming services often use multiple CDN providers (Akamai, CloudFlare, Fastly, plus custom infrastructure) for redundancy and cost optimization. Real-time monitoring determines which CDN performs best for each region and automatically routes traffic accordingly.

Adaptive Bitrate Streaming

Mobile users experience constantly varying network conditions: switching between WiFi and cellular, entering tunnels, moving through congested areas. Adaptive Bitrate Streaming (ABR) dynamically adjusts audio quality to match available bandwidth, ensuring uninterrupted playback.

The ABR Challenge:

Balancing three competing goals:

Maximize quality — Users want the best audio their subscription allows
Minimize buffering — Any interruption degrades experience
Minimize startup time — Users expect instant playback

ABR Components

•Bandwidth Estimation — Continuously estimate available bandwidth based on download speeds of recent segments.
•Buffer Level Monitoring — Track how many seconds of audio are buffered. Low buffer = risk of underrun.
•Quality Selection Algorithm — Choose optimal bitrate based on bandwidth, buffer, and user preferences.
•Segment Fetching — Download audio in small segments (typically 5-10 seconds) allowing quality changes between segments.
•Seamless Switching — Transition between quality levels without audible artifacts.

ABR Algorithm Implementation:

abr-algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
class AdaptiveBitrateController:
    """
    ABR controller that selects optimal audio quality based on
    network conditions and playback state.
    """
    
    # Available quality levels (bitrate in kbps)
    QUALITY_LEVELS = [24, 96, 160, 320]
    
    # Thresholds
    MIN_BUFFER_SECONDS = 10      # Minimum buffer before quality upgrade
    CRITICAL_BUFFER = 5          # Below this, aggressively downgrade
    STARTUP_BUFFER_TARGET = 5    # Buffer to accumulate before starting
    
    def __init__(self, user_max_quality=320):
        self.user_max_quality = user_max_quality  # Subscription limit
        self.current_quality = 96  # Start at medium quality
        self.bandwidth_history = []
        self.buffer_level = 0
        
    def estimate_bandwidth(self, segment_size_bytes, download_time_ms):
        """
        Estimate available bandwidth from recent segment download.
        Use exponential weighted moving average for smoothing.
        """
        current_bw = (segment_size_bytes * 8) / (download_time_ms / 1000)  # bits per second
        
        self.bandwidth_history.append(current_bw)
        if len(self.bandwidth_history) > 10:
            self.bandwidth_history.pop(0)
        
        # Use conservative estimate (lower percentile) to avoid overestimation
        return percentile(self.bandwidth_history, 25)
    
    def select_quality(self, estimated_bandwidth_bps):
        """
        Select quality level based on bandwidth and buffer state.
        
        Core algorithm:
        1. If buffer is critical, immediately drop quality
        2. If bandwidth is stable and buffer is healthy, consider upgrade
        3. Stay conservative on upgrades, aggressive on downgrades
        """
        # Convert to kbps for comparison with quality levels
        bandwidth_kbps = estimated_bandwidth_bps / 1000
        
        # Aggressive downgrade if buffer is critical
        if self.buffer_level < self.CRITICAL_BUFFER:
            return self._select_safe_quality(bandwidth_kbps * 0.5)
        
        # Conservative quality selection
        # Require 1.5x bandwidth headroom for quality level
        safe_bitrate = bandwidth_kbps / 1.5
        
        # Can upgrade if buffer is healthy
        if self.buffer_level > self.MIN_BUFFER_SECONDS:
            can_upgrade = True
        else:
            can_upgrade = False
        
        target_quality = self._select_safe_quality(safe_bitrate)
        
        # Apply upgrade/downgrade logic
        if target_quality > self.current_quality and not can_upgrade:
            # Don't upgrade when buffer isn't healthy
            return self.current_quality
        elif target_quality < self.current_quality:
            # Always allow downgrades for safety
            return target_quality
        else:
            return target_quality
    
    def _select_safe_quality(self, max_bitrate_kbps):
        """Select highest quality that fits within bitrate constraint and subscription."""
        available = [q for q in self.QUALITY_LEVELS 
                     if q <= max_bitrate_kbps and q <= self.user_max_quality]
        return max(available) if available else self.QUALITY_LEVELS[0]
    
    def startup_strategy(self, estimated_bandwidth):
        """
        Startup uses lower quality for faster time-to-first-byte,
        then upgrades as buffer builds.
        """
        # Start at lowest quality for instant playback
        initial_quality = min(self.QUALITY_LEVELS)
        
        # After STARTUP_BUFFER_TARGET seconds buffered, switch to optimal
        if self.buffer_level >= self.STARTUP_BUFFER_TARGET:
            return self.select_quality(estimated_bandwidth)
        
        return initial_quality

Latency of Quality Changes

Quality changes can only happen at segment boundaries. If segments are 10 seconds, worst-case reaction time to bandwidth drop is 10 seconds. Smaller segments (3-5 seconds) enable faster adaptation but increase request overhead and reduce compression efficiency.

Client-Side Playback Architecture

The client (mobile app, desktop app, web player) is responsible for fetching audio, managing buffers, decoding, and rendering audio output. A well-designed client masks network imperfections from the user.

Client Playback Pipeline:

playback-pipeline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
┌─────────────────────────────────────────────────────────────────────┐
│                    CLIENT PLAYBACK ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐          ┌──────────────┐          ┌────────────┐ │
│  │   Playlist   │──────────│  Prefetch    │──────────│   Cache    │ │
│  │   Manager    │ Next     │  Engine      │ Segments │  Manager   │ │
│  └──────────────┘ tracks   └──────────────┘          └────────────┘ │
│         │                         │                        │        │
│         │                         │                        │        │
│         ▼                         ▼                        ▼        │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      SEGMENT FETCHER                            │ │
│  │  • Manages HTTP connections to CDN                              │ │
│  │  • Implements retry logic with exponential backoff              │ │
│  │  • Reports bandwidth measurements to ABR controller             │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                │                                     │
│                                ▼                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      SEGMENT BUFFER                             │ │
│  │  • Stores downloaded compressed segments                        │ │
│  │  • Typical capacity: 30-60 seconds                              │ │
│  │  • Reports buffer level to ABR controller                       │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                │                                     │
│                                ▼                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      AUDIO DECODER                              │ │
│  │  • Decodes Ogg Vorbis/FLAC to PCM                               │ │
│  │  • Hardware-accelerated where available                         │ │
│  │  • Handles gapless transition between tracks                    │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                │                                     │
│                                ▼                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      AUDIO EFFECTS                              │ │
│  │  • Loudness normalization (ReplayGain)                          │ │
│  │  • Equalizer (user-configurable)                                │ │
│  │  • Crossfade between tracks                                     │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                │                                     │
│                                ▼                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      AUDIO OUTPUT                               │ │
│  │  • Platform audio API (AVAudioSession, AudioTrack, Web Audio)   │ │
│  │  • Bluetooth/AirPlay/Chromecast routing                         │ │
│  │  • Volume control integration                                   │ │
│  └────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Predictive Prefetching:

To achieve instant playback, the client predicts what the user will play next and prefetches it:

Prefetch Strategy

•Next in Queue — Always prefetch at least the first segment of the next track in queue.
•Playlist Prefetch — When entering a playlist, prefetch first segments of next 2-3 tracks.
•Skip Prediction — If user frequently skips within first 30 seconds, prefetch more tracks ahead.
•Shuffle Prediction — When shuffle is on, use recommendation model to predict likely next tracks.
•WiFi Opportunistic — When on WiFi, prefetch more aggressively (next 5-10 tracks).

The Magic of Instant Playback

Users think instant playback is about fast networks. In reality, it's about smart prefetching. By the time the user presses play, the first several seconds are already cached locally. Edge CDN proximity handles the rest.

Gapless Playback Implementation

Many albums are designed with seamless audio transitions between tracks—classical symphonies, DJ mixes, concept albums like Pink Floyd's "Dark Side of the Moon". Gapless playback ensures these transitions are preserved.

The Challenge:

Audio codecs add padding (encoder delay and padding) to frames. If not handled correctly, this padding creates silence gaps during track transitions.

Gapless Playback Requirements

•Encoder Delay Compensation — Track and trim encoder-added padding at start of tracks.
•End Padding Trimming — Remove padding samples at track end.
•Sample-Accurate Timing — Know exact sample count to seamlessly concatenate PCM buffers.
•Metadata Preservation — Encode delay/padding values must survive transcoding pipeline.
•Cross-Quality Handling — Handle transitions between different quality segments.

gapless-implementation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class GaplessPlaybackEngine:
    """
    Handles seamless audio transitions between tracks.
    """
    
    def __init__(self, audio_output_buffer):
        self.output_buffer = audio_output_buffer
        self.current_track = None
        self.next_track = None
        
    def prepare_transition(self, current_track, next_track):
        """
        Prepare for gapless transition between tracks.
        Must be called before current track ends.
        """
        self.current_track = current_track
        self.next_track = next_track
        
        # Get gapless metadata from track info
        current_end_padding = current_track.metadata.get('end_padding_samples', 0)
        next_start_padding = next_track.metadata.get('encoder_delay_samples', 0)
        
        # Store for trim operation during transition
        self.trim_end_samples = current_end_padding
        self.trim_start_samples = next_start_padding
        
    def decode_and_queue(self, segment, is_last_of_track=False):
        """
        Decode segment and queue to output buffer.
        Apply trimming for gapless playback.
        """
        pcm_samples = self.decoder.decode(segment)
        
        if is_last_of_track and self.trim_end_samples > 0:
            # Trim encoder padding from end of current track
            pcm_samples = pcm_samples[:-self.trim_end_samples]
            
        if self.is_first_segment_of_next and self.trim_start_samples > 0:
            # Trim encoder delay from start of next track
            pcm_samples = pcm_samples[self.trim_start_samples:]
            self.is_first_segment_of_next = False
        
        # Queue samples maintaining sample-accurate timing
        self.output_buffer.append(pcm_samples)
    
    def calculate_transition_point(self, current_track):
        """
        Calculate exact sample where current track content ends.
        This is total_samples - end_padding_samples.
        """
        total_samples = current_track.metadata['total_samples']
        end_padding = current_track.metadata.get('end_padding_samples', 0)
        
        return total_samples - end_padding
 
# Ogg Vorbis gapless metadata handling
class OggVorbisGaplessParser:
    """
    Parse encoder delay and padding from Ogg Vorbis files.
    Vorbis stores this in granule position calculations.
    """
    
    def parse_gapless_info(self, ogg_file):
        """
        Extract gapless playback information from Ogg Vorbis file.
        
        The crucial fields are:
        - preskip: Encoder delay in samples (skip at start)
        - granule_position: Allows calculation of end padding
        """
        # Parse Ogg pages to find identification header
        id_header = self.parse_identification_header(ogg_file)
        
        # preskip is encoder delay
        encoder_delay = id_header.get('preskip', 0)
        
        # Calculate end padding from final granule position
        final_granule = self.get_final_granule_position(ogg_file)
        total_pcm_samples = final_granule - encoder_delay
        
        # End padding is the difference from actual content length
        content_samples = self.calculate_content_samples(ogg_file)
        end_padding = total_pcm_samples - content_samples
        
        return {
            'encoder_delay_samples': encoder_delay,
            'end_padding_samples': max(0, end_padding),
            'total_content_samples': content_samples
        }

Why This Matters

Gapless playback might seem like a minor feature, but for audiophiles and album listeners, gaps are immediately noticeable and frustrating. This attention to detail is what separates good streaming services from great ones.

Error Handling and Recovery

In distributed systems, failures are inevitable. For streaming, failures manifest as network timeouts, CDN errors, corrupt data, or server overload. The goal is graceful degradation—maintain playback even when components fail.

Failure Modes and Recovery:

Failure Recovery Strategies
Failure Mode	Detection	Recovery Strategy	User Impact
CDN PoP timeout	Request takes >2s	Retry with different PoP	Brief stall if buffer low
Segment corrupt	Checksum mismatch	Re-fetch segment, try alt PoP	None if detected in buffer
Complete network loss	All requests fail	Continue from cache/offline	Playback stops when buffer empties
Origin failure	Regional CDN miss fails	Failover to backup origin	Possible quality degradation
Decoder error	Invalid audio frame	Skip frame, log error	Brief audio glitch
Session expiry	401 from CDN	Refresh access token	Brief pause during re-auth

retry-logic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class ResilientSegmentFetcher:
    """
    Fetch audio segments with intelligent retry and failover logic.
    """
    
    MAX_RETRIES = 3
    INITIAL_TIMEOUT_MS = 2000
    BACKOFF_MULTIPLIER = 1.5
    
    def __init__(self, cdn_selector):
        self.cdn_selector = cdn_selector
        self.failed_pops = set()  # Track recently failed PoPs
        
    async def fetch_segment(self, track_id, segment_index, quality):
        """
        Fetch segment with automatic retry and PoP failover.
        """
        last_error = None
        timeout = self.INITIAL_TIMEOUT_MS
        
        for attempt in range(self.MAX_RETRIES):
            try:
                # Select PoP, excluding recently failed ones
                pop = self.cdn_selector.select_pop(
                    track_id, 
                    exclude=self.failed_pops
                )
                
                url = self.build_segment_url(pop, track_id, segment_index, quality)
                
                response = await self.http_client.get(
                    url, 
                    timeout_ms=timeout
                )
                
                # Validate response
                if not self.validate_segment(response.body, response.checksum):
                    raise CorruptSegmentError(f"Checksum mismatch for {url}")
                
                # Success - clear failed PoP if we had marked it
                self.failed_pops.discard(pop.id)
                
                return response.body
                
            except (TimeoutError, NetworkError) as e:
                last_error = e
                self.failed_pops.add(pop.id)
                timeout = int(timeout * self.BACKOFF_MULTIPLIER)
                
                # Log for monitoring
                log.warning(f"Segment fetch failed: {pop.id}, attempt {attempt+1}")
                
            except CorruptSegmentError as e:
                last_error = e
                # Don't backoff for corruption, just try different PoP
                self.failed_pops.add(pop.id)
        
        # All retries exhausted
        raise SegmentFetchError(
            f"Failed to fetch segment after {self.MAX_RETRIES} attempts",
            cause=last_error
        )
    
    def schedule_pop_recovery(self, pop_id, delay_seconds=60):
        """
        Remove PoP from failed set after delay.
        Transient failures shouldn't permanently blacklist a PoP.
        """
        async def recover():
            await asyncio.sleep(delay_seconds)
            self.failed_pops.discard(pop_id)
        
        asyncio.create_task(recover())

Buffer as Your Safety Net

The buffer is everything. With 30 seconds buffered, you have 30 seconds to recover from failures before the user notices. This is why aggressive pre-buffering and conservative quality selection (to maintain buffer) are critical.

Streaming Architecture Summary

We've covered the complete audio streaming architecture. Let's consolidate the key architectural decisions:

Streaming Architecture Decisions
Component	Decision	Rationale
Codec	Ogg Vorbis (streaming), FLAC (lossless)	Royalty-free, excellent quality, gapless support
Quality Tiers	24k, 96k, 160k, 320k, lossless	Cover all network conditions and subscription tiers
Storage	Tiered: Edge → Regional → Origin	Balance latency vs. cost vs. capacity
Delivery	Segment-based with ABR	Adapt to network conditions, enable quality switching
Segment Size	5-10 seconds	Balance responsiveness vs. efficiency
Buffer Target	30-60 seconds	Survive typical network disruptions
Prefetch	Predictive based on queue and behavior	Achieve instant playback

Key Takeaways

•Multi-quality encoding enables adaptive streaming across varying network conditions.
•Tiered caching (edge → regional → origin) balances latency, cost, and capacity.
•CDN architecture with smart PoP selection ensures global low-latency delivery.
•ABR algorithms maximize quality while preventing buffering.
•Predictive prefetching achieves the perception of instant playback.
•Gapless playback requires careful handling of encoder padding.
•Resilient error handling with retries and failover maintains reliability.

What's next:

With streaming architecture covered, we'll move to Playlist and Library Management—how to design data models and systems that support billions of playlists and user libraries at scale.

Page Complete

You now understand the complete audio streaming architecture: from ingestion and encoding, through CDN distribution and adaptive bitrate streaming, to client-side playback and error handling. This forms the technical core of any music streaming platform.

2 / 6

Loading learning content...

System DesignSpotify Music Streaming

Designing Spotify: Music Streaming at Scale

LevelAdvanced

Duration180 mins

TopicSpotify Music Streaming

2 / 6

Audio Streaming Architecture

The Art of Instant Audio Delivery

This page explores the architecture that makes instantaneous, reliable audio streaming possible.

What You Will Learn

Audio Ingestion Pipeline

Before any streaming occurs, audio content must be ingested, processed, and prepared for delivery. The ingestion pipeline handles content from labels, distributors, and independent artists.

Ingestion Process Overview:

Label/Distributor → Upload API → Validation → Processing Queue → Transcoding → Quality Control → Storage → CDN Propagation → Available for Streaming

This pipeline runs continuously, processing approximately 40,000 new tracks daily.

Ingestion Steps

•Upload Reception — Labels upload via SFTP, API, or partner integrations. Files arrive in various formats: WAV, FLAC, MP3, AAC at various bitrates.
•Validation — Verify file integrity (checksums), format compliance, audio quality (bit depth, sample rate), and metadata presence.
•Metadata Extraction — Parse metadata from file tags; cross-reference with provided metadata from distributor.
•Audio Fingerprinting — Generate acoustic fingerprint for duplicate detection and copyright matching.
•Queue for Processing — Valid files enter transcoding queue with priority based on release date and label tier.

Transcoding Architecture:

Each source file must be encoded into multiple output formats to support different quality tiers, device capabilities, and network conditions.

Audio Quality Tiers
Quality Tier	Codec	Bitrate	Use Case	File Size (4 min track)
Low	AAC-HE	24 kbps	Very poor network, data saving	~750 KB
Normal	Ogg Vorbis	96 kbps	Standard mobile streaming	~3 MB
High	Ogg Vorbis	160 kbps	Premium mobile	~5 MB
Very High	Ogg Vorbis	320 kbps	Premium desktop/WiFi	~10 MB
Lossless	FLAC	~1,000 kbps	HiFi tier (audiophiles)	~30 MB

transcoding-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Transcoding configuration per quality tier
output_formats:
  - name: low_24kbps
    codec: aac-he
    bitrate: 24000
    sample_rate: 44100
    channels: 2
    
  - name: normal_96kbps
    codec: vorbis
    bitrate: 96000
    sample_rate: 44100
    channels: 2
    
  - name: high_160kbps
    codec: vorbis
    bitrate: 160000
    sample_rate: 44100
    channels: 2
    
  - name: very_high_320kbps
    codec: vorbis
    bitrate: 320000
    sample_rate: 44100
    channels: 2
    
  - name: lossless
    codec: flac
    bitrate: variable
    sample_rate: 44100  # Or preserve source if higher
    channels: 2
    bit_depth: 16  # Or preserve source
 
processing:
  normalize_loudness: true
  target_lufs: -14
  analyze_true_peak: true
  generate_waveform: true

Why Ogg Vorbis?

Audio Storage Architecture

Storage Scale:

100 million tracks × 5 qualities × 10MB average = ~5 PB encoded audio
Plus redundancy (3x) ≈ 15 PB total storage
Plus original masters ≈ 5 PB archival storage

Storage Architecture Layers

•Origin Storage (Cold) — Cloud object storage (S3, GCS) holds all encoded files. High durability (11 nines), lower access cost, higher latency. Used as source-of-truth.
•Edge Cache (Hot) — CDN Points of Presence (PoPs) cache popular tracks. Low latency access (<50ms). Limited capacity per PoP.
•Regional Cache (Warm) — Intermediate caching tier between origin and edge. Stores regionally popular content not cached at all edge PoPs.
•Client Cache — On-device cache for recently played and predicted upcoming tracks. Fastest access but limited by device storage.

File Organization:

Audio files are organized by a content-addressable scheme that enables efficient caching, deduplication, and version management:

storage-path-structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Origin storage path structure
/audio/{track_id}/{quality}/{version}-{checksum}.ogg
 
# Example:
/audio/5Gu0PDLN4YJeW52vHuLgR5/320k/v3-a7d8f923c4.ogg
 
# Components:
# - track_id: Spotify Track URI identifier
# - quality: 24k, 96k, 160k, 320k, flac
# - version: Encoding version (allows re-encoding with improved codec)
# - checksum: First 10 chars of file hash (cache invalidation)
 
# CDN edge cache key (includes geo for licensing):
/{country_code}/audio/{track_id}/{quality}/{version}-{checksum}.ogg
 
# This allows:
# - Easy cache invalidation via checksum change
# - Codec version upgrades without breaking existing streams
# - Geo-based routing for licensing compliance
# - Efficient CDN cache key design

Access Pattern Optimization:

Audio streaming has a Zipf-like distribution—a small percentage of tracks account for the majority of streams:

Content Popularity Distribution
Content Tier	% of Catalog	% of Streams	Caching Strategy
Mega-hits (top 1%)	1%	30%	Always cached at edge, pre-warmed
Popular (top 10%)	9%	40%	Cached at regional + edge when accessed
Long-tail (remaining)	90%	30%	Origin fetch with regional caching

Cache Warming Strategy

Content Delivery Network Architecture

CDN Design Principles:

CDN Architecture Goals

•Low Latency — Users should stream from a PoP within 50ms network latency. Requires hundreds of globally distributed PoPs.
•High Throughput — Each PoP must serve thousands of concurrent streams. Aggregate capacity of millions of streams per second.
•High Cache Hit Rate — Minimize origin fetches. Target 95%+ cache hit rate for audio content.
•Fault Tolerance — Single PoP failure should automatically redirect to next-closest PoP without user impact.
•Geo-Compliance — Route users to appropriate PoPs based on licensing boundaries.

Tiered CDN Architecture:

cdn-architecture.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
┌─────────────────────────────────────────────────────────────────┐
│                         CDN ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    TIER 1: EDGE PoPs                      │   │
│  │  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐   │   │
│  │  │ NYC │  │ LA  │  │ LON │  │ TOK │  │ SYD │  │ ... │   │   │
│  │  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘   │   │
│  │     │        │        │        │        │        │       │   │
│  │  ~100KB cache per PoP. Cache popular content.            │   │
│  │  Serves ~95% of requests. <10ms response.                │   │
│  └──────────────────────────────────────────────────────────┘   │
│                             │                                    │
│                             ▼ Cache MISS                         │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │               TIER 2: REGIONAL CACHES                     │   │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │   │
│  │  │ US-EAST   │ │ US-WEST   │ │ EU-WEST   │ │ APAC      │ │   │
│  │  └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │   │
│  │        │             │             │             │        │   │
│  │  ~10TB cache per region. Cache regionally popular.       │   │
│  │  Serves ~4% of requests. <50ms response.                 │   │
│  └──────────────────────────────────────────────────────────┘   │
│                             │                                    │
│                             ▼ Cache MISS                         │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                 TIER 3: ORIGIN STORAGE                    │   │
│  │                                                           │   │
│  │  ┌─────────────────────────────────────────────────────┐ │   │
│  │  │        Cloud Object Storage (S3/GCS)                │ │   │
│  │  │        Multi-region replication                     │ │   │
│  │  │        ~15 PB total storage                         │ │   │
│  │  │        Serves ~1% of requests. <500ms response.     │ │   │
│  │  └─────────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

PoP Selection Algorithm:

When a client requests audio, the system must select the optimal PoP:

pop-selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def select_optimal_pop(user_location, track_id, user_country):
    """
    Select the optimal CDN PoP for serving audio.
    
    Factors considered:
    1. Geographic proximity (latency)
    2. Licensing compliance (geo-restrictions)
    3. PoP health and load
    4. Cache probability (is content likely cached?)
    """
    
    # Get candidate PoPs within acceptable latency
    candidate_pops = get_pops_by_proximity(user_location, max_latency_ms=100)
    
    # Filter by licensing compliance
    licensed_pops = [
        pop for pop in candidate_pops 
        if is_track_available(track_id, pop.country, user_country)
    ]
    
    if not licensed_pops:
        raise GeoRestrictionError(f"Track {track_id} not available in {user_country}")
    
    # Score each PoP
    scored_pops = []
    for pop in licensed_pops:
        score = calculate_pop_score(
            latency=pop.estimated_latency(user_location),
            load=pop.current_load_percent,
            cache_probability=pop.cache_probability(track_id),
            health_score=pop.health_score
        )
        scored_pops.append((pop, score))
    
    # Select best PoP (consider randomization for load balancing)
    best_pop = weighted_random_selection(scored_pops)
    
    return best_pop
 
def calculate_pop_score(latency, load, cache_probability, health_score):
    """
    Weighted scoring for PoP selection.
    Lower latency, lower load, higher cache probability = higher score.
    """
    return (
        (100 - latency) * 0.3 +        # Latency weight: 30%
        (100 - load) * 0.2 +            # Load weight: 20%
        cache_probability * 100 * 0.3 + # Cache weight: 30%
        health_score * 0.2              # Health weight: 20%
    )

Multi-CDN Strategy

Adaptive Bitrate Streaming

The ABR Challenge:

Balancing three competing goals:

Maximize quality — Users want the best audio their subscription allows
Minimize buffering — Any interruption degrades experience
Minimize startup time — Users expect instant playback

ABR Components

•Bandwidth Estimation — Continuously estimate available bandwidth based on download speeds of recent segments.
•Buffer Level Monitoring — Track how many seconds of audio are buffered. Low buffer = risk of underrun.
•Quality Selection Algorithm — Choose optimal bitrate based on bandwidth, buffer, and user preferences.
•Segment Fetching — Download audio in small segments (typically 5-10 seconds) allowing quality changes between segments.
•Seamless Switching — Transition between quality levels without audible artifacts.

ABR Algorithm Implementation:

abr-algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
class AdaptiveBitrateController:
    """
    ABR controller that selects optimal audio quality based on
    network conditions and playback state.
    """
    
    # Available quality levels (bitrate in kbps)
    QUALITY_LEVELS = [24, 96, 160, 320]
    
    # Thresholds
    MIN_BUFFER_SECONDS = 10      # Minimum buffer before quality upgrade
    CRITICAL_BUFFER = 5          # Below this, aggressively downgrade
    STARTUP_BUFFER_TARGET = 5    # Buffer to accumulate before starting
    
    def __init__(self, user_max_quality=320):
        self.user_max_quality = user_max_quality  # Subscription limit
        self.current_quality = 96  # Start at medium quality
        self.bandwidth_history = []
        self.buffer_level = 0
        
    def estimate_bandwidth(self, segment_size_bytes, download_time_ms):
        """
        Estimate available bandwidth from recent segment download.
        Use exponential weighted moving average for smoothing.
        """
        current_bw = (segment_size_bytes * 8) / (download_time_ms / 1000)  # bits per second
        
        self.bandwidth_history.append(current_bw)
        if len(self.bandwidth_history) > 10:
            self.bandwidth_history.pop(0)
        
        # Use conservative estimate (lower percentile) to avoid overestimation
        return percentile(self.bandwidth_history, 25)
    
    def select_quality(self, estimated_bandwidth_bps):
        """
        Select quality level based on bandwidth and buffer state.
        
        Core algorithm:
        1. If buffer is critical, immediately drop quality
        2. If bandwidth is stable and buffer is healthy, consider upgrade
        3. Stay conservative on upgrades, aggressive on downgrades
        """
        # Convert to kbps for comparison with quality levels
        bandwidth_kbps = estimated_bandwidth_bps / 1000
        
        # Aggressive downgrade if buffer is critical
        if self.buffer_level < self.CRITICAL_BUFFER:
            return self._select_safe_quality(bandwidth_kbps * 0.5)
        
        # Conservative quality selection
        # Require 1.5x bandwidth headroom for quality level
        safe_bitrate = bandwidth_kbps / 1.5
        
        # Can upgrade if buffer is healthy
        if self.buffer_level > self.MIN_BUFFER_SECONDS:
            can_upgrade = True
        else:
            can_upgrade = False
        
        target_quality = self._select_safe_quality(safe_bitrate)
        
        # Apply upgrade/downgrade logic
        if target_quality > self.current_quality and not can_upgrade:
            # Don't upgrade when buffer isn't healthy
            return self.current_quality
        elif target_quality < self.current_quality:
            # Always allow downgrades for safety
            return target_quality
        else:
            return target_quality
    
    def _select_safe_quality(self, max_bitrate_kbps):
        """Select highest quality that fits within bitrate constraint and subscription."""
        available = [q for q in self.QUALITY_LEVELS 
                     if q <= max_bitrate_kbps and q <= self.user_max_quality]
        return max(available) if available else self.QUALITY_LEVELS[0]
    
    def startup_strategy(self, estimated_bandwidth):
        """
        Startup uses lower quality for faster time-to-first-byte,
        then upgrades as buffer builds.
        """
        # Start at lowest quality for instant playback
        initial_quality = min(self.QUALITY_LEVELS)
        
        # After STARTUP_BUFFER_TARGET seconds buffered, switch to optimal
        if self.buffer_level >= self.STARTUP_BUFFER_TARGET:
            return self.select_quality(estimated_bandwidth)
        
        return initial_quality

Latency of Quality Changes

Client-Side Playback Architecture

Client Playback Pipeline:

playback-pipeline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
┌─────────────────────────────────────────────────────────────────────┐
│                    CLIENT PLAYBACK ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐          ┌──────────────┐          ┌────────────┐ │
│  │   Playlist   │──────────│  Prefetch    │──────────│   Cache    │ │
│  │   Manager    │ Next     │  Engine      │ Segments │  Manager   │ │
│  └──────────────┘ tracks   └──────────────┘          └────────────┘ │
│         │                         │                        │        │
│         │                         │                        │        │
│         ▼                         ▼                        ▼        │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      SEGMENT FETCHER                            │ │
│  │  • Manages HTTP connections to CDN                              │ │
│  │  • Implements retry logic with exponential backoff              │ │
│  │  • Reports bandwidth measurements to ABR controller             │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                │                                     │
│                                ▼                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      SEGMENT BUFFER                             │ │
│  │  • Stores downloaded compressed segments                        │ │
│  │  • Typical capacity: 30-60 seconds                              │ │
│  │  • Reports buffer level to ABR controller                       │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                │                                     │
│                                ▼                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      AUDIO DECODER                              │ │
│  │  • Decodes Ogg Vorbis/FLAC to PCM                               │ │
│  │  • Hardware-accelerated where available                         │ │
│  │  • Handles gapless transition between tracks                    │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                │                                     │
│                                ▼                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      AUDIO EFFECTS                              │ │
│  │  • Loudness normalization (ReplayGain)                          │ │
│  │  • Equalizer (user-configurable)                                │ │
│  │  • Crossfade between tracks                                     │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                │                                     │
│                                ▼                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                      AUDIO OUTPUT                               │ │
│  │  • Platform audio API (AVAudioSession, AudioTrack, Web Audio)   │ │
│  │  • Bluetooth/AirPlay/Chromecast routing                         │ │
│  │  • Volume control integration                                   │ │
│  └────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Predictive Prefetching:

To achieve instant playback, the client predicts what the user will play next and prefetches it:

Prefetch Strategy

•Next in Queue — Always prefetch at least the first segment of the next track in queue.
•Playlist Prefetch — When entering a playlist, prefetch first segments of next 2-3 tracks.
•Skip Prediction — If user frequently skips within first 30 seconds, prefetch more tracks ahead.
•Shuffle Prediction — When shuffle is on, use recommendation model to predict likely next tracks.
•WiFi Opportunistic — When on WiFi, prefetch more aggressively (next 5-10 tracks).

The Magic of Instant Playback

Gapless Playback Implementation

The Challenge:

Audio codecs add padding (encoder delay and padding) to frames. If not handled correctly, this padding creates silence gaps during track transitions.

Gapless Playback Requirements

•Encoder Delay Compensation — Track and trim encoder-added padding at start of tracks.
•End Padding Trimming — Remove padding samples at track end.
•Sample-Accurate Timing — Know exact sample count to seamlessly concatenate PCM buffers.
•Metadata Preservation — Encode delay/padding values must survive transcoding pipeline.
•Cross-Quality Handling — Handle transitions between different quality segments.

gapless-implementation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class GaplessPlaybackEngine:
    """
    Handles seamless audio transitions between tracks.
    """
    
    def __init__(self, audio_output_buffer):
        self.output_buffer = audio_output_buffer
        self.current_track = None
        self.next_track = None
        
    def prepare_transition(self, current_track, next_track):
        """
        Prepare for gapless transition between tracks.
        Must be called before current track ends.
        """
        self.current_track = current_track
        self.next_track = next_track
        
        # Get gapless metadata from track info
        current_end_padding = current_track.metadata.get('end_padding_samples', 0)
        next_start_padding = next_track.metadata.get('encoder_delay_samples', 0)
        
        # Store for trim operation during transition
        self.trim_end_samples = current_end_padding
        self.trim_start_samples = next_start_padding
        
    def decode_and_queue(self, segment, is_last_of_track=False):
        """
        Decode segment and queue to output buffer.
        Apply trimming for gapless playback.
        """
        pcm_samples = self.decoder.decode(segment)
        
        if is_last_of_track and self.trim_end_samples > 0:
            # Trim encoder padding from end of current track
            pcm_samples = pcm_samples[:-self.trim_end_samples]
            
        if self.is_first_segment_of_next and self.trim_start_samples > 0:
            # Trim encoder delay from start of next track
            pcm_samples = pcm_samples[self.trim_start_samples:]
            self.is_first_segment_of_next = False
        
        # Queue samples maintaining sample-accurate timing
        self.output_buffer.append(pcm_samples)
    
    def calculate_transition_point(self, current_track):
        """
        Calculate exact sample where current track content ends.
        This is total_samples - end_padding_samples.
        """
        total_samples = current_track.metadata['total_samples']
        end_padding = current_track.metadata.get('end_padding_samples', 0)
        
        return total_samples - end_padding
 
# Ogg Vorbis gapless metadata handling
class OggVorbisGaplessParser:
    """
    Parse encoder delay and padding from Ogg Vorbis files.
    Vorbis stores this in granule position calculations.
    """
    
    def parse_gapless_info(self, ogg_file):
        """
        Extract gapless playback information from Ogg Vorbis file.
        
        The crucial fields are:
        - preskip: Encoder delay in samples (skip at start)
        - granule_position: Allows calculation of end padding
        """
        # Parse Ogg pages to find identification header
        id_header = self.parse_identification_header(ogg_file)
        
        # preskip is encoder delay
        encoder_delay = id_header.get('preskip', 0)
        
        # Calculate end padding from final granule position
        final_granule = self.get_final_granule_position(ogg_file)
        total_pcm_samples = final_granule - encoder_delay
        
        # End padding is the difference from actual content length
        content_samples = self.calculate_content_samples(ogg_file)
        end_padding = total_pcm_samples - content_samples
        
        return {
            'encoder_delay_samples': encoder_delay,
            'end_padding_samples': max(0, end_padding),
            'total_content_samples': content_samples
        }

Why This Matters

Error Handling and Recovery

Failure Modes and Recovery:

Failure Recovery Strategies
Failure Mode	Detection	Recovery Strategy	User Impact
CDN PoP timeout	Request takes >2s	Retry with different PoP	Brief stall if buffer low
Segment corrupt	Checksum mismatch	Re-fetch segment, try alt PoP	None if detected in buffer
Complete network loss	All requests fail	Continue from cache/offline	Playback stops when buffer empties
Origin failure	Regional CDN miss fails	Failover to backup origin	Possible quality degradation
Decoder error	Invalid audio frame	Skip frame, log error	Brief audio glitch
Session expiry	401 from CDN	Refresh access token	Brief pause during re-auth

retry-logic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class ResilientSegmentFetcher:
    """
    Fetch audio segments with intelligent retry and failover logic.
    """
    
    MAX_RETRIES = 3
    INITIAL_TIMEOUT_MS = 2000
    BACKOFF_MULTIPLIER = 1.5
    
    def __init__(self, cdn_selector):
        self.cdn_selector = cdn_selector
        self.failed_pops = set()  # Track recently failed PoPs
        
    async def fetch_segment(self, track_id, segment_index, quality):
        """
        Fetch segment with automatic retry and PoP failover.
        """
        last_error = None
        timeout = self.INITIAL_TIMEOUT_MS
        
        for attempt in range(self.MAX_RETRIES):
            try:
                # Select PoP, excluding recently failed ones
                pop = self.cdn_selector.select_pop(
                    track_id, 
                    exclude=self.failed_pops
                )
                
                url = self.build_segment_url(pop, track_id, segment_index, quality)
                
                response = await self.http_client.get(
                    url, 
                    timeout_ms=timeout
                )
                
                # Validate response
                if not self.validate_segment(response.body, response.checksum):
                    raise CorruptSegmentError(f"Checksum mismatch for {url}")
                
                # Success - clear failed PoP if we had marked it
                self.failed_pops.discard(pop.id)
                
                return response.body
                
            except (TimeoutError, NetworkError) as e:
                last_error = e
                self.failed_pops.add(pop.id)
                timeout = int(timeout * self.BACKOFF_MULTIPLIER)
                
                # Log for monitoring
                log.warning(f"Segment fetch failed: {pop.id}, attempt {attempt+1}")
                
            except CorruptSegmentError as e:
                last_error = e
                # Don't backoff for corruption, just try different PoP
                self.failed_pops.add(pop.id)
        
        # All retries exhausted
        raise SegmentFetchError(
            f"Failed to fetch segment after {self.MAX_RETRIES} attempts",
            cause=last_error
        )
    
    def schedule_pop_recovery(self, pop_id, delay_seconds=60):
        """
        Remove PoP from failed set after delay.
        Transient failures shouldn't permanently blacklist a PoP.
        """
        async def recover():
            await asyncio.sleep(delay_seconds)
            self.failed_pops.discard(pop_id)
        
        asyncio.create_task(recover())

Buffer as Your Safety Net

Streaming Architecture Summary

We've covered the complete audio streaming architecture. Let's consolidate the key architectural decisions:

Streaming Architecture Decisions
Component	Decision	Rationale
Codec	Ogg Vorbis (streaming), FLAC (lossless)	Royalty-free, excellent quality, gapless support
Quality Tiers	24k, 96k, 160k, 320k, lossless	Cover all network conditions and subscription tiers
Storage	Tiered: Edge → Regional → Origin	Balance latency vs. cost vs. capacity
Delivery	Segment-based with ABR	Adapt to network conditions, enable quality switching
Segment Size	5-10 seconds	Balance responsiveness vs. efficiency
Buffer Target	30-60 seconds	Survive typical network disruptions
Prefetch	Predictive based on queue and behavior	Achieve instant playback

Key Takeaways

•Multi-quality encoding enables adaptive streaming across varying network conditions.
•Tiered caching (edge → regional → origin) balances latency, cost, and capacity.
•CDN architecture with smart PoP selection ensures global low-latency delivery.
•ABR algorithms maximize quality while preventing buffering.
•Predictive prefetching achieves the perception of instant playback.
•Gapless playback requires careful handling of encoder padding.
•Resilient error handling with retries and failover maintains reliability.

What's next:

With streaming architecture covered, we'll move to Playlist and Library Management—how to design data models and systems that support billions of playlists and user libraries at scale.

Page Complete

2 / 6