Multimodal Learning - Learning Module

Loading content...

0/278

Audio-Visual Learning: When AI Sees and Hears

The Synergy of Sound and Sight

Human perception is inherently multimodal. When we watch a video, we don't just see images in sequence—we hear synchronized sounds that provide crucial context. The bark identifies a dog before we see it; the crash of waves tells us we're at a beach; a speaker's lip movements help us understand speech in noisy environments. This natural integration of audio and visual information has inspired a rich area of machine learning research: audio-visual learning.

Audio-visual learning extends multimodal AI beyond the vision-language paradigm to incorporate temporal, acoustic signals. This creates new opportunities: learning visual representations from naturally co-occurring sounds, improving speech recognition using visual cues, generating realistic audio for silent videos, and building AI systems that understand the world through multiple senses simultaneously.

Learning Objectives

This page covers the theoretical foundations and practical techniques for audio-visual learning, including self-supervised approaches that learn from video without labels, cross-modal generation, and applications in speech, music, and video understanding.

The Principle of Audio-Visual Correspondence

The fundamental insight underlying audio-visual learning is that sounds and visual events are naturally correlated in videos. When we see a guitar being strummed, we expect to hear music; when a dog appears, we may hear barking. This correspondence provides a powerful supervisory signal for learning without explicit labels.

Types of Audio-Visual Correspondence:

Spatial Correspondence: Objects that produce sound are visible in the scene. A barking dog appears in a specific location; a speaking person's face is visible.
Temporal Correspondence: Audio and visual events are synchronized. The sound of a bouncing ball aligns with the moment of impact; lip movements sync with speech.
Semantic Correspondence: Audio content relates meaningfully to visual content. Music genres correlate with visual styles; environmental sounds match scene types.

Exploiting Correspondence for Self-Supervised Learning:

The key insight is that audio-visual correspondence creates natural positive and negative pairs:

Positive pairs: Audio and video from the same clip (naturally aligned)
Negative pairs: Audio from one video with visuals from another (misaligned)

Models learn representations by distinguishing aligned from misaligned pairs, developing an understanding of both modalities in the process.

Audio-Visual Correspondence Types and Applications
Correspondence Type	Learning Signal	Example Applications
Spatial	Sound source localization	Visual attention, source separation
Temporal	Synchronization detection	Lip sync, action recognition
Semantic	Content matching	Video classification, retrieval
Cross-modal generation	Reconstruction	Audio-to-video, video-to-audio

Natural Supervision at Scale

Like CLIP's use of image-caption pairs, audio-visual learning leverages naturally occurring data—billions of hours of video with synchronized audio. This eliminates the need for manual annotation while providing diverse, real-world training signals.

Audio Representations for Multimodal Learning

Before combining audio with vision, we need effective audio representations. Modern audio encoders transform raw waveforms into semantic embeddings suitable for multimodal fusion.

Spectral Representations:

The most common approach converts raw audio waveforms into spectrograms—2D representations of frequency content over time:

Mel Spectrogram: Log-frequency spectrogram using mel scale (perceptually motivated)
MFCC: Mel-frequency cepstral coefficients (compact, decorrelated features)
Log Spectrogram: Linear frequency scale with log amplitude

Spectrogram to Embedding:

Once audio is represented as a spectrogram (essentially an "image" of sound), we can apply visual architectures:

CNN-based: ResNet, VGG adapted for spectrograms (AudioSet models)
Transformer-based: Audio Spectrogram Transformer (AST), treating spectrogram patches as tokens
Hybrid: Convolutional stem with transformer body

Waveform-Based Models:

Alternatively, models can process raw waveforms directly:

wav2vec 2.0: Self-supervised learning on raw audio with contrastive loss
HuBERT: Hidden-unit BERT for speech using masked prediction
SoundStream: Neural codec learning compact audio representations

audio_encoder.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import torch
import torch.nn as nn
import torchaudio
import torchaudio.transforms as T
 
class AudioEncoder(nn.Module):
    """
    Audio encoder that converts waveforms to embeddings.
    Uses mel spectrogram followed by a transformer encoder.
    """
    
    def __init__(
        self,
        sample_rate: int = 16000,
        n_mels: int = 128,
        n_fft: int = 1024,
        hop_length: int = 160,  # 10ms at 16kHz
        embed_dim: int = 768,
        num_layers: int = 12,
        num_heads: int = 12,
    ):
        super().__init__()
        
        # Mel spectrogram transform
        self.mel_transform = T.MelSpectrogram(
            sample_rate=sample_rate,
            n_fft=n_fft,
            hop_length=hop_length,
            n_mels=n_mels,
        )
        
        # Patch embedding (treat spectrogram like image)
        self.patch_size = (16, 16)  # (freq, time)
        patch_dim = n_mels // self.patch_size[0]
        self.patch_embed = nn.Linear(
            self.patch_size[0] * self.patch_size[1], embed_dim
        )
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim,
            nhead=num_heads,
            dim_feedforward=embed_dim * 4,
            batch_first=True,
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        
        # [CLS] token for pooled representation
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        
    def forward(self, waveform: torch.Tensor) -> torch.Tensor:
        """
        Args:
            waveform: (B, T) raw audio waveform at 16kHz
            
        Returns:
            (B, embed_dim) audio embedding
        """
        # Convert to mel spectrogram: (B, n_mels, time)
        mel = self.mel_transform(waveform)
        mel = torch.log(mel + 1e-6)  # Log scale
        
        # Patchify and embed
        B, F, T = mel.shape
        # Reshape to patches and project
        patches = self.patchify(mel)  # (B, num_patches, patch_dim)
        tokens = self.patch_embed(patches)  # (B, num_patches, embed_dim)
        
        # Prepend [CLS] token
        cls = self.cls_token.expand(B, -1, -1)
        tokens = torch.cat([cls, tokens], dim=1)
        
        # Transformer encoding
        encoded = self.transformer(tokens)
        
        # Return [CLS] token as audio embedding
        return encoded[:, 0]

Self-Supervised Audio-Visual Learning

Self-supervised audio-visual learning leverages the natural alignment between audio and video to learn representations without manual labels. Several approaches have proven effective:

Audio-Visual Correspondence (AVC) Learning:

The foundational approach trains models to determine whether audio and video are from the same source:

$$\mathcal{L}_{AVC} = -\mathbb{E}\left[y\log p(\text{match}|a,v) + (1-y)\log(1-p(\text{match}|a,v))\right]$$

Where $y=1$ for aligned pairs and $y=0$ for misaligned pairs.

Contrastive Audio-Visual Learning:

Mirroring CLIP's approach, we can use contrastive learning with audio-video pairs:

Positive pairs: Audio and video from same clip
Negative pairs: Audio and video from different clips in batch
Objective: InfoNCE loss across audio-video similarities

Landmark Methods:

Method	Year	Key Innovation
L³-Net	2017	Audio-visual correspondence as pretext task
SoundNet	2016	Transfer visual knowledge to audio via distillation
AVID	2020	Audio-visual instance discrimination
XDC	2020	Cross-modal deep clustering
AudioCLIP	2021	Extend CLIP to audio modality
ImageBind	2023	Bind audio, vision, text, depth, thermal, IMU

ImageBind: Unified Multimodal Space

ImageBind (Meta, 2023) demonstrates that using images as a 'binding' modality, we can align audio, text, depth, thermal, and IMU data into a single embedding space. By training each modality against images (which have abundant paired data), all modalities become comparable without requiring all-pairs data.

audio_visual_contrastive.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import torch
import torch.nn.functional as F
 
def audio_visual_contrastive_loss(
    audio_embeddings: torch.Tensor,
    visual_embeddings: torch.Tensor,
    temperature: float = 0.07,
) -> torch.Tensor:
    """
    Contrastive loss for audio-visual learning.
    
    Aligns audio and visual embeddings from the same video
    while pushing apart embeddings from different videos.
    
    Args:
        audio_embeddings: (B, D) normalized audio embeddings
        visual_embeddings: (B, D) normalized visual embeddings
        temperature: Softmax temperature
        
    Returns:
        Scalar loss value
    """
    batch_size = audio_embeddings.shape[0]
    
    # Compute similarity matrix
    # audio_embeddings @ visual_embeddings.T gives (B, B) matrix
    # Diagonal entries are positive pairs
    logits = (audio_embeddings @ visual_embeddings.T) / temperature
    
    # Labels: each audio matches its corresponding video
    labels = torch.arange(batch_size, device=logits.device)
    
    # Symmetric loss: audio-to-visual and visual-to-audio
    loss_a2v = F.cross_entropy(logits, labels)
    loss_v2a = F.cross_entropy(logits.T, labels)
    
    return (loss_a2v + loss_v2a) / 2
 
 
class AudioVisualModel(torch.nn.Module):
    """Simple audio-visual model with separate encoders."""
    
    def __init__(self, audio_encoder, visual_encoder, embed_dim=512):
        super().__init__()
        self.audio_encoder = audio_encoder
        self.visual_encoder = visual_encoder
        
        # Projection heads to shared space
        self.audio_proj = torch.nn.Linear(audio_encoder.embed_dim, embed_dim)
        self.visual_proj = torch.nn.Linear(visual_encoder.embed_dim, embed_dim)
        
    def forward(self, audio, frames):
        # Encode modalities
        audio_feat = self.audio_encoder(audio)
        visual_feat = self.visual_encoder(frames)
        
        # Project and normalize
        audio_embed = F.normalize(self.audio_proj(audio_feat), dim=-1)
        visual_embed = F.normalize(self.visual_proj(visual_feat), dim=-1)
        
        return audio_embed, visual_embed

Sound Source Localization

A compelling application of audio-visual learning is sound source localization: identifying which regions of an image or video correspond to a given sound. When we hear a dog bark, can the model highlight where the dog is in the frame?

The Localization Challenge:

Unlike supervised object detection, sound source localization uses audio as the query:

Input: Video frame + Audio clip Output: Spatial attention map or bounding box indicating sound source

Attention-Based Localization:

A common approach uses cross-modal attention between audio and spatial visual features:

Extract spatial visual features: (H×W, D) from CNN or ViT
Extract audio embedding: (D,) from audio encoder
Compute attention: audio embedding queries visual features
Attention weights indicate sound source locations

$$\text{Attention}(q_a, K_v) = \text{softmax}\left(\frac{q_a K_v^T}{\sqrt{d}}\right)$$

The resulting attention map highlights regions most relevant to the audio.

Training Approaches:

Contrastive localization: Learn spatial features that match audio
Mix-and-localize: Artificially mix sounds, learn to localize each
Class activation maps: Use classification gradients for localization

Sound Source Localization Methods
Method	Approach	Supervision	Performance
Attention10k	Cross-modal attention	Self-supervised	Baseline
DMC	Discriminative audio-visual	Self-supervised	+5% AP
LAVISH	Lightweight cross-attention	Self-supervised	SOTA efficiency
AudioCLIP	CLIP-extended to audio	Contrastive	Zero-shot capable

Real-World Applications

Sound source localization enables: hearing aid systems that focus on speakers, video editing tools that automatically link audio to sources, security systems that correlate sounds with visual events, and robotic systems that attend to relevant audio sources.

Audio-Visual Speech Processing

Speech is an inherently audio-visual phenomenon. Lip movements, facial expressions, and head gestures all convey information that complements the acoustic signal. Audio-visual speech processing leverages this multimodal nature for improved recognition, synthesis, and understanding.

Lip Reading (Visual Speech Recognition):

Lip reading recognizes speech from visual input alone—without audio. While humans achieve ~40-60% word accuracy (for untrained individuals), machine learning models can now exceed human performance on benchmark datasets.

Key Components:

Face detection and tracking
Lip region extraction (mouth ROI)
Frame sequence encoding (3D CNN, transformer)
Sequence decoding (CTC, attention-based)

Audio-Visual Speech Recognition (AVSR):

AVSR combines audio and visual streams for robust speech recognition, especially valuable in noisy environments where audio alone degrades.

$$P(\text{words}|\text{audio}, \text{video}) > P(\text{words}|\text{audio})$$

The visual modality provides complementary information:

Lip shape disambiguates similar sounds ('p' vs 'b' vs 'm')
Visual attention indicates who is speaking
Lip movements provide timing information

Lip Sync Detection:

Detecting whether audio and video are synchronized has applications in:

Deepfake detection (manipulated lip sync)
Video conferencing quality assessment
Automated dubbing quality control

Audio-Visual Speech Benchmarks

•LRS2/LRS3: BBC news and TED talks with word-level alignment; standard benchmark for lip reading
•VoxCeleb: Celebrity videos for speaker verification; audio-visual speaker recognition
•AVSpeech: Large-scale YouTube videos; 290K+ videos of diverse speakers
•AV-HuBERT: Self-supervised pre-training dataset; 143K hours of unlabeled video

AV-HuBERT: Self-Supervised AV Speech

AV-HuBERT (Meta, 2022) extends HuBERT to audio-visual input, learning representations from unlabeled video by predicting clusters of audio-visual features. It achieves state-of-the-art lip reading by first pre-training on 1,759 hours of unlabeled data, then fine-tuning on labeled datasets.

Cross-Modal Generation: Audio ↔ Visual

Cross-modal generation synthesizes one modality conditioned on another. This includes generating audio from video (e.g., foley sound), generating video from audio (e.g., music visualizers), and synthesizing talking heads from speech.

Video-to-Audio Generation:

Given silent video, generate appropriate synchronized audio:

Foley synthesis: Generate sound effects (footsteps, door creaks)
Music generation: Generate soundtrack matching video mood
Environmental audio: Generate ambient sounds for scenes

Audio-to-Video Generation:

Given audio, generate synchronized visual content:

Talking head synthesis: Generate face from speech audio
Music visualization: Generate abstract visuals from music
Instrument performance: Generate hands playing synthesized music

Talking Head Generation:

A prominent application is generating realistic talking head videos from speech audio:

Input: Audio waveform + identity image/video
Audio encoding: Extract speech features (mel, phonemes)
Motion prediction: Predict lip movements, expressions
Rendering: Generate photorealistic face video

Cross-Modal Generation Methods
Direction	Method	Approach	Application
Video→Audio	SpecVQGAN	VQ-GAN + spectrogram	Foley generation
Video→Audio	Im2Wav	Diffusion-based	Environmental sounds
Audio→Video	Wav2Lip	GAN-based lip sync	Video dubbing
Audio→Video	SadTalker	3DMM + diffusion	Talking head generation
Audio→Video	MakeItTalk	Content + speaker disentangle	Animation

Ethical Considerations

Audio-visual generation, especially talking head synthesis, raises significant ethical concerns around deepfakes and misinformation. Research and deployment must incorporate detection methods, watermarking, and responsible use guidelines to mitigate misuse.

Summary: Audio-Visual Learning

Key Takeaways

•Audio-visual correspondence provides natural supervision for learning from video without labels.
•Audio representations (spectrograms, wav2vec) enable integration with visual models.
•Self-supervised approaches (contrastive, correspondence) learn rich representations from unlabeled video.
•Sound source localization identifies visual regions corresponding to audio, enabling attention mechanisms.
•Audio-visual speech improves recognition in noise and enables lip reading applications.
•Cross-modal generation synthesizes audio from video and vice versa, with applications in media production.

Up Next

In the next page, we'll explore cross-modal retrieval—the task of searching across modalities, such as finding images from text queries or retrieving videos from audio descriptions, and the embedding spaces that make this possible.