Loading content...
Human perception is inherently multimodal. When we watch a video, we don't just see images in sequence—we hear synchronized sounds that provide crucial context. The bark identifies a dog before we see it; the crash of waves tells us we're at a beach; a speaker's lip movements help us understand speech in noisy environments. This natural integration of audio and visual information has inspired a rich area of machine learning research: audio-visual learning.
Audio-visual learning extends multimodal AI beyond the vision-language paradigm to incorporate temporal, acoustic signals. This creates new opportunities: learning visual representations from naturally co-occurring sounds, improving speech recognition using visual cues, generating realistic audio for silent videos, and building AI systems that understand the world through multiple senses simultaneously.
This page covers the theoretical foundations and practical techniques for audio-visual learning, including self-supervised approaches that learn from video without labels, cross-modal generation, and applications in speech, music, and video understanding.
The fundamental insight underlying audio-visual learning is that sounds and visual events are naturally correlated in videos. When we see a guitar being strummed, we expect to hear music; when a dog appears, we may hear barking. This correspondence provides a powerful supervisory signal for learning without explicit labels.
Types of Audio-Visual Correspondence:
Spatial Correspondence: Objects that produce sound are visible in the scene. A barking dog appears in a specific location; a speaking person's face is visible.
Temporal Correspondence: Audio and visual events are synchronized. The sound of a bouncing ball aligns with the moment of impact; lip movements sync with speech.
Semantic Correspondence: Audio content relates meaningfully to visual content. Music genres correlate with visual styles; environmental sounds match scene types.
Exploiting Correspondence for Self-Supervised Learning:
The key insight is that audio-visual correspondence creates natural positive and negative pairs:
Models learn representations by distinguishing aligned from misaligned pairs, developing an understanding of both modalities in the process.
| Correspondence Type | Learning Signal | Example Applications |
|---|---|---|
| Spatial | Sound source localization | Visual attention, source separation |
| Temporal | Synchronization detection | Lip sync, action recognition |
| Semantic | Content matching | Video classification, retrieval |
| Cross-modal generation | Reconstruction | Audio-to-video, video-to-audio |
Like CLIP's use of image-caption pairs, audio-visual learning leverages naturally occurring data—billions of hours of video with synchronized audio. This eliminates the need for manual annotation while providing diverse, real-world training signals.
Before combining audio with vision, we need effective audio representations. Modern audio encoders transform raw waveforms into semantic embeddings suitable for multimodal fusion.
Spectral Representations:
The most common approach converts raw audio waveforms into spectrograms—2D representations of frequency content over time:
Spectrogram to Embedding:
Once audio is represented as a spectrogram (essentially an "image" of sound), we can apply visual architectures:
Waveform-Based Models:
Alternatively, models can process raw waveforms directly:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import torchimport torch.nn as nnimport torchaudioimport torchaudio.transforms as T class AudioEncoder(nn.Module): """ Audio encoder that converts waveforms to embeddings. Uses mel spectrogram followed by a transformer encoder. """ def __init__( self, sample_rate: int = 16000, n_mels: int = 128, n_fft: int = 1024, hop_length: int = 160, # 10ms at 16kHz embed_dim: int = 768, num_layers: int = 12, num_heads: int = 12, ): super().__init__() # Mel spectrogram transform self.mel_transform = T.MelSpectrogram( sample_rate=sample_rate, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels, ) # Patch embedding (treat spectrogram like image) self.patch_size = (16, 16) # (freq, time) patch_dim = n_mels // self.patch_size[0] self.patch_embed = nn.Linear( self.patch_size[0] * self.patch_size[1], embed_dim ) # Transformer encoder encoder_layer = nn.TransformerEncoderLayer( d_model=embed_dim, nhead=num_heads, dim_feedforward=embed_dim * 4, batch_first=True, ) self.transformer = nn.TransformerEncoder( encoder_layer, num_layers=num_layers ) # [CLS] token for pooled representation self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) def forward(self, waveform: torch.Tensor) -> torch.Tensor: """ Args: waveform: (B, T) raw audio waveform at 16kHz Returns: (B, embed_dim) audio embedding """ # Convert to mel spectrogram: (B, n_mels, time) mel = self.mel_transform(waveform) mel = torch.log(mel + 1e-6) # Log scale # Patchify and embed B, F, T = mel.shape # Reshape to patches and project patches = self.patchify(mel) # (B, num_patches, patch_dim) tokens = self.patch_embed(patches) # (B, num_patches, embed_dim) # Prepend [CLS] token cls = self.cls_token.expand(B, -1, -1) tokens = torch.cat([cls, tokens], dim=1) # Transformer encoding encoded = self.transformer(tokens) # Return [CLS] token as audio embedding return encoded[:, 0]Self-supervised audio-visual learning leverages the natural alignment between audio and video to learn representations without manual labels. Several approaches have proven effective:
Audio-Visual Correspondence (AVC) Learning:
The foundational approach trains models to determine whether audio and video are from the same source:
$$\mathcal{L}_{AVC} = -\mathbb{E}\left[y\log p(\text{match}|a,v) + (1-y)\log(1-p(\text{match}|a,v))\right]$$
Where $y=1$ for aligned pairs and $y=0$ for misaligned pairs.
Contrastive Audio-Visual Learning:
Mirroring CLIP's approach, we can use contrastive learning with audio-video pairs:
Landmark Methods:
| Method | Year | Key Innovation |
|---|---|---|
| L³-Net | 2017 | Audio-visual correspondence as pretext task |
| SoundNet | 2016 | Transfer visual knowledge to audio via distillation |
| AVID | 2020 | Audio-visual instance discrimination |
| XDC | 2020 | Cross-modal deep clustering |
| AudioCLIP | 2021 | Extend CLIP to audio modality |
| ImageBind | 2023 | Bind audio, vision, text, depth, thermal, IMU |
ImageBind (Meta, 2023) demonstrates that using images as a 'binding' modality, we can align audio, text, depth, thermal, and IMU data into a single embedding space. By training each modality against images (which have abundant paired data), all modalities become comparable without requiring all-pairs data.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import torchimport torch.nn.functional as F def audio_visual_contrastive_loss( audio_embeddings: torch.Tensor, visual_embeddings: torch.Tensor, temperature: float = 0.07,) -> torch.Tensor: """ Contrastive loss for audio-visual learning. Aligns audio and visual embeddings from the same video while pushing apart embeddings from different videos. Args: audio_embeddings: (B, D) normalized audio embeddings visual_embeddings: (B, D) normalized visual embeddings temperature: Softmax temperature Returns: Scalar loss value """ batch_size = audio_embeddings.shape[0] # Compute similarity matrix # audio_embeddings @ visual_embeddings.T gives (B, B) matrix # Diagonal entries are positive pairs logits = (audio_embeddings @ visual_embeddings.T) / temperature # Labels: each audio matches its corresponding video labels = torch.arange(batch_size, device=logits.device) # Symmetric loss: audio-to-visual and visual-to-audio loss_a2v = F.cross_entropy(logits, labels) loss_v2a = F.cross_entropy(logits.T, labels) return (loss_a2v + loss_v2a) / 2 class AudioVisualModel(torch.nn.Module): """Simple audio-visual model with separate encoders.""" def __init__(self, audio_encoder, visual_encoder, embed_dim=512): super().__init__() self.audio_encoder = audio_encoder self.visual_encoder = visual_encoder # Projection heads to shared space self.audio_proj = torch.nn.Linear(audio_encoder.embed_dim, embed_dim) self.visual_proj = torch.nn.Linear(visual_encoder.embed_dim, embed_dim) def forward(self, audio, frames): # Encode modalities audio_feat = self.audio_encoder(audio) visual_feat = self.visual_encoder(frames) # Project and normalize audio_embed = F.normalize(self.audio_proj(audio_feat), dim=-1) visual_embed = F.normalize(self.visual_proj(visual_feat), dim=-1) return audio_embed, visual_embedA compelling application of audio-visual learning is sound source localization: identifying which regions of an image or video correspond to a given sound. When we hear a dog bark, can the model highlight where the dog is in the frame?
The Localization Challenge:
Unlike supervised object detection, sound source localization uses audio as the query:
Input: Video frame + Audio clip Output: Spatial attention map or bounding box indicating sound source
Attention-Based Localization:
A common approach uses cross-modal attention between audio and spatial visual features:
$$\text{Attention}(q_a, K_v) = \text{softmax}\left(\frac{q_a K_v^T}{\sqrt{d}}\right)$$
The resulting attention map highlights regions most relevant to the audio.
Training Approaches:
| Method | Approach | Supervision | Performance |
|---|---|---|---|
| Attention10k | Cross-modal attention | Self-supervised | Baseline |
| DMC | Discriminative audio-visual | Self-supervised | +5% AP |
| LAVISH | Lightweight cross-attention | Self-supervised | SOTA efficiency |
| AudioCLIP | CLIP-extended to audio | Contrastive | Zero-shot capable |
Sound source localization enables: hearing aid systems that focus on speakers, video editing tools that automatically link audio to sources, security systems that correlate sounds with visual events, and robotic systems that attend to relevant audio sources.
Speech is an inherently audio-visual phenomenon. Lip movements, facial expressions, and head gestures all convey information that complements the acoustic signal. Audio-visual speech processing leverages this multimodal nature for improved recognition, synthesis, and understanding.
Lip Reading (Visual Speech Recognition):
Lip reading recognizes speech from visual input alone—without audio. While humans achieve ~40-60% word accuracy (for untrained individuals), machine learning models can now exceed human performance on benchmark datasets.
Key Components:
Audio-Visual Speech Recognition (AVSR):
AVSR combines audio and visual streams for robust speech recognition, especially valuable in noisy environments where audio alone degrades.
$$P(\text{words}|\text{audio}, \text{video}) > P(\text{words}|\text{audio})$$
The visual modality provides complementary information:
Lip Sync Detection:
Detecting whether audio and video are synchronized has applications in:
AV-HuBERT (Meta, 2022) extends HuBERT to audio-visual input, learning representations from unlabeled video by predicting clusters of audio-visual features. It achieves state-of-the-art lip reading by first pre-training on 1,759 hours of unlabeled data, then fine-tuning on labeled datasets.
Cross-modal generation synthesizes one modality conditioned on another. This includes generating audio from video (e.g., foley sound), generating video from audio (e.g., music visualizers), and synthesizing talking heads from speech.
Video-to-Audio Generation:
Given silent video, generate appropriate synchronized audio:
Audio-to-Video Generation:
Given audio, generate synchronized visual content:
Talking Head Generation:
A prominent application is generating realistic talking head videos from speech audio:
| Direction | Method | Approach | Application |
|---|---|---|---|
| Video→Audio | SpecVQGAN | VQ-GAN + spectrogram | Foley generation |
| Video→Audio | Im2Wav | Diffusion-based | Environmental sounds |
| Audio→Video | Wav2Lip | GAN-based lip sync | Video dubbing |
| Audio→Video | SadTalker | 3DMM + diffusion | Talking head generation |
| Audio→Video | MakeItTalk | Content + speaker disentangle | Animation |
Audio-visual generation, especially talking head synthesis, raises significant ethical concerns around deepfakes and misinformation. Research and deployment must incorporate detection methods, watermarking, and responsible use guidelines to mitigate misuse.
In the next page, we'll explore cross-modal retrieval—the task of searching across modalities, such as finding images from text queries or retrieving videos from audio descriptions, and the embedding spaces that make this possible.