Loading learning content...
Foundation models are large-scale models trained on broad data that can be adapted to a wide range of downstream tasks. They represent the convergence of self-supervised learning techniques with massive compute and data.
The term, coined by Stanford HAI in 2021, captures a fundamental shift: instead of training task-specific models from scratch, we train one versatile model that serves as the foundation for many applications.
GPT-3 (175B parameters, 2020) demonstrated that scale enables emergent capabilities like few-shot learning. Vision foundation models like DINOv2, SAM, and multimodal models like GPT-4V extend this paradigm to visual understanding, creating general-purpose visual AI systems.
Scaling laws describe how model performance improves as we increase compute, data, and parameters. They guide resource allocation for training foundation models.
Hoffmann et al. (2022) showed that for a fixed compute budget C:
For vision models, Zhai et al. (2022) found: $$\text{Error} \propto \left(\frac{1}{\text{Samples}}\right)^{0.5} + \left(\frac{1}{\text{Params}}\right)^{0.5}$$
Both data and model size contribute equally to performance.
1234567891011121314151617181920212223242526272829303132333435363738394041
import numpy as np def chinchilla_optimal(compute_budget_flops): """Compute optimal model size and training tokens for given compute. Chinchilla scaling law: for compute C (in FLOPs), optimal N (params) and D (tokens) satisfy: N ≈ 0.0056 * C^0.5 D ≈ 0.0056 * C^0.5 Rule of thumb: train for ~20 tokens per parameter. """ # Empirical constants from Hoffmann et al. a = 406.4 # Loss coefficient for N b = 410.7 # Loss coefficient for D alpha = 0.34 # Scaling exponent for N beta = 0.28 # Scaling exponent for D # Optimal allocation optimal_n = (compute_budget_flops / 6) ** 0.5 * 0.0056 optimal_d = optimal_n # Equal compute allocation return { 'optimal_params': optimal_n, 'optimal_tokens': optimal_d, 'tokens_per_param': optimal_d / optimal_n } def estimate_training_cost(params_b, tokens_t, gpu_tflops=312): """Estimate training cost for a model. Rule: ~6 FLOPs per parameter per token for forward+backward pass """ flops = 6 * params_b * 1e9 * tokens_t * 1e12 gpu_hours = flops / (gpu_tflops * 1e12 * 3600) return { 'total_flops': flops, 'gpu_hours': gpu_hours, 'a100_days': gpu_hours / (24 * 8), # 8 GPUs }| Model | Parameters | Training Data | Compute (PF-days) |
|---|---|---|---|
| GPT-3 | 175B | 300B tokens | 3,640 |
| CLIP ViT-L | 428M | 400M images | ~300 |
| DINOv2 ViT-g | 1.1B | 142M images | ~500 |
| LLaMA 2 70B | 70B | 2T tokens | ~6,000 |
| GPT-4 (est.) | ~1.8T MoE | ~10T tokens | ~100,000 |
Vision foundation models apply self-supervised learning at scale to create general-purpose visual representations.
DINOv2 by Oquab et al. (2023) scales DINO to produce features that work across diverse vision tasks without fine-tuning:
SAM by Kirillov et al. (2023) creates a promptable segmentation foundation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
class DINOv2(nn.Module): """DINOv2: Scaled self-supervised vision model. Combines DINO self-distillation with iBOT masked modeling. """ def __init__(self, backbone, head_dim=65536, patch_size=14): super().__init__() self.student = backbone self.teacher = copy.deepcopy(backbone) self.patch_size = patch_size # DINO head for image-level distillation self.dino_head = DINOHead(backbone.embed_dim, head_dim) # iBOT head for patch-level masked prediction self.ibot_head = nn.Linear(backbone.embed_dim, 8192) def forward(self, global_crops, local_crops, masks): # DINO: distill global-to-local teacher_global = self.teacher(global_crops) student_all = [self.student(c) for c in global_crops + local_crops] dino_loss = self.dino_loss(teacher_global, student_all) # iBOT: masked patch prediction student_masked = self.student(global_crops, mask=masks) teacher_patches = self.teacher.get_patch_tokens(global_crops) ibot_loss = self.ibot_loss(student_masked, teacher_patches, masks) return dino_loss + ibot_loss class SAM(nn.Module): """Segment Anything Model architecture.""" def __init__(self, image_encoder, prompt_encoder, mask_decoder): super().__init__() self.image_encoder = image_encoder # ViT-H self.prompt_encoder = prompt_encoder # Points, boxes, text self.mask_decoder = mask_decoder # Lightweight transformer def forward(self, image, prompts): # Encode image once (can reuse for multiple prompts) image_embedding = self.image_encoder(image) # Encode prompts (points, boxes, or masks) sparse_embeds, dense_embeds = self.prompt_encoder(prompts) # Decode masks masks, iou_pred = self.mask_decoder( image_embedding, sparse_embeds, dense_embeds ) return masks, iou_predTraining foundation models requires sophisticated distributed systems and efficient implementations.
1234567891011121314151617181920212223242526272829303132333435363738
import torch.distributed as distfrom torch.nn.parallel import DistributedDataParallel as DDPfrom torch.cuda.amp import autocast, GradScaler def setup_foundation_training(model, world_size, local_rank): """Setup distributed training for foundation models.""" # Initialize process group dist.init_process_group("nccl", rank=local_rank, world_size=world_size) torch.cuda.set_device(local_rank) # Wrap model in DDP model = model.cuda(local_rank) model = DDP(model, device_ids=[local_rank], find_unused_parameters=False) # Mixed precision training scaler = GradScaler() return model, scaler def train_step_foundation(model, batch, optimizer, scaler): """Single training step with mixed precision.""" optimizer.zero_grad() with autocast(): loss = model(batch) # Scaled backward pass scaler.scale(loss).backward() # Gradient clipping before unscaling scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) scaler.step(optimizer) scaler.update() return loss.item()Foundation model quality depends critically on training data. Careful curation improves efficiency and capability.
12345678910111213141516171819202122232425262728293031323334353637383940414243
from imagededup.methods import PHashimport numpy as np def deduplicate_images(image_paths, threshold=0.9): """Remove near-duplicate images using perceptual hashing.""" hasher = PHash() encodings = hasher.encode_images(image_paths) duplicates = hasher.find_duplicates(encoding_map=encodings, max_distance_threshold=int((1-threshold)*64)) # Keep only unique images unique = set(image_paths) - set().union(*duplicates.values()) return list(unique) def compute_clip_score(model, images, texts): """Filter image-text pairs by CLIP score.""" image_embeds = model.encode_image(images) text_embeds = model.encode_text(texts) scores = (image_embeds * text_embeds).sum(dim=-1) return scores def quality_filter_pipeline(dataset, clip_model, min_score=0.25): """Multi-stage quality filtering for VL datasets.""" filtered = [] for image, text in dataset: # Stage 1: Image quality (resolution, aspect ratio) if image.size[0] < 200 or image.size[1] < 200: continue if image.size[0] / image.size[1] > 3: continue # Stage 2: Text quality (length, language) if len(text.split()) < 3: continue # Stage 3: CLIP score filtering score = compute_clip_score(clip_model, image, text) if score > min_score: filtered.append((image, text)) return filteredFoundation models are designed for adaptation. Key strategies include:
| Method | Trainable Params | Compute | Best For |
|---|---|---|---|
| Full Fine-tuning | 100% | High | Plenty of data, domain shift |
| Linear Probing | <1% | Low | Quick evaluation, limited data |
| LoRA | ~1% | Low | Efficient fine-tuning |
| Prompt Tuning | <0.1% | Very Low | Few-shot, no weight changes |
| Zero-shot | 0% | None | Direct application |
12345678910111213141516171819202122232425262728
class LoRALayer(nn.Module): """Low-Rank Adaptation layer.""" def __init__(self, original_layer, rank=8, alpha=16): super().__init__() self.original = original_layer self.rank = rank self.alpha = alpha # Freeze original weights for p in self.original.parameters(): p.requires_grad = False # Low-rank decomposition: W + BA in_features = original_layer.in_features out_features = original_layer.out_features self.lora_A = nn.Linear(in_features, rank, bias=False) self.lora_B = nn.Linear(rank, out_features, bias=False) # Initialize: A ~ N(0, 1), B = 0 nn.init.normal_(self.lora_A.weight) nn.init.zeros_(self.lora_B.weight) def forward(self, x): original_out = self.original(x) lora_out = self.lora_B(self.lora_A(x)) * (self.alpha / self.rank) return original_out + lora_outYou've completed Module 6: Modern Self-Supervised Methods! You now understand the cutting-edge techniques from BYOL through foundation models. These methods power today's most capable AI systems, from image classifiers to multimodal assistants. The self-supervised paradigm—learning from unlabeled data at scale—has become the dominant approach in modern deep learning.