Loading content...
The remarkable capabilities of modern AI systems come at equally remarkable computational costs. Training GPT-4 reportedly required compute measured in tens of millions of dollars; inference for large language models demands specialized hardware and consumes substantial energy. As AI becomes ubiquitous—running on phones, embedded devices, and serving billions of users—efficiency becomes not just an optimization but a requirement.
Efficient ML is the research agenda dedicated to achieving the same (or better) capabilities with less compute, memory, and energy. This encompasses techniques applied across the entire ML pipeline: efficient architectures that compute less, training methods that converge faster, inference optimizations that speed up serving, and specialized hardware that performs more operations per watt.
The stakes are significant. Efficiency determines:
By the end of this page, you will understand the key techniques for efficient ML including quantization, pruning, and knowledge distillation; efficient neural network architectures designed for speed and low resource usage; training efficiency methods that reduce compute requirements; hardware-software co-design for ML acceleration; and the tradeoffs involved in making ML systems more efficient.
Understanding why efficiency matters requires grasping the scale of computation in modern ML and the constraints that limit deployment.
The Scale of Compute
Modern AI training and inference consume extraordinary resources:
At these scales, even small efficiency improvements translate to millions of dollars saved and significant environmental impact reduced.
Deployment Constraints
Real-world deployment imposes hard constraints:
Edge Devices:
Cloud Economics:
Accessibility:
| Stage | Efficiency Challenge | Key Techniques |
|---|---|---|
| Architecture Design | Reduce compute per forward pass | Efficient attention, MoE, pruning-aware design |
| Training | Reduce total training compute | Mixed precision, gradient checkpointing, curriculum learning |
| Compression | Reduce model size | Quantization, pruning, distillation |
| Inference | Reduce latency and throughput | Batching, caching, speculative decoding |
| Hardware | Maximize ops/watt | Custom accelerators, sparsity support |
Efficiency research runs counter to the 'bigger is better' trend that has dominated recent ML. While scaling laws show that larger models trained on more data generally perform better, efficiency research asks: can we achieve the same capability with less? Often the answer is yes—through clever architecture design, training, and compression.
Quantization reduces model size and accelerates inference by using lower-precision numerical representations. Instead of 32-bit floating-point (FP32) weights and activations, quantized models use 16-bit (FP16/BF16), 8-bit (INT8), or even 4-bit (INT4) or lower representations.
Why Quantization Works
Neural networks are surprisingly robust to precision reduction:
Quantization Approaches
Post-Training Quantization (PTQ): Quantize a trained model without additional training:
PTQ is simple but can degrade performance, especially at very low bit-widths.
Quantization-Aware Training (QAT): Simulate quantization during training, allowing the model to adapt:
QAT typically preserves accuracy better than PTQ but requires retraining.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
# Conceptual quantization implementationimport torchimport torch.nn as nnimport torch.nn.functional as F def quantize_tensor(x: torch.Tensor, num_bits: int = 8) -> torch.Tensor: """ Symmetric quantization of tensor to num_bits. Maps floating-point values to integers in [-2^(b-1), 2^(b-1)-1] then back to floating point (simulating quantization effects). """ qmin = -(2 ** (num_bits - 1)) qmax = 2 ** (num_bits - 1) - 1 # Compute scale factor (per-tensor symmetric quantization) max_val = x.abs().max() scale = max_val / qmax if max_val > 0 else 1.0 # Quantize: round to integer, clip to valid range x_int = torch.round(x / scale).clamp(qmin, qmax) # Dequantize: convert back to float x_quant = x_int * scale return x_quant class QuantizedLinear(nn.Module): """ Linear layer with weight quantization. In practice, weights are stored as integers and dequantized during computation. Here we simulate this for illustration. """ def __init__(self, in_features: int, out_features: int, weight_bits: int = 8, activation_bits: int = 8): super().__init__() self.weight = nn.Parameter(torch.randn(out_features, in_features)) self.bias = nn.Parameter(torch.zeros(out_features)) self.weight_bits = weight_bits self.activation_bits = activation_bits def forward(self, x: torch.Tensor) -> torch.Tensor: # Quantize weights (simulating stored low-precision weights) weight_q = quantize_tensor(self.weight, self.weight_bits) # Quantize input activations x_q = quantize_tensor(x, self.activation_bits) # Matrix multiply (in practice, uses integer arithmetic) output = F.linear(x_q, weight_q, self.bias) return output class StraightThroughEstimator(torch.autograd.Function): """ Straight-through estimator for quantization-aware training. Forward: Apply quantization Backward: Pass gradients through unchanged (as if no quantization) """ @staticmethod def forward(ctx, x: torch.Tensor, num_bits: int = 8) -> torch.Tensor: return quantize_tensor(x, num_bits) @staticmethod def backward(ctx, grad_output: torch.Tensor) -> torch.Tensor: # Gradient passes through unchanged return grad_output, NoneModern Quantization for LLMs
Recent work has pushed quantization to extreme low bit-widths for large language models:
GPTQ: Data-free quantization using approximate second-order information:
AWQ (Activation-aware Weight Quantization):
GGML/llama.cpp Quantization:
Mixed-Precision Quantization:
Quantization Speedups:
Advances in quantization have enabled running models like LLaMA-65B on consumer GPUs (24GB) through 4-bit quantization—models that would otherwise require 130+ GB at FP16. This has democratized access to large language models and enabled local, private inference on consumer hardware.
Pruning removes unnecessary weights or structures from neural networks, creating sparse models that require less compute and memory. The key insight is that many parameters in neural networks are redundant or unimportant.
Types of Pruning
Unstructured (Weight) Pruning: Remove individual weights, creating irregular sparsity patterns:
Structured Pruning: Remove entire structures (neurons, channels, layers, attention heads):
Block Sparsity: Remove regular blocks of weights (e.g., 4×4 blocks):
When to Prune
Post-Training Pruning:
During-Training Pruning:
Importance Metrics
How do we decide which weights to prune?
Magnitude-Based:
Smaller weights contribute less to outputs:
importance(w) = |w|
Simple and often effective, but assumes magnitude reflects importance.
Gradient-Based:
Weights with small gradients matter less for the objective:
importance(w) = |w × ∂L/∂w|
Captures interaction between weight value and its sensitivity.
Hessian-Based:
Second-order information captures the loss increase from removing a weight:
importance(w) ≈ ½ w² H_{ww}
More accurate but computationally expensive.
Lottery Ticket Hypothesis
Frankle and Carlin's Lottery Ticket Hypothesis (2019) revealed a surprising property:
This suggests pruning doesn't just compress; it finds the essential structure hidden in overparameterized networks.
Sparse Training
Rather than pruning after training, train sparse models from the start:
SET (Sparse Evolutionary Training):
RigL:
A major challenge is converting theoretical sparsity to practical speedups. Unstructured 90% sparsity should give 10× speedup, but irregular memory access patterns often prevent this on GPUs designed for dense computation. Hardware-software co-design (structured sparsity, specialized kernels) is key to realizing sparsity benefits.
Knowledge distillation trains small, efficient 'student' models to mimic the behavior of large, accurate 'teacher' models. Rather than training on hard labels alone, students learn from the teacher's softer output distributions, which contain richer information about class relationships.
The Core Idea
A trained teacher model produces soft probability distributions over classes:
These soft labels contain more information than hard labels:
Distillation Loss
The standard distillation objective combines:
Hard label loss: Cross-entropy with true labels
L_hard = CE(student_logits, true_labels)
Soft label loss: KL divergence from teacher's softened distribution
L_soft = KL(softmax(student_logits/T), softmax(teacher_logits/T))
Combined: L = α L_hard + (1-α) L_soft
The temperature T (typically 3-10) softens distributions further, revealing more structure in the teacher's outputs.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
# Knowledge Distillation Implementationimport torchimport torch.nn as nnimport torch.nn.functional as Ffrom typing import Tuple class DistillationLoss(nn.Module): """ Knowledge distillation loss combining hard labels and soft targets. L = alpha * CE(student, labels) + (1-alpha) * KL(student_soft, teacher_soft) """ def __init__(self, temperature: float = 4.0, alpha: float = 0.5): super().__init__() self.temperature = temperature self.alpha = alpha self.ce_loss = nn.CrossEntropyLoss() self.kl_loss = nn.KLDivLoss(reduction='batchmean') def forward( self, student_logits: torch.Tensor, teacher_logits: torch.Tensor, labels: torch.Tensor, ) -> Tuple[torch.Tensor, dict]: """ Compute distillation loss. Args: student_logits: Raw logits from student model teacher_logits: Raw logits from teacher model (detached) labels: Ground truth labels Returns: Total loss and dict of component losses """ # Hard label loss: student vs ground truth hard_loss = self.ce_loss(student_logits, labels) # Soft label loss: student vs teacher (both softened by temperature) student_soft = F.log_softmax(student_logits / self.temperature, dim=-1) teacher_soft = F.softmax(teacher_logits / self.temperature, dim=-1) soft_loss = self.kl_loss(student_soft, teacher_soft) # Scale by T^2 as gradients are 1/T^2 smaller with temperature soft_loss = soft_loss * (self.temperature ** 2) # Combined loss total_loss = self.alpha * hard_loss + (1 - self.alpha) * soft_loss return total_loss, { 'hard_loss': hard_loss.item(), 'soft_loss': soft_loss.item(), 'total_loss': total_loss.item(), } def distill_model( teacher: nn.Module, student: nn.Module, train_loader: torch.utils.data.DataLoader, epochs: int = 10, temperature: float = 4.0, alpha: float = 0.5, learning_rate: float = 1e-3,): """Train student model via distillation from teacher.""" teacher.eval() # Teacher in eval mode (no updates) student.train() criterion = DistillationLoss(temperature=temperature, alpha=alpha) optimizer = torch.optim.Adam(student.parameters(), lr=learning_rate) for epoch in range(epochs): total_loss = 0.0 for inputs, labels in train_loader: # Get teacher predictions (no gradients needed) with torch.no_grad(): teacher_logits = teacher(inputs) # Get student predictions student_logits = student(inputs) # Compute distillation loss loss, loss_dict = criterion(student_logits, teacher_logits, labels) # Update student optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}: Loss = {total_loss / len(train_loader):.4f}") return studentAdvanced Distillation Techniques
Feature/Intermediate Distillation: Match not just final outputs but intermediate representations:
Self-Distillation: A model distills knowledge from itself:
Data-Free Distillation: Distill without access to original training data:
Distillation in Modern LLMs
Distillation is key to deploying efficient language models:
Why does distillation work better than training small models directly? Teacher outputs provide richer supervision than hard labels. The teacher's uncertainty (e.g., predicting 0.7 vs 0.99) conveys difficulty. Teachers trained on more data/compute transfer that knowledge to students. Essentially, distillation lets students benefit from the teacher's expensive training.
Beyond compressing existing models, we can design architectures that are inherently more efficient. These architectures achieve accuracy comparable to larger models with fundamentally fewer operations.
Efficient Convolutional Networks
Depthwise Separable Convolutions: Standard 3×3 convolution with C_in inputs and C_out outputs: O(k² × C_in × C_out × H × W) ops.
Depthwise separable splits this into:
Result: O(k² × C_in × H × W + C_in × C_out × H × W)—roughly 8-9× fewer operations.
Used in: MobileNet, EfficientNet, ShuffleNet
Neural Architecture Search (NAS): Automatically search for efficient architectures:
NAS has discovered architectures like EfficientNet that achieve state-of-the-art efficiency.
Efficient Attention
Self-attention has O(n²) complexity in sequence length—prohibitive for long sequences.
Linear Attention:
Sparse Attention:
Mixture of Experts (MoE)
MoE architectures activate only a subset of parameters for each input:
Benefits:
Challenges:
State Space Models (SSMs)
Recent alternatives to attention for sequence modeling:
Mamba:
SSMs may represent a path to efficient long-context modeling without attention's quadratic cost.
KV Cache Optimization
For autoregressive generation, key-value caching avoids recomputation:
Problem: KV cache grows linearly with sequence length, consuming memory
Solutions:
For edge deployment prioritize depthwise separable convs (vision) or small transformers/SSMs (sequence). For long contexts consider sparse attention or SSMs. For large models with limited inference budget explore MoE. For latency-sensitive applications profile carefully—theoretical FLOP counts don't always predict real-world latency.
Inference efficiency is crucial for deployment, but training efficiency determines development costs and accessibility. Training modern models requires enormous compute—making training more efficient expands who can participate in AI development.
Mixed Precision Training
Use lower-precision formats where possible:
FP16/BF16 Training:
Loss Scaling:
BF16 Advantages:
Gradient Checkpointing
Trade compute for memory:
Enables training larger models or larger batch sizes with limited GPU memory.
Efficient Optimizers
AdaFactor:
LAMB/LARS:
8-bit Optimizers:
Data Efficiency
Make more of less data:
Curriculum Learning:
Core-set Selection:
Synthetic Data:
Distributed Training Efficiency
Pipeline Parallelism:
Tensor Parallelism:
ZeRO Optimizer:
Research continues pushing the training efficiency frontier. Recent work on 1-bit optimizers, no-optimizer training methods like weight-decay-only, and hardware-aware training schedules suggest significant room for further improvement. Halving training cost effectively doubles accessible model sizes for a fixed budget.
Algorithmic efficiency is only half the story. The other half is efficiently mapping algorithms to hardware. Hardware-software co-design can yield multiplicative improvements beyond what either alone achieves.
ML Hardware Landscape
GPUs (NVIDIA, AMD):
TPUs (Google):
Specialized Accelerators:
Edge/Mobile:
Key Hardware Metrics:
Inference Optimization Systems
Serving Frameworks:
vLLM:
TensorRT-LLM:
llama.cpp:
Key Optimization Techniques:
Kernel Fusion:
Batching Strategies:
Speculative Decoding:
Maximum efficiency requires full-stack optimization: efficient algorithms, appropriate precision, optimized kernels, smart batching, and suitable hardware. A 4-bit quantized model with PagedAttention on TensorRT can be 10-100× more efficient than a naive FP32 implementation. The gains multiply across the stack.
Efficient ML is essential for deploying AI at scale, on edge devices, and sustainably. Let's consolidate the key insights:
Looking Forward:
Efficiency research will remain crucial as:
The most impactful AI of the future will likely combine frontier capabilities with practical efficiency—powerful enough to be useful, efficient enough to be deployable.
Congratulations! You've completed Module 6: Emerging Directions. From neurosymbolic AI through causal ML, world models, AI safety, and efficient ML, you now have a comprehensive view of the research frontiers shaping AI's future. These emerging directions represent where the field is headed—the problems that will define the next era of machine learning.