Ml Success Factors - Learning Module

Loading content...

0/245

Computational Resources

The Computational Dimension of ML

Machine learning is a computationally intensive discipline. From training neural networks with billions of parameters to running real-time inference on millions of predictions per second, computational resources fundamentally shape what's possible in ML.

Consider the scale of modern ML computation:

GPT-3 required an estimated 3.14 × 10²³ FLOPS to train—roughly 355 GPU-years
A single training run of a ImageNet model requires ~10¹⁸ FLOPS
Real-time recommendation systems serve millions of predictions per second
Autonomous vehicles process gigabytes of sensor data per second

Yet most practitioners don't have access to clusters of thousands of GPUs or unlimited cloud budgets. Understanding computational resources—what they enable, what they constrain, and how to maximize their efficiency—is essential for practical ML success.

The good news: hardware efficiency has improved dramatically. Techniques like transfer learning, model distillation, and efficient architectures mean that meaningful ML work is increasingly accessible to individuals and small teams with modest resources.

What You Will Learn

By the end of this page, you will understand the hardware landscape for ML (CPUs, GPUs, TPUs, specialized accelerators), know how to assess computational requirements for different ML tasks, master strategies for efficient resource utilization, and appreciate the infrastructure considerations for production ML systems.

The Hardware Landscape

Modern machine learning leverages diverse hardware optimized for different computational patterns. Understanding this landscape helps you match workloads to appropriate hardware:

Why Hardware Matters for ML:

Machine learning workloads have distinct computational characteristics:

High arithmetic intensity: Many FLOPS per byte of memory accessed
Parallelism: Same operations applied to many data points
Matrix operations: Linear algebra dominates (matrix multiply, convolutions)
Large models: Parameters may not fit in fast memory

Different hardware excels at different aspects of these requirements.

Central Processing Units (CPUs) are general-purpose processors optimized for sequential, branching workloads with complex control flow.

Architecture Characteristics:

Few cores (4-64 typical for consumer/server)
High clock speeds (3-5 GHz)
Large caches (L1/L2/L3 hierarchy)
Complex control logic for branch prediction, out-of-order execution
Optimized for low latency on individual operations

ML Strengths:

Inference: Light models (linear, small trees) run efficiently
Data Preprocessing: Complex ETL, feature engineering
Small Models: Training when GPU isn't cost-justified
Debugging: Easier to profile and debug than GPU code

ML Limitations:

Far fewer parallel units than GPUs (10-100× less)
Memory bandwidth is bottleneck for large matrix operations
Training large neural networks is orders of magnitude slower than GPU

When to Use CPUs:

Inference for simple models in production
Data preprocessing pipelines
Prototyping and debugging
When GPU cost isn't justified by compute needs

CPU Optimization for ML

Libraries like Intel MKL, OpenBLAS, and NumPy compiled with optimizations can significantly accelerate CPU-based ML. Enable multi-threading in scikit-learn (n_jobs=-1), use efficient libraries like LightGBM which are heavily optimized for CPU, and consider vectorization for custom code.

Estimating Computational Requirements

Before investing in hardware or cloud resources, it's valuable to estimate the computational requirements of your ML workload. This helps you budget appropriately and avoid over- or under-provisioning.

Typical Computational Requirements by Task
Task Type	Training Time (Typical)	Hardware Recommendation	Memory Requirements
Classical ML (tabular)	Seconds to minutes	CPU (modern multi-core)	< 1 GB to 10s of GB
Random Forest/Gradient Boosting	Minutes to hours	CPU (parallelized)	Dataset size + model size
Small Neural Networks (< 1M params)	Minutes to hours	CPU or single GPU	4-8 GB GPU memory
CNN on ImageNet-scale	Hours to days	1-8 GPUs	8-32 GB per GPU
Transformer (BERT-base scale)	Days to weeks	4-16 GPUs	16-32 GB per GPU
Large Language Models (1-10B params)	Weeks to months	100s of GPUs	Distributed across cluster
Foundation Models (100B+ params)	Months	1000s of GPUs	Multi-node with model parallelism

Estimating FLOPS:

For neural networks, you can estimate the floating-point operations (FLOPS) required for training:

Forward pass FLOPS ≈ 2 × (number of multiply-accumulate operations) Backward pass FLOPS ≈ 2 × Forward pass FLOPS Total training FLOPS ≈ 6 × parameters × tokens (for transformers) or 6 × samples × epochs × forward_flops

Practical Estimation Formula:

Training time ≈ (6 × parameters × tokens) / (GPU FLOPS × utilization)

Typical GPU utilization is 30-50% for training due to memory transfers, synchronization, and inefficiencies. Account for this when planning.

compute_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def estimate_training_compute(
    model_params: int,
    dataset_tokens: int,
    gpu_tflops: float = 312,  # A100 tensor FLOPS
    utilization: float = 0.4,
    num_gpus: int = 1
) -> dict:
    """
    Estimate training compute requirements for a neural network.
    
    Based on the scaling law: C ≈ 6 × N × D
    where C is compute (FLOPS), N is parameters, D is tokens/samples
    
    Args:
        model_params: Number of model parameters
        dataset_tokens: Number of tokens/samples in training
        gpu_tflops: GPU performance in TFLOPS
        utilization: Expected GPU utilization (0.3-0.5 typical)
        num_gpus: Number of GPUs for training
    
    Returns:
        Dictionary with compute estimates
    """
    # Total FLOPS for training (6 × params × tokens)
    total_flops = 6 * model_params * dataset_tokens
    
    # Effective GPU FLOPS (accounting for utilization)
    effective_tflops = gpu_tflops * utilization * num_gpus
    effective_flops_per_second = effective_tflops * 1e12
    
    # Training time in seconds
    training_seconds = total_flops / effective_flops_per_second
    training_hours = training_seconds / 3600
    training_days = training_hours / 24
    
    # Cost estimation (approximate)
    gpu_hour_cost = 2.0  # Approximate A100 cloud cost
    total_cost = training_hours * num_gpus * gpu_hour_cost
    
    return {
        "total_flops": total_flops,
        "total_pflops": total_flops / 1e15,
        "training_hours": training_hours,
        "training_days": training_days,
        "gpu_hours": training_hours * num_gpus,
        "estimated_cost_usd": total_cost,
    }
 
# Example: Training a BERT-base sized model
result = estimate_training_compute(
    model_params=110_000_000,      # 110M parameters
    dataset_tokens=3_300_000_000,  # 3.3B tokens (like BERT pretraining)
    gpu_tflops=312,                # A100
    utilization=0.4,
    num_gpus=8
)
print(f"Estimated training time: {result['training_days']:.1f} days")
print(f"Estimated cost: ${result['estimated_cost_usd']:, .0f}")

Estimates Are Approximate

These formulas provide order-of-magnitude estimates. Actual training time depends on many factors: batch size, optimizer, learning rate schedule, checkpointing, data loading bottlenecks, and hyperparameter tuning runs. Plan for 2-3× your initial estimate for a complete project.

Cloud vs. On-Premises Infrastructure

A fundamental infrastructure decision is whether to run ML workloads on cloud resources, on-premises hardware, or a hybrid approach. Each has distinct advantages:

Cloud Advantages

•Elasticity: Scale up/down instantly based on demand
•No CapEx: Avoid large upfront hardware investment
•Latest Hardware: Access newest GPUs without replacement cycles
•Managed Services: Leverage MLOps tools without building them
•Global Availability: Deploy close to users worldwide
•Experimentation: Spin up resources for experiments, tear down when done

On-Premises Advantages

•Cost at Scale: Cheaper for sustained high utilization (>50%)
•Data Sovereignty: Full control over data location and handling
•Predictable Costs: Known fixed costs vs. variable cloud bills
•Customization: Configure hardware exactly as needed
•No Egress Costs: No charges for data transfer
•Low Latency: Direct hardware access without network hops

Cloud Provider Comparison for ML:

Provider	GPU Offerings	ML Services	Strengths
AWS	P4d/P5 (A100/H100), Inf2, Trn1	SageMaker, Bedrock	Broadest services, mature ecosystem
GCP	A2/A3 (A100/H100), TPU v4/v5	Vertex AI, TPU access	Best for TPU, TensorFlow/JAX
Azure	NCsv3/NDm (A100/H100)	Azure ML, OpenAI integration	Enterprise integration, OpenAI partnership
Lambda Labs	A100/H100 clusters	GPU cloud focused	Lower cost for pure GPU compute
CoreWeave	Large GPU clusters	GPU-optimized cloud	Price-competitive for scale

Cost Optimization Strategies:

Spot/Preemptible Instances: 60-90% savings for interruptible workloads
Reserved Capacity: 30-50% savings for committed usage
Right-sizing: Match instance size to workload; don't over-provision
Auto-scaling: Scale down during low-demand periods
Multi-cloud: Use providers' strengths (e.g., GCP for TPUs, AWS for SageMaker)

The Crossover Point

Cloud is generally cheaper for utilization < 40-50%. Above that, on-premises becomes cost-effective if you have the expertise to manage it. For most ML teams, a hybrid approach works well: on-premises for sustained training workloads, cloud for experiments and burst capacity.

Efficient Training Strategies

Given fixed hardware, numerous techniques can dramatically improve training efficiency, allowing you to train larger models or iterate faster:

Training Efficiency Techniques

•Mixed Precision Training — Use FP16/BF16 for forward/backward passes, FP32 for accumulation. Near 2× speedup with minimal accuracy loss. Libraries like NVIDIA Apex and PyTorch AMP make this easy.
•Gradient Accumulation — Simulate larger batch sizes by accumulating gradients over multiple forward passes before updating. Enables larger effective batches with limited GPU memory.
•Gradient Checkpointing — Trade compute for memory by recomputing activations during backward pass instead of storing them. Enables training larger models in fixed memory.
•Data Loading Optimization — Ensure data loading doesn't bottleneck GPU. Use multiple workers, prefetching, and efficient data formats (TFRecord, WebDataset).
•Efficient Data Parallelism — Distribute data across GPUs, synchronize gradients. Libraries like PyTorch DDP, Horovod make this straightforward.
•Model Parallelism — Split model across multiple devices when it doesn't fit in one. Tensor parallelism splits layers; pipeline parallelism splits stages.

efficient_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import torch
from torch.cuda.amp import GradScaler, autocast
from torch.utils.data import DataLoader
from torch.nn.parallel import DistributedDataParallel as DDP
 
def efficient_training_loop(
    model, 
    optimizer, 
    train_loader,
    epochs: int,
    gradient_accumulation_steps: int = 4,
    use_mixed_precision: bool = True
):
    """
    Training loop with key efficiency optimizations:
    - Mixed precision training (FP16/BF16)
    - Gradient accumulation for larger effective batch sizes
    - Proper gradient scaling for mixed precision
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    
    # Initialize gradient scaler for mixed precision
    scaler = GradScaler(enabled=use_mixed_precision)
    
    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        
        for step, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)
            
            # Mixed precision forward pass
            with autocast(enabled=use_mixed_precision):
                outputs = model(inputs)
                loss = torch.nn.functional.cross_entropy(outputs, labels)
                # Scale loss for gradient accumulation
                loss = loss / gradient_accumulation_steps
            
            # Scaled backward pass
            scaler.scale(loss).backward()
            
            # Update weights every N steps (gradient accumulation)
            if (step + 1) % gradient_accumulation_steps == 0:
                # Unscale gradients and clip
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                
                # Optimizer step with gradient scaling
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad()
                
        print(f"Epoch {epoch + 1} complete")
 
# Gradient checkpointing example
from torch.utils.checkpoint import checkpoint_sequential
 
class MemoryEfficientModel(torch.nn.Module):
    """Model using gradient checkpointing for memory efficiency."""
    
    def __init__(self, layers):
        super().__init__()
        self.layers = torch.nn.ModuleList(layers)
        
    def forward(self, x):
        # Checkpoint every 2 layers - recompute activations during backward
        return checkpoint_sequential(self.layers, chunks=2, input=x)

Quick Wins for Training Efficiency

The highest-impact optimization is often simply using mixed precision training—it's nearly free (small code changes, minimal accuracy impact) and provides ~1.5-2× speedup. Enable it first before investing in more complex optimizations.

Inference Optimization

For production ML systems, inference efficiency often matters more than training efficiency. Training happens once (or infrequently); inference happens millions of times. Optimizing inference reduces latency, cost, and energy consumption.

Key Inference Optimization Techniques:

Inference Optimization Techniques
Technique	Description	Typical Speedup	Trade-offs
Quantization	Reduce precision (FP32 → INT8/INT4)	2-4×	Small accuracy loss; hardware support required
Pruning	Remove less important weights	1.5-3×	Retraining often needed; structured vs unstructured
Knowledge Distillation	Train smaller model to mimic larger one	10-100×	Architecture change; training required
Operator Fusion	Combine multiple operations into one kernel	1.2-2×	Framework-specific; limited generality
Batching	Process multiple inputs together	2-10×	Latency trade-off; requires request buffering
Caching	Cache common predictions or embeddings	10-1000×	Memory overhead; staleness concerns

Quantization in Detail:

Quantization is often the most impactful inference optimization. Modern approaches include:

Post-Training Quantization (PTQ): Quantize a trained model without retraining. Fast to apply but may lose accuracy on sensitive layers.
Quantization-Aware Training (QAT): Simulate quantization during training, allowing the model to adapt. Better accuracy but requires retraining.
Mixed-Precision Quantization: Use different precision for different layers based on sensitivity analysis.

Inference Serving Frameworks:

TensorRT (NVIDIA): Optimized inference for NVIDIA GPUs
ONNX Runtime: Cross-platform inference acceleration
TorchServe: PyTorch model serving
TFServing: TensorFlow model serving
Triton Inference Server: Multi-framework, GPU-optimized serving
vLLM: Optimized large language model inference

inference_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import torch
from torch.quantization import quantize_dynamic, get_default_qconfig
from torch.ao.quantization import prepare, convert
 
def apply_dynamic_quantization(model):
    """
    Apply dynamic quantization - weights are quantized at rest,
    activations quantized dynamically during inference.
    
    Best for models with significant linear layers (e.g., transformers).
    Typical speedup: 2-4× on CPU, minimal accuracy loss.
    """
    quantized_model = quantize_dynamic(
        model,
        {torch.nn.Linear, torch.nn.LSTM, torch.nn.LSTMCell},
        dtype=torch.qint8
    )
    return quantized_model
 
def export_to_onnx_for_optimization(model, sample_input, output_path):
    """
    Export model to ONNX format for cross-platform optimization.
    
    ONNX models can be optimized by:
    - ONNX Runtime for CPU/GPU inference
    - TensorRT for NVIDIA GPU optimization
    - OpenVINO for Intel hardware
    """
    model.eval()
    torch.onnx.export(
        model,
        sample_input,
        output_path,
        export_params=True,
        opset_version=14,
        do_constant_folding=True,  # Optimize constant expressions
        input_names=['input'],
        output_names=['output'],
        dynamic_axes={
            'input': {0: 'batch_size'},
            'output': {0: 'batch_size'}
        }
    )
    print(f"Model exported to {output_path}")
    print("Optimize with: python -m onnxruntime.transformers.optimize_model")
 
# Example: Optimize with TorchScript for production
def optimize_for_inference(model, sample_input):
    """
    Compile model with TorchScript for optimized inference.
    Enables operator fusion, constant folding, and other optimizations.
    """
    model.eval()
    
    # Trace the model with sample input
    traced = torch.jit.trace(model, sample_input)
    
    # Optimize for inference
    optimized = torch.jit.optimize_for_inference(traced)
    
    # Freeze model (inline parameters, remove training artifacts)
    frozen = torch.jit.freeze(optimized)
    
    return frozen

Start with Batching

Before implementing complex optimizations, ensure you're batching inference requests. Processing 32 inputs together is often 10× more efficient than 32 sequential single-input calls. This one change can dramatically reduce per-prediction cost.

Resource Monitoring and Optimization

Efficient resource utilization requires visibility into how resources are being used. Monitoring enables identification of bottlenecks, underutilization, and optimization opportunities.

Key Metrics to Monitor:

Critical Resource Metrics

•GPU Utilization — Percentage of time GPU is actively computing. Target > 80% for training. Low utilization suggests data loading bottleneck or small batch sizes.
•GPU Memory Usage — Fraction of GPU memory in use. Running at capacity? Consider gradient checkpointing. Underutilized? Increase batch size.
•CPU Utilization — For data preprocessing, multi-core utilization matters. Single-core bottleneck suggests parallelization opportunity.
•Memory Bandwidth — For IO-bound workloads, monitor memory transfer rates. May limit throughput for large models.
•Disk I/O — Data loading from disk can bottleneck GPU. Consider SSDs, data format optimization, or caching.
•Network I/O — For distributed training, network bandwidth affects synchronization. Monitor for multi-node workloads.

Monitoring Tools:

nvidia-smi: Command-line GPU monitoring (utilization, memory, temperature)
nvtop/gpustat: Interactive GPU monitoring
PyTorch Profiler: Detailed profiling of PyTorch operations
TensorBoard: Training metrics visualization
Weights & Biases: Experiment tracking and resource monitoring
Prometheus + Grafana: Infrastructure-level monitoring

Common Bottleneck Patterns:

GPU Util	CPU Util	Likely Bottleneck	Solution
Low	High	Data loading	More workers, prefetching, faster storage
Low	Low	Small batch size	Increase batch, gradient accumulation
High	Low	Optimal	Maintain current configuration
High	High	Both saturated	Need more hardware

Profile Before Optimizing

Don't guess where bottlenecks are—measure them. Running PyTorch Profiler for a few training steps reveals exactly where time is spent. Optimizing non-bottlenecks wastes effort; profiling ensures you invest in impactful improvements.

Cost Management for ML

ML workloads can be expensive. Without active cost management, cloud bills can spiral out of control. Here are strategies for managing ML compute costs effectively:

Training Cost Optimization:

Cost-Saving Strategies

•Spot/Preemptible Instances — Use interruptible instances for training (60-90% savings). Implement checkpointing to handle interruptions gracefully.
•Early Stopping — Don't train longer than necessary. Monitor validation metrics and stop when improvement plateaus.
•Hyperparameter Efficiency — Use early stopping in hyperparameter search too. Techniques like Hyperband/ASHA prune poor configurations early.
•Transfer Learning — Start from pretrained models. You'll converge faster than training from scratch.
•Model Selection — A smaller model that's 'good enough' costs less to train and deploy than a massive model with marginal accuracy gains.
•Efficient Architectures — Use architectures designed for efficiency (MobileNet, EfficientNet, DistilBERT) when full-scale models aren't needed.

Inference Cost Optimization:

Right-size instances: Don't use GPU instances for models that run fine on CPU.
Batch predictions: Process requests in batches when latency permits.
Caching: Cache embeddings, feature computations, and repeated predictions.
Model compression: Deploy quantized/distilled models.
Serverless for variable load: Use serverless inference (AWS Lambda, Cloud Functions) for low/variable traffic.
Auto-scaling: Scale down during off-peak hours.

Cost Tracking:

Implement cost attribution to understand where money goes:

Tag cloud resources by project/experiment
Track per-experiment compute costs
Set budgets and alerts
Review costs regularly (weekly) during development

GPU Instances Are Expensive When Idle

A single A100 GPU instance costs ~$2-4/hour. Left running idle for a month, that's $1,400-2,800 wasted. Implement auto-shutdown for development instances and rigorously shut down resources after experiments complete. Cloud cost monitoring alerts are essential.

Summary: Computational Resources

We've explored the critical role of computational resources in machine learning success. Let's consolidate the key insights:

Key Takeaways

•Match hardware to workload — GPUs for neural networks, CPUs for classical ML, TPUs for large-scale transformer training. The right hardware can provide 10-100× efficiency gains.
•Estimate before committing — Calculate FLOPS requirements, memory needs, and expected training time before provisioning resources. Avoid over-provisioning expensive hardware.
•Cloud vs. on-premises is nuanced — Cloud offers flexibility and scalability; on-premises is cheaper at high utilization. Many teams benefit from hybrid approaches.
•Efficiency techniques matter — Mixed precision, gradient accumulation, and checkpointing can double or triple effective resource utilization without hardware changes.
•Optimize inference, not just training — Inference cost often dominates production ML expenses. Quantization, batching, and caching deliver major savings.
•Monitor to optimize — You can't optimize what you don't measure. GPU utilization, memory usage, and bottleneck analysis guide efficient resource use.
•Manage costs actively — Spot instances, early stopping, right-sizing, and auto-shutdown prevent cloud bills from spiraling.

What's Next:

With data, features, algorithms, and computational resources covered, we turn to the final success factor: Domain Expertise. Technical excellence in ML is necessary but not sufficient—understanding the problem domain, stakeholder needs, and real-world deployment context often determines whether an ML project delivers value. The next page explores how domain knowledge amplifies ML success.

Page Complete

You now understand the computational dimension of ML success. You can match hardware to workloads, estimate resource requirements, leverage efficiency techniques, optimize inference, and manage costs—skills essential for practical ML at any scale.