Machine LearningModern Self-Supervised Methods

Modern Self-Supervised Methods

LevelAdvanced

Duration90 mins

TopicModern Self-Supervised Methods

5 / 5

Foundation Model Training: Scaling Self-Supervision

The Rise of Foundation Models

Foundation models are large-scale models trained on broad data that can be adapted to a wide range of downstream tasks. They represent the convergence of self-supervised learning techniques with massive compute and data.

The term, coined by Stanford HAI in 2021, captures a fundamental shift: instead of training task-specific models from scratch, we train one versatile model that serves as the foundation for many applications.

Defining Characteristics

Scale: Billions of parameters, trillions of training tokens
Self-supervision: Trained without manual labels on web-scale data
Emergence: Capabilities that arise at scale (in-context learning, reasoning)
Adaptability: Fine-tune or prompt for diverse downstream tasks
Multimodality: Often spanning text, images, audio, video

From GPT-3 to Multimodal Models

GPT-3 (175B parameters, 2020) demonstrated that scale enables emergent capabilities like few-shot learning. Vision foundation models like DINOv2, SAM, and multimodal models like GPT-4V extend this paradigm to visual understanding, creating general-purpose visual AI systems.

Scaling Laws for Foundation Models

Scaling laws describe how model performance improves as we increase compute, data, and parameters. They guide resource allocation for training foundation models.

The Chinchilla Optimal

Hoffmann et al. (2022) showed that for a fixed compute budget C:

Optimal parameters N and data D scale equally: $N \propto C^{0.5}$, $D \propto C^{0.5}$
Many LLMs were undertrained: GPT-3 should have seen 10x more data

Vision Scaling Laws

For vision models, Zhai et al. (2022) found: $$\text{Error} \propto \left(\frac{1}{\text{Samples}}\right)^{0.5} + \left(\frac{1}{\text{Params}}\right)^{0.5}$$

Both data and model size contribute equally to performance.

scaling_laws.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
 
def chinchilla_optimal(compute_budget_flops):
    """Compute optimal model size and training tokens for given compute.
    
    Chinchilla scaling law: for compute C (in FLOPs),
    optimal N (params) and D (tokens) satisfy:
        N ≈ 0.0056 * C^0.5
        D ≈ 0.0056 * C^0.5
    
    Rule of thumb: train for ~20 tokens per parameter.
    """
    # Empirical constants from Hoffmann et al.
    a = 406.4    # Loss coefficient for N
    b = 410.7    # Loss coefficient for D
    alpha = 0.34 # Scaling exponent for N
    beta = 0.28  # Scaling exponent for D
    
    # Optimal allocation
    optimal_n = (compute_budget_flops / 6) ** 0.5 * 0.0056
    optimal_d = optimal_n  # Equal compute allocation
    
    return {
        'optimal_params': optimal_n,
        'optimal_tokens': optimal_d,
        'tokens_per_param': optimal_d / optimal_n
    }
 
def estimate_training_cost(params_b, tokens_t, gpu_tflops=312):
    """Estimate training cost for a model.
    
    Rule: ~6 FLOPs per parameter per token for forward+backward pass
    """
    flops = 6 * params_b * 1e9 * tokens_t * 1e12
    gpu_hours = flops / (gpu_tflops * 1e12 * 3600)
    
    return {
        'total_flops': flops,
        'gpu_hours': gpu_hours,
        'a100_days': gpu_hours / (24 * 8),  # 8 GPUs
    }

Foundation Model Sizes
Model	Parameters	Training Data	Compute (PF-days)
GPT-3	175B	300B tokens	3,640
CLIP ViT-L	428M	400M images	~300
DINOv2 ViT-g	1.1B	142M images	~500
LLaMA 2 70B	70B	2T tokens	~6,000
GPT-4 (est.)	~1.8T MoE	~10T tokens	~100,000

Vision Foundation Models

Vision foundation models apply self-supervised learning at scale to create general-purpose visual representations.

DINOv2: Self-supervised Vision at Scale

DINOv2 by Oquab et al. (2023) scales DINO to produce features that work across diverse vision tasks without fine-tuning:

Trained on 142M curated images (LVD-142M dataset)
Combines self-distillation (DINO) with masked image modeling (iBOT)
Features work for classification, segmentation, depth estimation—zero-shot

SAM: Segment Anything Model

SAM by Kirillov et al. (2023) creates a promptable segmentation foundation:

Trained on 11M images with 1B masks
Zero-shot segmentation from points, boxes, or text
"Foundational" for segmentation tasks

vision_foundation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class DINOv2(nn.Module):
    """DINOv2: Scaled self-supervised vision model.
    
    Combines DINO self-distillation with iBOT masked modeling.
    """
    
    def __init__(self, backbone, head_dim=65536, patch_size=14):
        super().__init__()
        self.student = backbone
        self.teacher = copy.deepcopy(backbone)
        self.patch_size = patch_size
        
        # DINO head for image-level distillation
        self.dino_head = DINOHead(backbone.embed_dim, head_dim)
        
        # iBOT head for patch-level masked prediction
        self.ibot_head = nn.Linear(backbone.embed_dim, 8192)
    
    def forward(self, global_crops, local_crops, masks):
        # DINO: distill global-to-local
        teacher_global = self.teacher(global_crops)
        student_all = [self.student(c) for c in global_crops + local_crops]
        dino_loss = self.dino_loss(teacher_global, student_all)
        
        # iBOT: masked patch prediction
        student_masked = self.student(global_crops, mask=masks)
        teacher_patches = self.teacher.get_patch_tokens(global_crops)
        ibot_loss = self.ibot_loss(student_masked, teacher_patches, masks)
        
        return dino_loss + ibot_loss
 
 
class SAM(nn.Module):
    """Segment Anything Model architecture."""
    
    def __init__(self, image_encoder, prompt_encoder, mask_decoder):
        super().__init__()
        self.image_encoder = image_encoder  # ViT-H
        self.prompt_encoder = prompt_encoder  # Points, boxes, text
        self.mask_decoder = mask_decoder  # Lightweight transformer
    
    def forward(self, image, prompts):
        # Encode image once (can reuse for multiple prompts)
        image_embedding = self.image_encoder(image)
        
        # Encode prompts (points, boxes, or masks)
        sparse_embeds, dense_embeds = self.prompt_encoder(prompts)
        
        # Decode masks
        masks, iou_pred = self.mask_decoder(
            image_embedding, sparse_embeds, dense_embeds
        )
        
        return masks, iou_pred

Training Infrastructure at Scale

Training foundation models requires sophisticated distributed systems and efficient implementations.

Key Infrastructure Components

•Data Parallelism: Distribute batches across GPUs with gradient synchronization
•Model Parallelism: Split model layers across devices for large models
•Pipeline Parallelism: Micro-batch pipelining for throughput
•ZeRO Optimization: Partition optimizer states, gradients, parameters
•Mixed Precision: FP16/BF16 for speed, FP32 for stability
•Gradient Checkpointing: Trade compute for memory

distributed_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.cuda.amp import autocast, GradScaler
 
def setup_foundation_training(model, world_size, local_rank):
    """Setup distributed training for foundation models."""
    
    # Initialize process group
    dist.init_process_group("nccl", rank=local_rank, world_size=world_size)
    torch.cuda.set_device(local_rank)
    
    # Wrap model in DDP
    model = model.cuda(local_rank)
    model = DDP(model, device_ids=[local_rank], find_unused_parameters=False)
    
    # Mixed precision training
    scaler = GradScaler()
    
    return model, scaler
 
def train_step_foundation(model, batch, optimizer, scaler):
    """Single training step with mixed precision."""
    optimizer.zero_grad()
    
    with autocast():
        loss = model(batch)
    
    # Scaled backward pass
    scaler.scale(loss).backward()
    
    # Gradient clipping before unscaling
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    
    scaler.step(optimizer)
    scaler.update()
    
    return loss.item()

Data Curation for Foundation Models

Foundation model quality depends critically on training data. Careful curation improves efficiency and capability.

Key Data Practices

Deduplication: Remove near-duplicate images/documents to prevent memorization
Quality Filtering: Remove low-quality, NSFW, or toxic content
Diversity Sampling: Ensure coverage across domains and concepts
Distribution Balancing: Weight data sources to avoid domain collapse

data_curation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from imagededup.methods import PHash
import numpy as np
 
def deduplicate_images(image_paths, threshold=0.9):
    """Remove near-duplicate images using perceptual hashing."""
    hasher = PHash()
    encodings = hasher.encode_images(image_paths)
    duplicates = hasher.find_duplicates(encoding_map=encodings, 
                                         max_distance_threshold=int((1-threshold)*64))
    
    # Keep only unique images
    unique = set(image_paths) - set().union(*duplicates.values())
    return list(unique)
 
def compute_clip_score(model, images, texts):
    """Filter image-text pairs by CLIP score."""
    image_embeds = model.encode_image(images)
    text_embeds = model.encode_text(texts)
    
    scores = (image_embeds * text_embeds).sum(dim=-1)
    return scores
 
def quality_filter_pipeline(dataset, clip_model, min_score=0.25):
    """Multi-stage quality filtering for VL datasets."""
    filtered = []
    
    for image, text in dataset:
        # Stage 1: Image quality (resolution, aspect ratio)
        if image.size[0] < 200 or image.size[1] < 200:
            continue
        if image.size[0] / image.size[1] > 3:
            continue
            
        # Stage 2: Text quality (length, language)
        if len(text.split()) < 3:
            continue
            
        # Stage 3: CLIP score filtering
        score = compute_clip_score(clip_model, image, text)
        if score > min_score:
            filtered.append((image, text))
    
    return filtered

Adapting Foundation Models

Foundation models are designed for adaptation. Key strategies include:

Adaptation Methods Comparison
Method	Trainable Params	Compute	Best For
Full Fine-tuning	100%	High	Plenty of data, domain shift
Linear Probing	<1%	Low	Quick evaluation, limited data
LoRA	~1%	Low	Efficient fine-tuning
Prompt Tuning	<0.1%	Very Low	Few-shot, no weight changes
Zero-shot	0%	None	Direct application

lora_adaptation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class LoRALayer(nn.Module):
    """Low-Rank Adaptation layer."""
    
    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        self.original = original_layer
        self.rank = rank
        self.alpha = alpha
        
        # Freeze original weights
        for p in self.original.parameters():
            p.requires_grad = False
        
        # Low-rank decomposition: W + BA
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        
        # Initialize: A ~ N(0, 1), B = 0
        nn.init.normal_(self.lora_A.weight)
        nn.init.zeros_(self.lora_B.weight)
    
    def forward(self, x):
        original_out = self.original(x)
        lora_out = self.lora_B(self.lora_A(x)) * (self.alpha / self.rank)
        return original_out + lora_out

Summary: Foundation Model Training

Key Takeaways

•Scaling laws guide training: Optimal performance requires balancing model size and data
•Vision foundations exist: DINOv2, SAM provide general-purpose visual capabilities
•Infrastructure is critical: Distributed training, mixed precision, gradient checkpointing enable scale
•Data quality matters: Curation, deduplication, filtering improve efficiency
•Efficient adaptation: LoRA, prompt tuning enable downstream use with minimal compute

Module Complete

You've completed Module 6: Modern Self-Supervised Methods! You now understand the cutting-edge techniques from BYOL through foundation models. These methods power today's most capable AI systems, from image classifiers to multimodal assistants. The self-supervised paradigm—learning from unlabeled data at scale—has become the dominant approach in modern deep learning.

5 / 5

Loading learning content...

Machine LearningModern Self-Supervised Methods

Modern Self-Supervised Methods

LevelAdvanced

Duration90 mins

TopicModern Self-Supervised Methods

5 / 5

Foundation Model Training: Scaling Self-Supervision

The Rise of Foundation Models

Defining Characteristics

Scale: Billions of parameters, trillions of training tokens
Self-supervision: Trained without manual labels on web-scale data
Emergence: Capabilities that arise at scale (in-context learning, reasoning)
Adaptability: Fine-tune or prompt for diverse downstream tasks
Multimodality: Often spanning text, images, audio, video

From GPT-3 to Multimodal Models

Scaling Laws for Foundation Models

Scaling laws describe how model performance improves as we increase compute, data, and parameters. They guide resource allocation for training foundation models.

The Chinchilla Optimal

Hoffmann et al. (2022) showed that for a fixed compute budget C:

Optimal parameters N and data D scale equally: $N \propto C^{0.5}$, $D \propto C^{0.5}$
Many LLMs were undertrained: GPT-3 should have seen 10x more data

Vision Scaling Laws

For vision models, Zhai et al. (2022) found: $$\text{Error} \propto \left(\frac{1}{\text{Samples}}\right)^{0.5} + \left(\frac{1}{\text{Params}}\right)^{0.5}$$

Both data and model size contribute equally to performance.

scaling_laws.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
 
def chinchilla_optimal(compute_budget_flops):
    """Compute optimal model size and training tokens for given compute.
    
    Chinchilla scaling law: for compute C (in FLOPs),
    optimal N (params) and D (tokens) satisfy:
        N ≈ 0.0056 * C^0.5
        D ≈ 0.0056 * C^0.5
    
    Rule of thumb: train for ~20 tokens per parameter.
    """
    # Empirical constants from Hoffmann et al.
    a = 406.4    # Loss coefficient for N
    b = 410.7    # Loss coefficient for D
    alpha = 0.34 # Scaling exponent for N
    beta = 0.28  # Scaling exponent for D
    
    # Optimal allocation
    optimal_n = (compute_budget_flops / 6) ** 0.5 * 0.0056
    optimal_d = optimal_n  # Equal compute allocation
    
    return {
        'optimal_params': optimal_n,
        'optimal_tokens': optimal_d,
        'tokens_per_param': optimal_d / optimal_n
    }
 
def estimate_training_cost(params_b, tokens_t, gpu_tflops=312):
    """Estimate training cost for a model.
    
    Rule: ~6 FLOPs per parameter per token for forward+backward pass
    """
    flops = 6 * params_b * 1e9 * tokens_t * 1e12
    gpu_hours = flops / (gpu_tflops * 1e12 * 3600)
    
    return {
        'total_flops': flops,
        'gpu_hours': gpu_hours,
        'a100_days': gpu_hours / (24 * 8),  # 8 GPUs
    }

Foundation Model Sizes
Model	Parameters	Training Data	Compute (PF-days)
GPT-3	175B	300B tokens	3,640
CLIP ViT-L	428M	400M images	~300
DINOv2 ViT-g	1.1B	142M images	~500
LLaMA 2 70B	70B	2T tokens	~6,000
GPT-4 (est.)	~1.8T MoE	~10T tokens	~100,000

Vision Foundation Models

Vision foundation models apply self-supervised learning at scale to create general-purpose visual representations.

DINOv2: Self-supervised Vision at Scale

DINOv2 by Oquab et al. (2023) scales DINO to produce features that work across diverse vision tasks without fine-tuning:

Trained on 142M curated images (LVD-142M dataset)
Combines self-distillation (DINO) with masked image modeling (iBOT)
Features work for classification, segmentation, depth estimation—zero-shot

SAM: Segment Anything Model

SAM by Kirillov et al. (2023) creates a promptable segmentation foundation:

Trained on 11M images with 1B masks
Zero-shot segmentation from points, boxes, or text
"Foundational" for segmentation tasks

vision_foundation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class DINOv2(nn.Module):
    """DINOv2: Scaled self-supervised vision model.
    
    Combines DINO self-distillation with iBOT masked modeling.
    """
    
    def __init__(self, backbone, head_dim=65536, patch_size=14):
        super().__init__()
        self.student = backbone
        self.teacher = copy.deepcopy(backbone)
        self.patch_size = patch_size
        
        # DINO head for image-level distillation
        self.dino_head = DINOHead(backbone.embed_dim, head_dim)
        
        # iBOT head for patch-level masked prediction
        self.ibot_head = nn.Linear(backbone.embed_dim, 8192)
    
    def forward(self, global_crops, local_crops, masks):
        # DINO: distill global-to-local
        teacher_global = self.teacher(global_crops)
        student_all = [self.student(c) for c in global_crops + local_crops]
        dino_loss = self.dino_loss(teacher_global, student_all)
        
        # iBOT: masked patch prediction
        student_masked = self.student(global_crops, mask=masks)
        teacher_patches = self.teacher.get_patch_tokens(global_crops)
        ibot_loss = self.ibot_loss(student_masked, teacher_patches, masks)
        
        return dino_loss + ibot_loss
 
 
class SAM(nn.Module):
    """Segment Anything Model architecture."""
    
    def __init__(self, image_encoder, prompt_encoder, mask_decoder):
        super().__init__()
        self.image_encoder = image_encoder  # ViT-H
        self.prompt_encoder = prompt_encoder  # Points, boxes, text
        self.mask_decoder = mask_decoder  # Lightweight transformer
    
    def forward(self, image, prompts):
        # Encode image once (can reuse for multiple prompts)
        image_embedding = self.image_encoder(image)
        
        # Encode prompts (points, boxes, or masks)
        sparse_embeds, dense_embeds = self.prompt_encoder(prompts)
        
        # Decode masks
        masks, iou_pred = self.mask_decoder(
            image_embedding, sparse_embeds, dense_embeds
        )
        
        return masks, iou_pred

Training Infrastructure at Scale

Training foundation models requires sophisticated distributed systems and efficient implementations.

Key Infrastructure Components

•Data Parallelism: Distribute batches across GPUs with gradient synchronization
•Model Parallelism: Split model layers across devices for large models
•Pipeline Parallelism: Micro-batch pipelining for throughput
•ZeRO Optimization: Partition optimizer states, gradients, parameters
•Mixed Precision: FP16/BF16 for speed, FP32 for stability
•Gradient Checkpointing: Trade compute for memory

distributed_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.cuda.amp import autocast, GradScaler
 
def setup_foundation_training(model, world_size, local_rank):
    """Setup distributed training for foundation models."""
    
    # Initialize process group
    dist.init_process_group("nccl", rank=local_rank, world_size=world_size)
    torch.cuda.set_device(local_rank)
    
    # Wrap model in DDP
    model = model.cuda(local_rank)
    model = DDP(model, device_ids=[local_rank], find_unused_parameters=False)
    
    # Mixed precision training
    scaler = GradScaler()
    
    return model, scaler
 
def train_step_foundation(model, batch, optimizer, scaler):
    """Single training step with mixed precision."""
    optimizer.zero_grad()
    
    with autocast():
        loss = model(batch)
    
    # Scaled backward pass
    scaler.scale(loss).backward()
    
    # Gradient clipping before unscaling
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    
    scaler.step(optimizer)
    scaler.update()
    
    return loss.item()

Data Curation for Foundation Models

Foundation model quality depends critically on training data. Careful curation improves efficiency and capability.

Key Data Practices

Deduplication: Remove near-duplicate images/documents to prevent memorization
Quality Filtering: Remove low-quality, NSFW, or toxic content
Diversity Sampling: Ensure coverage across domains and concepts
Distribution Balancing: Weight data sources to avoid domain collapse

data_curation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from imagededup.methods import PHash
import numpy as np
 
def deduplicate_images(image_paths, threshold=0.9):
    """Remove near-duplicate images using perceptual hashing."""
    hasher = PHash()
    encodings = hasher.encode_images(image_paths)
    duplicates = hasher.find_duplicates(encoding_map=encodings, 
                                         max_distance_threshold=int((1-threshold)*64))
    
    # Keep only unique images
    unique = set(image_paths) - set().union(*duplicates.values())
    return list(unique)
 
def compute_clip_score(model, images, texts):
    """Filter image-text pairs by CLIP score."""
    image_embeds = model.encode_image(images)
    text_embeds = model.encode_text(texts)
    
    scores = (image_embeds * text_embeds).sum(dim=-1)
    return scores
 
def quality_filter_pipeline(dataset, clip_model, min_score=0.25):
    """Multi-stage quality filtering for VL datasets."""
    filtered = []
    
    for image, text in dataset:
        # Stage 1: Image quality (resolution, aspect ratio)
        if image.size[0] < 200 or image.size[1] < 200:
            continue
        if image.size[0] / image.size[1] > 3:
            continue
            
        # Stage 2: Text quality (length, language)
        if len(text.split()) < 3:
            continue
            
        # Stage 3: CLIP score filtering
        score = compute_clip_score(clip_model, image, text)
        if score > min_score:
            filtered.append((image, text))
    
    return filtered

Adapting Foundation Models

Foundation models are designed for adaptation. Key strategies include:

Adaptation Methods Comparison
Method	Trainable Params	Compute	Best For
Full Fine-tuning	100%	High	Plenty of data, domain shift
Linear Probing	<1%	Low	Quick evaluation, limited data
LoRA	~1%	Low	Efficient fine-tuning
Prompt Tuning	<0.1%	Very Low	Few-shot, no weight changes
Zero-shot	0%	None	Direct application

lora_adaptation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class LoRALayer(nn.Module):
    """Low-Rank Adaptation layer."""
    
    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        self.original = original_layer
        self.rank = rank
        self.alpha = alpha
        
        # Freeze original weights
        for p in self.original.parameters():
            p.requires_grad = False
        
        # Low-rank decomposition: W + BA
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        
        # Initialize: A ~ N(0, 1), B = 0
        nn.init.normal_(self.lora_A.weight)
        nn.init.zeros_(self.lora_B.weight)
    
    def forward(self, x):
        original_out = self.original(x)
        lora_out = self.lora_B(self.lora_A(x)) * (self.alpha / self.rank)
        return original_out + lora_out

Summary: Foundation Model Training

Key Takeaways

•Scaling laws guide training: Optimal performance requires balancing model size and data
•Vision foundations exist: DINOv2, SAM provide general-purpose visual capabilities
•Infrastructure is critical: Distributed training, mixed precision, gradient checkpointing enable scale
•Data quality matters: Curation, deduplication, filtering improve efficiency
•Efficient adaptation: LoRA, prompt tuning enable downstream use with minimal compute

Module Complete

5 / 5