Machine LearningAutoML & Neural Architecture Search

AutoML Systems

LevelAdvanced

Duration180 mins

TopicAutoML & Neural Architecture Search

2 / 5

AutoGluon: Multi-Layer Stack Ensembling for State-of-the-Art AutoML

A Paradigm Shift in AutoML Philosophy

In 2020, researchers from Amazon Web Services introduced AutoGluon, an AutoML framework that challenged fundamental assumptions of the field. While systems like Auto-sklearn focused on finding the single best pipeline through sophisticated search, AutoGluon asked a heretical question: What if we simply trained everything and let ensembling do the work?

This seemingly naive approach, when implemented with careful engineering and multi-layer stacking, produced results that consistently outperformed optimization-heavy systems. AutoGluon achieved top rankings on multiple benchmarks while often being faster than systems that spent compute on search rather than training.

The implications were profound: perhaps the AutoML community had been over-investing in search complexity and under-investing in ensemble architecture. AutoGluon's success demonstrated that with enough model diversity and intelligent stacking, explicit hyperparameter optimization becomes less critical than simply having well-chosen defaults and letting ensemble methods extract complementary value.

What You Will Master

By the end of this page, you will understand AutoGluon's complete architecture—from its multi-layer stacking design to its quality presets and time management. You will be able to deploy AutoGluon for tabular, text, image, and multimodal tasks, configure quality-speed trade-offs for production, and understand why this less-is-more approach often outperforms traditional AutoML.

Philosophical Foundation: Why Less Search May Be More

To appreciate AutoGluon's design, we must understand its departure from conventional AutoML wisdom.

The Traditional AutoML Paradigm

Systems like Auto-sklearn, TPOT, and H2O AutoML operate on an implicit assumption: the search for optimal hyperparameters is the primary challenge. These systems dedicate substantial compute to exploring configuration spaces, using techniques like Bayesian optimization, evolutionary search, or grid/random search.

This approach has intuitive appeal—surely finding the right hyperparameters must matter. And indeed, for a single model, hyperparameter tuning can yield significant improvements.

The AutoGluon Counter-Argument

AutoGluon's creators observed several phenomena that challenged this orthodoxy:

Key Observations Behind AutoGluon's Design

•Ensemble averaging dominates hyperparameter tuning: A well-constructed ensemble of models with reasonable (not optimal) hyperparameters often outperforms a single optimally-tuned model. The diversity benefit exceeds the tuning benefit.
•Good defaults exist: For many algorithms, well-chosen default hyperparameters perform within a few percent of optimal. The marginal gain from extensive search is often small.
•Search time is training time lost: Every minute spent searching is a minute not spent training more models to add to the ensemble. For a fixed budget, the trade-off favors training breadth.
•Stacking extracts complementary information: Multi-layer stacking can combine models more effectively than simple averaging, capturing higher-order relationships between model predictions.
•Robustness over optimality: An ensemble system is more robust to distributional shift and edge cases than a single optimized model. Production systems benefit from this stability.

The AutoGluon Strategy

Based on these observations, AutoGluon implements a fundamentally different strategy:

Train a diverse set of models with sensible defaults rather than searching for optimal hyperparameters
Use multi-layer stacking to combine models more powerfully than simple ensembling
Allocate compute to training not searching to maximize ensemble diversity within time budgets
Apply limited hyperparameter tuning only when time budget permits and for algorithms where it matters most

This approach trades theoretical optimality for practical robustness. AutoGluon may not find the single best possible model, but it consistently produces ensembles that generalize well across diverse datasets.

Empirical Evidence

In systematic benchmarks across 50+ datasets, AutoGluon's ensemble approach achieved the highest average rank among open-source AutoML systems, while often completing in less time than search-heavy competitors. This wasn't luck—it reflected a genuinely more efficient allocation of compute resources.

Multi-Layer Stack Ensembling Architecture

The core innovation in AutoGluon is its multi-layer stack ensembling architecture. Unlike simple ensembles that average predictions, stacking uses the outputs of base models as features for meta-models, enabling learning of optimal combination strategies.

Single-Layer Stacking Recap

In traditional stacking:

Train K base models on training data
Use out-of-fold predictions to create a meta-training set
Train a meta-model on the base model predictions
Final prediction = meta-model(base_model_1_pred, ..., base_model_K_pred)

This works because the meta-model can learn:

Which base models are more reliable overall
When specific base models should be trusted (conditional weighting)
Non-linear combinations of predictions that simple averaging misses

AutoGluon's Multi-Layer Extension

AutoGluon extends this to multiple stacking layers:

multi_layer_stacking.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# AutoGluon Multi-Layer Stacking Architecture
"""
Layer 0 (Base Models):
    - LightGBM, CatBoost, XGBoost (Gradient Boosting variants)
    - Random Forest, Extra Trees (Bagging variants)
    - Neural Network
    - KNN
    Each trained on original features
 
Layer 1 (First Stack):
    - New models trained on [original_features + layer_0_predictions]
    - Same algorithm families as Layer 0
    - Out-of-fold predictions prevent overfitting
 
Layer 2 (Second Stack - optional):
    - Models trained on [original_features + layer_0_preds + layer_1_preds]
    - Further refinement of predictions
    
Final Layer (Weighted Ensemble):
    - Learns optimal weights for all models across all layers
    - Ensemble selection similar to Caruana et al.
"""
 
class MultiLayerStackEnsemble:
    """
    Simplified implementation of AutoGluon's stacking architecture.
    """
    
    def __init__(
        self,
        base_models: list,
        num_stack_levels: int = 2,
        num_folds: int = 5,
        use_original_features: bool = True
    ):
        """
        Args:
            base_models: List of base model configurations
            num_stack_levels: Number of stacking layers (0 = no stacking)
            num_folds: K-fold for out-of-fold prediction generation
            use_original_features: Whether higher layers see original features
        """
        self.base_models = base_models
        self.num_stack_levels = num_stack_levels
        self.num_folds = num_folds
        self.use_original_features = use_original_features
        
        # Storage for trained models at each level
        self.level_models = {level: [] for level in range(num_stack_levels + 1)}
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Trains the full multi-layer stack ensemble.
        """
        current_features = X.copy()
        
        for level in range(self.num_stack_levels + 1):
            print(f"
Training Level {level} models...")
            
            level_oof_predictions = []
            
            for model_config in self.base_models:
                # Train with K-fold to generate OOF predictions
                oof_preds = self._train_with_oof(
                    model_config, 
                    current_features, 
                    y,
                    level
                )
                level_oof_predictions.append(oof_preds)
            
            # Stack: concatenate OOF predictions
            stacked_preds = np.column_stack(level_oof_predictions)
            
            if self.use_original_features and level < self.num_stack_levels:
                # Next level sees original features + all previous predictions
                current_features = np.hstack([X, stacked_preds])
            else:
                current_features = stacked_preds
        
        # Final ensemble weight optimization
        self._optimize_ensemble_weights()
        
        return self
    
    def _train_with_oof(
        self, 
        model_config, 
        X: np.ndarray, 
        y: np.ndarray,
        level: int
    ) -> np.ndarray:
        """
        Trains a model using K-fold and returns out-of-fold predictions.
        
        This prevents information leakage: each prediction is made
        by a model that never saw that sample during training.
        """
        from sklearn.model_selection import KFold
        
        kfold = KFold(n_splits=self.num_folds, shuffle=True, random_state=42)
        oof_predictions = np.zeros((len(X), self._num_classes))
        fold_models = []
        
        for fold_idx, (train_idx, val_idx) in enumerate(kfold.split(X)):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train = y[train_idx]
            
            # Instantiate and train model
            model = self._create_model(model_config)
            model.fit(X_train, y_train)
            
            # Generate out-of-fold predictions
            oof_predictions[val_idx] = model.predict_proba(X_val)
            
            fold_models.append(model)
        
        # Store all fold models for this level
        self.level_models[level].append(fold_models)
        
        return oof_predictions
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Generates predictions by propagating through all layers.
        """
        current_features = X.copy()
        
        for level in range(self.num_stack_levels + 1):
            level_predictions = []
            
            for model_group in self.level_models[level]:
                # Average predictions across folds
                fold_preds = np.mean([
                    model.predict_proba(current_features) 
                    for model in model_group
                ], axis=0)
                level_predictions.append(fold_preds)
            
            stacked_preds = np.column_stack(level_predictions)
            
            if self.use_original_features and level < self.num_stack_levels:
                current_features = np.hstack([X, stacked_preds])
            else:
                current_features = stacked_preds
        
        # Apply ensemble weights to final predictions
        return self._apply_ensemble_weights(current_features)

Why Multi-Layer Stacking Works

Information Aggregation: Each layer aggregates information from the previous layer's models. If Layer 0 gradient boosting is strong on certain samples and Random Forest on others, Layer 1 models can learn to route predictions appropriately.

Error Correction: Higher layers can correct systematic errors from lower layers. If Layer 0 models collectively underpredict a certain region, Layer 1 models trained on their errors can compensate.

Capacity Scaling: Adding layers adds capacity without the overfitting risks of simply making individual models larger. The out-of-fold training prevents information leakage.

Complementary Learning: Different algorithm families learn different patterns. By forcing all algorithms to see each other's predictions, the stack creates opportunities for complementary learning that wouldn't arise from independent training.

Avoiding Stack Overfitting

The key to successful stacking is out-of-fold (OOF) prediction generation. Each sample's stacked features must come from models that never saw that sample during training. AutoGluon uses repeated K-fold cross-validation internally to ensure this property, preventing the stack from simply memorizing training data.

Model Diversity and Base Learner Selection

AutoGluon's effectiveness depends critically on the diversity of its base learners. The framework includes carefully selected model families that maximize prediction diversity while maintaining individual quality.

The AutoGluon Model Zoo (Tabular)

AutoGluon-Tabular Base Model Families
Model Family	Algorithms Included	Diversity Contribution
Gradient Boosting	LightGBM, CatBoost, XGBoost	Sequential additive learning; excellent on tabular data
Random Forest Family	Random Forest, Extra Trees	Parallel ensemble; different from sequential boosting
Neural Networks	FastAI Tabular NN, Custom MLP	Non-tree approach; learns different feature interactions
K-Nearest Neighbors	Weighted KNN, Bagged KNN	Instance-based; captures local structure
Linear Models	Regularized Logistic/Linear Regression	Simple baselines; surprisingly effective stacking components

Hyperparameter Strategy

Rather than searching hyperparameter spaces, AutoGluon uses hyperparameter portfolios: multiple configurations of each algorithm that together capture the useful hyperparameter space.

For example, LightGBM might be trained with three configurations:

Default: Standard settings optimized for general use
High Capacity: More trees, deeper, for complex patterns
Regularized: Strong regularization for noisy/small datasets

This portfolio approach provides hyperparameter diversity without search overhead. The ensemble selection process then learns which configurations are most useful for the specific dataset.

hyperparameter_portfolio.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# AutoGluon Hyperparameter Portfolio Example
"""
AutoGluon defines multiple hyperparameter configurations per algorithm.
This provides diversity without explicit search.
"""
 
LIGHTGBM_PORTFOLIO = [
    # Default configuration
    {
        'num_boost_round': 10000,
        'learning_rate': 0.03,
        'num_leaves': 128,
        'feature_fraction': 0.9,
        'min_data_in_leaf': 5,
        'early_stopping_rounds': 150,
    },
    # Large/Complex - more capacity
    {
        'num_boost_round': 20000,
        'learning_rate': 0.01,
        'num_leaves': 256,
        'feature_fraction': 0.75,
        'min_data_in_leaf': 3,
        'extra_trees': True,  # Additional randomness
    },
    # Regularized - prevents overfitting
    {
        'num_boost_round': 5000,
        'learning_rate': 0.1,
        'num_leaves': 64,
        'feature_fraction': 0.8,
        'min_data_in_leaf': 20,
        'lambda_l1': 0.1,
        'lambda_l2': 1.0,
    },
]
 
NEURAL_NETWORK_PORTFOLIO = [
    # Default TabularNN
    {
        'hidden_layers': [256, 128],
        'dropout': 0.1,
        'learning_rate': 1e-3,
        'batch_size': 128,
        'epochs': 100,
    },
    # Wide and Shallow
    {
        'hidden_layers': [512],
        'dropout': 0.2,
        'learning_rate': 5e-4,
        'batch_size': 256,
        'epochs': 50,
    },
    # Deep Network
    {
        'hidden_layers': [256, 128, 64, 32],
        'dropout': 0.15,
        'learning_rate': 1e-3,
        'batch_size': 64,
        'epochs': 200,
        'use_batchnorm': True,
    },
]
 
def get_model_portfolio(algorithm: str, quality_preset: str) -> list:
    """
    Returns hyperparameter configurations based on quality preset.
    
    Args:
        algorithm: 'lightgbm', 'nn', 'rf', etc.
        quality_preset: 'best_quality', 'high_quality', 'good_quality', 'medium_quality'
    
    Returns:
        List of hyperparameter dictionaries to train
    """
    portfolios = {
        'lightgbm': LIGHTGBM_PORTFOLIO,
        'nn': NEURAL_NETWORK_PORTFOLIO,
        # ... more algorithms
    }
    
    full_portfolio = portfolios.get(algorithm, [])
    
    # Quality presets control how many configurations to use
    if quality_preset == 'best_quality':
        return full_portfolio  # All configurations
    elif quality_preset == 'high_quality':
        return full_portfolio[:2]  # Top 2 configurations
    else:
        return full_portfolio[:1]  # Just default

Why Three Gradient Boosting Implementations?

LightGBM, CatBoost, and XGBoost implement gradient boosting differently (leaf-wise vs depth-wise growth, native categorical handling, regularization approaches). Despite solving the same problem, they produce sufficiently different predictions that all three contribute value in the ensemble. This is model diversity at the algorithm level, not just hyperparameter level.

Quality Presets and Intelligent Time Management

AutoGluon provides quality presets that automatically configure the system for different accuracy-speed trade-offs. This addresses a fundamental challenge: different use cases have vastly different requirements.

The Preset Hierarchy

AutoGluon defines presets from fastest to most accurate:

AutoGluon Quality Presets
Preset	Use Case	Model Count	Stacking	HPO	Typical Time
`medium_quality`	Fast iteration, prototyping	~5 models	No	No	1-5 min
`good_quality`	Balance of speed and accuracy	~8 models	1 layer	Minimal	5-20 min
`high_quality`	Production-grade accuracy	~15 models	2 layers	Some	20-120 min
`best_quality`	Maximum accuracy, competitions	~25+ models	3 layers	Extensive	2+ hours

autogluon_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from autogluon.tabular import TabularPredictor
from autogluon.core.utils.loaders import load_pd
import pandas as pd
 
# Load example dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
label = 'target'
 
# =============================================
# Basic Usage with Presets
# =============================================
 
# Fast prototyping
predictor_fast = TabularPredictor(label=label).fit(
    train_data,
    presets='medium_quality',
    time_limit=300  # 5 minutes max
)
 
# Production-grade
predictor_prod = TabularPredictor(label=label).fit(
    train_data,
    presets='high_quality',
    time_limit=3600  # 1 hour
)
 
# Competition mode - maximum accuracy
predictor_best = TabularPredictor(label=label).fit(
    train_data,
    presets='best_quality',
    time_limit=14400  # 4 hours
)
 
# =============================================
# Advanced Configuration
# =============================================
 
predictor_advanced = TabularPredictor(
    label=label,
    eval_metric='roc_auc',  # Optimize for AUC
    path='./autogluon_models'  # Model save path
).fit(
    train_data,
    presets='high_quality',
    time_limit=7200,
    
    # Fine-grained control
    num_bag_folds=8,        # More folds = more robust OOF
    num_bag_sets=1,         # Bagging repetitions
    num_stack_levels=2,     # Stacking depth
    
    # Model inclusion/exclusion
    excluded_model_types=['KNN'],  # Exclude slow models
    
    # Hyperparameter tuning
    hyperparameters={
        'GBM': [
            {'num_boost_round': 10000},
            {'num_boost_round': 20000, 'extra_trees': True},
        ],
        'CAT': {'iterations': 10000},
        'NN_TORCH': {},  # Use default
    },
    
    # Resource management
    ag_args_fit={
        'num_gpus': 1,  # Enable GPU for neural networks
    },
)
 
# =============================================
# Inspecting Results
# =============================================
 
# Leaderboard of all models
leaderboard = predictor_advanced.leaderboard(test_data)
print(leaderboard)
 
# Feature importance
importance = predictor_advanced.feature_importance(test_data)
print(importance)
 
# Model analysis
fit_summary = predictor_advanced.fit_summary()
print(fit_summary)
 
# Generate predictions
predictions = predictor_advanced.predict(test_data)
probabilities = predictor_advanced.predict_proba(test_data)

Intelligent Time Allocation

AutoGluon doesn't simply run models until time runs out. It implements intelligent time allocation that maximizes ensemble quality within constraints:

Estimated Model Times: Based on dataset size and model complexity, AutoGluon estimates how long each model will take
Priority Ordering: Models are ordered by expected contribution per time unit. Fast, reliable models (LightGBM) run first; slow, potentially high-value models (large neural networks) run if time permits
Progressive Stacking: Only builds higher stack levels if time budget allows and lower levels provide benefit
Early Stopping: Iterative algorithms use validation-based early stopping to avoid wasting time on converged models
Dynamic Reallocation: If models complete faster than expected, AutoGluon can train additional configurations

Time Limit Best Practices

Set time_limit slightly higher than your actual budget. AutoGluon's time estimation isn't perfect, and a small buffer prevents incomplete runs. For production, consider using 2x your target time during development, then tightening for deployment after you understand typical run times on your data.

Beyond Tabular: Multimodal Capabilities

A major differentiator of AutoGluon is its multimodal architecture. Unlike most AutoML systems that focus exclusively on tabular data, AutoGluon provides unified APIs for tabular, text, image, and multimodal (combined) data.

AutoGluon Module Architecture

AutoGluon Modules and Their Capabilities
Module	Data Types	Key Models	Use Cases
`TabularPredictor`	Numeric, categorical, datetime, text (as feature)	GBM, RF, NN, Stacking	Structured data, classic ML
`TextPredictor`	Raw text	BERT, ELECTRA, transformer variants	Classification, regression, NER
`ImagePredictor`	Image files	EfficientNet, ResNet, ViT	Image classification, object detection
`MultiModalPredictor`	Any combination of above	Fusion models, late fusion, cross-modal attention	Product description + image → price prediction

multimodal_autogluon.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# AutoGluon Multimodal Capabilities
from autogluon.multimodal import MultiModalPredictor
from autogluon.text import TextPredictor
from autogluon.vision import ImagePredictor
import pandas as pd
 
# =============================================
# Text Classification
# =============================================
text_data = pd.DataFrame({
    'review': [
        "This product exceeded all my expectations!",
        "Terrible quality, would not recommend.",
        "Average product, nothing special.",
    ],
    'sentiment': ['positive', 'negative', 'neutral']
})
 
text_predictor = TextPredictor(label='sentiment')
text_predictor.fit(text_data, time_limit=600)
 
# =============================================
# Image Classification
# =============================================
from autogluon.vision import ImageDataset
 
# AutoGluon can work with image folders directly
train_dataset = ImageDataset.from_folder('path/to/train/')
 
image_predictor = ImagePredictor(label='category')
image_predictor.fit(train_dataset, time_limit=1200)
 
# Or fine-tune a specific backbone
image_predictor_custom = ImagePredictor(
    label='category',
    hyperparameters={'model': 'efficientnet_b4'}
)
image_predictor_custom.fit(
    train_dataset,
    epochs=20,
    lr=1e-4,
    time_limit=3600
)
 
# =============================================
# Multimodal: Tabular + Text + Image
# =============================================
multimodal_data = pd.DataFrame({
    'product_name': ['Laptop Pro 15', 'Budget Phone X', 'Smart Speaker'],
    'description': [
        'High-performance laptop with stunning display',
        'Affordable mobile device with basic features',
        'Voice-controlled smart home assistant'
    ],
    'category': ['electronics', 'electronics', 'smart_home'],
    'image_path': ['laptop.jpg', 'phone.jpg', 'speaker.jpg'],
    'price': [1299.99, 199.99, 49.99]  # Target
})
 
# Multimodal predictor automatically:
# 1. Detects text columns and applies transformers
# 2. Detects image columns and applies vision models
# 3. Handles categorical/numeric columns with tabular models
# 4. Fuses representations across modalities
 
multimodal_predictor = MultiModalPredictor(label='price')
multimodal_predictor.fit(
    multimodal_data,
    time_limit=3600,
    hyperparameters={
        'model.names': ['numerical_mlp', 'categorical_mlp', 
                        'hf_text', 'timm_image', 'fusion_mlp'],
        'data.text.normalize_text': True,
    }
)
 
# Feature extraction for downstream tasks
embeddings = multimodal_predictor.extract_embedding(multimodal_data)
print(f"Embedding shape: {embeddings.shape}")

The Fusion Architecture

For multimodal data, AutoGluon employs late fusion by default:

Modality-Specific Encoders: Each modality (text, image, tabular) is processed by specialized encoders
- Text: Transformer models (BERT, ELECTRA)
- Image: CNN/ViT (EfficientNet, ResNet)
- Tabular: Embedding layers + MLP
Representation Extraction: Each encoder produces a fixed-size embedding vector
Fusion Layer: Modality embeddings are concatenated and fed through fusion MLPs
Output Head: Task-specific head (regression/classification) produces final predictions

This architecture allows AutoGluon to leverage pre-trained models (BERT, ImageNet weights) while learning dataset-specific fusion strategies.

GPU Requirements for Multimodal

While AutoGluon-Tabular runs efficiently on CPU, the text and image modules heavily benefit from GPU acceleration. BERT inference without GPU is prohibitively slow for large datasets. Plan for GPU resources when using TextPredictor, ImagePredictor, or MultiModalPredictor.

Production Deployment Considerations

AutoGluon's ensemble approach creates specific deployment considerations. Understanding these is crucial for production success.

Model Persistence and Size

deployment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# AutoGluon Production Deployment Strategies
from autogluon.tabular import TabularPredictor
import os
import shutil
 
# =============================================
# Model Saving and Loading
# =============================================
 
# Training with explicit path
predictor = TabularPredictor(
    label='target', 
    path='./production_models/my_model'
).fit(train_data, presets='high_quality')
 
# Model is automatically saved during and after training
model_path = predictor.path
print(f"Model saved to: {model_path}")
 
# Loading for inference
loaded_predictor = TabularPredictor.load('./production_models/my_model')
predictions = loaded_predictor.predict(new_data)
 
# =============================================
# Model Cloning for Smaller Deployment
# =============================================
 
# Full model may be large (GB) due to all trained models
# Clone with only the best model for smaller deployment
predictor.clone_for_deployment(
    path='./production_models/my_model_small',
    model='WeightedEnsemble_L2',  # Or a single model name
)
 
# Even smaller: single best non-ensemble model
leaderboard = predictor.leaderboard()
best_single_model = leaderboard[~leaderboard['model'].str.contains('Ensemble')].iloc[0]['model']
predictor.clone_for_deployment(
    path='./production_models/single_model',
    model=best_single_model,
)
 
# =============================================
# Optimizing Inference Speed
# =============================================
 
# Refit on full data (no validation split) for final deployment
predictor.refit_full(model='best')
 
# Compile models for faster inference (experimental)
predictor.compile_models(compiler='onnx')  # If supported
 
# Get inference time estimates
inference_times = predictor.get_model_best_info()
print(f"Inference time per sample: {inference_times}")
 
# =============================================
# Serving Options
# =============================================
 
# Option 1: Direct Python serving
class AutoGluonPredictor:
    def __init__(self, model_path):
        self.predictor = TabularPredictor.load(model_path)
    
    def predict(self, data):
        return self.predictor.predict(data)
 
# Option 2: FastAPI endpoint
'''
from fastapi import FastAPI
import pandas as pd
 
app = FastAPI()
predictor = TabularPredictor.load('./model')
 
@app.post("/predict")
async def predict(data: dict):
    df = pd.DataFrame([data])
    prediction = predictor.predict(df)
    return {"prediction": prediction.tolist()[0]}
'''
 
# Option 3: SageMaker deployment (AWS)
# AutoGluon has native SageMaker integration
"""
from autogluon.cloud import TabularCloudPredictor
 
cloud_predictor = TabularCloudPredictor(
    cloud_output_path='s3://bucket/autogluon-models/'
)
cloud_predictor.fit(train_data, instance_type='ml.m5.2xlarge')
cloud_predictor.deploy(instance_type='ml.m5.large')
predictions = cloud_predictor.predict(test_data)
"""

Deployment Best Practices

•Size vs Accuracy Trade-off: The full ensemble maximizes accuracy but can be 10-100x larger than single models. Evaluate whether ensemble accuracy gains justify deployment complexity.
•Inference Latency: Ensembles require running multiple models. For real-time serving, consider using only top-1 or weighted ensemble with fewer components.
•Memory Footprint: Full ensembles load all models into memory. Use clone_for_deployment to create minimal inference artifacts.
•Version Management: Save AutoGluon version alongside models. Dependencies can affect predictions subtly between versions.
•Monitoring: Track prediction distributions and model confidence in production. AutoGluon's predict_proba provides calibrated probabilities for monitoring.

The 80/20 Rule of AutoGluon Deployment

Often, 80% of ensemble accuracy comes from 20% of the models. Analyze the model weights in your final ensemble—frequently, LightGBM and CatBoost dominate. Deploying just those two models may sacrifice only 0.1-0.5% accuracy while dramatically reducing complexity.

AutoGluon in the AutoML Landscape

Understanding AutoGluon's position relative to alternatives helps in system selection. Each AutoML framework makes different trade-offs.

AutoGluon vs Alternative AutoML Systems
Dimension	AutoGluon	Auto-sklearn	H2O AutoML
Philosophy	Train everything, ensemble intelligently	Search for optimal configuration	Balanced search + stacking
HPO Strategy	Minimal (portfolios + defaults)	Extensive (SMAC Bayesian)	Grid + random + early stopping
Ensembling	Multi-layer stacking	Post-hoc greedy selection	Single-layer stacking
Modalities	Tabular, text, image, multimodal	Tabular only	Tabular only
GPU Support	Yes (NN, text, image)	No	Yes (XGBoost, Deep Learning)
Explainability	Feature importance, SHAP integration	Model-level interpretability	Variable importance, MOJO export
Best For	Kaggle competitions, multimodal, high accuracy	Academic benchmarks, interpretability	Enterprise production, scalability

When to Choose AutoGluon

Ideal Scenarios:

Maximum accuracy is paramount (competitions, critical business decisions)
Multimodal data requiring unified handling
Time budget is sufficient for ensemble training
Deployment complexity is acceptable

Less Ideal Scenarios:

Strict latency requirements (single-model alternatives may be better)
Memory-constrained environments (ensemble overhead)
Need for extensive interpretability (simpler models may be preferred)
Very small datasets where overfitting is a major risk

Summary: AutoGluon's Contribution

AutoGluon's core insight—that ensemble construction can substitute for hyperparameter search—has influenced the entire AutoML field. Its success demonstrated that well-engineered defaults combined with intelligent stacking often outperform sophisticated search methods, challenging practitioners to reconsider where to invest compute resources.

For practitioners, AutoGluon represents the ease-of-use frontier of AutoML: minimal configuration, maximum automation, competitive accuracy. Its multimodal capabilities extend this philosophy beyond tabular data, providing a unified framework for modern ML challenges.

Page Complete

You now understand AutoGluon's multi-layer stacking architecture, quality presets, time management, multimodal capabilities, and production deployment strategies. You can evaluate when AutoGluon is the optimal choice for your AutoML needs. Next, we explore H2O AutoML, an enterprise-focused system with different design priorities.

2 / 5

Loading learning content...

Machine LearningAutoML & Neural Architecture Search

AutoML Systems

LevelAdvanced

Duration180 mins

TopicAutoML & Neural Architecture Search

2 / 5

AutoGluon: Multi-Layer Stack Ensembling for State-of-the-Art AutoML

A Paradigm Shift in AutoML Philosophy

What You Will Master

Philosophical Foundation: Why Less Search May Be More

To appreciate AutoGluon's design, we must understand its departure from conventional AutoML wisdom.

The Traditional AutoML Paradigm

This approach has intuitive appeal—surely finding the right hyperparameters must matter. And indeed, for a single model, hyperparameter tuning can yield significant improvements.

The AutoGluon Counter-Argument

AutoGluon's creators observed several phenomena that challenged this orthodoxy:

Key Observations Behind AutoGluon's Design

•Ensemble averaging dominates hyperparameter tuning: A well-constructed ensemble of models with reasonable (not optimal) hyperparameters often outperforms a single optimally-tuned model. The diversity benefit exceeds the tuning benefit.
•Good defaults exist: For many algorithms, well-chosen default hyperparameters perform within a few percent of optimal. The marginal gain from extensive search is often small.
•Search time is training time lost: Every minute spent searching is a minute not spent training more models to add to the ensemble. For a fixed budget, the trade-off favors training breadth.
•Stacking extracts complementary information: Multi-layer stacking can combine models more effectively than simple averaging, capturing higher-order relationships between model predictions.
•Robustness over optimality: An ensemble system is more robust to distributional shift and edge cases than a single optimized model. Production systems benefit from this stability.

The AutoGluon Strategy

Based on these observations, AutoGluon implements a fundamentally different strategy:

Train a diverse set of models with sensible defaults rather than searching for optimal hyperparameters
Use multi-layer stacking to combine models more powerfully than simple ensembling
Allocate compute to training not searching to maximize ensemble diversity within time budgets
Apply limited hyperparameter tuning only when time budget permits and for algorithms where it matters most

Empirical Evidence

Multi-Layer Stack Ensembling Architecture

Single-Layer Stacking Recap

In traditional stacking:

Train K base models on training data
Use out-of-fold predictions to create a meta-training set
Train a meta-model on the base model predictions
Final prediction = meta-model(base_model_1_pred, ..., base_model_K_pred)

This works because the meta-model can learn:

Which base models are more reliable overall
When specific base models should be trusted (conditional weighting)
Non-linear combinations of predictions that simple averaging misses

AutoGluon's Multi-Layer Extension

AutoGluon extends this to multiple stacking layers:

multi_layer_stacking.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# AutoGluon Multi-Layer Stacking Architecture
"""
Layer 0 (Base Models):
    - LightGBM, CatBoost, XGBoost (Gradient Boosting variants)
    - Random Forest, Extra Trees (Bagging variants)
    - Neural Network
    - KNN
    Each trained on original features
 
Layer 1 (First Stack):
    - New models trained on [original_features + layer_0_predictions]
    - Same algorithm families as Layer 0
    - Out-of-fold predictions prevent overfitting
 
Layer 2 (Second Stack - optional):
    - Models trained on [original_features + layer_0_preds + layer_1_preds]
    - Further refinement of predictions
    
Final Layer (Weighted Ensemble):
    - Learns optimal weights for all models across all layers
    - Ensemble selection similar to Caruana et al.
"""
 
class MultiLayerStackEnsemble:
    """
    Simplified implementation of AutoGluon's stacking architecture.
    """
    
    def __init__(
        self,
        base_models: list,
        num_stack_levels: int = 2,
        num_folds: int = 5,
        use_original_features: bool = True
    ):
        """
        Args:
            base_models: List of base model configurations
            num_stack_levels: Number of stacking layers (0 = no stacking)
            num_folds: K-fold for out-of-fold prediction generation
            use_original_features: Whether higher layers see original features
        """
        self.base_models = base_models
        self.num_stack_levels = num_stack_levels
        self.num_folds = num_folds
        self.use_original_features = use_original_features
        
        # Storage for trained models at each level
        self.level_models = {level: [] for level in range(num_stack_levels + 1)}
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Trains the full multi-layer stack ensemble.
        """
        current_features = X.copy()
        
        for level in range(self.num_stack_levels + 1):
            print(f"
Training Level {level} models...")
            
            level_oof_predictions = []
            
            for model_config in self.base_models:
                # Train with K-fold to generate OOF predictions
                oof_preds = self._train_with_oof(
                    model_config, 
                    current_features, 
                    y,
                    level
                )
                level_oof_predictions.append(oof_preds)
            
            # Stack: concatenate OOF predictions
            stacked_preds = np.column_stack(level_oof_predictions)
            
            if self.use_original_features and level < self.num_stack_levels:
                # Next level sees original features + all previous predictions
                current_features = np.hstack([X, stacked_preds])
            else:
                current_features = stacked_preds
        
        # Final ensemble weight optimization
        self._optimize_ensemble_weights()
        
        return self
    
    def _train_with_oof(
        self, 
        model_config, 
        X: np.ndarray, 
        y: np.ndarray,
        level: int
    ) -> np.ndarray:
        """
        Trains a model using K-fold and returns out-of-fold predictions.
        
        This prevents information leakage: each prediction is made
        by a model that never saw that sample during training.
        """
        from sklearn.model_selection import KFold
        
        kfold = KFold(n_splits=self.num_folds, shuffle=True, random_state=42)
        oof_predictions = np.zeros((len(X), self._num_classes))
        fold_models = []
        
        for fold_idx, (train_idx, val_idx) in enumerate(kfold.split(X)):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train = y[train_idx]
            
            # Instantiate and train model
            model = self._create_model(model_config)
            model.fit(X_train, y_train)
            
            # Generate out-of-fold predictions
            oof_predictions[val_idx] = model.predict_proba(X_val)
            
            fold_models.append(model)
        
        # Store all fold models for this level
        self.level_models[level].append(fold_models)
        
        return oof_predictions
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Generates predictions by propagating through all layers.
        """
        current_features = X.copy()
        
        for level in range(self.num_stack_levels + 1):
            level_predictions = []
            
            for model_group in self.level_models[level]:
                # Average predictions across folds
                fold_preds = np.mean([
                    model.predict_proba(current_features) 
                    for model in model_group
                ], axis=0)
                level_predictions.append(fold_preds)
            
            stacked_preds = np.column_stack(level_predictions)
            
            if self.use_original_features and level < self.num_stack_levels:
                current_features = np.hstack([X, stacked_preds])
            else:
                current_features = stacked_preds
        
        # Apply ensemble weights to final predictions
        return self._apply_ensemble_weights(current_features)

Why Multi-Layer Stacking Works

Capacity Scaling: Adding layers adds capacity without the overfitting risks of simply making individual models larger. The out-of-fold training prevents information leakage.

Avoiding Stack Overfitting

Model Diversity and Base Learner Selection

The AutoGluon Model Zoo (Tabular)

AutoGluon-Tabular Base Model Families
Model Family	Algorithms Included	Diversity Contribution
Gradient Boosting	LightGBM, CatBoost, XGBoost	Sequential additive learning; excellent on tabular data
Random Forest Family	Random Forest, Extra Trees	Parallel ensemble; different from sequential boosting
Neural Networks	FastAI Tabular NN, Custom MLP	Non-tree approach; learns different feature interactions
K-Nearest Neighbors	Weighted KNN, Bagged KNN	Instance-based; captures local structure
Linear Models	Regularized Logistic/Linear Regression	Simple baselines; surprisingly effective stacking components

Hyperparameter Strategy

Rather than searching hyperparameter spaces, AutoGluon uses hyperparameter portfolios: multiple configurations of each algorithm that together capture the useful hyperparameter space.

For example, LightGBM might be trained with three configurations:

Default: Standard settings optimized for general use
High Capacity: More trees, deeper, for complex patterns
Regularized: Strong regularization for noisy/small datasets

This portfolio approach provides hyperparameter diversity without search overhead. The ensemble selection process then learns which configurations are most useful for the specific dataset.

hyperparameter_portfolio.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# AutoGluon Hyperparameter Portfolio Example
"""
AutoGluon defines multiple hyperparameter configurations per algorithm.
This provides diversity without explicit search.
"""
 
LIGHTGBM_PORTFOLIO = [
    # Default configuration
    {
        'num_boost_round': 10000,
        'learning_rate': 0.03,
        'num_leaves': 128,
        'feature_fraction': 0.9,
        'min_data_in_leaf': 5,
        'early_stopping_rounds': 150,
    },
    # Large/Complex - more capacity
    {
        'num_boost_round': 20000,
        'learning_rate': 0.01,
        'num_leaves': 256,
        'feature_fraction': 0.75,
        'min_data_in_leaf': 3,
        'extra_trees': True,  # Additional randomness
    },
    # Regularized - prevents overfitting
    {
        'num_boost_round': 5000,
        'learning_rate': 0.1,
        'num_leaves': 64,
        'feature_fraction': 0.8,
        'min_data_in_leaf': 20,
        'lambda_l1': 0.1,
        'lambda_l2': 1.0,
    },
]
 
NEURAL_NETWORK_PORTFOLIO = [
    # Default TabularNN
    {
        'hidden_layers': [256, 128],
        'dropout': 0.1,
        'learning_rate': 1e-3,
        'batch_size': 128,
        'epochs': 100,
    },
    # Wide and Shallow
    {
        'hidden_layers': [512],
        'dropout': 0.2,
        'learning_rate': 5e-4,
        'batch_size': 256,
        'epochs': 50,
    },
    # Deep Network
    {
        'hidden_layers': [256, 128, 64, 32],
        'dropout': 0.15,
        'learning_rate': 1e-3,
        'batch_size': 64,
        'epochs': 200,
        'use_batchnorm': True,
    },
]
 
def get_model_portfolio(algorithm: str, quality_preset: str) -> list:
    """
    Returns hyperparameter configurations based on quality preset.
    
    Args:
        algorithm: 'lightgbm', 'nn', 'rf', etc.
        quality_preset: 'best_quality', 'high_quality', 'good_quality', 'medium_quality'
    
    Returns:
        List of hyperparameter dictionaries to train
    """
    portfolios = {
        'lightgbm': LIGHTGBM_PORTFOLIO,
        'nn': NEURAL_NETWORK_PORTFOLIO,
        # ... more algorithms
    }
    
    full_portfolio = portfolios.get(algorithm, [])
    
    # Quality presets control how many configurations to use
    if quality_preset == 'best_quality':
        return full_portfolio  # All configurations
    elif quality_preset == 'high_quality':
        return full_portfolio[:2]  # Top 2 configurations
    else:
        return full_portfolio[:1]  # Just default

Why Three Gradient Boosting Implementations?

Quality Presets and Intelligent Time Management

The Preset Hierarchy

AutoGluon defines presets from fastest to most accurate:

AutoGluon Quality Presets
Preset	Use Case	Model Count	Stacking	HPO	Typical Time
`medium_quality`	Fast iteration, prototyping	~5 models	No	No	1-5 min
`good_quality`	Balance of speed and accuracy	~8 models	1 layer	Minimal	5-20 min
`high_quality`	Production-grade accuracy	~15 models	2 layers	Some	20-120 min
`best_quality`	Maximum accuracy, competitions	~25+ models	3 layers	Extensive	2+ hours

autogluon_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from autogluon.tabular import TabularPredictor
from autogluon.core.utils.loaders import load_pd
import pandas as pd
 
# Load example dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
label = 'target'
 
# =============================================
# Basic Usage with Presets
# =============================================
 
# Fast prototyping
predictor_fast = TabularPredictor(label=label).fit(
    train_data,
    presets='medium_quality',
    time_limit=300  # 5 minutes max
)
 
# Production-grade
predictor_prod = TabularPredictor(label=label).fit(
    train_data,
    presets='high_quality',
    time_limit=3600  # 1 hour
)
 
# Competition mode - maximum accuracy
predictor_best = TabularPredictor(label=label).fit(
    train_data,
    presets='best_quality',
    time_limit=14400  # 4 hours
)
 
# =============================================
# Advanced Configuration
# =============================================
 
predictor_advanced = TabularPredictor(
    label=label,
    eval_metric='roc_auc',  # Optimize for AUC
    path='./autogluon_models'  # Model save path
).fit(
    train_data,
    presets='high_quality',
    time_limit=7200,
    
    # Fine-grained control
    num_bag_folds=8,        # More folds = more robust OOF
    num_bag_sets=1,         # Bagging repetitions
    num_stack_levels=2,     # Stacking depth
    
    # Model inclusion/exclusion
    excluded_model_types=['KNN'],  # Exclude slow models
    
    # Hyperparameter tuning
    hyperparameters={
        'GBM': [
            {'num_boost_round': 10000},
            {'num_boost_round': 20000, 'extra_trees': True},
        ],
        'CAT': {'iterations': 10000},
        'NN_TORCH': {},  # Use default
    },
    
    # Resource management
    ag_args_fit={
        'num_gpus': 1,  # Enable GPU for neural networks
    },
)
 
# =============================================
# Inspecting Results
# =============================================
 
# Leaderboard of all models
leaderboard = predictor_advanced.leaderboard(test_data)
print(leaderboard)
 
# Feature importance
importance = predictor_advanced.feature_importance(test_data)
print(importance)
 
# Model analysis
fit_summary = predictor_advanced.fit_summary()
print(fit_summary)
 
# Generate predictions
predictions = predictor_advanced.predict(test_data)
probabilities = predictor_advanced.predict_proba(test_data)

Intelligent Time Allocation

AutoGluon doesn't simply run models until time runs out. It implements intelligent time allocation that maximizes ensemble quality within constraints:

Estimated Model Times: Based on dataset size and model complexity, AutoGluon estimates how long each model will take
Priority Ordering: Models are ordered by expected contribution per time unit. Fast, reliable models (LightGBM) run first; slow, potentially high-value models (large neural networks) run if time permits
Progressive Stacking: Only builds higher stack levels if time budget allows and lower levels provide benefit
Early Stopping: Iterative algorithms use validation-based early stopping to avoid wasting time on converged models
Dynamic Reallocation: If models complete faster than expected, AutoGluon can train additional configurations

Time Limit Best Practices

Beyond Tabular: Multimodal Capabilities

AutoGluon Module Architecture

AutoGluon Modules and Their Capabilities
Module	Data Types	Key Models	Use Cases
`TabularPredictor`	Numeric, categorical, datetime, text (as feature)	GBM, RF, NN, Stacking	Structured data, classic ML
`TextPredictor`	Raw text	BERT, ELECTRA, transformer variants	Classification, regression, NER
`ImagePredictor`	Image files	EfficientNet, ResNet, ViT	Image classification, object detection
`MultiModalPredictor`	Any combination of above	Fusion models, late fusion, cross-modal attention	Product description + image → price prediction

multimodal_autogluon.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# AutoGluon Multimodal Capabilities
from autogluon.multimodal import MultiModalPredictor
from autogluon.text import TextPredictor
from autogluon.vision import ImagePredictor
import pandas as pd
 
# =============================================
# Text Classification
# =============================================
text_data = pd.DataFrame({
    'review': [
        "This product exceeded all my expectations!",
        "Terrible quality, would not recommend.",
        "Average product, nothing special.",
    ],
    'sentiment': ['positive', 'negative', 'neutral']
})
 
text_predictor = TextPredictor(label='sentiment')
text_predictor.fit(text_data, time_limit=600)
 
# =============================================
# Image Classification
# =============================================
from autogluon.vision import ImageDataset
 
# AutoGluon can work with image folders directly
train_dataset = ImageDataset.from_folder('path/to/train/')
 
image_predictor = ImagePredictor(label='category')
image_predictor.fit(train_dataset, time_limit=1200)
 
# Or fine-tune a specific backbone
image_predictor_custom = ImagePredictor(
    label='category',
    hyperparameters={'model': 'efficientnet_b4'}
)
image_predictor_custom.fit(
    train_dataset,
    epochs=20,
    lr=1e-4,
    time_limit=3600
)
 
# =============================================
# Multimodal: Tabular + Text + Image
# =============================================
multimodal_data = pd.DataFrame({
    'product_name': ['Laptop Pro 15', 'Budget Phone X', 'Smart Speaker'],
    'description': [
        'High-performance laptop with stunning display',
        'Affordable mobile device with basic features',
        'Voice-controlled smart home assistant'
    ],
    'category': ['electronics', 'electronics', 'smart_home'],
    'image_path': ['laptop.jpg', 'phone.jpg', 'speaker.jpg'],
    'price': [1299.99, 199.99, 49.99]  # Target
})
 
# Multimodal predictor automatically:
# 1. Detects text columns and applies transformers
# 2. Detects image columns and applies vision models
# 3. Handles categorical/numeric columns with tabular models
# 4. Fuses representations across modalities
 
multimodal_predictor = MultiModalPredictor(label='price')
multimodal_predictor.fit(
    multimodal_data,
    time_limit=3600,
    hyperparameters={
        'model.names': ['numerical_mlp', 'categorical_mlp', 
                        'hf_text', 'timm_image', 'fusion_mlp'],
        'data.text.normalize_text': True,
    }
)
 
# Feature extraction for downstream tasks
embeddings = multimodal_predictor.extract_embedding(multimodal_data)
print(f"Embedding shape: {embeddings.shape}")

The Fusion Architecture

For multimodal data, AutoGluon employs late fusion by default:

Modality-Specific Encoders: Each modality (text, image, tabular) is processed by specialized encoders
- Text: Transformer models (BERT, ELECTRA)
- Image: CNN/ViT (EfficientNet, ResNet)
- Tabular: Embedding layers + MLP
Representation Extraction: Each encoder produces a fixed-size embedding vector
Fusion Layer: Modality embeddings are concatenated and fed through fusion MLPs
Output Head: Task-specific head (regression/classification) produces final predictions

This architecture allows AutoGluon to leverage pre-trained models (BERT, ImageNet weights) while learning dataset-specific fusion strategies.

GPU Requirements for Multimodal

Production Deployment Considerations

AutoGluon's ensemble approach creates specific deployment considerations. Understanding these is crucial for production success.

Model Persistence and Size

deployment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# AutoGluon Production Deployment Strategies
from autogluon.tabular import TabularPredictor
import os
import shutil
 
# =============================================
# Model Saving and Loading
# =============================================
 
# Training with explicit path
predictor = TabularPredictor(
    label='target', 
    path='./production_models/my_model'
).fit(train_data, presets='high_quality')
 
# Model is automatically saved during and after training
model_path = predictor.path
print(f"Model saved to: {model_path}")
 
# Loading for inference
loaded_predictor = TabularPredictor.load('./production_models/my_model')
predictions = loaded_predictor.predict(new_data)
 
# =============================================
# Model Cloning for Smaller Deployment
# =============================================
 
# Full model may be large (GB) due to all trained models
# Clone with only the best model for smaller deployment
predictor.clone_for_deployment(
    path='./production_models/my_model_small',
    model='WeightedEnsemble_L2',  # Or a single model name
)
 
# Even smaller: single best non-ensemble model
leaderboard = predictor.leaderboard()
best_single_model = leaderboard[~leaderboard['model'].str.contains('Ensemble')].iloc[0]['model']
predictor.clone_for_deployment(
    path='./production_models/single_model',
    model=best_single_model,
)
 
# =============================================
# Optimizing Inference Speed
# =============================================
 
# Refit on full data (no validation split) for final deployment
predictor.refit_full(model='best')
 
# Compile models for faster inference (experimental)
predictor.compile_models(compiler='onnx')  # If supported
 
# Get inference time estimates
inference_times = predictor.get_model_best_info()
print(f"Inference time per sample: {inference_times}")
 
# =============================================
# Serving Options
# =============================================
 
# Option 1: Direct Python serving
class AutoGluonPredictor:
    def __init__(self, model_path):
        self.predictor = TabularPredictor.load(model_path)
    
    def predict(self, data):
        return self.predictor.predict(data)
 
# Option 2: FastAPI endpoint
'''
from fastapi import FastAPI
import pandas as pd
 
app = FastAPI()
predictor = TabularPredictor.load('./model')
 
@app.post("/predict")
async def predict(data: dict):
    df = pd.DataFrame([data])
    prediction = predictor.predict(df)
    return {"prediction": prediction.tolist()[0]}
'''
 
# Option 3: SageMaker deployment (AWS)
# AutoGluon has native SageMaker integration
"""
from autogluon.cloud import TabularCloudPredictor
 
cloud_predictor = TabularCloudPredictor(
    cloud_output_path='s3://bucket/autogluon-models/'
)
cloud_predictor.fit(train_data, instance_type='ml.m5.2xlarge')
cloud_predictor.deploy(instance_type='ml.m5.large')
predictions = cloud_predictor.predict(test_data)
"""

Deployment Best Practices

•Size vs Accuracy Trade-off: The full ensemble maximizes accuracy but can be 10-100x larger than single models. Evaluate whether ensemble accuracy gains justify deployment complexity.
•Inference Latency: Ensembles require running multiple models. For real-time serving, consider using only top-1 or weighted ensemble with fewer components.
•Memory Footprint: Full ensembles load all models into memory. Use clone_for_deployment to create minimal inference artifacts.
•Version Management: Save AutoGluon version alongside models. Dependencies can affect predictions subtly between versions.
•Monitoring: Track prediction distributions and model confidence in production. AutoGluon's predict_proba provides calibrated probabilities for monitoring.

The 80/20 Rule of AutoGluon Deployment

AutoGluon in the AutoML Landscape

Understanding AutoGluon's position relative to alternatives helps in system selection. Each AutoML framework makes different trade-offs.

AutoGluon vs Alternative AutoML Systems
Dimension	AutoGluon	Auto-sklearn	H2O AutoML
Philosophy	Train everything, ensemble intelligently	Search for optimal configuration	Balanced search + stacking
HPO Strategy	Minimal (portfolios + defaults)	Extensive (SMAC Bayesian)	Grid + random + early stopping
Ensembling	Multi-layer stacking	Post-hoc greedy selection	Single-layer stacking
Modalities	Tabular, text, image, multimodal	Tabular only	Tabular only
GPU Support	Yes (NN, text, image)	No	Yes (XGBoost, Deep Learning)
Explainability	Feature importance, SHAP integration	Model-level interpretability	Variable importance, MOJO export
Best For	Kaggle competitions, multimodal, high accuracy	Academic benchmarks, interpretability	Enterprise production, scalability

When to Choose AutoGluon

Ideal Scenarios:

Maximum accuracy is paramount (competitions, critical business decisions)
Multimodal data requiring unified handling
Time budget is sufficient for ensemble training
Deployment complexity is acceptable

Less Ideal Scenarios:

Strict latency requirements (single-model alternatives may be better)
Memory-constrained environments (ensemble overhead)
Need for extensive interpretability (simpler models may be preferred)
Very small datasets where overfitting is a major risk

Summary: AutoGluon's Contribution

Page Complete

2 / 5