Neural architecture search, hyperparameter optimization, and when AutoML makes sense
AutoML promises to automate the tedious process of machine learning model development: feature engineering, architecture design, and hyperparameter tuning. Yet many practitioners find AutoML disappointing—slow, expensive, and sometimes finding solutions worse than domain experts who understand their data.
This guide cuts through the hype to explain what AutoML actually does, when it helps, and how to integrate it practically into your workflow.
The most mature and widely useful AutoML component. Tuning learning rates, regularization, tree depths, and other settings that significantly impact model performance.
Automatically designing neural network architectures. Much more computationally expensive than HPO but can discover novel designs.
Automated generation of input features from raw data. Includes entity embedding learning, polynomial features, and interaction terms.
Automatically comparing different model families (XGBoost vs LightGBM vs Random Forest) and selecting the best performer.
The brute force approach—try every combination:
Search space: lr ∈ {0.001, 0.01, 0.1}, depth ∈ {4, 6, 8}
Total: 3 × 3 = 9 configurations
Pros: Exhaustive, parallelizable
Cons: Exponential in number of hyperparameters, inefficient
Grid search is inefficient for continuous hyperparameters and scales poorly. Most practitioners use it only with very few hyperparameters.
Randomly sample from the search space:
Pros:
- Simple to implement
- Finds good solutions faster than grid for continuous params
- Naturally parallelizable
Cons:
- No systematic exploration
- May miss optimal regions
Research showed random search often outperforms grid search with the same number of trials—grid wastes evaluations on irrelevant dimensions.
The most practical method for HPO. Builds a surrogate model of the objective function and selects configurations to try based on expected improvement:
1. Start with random configurations
2. Fit Gaussian Process (or similar) to observations
3. Compute acquisition function (e.g., Expected Improvement)
4. Select next configuration to try
5. Update surrogate model
6. Repeat
| Method | Efficiency | Scalability | Best For |
|---|---|---|---|
| Grid Search | Low | Low (exponential) | Few categorical params |
| Random Search | Moderate | Moderate | Quick baselines |
| Bayesian (GP) | High | Low-Moderate | Continuous params, <20 dims |
| Bayesian (RF/TPE) | High | Moderate-High | Mixed param types |
| Hyperband/ASHA | High | High | Long training times |
For expensive training runs, early stopping dramatically improves efficiency:
ASHA (Asynchronous Successive Halving):
1. Randomly assign trials to rungs (budget levels)
2. Run trials for minimum budget
3. Keep top 1/η, discard rest
4. Increase budget for survivors
5. Repeat until convergence
Result: 10-100x faster than random search for neural networks
NAS automates the design of neural network architectures. It's computationally expensive but can discover designs that outperform human-engineered alternatives.
NAS operates on a search space defined by the researcher:
Evolve architectures through mutation and crossover:
1. Start with random population of architectures
2. Train each architecture, measure fitness
3. Select top performers
4. Generate offspring via mutation/crossover
5. Replace weakest performers
6. Repeat
AmoebaNet (Real et al., 2019) used evolutionary search and found architectures matching human designs on ImageNet.
Train a controller network that generates architectures:
Controller: RNN that outputs architecture description
Reward: Validation accuracy of generated architecture
Training: Policy gradient (REINFORCE) to maximize expected reward
Result: Controller learns to design good architectures
NASNet (Zoph et al., 2017) used RL to discover architectures that outperformed human designs on CIFAR-10 and ImageNet.
1. Define super-network containing all possible operations
2. Relax discrete choice to weighted mixture
3. Optimize operation weights via gradient descent
4. Prune low-weight operations
5. Result: Discovered sub-network
Efficiency: 1-4 GPU days vs 48,000 GPU days for RL approaches
Key insight: train once, evaluate many architectures by sharing weights:
Supernet training:
Single network contains all possible sub-networks
Sub-networks share weights
Benefits:
- Evaluate 1000s of architectures for cost of 1 training
- Proxy task learning transfers to target
Drawback: Weight sharing is an approximation; may miss good architectures
Amazon's AutoML framework—particularly strong for tabular data:
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('train.csv')
predictor = TabularPredictor(label='target').fit(train_data)
predictions = predictor.predict(test_data)
AutoGluon automatically:
On benchmark datasets, AutoGluon often matches or beats Kaggle competition winners with zero tuning.
H2O's AutoML provides a simple interface:
from h2o.automl import H2OAutoML
aml = H2OAutoML(max_models=20, max_runtime_secs=3600)
aml.train(x=X, y=y, training_frame=train)
leaderboard = aml.leaderboard
best_model = aml.leader
Microsoft's Fast Lightweight AutoML (FLAML) focuses on efficiency:
import flaml
@flaml.tune
def train_model(config):
# config contains hyperparameters to tune
model = train_with_config(config)
return {"val_loss": model.score(val)}
result = flaml.tune(train_model, config_space, max_cost=3600)
FLAML uses a novel search strategy (BlendSearch) that's more efficient than standard Bayesian optimization for large search spaces.
Ray Tune is a scalable hyperparameter tuning library:
from ray import tune
from ray.train import Trainable
config = {
"lr": tune.loguniform(1e-5, 1e-1),
"depth": tune.randint(4, 12),
"features": tune.choice(["all", "top_50", "top_100"])
}
results = tune.run(
train_model,
config=config,
num_samples=100,
scheduler=ASHAScheduler()
)
1. Start with simple model (LogisticRegression, XGBoost defaults)
2. Get a working pipeline (data loading, preprocessing, evaluation)
3. Measure baseline performance
4. Only then consider AutoML
Real parameters:
- Learning rate: loguniform(1e-5, 1e-1)
- Regularization: uniform(0, 1)
Categorical parameters:
- Optimizer: choice(["adam", "sgd", "rmsprop"])
- Activation: choice(["relu", "gelu", "silu"])
Conditional:
- If optimizer == "sgd": momentum ∈ uniform(0, 0.99)
- If optimizer == "adam": betas ∈ [(0.9, 0.999), (0.95, 0.999)]
Estimate computational cost:
Total trials = time_budget / avg_trial_time
For 1 hour budget with 5-minute trials:
Total trials ≈ 12 configurations
Bayesian optimization typically needs ~50 trials for good results
So budget: 50 × 5 min = ~4 hours minimum
Instead of training on full data for every trial:
1. Train on 10-50% of data during search
2. Train final model on full data with best config
Warning: May not transfer perfectly if data distribution changes with size
Use cheap approximations to filter configs:
1. Train for 1 epoch, eliminate worst 50%
2. Train survivors for 10 epochs, eliminate worst 50%
3. Train survivors for full training
4. Result: ~same quality at 1/4 the cost
1. Search on small proxy task (CIFAR-10)
2. Transfer best architecture to large task (ImageNet)
3. Fine-tune transferred architecture
Cost reduction: 100-1000x for large-scale tasks
AutoML is most valuable when you lack deep ML expertise or need strong baselines quickly. For structured tabular data, frameworks like AutoGluon are mature enough to use in production. For custom architectures or specialized domains, AutoML techniques require more expertise to apply effectively.
The key is starting simple: establish a baseline with defaults, then decide if AutoML is worth the computational cost. AutoML doesn't replace understanding your data—it amplifies whatever baseline you start from.