Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

TL;DR

Proposes HDET method to improve optimization quality and generalization of large models via automatic learning rate exploration.

cs.LG 🔴 Advanced 2026-04-28 26 views
Hailing Cheng Tao Huang Chen Zhu Antonio Alonso
hyperparameter learning rate large models automation parallel computing

Key Findings

Methodology

This paper introduces a novel method called Hyperparameter-Divergent Ensemble Training (HDET), which repurposes existing GPU replicas for simultaneous learning rate exploration without additional hardware overhead. HDET operates in alternating phases: during the fan-out phase, each replica trains independently under a symmetric spread of learning rates; during the converge phase, parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, the paper also proposes an automatic learning rate controller that uses the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. This method generates a self-adapting learning rate schedule that improves both optimization quality and generalization without additional hyperparameter sweeps or training budget.

Key Results

  • The HDET method significantly improves final model quality and convergence speed on production-scale training tasks. For example, in experiments using 8 H100 GPUs, HDET achieved a training loss of 3.277, compared to the baseline model's 3.294.
  • Through the automatic learning rate controller, HDET autonomously discovers the decay ordering of learning rates for each parameter group without manual tuning, performing excellently in large-scale recommendation systems.
  • Experiments show that HDET trains stably under high learning rates, whereas traditional DDP would diverge under the same conditions.

Significance

The HDET method is significant in large model training, especially in scenarios requiring efficient exploration of hyperparameter space. Traditional methods often require extensive computational resources for grid searches, whereas HDET achieves zero hardware overhead hyperparameter exploration by leveraging existing GPU replicas. This method not only improves optimization quality and generalization but also offers a new perspective for large-scale distributed training, potentially influencing future deep learning framework designs.

Technical Contribution

The technical contribution of the HDET method lies in transforming existing DDP replicas into a structured learning rate exploration ensemble without additional hardware overhead. Through the fan-out/converge cycle and automatic learning rate controller, HDET achieves online adaptive learning rate adjustment, eliminating the need for a priori schedule selection in traditional methods. Additionally, the generality of the HDET framework allows exploration of any scalar hyperparameter that does not alter model architecture, such as dropout rate, attention temperature, and weight decay coefficient.

Novelty

HDET is the first method to utilize existing GPU replicas for simultaneous learning rate exploration. Unlike existing methods, HDET does not rely on gradient information but uses inter-replica loss differences as signals for gradient-free optimization. This innovative exploration approach not only enhances training stability and efficiency but also provides new insights for hyperparameter optimization.

Limitations

  • The stability of the HDET method under high learning rates relies on periodic parameter averaging, which may increase communication overhead in certain scenarios.
  • Although HDET can explore multiple hyperparameters, its performance might be influenced by specific tasks or datasets, requiring further validation of its generality.
  • In some cases, the automatic learning rate controller may not quickly adapt to extreme learning rate changes, leading to instability in the initial training phase.

Future Work

Future research directions include further optimizing the communication efficiency of HDET and exploring its applicability across different tasks and datasets. Additionally, combining it with other optimization algorithms, such as Adam's per-parameter adaptation, might further enhance HDET's performance. Investigating the application of HDET on larger-scale datasets is also a promising direction.

AI Executive Summary

In the field of deep learning, the choice of learning rate is crucial for the training effectiveness of large models. However, existing methods typically require the learning rate schedule to be fixed before training, limiting the model's adaptability during the training process. Traditional grid search methods are not only time-consuming and resource-intensive but may also become suboptimal as the scale of the model or dataset changes.

To address this issue, this paper proposes the Hyperparameter-Divergent Ensemble Training (HDET) method. HDET repurposes existing GPU replicas for simultaneous learning rate exploration, achieving zero hardware overhead hyperparameter optimization. The method operates in alternating phases: during the fan-out phase, each replica trains independently under a symmetric spread of learning rates; during the converge phase, parameters are averaged across all replicas via AllReduce every T steps.

The core technical principle of HDET lies in its automatic learning rate controller, which uses the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. This innovative exploration approach not only enhances training stability and efficiency but also provides new insights for hyperparameter optimization.

Experimental results show that HDET significantly improves final model quality and convergence speed on production-scale training tasks. For example, in experiments using 8 H100 GPUs, HDET achieved a training loss of 3.277, compared to the baseline model's 3.294. Additionally, HDET autonomously discovers the decay ordering of learning rates for each parameter group without manual tuning.

The HDET method is significant in large model training, especially in scenarios requiring efficient exploration of hyperparameter space. Traditional methods often require extensive computational resources for grid searches, whereas HDET achieves zero hardware overhead hyperparameter exploration by leveraging existing GPU replicas. This method not only improves optimization quality and generalization but also offers a new perspective for large-scale distributed training.

Despite its many strengths, the stability of the HDET method under high learning rates relies on periodic parameter averaging, which may increase communication overhead in certain scenarios. Future research directions include further optimizing the communication efficiency of HDET and exploring its applicability across different tasks and datasets.

Deep Analysis

Background

In recent years, as the scale of deep learning models continues to grow, the choice of learning rate has become increasingly critical to the effectiveness of model training. Traditional learning rate schedules, such as one-cycle annealing, cosine decay, and linear warmup-decay, typically need to be fixed before training, limiting the model's adaptability during the training process. Moreover, methods like grid search, while helpful in finding optimal learning rate schedules, are computationally expensive and may become suboptimal as the scale of the model or dataset changes. To address these challenges, researchers have been exploring new methods for adaptive learning rate adjustment, such as Hypergradient Descent, L4, and Schedule-Free, yet these methods largely rely on gradient information and fail to fully utilize existing hardware resources.

Core Problem

In large-scale model training, efficiently exploring the learning rate space is a key challenge. Traditional learning rate schedules need to be fixed before training, limiting the model's adaptability during the training process. Moreover, methods like grid search, while helpful in finding optimal learning rate schedules, are computationally expensive and may become suboptimal as the scale of the model or dataset changes. Therefore, achieving simultaneous exploration and adaptive adjustment of learning rates without increasing hardware overhead is a pressing issue that needs to be addressed.

Innovation

The core innovation of the HDET method lies in its use of existing GPU replicas for simultaneous learning rate exploration without additional hardware overhead. Specifically:

1. HDET operates in alternating fan-out and converge phases, allowing each replica to train independently under a symmetric spread of learning rates and merge parameters via AllReduce.

2. The automatic learning rate controller uses inter-replica relative training loss as a performance signal, adjusting the shared base schedule via a momentum-based gradient-free meta-update.

3. The generality of the HDET framework allows exploration of any scalar hyperparameter that does not alter model architecture, such as dropout rate, attention temperature, and weight decay coefficient.

Methodology

The implementation of the HDET method includes the following steps:

  • �� Initialization phase: Assign a symmetric learning rate range to each GPU replica, allowing independent training during the fan-out phase.
  • �� Fan-out phase: Each replica trains independently under a symmetric spread of learning rates, exploring different learning rate trajectories.
  • �� Converge phase: Every T steps, parameters are averaged across all replicas via AllReduce, merging all replica parameters.
  • �� Automatic learning rate controller: Uses inter-replica relative training loss as a performance signal, adjusting the shared base schedule via a momentum-based gradient-free meta-update.
  • �� Periodic parameter averaging: Prevents training divergence under high learning rates through periodic parameter averaging.

Experiments

The experimental design includes multiple tests conducted on a production-scale recommendation system using one year of user-item interaction log data. Models are jointly trained on three engagement tasks, with baselines including standard DDP and different learning rate configurations. Key hyperparameters include the maximum learning rate, spread ratio, and parameters of the automatic learning rate controller. Ablation studies were also conducted to verify the independent contributions of each component.

Results

Experimental results show that HDET significantly improves final model quality and convergence speed on production-scale training tasks. For example, in experiments using 8 H100 GPUs, HDET achieved a training loss of 3.277, compared to the baseline model's 3.294. Additionally, HDET autonomously discovers the decay ordering of learning rates for each parameter group without manual tuning. These results demonstrate that HDET trains stably under high learning rates, whereas traditional DDP would diverge under the same conditions.

Applications

The HDET method performs excellently in the production environment of large-scale recommendation systems, especially in scenarios requiring efficient exploration of hyperparameter space. Its zero hardware overhead characteristic makes it applicable to various tasks requiring large-scale distributed training, such as natural language processing and computer vision. Furthermore, the generality of HDET allows exploration of any scalar hyperparameter that does not alter model architecture, further expanding its application scope.

Limitations & Outlook

Despite its many strengths, the stability of the HDET method under high learning rates relies on periodic parameter averaging, which may increase communication overhead in certain scenarios. Additionally, although HDET can explore multiple hyperparameters, its performance might be influenced by specific tasks or datasets, requiring further validation of its generality. Future research directions include further optimizing the communication efficiency of HDET and exploring its applicability across different tasks and datasets.

Plain Language Accessible to non-experts

Imagine you're in a kitchen with a group of chefs, each cooking different dishes. Every chef has their own recipe and spice ratios, but they gather periodically to share their experiences and techniques. This is like the fan-out and converge phases in the HDET method. Each GPU replica is like a chef, independently training under different learning rates, akin to using different spice ratios in cooking. Periodic parameter averaging is like the chefs coming together to share experiences, ensuring each dish reaches its best flavor. The automatic learning rate controller is like a head chef, adjusting the spice ratios based on the taste of each dish, ensuring every dish achieves optimal taste. Through this approach, the HDET method achieves simultaneous exploration and adaptive adjustment of learning rates, much like continuously optimizing the flavor of each dish in the kitchen.

ELI14 Explained like you're 14

Hey there! Imagine you're playing an online multiplayer game where everyone has their own character and skills. You're all on the same map, but each person has different missions and strategies. The HDET method is like a new feature in this game that lets each player explore different skill combinations without changing their character. Every now and then, all players gather to share their experiences and tactics, like having a little meeting in the game to discuss the next strategy. This method not only helps each player make the most of their character but also helps the whole team perform better in the game. In this way, the HDET method helps large models find the best learning rate combinations during training, just like finding the best skill combinations in the game.

Glossary

Hyperparameter-Divergent Ensemble Training (HDET)

A method that repurposes existing GPU replicas for simultaneous learning rate exploration, achieving zero hardware overhead hyperparameter optimization through fan-out and converge phases.

The core method proposed in this paper to improve optimization quality and generalization of large model training.

Fan-out phase

A phase in the HDET method where each GPU replica trains independently under a symmetric spread of learning rates, exploring different learning rate trajectories.

Used to explore different learning rate configurations during training.

Converge phase

A phase in the HDET method where parameters are averaged across all replicas via AllReduce every T steps, merging all replica parameters.

Used to prevent training divergence under high learning rates.

Automatic learning rate controller

A component in the HDET method that uses inter-replica relative training loss as a performance signal, adjusting the shared base schedule via a momentum-based gradient-free meta-update.

Used to achieve adaptive learning rate adjustment.

AllReduce

A communication operation in distributed computing used to average parameters across multiple GPUs.

Used in the HDET method to merge all replica parameters.

Learning rate schedule

A strategy used to adjust the learning rate during training, such as one-cycle annealing, cosine decay, and linear warmup-decay.

Traditionally needs to be fixed before training, limiting model adaptability.

Gradient-free optimization

An optimization method that does not rely on gradient information, using other signals for parameter adjustment.

Used in the HDET method for learning rate adjustment.

Momentum-based meta-update

A method of parameter update using momentum information, commonly used in optimization algorithms.

Used in the HDET method for learning rate adjustment.

Spread ratio

A parameter in the HDET method used to define the range of learning rates, determining the configuration of learning rates during the fan-out phase.

Used to control the exploration range of learning rates.

Periodic parameter averaging

A method of preventing training divergence by periodically averaging parameters.

Used in the HDET method to improve training stability.

Production-scale recommendation system

A system used for large-scale recommendation tasks, typically requiring efficient hyperparameter optimization.

The experimental environment for the HDET method.

Ablation study

An experimental method of verifying the independent contribution of certain components by removing or modifying them.

Used to verify the independent contribution of each component in the HDET method.

High learning rate

Using a relatively large learning rate during training, which can accelerate convergence but may lead to instability.

Prevented from divergence in the HDET method through periodic parameter averaging.

Learning rate decay ordering

The decay order of learning rates for different parameter groups during training, affecting model optimization.

Autonomously discovered by the automatic learning rate controller in the HDET method.

Large-scale distributed training

A method of training models in parallel on multiple GPUs, typically used for training large models.

The application scenario for the HDET method.

Open Questions Unanswered questions from this research

  • 1 How can the communication efficiency of the HDET method be further optimized to reduce the overhead introduced by periodic parameter averaging? The current implementation may increase communication burden in certain scenarios, requiring exploration of more efficient parameter synchronization strategies.
  • 2 What is the generality of the HDET method across different tasks and datasets? While it performs excellently in production-scale recommendation systems, its applicability in other domains remains to be validated.
  • 3 How does the automatic learning rate controller adapt to extreme learning rate changes? Instability may occur in the initial phase, necessitating further research into its performance under different learning rate conditions.
  • 4 How can the HDET method be combined with other optimization algorithms, such as Adam's per-parameter adaptation, to further enhance its performance? This could provide new insights for hyperparameter optimization.
  • 5 What is the effect of applying the HDET method on larger-scale datasets? As the dataset scale increases, the performance and stability of HDET may be affected, requiring further research.

Applications

Immediate Applications

Large-scale Recommendation Systems

The HDET method can be used in the production environment of large-scale recommendation systems, efficiently exploring hyperparameter space to improve model optimization quality and generalization.

Natural Language Processing

In natural language processing tasks, the HDET method can help explore different learning rate configurations, improving model performance on large-scale datasets.

Computer Vision

The HDET method is also applicable in computer vision tasks, especially in scenarios requiring large-scale distributed training.

Long-term Vision

Automated Hyperparameter Optimization

The generality of the HDET method allows exploration of any scalar hyperparameter that does not alter model architecture, potentially becoming a standard method for automated hyperparameter optimization in the future.

Deep Learning Framework Design

The HDET method offers a new perspective for large-scale distributed training, potentially influencing future deep learning framework designs and promoting more efficient training methods.

Abstract

Training large neural networks with data-parallel stochastic gradient descent allocates N GPU replicas to compute effectively identical updates -- a practice that leaves the rich space of learning rate configurations entirely unexplored during training. We propose Hyperparameter-Divergent Ensemble Training (HDET), a method that repurposes these replicas for simultaneous learning rate exploration at negligible communication overhead. HDET operates in alternating phases: a fan-out stage in which replicas train independently under a structured, symmetric spread of learning rates, and a converge stage in which parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, we further propose an automatic learning rate (auto-LR) controller that treats the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. The combined method produces a self-adapting learning rate schedule that improves both optimization quality and generalization without additional hyperparameter sweeps or training budget. Crucially, the framework generalizes beyond learning rate: any scalar hyperparameter that does not alter model architecture -- such as dropout rate, attention scale temperature, or weight-decay coefficient -- can be explored across replicas using the same fan-out/converge protocol, with inter-replica loss differences serving as zero-order hypergradients that guide the search direction. HDET is implemented as a drop-in replacement for PyTorch's OneCycleLR scheduler, requiring no changes to model architecture, optimizer, or data pipeline.

cs.LG cs.AI

References (20)

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa et al.

2019 51354 citations ⭐ Influential View Analysis →

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Balaji Lakshminarayanan, A. Pritzel, C. Blundell

2016 7328 citations View Analysis →

DiLoCo: Distributed Low-Communication Training of Language Models

Arthur Douillard, Qixuang Feng, Andrei A. Rusu et al.

2023 94 citations View Analysis →

L4: Practical loss-based stepsize adaptation for deep learning

Michal Rolinek, G. Martius

2018 68 citations View Analysis →

Learning-Rate-Free Learning by D-Adaptation

Aaron Defazio, Konstantin Mishchenko

2023 117 citations View Analysis →

Natural Evolution Strategies

Daan Wierstra, T. Schaul, Jan Peters et al.

2008 1004 citations

Deep learning with Elastic Averaging SGD

Sixin Zhang, A. Choromańska, Yann LeCun

2014 640 citations View Analysis →

Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization

Farzin Haddadpour, Mohammad Mahdi Kamani, M. Mahdavi et al.

2019 224 citations View Analysis →

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, T. Garipov et al.

2018 1978 citations View Analysis →

Learning with Random Learning Rates

Léonard Blier, Pierre Wolinski, Y. Ollivier

2018 22 citations View Analysis →

SGDR: Stochastic Gradient Descent with Warm Restarts

I. Loshchilov, F. Hutter

2016 10287 citations View Analysis →

Population Based Training of Neural Networks

Max Jaderberg, Valentin Dalibard, Simon Osindero et al.

2017 857 citations View Analysis →

DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule

Maor Ivgi, Oliver Hinder, Y. Carmon

2023 98 citations View Analysis →

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

Vipul Gupta, Santiago Akle Serrano, D. DeCoste

2020 78 citations View Analysis →

Super-convergence: very fast training of neural networks using large learning rates

L. Smith, Nicholay Topin

2018 1701 citations

Deep Ensembles: A Loss Landscape Perspective

Stanislav Fort, Huiyi Hu, Balaji Lakshminarayanan

2019 732 citations View Analysis →

The Road Less Scheduled

Aaron Defazio, Xingyu Yang, Harsh Mehta et al.

2024 147 citations View Analysis →

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, P. Abbeel

2020 29771 citations View Analysis →

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, S. Gadre et al.

2022 1455 citations View Analysis →

Online Learning Rate Adaptation with Hypergradient Descent

A. G. Baydin, R. Cornish, David Martínez-Rubio et al.

2017 277 citations View Analysis →