Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

TL;DR

Budget-efficient scaling law fitting via active experiment selection achieves full dataset performance using only 10% of the budget.

cs.LG 🔴 Advanced 2026-04-25 34 views
Sijie Li Shanda Li Haowei Lin Weiwei Sun Ameet Talwalkar Yiming Yang
scaling laws budget optimization active experiment design uncertainty large-scale models

Key Findings

Methodology

The paper proposes an uncertainty-aware active experiment selection method for budget-constrained scaling law fitting. This method selects the most valuable experiments by maximizing extrapolation accuracy in the target region. Specifically, it uses an uncertainty objective to evaluate candidate experiments' utility and optimizes the experiment selection process through a sequential design strategy.

Key Results

  • Result 1: Across diverse scaling-law tasks, the method approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget, significantly outperforming classical design-based baselines.
  • Result 2: In the lr&bsz task, the method reaches the low-loss region using only 1% of the budget, demonstrating superior performance in low-budget scenarios.
  • Result 3: Ablation studies show that removing the inter-basin uncertainty term has minimal impact on performance, while removing the intra-basin uncertainty term significantly degrades performance.

Significance

This research holds significant implications for academia and industry by addressing budget allocation challenges in large-scale model training. It provides an efficient experimental design method that achieves high-precision scaling law fitting under limited budgets. This approach can substantially reduce the cost of large-scale model training, encouraging more researchers and companies to adopt scaling laws for model training optimization.

Technical Contribution

Technical contributions include: 1) a novel uncertainty-aware experiment selection strategy for high-precision scaling law fitting under budget constraints; 2) a sequential design method that significantly enhances experiment selection efficiency and effectiveness; 3) a new experimental design framework applicable to diverse tasks and cost structures for effective scaling law fitting.

Novelty

This paper is the first to formalize scaling law fitting as a budget-aware sequential experimental design problem and proposes an uncertainty-aware experiment selection method. Compared to existing work, this approach achieves higher prediction accuracy under budget constraints, significantly reducing experimental costs.

Limitations

  • Limitation 1: The method relies on uncertainty evaluation during experiment selection, which may be sensitive to parameter initialization in some cases, affecting final prediction accuracy.
  • Limitation 2: Although the method performs well across various tasks, it may still experience performance degradation in specific tasks, especially when task heterogeneity is high.
  • Limitation 3: The current method primarily targets scaling law fitting problems and may need further extension to accommodate other types of experimental design problems.

Future Work

Future research directions include: 1) extending the method to accommodate a broader range of experimental design problems, such as parameter estimation for nonlinear models; 2) exploring more efficient uncertainty evaluation methods to further enhance experiment selection efficiency; 3) validating the method's effectiveness in more practical application scenarios and optimizing its adaptability across different tasks.

AI Executive Summary

In today's AI research, scaling laws have become a crucial tool for planning large-scale language model training. However, fitting these scaling laws can itself require substantial budgets. Traditionally, researchers manually select experimental configurations, conduct numerous pilot trainings, and then fit a parametric law to the resulting observations. This approach can consume massive budgets at an industrial scale, especially when hundreds of training runs are needed.

This paper introduces a novel method that formalizes scaling law fitting as a budget-aware sequential experimental design problem. By selecting the most valuable experiments from a finite pool of runnable experiments, the method achieves high-precision extrapolation in the target region under budget constraints. Specifically, the paper proposes an uncertainty-aware method that maximizes prediction accuracy in the target region during experiment selection.

Across diverse scaling-law tasks, the method approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget, significantly outperforming classical design-based baselines. Experimental results demonstrate that the method can achieve efficient scaling law fitting in low-budget scenarios, particularly in the lr&bsz task, where it reaches the low-loss region using only 1% of the budget.

This research holds significant implications for academia and industry by addressing budget allocation challenges in large-scale model training. It provides an efficient experimental design method that achieves high-precision scaling law fitting under limited budgets. This approach can substantially reduce the cost of large-scale model training, encouraging more researchers and companies to adopt scaling laws for model training optimization.

However, the method relies on uncertainty evaluation during experiment selection, which may be sensitive to parameter initialization in some cases, affecting final prediction accuracy. Future research directions include extending the method to accommodate a broader range of experimental design problems and validating its effectiveness in more practical application scenarios.

Deep Analysis

Background

Scaling laws have become increasingly important in recent AI research, revealing predictable relationships among model size, data volume, and compute budget. They provide guidance for large-scale language model training. Early studies focused on model architectures, data scaling, and inference-time settings. However, fitting scaling laws in practice remains costly and heavily reliant on manual experiment design. Researchers typically select experimental configurations, conduct numerous pilot trainings, and fit a parametric law to the resulting observations. This approach can consume massive budgets at an industrial scale, especially when hundreds of training runs are needed.

Core Problem

The core problem of scaling law fitting lies in selecting experiments under limited budgets to ensure that the fitted scaling law extrapolates accurately in the target region. Traditional methods often rely on manual experiment selection, which becomes increasingly inefficient as task diversity and cost heterogeneity increase. Therefore, optimizing the experiment selection process under budget constraints becomes a crucial research problem.

Innovation

The core innovations of this paper include: 1) formalizing scaling law fitting as a budget-aware sequential experimental design problem; 2) proposing an uncertainty-aware experiment selection method that maximizes prediction accuracy in the target region during experiment selection; 3) significantly enhancing experiment selection efficiency and effectiveness through a sequential design strategy.

Methodology

  • �� Formalize scaling law fitting as a budget-aware sequential experimental design problem.
  • �� Propose an uncertainty-aware experiment selection method using an uncertainty objective to evaluate candidate experiments' utility.
  • �� Optimize the experiment selection process through a sequential design strategy to maximize prediction accuracy in the target region.
  • �� Validate the method's effectiveness across diverse scaling-law task benchmarks.

Experiments

The experimental design includes multiple scaling-law task benchmarks, covering pre-training hyperparameter tuning, data allocation, architecture design, sparsity, and inference-time scaling. Each task specifies a parametric law family, a finite pool of runnable candidate experiments with associated costs, and a held-out target region for evaluation. Baselines used include random selection, cheapest selection, cost-random selection, D-optimal, and V-optimal.

Results

Experimental results show that the method performs exceptionally well across diverse scaling-law tasks. It approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget, significantly outperforming classical design-based baselines. Notably, in the lr&bsz task, the method reaches the low-loss region using only 1% of the budget, demonstrating superior performance in low-budget scenarios.

Applications

The method can be directly applied to optimize large-scale language model training, especially under budget constraints. By optimizing the experiment selection process, researchers and companies can achieve high-precision scaling law fitting under limited budgets, reducing the cost of large-scale model training.

Limitations & Outlook

Despite the method's excellent performance across various tasks, it may still experience performance degradation in specific tasks, especially when task heterogeneity is high. Additionally, the method relies on uncertainty evaluation during experiment selection, which may be sensitive to parameter initialization in some cases, affecting final prediction accuracy. Future research directions include extending the method to accommodate a broader range of experimental design problems and validating its effectiveness in more practical application scenarios.

Plain Language Accessible to non-experts

Imagine you're in a kitchen trying to cook a delicious meal with a limited budget. You need to choose the ingredients that will maximize flavor, not just buy the cheapest ones. This method is like a smart chef who knows how to select the best ingredients within a budget to create the tastiest dish. The chef evaluates each ingredient's flavor and cost to decide which ones are worth buying. In the same way, this method selects the most valuable experiments under budget constraints to achieve high-precision scaling law fitting.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a game and you have a limited number of coins to buy gear, but you want to defeat the toughest boss. To do that, you need to choose the gear that gives you the most power, not just buy the cheapest stuff. This method is like a smart gamer who knows how to pick the best gear with limited coins to beat the boss. The gamer looks at each gear's power and price to decide which ones are worth buying. Just like that, this method picks the best experiments to get the most accurate results without spending too much money.

Glossary

Scaling Laws

Scaling laws describe predictable relationships among model size, data volume, and compute budget. They are crucial for guiding large-scale language model training.

In this paper, scaling laws are used to guide large-scale language model training.

Budget-Aware

Budget-aware refers to making decisions with consideration of budget constraints to achieve optimal resource allocation.

The paper formalizes scaling law fitting as a budget-aware sequential experimental design problem.

Uncertainty-Aware

Uncertainty-aware involves considering uncertainty factors in decision-making to improve accuracy.

The paper proposes an uncertainty-aware experiment selection method.

Sequential Experimental Design

Sequential experimental design is a method of selecting experiments step-by-step to optimize results.

The paper optimizes the experiment selection process through a sequential design strategy.

Target Region

The target region is the area of focus in experimental design, typically involving high-cost configurations.

The method maximizes prediction accuracy in the target region during experiment selection.

D-Optimality

D-optimality is a design criterion aimed at maximizing the precision of parameter estimation.

The paper compares D-optimality as a baseline.

V-Optimality

V-optimality is a design criterion aimed at maximizing prediction accuracy.

The paper compares V-optimality as a baseline.

Ablation Study

An ablation study evaluates the impact of removing certain components on overall performance.

The paper conducts ablation studies to evaluate the impact of different uncertainty terms on performance.

Local Linearization

Local linearization is a method of approximating a nonlinear model as a linear model in a local region.

The paper evaluates the utility of experiments in a locally linearized model.

Mixture of Gaussians

A mixture of Gaussians is a probabilistic model representing a combination of multiple Gaussian distributions.

The method uses a mixture of Gaussians to represent multiple plausible parameter regions.

Open Questions Unanswered questions from this research

  • 1 Open Question 1: How can this method be applied to a broader range of experimental design problems? The current method primarily targets scaling law fitting problems and may need further extension to accommodate other types of experimental design problems.
  • 2 Open Question 2: How can the efficiency of uncertainty evaluation be improved? The current method relies on uncertainty evaluation during experiment selection, which may be sensitive to parameter initialization in some cases, affecting final prediction accuracy.
  • 3 Open Question 3: How can the method's effectiveness be validated in more practical application scenarios? Although the method performs well across various tasks, it may still experience performance degradation in specific tasks.
  • 4 Open Question 4: How can the method's adaptability across different tasks be optimized? The current method may experience performance degradation when task heterogeneity is high, requiring further optimization of its adaptability.
  • 5 Open Question 5: How can experiment selection efficiency be further improved under higher budgets? Although the method performs well in low-budget scenarios, there is still room for improvement under higher budgets.

Applications

Immediate Applications

Large-Scale Language Model Training Optimization

The method can be used to optimize the training process of large-scale language models, especially under budget constraints. Researchers and companies can reduce the cost of large-scale model training by optimizing the experiment selection process.

Hyperparameter Tuning

Researchers can efficiently conduct hyperparameter tuning under limited budgets using this method, improving model performance and training efficiency.

Data Allocation Optimization

The method can be used to optimize data allocation strategies to achieve optimal training results under limited budgets.

Long-term Vision

Automated Experimental Design

The long-term vision of this method is to achieve automation in experimental design, reducing human intervention and improving experiment efficiency and effectiveness.

Cross-Domain Applications

In the future, the method can be extended to other fields' experimental design problems, such as biomedical research and materials science.

Abstract

Scaling laws are used to plan multi-million-dollar training runs, but fitting those laws can itself cost millions. In modern large-scale workflows, assembling a sufficiently informative set of pilot experiments is already a major budget-allocation problem rather than a routine preprocessing step. We formulate scaling-law fitting as budget-aware sequential experimental design: given a finite pool of runnable experiments with heterogeneous costs, choose which runs to execute so as to maximize extrapolation accuracy in a high-cost target region. We then propose an uncertainty-aware method for sequentially allocating experimental budget toward the runs most useful for target-region extrapolation. Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget. Our code is available at https://github.com/PlanarG/active-sl.

cs.LG

References (20)

Optimum design of experiments for statistical inference

S. Gilmour, L. Trinca

2012 87 citations

An extension of the General Equivalence Theorem to nonlinear models

L. White

1973 117 citations

Goal-Oriented Bayesian Optimal Experimental Design for Nonlinear Models using Markov Chain Monte Carlo

Shijie Zhong, Wanggang Shen, Tommie A. Catanach et al.

2024 11 citations View Analysis →

Designs for Generalized Linear Models

Anthony C. Atkinson, David C. Woods

2015 42 citations View Analysis →

Scaling Laws for Fine-Grained Mixture of Experts

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski et al.

2024 144 citations View Analysis →

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Chaofan Tao, Qian Liu, Longxu Dou et al.

2024 113 citations View Analysis →

On Optimal Designs for Nonlinear Models: A General and Efficient Algorithm

Min Yang, Stefanie Biedermann, Elina Tang

2013 66 citations

Scaling Data-Constrained Language Models

Niklas Muennighoff, Alexander M. Rush, B. Barak et al.

2023 370 citations View Analysis →

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani et al.

2017 951 citations View Analysis →

Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Ge-feng Yang, Edward J. Hu, Igor Babuschkin et al.

2021 138 citations

Design Issues for Generalized Linear Models: A Review

A. Khuri, B. Mukherjee, B. Sinha et al.

2006 145 citations View Analysis →

Simulation-based optimal Bayesian experimental design for nonlinear systems

X. Huan, Y. Marzouk

2011 474 citations View Analysis →

Optimal Design: An Introduction to the Theory for Parameter Estimation.

Robin Sibson, S. Silvey

1982 242 citations

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving

Yangzhen Wu, Zhiqing Sun, Shanda Li et al.

2024 180 citations View Analysis →

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Jiasheng Ye, Peiju Liu, Tianxiang Sun et al.

2024 137 citations View Analysis →

Scaling Laws for Neural Language Models

J. Kaplan, Sam McCandlish, T. Henighan et al.

2020 7641 citations View Analysis →

Can Language Models Discover Scaling Laws?

Haowei Lin, Haotian Ye, Wenzheng Feng et al.

2025 5 citations View Analysis →

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, Jacob Hilton

2022 945 citations View Analysis →

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, Barret Zoph, Noam Shazeer

2021 3704 citations View Analysis →

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Haoran Que, Jiaheng Liu, Ge Zhang et al.

2024 35 citations View Analysis →