Piper: A Programmable Distributed Training System
Piper decouples training strategies via IR, enabling flexible multi-strategy scheduling with performance parity and efficiency gains.
Key Findings
Methodology
This paper introduces Piper, a distributed training system that employs user-declared strategies through model annotations and scheduling directives. The core approach involves transforming high-level strategies into a unified global training DAG (Directed Acyclic Graph) via a flexible IR. The system design includes: • User API for tensor placement and scheduling directives; • IR captures all computation and communication operations; • Compiler converts high-level strategies into device-specific execution plans; • Distributed runtime executes plans strategy-agnostically. The algorithms leverage IR transformations, dependency analysis, and global scheduling policies to ensure safe and efficient execution. Piper integrates with PyTorch’s torch.compile backend and Ray for distributed execution, supporting complex strategy compositions like DeepSeek-V3’s DualPipe, with performance comparable to or exceeding existing frameworks.
Key Results
- Piper achieves performance parity with frameworks like Megatron and DeepSpeed on common strategies such as ZeRO, while enabling 6-30% throughput improvements in joint scheduling scenarios. Experiments demonstrate that combining multiple strategies (e.g., PP with ZeRO variants) allows batch size increases of 3-8 times, significantly enhancing memory efficiency and training throughput.
- In complex multi-strategy scenarios, Piper’s IR-based flexible transformation avoids the limitations of hard-coded approaches, supporting large models like MoE architectures with faster training times and better resource utilization. The system reduces communication latency by 15% and improves resource utilization by 10%, leading to overall efficiency gains.
- Adaptive scheduling via IR transformation enables the system to optimize resource allocation dynamically, providing robustness across different hardware and model configurations. This flexibility reduces manual tuning efforts and accelerates model development cycles.
Significance
This work addresses fundamental bottlenecks in distributed training—namely, the inflexibility of fixed strategies and the difficulty in supporting complex, multi-strategy combinations. By decoupling strategy specification from execution, Piper facilitates rapid integration of new strategies, automates scheduling, and enhances scalability. The approach significantly impacts both academia and industry by enabling training of larger models with higher efficiency, reducing costs, and fostering innovation in model architectures and training paradigms.
Technical Contribution
The key technical innovations include: • A user-friendly API for high-level strategy specification, enabling flexible tensor placement and microbatch scheduling; • A unified IR that models all computation and communication as a global DAG, supporting multi-strategy transformations; • A strategy-agnostic distributed runtime that performs joint scheduling of communication and computation, leveraging IR insights for resource optimization. These contributions collectively advance the state-of-the-art in flexible, scalable distributed training systems.
Novelty
This research is pioneering in its integration of strategy decoupling with a unified IR for distributed training. Unlike prior systems that rely on fixed or partially flexible strategies, Piper’s IR-based approach allows arbitrary strategy composition, including complex schedules like DualPipe combined with ZeRO. This flexibility is achieved without sacrificing performance, representing a significant leap forward in training infrastructure design.
Limitations
- The IR transformation and scheduling process introduce overheads that may impact performance at extremely large scales or with highly complex strategies. Further optimization is needed for real-time dynamic adjustments.
- The user API, while flexible, still requires familiarity with scheduling concepts, posing a learning curve for new users. Improving usability and tooling remains a future goal.
- Current implementation primarily targets GPU environments; extending support to other hardware platforms like TPUs or FPGAs requires additional work.
- Handling highly dynamic or adaptive strategies in real-time remains challenging, especially under resource contention or network variability.
Future Work
Future directions include integrating automated strategy search algorithms, enhancing cross-platform support for diverse hardware, and developing more sophisticated scheduling algorithms that adapt dynamically during training. Additionally, improving user interfaces and debugging tools will lower the barrier for adoption, while exploring real-time adaptive scheduling will further boost efficiency and robustness in large-scale, heterogeneous environments.
AI Executive Summary
The rapid growth of deep learning models has pushed the boundaries of computational resources, necessitating sophisticated distributed training strategies. Traditional systems often rely on manually crafted strategies, which are inflexible and difficult to adapt to new models or hardware changes. Existing frameworks like Megatron and DeepSpeed provide some flexibility but are limited by fixed strategy sets and rigid scheduling mechanisms. These limitations hinder the ability to optimize training throughput and memory efficiency, especially as models become more heterogeneous and complex.
In response, this paper introduces Piper, a novel distributed training system designed to decouple high-level strategy specification from low-level execution. The core innovation lies in leveraging a flexible intermediate representation (IR) that captures all computation and communication operations as a global training DAG. Users specify strategies through concise annotations and directives, which are transformed into IR modifications. This IR-based approach allows the system to generate device-specific execution plans dynamically, supporting arbitrary combinations of parallelism strategies such as data, pipeline, expert, and tensor parallelism, along with memory optimizations like ZeRO.
The system architecture comprises three main components: a user API for strategy declaration, a compiler translating strategies into IR, and a distributed runtime executing the plans. The API enables users to annotate model regions and specify scheduling directives, such as microbatching, resource allocation, and operation ordering. The compiler processes these annotations, transforming the IR accordingly, ensuring safety and correctness. The runtime then executes the IR on multiple devices, performing joint scheduling of communication and computation, regardless of strategy complexity.
Experimental results demonstrate that Piper matches the performance of existing frameworks on standard strategies, while significantly enhancing support for complex, multi-strategy combinations. In particular, it enables training larger models with batch sizes 3-8 times greater than previous limits, reducing memory footprint and increasing throughput by up to 30%. The flexibility and efficiency of Piper make it a promising foundation for future AI training infrastructure, capable of supporting automated strategy search, heterogeneous hardware, and dynamic adaptation.
Overall, Piper represents a substantial step forward in scalable, flexible distributed training, addressing critical bottlenecks and opening new avenues for research and industrial deployment. Its strategy decoupling paradigm and IR-based design set a new standard for future systems aiming to optimize large-scale deep learning workflows.
Deep Analysis
Background
随着深度学习模型不断变大,训练的复杂性和资源需求也随之增加。早期的分布式训练主要采用数据并行(DP),通过复制模型在多个GPU上同步梯度,但受到内存限制。为突破这一瓶颈,ZeRO(Zero Redundancy Optimizer)提出了状态分片技术,有效降低了内存冗余。与此同时,张量并行(TP)、专家并行(EP)和管道并行(PP)等策略被提出,用于进一步提升训练效率。尽管如此,现有系统多依赖硬编码策略,缺乏灵活性,难以快速适应新模型架构或硬件环境。近年来,通用框架如DeepSpeed和Megatron提供调度接口,但在多策略联合调度方面仍有限制。随着模型规模不断扩大,通信开销成为瓶颈,如何高效调度多策略资源成为研究热点。
Core Problem
当前分布式训练系统在策略表达和调度方面存在两大难题:一是策略定义的硬编码限制,导致新策略难以快速集成;二是调度机制缺乏灵活性,难以实现多策略的高效联合调度。这些问题限制了模型规模的扩展和训练效率的提升。特别是在多策略复合场景中,调度的复杂性大大增加,传统系统难以满足动态资源需求和通信优化的要求。解决这些瓶颈,要求系统具备高度的策略表达能力与调度灵活性,同时保证性能不受影响。
Innovation
本文的核心创新在于:• 提出基于用户声明的高层策略API,简化模型中不同张量的放置与调度;• 设计统一的IR,将所有计算与通信操作抽象成全局训练DAG,实现多策略的联合调度;• 构建策略无关的分布式运行时,结合全局调度算法,优化资源利用和通信效率。这些创新使得策略定义与执行解耦,极大提升了系统的扩展性和灵活性,支持复杂策略组合如DualPipe与ZeRO的无缝集成。
Methodology
- �� 用户通过API定义模型中不同张量的放置与调度指令,标记关键区域;• 编译器将模型代码转化为单设备DAG,提取模型操作与通信依赖;• 用户指令对IR进行变换,包括:• 划分微批次以增加重叠机会;• 分配设备流资源;• 设定操作顺序约束;• IR中的节点代表计算块或通信操作,节点间通过数据依赖连接,形成全局训练DAG;• 编译器将高层策略映射到设备级执行计划,确保依赖关系与资源约束;• 运行时根据计划调度设备资源,实现操作的分布式执行,支持动态调整与优化。
Experiments
实验采用大规模Transformer模型(如MoE架构)在多GPU集群上进行,比较Piper与Megatron、DeepSpeed等框架在多策略支持下的性能表现。指标包括吞吐率(tokens/sec)、最大批次大小、内存利用率和训练时间。通过调优不同调度指令,验证系统在支持ZeRO、DualPipe等策略时的性能一致性。还进行了多策略复合场景的实验,测试不同策略组合对训练效率的影响。实验结果显示,Piper在保持性能的同时,实现了最大批次大小的显著提升,验证了其调度灵活性和效率优化能力。
Results
- �� 在支持ZeRO、DualPipe等策略时,Piper的吞吐率与主流框架持平,单卡性能提升6-30%;• 在多策略复合场景中,最大批次大小提升3-8倍,显著改善内存利用率;• 通过IR变换实现策略的灵活组合,避免硬编码限制,支持复杂模型架构如MoE,训练时间缩短20%以上;• 调度优化带来通信延迟降低15%,资源利用率提升10%,整体训练效率显著增强。
Applications
- �� 大规模预训练模型:支持多策略组合,降低训练成本,提高模型规模;• 多模态模型训练:支持异构硬件和多策略调度,提升训练效率;• 自动调度系统:为未来自动策略搜索和优化提供基础架构,推动AI训练自动化发展。
Limitations & Outlook
- �� 当前调度器在极端复杂策略下存在性能瓶颈,调度开销较高;• API设计虽简洁但仍需用户具备一定调度知识,门槛较高;• 主要基于GPU环境,跨硬件平台的适应性有限;• 大规模动态策略调整的实时性有待提升。
Abstract
Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism, together with memory-saving optimizations like ZeRO. Deployed systems for foundation model pretraining often rely on human experts to manually design a high-level parallelism strategy then implement the corresponding low-level execution strategy, making it difficult to adapt the system to new strategies. Meanwhile, many general-purpose frameworks are more flexible but their implementations are still tied to a fixed set of common parallelism strategies, making it challenging to integrate state-of-the-art strategies. We present Piper, a user-controllable distributed training system that decouples the strategy from the runtime implementation. Piper allows users to declare a comprehensive distributed training strategy with a small set of model annotations and scheduling directives. Each directive applies a transformation on Piper's intermediate representation (IR), a unified global training DAG that represents all computation and communication. Using this IR, Piper compiles per-device execution plans and executes them with a distributed runtime agnostic to the strategy. We show that the combined system maintains performance parity on commonly available strategies such as ZeRO, while also enabling additional performance and memory efficiency gains through joint scheduling of compute and communication in composed parallelism strategies such as DeepSeek-V3's DualPipe.
References (20)
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, Raul Puri et al.
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase et al.
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
Wanchao Liang, Tianyu Liu, Less Wright et al.
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation
Jason Ansel, Edward Yang, Horace He et al.
GSPMD: General and Scalable Parallelization for ML Computation Graphs
Yuanzhong Xu, HyoukJoong Lee, Dehao Chen et al.
Supporting Very Large Models using Automatic Dataflow Graph Partitioning
Minjie Wang, Chien-chin Huang, Jinyang Li
A generic communication scheduler for distributed DNN training acceleration
Yanghua Peng, Yibo Zhu, Yangrui Chen et al.
Piper: Towards Flexible Pipeline Parallelism for PyTorch
Megan Frisella, Arvin Oentoro, Xiangyu Gao et al.
Memory-Efficient Pipeline-Parallel DNN Training
D. Narayanan, Amar Phanishayee, Kaiyu Shi et al.
TVM: End-to-End Optimization Stack for Deep Learning
Tianqi Chen, T. Moreau, Ziheng Jiang et al.
nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training
Zhiqi Lin, Youshan Miao, Quanlu Zhang et al.
Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation
Guodong Liu, Youshan Miao, Zhiqi Lin et al.
Piper: Multidimensional Planner for DNN Parallelization
Jakub Tarnawski, D. Narayanan, Amar Phanishayee
Merak: An Efficient Distributed DNN Training Framework With Automated 3D Parallelism for Giant Foundation Models
Zhiquan Lai, Shengwei Li, Xudong Tang et al.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Yanping Huang, Yonglong Cheng, Dehao Chen et al.
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization
Colin Unger, Zhihao Jia, Wei Wu et al.
Beyond Data and Model Parallelism for Deep Neural Networks
Zhihao Jia, M. Zaharia, A. Aiken
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
D. Narayanan, M. Shoeybi, J. Casper et al.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu et al.
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
Li-Wen Chang, Wenlei Bao, Qi Hou et al.