An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models
Study of SFT-DPO interaction in small models reveals full fine-tuning outperforms LoRA.
Key Findings
Methodology
This study employs a systematic experimental approach, comparing SFT-only, DPO-only, and staged SFT-to-DPO training strategies, and contrasts full fine-tuning (FFT) with low-rank adaptation (LoRA) on a GPT-2 scale decoder. The focus is on analyzing these methods' performance in paraphrase detection and Shakespearean sonnet continuation tasks. Specifically, DPO yields small, task-dependent gains over strong SFT, while FFT consistently outperforms LoRA at matched training depth.
Key Results
- In paraphrase detection, using a 283k dataset, the FFT method achieved 89.87% accuracy and 89.21% F1 score on the development set, while the best LoRA performance was 87.70% accuracy and 87.00% F1 score.
- In the sonnet continuation task, the DPO strategy with V1 preference pair construction showed slight improvement, achieving a chrF score of 41.94, whereas the V3 strategy did not yield significant improvement.
- The study indicates that in small-scale models and datasets, DPO and LoRA provide limited gains, with FFT remaining the primary performance lever.
Significance
This research highlights that in small language models, traditional full fine-tuning methods remain the primary means of performance enhancement, while preference optimization and low-rank adaptation offer limited marginal returns. This finding is significant for both academia and industry, particularly in resource-constrained environments, as it challenges the assumption that LoRA and DPO are effective in small-scale models. It suggests that researchers should prioritize full fine-tuning in small-scale conditions.
Technical Contribution
The technical contributions of this paper include a systematic analysis of the interaction between SFT and DPO in small models and experimental validation of the performance differences between full fine-tuning (FFT) and low-rank adaptation (LoRA). The study provides empirical data on DPO hyperparameter selection and SFT-to-DPO handoff timing, offering new insights into fine-tuning strategies for small models. Additionally, the paper reveals that in small-scale models, parameterization strategies have a greater impact on performance than the preference optimization stage.
Novelty
This study is the first to systematically compare the performance of SFT, DPO, FFT, and LoRA in small language models, particularly in paraphrase detection and poetry generation tasks. This comprehensive empirical study fills a gap in understanding the interaction of these methods in small-scale models, providing a new perspective on optimizing fine-tuning strategies for small models.
Limitations
- The study is conducted only on GPT-2 scale models, not covering the performance of larger models.
- The experimental environment is limited to specific hardware (NVIDIA H100 GPU), which may affect the efficiency evaluation of LoRA.
- The gains from DPO are inconsistent across tasks, requiring further exploration of its applicability in different tasks.
Future Work
Future research could explore the application of these methods on larger-scale models and datasets, particularly evaluating the performance of DPO and LoRA in different hardware environments. Additionally, further research could investigate how to combine other parameter-efficient fine-tuning methods to enhance the adaptability and performance of small models.
AI Executive Summary
In the field of natural language processing, fine-tuning pretrained language models to adapt to downstream tasks is a common challenge, especially for smaller models with limited compute and parameter budgets. This paper explores two widely used fine-tuning methods: full fine-tuning (FFT) and low-rank adaptation (LoRA), as well as the interaction between supervised fine-tuning (SFT) and direct preference optimization (DPO).
The study is conducted on a GPT-2 scale decoder, with tasks including paraphrase detection and Shakespearean sonnet continuation. Experimental results show that although DPO yields small, task-dependent gains over strong SFT, full fine-tuning (FFT) consistently outperforms LoRA at matched training depth. Furthermore, LoRA does not significantly reduce training time on our hardware.
These findings indicate that in small-scale models, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns. The results challenge the assumption that LoRA and DPO are effective in small-scale models, suggesting that researchers should prioritize full fine-tuning in small-scale conditions.
The experiments utilized the Quora Question Pairs dataset and the Shakespearean sonnet dataset, examining the impact of different data scales and parameterization strategies on model performance. The results demonstrate that data diversity is more valuable than repeated exposure, with larger datasets outperforming smaller ones under the same training time budget.
The study also reveals that the gains from DPO are inconsistent across tasks, requiring further exploration of its applicability in different tasks. Future research could explore the application of these methods on larger-scale models and datasets, particularly evaluating the performance of DPO and LoRA in different hardware environments.
Deep Analysis
Background
In recent years, pretrained language models have made significant advances in the field of natural language processing, particularly for large-scale models. However, for smaller models, efficiently adapting them to downstream tasks remains a challenge. Early work such as GPT-2 demonstrated the strong performance of large autoregressive language models across various tasks, but the application of these methods to small-scale models still requires further exploration. Parameter-efficient fine-tuning methods like LoRA, which update a small subset of parameters while freezing most weights, offer a solution in resource-constrained environments. Additionally, preference optimization methods like DPO provide a simplified alternative to reinforcement learning from human feedback, aligning models with human intent by directly optimizing preference pairs.
Core Problem
In small language models, efficiently adapting them to downstream tasks remains a challenge. Specifically, how LoRA compares to full fine-tuning (FFT) when adapting smaller models, and how SFT interacts with DPO, are questions that require further research. The importance of these questions lies in the constraints on compute and parameter budgets for small models, making efficient fine-tuning strategies necessary.
Innovation
The innovations of this paper lie in systematically analyzing the interaction between SFT and DPO in small models and experimentally validating the performance differences between full fine-tuning (FFT) and low-rank adaptation (LoRA). The study provides empirical data on DPO hyperparameter selection and SFT-to-DPO handoff timing, offering new insights into fine-tuning strategies for small models. Additionally, the paper reveals that in small-scale models, parameterization strategies have a greater impact on performance than the preference optimization stage.
Methodology
- �� Use GPT-2 (124M parameters) as the base model for paraphrase detection and sonnet continuation tasks.
- �� Compare SFT-only, DPO-only, and staged SFT-to-DPO training strategies.
- �� Contrast full fine-tuning (FFT) with low-rank adaptation (LoRA) at matched training depth.
- �� For paraphrase detection, use the Quora Question Pairs dataset for training and evaluation.
- �� For sonnet continuation, use the Shakespearean sonnet dataset for training and evaluation.
- �� Conduct experiments on DPO hyperparameter selection and SFT-to-DPO handoff timing.
Experiments
The experiments utilized the Quora Question Pairs dataset and the Shakespearean sonnet dataset, examining the impact of different data scales and parameterization strategies on model performance. In the paraphrase detection task, a 283k training dataset was used, with evaluation on the development set. In the sonnet continuation task, the Shakespearean sonnet dataset was used, constructing preference pairs for DPO training. The experiments also included studies on DPO hyperparameter selection and SFT-to-DPO handoff timing.
Results
In paraphrase detection, using a 283k dataset, the FFT method achieved 89.87% accuracy and 89.21% F1 score on the development set, while the best LoRA performance was 87.70% accuracy and 87.00% F1 score. In the sonnet continuation task, the DPO strategy with V1 preference pair construction showed slight improvement, achieving a chrF score of 41.94, whereas the V3 strategy did not yield significant improvement. The study indicates that in small-scale models and datasets, DPO and LoRA provide limited gains, with FFT remaining the primary performance lever.
Applications
The study's findings have important implications for fine-tuning strategies in small language models, particularly in resource-constrained environments. Full fine-tuning (FFT) performs well in small-scale models, suitable for tasks requiring high precision. While low-rank adaptation (LoRA) has potential efficiency advantages in large-scale models, it does not show significant performance improvement in small-scale models.
Limitations & Outlook
The study is conducted only on GPT-2 scale models, not covering the performance of larger models. The experimental environment is limited to specific hardware (NVIDIA H100 GPU), which may affect the efficiency evaluation of LoRA. The gains from DPO are inconsistent across tasks, requiring further exploration of its applicability in different tasks. Future research could explore the application of these methods on larger-scale models and datasets, particularly evaluating the performance of DPO and LoRA in different hardware environments.
Plain Language Accessible to non-experts
Imagine you have a small robot that can learn to perform different tasks. This robot has two ways of learning: one is to fully immerse itself in learning all the details (like full fine-tuning), and the other is to focus only on a few key points (similar to LoRA). In our study, we found that when the robot is working on small tasks, the fully immersive learning method works better because it can grasp all the skills needed for the task. The method of focusing only on key points does not have a significant advantage in small tasks because it might miss some important details. Additionally, we tried a new learning method called preference optimization, which is like giving the robot some preference options to let it know which choices are better. However, in small tasks, this method is also not as effective as the fully immersive learning method. Overall, for small tasks, the fully immersive learning method remains the most effective.
ELI14 Explained like you're 14
Imagine you're playing a game and you have a small robot assistant that can help you complete tasks. You have two ways to train it: one is to let it learn all the details, and the other is to let it focus on a few important places. We found that when the task is small, letting it learn all the details works better because it can understand the task more comprehensively. The method of focusing on a few places doesn't work as well in small tasks because it might miss some important things. We also tried a new method called preference optimization, which is like giving the robot some hints to let it know which choices are better. But in small tasks, this method is also not as good as letting it learn all the details. Overall, for small tasks, letting the robot learn all the details is still the best choice.
Glossary
Full Fine-Tuning
A method of fine-tuning a pretrained model by updating all its parameters to adapt to a specific task.
In this paper, full fine-tuning is used to compare with low-rank adaptation.
Low-Rank Adaptation
A parameter-efficient fine-tuning method that introduces low-rank matrices to approximate weight updates while keeping most pretrained weights fixed.
The paper compares low-rank adaptation with full fine-tuning in small models.
Supervised Fine-Tuning
A technique of fine-tuning pretrained models to adapt to specific tasks using supervised learning methods.
In this paper, supervised fine-tuning serves as the foundational step for preference optimization.
Direct Preference Optimization
A method that optimizes preference pairs directly without explicitly training a reward model, used to align language models with human intent.
The paper studies the application of direct preference optimization in small models.
Paraphrase Detection
A natural language processing task that determines whether two sentences express the same meaning.
The paper uses the Quora Question Pairs dataset for paraphrase detection experiments.
Sonnet Generation
A generation task where the model autoregressively generates the remaining part of a poem given its beginning.
The paper uses the Shakespearean sonnet dataset for sonnet generation experiments.
GPT-2
An autoregressive language model developed by OpenAI, known for its strong performance across various natural language processing tasks.
The paper uses GPT-2 as the base model for experiments.
chrF
A character-level n-gram F-score used to evaluate the quality of machine translation and text generation.
In the paper, chrF is used to evaluate the quality of sonnet generation.
Quora Question Pairs Dataset
A dataset containing pairs of questions used for paraphrase detection tasks.
The paper uses this dataset for paraphrase detection experiments.
Shakespeare Sonnet Dataset
A text dataset containing Shakespeare's sonnets, used for poetry generation tasks.
The paper uses this dataset for sonnet generation experiments.
Open Questions Unanswered questions from this research
- 1 Although the paper reveals that full fine-tuning outperforms low-rank adaptation in small models, its performance on larger models still needs further validation.
- 2 The applicability and gains of DPO across different tasks are inconsistent, requiring exploration of its potential in other tasks.
- 3 The efficiency evaluation of LoRA in specific hardware environments may not be comprehensive, requiring testing in a broader range of hardware configurations.
- 4 The study is conducted only on GPT-2 scale models, not covering the performance of larger models.
- 5 The results are validated on specific datasets and tasks, requiring validation on more diverse datasets and tasks.
Applications
Immediate Applications
Small Model Fine-Tuning
The study's findings can guide fine-tuning strategies for small models in resource-constrained environments, prioritizing full fine-tuning for better performance.
Task-Specific Model Optimization
Apply DPO in specific tasks to enhance model performance, especially when preference signals closely align with supervised signals.
Education and Training
The study's findings can be used in education and training to help students and researchers understand effective fine-tuning strategies for small models.
Long-term Vision
Large-Scale Model Optimization
Future research could explore combining DPO and LoRA in large-scale models to improve efficiency and performance.
Cross-Task General Models
Develop general models that perform well across multiple tasks, combining different fine-tuning strategies to adapt to various task requirements.
Abstract
Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised objective. In contrast, parameterization dominates: FFT consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on our hardware. These findings indicate that, in this small-scale regime, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns.
References (5)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, E. Mitchell et al.
Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
Julia Kreutzer, Joshua Uyheng, S. Riezler
Language Models are Unsupervised Multitask Learners
Alec Radford, Jeff Wu, R. Child et al.
LoRA: Low-Rank Adaptation of Large Language Models
J. Hu, Yelong Shen, Phillip Wallis et al.
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang et al.