CRAFT: Clustered Regression for Adaptive Filtering of Training data
CRAFT method enhances BLEU score by 2.13 points in English-Hindi translation through clustered regression for adaptive filtering.
Key Findings
Methodology
The CRAFT method decomposes the source-target joint distribution and employs a two-stage selection strategy: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. This method proves that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters.
Key Results
- CRAFT achieves 43.34 BLEU on English-Hindi translation using 33 million NLLB sentence pairs, outperforming TSDS by 2.13 points while completing selection 40 times faster.
- With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU, whereas TAROT takes 75.6 seconds, CRAFT only takes 26.86 seconds, a 2.8x speedup.
- CRAFT achieves 43.34 BLEU with embeddings on a 1M candidate pool, while achieving 41.78 BLEU with TF-IDF, comparable to TSDS's 41.21 BLEU.
Significance
The CRAFT method significantly reduces computational costs while improving model performance by selecting high-quality subsets for fine-tuning from large-scale corpora. It addresses the shortcomings of traditional methods in handling source-target conditional relationships, especially in multilingual translation tasks. This method is not only significant in academia but also provides an efficient data selection strategy for the industry.
Technical Contribution
CRAFT fundamentally differs from existing state-of-the-art methods by independently clustering source and target embeddings to capture conditional structures in the validation set, avoiding joint embedding processing of traditional methods. It provides new theoretical guarantees by bounding the KL divergence between selected and validation distributions, ensuring the selected subset aligns with the validation distribution. Additionally, CRAFT achieves significant speed improvements, particularly in large-scale dataset applications.
Novelty
CRAFT is the first method to perform data selection through source-target distribution decomposition. Compared to existing methods, CRAFT captures conditional structures in the validation set by independently clustering source and target embeddings, offering a new perspective on data selection and avoiding the shortcomings of traditional methods in handling source-target conditional relationships.
Limitations
- CRAFT may be affected by clustering quality when dealing with very high-dimensional embeddings, leading to a mismatch between the selected subset and the validation distribution.
- In some low-resource language pairs, the conditional relationship between source and target may not be apparent, affecting CRAFT's selection effectiveness.
- CRAFT relies on the quality and representativeness of the validation set; if the validation set is not representative, it may lead to a decline in the performance of the selected subset.
Future Work
Future research can explore the application of CRAFT in other tasks, such as image classification or text generation. Additionally, research can focus on enhancing CRAFT's performance in low-resource language pairs or combining other data selection strategies to improve selection accuracy and efficiency.
AI Executive Summary
As corpora continue to grow, selecting a small, high-quality subset for fine-tuning becomes increasingly important. Existing methods fall short in handling source-target conditional relationships, leading to a mismatch between the selected subset and the validation distribution. The CRAFT method offers a new solution through clustered regression for adaptive filtering of training data.
CRAFT decomposes the source-target joint distribution and employs a two-stage selection strategy. First, it matches the validation source distribution through proportional budget allocation across k-means clusters. Then, within each source cluster, it selects training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. This method proves that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters.
In experiments, CRAFT excels in English-Hindi translation tasks. Using 33 million NLLB sentence pairs to fine-tune the mBART model, CRAFT achieves a BLEU score of 43.34, outperforming TSDS by 2.13 points. Additionally, CRAFT completes selection 40 times faster, with TF-IDF vectorization allowing the entire process to complete in under one minute on CPU, whereas TAROT takes 75.6 seconds, and CRAFT only takes 26.86 seconds, a 2.8x speedup.
The CRAFT method significantly reduces computational costs while improving model performance by selecting high-quality subsets for fine-tuning from large-scale corpora. It addresses the shortcomings of traditional methods in handling source-target conditional relationships, especially in multilingual translation tasks.
However, CRAFT may be affected by clustering quality when dealing with very high-dimensional embeddings, leading to a mismatch between the selected subset and the validation distribution. Additionally, in some low-resource language pairs, the conditional relationship between source and target may not be apparent, affecting CRAFT's selection effectiveness. Future research can explore the application of CRAFT in other tasks, such as image classification or text generation.
Deep Analysis
Background
As the field of natural language processing rapidly evolves, the performance of machine translation models critically depends on the quality and relevance of their training data. In recent years, parallel corpora have expanded to tens of millions of sentence pairs, such as the 33 million English-Hindi pairs in the NLLB corpus. However, full fine-tuning on such large-scale corpora is computationally expensive and often unnecessary. Selecting a small, well-chosen subset can match or exceed the performance of training on the full dataset, provided the selection captures the right distributional properties. Existing methods address this problem with varying trade-offs between quality and computational cost, such as lexical methods that are fast but fail to capture semantic structure, and gradient-based methods that achieve strong performance but require expensive encoder inference or optimal transport solves over the full candidate pool.
Core Problem
Selecting appropriate training data from large-scale corpora has emerged as a critical challenge. Full fine-tuning is not only computationally expensive but often unnecessary. A small, well-chosen subset can match or exceed the performance of training on the full dataset, provided the selection captures the right distributional properties. Existing methods address this problem with varying trade-offs between quality and computational cost, such as lexical methods that are fast but fail to capture semantic structure, and gradient-based methods that achieve strong performance but require expensive encoder inference or optimal transport solves over the full candidate pool.
Innovation
The CRAFT method decomposes the source-target joint distribution and employs a two-stage selection strategy: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. This method proves that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. CRAFT fundamentally differs from existing state-of-the-art methods by independently clustering source and target embeddings to capture conditional structures in the validation set, avoiding joint embedding processing of traditional methods.
Methodology
- �� CRAFT decomposes the source-target joint distribution and employs a two-stage selection strategy.
- �� First, it matches the validation source distribution through proportional budget allocation across k-means clusters.
- �� Then, within each source cluster, it selects training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution.
- �� This method proves that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters.
- �� CRAFT fundamentally differs from existing state-of-the-art methods by independently clustering source and target embeddings to capture conditional structures in the validation set, avoiding joint embedding processing of traditional methods.
Experiments
In experiments, CRAFT excels in English-Hindi translation tasks. Using 33 million NLLB sentence pairs to fine-tune the mBART model, CRAFT achieves a BLEU score of 43.34, outperforming TSDS by 2.13 points. Additionally, CRAFT completes selection 40 times faster, with TF-IDF vectorization allowing the entire process to complete in under one minute on CPU, whereas TAROT takes 75.6 seconds, and CRAFT only takes 26.86 seconds, a 2.8x speedup. The CRAFT method significantly reduces computational costs while improving model performance by selecting high-quality subsets for fine-tuning from large-scale corpora.
Results
CRAFT excels in English-Hindi translation tasks. Using 33 million NLLB sentence pairs to fine-tune the mBART model, CRAFT achieves a BLEU score of 43.34, outperforming TSDS by 2.13 points. Additionally, CRAFT completes selection 40 times faster, with TF-IDF vectorization allowing the entire process to complete in under one minute on CPU, whereas TAROT takes 75.6 seconds, and CRAFT only takes 26.86 seconds, a 2.8x speedup.
Applications
The CRAFT method significantly reduces computational costs while improving model performance by selecting high-quality subsets for fine-tuning from large-scale corpora. It addresses the shortcomings of traditional methods in handling source-target conditional relationships, especially in multilingual translation tasks.
Limitations & Outlook
CRAFT may be affected by clustering quality when dealing with very high-dimensional embeddings, leading to a mismatch between the selected subset and the validation distribution. Additionally, in some low-resource language pairs, the conditional relationship between source and target may not be apparent, affecting CRAFT's selection effectiveness. Future research can explore the application of CRAFT in other tasks, such as image classification or text generation.
Plain Language Accessible to non-experts
Imagine you're in a huge library with thousands of books. You need to pick a few books to write an essay on a specific topic, but you don't have time to read them all. The CRAFT method is like a smart librarian who knows how to quickly find the most relevant books. First, they group all the books by topic, like putting them on different shelves. Then, they pick the books on each shelf that best represent the entire topic. It's like finding the most valuable books on each shelf rather than just picking a few at random. This way, you can quickly find the most useful information to write your essay without having to read all the books. The CRAFT method helps you quickly find the most relevant information without needing to read everything.
ELI14 Explained like you're 14
Hey there! Imagine you have a giant toy box with thousands of toys, but you can only pick a few to play with because you don't have much time. The CRAFT method is like a super-smart toy picker! First, it sorts the toys into different boxes by type, like cars, dolls, blocks, and so on. Then, it picks the coolest and most fun toys from each box. This way, you get to play with the best toys in a short amount of time, instead of wasting time on toys that aren't as fun. The CRAFT method helps you quickly find the best things to play with! Isn't that awesome?
Glossary
CRAFT (Clustered Regression for Adaptive Filtering)
A method for adaptive filtering of training data through clustered regression, aimed at selecting high-quality subsets from large-scale corpora for fine-tuning.
Used in the paper for selecting training data for English-Hindi translation tasks.
k-means Clustering
An unsupervised learning algorithm that partitions data points into k groups, each represented by a centroid.
Used to partition source and target embeddings into different clusters.
BLEU (Bilingual Evaluation Understudy)
A metric for evaluating the quality of machine translation by comparing the similarity between machine translation and reference translation.
Used to evaluate the performance of the CRAFT method in English-Hindi translation tasks.
TF-IDF (Term Frequency-Inverse Document Frequency)
A method for text vectorization that represents text by measuring the importance of a word in a document.
Used in the vectorization step of the CRAFT method.
LoRA (Low-Rank Adaptation)
A method for fine-tuning large language models by reducing the number of parameters through low-rank matrix decomposition.
Used to fine-tune the mBART model.
KL Divergence
A measure of the difference between two probability distributions.
Used to prove the difference between selected and validation distributions in the CRAFT method.
NLLB Corpus
A multilingual parallel corpus containing 33 million sentence pairs, used for machine translation tasks.
Used as the experimental dataset for evaluating the CRAFT method.
mBART Model
A multilingual sequence-to-sequence pre-trained model suitable for multilingual translation tasks.
Used to evaluate the translation performance of the CRAFT method.
Conditional Expected Distance
A method for measuring the distance between target embeddings and the validation target distribution under given conditions.
Used in the target selection step of the CRAFT method.
Distribution Matching
A method for adjusting the selected dataset to make its distribution similar to the validation set.
Used in the source distribution matching step of the CRAFT method.
Open Questions Unanswered questions from this research
- 1 CRAFT may be affected by clustering quality when dealing with very high-dimensional embeddings, leading to a mismatch between the selected subset and the validation distribution. Future research can explore how to improve clustering quality to ensure the selected subset aligns more closely with the validation distribution.
- 2 In some low-resource language pairs, the conditional relationship between source and target may not be apparent, affecting CRAFT's selection effectiveness. Future research can explore how to enhance CRAFT's performance in low-resource language pairs or combine other data selection strategies to improve selection accuracy and efficiency.
- 3 CRAFT relies on the quality and representativeness of the validation set; if the validation set is not representative, it may lead to a decline in the performance of the selected subset. Future research can explore how to select high-quality subsets even when the validation set is not representative.
- 4 While CRAFT performs well on large-scale datasets, its performance on small-scale datasets has not been fully verified. Future research can explore CRAFT's performance on small-scale datasets and compare it with other methods.
- 5 CRAFT shows superiority in multilingual translation tasks, but its application in other tasks has not been fully verified. Future research can explore the application of CRAFT in other tasks, such as image classification or text generation.
Applications
Immediate Applications
Multilingual Translation
The CRAFT method can be used for multilingual translation tasks to select high-quality training data and improve translation model performance.
Text Classification
The CRAFT method can be used for text classification tasks by selecting the most relevant training data to improve classification model accuracy.
Speech Recognition
The CRAFT method can be used for speech recognition tasks to select high-quality training data and improve recognition model performance.
Long-term Vision
Autonomous Driving
The CRAFT method can be applied to the perception module in autonomous driving to select high-quality training data and improve perception model accuracy and robustness.
Intelligent Customer Service
The CRAFT method can be used in intelligent customer service systems to select high-quality training data and improve response quality and user satisfaction.
Abstract
Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.
References (20)
TAROT: Targeted Data Selection via Optimal Transport
Lang Feng, Fan Nie, Yuejiang Liu et al.
TSDS: Data Selection for Task-Specific Model Finetuning
Zifan Liu, Amin Karbasi, Theodoros Rekatsinas
Data Selection for Language Models via Importance Resampling
Sang Michael Xie, Shibani Santurkar, Tengyu Ma et al.
LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia, Sadhika Malladi, Suchin Gururangan et al.
Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
Feiyang Kang, H. Just, Anit Kumar Sahu et al.
A Survey on Data Selection for Language Models
Alon Albalak, Yanai Elazar, Sang Michael Xie et al.
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni, Salim Roukos, T. Ward et al.
DsDm: Model-Aware Dataset Selection with Datamodels
Logan Engstrom, Axel Feldmann, A. Ma̧dry
Large Language Models for Summarizing Czech Historical Documents and Beyond
V'aclav Tran, Jakub Šmíd, J. Martínek et al.
Multilingual Translation from Denoising Pre-Training
Y. Tang, C. Tran, Xian Li et al.
LoRA: Low-Rank Adaptation of Large Language Models
J. Hu, Yelong Shen, Phillip Wallis et al.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers, Iryna Gurevych
Sampling techniques.
B. Longest
Comparative Analysis of Neural Translation Models based on Transformers Architecture
Alexander V. Smirnov, N. Teslya, N. Shilov et al.
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Satanjeev Banerjee, A. Lavie
Beyond English-Centric Multilingual Machine Translation
Angela Fan, Shruti Bhosale, Holger Schwenk et al.
Neural Machine Translation for Low-resource Languages: A Survey
Surangika Ranathunga, E. Lee, M. Skenduli et al.
A statistical interpretation of term specificity and its application in retrieval
Karen Spärck Jones
Billion-Scale Similarity Search with GPUs
Jeff Johnson, Matthijs Douze, H. Jégou
chrF: character n-gram F-score for automatic MT evaluation
Maja Popovic