CRAFT: Clustered Regression for Adaptive Filtering of Training data

TL;DR

CRAFT method enhances BLEU score by 2.13 points in English-Hindi translation through clustered regression for adaptive filtering.

cs.CL 🔴 Advanced 2026-04-25 36 views
Parthasarathi Panda Asheswari Swain Subhrakanta Panda
data selection clustering machine translation distribution matching TF-IDF

Key Findings

Methodology

The CRAFT method decomposes the source-target joint distribution and employs a two-stage selection strategy: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. This method proves that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters.

Key Results

  • CRAFT achieves 43.34 BLEU on English-Hindi translation using 33 million NLLB sentence pairs, outperforming TSDS by 2.13 points while completing selection 40 times faster.
  • With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU, whereas TAROT takes 75.6 seconds, CRAFT only takes 26.86 seconds, a 2.8x speedup.
  • CRAFT achieves 43.34 BLEU with embeddings on a 1M candidate pool, while achieving 41.78 BLEU with TF-IDF, comparable to TSDS's 41.21 BLEU.

Significance

The CRAFT method significantly reduces computational costs while improving model performance by selecting high-quality subsets for fine-tuning from large-scale corpora. It addresses the shortcomings of traditional methods in handling source-target conditional relationships, especially in multilingual translation tasks. This method is not only significant in academia but also provides an efficient data selection strategy for the industry.

Technical Contribution

CRAFT fundamentally differs from existing state-of-the-art methods by independently clustering source and target embeddings to capture conditional structures in the validation set, avoiding joint embedding processing of traditional methods. It provides new theoretical guarantees by bounding the KL divergence between selected and validation distributions, ensuring the selected subset aligns with the validation distribution. Additionally, CRAFT achieves significant speed improvements, particularly in large-scale dataset applications.

Novelty

CRAFT is the first method to perform data selection through source-target distribution decomposition. Compared to existing methods, CRAFT captures conditional structures in the validation set by independently clustering source and target embeddings, offering a new perspective on data selection and avoiding the shortcomings of traditional methods in handling source-target conditional relationships.

Limitations

  • CRAFT may be affected by clustering quality when dealing with very high-dimensional embeddings, leading to a mismatch between the selected subset and the validation distribution.
  • In some low-resource language pairs, the conditional relationship between source and target may not be apparent, affecting CRAFT's selection effectiveness.
  • CRAFT relies on the quality and representativeness of the validation set; if the validation set is not representative, it may lead to a decline in the performance of the selected subset.

Future Work

Future research can explore the application of CRAFT in other tasks, such as image classification or text generation. Additionally, research can focus on enhancing CRAFT's performance in low-resource language pairs or combining other data selection strategies to improve selection accuracy and efficiency.

AI Executive Summary

As corpora continue to grow, selecting a small, high-quality subset for fine-tuning becomes increasingly important. Existing methods fall short in handling source-target conditional relationships, leading to a mismatch between the selected subset and the validation distribution. The CRAFT method offers a new solution through clustered regression for adaptive filtering of training data.

CRAFT decomposes the source-target joint distribution and employs a two-stage selection strategy. First, it matches the validation source distribution through proportional budget allocation across k-means clusters. Then, within each source cluster, it selects training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. This method proves that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters.

In experiments, CRAFT excels in English-Hindi translation tasks. Using 33 million NLLB sentence pairs to fine-tune the mBART model, CRAFT achieves a BLEU score of 43.34, outperforming TSDS by 2.13 points. Additionally, CRAFT completes selection 40 times faster, with TF-IDF vectorization allowing the entire process to complete in under one minute on CPU, whereas TAROT takes 75.6 seconds, and CRAFT only takes 26.86 seconds, a 2.8x speedup.

The CRAFT method significantly reduces computational costs while improving model performance by selecting high-quality subsets for fine-tuning from large-scale corpora. It addresses the shortcomings of traditional methods in handling source-target conditional relationships, especially in multilingual translation tasks.

However, CRAFT may be affected by clustering quality when dealing with very high-dimensional embeddings, leading to a mismatch between the selected subset and the validation distribution. Additionally, in some low-resource language pairs, the conditional relationship between source and target may not be apparent, affecting CRAFT's selection effectiveness. Future research can explore the application of CRAFT in other tasks, such as image classification or text generation.

Deep Analysis

Background

As the field of natural language processing rapidly evolves, the performance of machine translation models critically depends on the quality and relevance of their training data. In recent years, parallel corpora have expanded to tens of millions of sentence pairs, such as the 33 million English-Hindi pairs in the NLLB corpus. However, full fine-tuning on such large-scale corpora is computationally expensive and often unnecessary. Selecting a small, well-chosen subset can match or exceed the performance of training on the full dataset, provided the selection captures the right distributional properties. Existing methods address this problem with varying trade-offs between quality and computational cost, such as lexical methods that are fast but fail to capture semantic structure, and gradient-based methods that achieve strong performance but require expensive encoder inference or optimal transport solves over the full candidate pool.

Core Problem

Selecting appropriate training data from large-scale corpora has emerged as a critical challenge. Full fine-tuning is not only computationally expensive but often unnecessary. A small, well-chosen subset can match or exceed the performance of training on the full dataset, provided the selection captures the right distributional properties. Existing methods address this problem with varying trade-offs between quality and computational cost, such as lexical methods that are fast but fail to capture semantic structure, and gradient-based methods that achieve strong performance but require expensive encoder inference or optimal transport solves over the full candidate pool.

Innovation

The CRAFT method decomposes the source-target joint distribution and employs a two-stage selection strategy: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. This method proves that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. CRAFT fundamentally differs from existing state-of-the-art methods by independently clustering source and target embeddings to capture conditional structures in the validation set, avoiding joint embedding processing of traditional methods.

Methodology

  • �� CRAFT decomposes the source-target joint distribution and employs a two-stage selection strategy.
  • �� First, it matches the validation source distribution through proportional budget allocation across k-means clusters.
  • �� Then, within each source cluster, it selects training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution.
  • �� This method proves that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters.
  • �� CRAFT fundamentally differs from existing state-of-the-art methods by independently clustering source and target embeddings to capture conditional structures in the validation set, avoiding joint embedding processing of traditional methods.

Experiments

In experiments, CRAFT excels in English-Hindi translation tasks. Using 33 million NLLB sentence pairs to fine-tune the mBART model, CRAFT achieves a BLEU score of 43.34, outperforming TSDS by 2.13 points. Additionally, CRAFT completes selection 40 times faster, with TF-IDF vectorization allowing the entire process to complete in under one minute on CPU, whereas TAROT takes 75.6 seconds, and CRAFT only takes 26.86 seconds, a 2.8x speedup. The CRAFT method significantly reduces computational costs while improving model performance by selecting high-quality subsets for fine-tuning from large-scale corpora.

Results

CRAFT excels in English-Hindi translation tasks. Using 33 million NLLB sentence pairs to fine-tune the mBART model, CRAFT achieves a BLEU score of 43.34, outperforming TSDS by 2.13 points. Additionally, CRAFT completes selection 40 times faster, with TF-IDF vectorization allowing the entire process to complete in under one minute on CPU, whereas TAROT takes 75.6 seconds, and CRAFT only takes 26.86 seconds, a 2.8x speedup.

Applications

The CRAFT method significantly reduces computational costs while improving model performance by selecting high-quality subsets for fine-tuning from large-scale corpora. It addresses the shortcomings of traditional methods in handling source-target conditional relationships, especially in multilingual translation tasks.

Limitations & Outlook

CRAFT may be affected by clustering quality when dealing with very high-dimensional embeddings, leading to a mismatch between the selected subset and the validation distribution. Additionally, in some low-resource language pairs, the conditional relationship between source and target may not be apparent, affecting CRAFT's selection effectiveness. Future research can explore the application of CRAFT in other tasks, such as image classification or text generation.

Plain Language Accessible to non-experts

Imagine you're in a huge library with thousands of books. You need to pick a few books to write an essay on a specific topic, but you don't have time to read them all. The CRAFT method is like a smart librarian who knows how to quickly find the most relevant books. First, they group all the books by topic, like putting them on different shelves. Then, they pick the books on each shelf that best represent the entire topic. It's like finding the most valuable books on each shelf rather than just picking a few at random. This way, you can quickly find the most useful information to write your essay without having to read all the books. The CRAFT method helps you quickly find the most relevant information without needing to read everything.

ELI14 Explained like you're 14

Hey there! Imagine you have a giant toy box with thousands of toys, but you can only pick a few to play with because you don't have much time. The CRAFT method is like a super-smart toy picker! First, it sorts the toys into different boxes by type, like cars, dolls, blocks, and so on. Then, it picks the coolest and most fun toys from each box. This way, you get to play with the best toys in a short amount of time, instead of wasting time on toys that aren't as fun. The CRAFT method helps you quickly find the best things to play with! Isn't that awesome?

Glossary

CRAFT (Clustered Regression for Adaptive Filtering)

A method for adaptive filtering of training data through clustered regression, aimed at selecting high-quality subsets from large-scale corpora for fine-tuning.

Used in the paper for selecting training data for English-Hindi translation tasks.

k-means Clustering

An unsupervised learning algorithm that partitions data points into k groups, each represented by a centroid.

Used to partition source and target embeddings into different clusters.

BLEU (Bilingual Evaluation Understudy)

A metric for evaluating the quality of machine translation by comparing the similarity between machine translation and reference translation.

Used to evaluate the performance of the CRAFT method in English-Hindi translation tasks.

TF-IDF (Term Frequency-Inverse Document Frequency)

A method for text vectorization that represents text by measuring the importance of a word in a document.

Used in the vectorization step of the CRAFT method.

LoRA (Low-Rank Adaptation)

A method for fine-tuning large language models by reducing the number of parameters through low-rank matrix decomposition.

Used to fine-tune the mBART model.

KL Divergence

A measure of the difference between two probability distributions.

Used to prove the difference between selected and validation distributions in the CRAFT method.

NLLB Corpus

A multilingual parallel corpus containing 33 million sentence pairs, used for machine translation tasks.

Used as the experimental dataset for evaluating the CRAFT method.

mBART Model

A multilingual sequence-to-sequence pre-trained model suitable for multilingual translation tasks.

Used to evaluate the translation performance of the CRAFT method.

Conditional Expected Distance

A method for measuring the distance between target embeddings and the validation target distribution under given conditions.

Used in the target selection step of the CRAFT method.

Distribution Matching

A method for adjusting the selected dataset to make its distribution similar to the validation set.

Used in the source distribution matching step of the CRAFT method.

Open Questions Unanswered questions from this research

  • 1 CRAFT may be affected by clustering quality when dealing with very high-dimensional embeddings, leading to a mismatch between the selected subset and the validation distribution. Future research can explore how to improve clustering quality to ensure the selected subset aligns more closely with the validation distribution.
  • 2 In some low-resource language pairs, the conditional relationship between source and target may not be apparent, affecting CRAFT's selection effectiveness. Future research can explore how to enhance CRAFT's performance in low-resource language pairs or combine other data selection strategies to improve selection accuracy and efficiency.
  • 3 CRAFT relies on the quality and representativeness of the validation set; if the validation set is not representative, it may lead to a decline in the performance of the selected subset. Future research can explore how to select high-quality subsets even when the validation set is not representative.
  • 4 While CRAFT performs well on large-scale datasets, its performance on small-scale datasets has not been fully verified. Future research can explore CRAFT's performance on small-scale datasets and compare it with other methods.
  • 5 CRAFT shows superiority in multilingual translation tasks, but its application in other tasks has not been fully verified. Future research can explore the application of CRAFT in other tasks, such as image classification or text generation.

Applications

Immediate Applications

Multilingual Translation

The CRAFT method can be used for multilingual translation tasks to select high-quality training data and improve translation model performance.

Text Classification

The CRAFT method can be used for text classification tasks by selecting the most relevant training data to improve classification model accuracy.

Speech Recognition

The CRAFT method can be used for speech recognition tasks to select high-quality training data and improve recognition model performance.

Long-term Vision

Autonomous Driving

The CRAFT method can be applied to the perception module in autonomous driving to select high-quality training data and improve perception model accuracy and robustness.

Intelligent Customer Service

The CRAFT method can be used in intelligent customer service systems to select high-quality training data and improve response quality and user satisfaction.

Abstract

Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.

cs.CL cs.AI

References (20)

TAROT: Targeted Data Selection via Optimal Transport

Lang Feng, Fan Nie, Yuejiang Liu et al.

2024 4 citations ⭐ Influential View Analysis →

TSDS: Data Selection for Task-Specific Model Finetuning

Zifan Liu, Amin Karbasi, Theodoros Rekatsinas

2024 26 citations ⭐ Influential View Analysis →

Data Selection for Language Models via Importance Resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma et al.

2023 315 citations ⭐ Influential View Analysis →

LESS: Selecting Influential Data for Targeted Instruction Tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan et al.

2024 455 citations ⭐ Influential View Analysis →

Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources

Feiyang Kang, H. Just, Anit Kumar Sahu et al.

2023 19 citations View Analysis →

A Survey on Data Selection for Language Models

Alon Albalak, Yanai Elazar, Sang Michael Xie et al.

2024 240 citations View Analysis →

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, Salim Roukos, T. Ward et al.

2002 32827 citations

DsDm: Model-Aware Dataset Selection with Datamodels

Logan Engstrom, Axel Feldmann, A. Ma̧dry

2024 104 citations View Analysis →

Large Language Models for Summarizing Czech Historical Documents and Beyond

V'aclav Tran, Jakub Šmíd, J. Martínek et al.

2025 2 citations View Analysis →

Multilingual Translation from Denoising Pre-Training

Y. Tang, C. Tran, Xian Li et al.

2021 144 citations

LoRA: Low-Rank Adaptation of Large Language Models

J. Hu, Yelong Shen, Phillip Wallis et al.

2021 18484 citations View Analysis →

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers, Iryna Gurevych

2019 17354 citations View Analysis →

Sampling techniques.

B. Longest

1971 7724 citations

Comparative Analysis of Neural Translation Models based on Transformers Architecture

Alexander V. Smirnov, N. Teslya, N. Shilov et al.

2022 6 citations

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee, A. Lavie

2005 7947 citations

Beyond English-Centric Multilingual Machine Translation

Angela Fan, Shruti Bhosale, Holger Schwenk et al.

2020 1050 citations View Analysis →

Neural Machine Translation for Low-resource Languages: A Survey

Surangika Ranathunga, E. Lee, M. Skenduli et al.

2021 350 citations View Analysis →

A statistical interpretation of term specificity and its application in retrieval

Karen Spärck Jones

2021 5259 citations

Billion-Scale Similarity Search with GPUs

Jeff Johnson, Matthijs Douze, H. Jégou

2017 5043 citations View Analysis →

chrF: character n-gram F-score for automatic MT evaluation

Maja Popovic

2015 1813 citations