Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

TL;DR

WALAR method enhances low-resource language translation using monolingual data, surpassing LLaMAX model.

cs.CL 🔴 Advanced 2026-03-13 2 views

Yifeng Liu Siqi Ouyang Yatish Hosmane Revanasiddappa Lei Li

Reinforcement Learning Multilingual Translation Reward Hacking Low-Resource Languages Large Language Models

Key Findings

Methodology

This paper introduces a reinforcement learning method called WALAR, focusing on enhancing large language models' translation performance in low-resource languages using monolingual text. WALAR addresses failure modes ('holes') in existing quality estimation models through techniques like word alignment and language alignment, preventing reward hacking. The method is implemented in the GRPO training framework and tested on models like Qwen3-8B, LLaMAX3-8B-Alpaca, and Translategemma-4B-it.

Key Results

Result 1: On the Flores-101 dataset across 1414 language directions, models trained with WALAR showed significant improvements in spBLEU, with LLaMAX3-8B-Alpaca improving from 54.00 to 60.31 in Swahili-X translation.
Result 2: In xCOMET* scoring, the LLaMAX3-8B-Alpaca+WALAR model's average score increased from 64.97 to 71.34, indicating substantial improvements in multilingual translation.
Result 3: Ablation studies confirmed the importance of word alignment and language alignment in the reward signal, especially effective in low-resource language directions.

Significance

This research breaks the dependency on high-quality parallel data for low-resource language translation by leveraging monolingual data, significantly enhancing multilingual translation model performance. WALAR's method holds significant academic value, offering a new solution for low-resource language translation, and potential industrial applications, aiding in the development of more efficient multilingual translation systems.

Technical Contribution

Technical contributions include: 1) Introducing a novel reward signal design combining quality estimation, word alignment, and language alignment to prevent reward hacking; 2) Implementing an effective post-training strategy under the GRPO framework, significantly improving multilingual translation model performance; 3) Providing a solution for low-resource language translation without parallel data, expanding the application range of large language models.

Novelty

WALAR is the first method to enhance low-resource language translation performance using monolingual data through reinforcement learning. Compared to existing post-training strategies relying on parallel data, WALAR innovatively addresses the 'holes' in quality estimation models, offering a more universal solution.

Limitations

Limitation 1: WALAR may still face challenges in extremely low-resource languages, as monolingual data for these languages might also be scarce.
Limitation 2: Although WALAR performs well in experiments, its training process requires substantial computational resources, potentially limiting its application in resource-constrained environments.
Limitation 3: The method's performance heavily relies on the accuracy of the quality estimation model used; any bias in the model could affect the final results.

Future Work

Future research directions include: 1) Exploring ways to further enhance WALAR's performance in even lower-resource environments; 2) Developing more efficient quality estimation models to improve reward signal accuracy; 3) Applying WALAR to other natural language processing tasks, such as text generation and dialogue systems, to verify its broad applicability.

AI Executive Summary

In recent years, large language models (LLMs) have demonstrated remarkable capabilities in machine translation, particularly for high-resource language pairs. However, their performance for low-resource languages remains significantly inferior. Existing post-training methods primarily rely on high-quality parallel data, which is often scarce or unavailable for low-resource languages.

This paper introduces a reinforcement learning method called WALAR, which uses only monolingual text to enhance LLMs' translation capabilities for a wide range of low-resource languages while maintaining their performance on high-resource languages. The core insight of WALAR is based on the observation of failure modes ('holes') in existing source-based multilingual quality estimation (QE) models. Reinforcement learning using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs.

To address this issue, WALAR develops techniques including word alignment and language alignment to mitigate such holes in the reward signal for RL training. The method is implemented in the Group Relative Policy Optimization (GRPO) training framework and tested on LLMs supporting translation of 101 languages. Experimental results show that the new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs, by a large margin on 1414 language directions on the Flores-101 dataset.

In the experiments, WALAR demonstrated outstanding performance across various evaluation metrics, particularly in low-resource language directions, significantly improving translation quality. Ablation studies confirmed the importance of word alignment and language alignment in the reward signal, especially in preventing reward hacking.

The introduction of WALAR holds significant academic value, offering a new solution for low-resource language translation, and potential industrial applications, aiding in the development of more efficient multilingual translation systems. However, WALAR may still face challenges in extremely low-resource languages, as monolingual data for these languages might also be scarce. Additionally, while WALAR performs well in experiments, its training process requires substantial computational resources, potentially limiting its application in resource-constrained environments. Future research directions include exploring ways to further enhance WALAR's performance in even lower-resource environments and developing more efficient quality estimation models to improve reward signal accuracy.

Deep Analysis

Background

In recent years, with the development of large language models (LLMs), machine translation technology has made significant progress, especially in translating high-resource language pairs. However, translation quality for low-resource languages remains unsatisfactory. Traditional methods primarily rely on high-quality parallel data for post-training, such as supervised fine-tuning, knowledge distillation, and back-translation. However, these methods are ineffective for low-resource or zero-resource languages due to the lack of large amounts of high-quality parallel data. To overcome this challenge, researchers have begun exploring methods to enhance translation performance using monolingual data.

Core Problem

The core problem in low-resource language translation is the lack of high-quality parallel data, making traditional post-training methods ineffective. Existing quality estimation models have holes when evaluating translation quality, leading to potential reward hacking in reinforcement learning, where models may gain high scores by simply repeating input source sentences. This not only affects translation quality but also limits model generalization.

Innovation

The core innovations of the WALAR method include: 1) Using monolingual data for reinforcement learning, avoiding dependency on parallel data; 2) Introducing word alignment and language alignment techniques to address holes in quality estimation models; 3) Implementing an effective post-training strategy under the GRPO framework, significantly improving multilingual translation model performance.

Methodology

�� Use monolingual data for reinforcement learning, avoiding dependency on parallel data.
�� Introduce word alignment to ensure proper coverage of words in the target sentence, avoiding over-translation and omissions.
�� Introduce language alignment to ensure the generated translation matches the expected target language.
�� Conduct post-training under the GRPO framework, optimizing the model's translation performance.

Experiments

The experimental design includes testing on the Flores-101 dataset, covering 1414 language directions. We use metrics such as spBLEU, xCOMET*, and MetricX* to evaluate translation quality and conduct ablation studies to verify the effects of word alignment and language alignment. In the experiments, we compare various baseline models, including LLaMAX3-8B-Alpaca, Qwen3-8B, and Translategemma-4B-it.

Results

The experimental results show that the WALAR method performs excellently across various evaluation metrics, particularly in low-resource language directions, significantly improving translation quality. The LLaMAX3-8B-Alpaca model improved from 54.00 to 60.31 in Swahili-X translation, indicating substantial improvements in multilingual translation.

Applications

The WALAR method can be directly applied to multilingual translation systems, especially for low-resource language translation. Its characteristic of not requiring parallel data gives it a significant advantage in data-scarce environments, helping to develop more efficient translation systems.

Limitations & Outlook

WALAR may still face challenges in extremely low-resource languages, as monolingual data for these languages might also be scarce. Additionally, while WALAR performs well in experiments, its training process requires substantial computational resources, potentially limiting its application in resource-constrained environments. Future research directions include exploring ways to further enhance WALAR's performance in even lower-resource environments and developing more efficient quality estimation models to improve reward signal accuracy.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. You have a recipe (large language model) that tells you how to make a dish (translation). For some common ingredients (high-resource languages), you have detailed steps and ingredient lists (parallel data), so you can easily make delicious dishes. But for some uncommon ingredients (low-resource languages), you don't have detailed ingredient lists, so you have to rely on experience and intuition (monolingual data) to cook.

The WALAR method is like a smart assistant that helps you make delicious dishes even without detailed ingredient lists. It observes your cooking process (quality estimation model), identifies potential mistakes (holes), and gives you suggestions (reward signals), like 'add more salt' or 'use less oil'.

This assistant also reminds you to use the right ingredients (language alignment), ensuring that the dish you make matches the expected flavor (target language). In this way, even without detailed ingredient lists, you can make delicious dishes and improve your cooking skills (translation ability).

ELI14 Explained like you're 14

Hey, friends! Imagine you're playing a super cool translation game. You have a super smart robot assistant that helps you translate one language into another. For some common languages, like English and French, this robot assistant does a great job because it has lots of ready-made translation examples to refer to.

But for some less common languages, like Swahili, this robot assistant gets a bit lost because it doesn't have as many examples to refer to. But don't worry! Our WALAR method is like giving this robot assistant a super brain that can learn and find translation patterns on its own.

This super brain can also spot mistakes the robot assistant might make while translating, like translating into the wrong language or missing some important words. It gives the robot assistant some tips to help it correct mistakes.

In this way, even without many examples, this robot assistant can become smarter and translate better and better! Isn't that cool?

Glossary

Reinforcement Learning

A machine learning method that guides models to learn optimal strategies through reward and punishment mechanisms.

Used in this paper to train translation models, optimizing translation quality through reward signals.

Large Language Model

A deep learning-based model capable of processing and generating natural language text.

Used for multilingual translation tasks to enhance translation performance.

Quality Estimation

A method for evaluating translation quality, typically without requiring reference translations.

Used to generate reward signals to guide model learning.

Word Alignment

Identifying corresponding relationships between words in source and target languages.

Ensures proper coverage of words in translation, avoiding omissions or over-translation.

Language Alignment

Ensures the generated translation matches the expected target language.

Avoids translating into the wrong language, improving translation consistency.

Reward Hacking

A phenomenon where models gain high reward scores through improper means.

In this paper, it refers to models gaining high scores by repeating input source sentences.

Flores-101 Dataset

A dataset used to evaluate multilingual translation performance, covering 101 languages.

Used to evaluate the translation performance of the WALAR method.

GRPO (Group Relative Policy Optimization)

A reinforcement learning algorithm for optimizing policies.

Used in this paper to train translation models, enhancing translation quality.

spBLEU

A metric for evaluating translation quality based on BLEU scores.

Used to assess the translation performance of the WALAR method.

xCOMET*

An improved translation quality evaluation metric considering language consistency.

Used to evaluate the translation performance of the WALAR method.

Open Questions Unanswered questions from this research

1 How to further enhance WALAR's performance in extremely low-resource language environments? Current methods may perform poorly when monolingual data is extremely scarce, requiring exploration of new data acquisition and utilization strategies.
2 How to further improve the accuracy of quality estimation models? Existing quality estimation models may have biases in certain situations, affecting the effectiveness of reward signals.
3 Can the WALAR method be applied to other natural language processing tasks? Verification of its applicability and effectiveness in tasks such as text generation and dialogue systems is needed.
4 How to reduce the computational resource requirements of the WALAR method? The current training process requires substantial computational resources, limiting its application in resource-constrained environments.
5 How to simplify the implementation of the WALAR method without affecting translation quality? Exploration of more concise and efficient algorithm designs is needed.

Applications

Immediate Applications

Low-Resource Language Translation

The WALAR method can be used to enhance the translation quality of low-resource languages, aiding in the development of more efficient translation systems.

Multilingual Translation Systems

Developers can build multilingual translation systems supporting various languages using the WALAR method, especially in data-scarce environments.

Language Learning Tools

The WALAR method can be used to develop language learning tools, helping users learn and translate low-resource languages.

Long-term Vision

Global Language Communication

Enhancing low-resource language translation capabilities promotes global language communication and cultural exchange.

Cross-Cultural Collaboration

The WALAR method helps eliminate language barriers, promoting cross-cultural collaboration and international exchange.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.

cs.CL

References (20)

MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task

Juraj Juraska, Daniel Deutsch, Mara Finkelstein et al.

2024 92 citations ⭐ Influential View Analysis →

Tower

G. Wrenn

2017 30 citations ⭐ Influential

X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale

Haoran Xu, Kenton Murray, Philipp Koehn et al.

2024 35 citations View Analysis →

MaskLID: Code-Switching Language Identification through Iterative Masking

Amir Hossein Kargaran, Franccois Yvon, Hinrich Schutze

2024 8 citations View Analysis →

How Vocabulary Sharing Facilitates Multilingualism in LLaMA?

Fei Yuan, Shuai Yuan, Zhiyong Wu et al.

2023 17 citations View Analysis →

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

Zi-Yi Dou, Graham Neubig

2021 304 citations View Analysis →

Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation

Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch et al.

2025 11 citations View Analysis →

Reinforcement Learning based Curriculum Optimization for Neural Machine Translation

Manish Kumar, George F. Foster, Colin Cherry et al.

2019 78 citations View Analysis →

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang et al.

2022 19012 citations View Analysis →

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Wenhao Zhu, Hongyi Liu, Qingxiu Dong et al.

2023 250 citations View Analysis →

Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier

John Dang, Shivalika Singh, Daniel D'souza et al.

2024 104 citations View Analysis →

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Yinquan Lu, Wenhao Zhu, Lei Li et al.

2024 60 citations View Analysis →

Cross-lingual Retrieval for Iterative Self-Supervised Training

C. Tran, Y. Tang, Xian Li et al.

2020 76 citations View Analysis →

The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

Go Inoue, Bashar Alhafni, Nurpeiis Baimukan et al.

2021 320 citations View Analysis →

Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task

Markus Freitag, Nitika Mathur, Daniel Deutsch et al.

2024 86 citations

COMET: A Neural Framework for MT Evaluation

Ricardo Rei, Craig Alan Stewart, Ana C. Farinha et al.

2020 1450 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 55356 citations View Analysis →

xcomet: Transparent Machine Translation Evaluation through Fine-grained Error Detection

Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt et al.

2023 261 citations View Analysis →

Aligning Neural Machine Translation Models: Human Feedback in Training and Inference

Miguel Moura Ramos, Patrick Fernandes, António Farinhas et al.

2023 23 citations View Analysis →

Empirical Results and Analysis

Tengfei Wang, K. Cullinane, Dong-Wook Song

2005 3 citations

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Reinforcement Learning

Large Language Model

Quality Estimation

Word Alignment

Language Alignment

Reward Hacking

Flores-101 Dataset

GRPO (Group Relative Policy Optimization)

spBLEU

xCOMET*

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Low-Resource Language Translation

Multilingual Translation Systems

Language Learning Tools

Long-term Vision

Global Language Communication

Cross-Cultural Collaboration

Abstract

References (20)

Related Papers

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration