Fabricator or dynamic translator?

TL;DR

LLMs generate excessive content in translations; detection strategies improve translation quality.

cs.CL 🔴 Advanced 2026-04-16 33 views
Lisa Vasileva Karin Sim
LLMs machine translation overgeneration translation quality detection strategies

Key Findings

Methodology

This study investigates the overgeneration issue in machine translation using Large Language Models (LLMs) in a commercial setting. By comparing different detection strategies, including MTQE models and alignment detection methods, the researchers aim to identify and categorize these overgeneration phenomena. The study employs various datasets, including WMT25 AOC task data and internally developed datasets, to validate the effectiveness of these strategies.

Key Results

  • Result 1: On the WMT25 AOC dataset, the combined strategy using MTQE models and alignment detection methods can detect overgeneration phenomena with 95% accuracy.
  • Result 2: On internal datasets, the combined strategy performs well in detecting minimally detached overgenerations, achieving a recall rate of 77%, although the precision is low at 22%.
  • Result 3: The study finds that LLMs can provide appropriate explanatory expansions in translations, which is beneficial in some cases but adds complexity to detection.

Significance

This study reveals the phenomenon of overgeneration in LLMs during machine translation and proposes effective detection strategies. This is significant for improving translation quality and reducing unnecessary content generation. The research provides new perspectives in academia and practical solutions for commercial applications, especially in scenarios requiring high-precision translations.

Technical Contribution

The technical contribution lies in developing a strategy that combines MTQE models and alignment detection methods to effectively identify and categorize different types of overgeneration phenomena. This strategy not only improves detection accuracy but also provides a new methodological foundation for future translation quality assessment.

Novelty

This study is the first to systematically explore the phenomenon of overgeneration in LLMs during translation and proposes a strategy combining multiple detection methods. This approach is innovative in handling complex translation generation issues, particularly in identifying minimally detached overgenerations.

Limitations

  • Limitation 1: The alignment detection method may produce a high false positive rate when dealing with very short overgenerations, as these phrases may be poorly aligned with the source text.
  • Limitation 2: The MTQE model has low precision in detecting minimally detached overgenerations, which may require further optimization.
  • Limitation 3: The applicability of the current strategy across different language pairs has not been fully validated.

Future Work

Future research directions include optimizing existing detection strategies to improve precision, especially in detecting minimally detached overgenerations. Additionally, exploring the applicability of these strategies across more language pairs and how to better incorporate feedback from human translators to improve model performance.

AI Executive Summary

In the field of modern machine translation, Large Language Models (LLMs) are gaining attention for their generative capabilities. However, these models often generate excessive content during translation, leading to a decline in translation quality. Traditional Neural Machine Translation (NMT) models primarily face issues of repetition and neurobabble, while LLMs exhibit more complex overgeneration phenomena, including self-explanations and unnecessary expansions.

To address this issue, researchers have proposed a strategy that combines MTQE models and alignment detection methods to identify and categorize different types of overgeneration phenomena. The MTQE model is a multilingual encoder regression model fine-tuned to predict translation quality, while the alignment detection method uses alignment as a proxy for attention weights to detect unaligned text chunks.

In experiments, researchers used various datasets, including WMT25 AOC task data and internally developed datasets, to validate the effectiveness of these strategies. Results show that the combined strategy performs well in detecting overgeneration phenomena, particularly in handling minimally detached overgenerations, achieving a recall rate of 77% despite low precision.

These findings indicate that LLMs can provide appropriate explanatory expansions in translations, which is beneficial in some cases but adds complexity to detection. The research provides new perspectives in academia and practical solutions for commercial applications, especially in scenarios requiring high-precision translations.

However, the applicability of the current strategy across different language pairs has not been fully validated. Future research directions include optimizing existing detection strategies to improve precision and exploring the applicability of these strategies across more language pairs. By incorporating feedback from human translators, researchers hope to further improve model performance for higher-quality machine translation.

Deep Analysis

Background

Machine translation technology has made significant progress over the past decades, evolving from rule-based methods to modern Neural Machine Translation (NMT) models. NMT models achieve efficient conversion from source language to target language through an encoder-decoder architecture. However, with the advent of Large Language Models (LLMs), the translation field faces new challenges and opportunities. LLMs are known for their powerful generative capabilities but often generate excessive content during translation, a phenomenon known as overgeneration. Overgeneration not only affects translation accuracy but can also lead to misunderstandings and confusion. While previous research has explored the issue of neurobabble in NMT models, the overgeneration phenomenon in LLMs is more complex, involving self-explanations and unnecessary expansions.

Core Problem

The issue of LLMs generating excessive content during translation is becoming increasingly prominent. This overgeneration phenomenon not only affects translation accuracy but can also lead to misunderstandings and confusion. Unlike traditional NMT models, the overgeneration phenomenon in LLMs is more complex, involving self-explanations and unnecessary expansions. Effectively detecting and categorizing these overgeneration phenomena to improve translation quality is the core problem of current research. Solving this problem is significant for improving the practicality and reliability of machine translation.

Innovation

The core innovation of this study lies in proposing a strategy that combines MTQE models and alignment detection methods to identify and categorize different types of overgeneration phenomena. Specifically, the MTQE model is a multilingual encoder regression model fine-tuned to predict translation quality, while the alignment detection method uses alignment as a proxy for attention weights to detect unaligned text chunks. This combined strategy not only improves detection accuracy but also provides a new methodological foundation for future translation quality assessment. Additionally, the study is the first to systematically explore the phenomenon of overgeneration in LLMs during translation, particularly in identifying minimally detached overgenerations.

Methodology

  • �� Use MTQE model for translation quality prediction: The model is based on the XLM-R large model, fine-tuned for multilingual translation tasks.
  • �� Alignment detection method: Use the AwesomeAlign tool for alignment detection to identify unaligned text chunks.
  • �� Dataset selection: Use WMT25 AOC task data and internally developed datasets for experimental validation.
  • �� Combined strategy: Combine MTQE models and alignment detection methods to improve the accuracy of overgeneration detection.
  • �� Result analysis: Validate the effectiveness of the combined strategy through experimental results, particularly in handling minimally detached overgenerations.

Experiments

The experimental design includes using multiple datasets to validate the proposed detection strategies. The primary datasets include WMT25 AOC task data and internally developed datasets. These datasets cover various language pairs, such as English-Chinese, English-Russian, and English-Japanese. The baselines used in the experiments include traditional NMT models and translations generated by LLMs. Key evaluation metrics include detection accuracy, recall, and precision. Additionally, the study conducted ablation studies to assess the impact of different strategy combinations on detection performance.

Results

Experimental results show that the combined strategy performs well in detecting overgeneration phenomena. On the WMT25 AOC dataset, the combined strategy can detect overgeneration phenomena with 95% accuracy. On internal datasets, the combined strategy performs well in detecting minimally detached overgenerations, achieving a recall rate of 77%, although the precision is low at 22%. These results indicate that the combined strategy has significant advantages in handling complex translation generation issues, particularly in identifying minimally detached overgenerations.

Applications

The findings of this study have significant implications for various application scenarios. Firstly, it can be used to improve the quality of machine translation systems, especially in scenarios requiring high-precision translations. Secondly, the strategy can help translation service providers better identify and handle overgeneration phenomena in translations, thereby improving customer satisfaction. Additionally, these detection strategies can be applied to other natural language processing tasks, such as text generation and summarization, to improve the accuracy and relevance of generated content.

Limitations & Outlook

Despite the excellent performance of the proposed combined strategy in detecting overgeneration phenomena, there are some limitations. Firstly, the alignment detection method may produce a high false positive rate when dealing with very short overgenerations, as these phrases may be poorly aligned with the source text. Secondly, the MTQE model has low precision in detecting minimally detached overgenerations, which may require further optimization. Additionally, the applicability of the current strategy across different language pairs has not been fully validated. Future research directions include exploring the applicability of these strategies across more language pairs.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe that tells you what ingredients you need and the steps to follow. As you start preparing the ingredients, sometimes you might add a bit too much of an unnecessary spice, like extra salt or sugar. This is similar to how Large Language Models (LLMs) generate excessive content during translation. While these extra spices might make the dish more flavorful, sometimes they can also make it taste off.

In translation, LLMs sometimes add unnecessary explanations or expansions, just like adding extra steps to a recipe. To ensure the dish tastes just right, we need a way to detect and correct these extra spices and steps.

Researchers have proposed a strategy, like using a smart assistant in the kitchen, that can detect if you've added too much spice and tell you how to adjust. This strategy combines two methods: one checks the overall taste of the dish, and the other checks if each step is followed according to the recipe.

With this approach, we can ensure the final dish tastes just right, just like ensuring the translation content is accurate. This not only improves the quality of the translation but also helps us enjoy a delicious meal without unnecessary hassle.

ELI14 Explained like you're 14

Hey there! You know how sometimes when you're playing a video game, you might press a few extra buttons, and your character does something weird? Well, Large Language Models (LLMs) do something similar when translating!

These models are like super-smart robots that can translate one language into another. But sometimes, they say a bit too much, just like pressing extra buttons in a game.

To make translations more accurate, scientists have come up with a method, like a cheat code in a game, to detect these extra words and help the model fix them.

This way, we get better translations, just like getting a higher score in a game! Isn't that cool?

Glossary

Large Language Model (LLM)

A large language model is an AI model capable of generating natural language text, commonly used for translation and text generation tasks.

In this paper, LLMs are used for machine translation but tend to overgenerate content.

Neural Machine Translation (NMT)

Neural machine translation is a translation method based on neural networks, typically using an encoder-decoder architecture.

In this paper, NMT models are compared with LLMs regarding generation issues.

Overgeneration

Overgeneration refers to generating unnecessary content during translation, which can degrade translation quality.

This paper studies how to detect overgeneration phenomena in LLMs.

MTQE Model

MTQE is a multilingual encoder regression model used to predict translation quality.

In this paper, the MTQE model is used to detect overgeneration in translations.

Alignment Detection

Alignment detection is a method that checks the alignment between translated text and source text to detect overgeneration.

In this paper, alignment detection is used to identify unaligned text chunks.

Explanatory Expansion

Explanatory expansion involves adding extra explanations or information in translation to enhance understanding for the target language audience.

In this paper, explanatory expansion is considered a type of overgeneration phenomenon.

Recall Rate

Recall rate measures the proportion of actual positive cases detected by the model.

In this paper, recall rate is used to evaluate the effectiveness of overgeneration detection strategies.

Precision Rate

Precision rate measures the proportion of detected positive cases that are true positives.

In this paper, precision rate is used to evaluate the effectiveness of overgeneration detection strategies.

Ablation Study

Ablation study is a method to assess the impact of removing certain parts of a model on overall performance.

In this paper, ablation studies are used to evaluate the impact of different strategy combinations on detection performance.

Minimally Detached Overgeneration

Minimally detached overgeneration refers to translations with only a small amount of unnecessary content generation, often difficult to detect.

In this paper, minimally detached overgeneration is a focus of detection strategies.

Open Questions Unanswered questions from this research

  • 1 How can the current overgeneration detection strategies be effectively applied across different language pairs? While the strategies proposed in this paper perform well on certain language pairs, their applicability to other language pairs has not been fully validated.
  • 2 How can the precision of the MTQE model in detecting minimally detached overgenerations be improved? The current model's performance in this area is suboptimal and requires further optimization and improvement.
  • 3 Why does the alignment detection method produce a high false positive rate when dealing with very short overgenerations? Further research is needed to understand the causes of poor alignment and find solutions.
  • 4 How can feedback from human translators be incorporated to improve overgeneration detection strategies? The experience and intuition of human translators may provide new perspectives for model optimization.
  • 5 How can the overgeneration phenomenon in LLMs be reduced without affecting translation quality? New model structures or training methods need to be explored to reduce unnecessary content generation.

Applications

Immediate Applications

Translation Quality Improvement

By detecting and correcting overgeneration phenomena, translation service providers can improve translation quality to meet customer demands for high-precision translations.

Text Generation Optimization

In other natural language processing tasks, such as text generation and summarization, detection strategies can help reduce unnecessary content generation and improve the relevance of generated text.

Multilingual Support

The strategy can be applied to multilingual translation systems to help identify and handle overgeneration phenomena across different language pairs, enhancing system versatility.

Long-term Vision

Intelligent Translation Assistant

In the future, intelligent translation assistants incorporating overgeneration detection strategies can provide real-time translation quality feedback to help translators improve work efficiency.

Automated Content Review

In the content review field, detection strategies can be used to automatically identify and filter unnecessary content, ensuring information accuracy and relevance.

Abstract

LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self-explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would, enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.

cs.CL

References (13)

Quality Estimation with Force-Decoded Attention and Cross-lingual Embeddings

E. Yankovskaya, Andre Tättar, Mark Fishel

2018 11 citations ⭐ Influential

Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models

Kenza Benkirane, Laura Gongas, Shahar Pelles et al.

2024 19 citations View Analysis →

Hallucinations in Large Multilingual Translation Models

Nuno M. Guerreiro, Duarte M. Alves, Jonas Waldendorf et al.

2023 225 citations View Analysis →

HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation

David Dale, Elena Voita, Janice Lam et al.

2023 40 citations View Analysis →

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

Zi-Yi Dou, Graham Neubig

2021 307 citations View Analysis →

SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Timothee Mickus, Elaine Zosa, Ra'ul V'azquez et al.

2024 41 citations View Analysis →

SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes

Ra'ul V'azquez, Timothee Mickus, Elaine Zosa et al.

2025 13 citations View Analysis →

SALTED: A Framework for SAlient Long-Tail Translation Error Detection

Vikas Raunak, Matt Post, Arul Menezes

2022 29 citations View Analysis →

How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation

Amr Hendy, Mohamed Abdelrehim, Amr Sharaf et al.

2023 574 citations View Analysis →

Measuring the Mixing of Contextual Information in the Transformer

Javier Ferrando, Gerard I. Gállego, M. Costa-jussà

2022 75 citations View Analysis →

Confabulation: The Surprising Value of Large Language Model Hallucinations

Peiqi Sui, Eamon Duede, Sophie Wu et al.

2024 52 citations View Analysis →

Explicitation and Implicitation in Arabic- English Translation of Institutional Academic Correspondence

Nada Mohamed Al Hammadi, Sane Yagi, S. Fareh

2024 2 citations

Towards Opening the Black Box of Neural Machine Translation: Source and Target Interpretations of the Transformer

Javier Ferrando, Gerard I. Gállego, Belen Alastruey et al.

2022 57 citations View Analysis →