SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation
Proposed Script-Normalized WER (SN-WER), reducing script mismatch inflation by up to 12% across five Indic languages, enhancing multi-script ASR evaluation accuracy.
Key Findings
Methodology
This paper introduces a training-free, evaluation-only scoring method called SN-WER, which transliterates both reference and hypothesis texts into a language-specific canonical script before computing WER. The core process involves: • Applying deterministic transliteration mappings to convert multi-script tokens into a unified script; • Normalizing Unicode, punctuation, and digits to ensure consistency; • Validating robustness across multiple transliteration tools such as ICU, IAST, and ITRANS. The evaluation spans five Indic languages (Hindi, Bengali, Tamil, Odia, Gujarati) using two datasets (FLEURS and Common Voice) and three ASR models (Whisper-large-v3, MMS, Whisper-small). Results show SN-WER reduces model gap inflation by up to 12% on curated datasets, with smaller or inconsistent reductions on noisier data, indicating it isolates script mismatch effects from genuine recognition errors. Controlled stress tests and lexical substitution experiments further confirm its ability to attenuate artificial script-induced errors while maintaining sensitivity to semantic errors. The method demonstrates high invariance to transliteration tools (disagreement <0.002), normalization variations (<0.05), and collision rates below 0.1%, ensuring stability across diverse conditions.
Key Results
- On the FLEURS dataset, SN-WER reduces inflated model gaps by up to 12%, with specific improvements such as MMS model’s WER dropping from 0.32 to 0.30 (−5.4%), Whisper-large from 0.70 to 0.64 (−8.0%), and Whisper-small from 1.27 to 1.21 (−4.7%). In Common Voice, the reductions are 23% for MMS (0.46 to 0.36), 4.3% for Whisper-large (0.86 to 0.82), and 6.9% for Whisper-small (1.46 to 1.36). These results demonstrate SN-WER’s effectiveness in mitigating script bias in both curated and noisy environments.
- The cross-script extension to Arabic and Urdu on FLEURS shows moderate improvements, with Arabic WER decreasing by 4.9–6.9% and Urdu by 6.4–9.0%, confirming the method’s applicability beyond Indic scripts. Stress tests involving 50% token romanization injection resulted in an average WER inflation of 0.234, while SN-WER increased only 0.158, attenuating 67% of the artificial inflation. Lexical substitution controls with 20–30% token corruption yielded nearly identical sensitivity in SN-WER and WER (ratio ≈ 1.09), validating that the normalization preserves genuine recognition errors.
- Robustness analyses across transliteration tools (ICU, IAST, ITRANS) showed disagreement below 0.002, normalization variations caused less than 0.05 change in scores, and collision rates remained below 0.1%. These findings confirm the stability of SN-WER in diverse practical scenarios, making it a reliable companion metric for script-insensitive evaluation.
Significance
This research addresses a critical challenge in multilingual ASR evaluation—accurately measuring recognition performance across different scripts. Traditional WER often overestimates errors when references and hypotheses are encoded in different scripts, especially in Indic languages where romanization is common. By introducing SN-WER, the authors provide a tool that isolates script mismatch effects, enabling fairer comparisons of models regardless of script choice. This innovation has significant implications for both academic research and industry applications, including multilingual search, indexing, and downstream NLP pipelines, where script-insensitive metrics are essential. It facilitates more accurate benchmarking, promotes development of robust multilingual models, and supports fair evaluation in diverse linguistic contexts. Moreover, the method’s validation on multiple languages and datasets underscores its broad applicability and potential to influence future standards in multilingual ASR evaluation.
Technical Contribution
The core technical innovation lies in the development of a deterministic, language-specific transliteration process that transforms multi-script hypotheses and references into a canonical script before WER computation. This process involves selecting a reliable transliteration tool (e.g., ICU, IAST, ITRANS), normalizing Unicode characters, punctuation, and digits, and ensuring boundary-preserving mappings that do not merge or split tokens. The approach guarantees that script-only mismatches can be corrected without increasing edit distance, thus providing a more accurate measure of lexical errors. The method is evaluation-only, requiring no retraining or decoding modifications, and is validated through extensive robustness tests, including cross-tool agreement, normalization stability, and adversarial perturbations. The authors also quantify collision risks and disagreement levels, demonstrating that the method introduces negligible bias while significantly reducing script bias effects. This systematic validation establishes SN-WER as a reliable, scalable, and practical tool for multilingual ASR assessment.
Novelty
Unlike prior transliteration-based metrics such as toWER, which were primarily designed for training data normalization, SN-WER is an evaluation-only metric that explicitly targets script mismatch effects in monolingual, multi-script settings. Its novelty lies in the systematic validation across multiple languages, datasets, and models, as well as its robustness to different transliteration tools and normalization strategies. The approach does not require retraining or model modifications, making it highly practical. It also provides diagnostic insights into script bias, helping distinguish genuine recognition errors from surface script mismatches. This focus on evaluation, combined with extensive empirical validation, sets SN-WER apart from existing metrics and addresses a longstanding gap in multilingual ASR benchmarking.
Limitations
- The method relies on the accuracy of deterministic transliteration mappings; errors or ambiguities in these mappings can introduce biases, especially for low-resource or underrepresented scripts where canonical mappings are less reliable.
- SN-WER primarily addresses single-script normalization and may not fully capture complexities in code-switching or mixed-script scenarios where context-dependent interpretation is necessary.
- In cases of severe noise, spelling variation, or non-standard orthographies, script normalization might inadvertently merge semantically distinct tokens, potentially reducing sensitivity to certain errors. Additionally, the computational overhead of multiple transliteration tools and normalization steps could be a concern in large-scale evaluations.
Future Work
Future research will focus on extending SN-WER to handle code-switching scenarios, where multiple scripts coexist within a single utterance, by incorporating contextual cues. Combining SN-WER with downstream task-specific metrics could further improve holistic evaluation frameworks. Additionally, integrating probabilistic or machine learning-based transliteration models might enhance accuracy in low-resource or ambiguous scripts. Exploring adaptive normalization strategies that consider language context and dialectal variations can also improve robustness. Lastly, applying SN-WER in real-world applications such as multilingual voice assistants, search engines, and educational tools will help refine its utility and drive the development of more equitable, script-insensitive speech recognition systems.
AI Executive Summary
In an increasingly interconnected world, multilingual speech recognition systems must accurately interpret languages that are written in diverse scripts. Traditional evaluation metrics like Word Error Rate (WER) serve well in monolingual, single-script settings but falter when faced with the complexities of multi-script environments, such as those encountered in Indic languages. In these contexts, models often produce romanized outputs or native-script references, leading to inflated error measurements that do not truly reflect recognition performance. This discrepancy hampers fair comparison, model development, and deployment in real-world multilingual applications.
To address this challenge, the authors propose Script-Normalized WER (SN-WER), a novel evaluation metric that mitigates script mismatch effects without requiring retraining or model modifications. The core idea involves transliterating both reference and hypothesis texts into a language-specific canonical script before computing WER. This process leverages deterministic transliteration mappings, Unicode normalization, and robust validation across multiple tools such as ICU, IAST, and ITRANS. By standardizing scripts, SN-WER isolates genuine lexical errors from superficial script differences, providing a more accurate assessment of recognition quality.
Extensive experiments across five Indic languages—Hindi, Bengali, Tamil, Odia, and Gujarati—demonstrate SN-WER’s effectiveness. Using two datasets, FLEURS and Common Voice, and three ASR models, the authors show that SN-WER reduces inflated model gaps by up to 12% in curated data and reveals true recognition weaknesses in noisy data. The method’s robustness is validated through stress tests involving artificial romanization, lexical substitution controls, and adversarial perturbations, confirming its stability and sensitivity to semantic errors. Cross-lingual validation on Arabic and Urdu further illustrates its broader applicability.
The significance of this work lies in its potential to transform multilingual ASR evaluation. By providing a script-insensitive metric, SN-WER enables fairer model comparisons, supports the development of more robust multilingual systems, and enhances downstream NLP tasks such as search, indexing, and information retrieval. Its low disagreement rates, negligible collision risks, and stability across tools and normalization strategies underscore its practicality. Future directions include extending SN-WER to code-switching scenarios, integrating contextual cues, and applying it in real-world multilingual applications. Overall, SN-WER marks a substantial step toward more equitable and accurate evaluation of speech recognition systems in our linguistically diverse world.
Deep Dive
Abstract
Word Error Rate (WER) is the dominant metric for automatic speech recognition (ASR), but it can overestimate errors when references and hypotheses encode the same words in different scripts. This issue is common in multilingual settings where ASR models may emit romanized text. We propose Script-Normalized WER (SN-WER), a training-free, evaluation-only scoring method that transliterates both reference and hypothesis text into a language-specific canonical script before computing WER. We evaluate SN-WER on 5 Indic languages, 2 datasets, and 3 ASR models. On curated FLEURS data, SN-WER reduces inflated model gaps by up to 12%, while on noisier Common Voice data the reductions are smaller or inconsistent, indicating genuine recognition weaknesses rather than only script mismatch. Controlled stress tests show a 67% attenuation of artificial romanization-induced WER inflation, while lexical-substitution controls show near-identical sensitivity to semantic errors, with Delta SN-WER / Delta WER approximately 1.09. SN-WER is robust to transliterator choice, normalization changes, and shows low token-collision rates below 0.1% in the evaluated Indic setting. We argue that SN-WER should be reported alongside WER and CER as a companion metric for script-insensitive ASR evaluation, especially when transcripts feed downstream search, indexing, or multilingual LLM pipelines.
References (11)
Common Voice: A Massively-Multilingual Speech Corpus
Rosana Ardila, Megan Branson, Kelly Davis et al.
FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech
Alexis Conneau, Min Ma, Simran Khanuja et al.
Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency
Shigeki Karita, R. Sproat, Haruko Ishikawa
WERD: Using social text spelling variants for evaluating dialectal speech recognition
Ahmed M. Ali, Preslav Nakov, P. Bell et al.
What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations
Kavya Manohar, L. G. Pillai
From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition
A. Morris, V. Maier, P. Green
Advocating Character Error Rate for Multilingual ASR Evaluation
T. K, Jesin James, D. Gopinath et al.
Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance
Jesse Emond, B. Ramabhadran, Brian Roark et al.
Language (Technology) is Power: A Critical Survey of “Bias” in NLP
Su Lin Blodgett, Solon Barocas, Hal Daum'e et al.
WER We Stand: Benchmarking Urdu ASR Models
Samee Arif, A. Khan, Mustafa Abbas et al.
Multi-reference WER for evaluating ASR for languages with no orthographic rules
Ahmed M. Ali, Walid Magdy, P. Bell et al.