Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

TL;DR

Zero-shot morphological discovery in low-resource Bantu languages via cross-lingual transfer and unsupervised clustering.

cs.LG 🔴 Advanced 2026-04-25 39 views
Hillary Mutisya John Mugane
cross-lingual transfer unsupervised learning morphology low-resource languages Bantu languages

Key Findings

Methodology

The paper presents a method combining cross-lingual transfer learning and unsupervised clustering for morphological feature discovery in low-resource Bantu languages. Specifically, it uses the BantuMorph model to map words into a shared embedding space, applies K-nearest neighbors for transfer learning, and employs UMAP and K-means for unsupervised clustering. The results are combined using weighted voting.

Key Results

  • In Giriama, the method discovered noun class assignments for 2,455 words and identified two novel morphological patterns: a vowel coalescence prefix variant for Class 2 (95.1% consistency) and a contracted k'- prefix (98.5% consistency).
  • External validation on 444 known Giriama verb paradigms showed 78.2% lemmatization accuracy, while a corpus expansion to 19,624 words achieved 97.3% segmentation and 86.7% lemmatization rates.
  • The combination of transfer learning and unsupervised clustering leveraged Swahili's high-resource data to discover Giriama-specific linguistic innovations.

Significance

This research provides a novel approach for morphological documentation of low-resource Bantu languages, particularly in data-scarce scenarios. By combining high-resource language transfer learning with unsupervised clustering, it effectively discovers language-specific features. This method not only enhances morphological coverage for Giriama but also offers a framework applicable to other low-resource languages.

Technical Contribution

Technically, the study demonstrates how to integrate cross-lingual transfer learning with unsupervised clustering for zero-shot morphological discovery. By using the BantuMorph model, it successfully maps vocabulary into a shared embedding space and combines K-nearest neighbors and UMAP+K-means unsupervised methods, providing a novel tool for morphological analysis.

Novelty

This study is the first to achieve zero-shot morphological discovery in low-resource Bantu languages by combining cross-lingual transfer learning and unsupervised clustering. The innovation lies in its ability to perform effective morphological analysis using high-resource language data in the absence of annotated data.

Limitations

  • The method relies on lexical overlap between high-resource and target languages, which may limit transfer learning effectiveness if overlap is low.
  • Unsupervised clustering may fail for classes with ambiguous prefixes, such as those that could belong to multiple categories.
  • The study's coverage is limited to nouns in the corpus, with rare classes underrepresented.

Future Work

Future research directions include expanding to more low-resource languages, especially those with lower lexical overlap with high-resource languages. Additionally, improving model accuracy and generalization capabilities, and exploring applications in other linguistic tasks are promising areas for further investigation.

AI Executive Summary

Morphological analysis is fundamental to linguistic documentation and natural language processing, yet most of the world's languages lack comprehensive morphological resources. This issue is particularly acute for the Bantu language family, where noun class systems remain undocumented for many languages.

This paper introduces a method combining cross-lingual transfer learning and unsupervised clustering for morphological feature discovery in low-resource Bantu languages. The study focuses on Giriama, a language with only 91 labeled paradigms. Using this method, researchers discovered noun class assignments for 2,455 words and identified two previously undocumented morphological patterns: a vowel coalescence prefix variant for Class 2 and a contracted k'- prefix.

The core of this method is the use of the BantuMorph model, which maps Bantu language vocabulary into a shared embedding space. Transfer learning is performed using the K-nearest neighbors algorithm, while unsupervised clustering is achieved through UMAP and K-means. The results are combined using weighted voting, successfully achieving zero-shot noun class discovery in Giriama.

Experimental results show that external validation on 444 known Giriama verb paradigms achieved 78.2% lemmatization accuracy, while a corpus expansion to 19,624 words resulted in 97.3% segmentation and 86.7% lemmatization rates. This demonstrates that the method not only enhances morphological coverage for Giriama but also offers a framework applicable to other low-resource languages.

Despite the significant achievements in Giriama, the method relies on lexical overlap between high-resource and target languages, which may limit transfer learning effectiveness if overlap is low. Additionally, unsupervised clustering may fail for classes with ambiguous prefixes. Future research directions include expanding to more low-resource languages and improving model accuracy and generalization capabilities.

Deep Analysis

Background

Morphological analysis is a crucial component of linguistic research, especially in natural language processing and language documentation. However, most of the world's over 7,000 languages lack comprehensive morphological resources, particularly the Bantu language family. Bantu languages are known for their rich agglutinative morphology and noun class systems, but many languages' noun class systems remain undocumented. Giriama, a member of the Bantu language family, has a significant speaker population but only 91 morphological paradigms are annotated and available in computational form. Traditional supervised learning approaches struggle to achieve good coverage with such minimal data, but Giriama shares approximately 60% vocabulary with Swahili, suggesting the viability of cross-lingual transfer learning.

Core Problem

The morphological analysis of Giriama faces the challenge of data scarcity. With only 91 labeled paradigms, traditional supervised learning methods struggle to achieve effective coverage. Noun class systems are a significant feature of Bantu languages, yet remain undocumented for many languages. The core problem is how to perform effective morphological analysis using high-resource language data in the absence of annotated data.

Innovation

The innovation of this paper lies in combining cross-lingual transfer learning and unsupervised clustering for zero-shot morphological discovery in low-resource Bantu languages. Specifically:

1. The BantuMorph model is used to map Bantu language vocabulary into a shared embedding space, leveraging K-nearest neighbors for transfer learning.

2. UMAP and K-means are employed for unsupervised clustering to identify language-specific features.

3. Results from both methods are combined using weighted voting to achieve noun class discovery in Giriama.

Methodology

The methodology includes the following steps:

  • �� Use the BantuMorph model to encode Bantu language vocabulary, mapping it into a shared embedding space.
  • �� Apply K-nearest neighbors algorithm in the embedding space for transfer learning, identifying vocabulary similar to high-resource languages like Swahili.
  • �� Use UMAP for dimensionality reduction and K-means for unsupervised clustering to identify language-specific features.
  • �� Combine results from transfer learning and unsupervised clustering using weighted voting to achieve noun class discovery.

Experiments

The experimental design includes using 7,812 sentences from the Giriama language as an unlabeled corpus, with Swahili as the high-resource source language. Transfer learning is performed using the K-nearest neighbors algorithm in the ByT5 embedding space, with K=5. Unsupervised clustering uses UMAP for dimensionality reduction and K-means for clustering, with K=12. Results are combined using weighted voting, with a confidence threshold of 0.70.

Results

Experimental results show that in Giriama, the method discovered noun class assignments for 2,455 words and identified two novel morphological patterns: a vowel coalescence prefix variant for Class 2 (95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms achieved 78.2% lemmatization accuracy, while a corpus expansion to 19,624 words resulted in 97.3% segmentation and 86.7% lemmatization rates.

Applications

The method's direct application scenarios include morphological documentation of low-resource languages, particularly in data-scarce scenarios where high-resource language data can be leveraged. Additionally, the method can be applied to other linguistic tasks, such as vocabulary expansion and identification of linguistic innovations.

Limitations & Outlook

Despite the significant achievements in Giriama, the method relies on lexical overlap between high-resource and target languages, which may limit transfer learning effectiveness if overlap is low. Additionally, unsupervised clustering may fail for classes with ambiguous prefixes. Future research directions include expanding to more low-resource languages and improving model accuracy and generalization capabilities.

Plain Language Accessible to non-experts

Imagine you're in a library where books are categorized into different sections like fiction, non-fiction, science, and history. Each book has a label indicating its category. Now, suppose you enter a new library where books have no labels, and you need to guess their categories based on their content and style.

This is what the method in this paper does. It observes the features of words (like books) to infer their categories (noun classes). To achieve this, researchers use a tool called BantuMorph, which acts like a super librarian, quickly scanning books and finding similarities.

They also use a method called K-nearest neighbors, which is like asking other librarians how they categorize similar books. Finally, they use a method called unsupervised clustering, which groups books based on their covers and summaries.

By combining these methods, researchers can categorize new books without explicit labels. This approach not only helps us better understand languages but can also be applied to other fields requiring categorization.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game with lots of different characters, each with their own skills and attributes. You need to figure out which team each character belongs to, like warriors, mages, or archers, based on their skills and attributes.

Now, suppose you enter a new game world where characters don't have clear team labels. You need to guess their teams by observing their skills and behavior. That's what this paper's method does!

Researchers use a tool called BantuMorph, like a super game guide, to quickly analyze characters' skills and attributes. They also use a method called K-nearest neighbors, which is like asking other players how they categorize similar characters.

Finally, they use a method called unsupervised clustering, which groups characters based on their appearance and behavior. By combining these methods, researchers can categorize new characters without explicit labels. This approach not only helps us better understand the game world but can also be applied to other fields requiring categorization.

Glossary

Cross-Lingual Transfer Learning

A method that utilizes knowledge from high-resource languages to enhance models for low-resource languages by transferring learning outcomes through shared features or structures.

Used in this paper to transfer morphological knowledge from Swahili to Giriama.

Unsupervised Clustering

A clustering method that does not require pre-labeled data, grouping data based on inherent structures. Common algorithms include K-means and UMAP.

Used in this paper to identify language-specific features in Giriama.

BantuMorph

A model for morphological analysis of Bantu languages. It maps vocabulary into a shared embedding space at the character level.

Used in this paper to map Giriama vocabulary into a shared embedding space.

K-Nearest Neighbors

A distance-based classification method that classifies data points based on the majority class of their nearest neighbors.

Used in this paper for transfer learning in the embedding space.

UMAP

An unsupervised learning algorithm for dimensionality reduction that preserves local data structure.

Used in this paper for dimensionality reduction before K-means clustering.

K-means

A common clustering algorithm that partitions data into K clusters by minimizing within-cluster variance.

Used in this paper for clustering reduced-dimensional data.

Noun Class

A grammatical category in Bantu languages where nouns are classified based on prefixes, affecting morphological changes in other sentence words.

Used in this paper to analyze noun class assignments in Giriama.

Lemmatization

The process of reducing word forms to their base form, commonly used in natural language processing tasks.

Used in this paper to validate morphological analysis results in Giriama.

Vowel Coalescence

The merging of two adjacent vowels into one, common in certain languages' morphological changes.

Identified as a novel morphological pattern in Giriama in this paper.

Contracted Prefix

A phenomenon where prefixes shorten under specific conditions in some languages.

Identified as another novel morphological pattern in Giriama in this paper.

Open Questions Unanswered questions from this research

  • 1 How to achieve effective cross-lingual transfer learning with low lexical overlap? The current method relies on lexical overlap between high-resource and target languages, which may limit transfer learning effectiveness if overlap is low.
  • 2 How to handle classes with ambiguous prefixes? Unsupervised clustering may fail for classes with ambiguous prefixes, requiring further research to improve accuracy.
  • 3 How to apply this method to other linguistic tasks? While successful in morphological analysis, the method's application to other linguistic tasks remains to be explored.
  • 4 How to improve model generalization? The current model performs well on specific languages, but its generalization capabilities across other languages need verification.
  • 5 How to conduct effective morphological analysis without annotated data? Although successful in data-scarce scenarios, more methods are needed to improve analysis accuracy.

Applications

Immediate Applications

Low-Resource Language Documentation

This method can be directly applied to morphological documentation of low-resource languages, especially in data-scarce scenarios where high-resource language data can be leveraged.

Linguistic Research

By identifying language-specific features, this method provides new perspectives for linguistic research, helping researchers better understand language evolution and change.

Natural Language Processing

The method can be applied to NLP tasks such as machine translation and automatic summarization, particularly in processing low-resource languages.

Long-term Vision

Global Language Preservation

By enhancing morphological analysis capabilities for low-resource languages, this method can provide technical support for global language preservation and revitalization, helping to save endangered languages.

Cross-Language Technology Applications

The successful application of this method can drive the development of cross-language technologies, promoting innovation and application in multilingual environments.

Abstract

We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.

cs.LG cs.CL

References (20)

Cross-Lingual Morphological Tagging for Low-Resource Languages

Jan Buys, Jan A. Botha

2016 50 citations ⭐ Influential View Analysis →

Unsupervised Cross-lingual Representation Learning at Scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal et al.

2019 8224 citations View Analysis →

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee et al.

2019 113527 citations View Analysis →

Neural Multi-Source Morphological Reinflection

Hinrich Schütze, Ryan Cotterell, Katharina Kann

2016 34 citations View Analysis →

The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

Ryan Cotterell, Christo Kirov, John Sylak-Glassman et al.

2018 158 citations View Analysis →

SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

Ekaterina Vylomova, Jennifer C. White, Elizabeth Salesky et al.

2020 87 citations View Analysis →

A Universal Feature Schema for Rich Morphological Annotation and Fine-Grained Cross-Lingual Part-of-Speech Tagging

John Sylak-Glassman, Christo Kirov, Matt Post et al.

2015 35 citations

Unsupervised Learning of the Morphology of a Natural Language

J. Goldsmith

2001 891 citations

Character-Aware Neural Language Models

Yoon Kim, Yacine Jernite, D. Sontag et al.

2015 1714 citations View Analysis →

Object marking and morphosyntactic variation in Bantu

L. Marten, N. Kula

2012 99 citations

Marrying Universal Dependencies and Universal Morphology

Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell et al.

2018 47 citations View Analysis →

Unsupervised models for morpheme segmentation and morphology learning

Mathias Creutz, K. Lagus

2007 419 citations

A Two-Level Computer Formalism for the Analysis of Bantu Morphology An Application to Swahili ARVI HURSKAINEN

A. Hurskainen

2005 14 citations

Unsupervised Learning of Morphology

H. Hammarström, L. Borin

2011 153 citations

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Linting Xue, Aditya Barua, Noah Constant et al.

2021 664 citations View Analysis →

Deep Contextualized Word Representations

Matthew E. Peters, Mark Neumann, Mohit Iyyer et al.

2018 12115 citations View Analysis →

UniMorph 2.0: Universal Morphology

Christo Kirov, Ryan Cotterell, John Sylak-Glassman et al.

2018 152 citations View Analysis →

A comparative study of Bantu noun classes

E. Vajda

2002 114 citations

Exploiting Cross-Linguistic Similarities in Zulu and Xhosa Computational Morphology

L. Pretorius, Sonja E. Bosch

2009 25 citations

CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages

Ryan Cotterell, Christo Kirov, John Sylak-Glassman et al.

2017 203 citations View Analysis →