Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

TL;DR

Using BantuMorph v7, a neural model recovers historical lexical structures in Bantu languages from modern data, confirming 90.9% noun candidates align with Proto-Bantu forms.

cs.LG 🔴 Advanced 2026-04-25 29 views
Hillary Mutisya John Mugane
neural networks historical linguistics Bantu languages lexical structure cross-lingual analysis

Key Findings

Methodology

The study utilizes BantuMorph v7, a character-level transformer over Bantu morphological paradigms, to analyze 14 Eastern and Southern Bantu languages. Encoder embeddings for noun and verb lemmas are extracted, identifying 728 noun and 1,525 verb cognate candidates. These candidates are evaluated against historical resources such as the Bantu Lexical Reconstructions database (BLR3) and the ASJP basic vocabulary, confirming many candidates align with reconstructed Proto-Bantu forms.

Key Results

  • Result 1: Among the top 11 noun candidates, 10 (90.9%) align with historical resources, such as *-ntU 'person' (8 languages) and *gombe 'cow' (9 languages).
  • Result 2: For verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- 'see' and *-jIm- 'stand', validated across wide geographic ranges.
  • Result 3: Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with Guthrie-zone classifications (p < 0.01).

Significance

This research demonstrates how neural models trained on modern morphological data can recover historical lexical structures, providing new tools for historical linguistics and cross-lingual lexical analysis. By validating consistency with historical resources, the method shows potential in recovering and analyzing language evolution, particularly in complex language families like Bantu.

Technical Contribution

Technical contributions include using a character-level transformer model to capture cross-lingual lexical structures, demonstrating how historical language structures can be recovered from modern data. Additionally, cross-model validation shows consistency between independent models, enhancing the reliability of the results.

Novelty

This study is the first to show that neural models trained solely on modern morphological data can recover cross-lingual lexical structures consistent with historical reconstruction. Unlike previous studies, this method does not rely on traditional phonological reconstruction but achieves results through neural network embedding learning.

Limitations

  • Limitation 1: The dataset is restricted to Eastern and Southern Bantu languages, limiting the ability to distinguish Proto-Bantu retentions from later regional innovations.
  • Limitation 2: The model is character-level, failing to capture systematic phonological correspondences.
  • Limitation 3: BLR3 matching uses substring comparison; formal cognate coding requires expert judgment.

Future Work

Future work could expand to Western Bantu languages for more comprehensive validation. Additionally, combining other linguistic methods, such as phonological reconstruction, could provide deeper insights into historical linguistics.

AI Executive Summary

The Bantu language family, spoken by over 300 million people across sub-Saharan Africa, shares a common ancestor known as Proto-Bantu. Traditionally, historical linguists have reconstructed Proto-Bantu forms through the comparative method, a time-consuming process.

This study proposes a novel approach using neural models trained on modern morphological data to recover historical lexical structures in Bantu languages. The research employs BantuMorph v7, a character-level transformer over Bantu morphological paradigms, to analyze 14 Eastern and Southern Bantu languages. Encoder embeddings for noun and verb lemmas are extracted, identifying 728 noun and 1,525 verb cognate candidates.

These candidates are evaluated against historical resources such as the Bantu Lexical Reconstructions database (BLR3) and the ASJP basic vocabulary, confirming many candidates align with reconstructed Proto-Bantu forms. Notably, among the top 11 noun candidates, 10 (90.9%) align with historical resources, such as *-ntU 'person' and *gombe 'cow'.

Furthermore, cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with Guthrie-zone classifications (p < 0.01).

While the study achieves significant results, it also has limitations. The dataset is restricted to Eastern and Southern Bantu languages, limiting the ability to distinguish Proto-Bantu retentions from later regional innovations. Future work could expand to Western Bantu languages for more comprehensive validation.

Deep Analysis

Background

The Bantu language family includes over 500 languages spoken by more than 300 million people across sub-Saharan Africa. Its common ancestor, Proto-Bantu, was spoken approximately 4,500 to 4,000 years ago in the Cameroon highlands. Historical linguistics has traditionally reconstructed Proto-Bantu forms through the comparative method, identifying regular sound correspondences across daughter languages, a process that has taken over a century.


Recently, neural networks have provided new tools for linguistic research in natural language processing. The question of whether models trained on modern data can recover historical language structures is worth exploring. The complexity and diversity of Bantu languages make them an ideal subject for studying the recovery of cross-lingual lexical structures.

Core Problem

The core problem addressed in this study is whether neural models trained solely on modern morphological data can recover cross-lingual lexical structures consistent with historical reconstruction. Traditional historical linguistics methods rely on phonological reconstruction and the comparative method, which are time-consuming and require extensive linguistic knowledge. How to leverage modern technological means, particularly neural network models, to simplify this process and improve efficiency and accuracy is a pressing issue.

Innovation

The core innovations of this study include:

1. Using BantuMorph v7, a character-level transformer over Bantu morphological paradigms, to analyze lexical structures in Bantu languages.

2. Extracting encoder embeddings for noun and verb lemmas to identify cross-lingual cognate candidates.

3. Validating candidate words against historical resources (BLR3 and ASJP), demonstrating the potential of neural models in recovering historical language structures.

4. Cross-model validation using an independent translation model (NLLB-600M), enhancing the reliability of the results.

Methodology

Method details:

  • �� Use BantuMorph v7 model to analyze 14 Eastern and Southern Bantu languages.
  • �� Extract encoder embeddings for noun and verb lemmas, identifying 728 noun and 1,525 verb cognate candidates.
  • �� Validate against Bantu Lexical Reconstructions database (BLR3) and ASJP basic vocabulary.
  • �� Cross-model validation using an independent translation model (NLLB-600M) to confirm consistency of cognate clusters and phylogenetic groupings.

Experiments

The experimental design includes:

  • �� Datasets: Data from 14 Eastern and Southern Bantu languages.
  • �� Baselines: Comparison with Bantu Lexical Reconstructions database (BLR3) and ASJP basic vocabulary.
  • �� Metrics: Validation rate of cognate candidates and model consistency as primary metrics.
  • �� Key hyperparameters: BantuMorph v7 model set with 300M parameters, using character-level inputs.

Results

Results analysis:

  • �� Among the top 11 noun candidates, 10 (90.9%) align with historical resources.
  • �� 12 verb cognates align with reconstructed Proto-Bantu roots, demonstrating model effectiveness in verb recognition.
  • �� Cross-model validation shows that both models recover cognate clusters and phylogenetic groupings consistent with Guthrie-zone classifications (p < 0.01).

Applications

Application scenarios:

  • �� Direct application in historical linguistics research, providing a new tool for analyzing and recovering language evolution.
  • �� In cross-lingual lexical analysis, helping identify and validate cognates, enhancing understanding between languages.
  • �� For linguists and historians, offering an efficient method to verify and explore the historical structure of languages.

Limitations & Outlook

Limitations & outlook:

  • �� Dataset limitations: Restricted to Eastern and Southern Bantu languages, limiting comprehensive validation of the model.
  • �� Model limitations: Character-level based, failing to capture systematic phonological correspondences.
  • �� Future improvements: Combining other linguistic methods, such as phonological reconstruction, could provide deeper insights into historical linguistics.

Plain Language Accessible to non-experts

Imagine you have a giant jigsaw puzzle, where each piece represents the vocabulary of a language. Traditionally, linguists have to manually compare these pieces to figure out which ones belong to the same whole, a process that is both time-consuming and complex. Now, researchers have developed a smart machine that can automatically analyze these pieces and find the connections between them. This machine is BantuMorph v7, which analyzes modern language data to help us recover those ancient language structures.

Just like cooking in a kitchen, you need to combine different ingredients to make a delicious dish. BantuMorph v7 is like a smart chef that can identify which ingredients (language vocabularies) are part of the same dish (historical language structure). In this way, we can understand the evolution of languages faster and more accurately.

This method not only improves efficiency but also provides new perspectives for linguistic research. Just like learning history in school, we need to know not only what happened in the past but also how these events affect the present. BantuMorph v7 helps us better understand the history of languages, providing a solid foundation for future research.

ELI14 Explained like you're 14

Hey there! Have you ever wondered how there are so many languages in the world and where they come from? It's like playing a puzzle game where each language is a little piece, and when you put them together, you can see the whole history of a language family.

Scientists have been studying the history of these languages, but it's not an easy task. Imagine trying to put together hundreds of puzzle pieces and figuring out which ones are part of the same picture. It takes a lot of time and effort.

But now, there's a super cool tool called BantuMorph v7. It's like a smart assistant that helps us quickly find out which language words share the same ancestor. This way, we can understand the history of languages much faster!

Although this tool is amazing, it does have some small problems, like it can only analyze a part of the languages. But scientists are working hard to improve it and make it even more powerful! So, in the future, understanding the history of languages will be much easier!

Glossary

BantuMorph v7

A character-level transformer model based on Bantu morphological paradigms used to analyze lexical structures in Bantu languages.

Used to extract encoder embeddings for noun and verb lemmas.

Transformer

A neural network architecture widely used in natural language processing tasks, especially effective in handling sequence data.

BantuMorph v7 is based on the Transformer architecture.

Proto-Bantu

The common ancestor language of the Bantu language family, spoken approximately 4,500 to 4,000 years ago.

Used in the study to validate the historical consistency of cognate candidates.

BLR3

A database containing 4,786 reconstructed Proto-Bantu forms, used for historical linguistic research.

Used to validate the effectiveness of cognate candidates.

ASJP

Provides standardized 40-item basic vocabulary lists for computational comparison between languages.

Used to validate the effectiveness of cognate candidates.

Cognate

Words in different languages that have a common origin, usually similar in form and meaning.

The target for identification and validation in the study.

NLLB-600M

An independent translation model used for cross-model validation of cognate clusters and phylogenetic groupings.

Used to validate the results of BantuMorph v7.

Cosine Similarity

A metric used to measure the similarity between two vectors, ranging from -1 to 1.

Used to analyze the consistency of noun class structures.

Guthrie Zone

A classification system for Bantu languages, divided into zones based on geographic and linguistic features.

Used to analyze cognate clusters and phylogenetic groupings.

Phylogenetic Grouping

Classification based on genetic relationships between languages, reflecting the evolutionary history of languages.

One of the targets validated in the study.

Open Questions Unanswered questions from this research

  • 1 Open question 1: How can the model's applicability to Western Bantu languages be improved without increasing the dataset size? The current study is limited to Eastern and Southern Bantu languages, and future research needs to expand to a broader range of languages.
  • 2 Open question 2: How can phonological reconstruction methods be integrated to enhance the model's historical linguistic interpretability? While BantuMorph v7 performs well in lexical structure recovery, its character-level analysis fails to capture systematic phonological correspondences.
  • 3 Open question 3: How can the model's computational efficiency be improved without sacrificing accuracy? The current model may face computational resource limitations when processing large-scale data.
  • 4 Open question 4: How can borrowed words be better identified and excluded from cognate recognition? Borrowed words may lead the model to mistakenly identify them as cognates, requiring more precise identification mechanisms.
  • 5 Open question 5: How can more linguistic knowledge be incorporated into the model to improve its understanding of complex linguistic phenomena? The current model primarily relies on data-driven methods, which may overlook some linguistic details.

Applications

Immediate Applications

Historical Linguistics Research

Provides linguists with an efficient tool to analyze and recover the historical structure of languages, especially in complex language families.

Cross-Lingual Lexical Analysis

Helps identify and validate cognates, enhancing understanding between languages, applicable in multilingual language research.

Language Education

Offers new perspectives and methods for language education by recovering the historical structure of languages, helping students better understand language evolution.

Long-term Vision

Language Preservation and Revival

Helps preserve endangered languages and provides scientific basis for their revival by recovering and analyzing their historical structures.

Intelligent Translation Systems

Provides deeper language understanding for intelligent translation systems, enhancing their translation accuracy and naturalness in multilingual environments.

Abstract

We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU 'person' (8 languages), *gombe 'cow' (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p < 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain >0.83 cosine similarity across languages (within-class > between-class, p < 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.

cs.LG cs.CL

References (4)

Finding Universal Grammatical Relations in Multilingual BERT

Ethan A. Chi, John Hewitt, Christopher D. Manning

2020 173 citations View Analysis →

Comparative Bantu: An Introduction to the comparative linguistics and prehistory of the Bantu languages

M. Guthrie

1967 372 citations

The Bantu Languages

D. Nurse, G. Philippson

2003 266 citations

Bantu grammatical reconstructions

A. E. Meeussen

1967 425 citations