F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
F2LLM-v2 offers efficient multilingual embeddings using a two-stage training and matryoshka learning, supporting over 200 languages.
Key Findings
Methodology
F2LLM-v2 employs a two-stage LLM embedding training pipeline, integrating matryoshka learning, model pruning, and knowledge distillation. Initially, it builds a robust semantic foundation using seven large-scale retrieval datasets. Subsequently, it refines training for specific downstream applications, enhancing model capabilities with task-specific instructions. The model architecture is based on the standard Transformer decoder of Qwen3, supporting eight distinct model sizes.
Key Results
- F2LLM-v2-14B ranks first on 11 MTEB benchmarks, demonstrating outstanding multilingual embedding capabilities. Smaller models like 330M and 0.6B also excel in resource-constrained applications, surpassing Qwen3-Embedding and EmbeddingGemma.
- Through knowledge distillation, F2LLM-v2 shows superior performance on several language-specific benchmarks, particularly in the 80M and 160M models, verifying an ideal balance between performance and efficiency.
- Ablation studies indicate that knowledge distillation significantly enhances model performance, especially in smaller-scale models, proving effective transfer of teacher model capabilities.
Significance
The introduction of F2LLM-v2 marks a significant advancement in multilingual embedding research, particularly in addressing language imbalance and training transparency. By supporting over 200 languages, especially mid- and low-resource ones, the model holds substantial significance in both academia and industry. It not only addresses existing models' shortcomings in multilingual support but also promotes research transparency and reproducibility through open-source initiatives.
Technical Contribution
F2LLM-v2 presents significant technical differences from existing SOTA methods. Its integration of matryoshka learning and a two-stage training strategy offers new theoretical guarantees and engineering possibilities. The combination of model pruning and knowledge distillation allows smaller-scale models to approach the performance of larger models, providing efficient solutions in resource-constrained environments.
Novelty
F2LLM-v2 is the first to achieve efficient embeddings in a multilingual context through the combination of two-stage training and matryoshka learning. Compared to existing multilingual embedding models, it fundamentally innovates in supporting language diversity and training transparency.
Limitations
- Despite F2LLM-v2's excellent multilingual support, performance on certain low-resource languages still needs improvement, particularly where high-quality training data is lacking.
- The model still demands significant computational resources, especially for larger-scale models like the 14B version.
- Performance on specific tasks may be affected by the distribution of training data, leading to limitations in generalization capabilities.
Future Work
Future research directions include further optimizing performance on low-resource languages, exploring more efficient training methods to reduce computational demands, and validating the model's effectiveness in more practical application scenarios.
AI Executive Summary
F2LLM-v2 is a novel family of multilingual embedding models designed to address the current imbalances in embedding research regarding language support and transparency. Existing embedding models often focus on high-resource languages like English and Chinese, neglecting the needs of mid- and low-resource languages. Additionally, many top-performing embedding models lack transparency in training data and methodologies, limiting reproducibility.
F2LLM-v2 integrates a two-stage LLM embedding training pipeline, matryoshka learning, model pruning, and knowledge distillation to provide an efficient and inclusive solution. The model family supports over 200 languages, with a particular emphasis on mid- and low-resource languages, and includes eight distinct model sizes ranging from 80M to 14B.
Technically, F2LLM-v2 employs a standard Transformer decoder architecture based on Qwen3, utilizing the final hidden states of the EOS token as sequence representation. Through a two-stage training strategy, the model excels in building semantic foundations and handling diverse downstream applications.
Experimental results show that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, with smaller models like 330M and 0.6B also performing exceptionally well in resource-constrained applications. Ablation studies further validate the effectiveness of knowledge distillation in enhancing model performance, particularly in smaller-scale models.
The release of F2LLM-v2 not only holds significant implications for academia and industry but also promotes research transparency and reproducibility through open-source initiatives. Future research directions include optimizing performance on low-resource languages, exploring more efficient training methods, and validating the model's effectiveness in more practical application scenarios.
Deep Analysis
Background
In recent years, text embedding models have played a crucial role in AI applications such as semantic search, text classification, and clustering. Traditional embedding models were primarily based on encoder architectures like XLM-R and mBART. However, with the rise of decoder architectures, LLM-based embedding models like E5-Mistral and NV-Embed have become dominant. These models have gained extensive reasoning and linguistic capabilities through large-scale pre-training. However, current embedding research faces two major issues: a pervasive English-centric bias in training and evaluation, and a lack of transparency, with many top-performing models not disclosing training data and methodologies, limiting reproducibility.
Core Problem
Current embedding research primarily focuses on high-resource languages, resulting in insufficient support for mid- and low-resource languages. Additionally, many top-performing embedding models lack transparency in training data and methodologies, limiting reproducibility and global applicability. Addressing these issues is crucial for building truly inclusive, general-purpose embedding systems.
Innovation
F2LLM-v2's core innovations lie in its multilingual support and training transparency. Firstly, the model family supports over 200 languages, with a particular emphasis on mid- and low-resource languages. Secondly, by releasing all models, data, code, and intermediate checkpoints, F2LLM-v2 promotes research transparency and reproducibility. Additionally, F2LLM-v2 employs a two-stage LLM embedding training pipeline, integrating matryoshka learning, model pruning, and knowledge distillation to provide an efficient solution.
Methodology
- �� Data Collection: Aggregated data from 157 publicly available sources, creating a collection of 60 million training samples spanning 282 natural languages and over 40 programming languages.
- �� Two-Stage Training: The first stage builds a semantic foundation using seven large-scale retrieval datasets. The second stage refines training for specific downstream applications.
- �� Model Architecture: Based on the standard Transformer decoder of Qwen3, supporting eight distinct model sizes.
- �� Knowledge Distillation: Enhances model performance by computing the mean squared error between student and teacher model sequence embeddings.
- �� Matryoshka Learning: Applied in both training stages to ensure high performance.
Experiments
The experimental design includes evaluating F2LLM-v2 on 17 MTEB benchmarks, totaling 430 tasks covering retrieval, reranking, classification, and more. Datasets used include CodeSearchNet, MMARCO, and ParaCrawl. Ablation studies were conducted to validate the effectiveness of knowledge distillation and matryoshka learning.
Results
F2LLM-v2-14B ranks first on 11 MTEB benchmarks, demonstrating outstanding multilingual embedding capabilities. Smaller models like 330M and 0.6B also excel in resource-constrained applications, surpassing Qwen3-Embedding and EmbeddingGemma. Ablation studies indicate that knowledge distillation significantly enhances model performance, especially in smaller-scale models, proving effective transfer of teacher model capabilities.
Applications
F2LLM-v2 can be applied in multilingual semantic search, text classification, and clustering scenarios. Its multilingual support makes it widely applicable globally, especially in mid- and low-resource language applications. Additionally, smaller-scale models provide efficient solutions in resource-constrained environments.
Limitations & Outlook
Despite F2LLM-v2's excellent multilingual support, performance on certain low-resource languages still needs improvement, particularly where high-quality training data is lacking. The model still demands significant computational resources, especially for larger-scale models like the 14B version. Performance on specific tasks may be affected by the distribution of training data, leading to limitations in generalization capabilities.
Plain Language Accessible to non-experts
Imagine you're in a large library, and F2LLM-v2 is like a super-smart librarian. This librarian can not only quickly find the book you want but also explain its contents in your preferred language. Whether you speak English, Chinese, or any of the other 200+ languages, this librarian understands and responds to you.
F2LLM-v2 learns from a vast number of books and articles, mastering the essence of various languages. It's like a multilingual translator, helping you switch seamlessly between different languages. Even for some less common languages, it can provide assistance, much like a knowledgeable language expert.
Moreover, this librarian is highly efficient. Even in resource-limited situations, it can quickly find answers. This is because it has undergone special training to make optimal decisions within limited time and resources. It's like an experienced detective who can quickly find clues in complex cases.
In summary, F2LLM-v2 is a versatile assistant that helps us communicate and understand better in a multilingual world.
ELI14 Explained like you're 14
Hey there! Imagine you have a super-smart friend named F2LLM-v2. This friend can speak over 200 languages! Yes, you heard that right, not just English and Chinese, but many languages you've probably never even heard of.
F2LLM-v2 is like a language wizard. It can help you switch between different languages, like playing a super cool language game. Whether you're looking up information or doing homework, it can help you find answers quickly.
What's even cooler is that this friend can perform well even when resources are limited. It's like when you're playing a game, and your battery is running low, but you still manage to score high!
So, next time you have a language problem, remember to ask F2LLM-v2 for help! It's your multilingual buddy!
Glossary
F2LLM-v2
F2LLM-v2 is a family of multilingual embedding models supporting over 200 languages, with a focus on mid- and low-resource languages.
In the paper, F2LLM-v2 is used to address efficiency and inclusivity in multilingual embeddings.
Matryoshka Learning
Matryoshka learning is a training strategy that improves performance by incrementally increasing model complexity.
In F2LLM-v2, matryoshka learning is used to enhance model performance across training stages.
Knowledge Distillation
Knowledge distillation is a technique that improves smaller models by transferring knowledge from larger models.
In F2LLM-v2, knowledge distillation is used to enhance the performance of smaller-scale models.
Model Pruning
Model pruning reduces the number of model parameters to improve computational efficiency.
F2LLM-v2 uses model pruning to support different model scales.
MTEB Benchmark
MTEB benchmark is a set of standard tests for evaluating multilingual embedding model performance.
F2LLM-v2 performs exceptionally well on multiple MTEB benchmarks.
Qwen3
Qwen3 is a standard Transformer decoder architecture on which F2LLM-v2 is based.
F2LLM-v2's model architecture is based on Qwen3.
EOS Token
EOS token marks the end of a sequence, used to represent the final state of a sequence.
F2LLM-v2 uses the hidden state of the EOS token as sequence representation.
Retrieval Dataset
Retrieval datasets are used to train models to improve information retrieval capabilities.
F2LLM-v2 uses several retrieval datasets in its first training stage.
Ablation Study
Ablation studies evaluate the contribution of individual components by systematically removing them to observe performance changes.
F2LLM-v2 uses ablation studies to validate the effectiveness of its techniques.
Multilingual Support
Multilingual support refers to a model's ability to process and understand multiple languages.
F2LLM-v2 supports over 200 languages, providing extensive multilingual support.
Open Questions Unanswered questions from this research
- 1 Despite F2LLM-v2's excellent multilingual support, performance on certain low-resource languages still needs improvement. Future research needs to further optimize model performance for these languages to ensure global applicability.
- 2 The model still demands significant computational resources, especially for larger-scale models like the 14B version. Research needs to explore more efficient training methods to reduce computational costs.
- 3 Performance on specific tasks may be affected by the distribution of training data, leading to limitations in generalization capabilities. Future research should focus on improving model generalization across different tasks.
- 4 Although F2LLM-v2 has made efforts in training transparency, further opening of training data and methodologies is needed to promote research reproducibility and transparency.
- 5 Current models may require more task-specific optimization when handling diverse downstream applications. Future research should explore how to improve task-specific performance without compromising model generality.
Applications
Immediate Applications
Multilingual Semantic Search
F2LLM-v2 can be used for multilingual semantic search, helping users quickly find relevant information across documents in different languages, suitable for global enterprises and multilingual platforms.
Text Classification
By supporting multiple languages, F2LLM-v2 can be applied in text classification tasks globally, especially in scenarios involving mid- and low-resource languages.
Clustering Analysis
F2LLM-v2 can be used for clustering analysis of multilingual texts, helping researchers and businesses discover potential patterns and trends in large datasets.
Long-term Vision
Global Language Services
F2LLM-v2's multilingual support can drive the development of global language services, helping businesses and organizations better communicate and collaborate across languages.
Intelligent Translation Systems
With further optimization, F2LLM-v2 is expected to become a core component of intelligent translation systems, providing more efficient and accurate translation services.
Abstract
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.
References (20)
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long et al.
EmbeddingGemma: Powerful and Lightweight Text Representations
Henrique Schechter Vera, Sahil Dua, Biao Zhang et al.
MTEB: Massive Text Embedding Benchmark
Niklas Muennighoff, Nouamane Tazi, L. Magne et al.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Chankyu Lee, Rajarshi Roy, Mengyao Xu et al.
A question-entailment approach to question answering
Asma Ben Abacha, Dina Demner-Fushman
DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications
Wei He, Kai Liu, Jing Liu et al.
WildChat: 1M ChatGPT Interaction Logs in the Wild
Wenting Zhao, Xiang Ren, J. Hessel et al.
Improving Text Embeddings with Large Language Models
Liang Wang, Nan Yang, Xiaolong Huang et al.
I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Review
James O'Neill, Polina Rozenshtein, Ryuichi Kiryo et al.
Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection
Sheng Zhang, Xin Zhang, Hui Wang et al.
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Rishabh Maheshwary, Vikas Yadav, Hoang Nguyen et al.
ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search
Zehan Li, Jianfei Zhang, Chuantao Yin et al.
LinkSO: a dataset for learning to retrieve similar question answer pairs on software development forums
Xueqing Liu, Chi Wang, Yue Leng et al.
F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
Ziyin Zhang, Zihan Liao, Hang Yu et al.
SPECTER: Document-level Representation Learning using Citation-informed Transformers
Arman Cohan, Sergey Feldman, Iz Beltagy et al.
Applying deep matching networks to Chinese medical question answering: a study and a dataset
Junqing He, Mingming Fu, Manshu Tu
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen et al.
D2LLM: Decomposed and Distilled Large Language Models for Semantic Search
Zihan Liao, Hang Yu, Jianguo Li et al.
MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark
Haoran Li, Abhinav Arora, Shuohui Chen et al.
Matryoshka Representation Learning
Aditya Kusupati, Gantavya Bhatt, Aniket Rege et al.