F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

TL;DR

F2LLM-v2 offers efficient multilingual embeddings using a two-stage training and matryoshka learning, supporting over 200 languages.

cs.CL 🔴 Advanced 2026-03-20 57 views
Ziyin Zhang Zihan Liao Hang Yu Peng Di Rui Wang
multilingual embedding model knowledge distillation model pruning matryoshka learning

Key Findings

Methodology

F2LLM-v2 employs a two-stage LLM embedding training pipeline, integrating matryoshka learning, model pruning, and knowledge distillation. Initially, it builds a robust semantic foundation using seven large-scale retrieval datasets. Subsequently, it refines training for specific downstream applications, enhancing model capabilities with task-specific instructions. The model architecture is based on the standard Transformer decoder of Qwen3, supporting eight distinct model sizes.

Key Results

  • F2LLM-v2-14B ranks first on 11 MTEB benchmarks, demonstrating outstanding multilingual embedding capabilities. Smaller models like 330M and 0.6B also excel in resource-constrained applications, surpassing Qwen3-Embedding and EmbeddingGemma.
  • Through knowledge distillation, F2LLM-v2 shows superior performance on several language-specific benchmarks, particularly in the 80M and 160M models, verifying an ideal balance between performance and efficiency.
  • Ablation studies indicate that knowledge distillation significantly enhances model performance, especially in smaller-scale models, proving effective transfer of teacher model capabilities.

Significance

The introduction of F2LLM-v2 marks a significant advancement in multilingual embedding research, particularly in addressing language imbalance and training transparency. By supporting over 200 languages, especially mid- and low-resource ones, the model holds substantial significance in both academia and industry. It not only addresses existing models' shortcomings in multilingual support but also promotes research transparency and reproducibility through open-source initiatives.

Technical Contribution

F2LLM-v2 presents significant technical differences from existing SOTA methods. Its integration of matryoshka learning and a two-stage training strategy offers new theoretical guarantees and engineering possibilities. The combination of model pruning and knowledge distillation allows smaller-scale models to approach the performance of larger models, providing efficient solutions in resource-constrained environments.

Novelty

F2LLM-v2 is the first to achieve efficient embeddings in a multilingual context through the combination of two-stage training and matryoshka learning. Compared to existing multilingual embedding models, it fundamentally innovates in supporting language diversity and training transparency.

Limitations

  • Despite F2LLM-v2's excellent multilingual support, performance on certain low-resource languages still needs improvement, particularly where high-quality training data is lacking.
  • The model still demands significant computational resources, especially for larger-scale models like the 14B version.
  • Performance on specific tasks may be affected by the distribution of training data, leading to limitations in generalization capabilities.

Future Work

Future research directions include further optimizing performance on low-resource languages, exploring more efficient training methods to reduce computational demands, and validating the model's effectiveness in more practical application scenarios.

AI Executive Summary

F2LLM-v2 is a novel family of multilingual embedding models designed to address the current imbalances in embedding research regarding language support and transparency. Existing embedding models often focus on high-resource languages like English and Chinese, neglecting the needs of mid- and low-resource languages. Additionally, many top-performing embedding models lack transparency in training data and methodologies, limiting reproducibility.

F2LLM-v2 integrates a two-stage LLM embedding training pipeline, matryoshka learning, model pruning, and knowledge distillation to provide an efficient and inclusive solution. The model family supports over 200 languages, with a particular emphasis on mid- and low-resource languages, and includes eight distinct model sizes ranging from 80M to 14B.

Technically, F2LLM-v2 employs a standard Transformer decoder architecture based on Qwen3, utilizing the final hidden states of the EOS token as sequence representation. Through a two-stage training strategy, the model excels in building semantic foundations and handling diverse downstream applications.

Experimental results show that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, with smaller models like 330M and 0.6B also performing exceptionally well in resource-constrained applications. Ablation studies further validate the effectiveness of knowledge distillation in enhancing model performance, particularly in smaller-scale models.

The release of F2LLM-v2 not only holds significant implications for academia and industry but also promotes research transparency and reproducibility through open-source initiatives. Future research directions include optimizing performance on low-resource languages, exploring more efficient training methods, and validating the model's effectiveness in more practical application scenarios.

Deep Analysis

Background

In recent years, text embedding models have played a crucial role in AI applications such as semantic search, text classification, and clustering. Traditional embedding models were primarily based on encoder architectures like XLM-R and mBART. However, with the rise of decoder architectures, LLM-based embedding models like E5-Mistral and NV-Embed have become dominant. These models have gained extensive reasoning and linguistic capabilities through large-scale pre-training. However, current embedding research faces two major issues: a pervasive English-centric bias in training and evaluation, and a lack of transparency, with many top-performing models not disclosing training data and methodologies, limiting reproducibility.

Core Problem

Current embedding research primarily focuses on high-resource languages, resulting in insufficient support for mid- and low-resource languages. Additionally, many top-performing embedding models lack transparency in training data and methodologies, limiting reproducibility and global applicability. Addressing these issues is crucial for building truly inclusive, general-purpose embedding systems.

Innovation

F2LLM-v2's core innovations lie in its multilingual support and training transparency. Firstly, the model family supports over 200 languages, with a particular emphasis on mid- and low-resource languages. Secondly, by releasing all models, data, code, and intermediate checkpoints, F2LLM-v2 promotes research transparency and reproducibility. Additionally, F2LLM-v2 employs a two-stage LLM embedding training pipeline, integrating matryoshka learning, model pruning, and knowledge distillation to provide an efficient solution.

Methodology

  • �� Data Collection: Aggregated data from 157 publicly available sources, creating a collection of 60 million training samples spanning 282 natural languages and over 40 programming languages.

  • �� Two-Stage Training: The first stage builds a semantic foundation using seven large-scale retrieval datasets. The second stage refines training for specific downstream applications.

  • �� Model Architecture: Based on the standard Transformer decoder of Qwen3, supporting eight distinct model sizes.

  • �� Knowledge Distillation: Enhances model performance by computing the mean squared error between student and teacher model sequence embeddings.

  • �� Matryoshka Learning: Applied in both training stages to ensure high performance.

Experiments

The experimental design includes evaluating F2LLM-v2 on 17 MTEB benchmarks, totaling 430 tasks covering retrieval, reranking, classification, and more. Datasets used include CodeSearchNet, MMARCO, and ParaCrawl. Ablation studies were conducted to validate the effectiveness of knowledge distillation and matryoshka learning.

Results

F2LLM-v2-14B ranks first on 11 MTEB benchmarks, demonstrating outstanding multilingual embedding capabilities. Smaller models like 330M and 0.6B also excel in resource-constrained applications, surpassing Qwen3-Embedding and EmbeddingGemma. Ablation studies indicate that knowledge distillation significantly enhances model performance, especially in smaller-scale models, proving effective transfer of teacher model capabilities.

Applications

F2LLM-v2 can be applied in multilingual semantic search, text classification, and clustering scenarios. Its multilingual support makes it widely applicable globally, especially in mid- and low-resource language applications. Additionally, smaller-scale models provide efficient solutions in resource-constrained environments.

Limitations & Outlook

Despite F2LLM-v2's excellent multilingual support, performance on certain low-resource languages still needs improvement, particularly where high-quality training data is lacking. The model still demands significant computational resources, especially for larger-scale models like the 14B version. Performance on specific tasks may be affected by the distribution of training data, leading to limitations in generalization capabilities.

Plain Language Accessible to non-experts

Imagine you're in a large library, and F2LLM-v2 is like a super-smart librarian. This librarian can not only quickly find the book you want but also explain its contents in your preferred language. Whether you speak English, Chinese, or any of the other 200+ languages, this librarian understands and responds to you.

F2LLM-v2 learns from a vast number of books and articles, mastering the essence of various languages. It's like a multilingual translator, helping you switch seamlessly between different languages. Even for some less common languages, it can provide assistance, much like a knowledgeable language expert.

Moreover, this librarian is highly efficient. Even in resource-limited situations, it can quickly find answers. This is because it has undergone special training to make optimal decisions within limited time and resources. It's like an experienced detective who can quickly find clues in complex cases.

In summary, F2LLM-v2 is a versatile assistant that helps us communicate and understand better in a multilingual world.

ELI14 Explained like you're 14

Hey there! Imagine you have a super-smart friend named F2LLM-v2. This friend can speak over 200 languages! Yes, you heard that right, not just English and Chinese, but many languages you've probably never even heard of.

F2LLM-v2 is like a language wizard. It can help you switch between different languages, like playing a super cool language game. Whether you're looking up information or doing homework, it can help you find answers quickly.

What's even cooler is that this friend can perform well even when resources are limited. It's like when you're playing a game, and your battery is running low, but you still manage to score high!

So, next time you have a language problem, remember to ask F2LLM-v2 for help! It's your multilingual buddy!

Glossary

F2LLM-v2

F2LLM-v2 is a family of multilingual embedding models supporting over 200 languages, with a focus on mid- and low-resource languages.

In the paper, F2LLM-v2 is used to address efficiency and inclusivity in multilingual embeddings.

Matryoshka Learning

Matryoshka learning is a training strategy that improves performance by incrementally increasing model complexity.

In F2LLM-v2, matryoshka learning is used to enhance model performance across training stages.

Knowledge Distillation

Knowledge distillation is a technique that improves smaller models by transferring knowledge from larger models.

In F2LLM-v2, knowledge distillation is used to enhance the performance of smaller-scale models.

Model Pruning

Model pruning reduces the number of model parameters to improve computational efficiency.

F2LLM-v2 uses model pruning to support different model scales.

MTEB Benchmark

MTEB benchmark is a set of standard tests for evaluating multilingual embedding model performance.

F2LLM-v2 performs exceptionally well on multiple MTEB benchmarks.

Qwen3

Qwen3 is a standard Transformer decoder architecture on which F2LLM-v2 is based.

F2LLM-v2's model architecture is based on Qwen3.

EOS Token

EOS token marks the end of a sequence, used to represent the final state of a sequence.

F2LLM-v2 uses the hidden state of the EOS token as sequence representation.

Retrieval Dataset

Retrieval datasets are used to train models to improve information retrieval capabilities.

F2LLM-v2 uses several retrieval datasets in its first training stage.

Ablation Study

Ablation studies evaluate the contribution of individual components by systematically removing them to observe performance changes.

F2LLM-v2 uses ablation studies to validate the effectiveness of its techniques.

Multilingual Support

Multilingual support refers to a model's ability to process and understand multiple languages.

F2LLM-v2 supports over 200 languages, providing extensive multilingual support.

Open Questions Unanswered questions from this research

  • 1 Despite F2LLM-v2's excellent multilingual support, performance on certain low-resource languages still needs improvement. Future research needs to further optimize model performance for these languages to ensure global applicability.
  • 2 The model still demands significant computational resources, especially for larger-scale models like the 14B version. Research needs to explore more efficient training methods to reduce computational costs.
  • 3 Performance on specific tasks may be affected by the distribution of training data, leading to limitations in generalization capabilities. Future research should focus on improving model generalization across different tasks.
  • 4 Although F2LLM-v2 has made efforts in training transparency, further opening of training data and methodologies is needed to promote research reproducibility and transparency.
  • 5 Current models may require more task-specific optimization when handling diverse downstream applications. Future research should explore how to improve task-specific performance without compromising model generality.

Applications

Immediate Applications

Multilingual Semantic Search

F2LLM-v2 can be used for multilingual semantic search, helping users quickly find relevant information across documents in different languages, suitable for global enterprises and multilingual platforms.

Text Classification

By supporting multiple languages, F2LLM-v2 can be applied in text classification tasks globally, especially in scenarios involving mid- and low-resource languages.

Clustering Analysis

F2LLM-v2 can be used for clustering analysis of multilingual texts, helping researchers and businesses discover potential patterns and trends in large datasets.

Long-term Vision

Global Language Services

F2LLM-v2's multilingual support can drive the development of global language services, helping businesses and organizations better communicate and collaborate across languages.

Intelligent Translation Systems

With further optimization, F2LLM-v2 is expected to become a core component of intelligent translation systems, providing more efficient and accurate translation services.

Abstract

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

cs.CL cs.AI

References (20)

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long et al.

2025 557 citations ⭐ Influential View Analysis →

EmbeddingGemma: Powerful and Lightweight Text Representations

Henrique Schechter Vera, Sahil Dua, Biao Zhang et al.

2025 54 citations ⭐ Influential View Analysis →

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff, Nouamane Tazi, L. Magne et al.

2022 766 citations ⭐ Influential View Analysis →

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu et al.

2024 446 citations ⭐ Influential View Analysis →

A question-entailment approach to question answering

Asma Ben Abacha, Dina Demner-Fushman

2019 272 citations View Analysis →

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Wei He, Kai Liu, Jing Liu et al.

2017 304 citations View Analysis →

WildChat: 1M ChatGPT Interaction Logs in the Wild

Wenting Zhao, Xiang Ren, J. Hessel et al.

2024 458 citations View Analysis →

Improving Text Embeddings with Large Language Models

Liang Wang, Nan Yang, Xiaolong Huang et al.

2023 333 citations View Analysis →

I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Review

James O'Neill, Polina Rozenshtein, Ryuichi Kiryo et al.

2021 42 citations View Analysis →

Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection

Sheng Zhang, Xin Zhang, Hui Wang et al.

2018 105 citations

M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models

Rishabh Maheshwary, Vikas Yadav, Hoang Nguyen et al.

2024 8 citations View Analysis →

ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search

Zehan Li, Jianfei Zhang, Chuantao Yin et al.

2024 16 citations View Analysis →

LinkSO: a dataset for learning to retrieve similar question answer pairs on software development forums

Xueqing Liu, Chi Wang, Yue Leng et al.

2018 33 citations

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Ziyin Zhang, Zihan Liao, Hang Yu et al.

2025 3 citations View Analysis →

SPECTER: Document-level Representation Learning using Citation-informed Transformers

Arman Cohan, Sergey Feldman, Iz Beltagy et al.

2020 751 citations View Analysis →

Applying deep matching networks to Chinese medical question answering: a study and a dataset

Junqing He, Mingming Fu, Manshu Tu

2019 72 citations

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen et al.

2023 139 citations View Analysis →

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search

Zihan Liao, Hang Yu, Jianguo Li et al.

2024 15 citations View Analysis →

MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark

Haoran Li, Abhinav Arora, Shuohui Chen et al.

2020 206 citations View Analysis →

Matryoshka Representation Learning

Aditya Kusupati, Gantavya Bhatt, Aniket Rege et al.

2022 214 citations View Analysis →