Long-Context Encoder Models for Polish Language Understanding

TL;DR

Introduced a Polish long-context encoder model handling up to 8192 tokens, significantly improving long-document task performance.

cs.CL 🔴 Advanced 2026-03-13 12 views

Sławomir Dadas Rafał Poświata Marek Kozłowski Małgorzata Grębowiec Michał Perełkiewicz Paweł Klimiuk Przemysław Boruta

AI Reader Arxiv Page Download PDF

NLP Polish long-context encoder model knowledge distillation

Key Findings

Methodology

The paper proposes a two-stage training strategy, initially adapting positional embeddings to extend the model's context window, followed by full parameter continuous pre-training. This approach enhances the Polish RoBERTa encoder to support Flash Attention and contamination-free packing, boosting training efficiency and long-document processing capabilities. Additionally, compressed model variants were trained via knowledge distillation, reducing layer counts by 50% and 75% for efficiency-critical applications like edge devices.

Key Results

Result 1: Across 25 tasks, including the KLEJ benchmark and the newly introduced financial task suite FinBench, the model excels in long-context tasks, achieving the best average performance among Polish and multilingual models, particularly outperforming competitors in long-document tasks.
Result 2: On short text tasks, the model performs comparably to existing solutions, demonstrating adaptability across different context length tasks.
Result 3: Through knowledge distillation, the compressed models maintain the original model's performance while significantly reducing computational resource consumption, especially in edge device deployments.

Significance

This research provides significant advancements in Polish NLP, particularly in long-document understanding. By extending the context window to 8192 tokens, the model can handle longer texts, crucial for fields like finance and law that require extensive information analysis. Additionally, the application of knowledge distillation allows the model to operate efficiently in resource-constrained environments, broadening its application scenarios. The study not only offers new insights into long-context processing in academia but also provides technical support for industry-specific applications.

Technical Contribution

The technical contributions of this paper include: firstly, extending the context window of the Polish RoBERTa encoder to handle texts up to 8192 tokens; secondly, introducing architectural improvements like Flash Attention and contamination-free packing to enhance training efficiency and model performance; and finally, successfully compressing the model through knowledge distillation, significantly reducing computational requirements and making it suitable for edge device applications.

Novelty

This study is the first to achieve a Polish NLP encoder model supporting an 8192-token context window and to successfully compress and optimize model performance through knowledge distillation. Compared to existing Polish and multilingual models, this model excels in long-document tasks, filling a gap in long-context processing.

Limitations

Limitation 1: Despite excellent performance in long-document tasks, further fine-tuning and optimization may be needed for specific domains or tasks.
Limitation 2: The model's training and inference still require substantial computational resources, especially for ultra-long texts, potentially limiting its application in resource-constrained environments.
Limitation 3: During knowledge distillation, the performance recovery of the compressed model may not fully match the original, particularly in complex tasks.

Future Work

Future research directions include: further optimizing training efficiency and performance, especially in resource-constrained environments; exploring applications in more domains requiring long-context tasks, such as law and medicine; and continuing to improve knowledge distillation techniques to better recover the original model's performance.

AI Executive Summary

In recent years, with the rapid development of natural language processing technologies, encoder models face the challenge of context window limitations when handling long text tasks. Traditional encoders like BERT have a context window of only 512 tokens, insufficient for long-document processing needs. To address this issue, this paper introduces a novel Polish encoder model capable of processing texts up to 8192 tokens long. Through a two-stage training strategy involving positional embedding adaptation and full parameter continuous pre-training, the model excels in long-document tasks.

The model is based on enhancements to the Polish RoBERTa encoder, supporting Flash Attention and contamination-free packing to improve training efficiency and model performance. Additionally, compressed model variants were trained via knowledge distillation, reducing layer counts by 50% and 75% to suit efficiency-critical applications like edge devices.

In experiments, the model demonstrated superior performance across 25 tasks, including the KLEJ benchmark and the newly introduced financial task suite FinBench, particularly excelling in long-document tasks compared to competitors. On short text tasks, the model performs comparably to existing solutions, showcasing its adaptability across different context length tasks.

However, the model's training and inference still require substantial computational resources, especially for ultra-long texts, potentially limiting its application in resource-constrained environments. Future research directions include further optimizing training efficiency and performance, exploring applications in more domains requiring long-context tasks, and continuing to improve knowledge distillation techniques to better recover the original model's performance.

Deep Analysis

Background

In recent years, the introduction of the Transformer architecture has led to significant advancements in the field of natural language processing. Encoder models like BERT and RoBERTa have excelled in tasks such as text classification and named entity recognition. However, these models typically have a context window limited to 512 tokens, which becomes a bottleneck when processing long documents. To address this issue, researchers have begun exploring encoder models with extended context windows, such as ModernBERT and NeoBERT. However, these models primarily target English, and there is still a lack of models for Polish that can handle long contexts. This paper addresses this gap by introducing a novel Polish encoder model designed to extend the context window and enhance long-document processing capabilities.

Core Problem

Traditional encoder models face the challenge of context window limitations when processing long documents. Classic models like BERT and RoBERTa have a context window of only 512 tokens, which is insufficient for tasks requiring long text processing, such as legal document analysis and financial report interpretation. The core problem addressed in this paper is how to extend the model's context window to support processing texts up to 8192 tokens without significantly increasing computational resource demands.

Innovation

The core innovations of this paper include:

1) Extending the context window of the Polish RoBERTa encoder to 8192 tokens through positional embedding adaptation and full parameter continuous pre-training, enhancing long-document processing capabilities.

2) Introducing Flash Attention and contamination-free packing techniques to improve training efficiency and model performance, ensuring superior performance in long-document tasks.

3) Successfully compressing the model through knowledge distillation, reducing layer counts to suit efficiency-critical applications like edge devices.

Methodology

The methodology of this paper includes the following key steps:

�� Extending positional embeddings: Adapting positional embeddings to extend the context window of the Polish RoBERTa encoder to 8192 tokens.
�� Full parameter continuous pre-training: Conducting full parameter continuous pre-training after extending positional embeddings to adapt the model to long-document processing.
�� Introducing Flash Attention: Optimizing the attention mechanism to reduce memory consumption and improve computational efficiency.
�� Contamination-free packing: Limiting the attention mechanism from crossing document boundaries to avoid cross-contamination of content from different documents.
�� Knowledge distillation: Training compressed model variants through knowledge distillation, reducing layer counts to suit resource-constrained environments.

Experiments

The experimental design includes evaluating the model's performance across 25 tasks, covering the KLEJ benchmark, the financial task suite FinBench, and other classification and regression tasks. Key hyperparameters used in the experiments include: AdamW optimizer, a maximum learning rate of 2e-5, a warmup phase of 500 batches, a batch size of 128, and a sequence length of 8192. By comparing the performance of different models on long and short text tasks, the superiority of the proposed model in long-document tasks is validated.

Results

Experimental results show that the proposed model excels in long-document tasks, particularly in the KLEJ benchmark and FinBench tasks, achieving the best average performance among Polish and multilingual models. On short text tasks, the model performs comparably to existing solutions, demonstrating adaptability across different context length tasks. Additionally, through knowledge distillation, the compressed models maintain the original model's performance while significantly reducing computational resource consumption, especially in edge device deployments.

Applications

The model has broad application prospects in fields like finance and law that require long-document processing. By extending the context window, the model can handle longer texts, crucial for tasks requiring extensive information analysis. Additionally, the application of knowledge distillation allows the model to operate efficiently in resource-constrained environments, broadening its application scenarios.

Limitations & Outlook

Despite excellent performance in long-document tasks, further fine-tuning and optimization may be needed for specific domains or tasks. Additionally, the model's training and inference still require substantial computational resources, especially for ultra-long texts, potentially limiting its application in resource-constrained environments. Future research directions include further optimizing training efficiency and performance, exploring applications in more domains requiring long-context tasks, and continuing to improve knowledge distillation techniques to better recover the original model's performance.

Plain Language Accessible to non-experts

Imagine you're in a library faced with a stack of thick books. Traditional encoder models are like a student who can only read one page at a time, needing to understand the entire book in a limited time. The Polish encoder model proposed in this paper is like a student who can quickly skim through the entire book, grasping the key points in a short time. This improvement is due to the extension of the model's context window, akin to enhancing the student's reading speed and comprehension ability. Additionally, through knowledge distillation, this student can lighten their backpack without losing comprehension ability, allowing them to adapt flexibly to various challenges in different environments.

ELI14 Explained like you're 14

Hey there, imagine you're playing a super complex game where you need to remember lots of clues to win. Traditional encoder models are like a player who can only remember a small part of the clues, while our new Polish encoder model is like a player with a super memory, able to remember more clues to help you win faster! Plus, this player has learned how to travel light without losing important information, adapting to different game environments. It's like in school, where you can remember the key points the teacher mentioned and use them flexibly in exams to score high!

Glossary

Encoder Model

A neural network architecture used to process and understand text input, typically used for tasks like classification and named entity recognition.

In this paper, the encoder model is used for long-document tasks.

Context Window

The maximum number of tokens the model can focus on simultaneously when processing text.

This paper extends the context window to 8192 tokens to improve long-document processing capabilities.

Positional Embedding

Vectors used to represent the position of each token in a sequence.

This paper adapts positional embeddings to extend the model's context window.

Flash Attention

An optimized attention mechanism designed to reduce memory consumption and computational overhead.

This paper introduces Flash Attention to improve training efficiency.

Contamination-Free Packing

A technique to avoid cross-contamination of content from different documents by limiting the attention mechanism from crossing document boundaries.

This paper uses contamination-free packing to ensure model performance in long-document tasks.

Knowledge Distillation

A technique for training smaller models to mimic the behavior of larger models, aiming to reduce model size and computational resource requirements.

This paper uses knowledge distillation to train compressed model variants.

KLEJ Benchmark

A benchmark used to evaluate the performance of Polish NLP models, consisting of multiple tasks.

This paper validates the model's performance on the KLEJ benchmark.

FinBench

A Polish benchmark focused on tasks in the financial and banking domains.

This paper introduces FinBench to evaluate the model's performance in financial tasks.

Long-Document Task

Tasks that require processing and understanding long texts, typically exceeding the context window of traditional encoders.

The model proposed in this paper excels in long-document tasks.

Edge Device

Devices with limited computational resources, such as smartphones and IoT devices.

This paper uses knowledge distillation to make the model suitable for edge devices.

Open Questions Unanswered questions from this research

1 Despite the model's excellent performance in long-document tasks, further fine-tuning and optimization may be needed for specific domains or tasks. This requires exploring more fine-grained domain adaptation techniques.
2 The model's training and inference still require substantial computational resources, especially for ultra-long texts, potentially limiting its application in resource-constrained environments. How to further reduce computational resource demands without compromising performance is a question worth exploring.
3 During knowledge distillation, the performance recovery of the compressed model may not fully match the original, particularly in complex tasks. How to improve knowledge distillation techniques to better recover the original model's performance is a future research direction.
4 Although the context window has been extended, the model may still encounter performance bottlenecks when processing extremely long texts. How to further extend the context window while maintaining computational efficiency is a challenge.
5 In multilingual environments, how to effectively apply the methods proposed in this paper to other languages, especially low-resource languages, is a direction worth exploring.

Applications

Immediate Applications

Financial Report Analysis

Financial institutions can use this model to analyze long financial reports, extract key information, and improve decision-making efficiency.

Legal Document Processing

Law firms can use this model to process and analyze long legal documents, supporting legal research and case analysis.

Customer Service Automation

Companies can apply this model to customer service systems to automatically process and understand long customer feedback and complaints.

Long-term Vision

Multilingual Long-Document Processing

In the future, this model can be extended to other languages, supporting multilingual long-document processing and facilitating cross-language information exchange.

Intelligent Document Management System

Develop an intelligent document management system that uses this model to automatically archive, classify, and retrieve long documents, improving enterprise information management efficiency.

Abstract

While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.

cs.CL

References (20)

Pre-training Polish Transformer-based Language Models at Scale

Slawomir Dadas, Michał Perełkiewicz, Rafal Poswiata

2020 44 citations ⭐ Influential View Analysis →

MIPD: Exploring Manipulation and Intention In a Novel Corpus of Polish Disinformation

Arkadiusz Modzelewski, Giovanni Da San Martino, Pavel Savov et al.

2024 3 citations

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Xinpeng Wang, Leonie Weissweiler, Hinrich Schutze et al.

2023 12 citations View Analysis →

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon et al.

2022 3775 citations View Analysis →

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié et al.

2024 480 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 55282 citations View Analysis →

MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

J. Portes, Alex Trott, Sam Havens et al.

2023 36 citations View Analysis →

Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

Slawomir Dadas

2022 9 citations View Analysis →

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wenhui Wang, Furu Wei, Li Dong et al.

2020 1942 citations View Analysis →

WWW'18 Open Challenge: Financial Opinion Mining and Question Answering

Macedo Maia, S. Handschuh, A. Freitas et al.

2018 394 citations

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Marc Marone, Orion Weller, William Fleshman et al.

2025 21 citations View Analysis →

HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish

Robert Mroczkowski, Piotr Rybak, Alina Wróblewska et al.

2021 99 citations View Analysis →

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee et al.

2019 111459 citations View Analysis →

Evaluation of Sentence Representations in Polish

Slawomir Dadas, Michał Perełkiewicz, Rafal Poswiata

2019 21 citations View Analysis →

EuroBERT: Scaling Multilingual Encoders for European Languages

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves et al.

2025 21 citations View Analysis →

KLEJ: Comprehensive Benchmark for Polish Language Understanding

Piotr Rybak, Robert Mroczkowski, Janusz Tracz et al.

2020 93 citations View Analysis →

Impact of News on the Commodity Market: Dataset and Results

Ankur Sinha, Tanmay Khandait

2020 97 citations View Analysis →

NeoBERT: A Next-Generation BERT

Lola Le Breton, Quentin Fournier, Mariam El Mezouar et al.

2025 10 citations View Analysis →

Efficient Intent Detection with Dual Sentence Encoders

I. Casanueva, Tadas Temvcinas, D. Gerz et al.

2020 580 citations View Analysis →

Large-Scale Multi-Label Text Classification on EU Legislation

Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis et al.

2019 250 citations View Analysis →

Long-Context Encoder Models for Polish Language Understanding

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Encoder Model

Context Window

Positional Embedding

Flash Attention

Contamination-Free Packing

Knowledge Distillation

KLEJ Benchmark

FinBench

Long-Document Task

Edge Device

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Financial Report Analysis

Legal Document Processing

Customer Service Automation

Long-term Vision

Multilingual Long-Document Processing

Intelligent Document Management System

Abstract

References (20)

Related Papers

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection