Understanding Data Temporality Impact on Large Language Models Pre-training
6B-parameter LLMs pretrained sequentially on Common Crawl show 15% F1 improvement on KairosQA for temporal knowledge over shuffled baselines.
Key Findings
Methodology
This work presents a controlled experimental framework comparing 6B-parameter Transformer decoder models pretrained on temporally ordered versus randomly shuffled Common Crawl snapshots spanning 2018 to 2025. The data undergoes rigorous multi-stage filtering including language identification, deduplication, and quality scoring. The authors introduce KairosQA, a novel benchmark with 7,167 temporally grounded question-answer pairs extracted from Wikidata, designed to evaluate models' ability to associate facts with their correct time periods. Evaluation protocols combine cloze-style and generative QA tasks, alongside OLMES and TAQA benchmarks, to comprehensively assess temporal factual knowledge acquisition and general language understanding.
Key Results
- Sequentially pretrained models achieve approximately 15% higher F1 scores on KairosQA compared to shuffled baselines, especially excelling on recent years (2023-2024), demonstrating enhanced factual freshness and temporal precision.
- Both training paradigms perform comparably on the OLMES benchmark for general language understanding, indicating that temporal ordering does not compromise core language capabilities.
- Sequential models exhibit a clear recency bias, with peak accuracy aligned with their training cutoff year, whereas shuffled models favor older data, likely due to repeated exposure to historical facts.
Significance
This study fundamentally advances understanding of how pretraining data temporality influences large language models' knowledge dynamics. It addresses the critical limitation of knowledge freezing by demonstrating that temporal ordering of data during pretraining significantly improves models' ability to internalize and recall up-to-date facts. By releasing the KairosQA dataset, code, and checkpoints, the work provides valuable resources for the community to develop continual learning methods and build more temporally aligned LLMs, thereby enhancing their reliability and applicability in real-world, time-sensitive scenarios.
Technical Contribution
The paper introduces a novel temporal curriculum learning paradigm for LLM pretraining, departing from the conventional random shuffling approach. It designs KairosQA, a temporally annotated QA benchmark, and employs a dual evaluation protocol combining cloze and generative tasks to precisely measure temporal knowledge alignment. The training employs a staged cooldown learning rate schedule to ensure convergence stability. The authors systematically analyze intermediate checkpoints to reveal temporal knowledge acquisition trajectories and forgetting patterns, providing new insights into the temporal dynamics of LLM knowledge.
Novelty
This is the first comprehensive study to empirically compare the effects of temporally ordered versus shuffled pretraining data on LLMs' temporal factual knowledge. The creation of KairosQA as a temporally grounded QA dataset is a fundamental innovation, enabling precise evaluation of models' understanding of fact-time associations. The work challenges the prevailing assumption that random shuffling suffices, proposing temporal ordering as a key factor to enhance knowledge freshness and temporal grounding.
Limitations
- Sequential pretraining induces forgetting of older knowledge in favor of recent facts, potentially compromising long-term knowledge retention.
- Experiments are limited to 6B-parameter models; scalability and effectiveness on larger models remain to be validated.
- KairosQA primarily covers sports and awards domains, limiting the diversity of temporal fact types evaluated.
Future Work
Future research directions include integrating sequential pretraining with continual learning techniques to mitigate knowledge forgetting and enhance adaptability to evolving information. Expanding KairosQA to cover more domains and languages will improve evaluation breadth. Additionally, exploring temporal curriculum strategies on larger models and longer time horizons will further advance the development of temporally aware LLMs suited for dynamic real-world applications.
AI Executive Summary
Large language models (LLMs) have revolutionized natural language processing by leveraging vast amounts of text data during pretraining. However, a persistent challenge is that these models' knowledge becomes effectively frozen at the time their training data is collected, limiting their ability to answer questions about recent events or evolving facts. Traditional pretraining pipelines typically shuffle data randomly, disregarding temporal information, which may hinder models from accurately capturing the temporal dynamics of knowledge.
In this study, the authors propose a novel approach that explicitly incorporates temporal ordering into the pretraining process. They train 6-billion-parameter Transformer decoder models on Common Crawl snapshots arranged chronologically from 2018 to 2025, contrasting this with conventional shuffled pretraining on the same data volume. To rigorously evaluate temporal knowledge acquisition, they introduce KairosQA, a new benchmark comprising over 7,000 temporally grounded question-answer pairs derived from Wikidata, designed to test whether models correctly associate facts with their respective time periods.
The core technical insight is that sequential pretraining enables models to develop a recency bias, focusing learning on the most recent data and thereby improving the freshness and temporal precision of their factual knowledge. The authors employ both cloze-style and generative QA tasks, alongside established benchmarks like OLMES and TAQA, to comprehensively assess model performance. They also analyze intermediate checkpoints to understand how temporal knowledge evolves and how forgetting of older facts occurs.
Experimental results reveal that sequentially pretrained models match shuffled baselines on general language understanding tasks but significantly outperform them on temporally sensitive questions, achieving approximately 15% higher F1 scores on KairosQA for recent years. The shuffled models tend to peak on older data, likely due to repeated exposure, while sequential models maintain up-to-date knowledge aligned with their training cutoff. This demonstrates the effectiveness of temporal curriculum learning in enhancing LLMs' temporal alignment.
The broader impact of this work lies in its potential to improve the reliability and applicability of LLMs in dynamic, real-world settings where knowledge evolves rapidly. By releasing the KairosQA dataset, code, and pretrained checkpoints, the authors provide valuable tools for the community to further explore continual learning and temporal knowledge integration. Nonetheless, challenges remain, including mitigating forgetting of older knowledge and scaling the approach to larger models and more diverse domains.
In summary, this research marks a significant step toward temporally aware language models, highlighting the critical role of data temporality in pretraining and offering a promising pathway for developing LLMs that better reflect the evolving nature of human knowledge.
Deep Analysis
Background
Large language models (LLMs) have emerged as foundational tools in natural language processing, demonstrating remarkable capabilities in text generation, comprehension, and reasoning. Models such as GPT, LLaMA, and others rely on pretraining over massive corpora, often sourced from internet-scale datasets like Common Crawl. While these models excel in many tasks, a notable limitation is their static knowledge base, which is frozen at the time of training data collection. This temporal stasis restricts their ability to accurately respond to queries about recent events or evolving facts. Prior research has explored continual learning and fine-tuning to update model knowledge post hoc, but the influence of pretraining data temporality—specifically the order in which data is presented—remains underexplored. Understanding how temporal data ordering affects knowledge acquisition is crucial for developing models that better track the dynamic nature of real-world information.
Core Problem
The core problem addressed is the temporal misalignment between LLMs' internal knowledge and the real-world timeline of facts. Conventional pretraining shuffles data randomly, mixing information from different time periods without regard to chronology. This leads to models that disproportionately memorize older, frequently repeated facts and underperform on recent knowledge, even when recent data is available. The bottlenecks include: 1) lack of mechanisms for models to associate facts with their correct time frames; 2) ineffective prioritization of recent information during training; 3) absence of robust benchmarks to evaluate temporal factual knowledge. Addressing these issues is vital for improving LLMs' relevance and accuracy in time-sensitive applications.
Innovation
This work introduces several key innovations: 1) a temporal curriculum learning approach that feeds pretraining data in strict chronological order, simulating natural knowledge acquisition over time; 2) the creation of KairosQA, a temporally annotated QA dataset with over 7,000 question-answer pairs derived from Wikidata, focusing on facts that change over time; 3) a dual evaluation protocol combining cloze-style and generative QA tasks to measure temporal knowledge alignment precisely; 4) a staged cooldown learning rate schedule to ensure stable convergence during sequential training; 5) comprehensive analysis of intermediate checkpoints to track temporal knowledge acquisition and forgetting dynamics. These innovations collectively provide new insights into how data temporality shapes LLM knowledge.
Methodology
- �� Data Collection and Filtering: Gather Common Crawl snapshots from 2018 to 2025, applying multi-stage filtering including character length thresholds, fastText language identification (retaining 24 European languages), Bloom filter-based deduplication, domain-weighted quality scoring, and repetition rate controls.
- �� Model Architecture: Utilize a 6B-parameter Transformer decoder with 32 layers, 32 attention heads, hidden size 4096, incorporating Grouped-Query Attention (4 key-value heads), Rotary Positional Embeddings (RoPE), and SwiGLU activations.
- �� Training Regimes:
- Baseline: Randomly shuffled data from 2020-2024, totaling 2.5 trillion tokens.
- Sequential: Strict chronological order from 2018-2025, with annual data segments (~315B tokens each), also totaling 2.5 trillion tokens.
- Optimization via AdamW with Warmup-Stable-Decay scheduler, peak learning rate 10^-3.
- Post-training cooldown: Branching off main run for 30k steps with cosine decay to 10^-4 learning rate.
- �� Evaluation Datasets:
- KairosQA: 7,167 temporally grounded QA pairs, filtered for popularity and temporal variation.
- OLMES: General language understanding benchmark.
- TAQA: Existing temporal QA dataset for supplementary evaluation.
- �� Evaluation Protocols:
- Cloze formulation (masked token prediction) and generative QA with normalized F1 scoring.
- Multiple-choice format with distractors from neighboring years.
- Temporal alignment assessed by measuring accuracy/F1 across evaluation years.
- �� Experimental Analysis:
- Compare sequential and shuffled models at matched token counts.
- Analyze intermediate checkpoints to observe temporal knowledge dynamics and forgetting.
Experiments
The experiments involve training two sets of 6B-parameter Transformer decoder models: one on randomly shuffled Common Crawl data from 2020 to 2024, and the other on temporally ordered data from 2018 to 2025. Both training regimes process approximately 2.5 trillion tokens. Eight checkpoints are saved for each setup, corresponding to yearly data cutoffs. Evaluation uses the newly introduced KairosQA dataset, containing 7,167 temporally annotated question-answer pairs, alongside OLMES for general language tasks and TAQA for additional temporal QA evaluation. Metrics include cloze task accuracy, multiple-choice accuracy, and generative F1 scores. The authors also benchmark several open-source LLMs (e.g., LLaMA 3.1-8B, Gemma3, Olmo3, Qwen3) to contextualize their results. Ablation studies analyze the impact of training length and data ordering on temporal knowledge acquisition and forgetting.
Results
Sequentially pretrained models demonstrate a clear advantage in temporal knowledge freshness, achieving approximately 15% higher F1 scores on KairosQA for recent years (2023-2024) compared to shuffled baselines. Both models perform similarly on the OLMES benchmark, confirming no degradation in general language understanding due to temporal ordering. Sequential models show a recency bias, with peak accuracy aligned to their training cutoff year, while shuffled models peak on older data, likely due to repeated exposure to historical facts. Generative QA results corroborate these findings, with sequential models producing more temporally precise answers. Open-source LLMs evaluated exhibit temporal decay in knowledge, aligning with shuffled baseline trends, highlighting the novelty and effectiveness of the sequential approach.
Applications
The findings have immediate applications in building more temporally aware intelligent question answering systems, enabling them to provide accurate responses about recent events. The sequential pretraining paradigm supports dynamic knowledge base maintenance by facilitating continuous integration of new information. News summarization systems can benefit from improved freshness and factual accuracy. In the long term, integrating sequential pretraining with continual learning frameworks can yield LLMs capable of lifelong learning and adaptation. Extending temporal knowledge modeling to diverse domains like healthcare and law can revolutionize domain-specific AI applications requiring up-to-date information.
Limitations & Outlook
The sequential pretraining approach, while enhancing recent knowledge, leads to forgetting of older facts, potentially reducing long-term knowledge completeness. Experiments are limited to 6B-parameter models; scalability to larger architectures remains untested. KairosQA's domain coverage is focused mainly on sports and awards, limiting generalizability. Some ambiguity in generative QA evaluation may affect result precision. The computational cost of sequential training over large temporal spans is substantial, posing practical challenges for widespread adoption.
Abstract
Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.