SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

TL;DR

SPA method uses carefully designed prompts to generate large-scale synthetic data for effective knowledge injection.

cs.LG 🟡 Intermediate 2026-03-24 51 views
Kexian Tang Jiani Wang Shaowen Wang Kaifeng Lyu
knowledge injection large language models prompt engineering synthetic data benchmarking

Key Findings

Methodology

SPA (Scaling Prompt-engineered Augmentation) is a method that uses a small set of carefully designed prompt templates to generate large-scale synthetic data for knowledge injection. The method is based on learning strategies from cognitive science and educational psychology, designing seven prompt templates including Concept Learning, Critical Thinking, and Generative Learning. By repeatedly prompting a large language model to rewrite the source content, it generates a large-scale synthetic corpus, which is then used to train the target model.

Key Results

  • On the SQuAD dataset, SPA achieved 91.27% accuracy, surpassing Active Reading's 90.25% and SEAL's 74.23%. On the QuALITY dataset, SPA achieved 57.03% accuracy, outperforming EntiGraph's 56.22% and Active Reading's 51.13%. On the MultiHop-RAG dataset, SPA achieved 86.64% on Qwen2.5-7B and 88.36% on Meta-Llama-3-8B, surpassing all baselines.
  • SPA demonstrates consistent advantages across different generation models and adapted model families, indicating its broad applicability. When using a weaker generator like gpt-oss-120b, SPA still outperforms baseline methods on the QuALITY dataset.
  • In experiments, SPA continues to improve performance by progressively scaling the synthetic corpus, ultimately achieving the best performance across all benchmarks, showcasing its strong potential in large-scale data generation.

Significance

The SPA method holds significant importance in the field of knowledge injection. By employing simple prompt design and large-scale synthetic data generation, it addresses the shortcomings of previous methods in small-scale data scenarios. SPA not only outperforms complex methods in terms of performance but also provides a cost-effective and efficient pathway for knowledge injection. The results suggest that careful prompt design combined with large-scale data generation can yield unexpectedly effective results in knowledge injection tasks, offering a strong baseline for future research.

Technical Contribution

SPA's technical contributions lie in its simple yet effective prompt design and large-scale data generation strategy. Compared to existing reinforcement learning and multi-stage prompting methods, SPA generates high-quality synthetic data through a single-stage prompt without relying on downstream task supervision. Its innovation lies in utilizing human learning strategies to design prompt templates, resulting in more diverse and comprehensive generated data, thereby enhancing the effectiveness of knowledge injection.

Novelty

SPA's novelty lies in its simple prompt design and large-scale data generation strategy, distinguishing it from complex multi-stage prompting and reinforcement learning methods. By integrating learning strategies from cognitive science, SPA achieves efficient knowledge injection without relying on downstream tasks, making it a novel approach in this field.

Limitations

  • SPA slightly underperforms SEAL on small-scale datasets, possibly due to insufficient diversity in small-scale data generation.
  • While SPA excels in large-scale data generation, its flexibility in generator selection may be limited, especially when using weaker generators.
  • SPA relies on the quality of prompt templates, which may require adjustment for specific knowledge injection needs in different domains.

Future Work

Future research can explore further optimization of SPA's prompt design to enhance its applicability across different domains. Additionally, researchers can attempt to integrate other data generation techniques to further improve SPA's performance and efficiency. Exploring SPA's application in real-time knowledge updates and dynamic knowledge injection scenarios is also a promising direction.

AI Executive Summary

In the field of large language models (LLMs), knowledge injection has been a persistent challenge, particularly in specialized domains with scarce data. Existing methods, such as reinforcement learning and multi-stage prompting, although effective on small-scale data, often face issues like diversity collapse and diminishing returns when scaling up.

To address these challenges, researchers have proposed a new method called SPA (Scaling Prompt-engineered Augmentation). SPA generates large-scale synthetic data for knowledge injection using carefully designed prompt templates. This method is based on learning strategies from cognitive science and educational psychology, designing seven prompt templates including Concept Learning, Critical Thinking, and Generative Learning.

The core technical principle of SPA lies in its simple prompt design and large-scale data generation strategy, achieving efficient knowledge injection. Unlike complex multi-stage prompting and reinforcement learning methods, SPA generates high-quality synthetic data through a single-stage prompt without relying on downstream task supervision.

In experiments, SPA demonstrated superior performance across multiple benchmarks, including SQuAD, QuALITY, and MultiHop-RAG, surpassing various complex methods such as SEAL and Active Reading. Notably, in large-scale data generation, SPA continues to improve performance by progressively scaling the synthetic corpus, ultimately achieving the best performance across all benchmarks.

The significance of SPA lies in providing a simple yet effective pathway for knowledge injection, addressing the shortcomings of previous methods in small-scale data scenarios. The results suggest that careful prompt design combined with large-scale data generation can yield unexpectedly effective results in knowledge injection tasks, offering a strong baseline for future research.

While SPA excels in large-scale data generation, its flexibility in generator selection may be limited, especially when using weaker generators. Future research can explore further optimization of SPA's prompt design to enhance its applicability across different domains.

Deep Analysis

Background

Large language models (LLMs) have made significant strides in the field of natural language processing, capable of learning broad world knowledge and general capabilities from massive web text. However, in specialized domains, particularly those with scarce data, the knowledge coverage of LLMs remains incomplete. To address this gap, researchers have attempted to enhance models' domain knowledge through knowledge injection. Knowledge injection typically involves further fine-tuning or continual pretraining using domain-specific data. However, these domain-specific datasets are often limited in scale and diversity, and directly fine-tuning LLMs on such sparse data often leads to overfitting to specific surface forms rather than robust knowledge acquisition.

Core Problem

In the field of knowledge injection, existing methods face two major challenges: first, reinforcement learning-based methods, while improving token efficiency in small-scale data generation, often suffer from diversity collapse as data scales, leading to diminishing returns. Second, multi-stage prompting methods, although superior to simple augmentation methods in some cases, may lose their advantages after careful prompt tuning. These issues limit the effectiveness of existing methods in large-scale data generation.

Innovation

The core innovations of the SPA method lie in its simple yet effective prompt design and large-scale data generation strategy. Specifically:

1. Prompt Design: SPA designs seven prompt templates based on learning strategies from cognitive science and educational psychology, including Concept Learning, Critical Thinking, and Generative Learning. These prompt templates help the generator produce more diverse and comprehensive synthetic data.

2. Large-scale Data Generation: By repeatedly prompting a large language model to rewrite the source content, SPA generates a large-scale synthetic corpus for knowledge injection.

3. Single-stage Prompting: Unlike complex multi-stage prompting methods, SPA generates high-quality synthetic data through a single-stage prompt, simplifying system complexity.

Methodology

The implementation of the SPA method involves the following key steps:

  • �� Prompt Design: Design seven prompt templates based on learning strategies from cognitive science and educational psychology, including Concept Learning, Critical Thinking, and Generative Learning.
  • �� Data Generation: Use the designed prompt templates to repeatedly prompt a large language model to rewrite the source content, generating a large-scale synthetic corpus.
  • �� Model Training: Train the target model on the generated synthetic corpus to enhance its domain knowledge.
  • �� Performance Evaluation: Evaluate SPA's performance on multiple benchmarks to verify its effectiveness in knowledge injection tasks.

Experiments

The experimental design includes the following aspects:

  • �� Datasets: Select SQuAD, QuALITY, and MultiHop-RAG as benchmark datasets.
  • �� Baseline Methods: Choose complex methods such as SEAL and Active Reading as comparison baselines.
  • �� Evaluation Metrics: Use accuracy as the main evaluation metric to assess SPA's performance on different datasets.
  • �� Hyperparameter Settings: Match the number of training tokens for all methods to ensure fair comparisons.
  • �� Ablation Studies: Analyze SPA's performance changes by progressively scaling the synthetic corpus.

Results

In experiments, SPA demonstrated superior performance across all benchmarks, surpassing various complex methods such as SEAL and Active Reading. Specifically:

  • �� On the SQuAD dataset, SPA achieved 91.27% accuracy, surpassing Active Reading's 90.25% and SEAL's 74.23%.
  • �� On the QuALITY dataset, SPA achieved 57.03% accuracy, outperforming EntiGraph's 56.22% and Active Reading's 51.13%.
  • �� On the MultiHop-RAG dataset, SPA achieved 86.64% on Qwen2.5-7B and 88.36% on Meta-Llama-3-8B, surpassing all baselines.

Applications

The SPA method has broad application potential in various fields:

  • �� Domain-specific Knowledge Injection: SPA can be used for knowledge injection in specialized fields such as medicine, finance, and law, enhancing the model's domain knowledge coverage by generating large-scale synthetic data.
  • �� Data-scarce Scenarios: In data-scarce scenarios, SPA provides a cost-effective and efficient pathway for knowledge injection, helping models better understand and answer domain-related questions.
  • �� Real-time Knowledge Updates: SPA can be applied in real-time knowledge update scenarios, generating new synthetic data to help models quickly adapt to the latest domain knowledge.

Limitations & Outlook

While SPA excels in large-scale data generation, its flexibility in generator selection may be limited, especially when using weaker generators. Additionally, SPA relies on the quality of prompt templates, which may require adjustment for specific knowledge injection needs in different domains. Future research can explore further optimization of SPA's prompt design to enhance its applicability across different domains.

Plain Language Accessible to non-experts

Imagine you have a massive library with all sorts of books, but some of them have very little content or are even blank. To make these books more complete, you decide to fill in the gaps using the knowledge you already have. This is essentially what the SPA method does. It uses cleverly designed prompts to guide a large language model to generate new content, much like you writing new chapters for those blank books in the library.

These prompts are like writing themes you set for yourself, such as 'explain this concept' or 'pose a question and answer it.' With these prompts, the model can generate a large amount of synthetic data, helping it become smarter in specific domains.

Just like in the library, you might find some books need more details, while others need a broader perspective. SPA ensures the model improves in all aspects by continuously adjusting the prompts to generate different types of content.

In the end, after all these efforts, the model becomes like a knowledgeable librarian, ready to provide accurate and rich information whenever needed.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where the goal is to make your character a knowledge master. The problem is, some knowledge points in the game are incomplete, like some levels have blank maps.

So, you decide to use something called SPA to fix this. SPA is like a super smart assistant that gives you prompts like 'explain this concept' or 'come up with a question and answer it.' Then, your character uses these prompts to create new knowledge, just like drawing new paths on the blank maps.

As you keep using these prompts, your character gets smarter and can handle all sorts of challenges in the game. Just like in school, where you learn new things and get better at stuff!

So, SPA is this awesome tool that helps your character grow and become the ultimate knowledge master in the game! Isn't that cool?

Glossary

SPA (Scaling Prompt-engineered Augmentation)

A method that uses carefully designed prompt templates to generate large-scale synthetic data for knowledge injection.

SPA is used to generate synthetic data to enhance the domain knowledge of large language models.

LLM (Large Language Model)

A large-scale language model capable of learning broad world knowledge and general capabilities from massive web text.

LLMs are widely used in natural language processing tasks but have incomplete knowledge coverage in specialized domains.

Knowledge Injection

The process of injecting domain-specific knowledge into large language models through further fine-tuning or continual pretraining.

Knowledge injection is used to enhance models' knowledge coverage in data-scarce domains.

Prompt Engineering

The design of prompt templates used to guide language models in generating specific content.

SPA uses prompt engineering to generate large-scale synthetic data.

Synthetic Data

Simulated data generated by models to enhance the training dataset.

The large-scale synthetic data generated by SPA is used for knowledge injection.

Concept Learning

A learning strategy that requires learners to search for and test attributes to distinguish exemplars of a concept from non-exemplars.

One of SPA's prompt templates, helping the generator produce diverse data.

Critical Thinking

The process of systematically analyzing facts, evidence, observations, and arguments to arrive at well-reasoned conclusions.

One of SPA's prompt templates, encouraging the generator to produce in-depth understanding data.

Generative Learning

A strategy that requires learners to actively make sense of learning material so they can apply it to new situations.

One of SPA's prompt templates, promoting the generation of highly applicable data.

Reinforcement Learning

A machine learning method that trains models through reward signals to improve their performance on specific tasks.

An existing method for knowledge injection, but faces diversity collapse issues in large-scale data generation.

Multi-stage Prompting

A prompting pipeline that transforms original corpus into final synthetic data through several intermediate steps.

An existing method for knowledge injection, but may lose advantages after prompt tuning.

Open Questions Unanswered questions from this research

  • 1 How can SPA's prompt design be further optimized to enhance its applicability across different domains? While the existing prompt templates are effective, they may require adjustments to suit specific knowledge injection needs in different domains.
  • 2 SPA's flexibility in generator selection may be limited, especially when using weaker generators. How can SPA's performance be improved across different generators is a question worth investigating.
  • 3 SPA slightly underperforms SEAL on small-scale datasets, possibly due to insufficient diversity in small-scale data generation. How to improve SPA's diversity on small-scale datasets is a challenge that needs to be addressed.
  • 4 The current SPA method heavily relies on the quality of prompt templates. How to design higher quality prompt templates to further enhance SPA's performance is a direction worth exploring.
  • 5 SPA's application potential in real-time knowledge updates and dynamic knowledge injection scenarios has not been fully explored. How to effectively apply SPA in these scenarios is an important direction for future research.

Applications

Immediate Applications

Domain-specific Knowledge Injection

SPA can be used for knowledge injection in specialized fields such as medicine, finance, and law, enhancing the model's domain knowledge coverage by generating large-scale synthetic data.

Data-scarce Scenarios

In data-scarce scenarios, SPA provides a cost-effective and efficient pathway for knowledge injection, helping models better understand and answer domain-related questions.

Real-time Knowledge Updates

SPA can be applied in real-time knowledge update scenarios, generating new synthetic data to help models quickly adapt to the latest domain knowledge.

Long-term Vision

Dynamic Knowledge Injection

SPA's application potential in dynamic knowledge injection scenarios has not been fully explored. Future research can optimize prompt design to achieve real-time knowledge updates and injection.

Cross-domain Knowledge Transfer

The synthetic data generated by SPA can enable cross-domain knowledge transfer, helping models quickly adapt and apply in different fields.

Abstract

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.

cs.LG cs.AI cs.CL