CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

TL;DR

CLASP model detects malicious tokens using XGBoost classifier, achieving 95.9% token-level F1 score.

cs.CL 🔴 Advanced 2026-03-13 12 views

Alexandre Le Mercier Thomas Demeester Chris Develder

State Space Model Hidden State Poisoning Attack XGBoost Resume Screening Security Defense

Key Findings

Methodology

The paper introduces the CLASP model, framing the mitigation of Hidden State Poisoning Attacks (HiSPA) as a token-level binary classification problem. By leveraging Mamba's Block Output Embeddings (BOE) features and combining them with an XGBoost classifier, CLASP effectively identifies malicious tokens with minimal computational overhead.

Key Results

CLASP was evaluated on a corpus of 2,483 resumes totaling 9.5M tokens, achieving a token-level F1 score of 95.9% and a document-level F1 score of 99.3% in detecting malicious tokens.
Under leave-one-out cross-validation, CLASP maintained a high document-level F1 score of 96.9% even for unseen attack patterns.
In clustered cross-validation with structurally novel triggers, CLASP sustained a useful detection capability with an average document-level F1 score of 91.6%.

Significance

This study addresses the vulnerability of State Space Models (SSMs) and their hybrid variants to Hidden State Poisoning Attacks (HiSPA) by introducing the CLASP model. The model not only demonstrates excellent detection efficiency but also operates independently of downstream models, making it suitable for real-world deployment as a lightweight front-line defense tool.

Technical Contribution

Novelty

CLASP is the first dedicated defense model against Hidden State Poisoning Attacks (HiSPA). Unlike previous studies, it not only identifies malicious tokens but also maintains high detection capability for unseen attack patterns, filling a gap in the existing literature.

Limitations

CLASP's performance declines when handling structurally novel triggers, particularly in the third fold of clustered cross-validation where the F1 score drops to 82.17%.
The model assumes that all possible trigger patterns are represented in the training set, which may not hold true in practical applications.
Due to the time-invariance constraint, CLASP's performance at the token level is limited, unable to leverage contextual information to distinguish ambiguous tokens.

Future Work

Future research directions include: 1) Improving CLASP's performance in handling structurally novel triggers; 2) Exploring broader injection attack defense strategies; 3) Investigating the model's transferability to other recurrent architectures and developing systematic frameworks for evaluating model security before large-scale deployment.

AI Executive Summary

In the application of modern language models, Hidden State Poisoning Attacks (HiSPA) pose an emerging threat, especially to State Space Models (SSMs) and their hybrid variants. Existing defense strategies primarily target Transformer-based models, while SSMs exhibit unique vulnerabilities when facing HiSPA. To address this issue, the paper introduces the CLASP model, framing the mitigation of HiSPA as a token-level binary classification problem. By leveraging Mamba's Block Output Embeddings (BOE) features and combining them with an XGBoost classifier, CLASP effectively detects and intercepts potential attacks without increasing computational overhead.

In experiments, CLASP was evaluated on a corpus of 2,483 resumes totaling 9.5M tokens, achieving a token-level F1 score of 95.9% and a document-level F1 score of 99.3% in detecting malicious tokens. Under leave-one-out cross-validation, CLASP maintained a high document-level F1 score of 96.9% even for unseen attack patterns. In clustered cross-validation with structurally novel triggers, CLASP sustained a useful detection capability with an average document-level F1 score of 91.6%. This indicates that CLASP not only performs well for known attack patterns but also maintains high detection capability for unseen attack patterns.

The technical contribution of CLASP lies in its innovative use of Mamba's Block Output Embeddings (BOE) features combined with an XGBoost classifier to efficiently detect malicious tokens. This approach differs from existing Transformer-based defense strategies by focusing on SSM-specific vulnerabilities, offering new engineering possibilities. The independence of CLASP allows it to operate without affecting downstream models, making it suitable for real-world deployment as a lightweight front-line defense tool.

However, CLASP's performance declines when handling structurally novel triggers, particularly in the third fold of clustered cross-validation where the F1 score drops to 82.17%. Additionally, the model assumes that all possible trigger patterns are represented in the training set, which may not hold true in practical applications. Due to the time-invariance constraint, CLASP's performance at the token level is limited, unable to leverage contextual information to distinguish ambiguous tokens.

Future research directions include improving CLASP's performance in handling structurally novel triggers, exploring broader injection attack defense strategies, and investigating the model's transferability to other recurrent architectures. Through these efforts, CLASP is expected to play a greater role in the security defense of language models.

Deep Analysis

Background

In recent years, as large language models (LLMs) have been widely applied in document-centric workflows, injection attacks have become an increasingly serious security threat. Prompt Injection Attacks (PIAs), in particular, are considered one of the most critical practical threats. Existing defense strategies primarily focus on Transformer-based models, employing token-level detection and specialized fine-tuning strategies to resist PIAs. However, with the rise of State Space Models (SSMs) and their hybrid variants, Hidden State Poisoning Attacks (HiSPA) have emerged as a new challenge. SSMs, such as Mamba, achieve performance comparable to Transformers with linear complexity, but their unique recurrent dynamics make them vulnerable to HiSPA. HiSPA corrupts the hidden state of SSMs through malicious tokens, leading to irreversible memory damage, severely impacting model performance and reliability.

Core Problem

Hidden State Poisoning Attacks (HiSPA) pose an emerging threat, especially to State Space Models (SSMs) and their hybrid variants. HiSPA corrupts the hidden state of SSMs through malicious tokens, leading to irreversible memory damage, severely impacting model performance and reliability. Existing defense strategies primarily target Transformer-based models, while SSMs exhibit unique vulnerabilities when facing HiSPA. Effectively detecting and intercepting HiSPA has become an urgent problem to solve.

Innovation

The innovation of the CLASP model lies in framing the mitigation of Hidden State Poisoning Attacks (HiSPA) as a token-level binary classification problem. By leveraging Mamba's Block Output Embeddings (BOE) features and combining them with an XGBoost classifier, CLASP effectively detects and intercepts potential attacks without increasing computational overhead. Unlike existing Transformer-based defense strategies, CLASP focuses on SSM-specific vulnerabilities, offering new engineering possibilities. The independence of CLASP allows it to operate without affecting downstream models, making it suitable for real-world deployment as a lightweight front-line defense tool.

Methodology

�� The CLASP model frames the mitigation of Hidden State Poisoning Attacks (HiSPA) as a token-level binary classification problem.
�� Utilizes Mamba's Block Output Embeddings (BOE) features, combined with an XGBoost classifier, to identify malicious tokens.
�� Effectively detects and intercepts potential attacks without increasing computational overhead.
�� In experiments, CLASP was evaluated on a corpus of 2,483 resumes totaling 9.5M tokens, achieving a token-level F1 score of 95.9% and a document-level F1 score of 99.3%.

Experiments

The experimental design includes evaluation on a corpus of 2,483 resumes totaling 9.5M tokens. Controlled injection is used to assess CLASP's performance in detecting malicious tokens. Leave-one-out cross-validation and clustered cross-validation are employed to test CLASP's generalization capability for unseen attack patterns. The results demonstrate that CLASP performs well for known attack patterns and maintains high detection capability for unseen attack patterns.

Results

The results show that CLASP was evaluated on a corpus of 2,483 resumes totaling 9.5M tokens, achieving a token-level F1 score of 95.9% and a document-level F1 score of 99.3% in detecting malicious tokens. Under leave-one-out cross-validation, CLASP maintained a high document-level F1 score of 96.9% even for unseen attack patterns. In clustered cross-validation with structurally novel triggers, CLASP sustained a useful detection capability with an average document-level F1 score of 91.6%.

Applications

The CLASP model is applicable in scenarios requiring defense against Hidden State Poisoning Attacks (HiSPA), such as resume screening, compliance checks, and customer support in document-centric workflows. Due to its independence and efficiency, CLASP can serve as a lightweight front-line defense tool, protecting systems based on State Space Models (SSMs) from potential attacks.

Limitations & Outlook

CLASP's performance declines when handling structurally novel triggers, particularly in the third fold of clustered cross-validation where the F1 score drops to 82.17%. Additionally, the model assumes that all possible trigger patterns are represented in the training set, which may not hold true in practical applications. Due to the time-invariance constraint, CLASP's performance at the token level is limited, unable to leverage contextual information to distinguish ambiguous tokens. Future research directions include improving CLASP's performance in handling structurally novel triggers, exploring broader injection attack defense strategies, and investigating the model's transferability to other recurrent architectures.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You have a big pot filled with various ingredients. Each ingredient has its own flavor, just like each token has its own information. During the process, someone secretly adds some strange spices to the pot, which change the entire dish's flavor, making it taste bad. This is like a Hidden State Poisoning Attack (HiSPA), where malicious tokens alter the model's memory, causing it to make wrong decisions. The CLASP model is like a smart chef who can detect these strange spices before they are added and remove them, ensuring your dish isn't ruined. CLASP analyzes the characteristics of each ingredient, identifying those that might be problematic, and removes them before they affect the entire dish. This way, even if someone tries to ruin your dish, CLASP helps you keep it delicious. This method is not only effective but also doesn't add extra workload, like a helpful kitchen assistant aiding you in making a tasty meal.

ELI14 Explained like you're 14

Hey there, friends! Did you know that sometimes computers can get tricked by pranksters? It's like when you're playing a game, and someone secretly changes the rules so you always lose. In the computer's brain, some bad guys use weird codes to trick it into making wrong decisions. This kind of bad guy attack is called a Hidden State Poisoning Attack (HiSPA).

But don't worry! Scientists have invented a super tool called CLASP. It's like a smart detective that can spot the pranksters' tricks before they mess things up. CLASP carefully checks every line of code, finds the parts that look suspicious, and removes them so the computer doesn't get tricked!

Imagine you're doing a science experiment at school, and someone secretly adds weird stuff to your test tube, making the results strange. CLASP is like your good friend who helps you check the test tube before the experiment starts, making sure there's nothing weird inside. That way, your experiment won't get ruined!

So next time you hear someone talk about CLASP, you'll know it's a superhero in the computer world, protecting our computers from getting tricked by bad guys!

Glossary

State Space Model

An efficient model alternative to Transformers, featuring linear complexity, suitable for long-sequence processing.

Used as an alternative to Transformers for improved efficiency.

Hidden State Poisoning Attack

Corrupts the hidden state of SSMs through malicious tokens, leading to irreversible memory damage.

Attacks SSMs, affecting model performance.

Block Output Embedding

Output features of the Mamba model used to identify characteristics of malicious tokens.

Used by CLASP to detect malicious tokens.

XGBoost

An efficient gradient boosting decision tree algorithm used for classification tasks.

Used by CLASP for malicious token classification.

Resume Screening

The process of using LLMs to screen resumes to identify the best candidates.

CLASP is evaluated in the resume screening scenario.

Leave-One-Out Cross-Validation

A validation method where one sample is used as the test set, and the rest as the training set.

Used to evaluate CLASP's performance on unseen attack patterns.

Clustered Cross-Validation

A validation method that divides data into clusters of structurally similar triggers to test generalization capability.

Used to test CLASP's performance on structurally novel triggers.

Token-Level F1 Score

Measures the harmonic mean of precision and recall at the token level for classification models.

CLASP's performance metric in detecting malicious tokens.

Document-Level F1 Score

Measures the harmonic mean of precision and recall at the document level for classification models.

CLASP's performance metric in detecting malicious documents.

Time-Invariance Constraint

A limitation where CLASP does not use contextual information for token-level detection.

Limits CLASP's token-level performance.

Open Questions Unanswered questions from this research

1 How can CLASP's performance be improved when handling structurally novel triggers? The existing model shows a decline in performance when facing unseen attack patterns, particularly in the third fold of clustered cross-validation where the F1 score drops to 82.17%. New methods need to be explored to enhance the model's generalization capability.
2 The CLASP model assumes that all possible trigger patterns are represented in the training set, but this assumption may not hold true in practical applications. How can the model's applicability be extended without increasing computational overhead?
3 The time-invariance constraint limits CLASP's performance at the token level. How can contextual information be leveraged to improve detection accuracy while maintaining model efficiency?
4 Existing defense strategies primarily target Transformer-based models, while SSMs exhibit unique vulnerabilities when facing HiSPA. How can dedicated defense strategies be developed specifically for SSMs to enhance their security?
5 CLASP performs well in the resume screening scenario, but its applicability in other document-centric workflows has not been validated. How can CLASP's performance be evaluated and extended in different application scenarios?

Applications

Immediate Applications

Resume Screening

CLASP can be used in corporate HR departments to detect and intercept potential malicious injection attacks during resume screening, ensuring the accuracy and fairness of the screening results.

Compliance Checks

In compliance checks, CLASP can serve as a front-line defense tool, protecting document processing systems from Hidden State Poisoning Attacks, ensuring the reliability of compliance reviews.

Customer Support

CLASP can be used in customer support systems to detect and intercept potential malicious injection attacks, protecting system stability and customer data security.

Long-term Vision

Cross-Domain Applications

As CLASP is successfully applied in different document-centric workflows, its methods can be extended to other domains such as finance and healthcare, providing broader security protection.

Security Protection for Recurrent Architectures

In the future, CLASP's technology can be extended to other recurrent architectures, providing security protection for a wider range of models, advancing the field of model security.

Abstract

State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

cs.CL

References (20)

Attention is All you Need

Ashish Vaswani, Noam Shazeer, Niki Parmar et al.

2017 169218 citations ⭐ Influential View Analysis →

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black et al.

2020 2673 citations ⭐ Influential View Analysis →

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

Nvidia Aaron Blakeman, Aarti Basant, Abhinav Khattar et al.

2025 53 citations ⭐ Influential View Analysis →

Green AI: exploring carbon footprints, mitigation strategies, and trade offs in large language model training

V. Liu, Yiqiao Yin

2024 59 citations ⭐ Influential View Analysis →

Hidden State Poisoning Attacks against Mamba-based Language Models

A. Mercier, Chris Develder, Thomas Demeester

2026 1 citations ⭐ Influential View Analysis →

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

Runpeng Geng, Yanting Wang, Chenlong Yin et al.

2025 3 citations ⭐ Influential View Analysis →

XGBoost: A Scalable Tree Boosting System

Tianqi Chen, Carlos Guestrin

2016 50380 citations ⭐ Influential View Analysis →

Carbon Emissions and Large Neural Network Training

David A. Patterson, Joseph Gonzalez, Quoc V. Le et al.

2021 980 citations View Analysis →

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li et al.

2023 641 citations View Analysis →

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao, Albert Gu

2024 1248 citations View Analysis →

Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

Zhengmian Hu, Gang Wu, Saayan Mitra et al.

2023 39 citations View Analysis →

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, Christopher R'e

2021 3185 citations View Analysis →

Hymba: A Hybrid-head Architecture for Small Language Models

Xin Dong, Y. Fu, Shizhe Diao et al.

2024 68 citations View Analysis →

Recurrent Neural Networks (RNNs): A gentle Introduction and Overview

Robin M. Schmidt

2019 252 citations View Analysis →

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Yupei Liu, Yuqi Jia, Runpeng Geng et al.

2023 252 citations View Analysis →

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez, I. Ribeiro

2022 702 citations View Analysis →

Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

Yinan Zhong, Qianhao Miao, Yanjiao Chen et al.

2025 2 citations View Analysis →

Can Indirect Prompt Injection Attacks Be Detected and Removed?

Yulin Chen, Haoran Li, Yuan Sui et al.

2025 36 citations View Analysis →

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu, Tri Dao

2023 6076 citations View Analysis →

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Liliang Ren, Yang Liu, Yadong Lu et al.

2024 130 citations View Analysis →

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

State Space Model

Hidden State Poisoning Attack

Block Output Embedding

XGBoost

Resume Screening

Leave-One-Out Cross-Validation

Clustered Cross-Validation

Token-Level F1 Score

Document-Level F1 Score

Time-Invariance Constraint

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Resume Screening

Compliance Checks

Customer Support

Long-term Vision

Cross-Domain Applications

Security Protection for Recurrent Architectures

Abstract

References (20)

Related Papers

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection