CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks
CLASP model detects malicious tokens using XGBoost classifier, achieving 95.9% token-level F1 score.
Key Findings
Methodology
The paper introduces the CLASP model, framing the mitigation of Hidden State Poisoning Attacks (HiSPA) as a token-level binary classification problem. By leveraging Mamba's Block Output Embeddings (BOE) features and combining them with an XGBoost classifier, CLASP effectively identifies malicious tokens with minimal computational overhead.
Key Results
- CLASP was evaluated on a corpus of 2,483 resumes totaling 9.5M tokens, achieving a token-level F1 score of 95.9% and a document-level F1 score of 99.3% in detecting malicious tokens.
- Under leave-one-out cross-validation, CLASP maintained a high document-level F1 score of 96.9% even for unseen attack patterns.
- In clustered cross-validation with structurally novel triggers, CLASP sustained a useful detection capability with an average document-level F1 score of 91.6%.
Significance
This study addresses the vulnerability of State Space Models (SSMs) and their hybrid variants to Hidden State Poisoning Attacks (HiSPA) by introducing the CLASP model. The model not only demonstrates excellent detection efficiency but also operates independently of downstream models, making it suitable for real-world deployment as a lightweight front-line defense tool.
Technical Contribution
The technical contribution of CLASP lies in its innovative use of Mamba's Block Output Embeddings (BOE) features combined with an XGBoost classifier to efficiently detect malicious tokens. This approach differs from existing Transformer-based defense strategies by focusing on SSM-specific vulnerabilities, offering new engineering possibilities.
Novelty
CLASP is the first dedicated defense model against Hidden State Poisoning Attacks (HiSPA). Unlike previous studies, it not only identifies malicious tokens but also maintains high detection capability for unseen attack patterns, filling a gap in the existing literature.
Limitations
- CLASP's performance declines when handling structurally novel triggers, particularly in the third fold of clustered cross-validation where the F1 score drops to 82.17%.
- The model assumes that all possible trigger patterns are represented in the training set, which may not hold true in practical applications.
- Due to the time-invariance constraint, CLASP's performance at the token level is limited, unable to leverage contextual information to distinguish ambiguous tokens.
Future Work
Future research directions include: 1) Improving CLASP's performance in handling structurally novel triggers; 2) Exploring broader injection attack defense strategies; 3) Investigating the model's transferability to other recurrent architectures and developing systematic frameworks for evaluating model security before large-scale deployment.
AI Executive Summary
In the application of modern language models, Hidden State Poisoning Attacks (HiSPA) pose an emerging threat, especially to State Space Models (SSMs) and their hybrid variants. Existing defense strategies primarily target Transformer-based models, while SSMs exhibit unique vulnerabilities when facing HiSPA. To address this issue, the paper introduces the CLASP model, framing the mitigation of HiSPA as a token-level binary classification problem. By leveraging Mamba's Block Output Embeddings (BOE) features and combining them with an XGBoost classifier, CLASP effectively detects and intercepts potential attacks without increasing computational overhead.
In experiments, CLASP was evaluated on a corpus of 2,483 resumes totaling 9.5M tokens, achieving a token-level F1 score of 95.9% and a document-level F1 score of 99.3% in detecting malicious tokens. Under leave-one-out cross-validation, CLASP maintained a high document-level F1 score of 96.9% even for unseen attack patterns. In clustered cross-validation with structurally novel triggers, CLASP sustained a useful detection capability with an average document-level F1 score of 91.6%. This indicates that CLASP not only performs well for known attack patterns but also maintains high detection capability for unseen attack patterns.
The technical contribution of CLASP lies in its innovative use of Mamba's Block Output Embeddings (BOE) features combined with an XGBoost classifier to efficiently detect malicious tokens. This approach differs from existing Transformer-based defense strategies by focusing on SSM-specific vulnerabilities, offering new engineering possibilities. The independence of CLASP allows it to operate without affecting downstream models, making it suitable for real-world deployment as a lightweight front-line defense tool.
However, CLASP's performance declines when handling structurally novel triggers, particularly in the third fold of clustered cross-validation where the F1 score drops to 82.17%. Additionally, the model assumes that all possible trigger patterns are represented in the training set, which may not hold true in practical applications. Due to the time-invariance constraint, CLASP's performance at the token level is limited, unable to leverage contextual information to distinguish ambiguous tokens.
Future research directions include improving CLASP's performance in handling structurally novel triggers, exploring broader injection attack defense strategies, and investigating the model's transferability to other recurrent architectures. Through these efforts, CLASP is expected to play a greater role in the security defense of language models.
Deep Analysis
Background
In recent years, as large language models (LLMs) have been widely applied in document-centric workflows, injection attacks have become an increasingly serious security threat. Prompt Injection Attacks (PIAs), in particular, are considered one of the most critical practical threats. Existing defense strategies primarily focus on Transformer-based models, employing token-level detection and specialized fine-tuning strategies to resist PIAs. However, with the rise of State Space Models (SSMs) and their hybrid variants, Hidden State Poisoning Attacks (HiSPA) have emerged as a new challenge. SSMs, such as Mamba, achieve performance comparable to Transformers with linear complexity, but their unique recurrent dynamics make them vulnerable to HiSPA. HiSPA corrupts the hidden state of SSMs through malicious tokens, leading to irreversible memory damage, severely impacting model performance and reliability.
Core Problem
Hidden State Poisoning Attacks (HiSPA) pose an emerging threat, especially to State Space Models (SSMs) and their hybrid variants. HiSPA corrupts the hidden state of SSMs through malicious tokens, leading to irreversible memory damage, severely impacting model performance and reliability. Existing defense strategies primarily target Transformer-based models, while SSMs exhibit unique vulnerabilities when facing HiSPA. Effectively detecting and intercepting HiSPA has become an urgent problem to solve.
Innovation
The innovation of the CLASP model lies in framing the mitigation of Hidden State Poisoning Attacks (HiSPA) as a token-level binary classification problem. By leveraging Mamba's Block Output Embeddings (BOE) features and combining them with an XGBoost classifier, CLASP effectively detects and intercepts potential attacks without increasing computational overhead. Unlike existing Transformer-based defense strategies, CLASP focuses on SSM-specific vulnerabilities, offering new engineering possibilities. The independence of CLASP allows it to operate without affecting downstream models, making it suitable for real-world deployment as a lightweight front-line defense tool.
Methodology
- �� The CLASP model frames the mitigation of Hidden State Poisoning Attacks (HiSPA) as a token-level binary classification problem.
- �� Utilizes Mamba's Block Output Embeddings (BOE) features, combined with an XGBoost classifier, to identify malicious tokens.
- �� Effectively detects and intercepts potential attacks without increasing computational overhead.
- �� In experiments, CLASP was evaluated on a corpus of 2,483 resumes totaling 9.5M tokens, achieving a token-level F1 score of 95.9% and a document-level F1 score of 99.3%.
Experiments
The experimental design includes evaluation on a corpus of 2,483 resumes totaling 9.5M tokens. Controlled injection is used to assess CLASP's performance in detecting malicious tokens. Leave-one-out cross-validation and clustered cross-validation are employed to test CLASP's generalization capability for unseen attack patterns. The results demonstrate that CLASP performs well for known attack patterns and maintains high detection capability for unseen attack patterns.
Results
The results show that CLASP was evaluated on a corpus of 2,483 resumes totaling 9.5M tokens, achieving a token-level F1 score of 95.9% and a document-level F1 score of 99.3% in detecting malicious tokens. Under leave-one-out cross-validation, CLASP maintained a high document-level F1 score of 96.9% even for unseen attack patterns. In clustered cross-validation with structurally novel triggers, CLASP sustained a useful detection capability with an average document-level F1 score of 91.6%.
Applications
The CLASP model is applicable in scenarios requiring defense against Hidden State Poisoning Attacks (HiSPA), such as resume screening, compliance checks, and customer support in document-centric workflows. Due to its independence and efficiency, CLASP can serve as a lightweight front-line defense tool, protecting systems based on State Space Models (SSMs) from potential attacks.
Limitations & Outlook
CLASP's performance declines when handling structurally novel triggers, particularly in the third fold of clustered cross-validation where the F1 score drops to 82.17%. Additionally, the model assumes that all possible trigger patterns are represented in the training set, which may not hold true in practical applications. Due to the time-invariance constraint, CLASP's performance at the token level is limited, unable to leverage contextual information to distinguish ambiguous tokens. Future research directions include improving CLASP's performance in handling structurally novel triggers, exploring broader injection attack defense strategies, and investigating the model's transferability to other recurrent architectures.
Plain Language Accessible to non-experts
Imagine you're cooking in a kitchen. You have a big pot filled with various ingredients. Each ingredient has its own flavor, just like each token has its own information. During the process, someone secretly adds some strange spices to the pot, which change the entire dish's flavor, making it taste bad. This is like a Hidden State Poisoning Attack (HiSPA), where malicious tokens alter the model's memory, causing it to make wrong decisions. The CLASP model is like a smart chef who can detect these strange spices before they are added and remove them, ensuring your dish isn't ruined. CLASP analyzes the characteristics of each ingredient, identifying those that might be problematic, and removes them before they affect the entire dish. This way, even if someone tries to ruin your dish, CLASP helps you keep it delicious. This method is not only effective but also doesn't add extra workload, like a helpful kitchen assistant aiding you in making a tasty meal.
ELI14 Explained like you're 14
Hey there, friends! Did you know that sometimes computers can get tricked by pranksters? It's like when you're playing a game, and someone secretly changes the rules so you always lose. In the computer's brain, some bad guys use weird codes to trick it into making wrong decisions. This kind of bad guy attack is called a Hidden State Poisoning Attack (HiSPA).
But don't worry! Scientists have invented a super tool called CLASP. It's like a smart detective that can spot the pranksters' tricks before they mess things up. CLASP carefully checks every line of code, finds the parts that look suspicious, and removes them so the computer doesn't get tricked!
Imagine you're doing a science experiment at school, and someone secretly adds weird stuff to your test tube, making the results strange. CLASP is like your good friend who helps you check the test tube before the experiment starts, making sure there's nothing weird inside. That way, your experiment won't get ruined!
So next time you hear someone talk about CLASP, you'll know it's a superhero in the computer world, protecting our computers from getting tricked by bad guys!
Glossary
State Space Model
An efficient model alternative to Transformers, featuring linear complexity, suitable for long-sequence processing.
Used as an alternative to Transformers for improved efficiency.
Hidden State Poisoning Attack
Corrupts the hidden state of SSMs through malicious tokens, leading to irreversible memory damage.
Attacks SSMs, affecting model performance.
Block Output Embedding
Output features of the Mamba model used to identify characteristics of malicious tokens.
Used by CLASP to detect malicious tokens.
XGBoost
An efficient gradient boosting decision tree algorithm used for classification tasks.
Used by CLASP for malicious token classification.
Resume Screening
The process of using LLMs to screen resumes to identify the best candidates.
CLASP is evaluated in the resume screening scenario.
Leave-One-Out Cross-Validation
A validation method where one sample is used as the test set, and the rest as the training set.
Used to evaluate CLASP's performance on unseen attack patterns.
Clustered Cross-Validation
A validation method that divides data into clusters of structurally similar triggers to test generalization capability.
Used to test CLASP's performance on structurally novel triggers.
Token-Level F1 Score
Measures the harmonic mean of precision and recall at the token level for classification models.
CLASP's performance metric in detecting malicious tokens.
Document-Level F1 Score
Measures the harmonic mean of precision and recall at the document level for classification models.
CLASP's performance metric in detecting malicious documents.
Time-Invariance Constraint
A limitation where CLASP does not use contextual information for token-level detection.
Limits CLASP's token-level performance.
Open Questions Unanswered questions from this research
- 1 How can CLASP's performance be improved when handling structurally novel triggers? The existing model shows a decline in performance when facing unseen attack patterns, particularly in the third fold of clustered cross-validation where the F1 score drops to 82.17%. New methods need to be explored to enhance the model's generalization capability.
- 2 The CLASP model assumes that all possible trigger patterns are represented in the training set, but this assumption may not hold true in practical applications. How can the model's applicability be extended without increasing computational overhead?
- 3 The time-invariance constraint limits CLASP's performance at the token level. How can contextual information be leveraged to improve detection accuracy while maintaining model efficiency?
- 4 Existing defense strategies primarily target Transformer-based models, while SSMs exhibit unique vulnerabilities when facing HiSPA. How can dedicated defense strategies be developed specifically for SSMs to enhance their security?
- 5 CLASP performs well in the resume screening scenario, but its applicability in other document-centric workflows has not been validated. How can CLASP's performance be evaluated and extended in different application scenarios?
Applications
Immediate Applications
Resume Screening
CLASP can be used in corporate HR departments to detect and intercept potential malicious injection attacks during resume screening, ensuring the accuracy and fairness of the screening results.
Compliance Checks
In compliance checks, CLASP can serve as a front-line defense tool, protecting document processing systems from Hidden State Poisoning Attacks, ensuring the reliability of compliance reviews.
Customer Support
CLASP can be used in customer support systems to detect and intercept potential malicious injection attacks, protecting system stability and customer data security.
Long-term Vision
Cross-Domain Applications
As CLASP is successfully applied in different document-centric workflows, its methods can be extended to other domains such as finance and healthcare, providing broader security protection.
Security Protection for Recurrent Architectures
In the future, CLASP's technology can be extended to other recurrent architectures, providing security protection for a wider range of models, advancing the field of model security.
Abstract
State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.
References (20)
Attention is All you Need
Ashish Vaswani, Noam Shazeer, Niki Parmar et al.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black et al.
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
Nvidia Aaron Blakeman, Aarti Basant, Abhinav Khattar et al.
Green AI: exploring carbon footprints, mitigation strategies, and trade offs in large language model training
V. Liu, Yiqiao Yin
Hidden State Poisoning Attacks against Mamba-based Language Models
A. Mercier, Chris Develder, Thomas Demeester
PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization
Runpeng Geng, Yanting Wang, Chenlong Yin et al.
XGBoost: A Scalable Tree Boosting System
Tianqi Chen, Carlos Guestrin
Carbon Emissions and Large Neural Network Training
David A. Patterson, Joseph Gonzalez, Quoc V. Le et al.
Prompt Injection attack against LLM-integrated Applications
Yi Liu, Gelei Deng, Yuekang Li et al.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Tri Dao, Albert Gu
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
Zhengmian Hu, Gang Wu, Saayan Mitra et al.
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, Christopher R'e
Hymba: A Hybrid-head Architecture for Small Language Models
Xin Dong, Y. Fu, Shizhe Diao et al.
Recurrent Neural Networks (RNNs): A gentle Introduction and Overview
Robin M. Schmidt
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Yupei Liu, Yuqi Jia, Runpeng Geng et al.
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez, I. Ribeiro
Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs
Yinan Zhong, Qianhao Miao, Yanjiao Chen et al.
Can Indirect Prompt Injection Attacks Be Detected and Removed?
Yulin Chen, Haoran Li, Yuan Sui et al.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu, Tri Dao
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
Liliang Ren, Yang Liu, Yadong Lu et al.