Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation
Constructed a multi-source log dataset with 870 sessions, 2.3 million events, labeled with ATT&CK techniques, fine-tuned three small language models (Qwen, Llama, Phi) using LoRA, achieving up to 97% accuracy in chunk classification.
Key Findings
Methodology
This study collected synchronized system, network, and browser logs from Windows endpoints during simulated multi-stage attacks using real attack tools, ensuring high authenticity. The dataset comprises 870 sessions with detailed ATT&CK technique annotations for malicious events. Each session lasts 20 minutes, capturing approximately 2.3 million events across 12 ATT&CK tactics and 53 techniques. The authors employed LoRA to fine-tune three small language models—Qwen2.5-1.5B, Llama-3.2-3B, and Phi-4-Mini—on this dataset. Evaluation involved two tasks: chunk classification (normal vs. suspicious) and ATT&CK technique identification, measured across ten metrics. Results demonstrated significant performance improvements post-fine-tuning, with chunk classification accuracy rising from roughly 8% to 90-97%, and the best exact-match technique identification reaching 42%. The methodology emphasizes realistic attack simulation, precise labeling, and efficient model adaptation.
Key Results
- The dataset includes 870 sessions (70 attack, 800 benign), with about 2.3 million events from system, network, and browser sources, covering comprehensive attack tactics and techniques, generated with real attack tools to ensure authenticity.
- Fine-tuning models with LoRA dramatically improved performance: chunk classification accuracy increased from approximately 8% to over 90%, and the best ATT&CK technique exact-match accuracy reached 42%, indicating strong learnability.
- Analysis revealed that models effectively captured underlying attack reasoning, with high partial-match scores suggesting robust pattern recognition despite the challenge of precise identification.
Significance
This work addresses a critical gap in cybersecurity research by providing the first publicly available, multi-source, ATT&CK-labeled dataset derived from real attack scenarios. It enables the development and benchmarking of models capable of reasoning across system, network, and browser logs at the technique level, which is essential for detecting sophisticated multi-stage attacks. The integration of real attack tools enhances the dataset’s realism, making it highly valuable for both academia and industry. The demonstrated effectiveness of LoRA fine-tuning on small language models paves the way for deploying lightweight, accurate detection systems in real-world environments, significantly advancing automated threat detection and response capabilities.
Technical Contribution
The paper's main technical contributions include: 1) the creation of a comprehensive, synchronized multi-source log dataset with per-event ATT&CK labels, covering broad attack tactics and techniques; 2) the use of real attack tools for data generation, ensuring high fidelity of attack scenarios; 3) the application of LoRA to efficiently fine-tune large pre-trained language models for security tasks, achieving substantial performance gains with minimal additional parameters; 4) rigorous evaluation across multiple metrics, establishing a new benchmark for multi-source attack detection. These innovations combine data authenticity, detailed labeling, and efficient model adaptation, pushing the frontier of AI-driven cybersecurity.
Novelty
This study is pioneering in integrating real attack tooling with multi-source logging and ATT&CK-level annotation, filling a significant gap in publicly available datasets. Unlike prior works that either focus on single sources or lack detailed labels, this dataset captures synchronized system, network, and browser activities with fine-grained attack labels. Additionally, applying LoRA for targeted model fine-tuning in this context is novel, demonstrating that lightweight adaptation can achieve near state-of-the-art performance in complex multi-source security tasks. The combination of real attack simulation, detailed labeling, and efficient fine-tuning distinguishes this work from existing literature.
Limitations
- The dataset is limited to Windows environments, which may restrict generalization to other operating systems like Linux or macOS. Expanding to diverse platforms is necessary for broader applicability.
- Although attack scenarios are realistic, they are still simulated within controlled environments; real-world complexities and unknown attack vectors may pose additional challenges.
- ATT&CK labels depend on attack tool traceability, risking incomplete or inaccurate annotation if attack tools evolve or are obfuscated. Further automation and validation are needed to improve label accuracy.
Future Work
Future directions include expanding the dataset to include other operating systems and device types, increasing attack scenario diversity, and integrating unsupervised or semi-supervised learning techniques to reduce manual labeling efforts. Developing more advanced multi-source fusion architectures, such as graph neural networks or transformer-based models that explicitly model cross-source relationships, could further improve detection accuracy. Additionally, exploring real-time deployment and adaptive learning methods will be crucial for operational cybersecurity applications, enabling systems to evolve with emerging threats.
AI Executive Summary
In the rapidly evolving landscape of cybersecurity, adversaries increasingly deploy multi-stage, multi-source attacks that traverse system, network, and browser environments. Traditional detection methods, often relying on signature-based or rule-based approaches, struggle to keep pace with sophisticated threats that adapt and blend across different layers of an organization’s digital infrastructure. This challenge underscores the urgent need for advanced, data-driven solutions capable of understanding complex attack behaviors across multiple data sources.
Addressing this critical gap, the present study introduces a comprehensive multi-source log dataset, meticulously constructed from real attack scenarios on Windows endpoints. The dataset encompasses 870 sessions, including 70 malicious attack sessions and 800 benign user activities, with approximately 2.3 million events spanning system, network, and browser logs. Each malicious event is annotated with fine-grained ATT&CK technique IDs, covering 12 tactics and 53 techniques, providing a detailed map of adversarial behaviors. The data collection process involved synchronized logging to ensure temporal alignment, and attack scenarios were crafted using authentic tools such as Revenge-RAT, Process Hacker, and rclone, simulating real-world multi-stage attacks.
Leveraging this rich dataset, the authors employed Low-Rank Adaptation (LoRA) to fine-tune three small language models—Qwen2.5-1.5B, Llama-3.2-3B, and Phi-4-Mini—aiming to enhance their ability to detect malicious activities. The models were evaluated on two core tasks: chunk classification, distinguishing normal from suspicious activity, and ATT&CK technique identification, pinpointing the specific adversarial method used. Results demonstrated that fine-tuning dramatically improved performance: chunk classification accuracy soared from approximately 8% to over 90%, while the best exact-match ATT&CK technique identification reached 42%. These findings highlight the models’ capacity to learn complex attack patterns from multi-source data.
The significance of this work lies in its contribution to both dataset availability and methodological advancement. By providing the first publicly accessible, real-attack, multi-source dataset with detailed labels, it enables researchers and practitioners to develop more accurate, explainable, and robust detection models. The successful application of LoRA showcases a scalable approach to adapt large pre-trained models for cybersecurity tasks efficiently, opening avenues for deploying lightweight yet effective security solutions.
Looking ahead, future research should focus on expanding the dataset to include diverse platforms and attack types, exploring more sophisticated multi-source fusion architectures, and deploying models in real-time operational environments. The integration of such intelligent systems promises to significantly elevate the automation and effectiveness of cybersecurity defenses, making organizations more resilient against increasingly complex adversarial threats. Overall, this work marks a pivotal step toward intelligent, multi-layered cyberattack detection, blending authentic attack scenarios with cutting-edge AI techniques to forge the future of cybersecurity.
Deep Analysis
Background
随着信息技术的不断发展,网络攻击手段也在不断演变,从简单的单一攻击逐渐演变为多阶段、多源协作的复杂威胁。早期的数据集如KDD Cup 1999和NSL-KDD主要关注网络流量,缺乏对主机和浏览器行为的细粒度监控。近年来,CICIDS和UNSW-NB15等数据集开始引入多源信息,但仍未实现多源、多技术的细粒度标注。与此同时,基于模拟攻击的公共数据集如ATLAS和ATLASv2虽尝试整合多源数据,但缺乏ATT&CK框架的详细技术标签,限制了模型的攻击行为理解能力。真实攻击工具的引入极大增强了数据的真实性,但也带来了采集和标注的技术难题。综上,构建一个融合真实攻击场景、多源信息和细粒度标签的高质量数据集,成为当前网络安全研究的热点和难点。
Core Problem
现有的多源日志数据集在同步采集、标签细粒度和真实性方面存在明显不足。单一源数据难以反映攻击的全貌,多源数据的时间同步和信息融合技术尚不成熟,限制了跨源关联分析的效果。攻击行为的复杂性要求模型同时理解系统、网络和浏览器的行为特征,但缺乏高质量的训练和评估数据。此外,利用真实攻击工具模拟攻击场景虽增强了数据的真实性,但也增加了采集和标注的难度。如何在保证数据真实性的基础上,实现事件的细粒度标注,成为亟待解决的核心问题。这不仅关系到模型的检测效果,也影响到实际应用中的威胁识别能力。
Innovation
本研究的主要创新包括:1)同步采集系统、网络和浏览器日志,确保多源数据的时间一致性,解决多源信息整合难题;2)采用真实攻击工具模拟多阶段攻击,确保数据的真实性和复杂性,为模型提供真实场景训练基础;3)基于ATT&CK框架,为每个恶意事件标注细粒度技术ID,实现攻击行为的精细化识别;4)引入LoRA技术,有效微调预训练模型,提升模型在多源攻击检测中的性能。这些创新结合了多源信息融合、真实攻击模拟和高效微调,为多源、多任务安全检测提供了新思路。
Methodology
- �� 数据采集:在Windows端同步采集系统、网络和浏览器日志,采用Sysmon、tshark和Activity Watch,确保多源数据的时间同步。• 攻击模拟:利用真实攻击工具(如Revenge-RAT、Process Hacker、rclone等)设计多场景攻击,包括初始访问、权限提升、横向移动、数据窃取和勒索等,确保攻击行为的真实性和多样性。• 标注流程:根据攻击日志追踪,手动标注每个事件对应的ATT&CK技术ID,构建细粒度标签体系。• 特征工程:将连续事件划分为包含7个事件的块(chunk),每块附带会话ID、块索引和事件数,作为模型输入。• 微调模型:采用LoRA技术对Qwen2.5-1.5B、Llama-3.2-3B和Phi-4-Mini进行微调,调整参数以适应安全任务。• 评估指标:在块分类和技术识别两个任务中,使用准确率、精确率、召回率、F1-score等十项指标进行性能评估。
Experiments
实验设计包括:在真实攻击环境中采集870个会话,分为70个攻击会话和800个正常会话,确保数据多样性和真实性。攻击场景涵盖12个ATT&CK战术和53个技术,模拟多阶段、多技术的攻击流程。模型训练采用随机划分的训练集、验证集和测试集,利用LoRA技术对预训练模型进行微调。模型性能通过多项指标进行评估,包括准确率、召回率和F1-score,特别关注在技术识别中的精确匹配率。通过多轮实验验证微调的有效性,并进行消融实验,分析不同技术环节对模型性能的贡献。
Results
微调后,三款模型在块分类任务中的准确率从约8%提升至90%-97%,验证了数据集的高质量和模型的学习能力。在ATT&CK技术识别方面,最高精确匹配率达42%,部分匹配得分较高,显示模型已掌握大部分攻击推理逻辑。不同模型在不同任务中的表现略有差异,说明模型结构和微调策略影响检测效果。消融实验表明,真实攻击工具的引入显著提升模型的泛化能力,细粒度标签增强了模型的识别能力。这些结果充分验证了数据集的实用性和微调技术的有效性,为未来多源安全检测提供了技术基础。
Applications
该数据集和微调模型可以直接应用于企业安全运营中心(SOC),实现多源日志的自动监控和威胁检测。模型能够实时分析系统、网络和浏览器日志,自动识别潜在攻击行为,辅助安全分析和响应。未来,结合主动学习和半监督学习技术,可以降低标注成本,提升模型的适应性。长远来看,推动多源信息融合架构的发展,构建跨平台、多设备的智能安全防御体系,显著提升网络安全的自动化水平。这对于应对日益复杂的网络威胁、实现安全自动化具有重要意义。
Limitations & Outlook
目前数据集规模和场景有限,主要集中在Windows端,未来需扩展到其他操作系统和设备类型以增强泛化能力。攻击场景虽多样,但仍为模拟环境,实际复杂环境中的模型表现尚需验证。ATT&CK标签的标注依赖攻击工具的追溯性,可能存在漏标或误标的风险。模型微调虽提升性能,但在极端或未知攻击场景下的鲁棒性仍需验证。未来应结合多源信息融合的深度学习架构,提升模型的抗干扰能力和泛化能力。
Abstract
Multi-stage cyberattacks span system, network, and browser logs. Detecting them requires correlating events across all three sources. Machine learning methods can learn these cross-source patterns, but they need labeled multi-source data. Existing public datasets fall short. Network-only datasets such as CICIDS and UNSW-NB15 miss host and browser activity. Host-focused datasets such as LMDG and CICAPT-IIoT lack browser telemetry. ATLAS includes all three sources but labels events only as malicious or benign, without MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) technique granularity. No public dataset combines all three sources with per-entry ATT&CK technique labels. We close the gap by building a multi-source log dataset of 870 sessions (70 attack, 800 benign) and approximately 2.3 million events. We captured system, network, and browser activity simultaneously on Windows endpoints. We labeled malicious events with ATT&CK technique IDs, covering 12 tactics and 53 techniques. We generated all attack data using real tools, including Remote Access Trojan (RAT), Command and Control (C2) tunnels, and cloud exfiltration. To demonstrate learnability, we fine-tuned three Small Language Models (SLMs) (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini) using Low-Rank Adaptation (LoRA). We compared each against its base variant across ten metrics on two tasks: chunk classification and ATT&CK technique identification. Fine-tuning improved every model on every metric. Chunk classification accuracy rose from approximately 8% in the base variants to between 90% and 97% after fine-tuning. Technique identification remained challenging, with the best exact-match accuracy at 42%, although high partial-match scores show the models captured most of the underlying reasoning.