LLMSurgeon: Diagnosing Data Mixture of Large Language Models

TL;DR

LLMSurgeon formulates data mixture diagnosis as a label-shift inverse problem, achieving 94.46% accuracy on the LLMSurgeon benchmark.

cs.CL 🔴 Advanced 2026-05-29 83 views
Yaxin Luo Jiacheng Cui Xiaohan Zhao Xinyi Shang Jiacheng Liu Xinyue Bi Zhaoyi Li Zhiqiang Shen
Model Data Auditing Inverse Problem Label Shift Data Mixture Model Explainability

Key Findings

Methodology

The approach formalizes Data Mixture Surgery (DMS) as an inverse problem under the label-shift assumption, where the domain proportions change but the conditional distributions remain invariant. It involves training an external classifier on known reference data to estimate a systematic bias matrix (confusion matrix), then sampling generated texts from the target model, classifying them, and applying a calibrated linear inverse to recover the latent domain proportions. This process avoids direct reliance on membership inference at the sample level, instead focusing on macro distribution estimation. The pipeline includes classifier calibration, neutral prompt sampling, and constrained least squares optimization, ensuring robustness against semantic overlap and classifier bias. The method's effectiveness is validated on the newly introduced LLMSurgeon benchmark, which contains models with transparent training data, enabling precise evaluation across multiple granularities and model scales.

Key Results

  • On the LLMSurgeon benchmark, the method consistently outperforms baseline aggregation methods, achieving an average overlap accuracy of 94.46% across coarse, mid, and fine granularities. In particular, at the coarse level (6 domains), accuracy exceeds 99%, while at the fine level (87 programming languages), it remains above 30%, surpassing prior approaches like GradNorm (27.54%). The robustness persists across models from 7B to 65B parameters, demonstrating scalability. Ablation studies reveal that classifier quality, sample size (≥1000), and inverse bias correction are critical for high-fidelity recovery. The method maintains stability under different sampling styles and domain definitions, confirming its practical utility in real-world scenarios.
  • The experiments on models such as LLaMA-1, OLMo, Amber, Pythia, StarCoder, and GPT-Neo validate the approach's generality. The use of publicly documented training data distributions as ground truth ensures the evaluation's reliability. The results show that the calibrated inverse approach significantly improves estimation accuracy over naive aggregation, especially in challenging fine-grained tasks. The method also effectively tracks data distribution dynamics during training, revealing fluctuations and convergence patterns in models like Amber and OLMo. These findings highlight the potential of LLMSurgeon as a post-hoc auditing tool for large models, capable of revealing their digital DNA without access to training data or internal parameters.

Significance

This work addresses a fundamental challenge in AI transparency: understanding the training data composition of proprietary large language models without direct access. By framing data mixture estimation as an inverse problem under label shift, it offers a scalable, model-agnostic, post-hoc auditing method that can be applied to black-box models. This advances the field of AI accountability, enabling stakeholders to verify data sources, detect biases, and ensure compliance with legal and ethical standards. The introduction of LLMSurgeon and LLMSurgeon benchmark provides a standardized platform for future research, fostering transparency and trust in AI systems. It also bridges the gap between statistical theory and practical AI auditing, opening new avenues for responsible AI deployment.

Technical Contribution

The core technical innovation lies in modeling the data mixture estimation as a label-shift inverse problem, where the observed distribution of generated texts is a biased convolution of the true domain proportions. By estimating a calibrated soft confusion matrix from reference data, the method formulates a constrained linear inverse problem, solved via least squares with probability simplex constraints, to recover the latent domain proportions. This approach leverages the invariance of conditional distributions under label shift, combined with robust calibration and regularization techniques, to achieve high accuracy and stability. The framework is validated on a new benchmark, LLMSurgeon, which provides real training data distributions for comprehensive evaluation. The method's modular design allows integration with various classifiers and sampling strategies, making it adaptable to different model scales and domain granularities.

Novelty

This research is the first to explicitly formulate large model data mixture estimation as a label-shift inverse problem, moving beyond traditional membership inference techniques. Unlike prior work that relies on sample-level detection or dataset aggregation, LLMSurgeon employs a calibration-based linear inversion to estimate global domain proportions solely from generated texts. This paradigm shift enables macro-level auditing without access to training data or internal model parameters, addressing a critical gap in model transparency. The introduction of LLMSurgeon benchmark further distinguishes this work by providing real, large-scale training data distributions for rigorous evaluation, setting a new standard for model data auditing research.

Limitations

  • The method assumes that domain conditional distributions are invariant, which may not hold in cases of significant domain shift during training or fine-tuning, potentially biasing the estimates.
  • The accuracy depends heavily on the classifier's performance; in scenarios with high semantic overlap or poorly calibrated classifiers, the inverse problem may become ill-conditioned, reducing reliability.
  • Fine-grained classification tasks (e.g., distinguishing 87 programming languages) face challenges due to semantic similarity, leading to unstable inverse solutions. Further regularization or domain merging strategies are needed.
  • The approach is primarily validated on static models with fixed training data; extending to models with evolving data or multi-stage training remains an open challenge.

AI Executive Summary

The rapid proliferation of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented capabilities in reasoning, coding, and knowledge synthesis. However, the opacity surrounding their training data sources poses significant challenges for transparency, accountability, and governance. Traditional privacy attacks like Membership Inference Attack (MIA) can identify whether specific samples were seen during training but fall short in revealing the overall data distribution or domain composition of the training corpus.

In response, Luo et al. introduce LLMSurgeon, a novel framework that approaches the problem of estimating the domain-level data mixture of a trained LLM as an inverse problem under the label-shift assumption. This assumption posits that while the proportions of different data domains may shift between the training set and the model's generated outputs, the underlying linguistic features within each domain remain statistically invariant. Leveraging this insight, LLMSurgeon employs an external classifier trained on known reference data to characterize systematic biases via a confusion matrix, which is then used to calibrate the observed distribution of generated texts.

The core innovation lies in formulating the recovery of the latent data mixture as a constrained linear inverse problem. By solving this inverse with regularization and calibration, the method accurately estimates the true domain proportions without requiring access to the training data or internal model parameters. To evaluate the approach, the authors develop LLMSurgeon benchmark, comprising models with publicly documented training data, spanning coarse, mid, and fine granularities, including 87 programming languages. Experimental results demonstrate that LLMSurgeon achieves a mean overlap accuracy of 94.46%, significantly outperforming naive aggregation baselines and previous methods.

This work marks a significant advancement in AI transparency, providing a practical, post-hoc tool for auditing the digital DNA of foundation models. Its implications extend to model safety, bias detection, copyright compliance, and responsible AI deployment. The robustness of the method across different model scales, training dynamics, and domain definitions underscores its potential as a standard auditing framework. Despite some limitations related to domain invariance assumptions and semantic overlaps, LLMSurgeon opens new avenues for understanding and verifying large models' training data, fostering greater trust and accountability in AI systems.

Deep Analysis

Background

近年来,随着Transformer架构的普及和预训练技术的飞速发展,大规模语言模型(如GPT系列、LLaMA、Pythia、StarCoder等)在自然语言处理、代码生成、知识推理等多个领域取得了突破性进展。这些模型的成功在很大程度上依赖于庞大的训练语料库,涵盖网页、书籍、学术论文、代码库等多源数据。早期工作如OpenAI的GPT-3(2020)和Meta的LLaMA(2023)强调了数据多样性的重要性,但同时也引发了关于数据来源、偏见、版权和隐私的担忧。传统的模型审计方法主要依赖于访问训练数据或模型参数,存在数据隐私泄露和黑盒限制。近年来,Membership Inference Attack(MIA)等技术尝试揭示模型是否记忆特定样本,但难以提供宏观的训练数据组成信息。为此,研究者开始关注数据分布的宏观估计,试图在不访问原始数据的情况下,推断模型的训练数据域比例。

Core Problem

核心问题在于:如何在模型黑盒、无法访问训练数据的前提下,准确估算模型预训练数据的域比例分布?传统方法如基于实例的membership inference只能提供样本级别的存在性判断,无法反映整体数据结构。现有的统计方法多依赖于对模型输出的微观分析,受限于样本噪声、语义重叠和偏差校正困难,难以实现宏观的分布估计。此外,模型生成文本的偏差受到采样策略、模型调优和对抗样本的影响,导致直接聚合分类结果偏离真实分布。解决这一问题的难点在于:如何设计一个稳健的逆推机制,校准偏差,准确反映训练数据的真实比例,从而实现模型数据源的追溯和责任追踪。

Innovation

本研究的创新点主要体现在:1)将数据混合估计问题形式化为标签偏移(Label Shift)下的逆问题,利用校准的软混淆矩阵实现宏观分布反演,突破了传统实例级分析的局限;2)提出LLMScan基准,提供真实的多域训练数据分布作为评估标准,确保方法的真实性和可靠性;3)引入多粒度分析和动态训练监控,验证模型训练中数据动态变化的影响,增强方法的适应性。这些创新结合了统计学中的线性反演、校准技术和大规模文本分类技术,为大模型数据理解提供了新思路。

Methodology

  • �� 训练分类器:在已知参考数据集(如C4、The Pile、StackExchange)上训练多类别文本分类模型(如DistilBERT),计算偏差矩阵C,反映分类器在不同域的系统性偏差。
  • �� 采样目标模型:使用中性采样(neutral prompts)生成目标模型的文本样本,确保生成分布尽可能反映训练时的潜在域比例。
  • �� 分类预测:将生成文本输入分类器,得到偏差观察值(soft predictions),形成向量¯p,代表模型生成文本的域分布的模糊估计。
  • �� 逆问题求解:利用线性关系¯p ≈ C⊤π,将偏差观察值与潜在真实比例π联系起来,通过求解带约束的线性优化问题(如最小二乘带概率约束)反演出潜在的训练数据域比例。
  • �� 校准与正则化:为应对域间语义重叠和偏差矩阵条件数问题,采用正则化策略(如域合并、平滑)提升逆推稳定性。
  • �� 评估指标:采用重叠准确率(Overlap Accuracy)、平均绝对误差(MAE)和决定系数(R²)衡量估算效果,确保方法的科学性和实用性。

Experiments

实验设计包括:选择多个公开模型(如LLaMA-1、OLMo、Amber、Pythia、StarCoder、GPT-Neo)作为目标模型,利用其官方预训练报告定义的域类别(从6个到87个子域)作为真值基准。采样策略包括中性采样和多样化风格,确保生成文本的代表性。每个模型都训练分类器,计算偏差矩阵,采样生成文本,应用LLMSurgeon进行逆推,评估重建的域比例与真实值的偏差。对比基线包括直接聚合分类器输出和未校正的逆推方法。指标方面,主要使用重叠准确率(超过94%)、MAE(低于0.02)和R²(接近1)进行量化。还进行了消融实验,分析分类器类型、域定义粒度、样本数量、采样策略和逆偏差校正对估算精度的影响。

Results

LLMSurgeon在不同粒度下均表现优异,粗粒度(6个大域)重建准确率达99%以上,中粒度(17个子域)达到94.46%,细粒度(87个编程语言)仍保持30.37%的准确率,显著优于GradNorm(27.54%)。在模型规模从7B到65B的范围内,性能保持稳定,验证了方法的鲁棒性。消融实验显示,分类器性能、样本数量(≥1000)和逆偏差校正是影响估算精度的关键因素。研究还发现,域定义的合理合并(如C4与Common Crawl)对稳定性至关重要。整体而言,LLMSurgeon在真实模型和多样数据场景中均实现了高效、可靠的宏观数据分布估计,为大模型数据审计提供了新工具。

Applications

该方法可应用于模型安全审计、偏见检测、版权追溯和责任追踪等场景。企业和研究机构可以利用LLMSurgeon在模型发布后进行数据组成分析,无需访问训练数据或模型参数,提升模型透明度。未来,结合动态训练监控和多模态数据分析,有望实现对模型训练过程的实时追踪和数据源溯源。此外,该技术还可用于检测模型中潜在的偏见源,优化训练数据策略,推动公平性和责任性的发展。

Limitations & Outlook

该方法假设域条件分布保持不变,若模型训练中存在显著的语义偏移或多阶段数据引入,逆推结果可能偏离实际。此外,分类器性能对估算结果影响巨大;在语义模糊或重叠的细粒度分类任务中,设计更鲁棒的分类模型仍是挑战。逆问题的条件数受域间语义相似度影响较大,细粒度场景容易出现不稳定,需引入更强的正则化策略。未来还需考虑模型训练动态变化的影响,提升在复杂场景中的适应性。

Plain Language Accessible to non-experts

想象你在一个大型工厂里,工厂每天生产各种不同的产品,比如家具、电子产品、衣服等。你想知道:这个工厂到底用了哪些原材料?比如,家具用的木材比例是多少,电子产品用的芯片来自哪个国家?

因为工厂的设计图(相当于模型的结构)很复杂,直接追踪每一块原材料很难。于是,你请了一个专家(分类器)来观察工厂的成品,判断它们属于哪一类(家具、电子、衣服)。但这个专家可能会有偏差,比如他可能会把一些电子产品误判为家具。你还知道一些参考样品(已知的原材料的样本),用来校准专家的判断。

接下来,你让工厂生产一批产品,专家对这些产品进行分类,得到一些模糊的比例(比如,70%的产品看起来像家具,20%像电子,10%像衣服)。但这些比例受到专家偏差的影响。于是,你用数学的方法(逆问题求解)调整这些偏差,估算出工厂真正用的原材料比例。这样,即使你不能直接看工厂的存货,也能大致知道他们用了哪些原材料、用的多少。这就像用数学和统计的方法,帮你“破解”工厂的秘密,让你知道他们用了哪些原材料,比例多少。非常聪明,也很实用!

ELI14 Explained like you're 14

想象你在一个超级大的厨房里,厨师每天用各种不同的食材做菜,比如蔬菜、肉类、调料等等。你想知道:这个厨房到底用了多少比例的食材?比如,蔬菜占了多大比例,肉类又是多少?但你不能直接进去看厨房的存货,只能通过品尝厨师做的菜来猜测。

于是,你请了一个味觉专家(就像论文里的分类器),让他尝每道菜,然后告诉你它们大概属于哪一类(比如蔬菜味重、肉味浓)。不过,这个专家可能会有偏差,比如他可能会把某些带有调料的菜误判为肉菜。你还知道一些参考菜谱(已知的食材比例),用来校准专家的判断。

你让厨师做一批菜,用味觉专家分类,然后根据分类结果,结合校准信息,反推出厨房里实际用的食材比例。这样,即使你不能直接看厨房的存货,也能大致知道他们用了哪些食材、用的多少。这就像是用数学和统计的方法,帮你“破解”厨房的秘密,让你知道他们用了哪些食材,比例多少。非常聪明,也很实用!

Glossary

Large Language Model (大规模语言模型)

一种基于深度学习的模型,能理解和生成自然语言,训练时使用大量文本数据,具有强大的语言理解能力。

论文中指如LLaMA、GPT等模型,其预训练数据组成是模型行为和能力的重要基础。

Data Mixture Surgery (数据混合手术)

一种通过逆问题方法估算模型预训练数据域比例的技术,旨在揭示模型的“数字DNA”。

论文提出的核心方法,用于在黑盒条件下审计模型训练数据组成。

标签偏移 (Label Shift)

指训练和测试或生成数据的类别比例发生变化,但类别条件分布保持不变的假设。

该假设是LLMSurgeon将逆问题建模的基础,确保域比例可以通过校准的线性反演恢复。

软混淆矩阵 (Soft Confusion Matrix)

描述分类器在不同类别间系统性偏差的概率矩阵,用于校准分类器输出。

在方法中用以校正分类器偏差,提升逆推的准确性。

逆问题 (Inverse Problem)

通过已知输出反推输入参数的数学问题,常用于信号处理、统计推断等领域。

论文中将数据混合估计转化为线性逆问题,通过校准矩阵反演潜在域比例。

LLMScan

由论文提出的基准测试平台,包含多个公开模型和真实数据分布,用于评估数据混合估计方法。

用于验证LLMSurgeon在不同粒度和模型规模下的性能。

重叠准确率 (Overlap Accuracy)

衡量估算结果与真实比例重叠程度的指标,反映估算的精确性。

论文中用作主要性能指标,最高达94.46%。

中性采样 (Neutral Sampling)

一种采样策略,旨在减少生成文本中的风格偏差,保持生成分布的自然性。

在实验中用以确保生成文本的分布尽可能反映潜在训练数据。

校准 (Calibration)

调整模型输出或估算结果,使其更符合真实分布的过程。

在方法中通过校准混淆矩阵,校正分类器偏差。

线性反演 (Linear Inversion)

通过求解线性方程组逆转偏差观察值,恢复潜在参数的技术。

核心技术,用于从偏差观察值反推出真实域比例。

Abstract

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

cs.CL cs.AI cs.LG

References (20)

Membership Inference Attacks Against Machine Learning Models

R. Shokri, M. Stronati, Congzheng Song et al.

2016 5228 citations ⭐ Influential View Analysis →

GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow

Sid Black, Leo Gao, Phil Wang et al.

2021 938 citations ⭐ Influential

LLM360: Towards Fully Transparent Open-Source LLMs

Zhengzhong Liu, Aurick Qiao, Willie Neiswanger et al.

2023 109 citations ⭐ Influential View Analysis →

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method

Weichao Zhang, Ruqing Zhang, Jiafeng Guo et al.

2024 65 citations ⭐ Influential View Analysis →

Any-Shift Prompting for Generalization Over Distributions

Zehao Xiao, Jiayi Shen, Mohammad Mahdi Derakhshani et al.

2024 24 citations View Analysis →

Deep Learning with Differential Privacy

Martín Abadi, Andy Chu, I. Goodfellow et al.

2016 7753 citations View Analysis →

Extracting Training Data from Large Language Models

Nicholas Carlini, Florian Tramèr, Eric Wallace et al.

2020 2967 citations View Analysis →

Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models

Jeffrey G. Wang, Jason Wang, Marvin Li et al.

2024 10 citations View Analysis →

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.

2023 368 citations View Analysis →

Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting

Samuel Yeom, Irene Giacomelli, Matt Fredrikson et al.

2017 1499 citations

SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)

Matthieu Meeus, Igor Shilov, Shubham Jain et al.

2024 58 citations View Analysis →

Dataset Inference: Ownership Resolution in Machine Learning

Pratyush Maini

2021 159 citations View Analysis →

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Luca Soldaini, Rodney Kinney, Akshita Bhagia et al.

2024 476 citations View Analysis →

ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods

Roy Xie, Junlin Wang, Ruomin Huang et al.

2024 59 citations View Analysis →

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Sang Michael Xie, Hieu Pham, Xuanyi Dong et al.

2023 358 citations View Analysis →

Data Selection for Language Models via Importance Resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma et al.

2023 335 citations View Analysis →

Membership Inference Attacks From First Principles

Nicholas Carlini, Steve Chien, Milad Nasr et al.

2021 1086 citations View Analysis →

Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models

Mayee F. Chen, Nicholas Roberts, K. Bhatia et al.

2023 115 citations View Analysis →

SlimPajama-DC: Understanding Data Combinations for LLM Training

Zhiqiang Shen, Tianhua Tao, Liqun Ma et al.

2023 82 citations View Analysis →

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, Maarten Sap et al.

2020 1700 citations View Analysis →