LLMSurgeon: Diagnosing Data Mixture of Large Language Models
LLMSurgeon formulates data mixture diagnosis as a label-shift inverse problem, achieving 94.46% accuracy on the LLMSurgeon benchmark.
Yaxin Luo, Jiacheng Cui, Xiaohan Zhao et al.
LLMSurgeon formulates data mixture diagnosis as a label-shift inverse problem, achieving 94.46% accuracy on the LLMSurgeon benchmark.
Yaxin Luo, Jiacheng Cui, Xiaohan Zhao et al.
Proposes COMPOSE, a dual-graph framework combining citation and formal theorem graphs, generating plausible future theorems with 108K training pairs and 47K future papers tested.
David Busbib, Michael Werman
Proposes MedCase-Structured, a pipeline combining LLMs and terminology validation to generate HL7 FHIR R4 clinical datasets for diagnostic reasoning, with an 82.5% success rate.
Valentina Bui Muti, Eugénie Dulout, Ziquan Fu
Proposes Bidirectional Evolutionary Search (BES), combining forward candidate evolution with backward goal decomposition to enhance exploration and verification in language models.
Guowei Xu, Zhenting Qi, Huangyuan Su et al.
OmniVerifier-M1 employs symbolic bounding boxes and decoupled reinforcement learning to enhance visual verification accuracy, achieving 0.68 on ViVerBench.
Xinchen Zhang, Bowei Liu, Jiale Liu et al.
Proposes CAPO, a cross-annotator preference optimization method, enabling LLMs to learn and reproduce stable individual explanation behaviors, outperforming prompting and SFT.
Beiduo Chen, Pingjun Hong, Ziyun Zhang et al.
FluxMem models memory as a dynamically evolving heterogeneous graph with three stages, achieving state-of-the-art results in complex reasoning and web navigation tasks.
Jizhan Fang, Buqiang Xu, Zhixian Wang et al.
FinHarness reduces ASR to 15% on FinVault with 4.7× fewer advanced judge calls via inline lifecycle safety harness
Haoxuan Jia, Yang Liu, Bin Chong et al.
ConvexTok uses convex relaxations to optimize tokenisation, achieving near-optimal compression within 1% at 128k vocabulary, improving BpB significantly.
Jan Tempus, Philip Whittington, Craig W. Schmidt et al.
Evaluated six commercial AI chatbots on 2,100 BBC news questions across six languages, achieving up to 95.6% accuracy on emerging facts.
Mirac Suzgun, Emily Shen, Federico Bianchi et al.
6B-parameter LLMs pretrained sequentially on Common Crawl show 15% F1 improvement on KairosQA for temporal knowledge over shuffled baselines.
Pilchen Hippolyte, Fabre Romain, Signe Talla Franck et al.
LongMemEval-V2 achieves 72.5% accuracy with AgentRunbook-C, evaluating long-term memory in agents.
Di Wu, Zixiang Ji, Asmi Kawatkar et al.
Task-Adaptive Embedding Refinement via Test-time LLM Guidance improves zero-shot search and classification by up to 25%.
Ariel Gera, Shir Ashury-Tahan, Gal Bloch et al.
Using a Computational Social Science framework, audit LLM-generated political discourse across nine crisis events, finding it more negative and structurally consistent.
Gunjan, Sidahmed Benabderrahmane, Talal Rahwan
Q-DAPS estimates question difficulty by computing the entropy of plausibility scores, excelling on four QA datasets.
Jamshid Mozafari, Bhawna Piryani, Adam Jatowt
MedHopQA evaluates biomedical QA via multi-hop reasoning with 1,000 expert-curated question-answer pairs.
Rezarta Islamaj, Robert Leaman, Joey Chan et al.
Sentiment and emotion classification of Indonesian e-commerce reviews using Multi-Task BiLSTM and AutoML, achieving high accuracy.
Hermawan Manurung, Ibrahim Al-Kahfi, Ahmad Rizqi et al.
SeaEvo enhances algorithm discovery via strategy space evolution, achieving 21% improvement in system optimization tasks.
Sichun Luo, Yi Huang, Haochen Luo et al.
Improving tabular retrieval robustness via representational stability using centroid averaging to reduce format-specific variance.
Kushal Raj Bhandari, Adarsh Singh, Jianxi Gao et al.
Study reveals representational harms in LLM narratives against Global Majority nationalities using a QA model on 500,000 stories.
Ilana Nguyen, Harini Suresh, Thema Monroe-White et al.