Paper Insights - AI Arxiv Paper Analysis

cs.CL 2604.15203

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

MADE benchmark enhances multi-label text classification accuracy with uncertainty quantification, especially in medical device adverse events.

Raunak Agarwal, Markus Wenzel, Simon Baur et al.

2026-04-17 34

cs.CL 2604.15165

Fabricator or dynamic translator?

LLMs generate excessive content in translations; detection strategies improve translation quality.

Lisa Vasileva, Karin Sim

2026-04-16 31

cs.CL 2603.24580

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Study finds RAG system improvements in retrieval do not guarantee better QA performance in AI policy analysis.

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur et al.

2026-03-26 47

cs.CL 2603.24579

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

MARCH framework significantly reduces LLM hallucination using multi-agent reinforced self-check, enhancing factual consistency in an 8B parameter model.

Zhuo Li, Yupeng Zhang, Pengyu Cheng et al.

2026-03-26 222

cs.CL 2603.24472

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Self-distillation can degrade LLMs' reasoning in math by suppressing uncertainty expression.

Jeonghye Kim, Xufang Luo, Minbeom Kim et al.

2026-03-26 68

cs.CL 2603.22267

TiCo: Time-Controllable Training for Spoken Dialogue Models

TiCo method significantly enhances time control in dialogue models using Spoken Time Markers, reducing MAE to 4.54 seconds.

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu et al.

2026-03-24 69

cs.CL 2603.22241

MemDLM: Memory-Enhanced DLM Training

MemDLM embeds a simulated denoising process into training via bi-level optimization, enhancing DLM training efficiency and long-context understanding.

Zehua Pei, Hui-Ling Zhen, Weizhe Lin et al.

2026-03-24 44

cs.CL 2603.20161

Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

Semantic Token Clustering (STC) method achieves efficient uncertainty quantification in large language models, significantly reducing computational overhead.

Qi Cao, Andrew Gambardella, Takeshi Kojima et al.

2026-03-21 51

cs.CL 2603.20100

An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models

Study of SFT-DPO interaction in small models reveals full fine-tuning outperforms LoRA.

Yuming Feng, Christy Yang

2026-03-21 64

cs.CL 2603.19223

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

F2LLM-v2 offers efficient multilingual embeddings using a two-stage training and matryoshka learning, supporting over 200 languages.

Ziyin Zhang, Zihan Liao, Hang Yu et al.

2026-03-20 56

cs.CL 2603.19220

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Nemotron-Cascade 2 achieves top-tier reasoning with Cascade RL and multi-domain distillation in a 30B MoE model.

Zhuolin Yang, Zihan Liu, Yang Chen et al.

2026-03-20 54

cs.CL 2603.19152

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

VEPO enhances translation quality and tokenization efficiency for low-resource languages using reinforcement learning with verifiable rewards.

Chonghan Liu, Yimin Du, Qi An et al.

2026-03-20 45

cs.CL 2603.17942

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

Efficient training-free multi-token prediction via embedding-space probing, improving LLaMA3 acceptance length by 12%.

Raghavv Goel, Mukul Gagrani, Mingu Lee et al.

2026-03-19 97

cs.CL 2603.15619

Mixture-of-Depths Attention

Mixture-of-Depths Attention (MoDA) improves downstream task performance by 2.11% on a 1.5B-parameter model with only a 3.7% increase in FLOPs.

Lianghui Zhu, Yuxin Fang, Bencheng Liao et al.

2026-03-17 66

cs.CL 2603.15615

Mechanistic Origin of Moral Indifference in Language Models

Correcting moral indifference in language models using Sparse Autoencoders, achieving a 75% win-rate on adversarial benchmarks.

Lingyu Li, Yan Teng, Yingchun Wang

2026-03-17 49

cs.CL 2603.15611

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Code-A1 enhances code and test generation through an adversarial co-evolution framework.

Aozhe Wang, Yuchen Yan, Nan Zhou et al.

2026-03-17 54

cs.CL 2603.13201

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

NAIT framework selects efficient instruction tuning data via neuron activation patterns, enhancing LLM performance.

Xin Chen, Junchao Wu, Shu Yang et al.

2026-03-14 71

cs.CL 2603.13154

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

ESG-Bench significantly reduces hallucinations in long-context ESG report analysis using task-specific Chain-of-Thought prompting strategies.

Siqi Sun, Ben Peng Wu, Mali Jin et al.

2026-03-14 113

cs.CL 2603.13045

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

WALAR method enhances low-resource language translation using monolingual data, surpassing LLaMAX model.

Yifeng Liu, Siqi Ouyang, Yatish Hosmane Revanasiddappa et al.

2026-03-13 58

cs.CL 2603.13038

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Proposed a PCA sweep method to optimize dimension selection in SSD, enhancing interpretability and stability.

Hubert Plisiecki, Maria Leniarska, Jan Piotrowski et al.

2026-03-13 51