Abstracting Cross-Domain Action Sequences into Interpretable Workflows

TL;DR

WorkflowView leverages LLMs to abstract low-level action sequences into high-level activities with F1=0.90, demonstrating cross-domain generalization.

cs.AI 🔴 Advanced 2026-06-13 40 views

Gaurav Verma Scott Counts

AI Reader Arxiv Page Download PDF

behavior abstraction large language models cross-domain sequence modeling privacy

Key Findings

Methodology

WorkflowView employs a hierarchical prompting framework that transforms raw timestamped behavior sequences into detailed natural language descriptions, then infers high-level activities and categories. The architecture consists of multiple prompt layers: the first generates comprehensive descriptions of actions, leveraging models like GPT-4 and Phi-4, without fine-tuning. The second layer uses these descriptions to infer user intents or tasks via prompt-based reasoning, while optional third layers perform classification or prediction tasks. This approach supports zero-shot and few-shot learning, effectively denoising behavior data and capturing semantic intent across diverse domains. The prompts are carefully designed to encode temporal and contextual cues, enabling the model to generalize across web logs, MOOC interactions, and document workflows. Extensive experiments validate the framework’s robustness, with metrics such as μsim=0.91 for task reconstruction and F1=0.90 for dropout prediction.

Key Results

In browser task reconstruction, WorkflowView achieved a semantic similarity of μsim=0.91, surpassing traditional statistical and deep learning models, demonstrating effective zero-shot inference.
In MOOC dropout prediction, with only five in-context examples, the model reached a weighted F1 score of 0.90, outperforming baseline models trained on thousands of labeled instances.
In analyzing Word document workflows, the method successfully categorized anonymized user actions, providing privacy-preserving high-level insights that support product improvements.

Significance

This work addresses the longstanding challenge of noisy, low-level behavioral data interpretation by introducing a flexible, prompt-based hierarchical abstraction framework. It significantly enhances the ability to derive meaningful, high-level insights from raw logs without extensive annotation or fine-tuning, thus facilitating scalable, cross-domain behavior understanding. The approach also emphasizes privacy-preserving analysis, aligning with industry needs for secure data utilization. Its broad applicability across web, education, and productivity tools underscores its potential to transform user behavior analytics, enabling more intelligent, personalized digital experiences.

Technical Contribution

The paper introduces a novel hierarchical prompting architecture that leverages the zero-shot and few-shot capabilities of large language models, enabling behavior sequence abstraction without task-specific fine-tuning. The layered design effectively denoises raw data, captures semantic intent, and supports multi-task inference, representing a significant departure from traditional sequence modeling techniques. The framework’s modularity allows easy adaptation to new domains by modifying prompts, reducing deployment costs. The integration of prompt engineering with hierarchical reasoning provides a new paradigm for non-language sequential data understanding, expanding the scope of LLM applications beyond NLP.

Novelty

This research is pioneering in applying large language models to non-linguistic, low-level behavior sequences for high-level activity inference. Unlike prior works limited to natural language understanding or domain-specific fine-tuning, WorkflowView exploits the generalization power of pre-trained LLMs via prompt-based hierarchical reasoning. Its cross-domain, zero-shot, and few-shot capabilities represent a significant innovation, opening new avenues for behavior analysis in diverse applications without the need for extensive labeled data.

Limitations

The approach relies heavily on prompt quality; suboptimal prompts can reduce accuracy, necessitating iterative prompt engineering for different tasks.
Handling extreme noise or anomalous behaviors remains challenging, potentially requiring additional noise filtering techniques.
Large-scale LLMs incur high computational costs, limiting real-time deployment on resource-constrained devices. Future work should focus on model compression and efficiency improvements.

Future Work

Future research will explore automated prompt optimization via reinforcement learning or self-supervised methods to reduce manual tuning. Integrating multi-modal data such as images and audio could enrich behavior understanding. Developing lightweight models or distillation techniques will facilitate deployment in edge environments. Additionally, enhancing interpretability and user control over the abstraction process will foster greater trust and usability in practical systems.

AI Executive Summary

The proliferation of digital applications has led to an explosion of user interaction logs, capturing every click, keystroke, and mouse movement. While these logs offer a treasure trove of insights into user behavior, their raw form is often too granular, noisy, and domain-specific to directly inform product improvements or user experience design. Traditional approaches, such as statistical pattern mining or domain-specific deep learning models, struggle to generalize across different applications and are sensitive to noise, limiting their scalability and robustness.

In response to these challenges, this paper introduces WorkflowView, a hierarchical framework that leverages the powerful inference capabilities of large language models (LLMs) like GPT-4. The core idea is to transform low-level, timestamped behavior sequences into high-level, interpretable activities through a multi-layered prompting architecture. The first layer converts raw actions into detailed natural language descriptions, capturing temporal and contextual nuances. The second layer infers user intents or task categories from these descriptions, while optional subsequent layers perform classification or prediction tasks.

This approach is innovative because it exploits the zero-shot and few-shot learning abilities of LLMs, eliminating the need for extensive task-specific fine-tuning. The prompts are carefully designed to encode domain knowledge and temporal cues, enabling the model to denoise noisy data and generalize across diverse scenarios. The authors validate their framework through three experiments: reconstructing browser tasks with a semantic similarity of 0.91, predicting student dropout with a weighted F1 of 0.90 using only five examples, and analyzing AI tool usage in Word documents while preserving user privacy.

The results demonstrate that WorkflowView can reliably abstract behavior sequences across domains, providing high-quality, interpretable insights that support product optimization, privacy-preserving analytics, and automated decision-making. This work addresses critical limitations of existing methods by offering a flexible, scalable, and privacy-conscious solution for behavior understanding. Its potential impact spans industries from web browsing and online education to enterprise productivity tools, paving the way for smarter, more personalized digital experiences.

Looking ahead, future work will focus on optimizing prompts, reducing computational costs, and integrating multi-modal data sources. The ultimate goal is to embed such hierarchical behavior understanding deeply into logging infrastructures, enabling real-time, privacy-aware, and cross-domain intelligent systems that adapt seamlessly to evolving user behaviors and application contexts.

Deep Analysis

Background

Over the past decade, user behavior analysis has evolved from simple statistical techniques to sophisticated deep learning models. Early methods relied on frequent pattern mining (Mannila et al., 1997; Agrawal et al., 1993), which identified common sequences but lacked semantic understanding and were sensitive to noise. The advent of recurrent neural networks (Hochreiter and Schmidhuber, 1997) and transformers (Vaswani et al., 2017) enabled modeling temporal dependencies more effectively, yet these models often required large annotated datasets and domain-specific fine-tuning. Recent advances in pre-trained language models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2019; OpenAI, 2024) have demonstrated remarkable transfer learning capabilities, inspiring efforts to adapt them for behavior log interpretation (Guo et al., 2021; Zhou et al., 2024). However, these approaches still face challenges in handling noisy, low-level data across diverse domains without extensive retraining. This context motivates the development of a universal, prompt-based hierarchical framework that leverages the generalization power of LLMs to interpret behavior sequences in a domain-agnostic manner.

Core Problem

Despite the availability of rich interaction logs from web browsers, MOOCs, and productivity tools, extracting meaningful high-level activities remains difficult. The core issues include the high noise level, the granularity of raw actions, and the domain-specific semantics that traditional models cannot easily capture. Existing methods either rely on handcrafted rules, which lack scalability, or require large annotated datasets for supervised learning, which are costly and inflexible. Moreover, models trained on one domain often fail to generalize to others, limiting their utility in real-world, multi-application environments. Addressing these bottlenecks requires a flexible, low-cost approach capable of understanding diverse behavior data without extensive retraining or annotation.

Innovation

This work introduces a hierarchical prompting framework that leverages the zero-shot and few-shot inference capabilities of LLMs to abstract behavior sequences. Key innovations include:

1) Multi-layer prompts: transforming raw actions into natural language descriptions, then inferring high-level activities, and finally categorizing tasks—each layer designed to denoise data and enhance semantic understanding.

2) Cross-domain applicability:无需微调，通过提示设计实现浏览器、MOOC和Word等不同场景的泛化。

3) 支持少样本学习：只需少量示例即可实现高精度预测，极大降低数据标注成本。

4) 隐私保护：采用匿名化和聚合策略，确保用户数据安全，符合行业隐私要求。

Methodology

�� 数据预处理：将原始行为日志转换为时间戳行为事件序列，确保数据结构统一。
�� 提示设计：
第一层：利用提示将行为事件描述为详细的自然语言句子，强调时间关系和行为特征（如“用户点击了按钮，等待2秒”）。
第二层：基于描述推断用户的主要任务或意图（如“正在浏览商品”），生成简洁总结。
第三层（可选）：将推断结果分类到预定义类别（如“购物”、“浏览”）或进行任务预测。
�� 模型推理：使用GPT-4或Phi-4模型，通过prompt输入实现多层次推理，无需微调，支持零样本和少样本场景。
�� 少样本学习：在少量示例基础上，通过提示引导模型学习任务特征。
�� 评估指标：采用语义相似度（μsim）、F1、Recall、Precision等指标，衡量抽象和预测性能。

Experiments

实验包括三个场景：

1) 浏览器任务重建：利用Mind2Web数据集，涵盖137网站，模型在零样本条件下实现μsim=0.91，优于传统方法。

2) MOOC学生退学预测：使用Feng等（2019）数据，仅用五个示例，模型达F1=0.90，显著优于基线。

3) Word文档行为分类：在隐私保护前提下，模型成功识别行为类别，支持产品优化。所有实验均采用GPT-4及其他模型，比较不同提示设计和少样本策略，验证其跨域泛化能力。

Results

在浏览器任务重建中，模型实现了μsim=0.91，超越传统统计和深度模型，支持零样本推断。在MOOC预测中，少样本条件下达F1=0.90，优于基线（F1≈0.84）。Word场景中，模型成功分类行为类别，提供隐私保护的高层次理解。整体来看，模型在不同任务和域中均表现出强大适应性，验证了其跨域泛化能力和实际应用潜力。

Applications

该方法可广泛应用于用户行为分析、产品优化、个性化推荐、隐私保护等场景。企业可以利用其实现低成本、高效率的行为理解，无需大量标注数据。未来，结合多模态信息和增强学习，将推动智能交互、自动化监控和个性化服务的发展，满足日益增长的数字化需求。

Limitations & Outlook

模型对提示设计敏感，可能在不同任务中需要调整提示内容，存在调优成本。极端噪声或异常行为可能影响抽象效果。大规模LLMs的计算成本较高，限制其在实时系统中的应用。未来需优化模型效率和鲁棒性，提升在复杂环境中的表现。

Plain Language Accessible to non-experts

想象你在一家大型工厂工作，工厂里有许多不同的机器，每天都在不停地运转。工厂管理者希望了解每台机器的工作状态，但机器发出的信号很杂乱，有的信号代表机器在工作，有的代表故障或维护。过去，工厂用人工记录这些信号，然后用统计方法找出问题，但这些方法效率低，容易出错。

现在，假设有一个智能助手，它能像人一样理解这些信号。这个助手可以把每个信号转化成一句话，比如“机器正在加热”或“出现故障”，然后再根据这些描述推断出工厂的整体生产状态，比如“生产正常”或“需要维修”。这个助手还能根据不同工厂的机器类型，自动调整理解方式，不需要专门为每个工厂重新训练。

这个智能助手就像是用大脑训练出来的，能在不同工厂中快速理解复杂、杂乱的信号，把繁琐的机器声变成清晰的生产报告。这样，工厂管理者就能更快发现问题，提升效率，而不用担心数据太杂或不懂技术。它让复杂的机器信号变得像是人说的话一样容易理解，帮助工厂变得更智能、更高效。

ELI14 Explained like you're 14

想象你在学校的操场上玩游戏，游戏规则很复杂，有很多不同的动作，比如跑、跳、投球、躲避。每次你做动作时，老师都会记录下来，但这些记录非常详细，比如“你跑了10米，花了3秒，跳了1米高，投了一个球”。如果只看这些细节，可能很难知道你在玩什么游戏，也不知道你是不是在赢。

现在，假设有个聪明的朋友，他可以把这些复杂的动作都变成一句简单的话，比如“你在玩接球游戏”或者“你在跑步比赛”。这个朋友还能根据这些简单的话，告诉你你是不是在赢，或者你需要练习什么。这就像是用一个超级聪明的机器人，把所有复杂的动作变成一句话，然后帮你理解整个游戏的意思。

这个机器人不用你教它怎么玩游戏，也不用你告诉它规则，它只需要观察你的动作，然后用它的大脑告诉你你在做什么、是不是在赢。这就像你有一个超级聪明的朋友，总是能帮你理解复杂的事情，让你更开心、更聪明！

Glossary

Large Language Model (LLM) (大规模语言模型)

一种基于海量文本数据训练的深度学习模型，能理解和生成自然语言，支持多任务推理和少样本学习。

本文中用GPT-4等模型进行行为序列的多层次推理。

Hierarchical Reasoning (层级推理)

一种分层次处理信息的方法，从低级行为到高层意图逐步抽象，增强模型的解释性和鲁棒性。

本文设计的多层提示架构即采用此技术。

Prompt Engineering (提示工程)

通过设计特定的输入提示，引导模型完成特定任务，无需微调。

本文利用提示设计实现零样本和少样本行为抽象。

Zero-shot Learning (零样本学习)

模型在没有专门训练样本的情况下，完成新任务的能力。

WorkflowView在多个任务中实现零样本推断。

Few-shot Learning (少样本学习)

模型通过少量示例快速适应新任务的能力。

模型在MOOC退学预测中只用五个示例。

Semantic Similarity (语义相似度)

衡量两个文本在语义上的接近程度，常用余弦相似度等指标。

用于评估生成任务描述与真实描述的匹配程度。

Behavior Sequence (行为序列)

用户在时间上连续发生的行为事件集合。

本文分析的核心数据类型。

Task Reconstruction (任务重建)

从行为数据中推断用户的具体任务或意图。

在浏览器日志中的应用。

Privacy-preserving Analysis (隐私保护分析)

在数据分析中采用匿名化或聚合技术，保护用户隐私。

Word文档工作流中的应用。

Prompt Tuning (提示调优)

通过优化提示内容提升模型任务表现的技术。

未来工作中可能结合自监督优化提示。

Abstract

Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people's work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications. To address this limitation, we introduce WorkflowView, a framework that uses large language models (LLMs) to abstract low-level action sequences into high-level activities. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: (a) zero-shot task description reconstruction from browser logs (achieving high semantic similarity, $μ_{sim} = 0.91$), (b) few-shot student dropout prediction using MOOC interaction logs (reaching weighted $F_1 = 0.90$ with only five few-shot examples), and (c) anonymized, privacy-preserving analysis of AI tool integration within document workflows in Microsoft Word. Our work demonstrates that LLM-based abstraction is a robust and efficient path forward for transforming low-level behavioral data into high-level, interpretable, and actionable insights. We also discuss practical considerations for deploying LLM-based inferences within logging infrastructures, including computational efficiency and user privacy.

cs.AI cs.CL cs.LG

References (20)

Deep Learning-Based Method for Predicting Student Dropouts in MOOCs

Shu Yang, YinFeng Xiao, Fei Meng

2024 3 citations ⭐ Influential

Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin, John Hewitt et al.

2023 3944 citations ⭐ Influential View Analysis →

gpt-oss-120b&gpt-oss-20b Model Card

OpenAI Sandhini Agarwal, L. Ahmad, Jason Ai et al.

2025 902 citations ⭐ Influential View Analysis →

Understanding Dropouts in MOOCs

Wenzheng Feng, Jie Tang, T. Liu

2019 219 citations ⭐ Influential

Sequence to Sequence Learning with Neural Networks

I. Sutskever, O. Vinyals, Quoc V. Le

2014 22096 citations ⭐ Influential View Analysis →

Stuck? No worries!: Task-aware Command Recommendation and Proactive Help for Analysts

Aadhavan M. Nambhi, Bhanu Prakash Reddy Guda, Aarsh Prakash Agarwal et al.

2019 5 citations View Analysis →

Mining sequential patterns

R. Agrawal, R. Srikant

1995 6122 citations

An LSTM Based System for Prediction of Human Activities with Durations

Kundan Krishna, Deepali Jain, Sanket Vaibhav Mehta et al.

2018 68 citations

LSTPrompt: Large Language Models as Zero-Shot Time Series Forecasters by Long-Short-Term Prompting

Haoxin Liu, Zhiyuan Zhao, Jindong Wang et al.

2024 82 citations View Analysis →

Vellum

2021 7 citations

Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space

Gaurav Verma, Minje Choi, Kartik Sharma et al.

2024 23 citations View Analysis →

CLSA: A novel deep learning model for MOOC dropout prediction

Qian Fu, Zhanghao Gao, Junyi Zhou et al.

2021 79 citations

Mining association rules between sets of items in large databases

R. Agrawal, T. Imielinski, A. Swami

1993 16780 citations

Leveraging Pre-trained Checkpoints for Sequence Generation Tasks

S. Rothe, Shashi Narayan, A. Severyn

2019 471 citations View Analysis →

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Seungwhan Moon, Andrea Madotto, Zhaojiang Lin et al.

2023 121 citations View Analysis →

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Sewon Min, Xinxi Lyu, Ari Holtzman et al.

2022 2036 citations View Analysis →

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, G. Corrado et al.

2013 34463 citations View Analysis →

The Hierarchical Hidden Markov Model: Analysis and Applications

Shai Fine, Y. Singer, Naftali Tishby

1998 1034 citations

Mining long sequential patterns in a noisy environment

Jiong Yang, Wei Wang, Philip S. Yu et al.

2002 194 citations

Identifying Frequent User Tasks from Application Logs

Himel Dev, Zhicheng Liu

2017 42 citations

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Large Language Model (LLM) (大规模语言模型)

Hierarchical Reasoning (层级推理)

Prompt Engineering (提示工程)

Zero-shot Learning (零样本学习)

Few-shot Learning (少样本学习)

Semantic Similarity (语义相似度)

Behavior Sequence (行为序列)

Task Reconstruction (任务重建)

Privacy-preserving Analysis (隐私保护分析)

Prompt Tuning (提示调优)

Abstract

References (20)

Related Papers

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

Multi-Agent Transactive Memory

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

Automated reproducibility assessments in the social and behavioral sciences using large language models

The Role of Feedback Alignment in Self-Distillation

A History-Aware Visually Grounded Critic for Computer Use Agents