Bridging Multimodal Learning and Planning for Intelligent Task Assistance

Tabassum, Afrina

Bridging Multimodal Learning and Planning for Intelligent Task Assistance

dc.contributor.author	Tabassum, Afrina	en
dc.contributor.committeechair	Eldardiry, Hoda Mohamed	en
dc.contributor.committeechair	Lourentzou, Ismini	en
dc.contributor.committeemember	Jin, Ran	en
dc.contributor.committeemember	Thomas, Christopher Lee	en
dc.contributor.committeemember	Huang, Jia-Bin	en
dc.contributor.department	Computer Science and#38; Applications	en
dc.date.accessioned	2025-05-15T08:00:49Z	en
dc.date.available	2025-05-15T08:00:49Z	en
dc.date.issued	2025-05-14	en
dc.description.abstract	Task-assistance systems provide adaptive, multimodal guidance for complex, step-based activities such as cooking and DIY projects. A central challenge lies in enabling these systems to interpret real-world scenarios—understanding user intent from verbal, visual, or textual cues and generating coherent, multimodal instructions enriched with relevant visual. To tackle this, modern systems leverage advanced machine learning techniques, from representation learning that processes information from diverse modalities (e.g., text, images, audio) to procedural planning which provides dynamic, context-driven guidance, enabling systems to provide precise, real-time assistance tailored to user needs. This work addresses core challenges in representation learning and multimodal planning through three key contributions. First, we introduce a modality-agnostic contrastive learning framework that optimizes negative sample selection by jointly balancing anchor similarity, influence and diversity, improving generalization across vision, language, and graph tasks. Second, we propose a tuning strategy for masked audio models that leverages unsupervised audio mixtures to enhance adaptation to downstream tasks with less labeled data, such as few-shot learning. Third, we present a zero-shot framework for generating multimodal procedural plans with explicit object-state consistency, paired with two novel evaluation metrics and an evaluation task to assess planning accuracy, cross-modal alignment and temporal coherence. These contributions are integrated into a context-aware multimodal task assistant, empirically validated through real-world user studies. Our work establishes a foundation for more robust, adaptable, and user-centric task-assistance systems, bridging critical gaps in multimodal understanding and guidance.	en
dc.description.abstractgeneral	Multimodal task assistants guide users through complex tasks like cooking a new recipe or assembling furniture by understanding their voice, and providing clear, step-by-step instructions with helpful visuals. However, designing intelligent assistants that can accurately interpret real-world situations and provide clear, useful guidance remains a challenge. Our research focuses on improving how these systems learn from diverse modalities and generate multimodal instructions. We introduce three key innovations: (1) a new learning method that helps the system understand information across multiple formats, such as text, images, and graphs without any kind of label information; (2) an improved approach for training audio-based models, allowing them to perform well even with limited labeled data; and (3) a strategy for generating detailed, step-by-step instructions (both textual and visual) that ensure consistency between objects and actions. These improvements are combined into a smart, context-aware task assistant that has been tested with real users. By advancing how systems understand and guide users through complex tasks, our work brings us closer to more helpful, intelligent digital assistants for everyday activities.	en
dc.description.degree	Doctor of Philosophy	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:42996	en
dc.identifier.uri	https://hdl.handle.net/10919/132474	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	self-supervised learning	en
dc.subject	multimodal procedural planning	en
dc.title	Bridging Multimodal Learning and Planning for Intelligent Task Assistance	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Doctor of Philosophy	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Tabassum_A_D_2025.pdf
Size:: 23.34 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations