Bridging Multimodal Learning and Planning for Intelligent Task Assistance
dc.contributor.author | Tabassum, Afrina | en |
dc.contributor.committeechair | Eldardiry, Hoda Mohamed | en |
dc.contributor.committeechair | Lourentzou, Ismini | en |
dc.contributor.committeemember | Jin, Ran | en |
dc.contributor.committeemember | Thomas, Christopher Lee | en |
dc.contributor.committeemember | Huang, Jia-Bin | en |
dc.contributor.department | Computer Science and#38; Applications | en |
dc.date.accessioned | 2025-05-15T08:00:49Z | en |
dc.date.available | 2025-05-15T08:00:49Z | en |
dc.date.issued | 2025-05-14 | en |
dc.description.abstract | Task-assistance systems provide adaptive, multimodal guidance for complex, step-based activities such as cooking and DIY projects. A central challenge lies in enabling these systems to interpret real-world scenarios—understanding user intent from verbal, visual, or textual cues and generating coherent, multimodal instructions enriched with relevant visual. To tackle this, modern systems leverage advanced machine learning techniques, from representation learning that processes information from diverse modalities (e.g., text, images, audio) to procedural planning which provides dynamic, context-driven guidance, enabling systems to provide precise, real-time assistance tailored to user needs. This work addresses core challenges in representation learning and multimodal planning through three key contributions. First, we introduce a modality-agnostic contrastive learning framework that optimizes negative sample selection by jointly balancing anchor similarity, influence and diversity, improving generalization across vision, language, and graph tasks. Second, we propose a tuning strategy for masked audio models that leverages unsupervised audio mixtures to enhance adaptation to downstream tasks with less labeled data, such as few-shot learning. Third, we present a zero-shot framework for generating multimodal procedural plans with explicit object-state consistency, paired with two novel evaluation metrics and an evaluation task to assess planning accuracy, cross-modal alignment and temporal coherence. These contributions are integrated into a context-aware multimodal task assistant, empirically validated through real-world user studies. Our work establishes a foundation for more robust, adaptable, and user-centric task-assistance systems, bridging critical gaps in multimodal understanding and guidance. | en |
dc.description.abstractgeneral | Multimodal task assistants guide users through complex tasks like cooking a new recipe or assembling furniture by understanding their voice, and providing clear, step-by-step instructions with helpful visuals. However, designing intelligent assistants that can accurately interpret real-world situations and provide clear, useful guidance remains a challenge. Our research focuses on improving how these systems learn from diverse modalities and generate multimodal instructions. We introduce three key innovations: (1) a new learning method that helps the system understand information across multiple formats, such as text, images, and graphs without any kind of label information; (2) an improved approach for training audio-based models, allowing them to perform well even with limited labeled data; and (3) a strategy for generating detailed, step-by-step instructions (both textual and visual) that ensure consistency between objects and actions. These improvements are combined into a smart, context-aware task assistant that has been tested with real users. By advancing how systems understand and guide users through complex tasks, our work brings us closer to more helpful, intelligent digital assistants for everyday activities. | en |
dc.description.degree | Doctor of Philosophy | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:42996 | en |
dc.identifier.uri | https://hdl.handle.net/10919/132474 | en |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | self-supervised learning | en |
dc.subject | multimodal procedural planning | en |
dc.title | Bridging Multimodal Learning and Planning for Intelligent Task Assistance | en |
dc.type | Dissertation | en |
thesis.degree.discipline | Computer Science & Applications | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | doctoral | en |
thesis.degree.name | Doctor of Philosophy | en |
Files
Original bundle
1 - 1 of 1