Bridging Multimodal Learning and Planning for Intelligent Task Assistance

dc.contributor.authorTabassum, Afrinaen
dc.contributor.committeechairEldardiry, Hoda Mohameden
dc.contributor.committeechairLourentzou, Isminien
dc.contributor.committeememberJin, Ranen
dc.contributor.committeememberThomas, Christopher Leeen
dc.contributor.committeememberHuang, Jia-Binen
dc.contributor.departmentComputer Science and#38; Applicationsen
dc.date.accessioned2025-05-15T08:00:49Zen
dc.date.available2025-05-15T08:00:49Zen
dc.date.issued2025-05-14en
dc.description.abstractTask-assistance systems provide adaptive, multimodal guidance for complex, step-based activities such as cooking and DIY projects. A central challenge lies in enabling these systems to interpret real-world scenarios—understanding user intent from verbal, visual, or textual cues and generating coherent, multimodal instructions enriched with relevant visual. To tackle this, modern systems leverage advanced machine learning techniques, from representation learning that processes information from diverse modalities (e.g., text, images, audio) to procedural planning which provides dynamic, context-driven guidance, enabling systems to provide precise, real-time assistance tailored to user needs. This work addresses core challenges in representation learning and multimodal planning through three key contributions. First, we introduce a modality-agnostic contrastive learning framework that optimizes negative sample selection by jointly balancing anchor similarity, influence and diversity, improving generalization across vision, language, and graph tasks. Second, we propose a tuning strategy for masked audio models that leverages unsupervised audio mixtures to enhance adaptation to downstream tasks with less labeled data, such as few-shot learning. Third, we present a zero-shot framework for generating multimodal procedural plans with explicit object-state consistency, paired with two novel evaluation metrics and an evaluation task to assess planning accuracy, cross-modal alignment and temporal coherence. These contributions are integrated into a context-aware multimodal task assistant, empirically validated through real-world user studies. Our work establishes a foundation for more robust, adaptable, and user-centric task-assistance systems, bridging critical gaps in multimodal understanding and guidance.en
dc.description.abstractgeneralMultimodal task assistants guide users through complex tasks like cooking a new recipe or assembling furniture by understanding their voice, and providing clear, step-by-step instructions with helpful visuals. However, designing intelligent assistants that can accurately interpret real-world situations and provide clear, useful guidance remains a challenge. Our research focuses on improving how these systems learn from diverse modalities and generate multimodal instructions. We introduce three key innovations: (1) a new learning method that helps the system understand information across multiple formats, such as text, images, and graphs without any kind of label information; (2) an improved approach for training audio-based models, allowing them to perform well even with limited labeled data; and (3) a strategy for generating detailed, step-by-step instructions (both textual and visual) that ensure consistency between objects and actions. These improvements are combined into a smart, context-aware task assistant that has been tested with real users. By advancing how systems understand and guide users through complex tasks, our work brings us closer to more helpful, intelligent digital assistants for everyday activities.en
dc.description.degreeDoctor of Philosophyen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:42996en
dc.identifier.urihttps://hdl.handle.net/10919/132474en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectself-supervised learningen
dc.subjectmultimodal procedural planningen
dc.titleBridging Multimodal Learning and Planning for Intelligent Task Assistanceen
dc.typeDissertationen
thesis.degree.disciplineComputer Science & Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.nameDoctor of Philosophyen

Files

Original bundle
Now showing 1 - 1 of 1
Name:
Tabassum_A_D_2025.pdf
Size:
23.34 MB
Format:
Adobe Portable Document Format