Bridging Multimodal Learning and Planning for Intelligent Task Assistance

TR Number

Date

2025-05-14

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Task-assistance systems provide adaptive, multimodal guidance for complex, step-based activities such as cooking and DIY projects. A central challenge lies in enabling these systems to interpret real-world scenarios—understanding user intent from verbal, visual, or textual cues and generating coherent, multimodal instructions enriched with relevant visual. To tackle this, modern systems leverage advanced machine learning techniques, from representation learning that processes information from diverse modalities (e.g., text, images, audio) to procedural planning which provides dynamic, context-driven guidance, enabling systems to provide precise, real-time assistance tailored to user needs. This work addresses core challenges in representation learning and multimodal planning through three key contributions. First, we introduce a modality-agnostic contrastive learning framework that optimizes negative sample selection by jointly balancing anchor similarity, influence and diversity, improving generalization across vision, language, and graph tasks. Second, we propose a tuning strategy for masked audio models that leverages unsupervised audio mixtures to enhance adaptation to downstream tasks with less labeled data, such as few-shot learning. Third, we present a zero-shot framework for generating multimodal procedural plans with explicit object-state consistency, paired with two novel evaluation metrics and an evaluation task to assess planning accuracy, cross-modal alignment and temporal coherence. These contributions are integrated into a context-aware multimodal task assistant, empirically validated through real-world user studies. Our work establishes a foundation for more robust, adaptable, and user-centric task-assistance systems, bridging critical gaps in multimodal understanding and guidance.

Description

Keywords

self-supervised learning, multimodal procedural planning

Citation