Nova: Privacy-Preserving Goal-Oriented Reasoning in Smart Homes with On-Device Vision-Language Models
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Smart home assistants have become increasingly capable of interpreting under-specified user commands with the introduction of Large Language Models (LLMs). However, goal-oriented commands such as "Get this space ready for a party" expose a fundamental limitation in current systems. Accomplishing such goals requires coordinating three distinct categories of tasks: user-only tasks, such as cleaning the coffee table or arranging the couch; machine-only tasks, such as adjusting the thermostat or playing music through smart speakers; and inter dependent tasks, such as cleaning the floor, where a human must first pick up objects before the smart vacuum can operate. Accomplishing such goals requires visual reasoning about the state of the space, coordinated execution across interdependent human and device actions, and sustained multi-turn interaction to progressively transition the environment from an initial state to a desired state. Current systems are ill-equipped for this class of tasks, as they lack the ability to reason visually about the space or engage users in the structured, iterative dialogue that goal accomplishment requires. A further limitation of existing smart home assistants is their reliance on API-based calls to cloud-hosted LLMs. While power ful, this approach exposes sensitive information, including device states and images of the living space, to external servers, raising significant privacy concerns. We introduce Vision Language Models (VLMs) running entirely on local NVIDIA Jetson Orin Nano hardware to this problem space. We present Nova, a privacy-preserving smart home assistant built on a six-stage agentic pipeline that performs tasks ranging from visual reasoning about the state of the space to generating a structured action plan, and a state machine task executor that works through the plan in spoken, multi-turn collaboration with the user, all running entirely on-device without any data leaving the home. We empirically evaluate Nova on 500 user goals spanning 34 goal types across 3 distinct spaces, comparing its performance against cloud-based API models (Gemini 2.5 Flash, GPT-4o-mini and GPT-5.1) and a larger VLM from the same model family (Qwen2.5-VL-32B-Instruct) across three automated evaluation metrics and a human evaluation study conducted on a subset of 50 user goals. Our results show that Nova achieves a task assignment accuracy of 0.82 on-device, trailing cloud-based models (0.97) and a larger VLM from the same model family (0.94), while operating en tirely without any data leaving the home. A human evaluation further shows that while participants preferred the cloud baseline on output quality alone, introducing privacy as a consideration reversed that preference strongly in Nova's favor, with 82% of votes shifting to the on-device system, suggesting that users are willing to accept a measurable perfor mance gap in exchange for keeping their home data local. Our findings demonstrate that goal-oriented, privacy-preserving smart home assistance is feasible on a constrained single board device, opening a path towards capable smart home AI that does not require users to compromise their privacy