General-Purpose Task Guidance from Natural Language in Augmented Reality using Vision-Language Models
dc.contributor.author | Stover, Daniel James | en |
dc.contributor.committeechair | Abbott, Amos L. | en |
dc.contributor.committeechair | Bowman, Douglas Andrew | en |
dc.contributor.committeemember | Thomas, Christopher Lee | en |
dc.contributor.committeemember | Jones, Creed Farris | en |
dc.contributor.department | Electrical and Computer Engineering | en |
dc.date.accessioned | 2024-06-13T08:01:42Z | en |
dc.date.available | 2024-06-13T08:01:42Z | en |
dc.date.issued | 2024-06-12 | en |
dc.description.abstract | Augmented reality task guidance systems provide assistance for procedural tasks, which require a sequence of physical actions, by rendering virtual guidance visuals within the real-world environment. An example of such a task would be to secure two wood parts together, which could display guidance visuals indicating the user to pick up a drill and drill each screw. Current AR task guidance systems are limited in that they require AR system experts for use, require CAD models of real-world objects, or only function for limited types of tasks or environments. We propose a general-purpose AR task guidance approach and proof-of-concept system to generate guidance for tasks defined by natural language. Our approach allows an operator to take pictures of relevant objects and write task instructions for an end user, which are used by the system to determine where to place guidance visuals. Then, an end user can receive and follow guidance even if objects change location or environment. Guidance includes reusable visuals that display generic actions, such as our system's 3D hand animations. Our approach utilizes current vision-language machine learning models for text and image semantic understanding and object localization. We built a proof-of-concept system using our approach and tested its accuracy and usability in a user study. We found that all operators were able to generate clear guidance for tasks in an office room, and end users were able to follow the guidance visuals to complete the expected action 85.7% of the time without knowledge of their tasks. Participants rated that our system was easy to use to generate guidance visuals they expected. | en |
dc.description.abstractgeneral | Augmented Reality (AR) task guidance systems provide assistance for tasks by placing virtual guidance visuals on top of the real world through displays. An example of such a task would be to secure two wood parts together, which could display guidance visuals indicating the user to pick up a drill and drill each screw. Current AR task guidance systems are limited in that they require AR system experts for use, require detailed models of real-world objects, or only function for limited types of tasks or environments. We propose a new task guidance approach and built a system to generate guidance for tasks defined by written instructions. Our approach allows an operator to take pictures of relevant objects and write task instructions for an end user, which are used by the system to determine where to place digital visuals. Then, an end user can receive and follow guidance even if objects change location or environment. Guidance includes visuals that display generic actions, such as our system's 3D hand animations that mimic human hand actions. Our approach utilizes AI models for text and image understanding and object detection. We built a proof-of-concept system using our approach and tested its accuracy and usability in a user study. We found that all operators were able to generate clear guidance for tasks in an office room, and end users were able to follow the guidance visuals to complete the expected action 85.7% of the time without knowledge of the tasks. Participants rated that our system made it easy to write instructions and take pictures to create guidance visuals. | en |
dc.description.degree | Master of Science | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:40807 | en |
dc.identifier.uri | https://hdl.handle.net/10919/119417 | en |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International | en |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ | en |
dc.subject | Augmented Reality | en |
dc.subject | Machine Learning | en |
dc.subject | Task Guidance | en |
dc.title | General-Purpose Task Guidance from Natural Language in Augmented Reality using Vision-Language Models | en |
dc.type | Thesis | en |
thesis.degree.discipline | Computer Engineering | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | masters | en |
thesis.degree.name | Master of Science | en |
Files
Original bundle
1 - 5 of 8
Loading...
- Name:
- Stover_DJ_T_2024_support_7.mp4
- Size:
- 6.87 MB
- Format:
- MP4 Container format for video files
- Description:
- Supporting documents
Loading...
- Name:
- Stover_DJ_T_2024_support_4.mp4
- Size:
- 61.5 MB
- Format:
- MP4 Container format for video files
- Description:
- Supporting documents
Loading...
- Name:
- Stover_DJ_T_2024_support_5.mp4
- Size:
- 6.11 MB
- Format:
- MP4 Container format for video files
- Description:
- Supporting documents
Loading...
- Name:
- Stover_DJ_T_2024_support_8.mp4
- Size:
- 9.99 MB
- Format:
- MP4 Container format for video files
- Description:
- Supporting documents