Conversational Multimodal LLMs for Food Nutritional Information Retrieval: A Systematic Evaluation
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Accurate dietary monitoring underpins public health initiatives, chronic disease management, and personalized nutrition, yet manual food logging remains laborious and prone to error. At the same time, vision and language models have reached new levels of capability and offer the potential to infer nutritional content directly from meal photographs with minimal user effort. This thesis systematically evaluates eight state-of-the-art multimodal large language models (including GPT-4o, Qwen 2, and Qwen 2.5, DeepSeek, and LLaVA), in a zero-shot setting on two real-world datasets, Nutrition5k and MetaFood3D. To identify which types of information most improve model predictions, a structured cue ladder protocol was developed. Models first receive only the image, then incrementally gain access to a verified ingredient list, total dish mass, and individual ingredient mass or volume. Single-view and two-view inputs are compared, and two conversational prompting workflows are tested: a predefined sequence of reasoning questions and an agentic self-questioning pipeline that generates clarifying queries on the fly. Results show that GPT-4o achieves the lowest overall mean absolute percentage error (MAPE), with calorie prediction errors improving from approximately 51% in the image-only setting to as low as 29% when provided with total per-ingredient-wise mass. Providing just the total mass reduces calorie prediction error across all models, making it the single most impactful cue. Adding a second image view yields smaller but consistent improvements, in the range of 3-7 percentage points. Finally, the agentic self-questioning workflow consistently outperforms the fixed prompt sequence, particularly in low-context scenarios, with some models showing improvements of over 8 percentage points. Even when granted nearly all available cues, no model attains perfect accuracy, as errors persist in estimating invisible elements such as oils and dressings. These findings clarify the trade-off between user effort and predictive accuracy, and they establish a general evaluation framework and conversational pipeline design that can extend to other applications requiring integrated visual and contextual reasoning. The ultimate aim is to enable an end-user application that delivers reliable nutritional estimates from a simple photograph with minimal additional input.