Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

Zeng, Tong; Wu, Longfeng; Shi, Liang; Zhou, Dawei; Guo, Feng

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

dc.contributor.author	Zeng, Tong	en
dc.contributor.author	Wu, Longfeng	en
dc.contributor.author	Shi, Liang	en
dc.contributor.author	Zhou, Dawei	en
dc.contributor.author	Guo, Feng	en
dc.date.accessioned	2025-09-10T12:23:06Z	en
dc.date.available	2025-09-10T12:23:06Z	en
dc.date.issued	2025-08-03	en
dc.date.updated	2025-09-01T07:48:01Z	en
dc.description.abstract	Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs’ performance in safety-critical scenarios. To address this, we introduce DVBench—a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers , enabling a comprehensive evaluation of VLLMs’ capabilities in perception and reasoning. Experiments on 14 state-of-the-art VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between generalpurpose vision-language models and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: https://github.com/tong-zeng/DVBench.git.	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1145/3711896.3737396	en
dc.identifier.uri	https://hdl.handle.net/10919/137724	en
dc.language.iso	en	en
dc.publisher	ACM	en
dc.rights	Creative Commons Attribution-NonCommercial 4.0 International	en
dc.rights.holder	The author(s)	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/	en
dc.title	Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3711896.3737396.pdf
Size:: 4.72 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Journal Articles, Association for Computing Machinery (ACM)
Scholarly Works, Computer Science
Scholarly Works, Statistics
Scholarly Works, Virginia Tech Transportation Institute