Leveraging Multimodal LLM for Efficient Understanding of VRU via Multimodal Traffic Sensing Approach

Year: 2025-2027

PIs: Yiqiao Li

Sponsor: US DOT – UTC (update when confirmed)

Description: Vulnerable roadway users (VRU), such as pedestrians, bicyclists, and other non-vehicle occupants, have been increasingly recognized as a critical focus in transportation safety research due to their heightened exposure to traffic-related risks and limited physical protection in crash events. Prior work has established multi-modality traffic sensing testbeds and investigated deep learning approaches to collect high-quality VRU data, better serving the needs of VRU behavior studies and safety analysis. However, training deep learning-based models required labor-intensive manual annotations, especially for vision-based methods such as LiDAR and video footage.

Building upon these successes, this project aims to transition from manual labeling toward a self-learning, AI agentic system powered by Multimodal Large Language Models (MLLMs). In this study, we propose a two-phase approach. First, we will integrate Vision-Language Models (VLMs) such as CLIP2Point and Gemini into a semi-automated annotation pipeline to extract high-quality VRU data without a laborious data annotation process. Second, we will progressively transition these models from annotation assistants into active agents, using in-context learning, Chain-of-Thought (CoT) prompting, retrieval-augmented generation (RAG), and fine-tuning techniques to perform zero/few-shot MLLM-based VRU detection and classification. The system will continuously learn from its annotations, enabling its adaptive self-learning ability. This novel approach seeks to enhance the efficiency of VRU data collection while producing high-quality data to enable a more detailed understanding of micro-level VRU travel behaviors.