Semester of Graduation
Fall 2025
Degree Type
Thesis
Degree Name
MASTER OF SCIENCE IN INFORMATION TECHNOLOGY
Department
Department of Information Technology
Committee Chair/First Advisor
Shaoen Wu
Abstract
Recent advances in large language models (LLMs) and multimodal large language models (MLLMs) enable natural language–based querying in virtual reality (VR). However, VR environments are highly localized, personalized, and dynamic, making it challenging for general-purpose models to answer environment-specific queries or reason about subtle object state changes. To address these challenges, this thesis develops two systems for 3D question answering in VR.
First, we present RAG-VR, the first retrieval-augmented 3D question-answering system designed for VR. RAG-VR augments an LLM with external knowledge retrieved from a localized knowledge database and includes a pipeline for extracting environmental and user-related information. To improve efficiency, the system offloads retrieval to a nearby edge server and uses only essential information during inference. The retriever is further trained to distinguish among relevant, irrelevant, and hard-to-differentiate information. Experimental results show that RAG-VR improves answer accuracy by 17.9%–41.8% and reduces end-to-end latency by 34.5%–47.3% compared with two baseline systems.
Second, we introduce ObjChangeVR-Dataset, which targets object state change question answering in scenarios without direct user interaction. Unlike prior work that relies on explicit interaction cues, this setting requires reasoning about subtle and implicit changes. Building on this dataset, we propose ObjChangeVR, a framework that integrates viewpoint-aware and temporal retrieval with cross-view reasoning. Experiments demonstrate that ObjChangeVR consistently outperforms strong baselines across multiple state-of-the-art MLLMs.
Together, RAG-VR and ObjChangeVR provide end-to-end solutions for VR-based question answering, enabling both localized 3D context understanding and reasoning about subtle object state changes from continuous egocentric observations.