Date of Award

Spring 5-10-2024

Degree Type


Degree Name

Master of Science in Information Technology


Department of Information Technology

Committee Chair/First Advisor

Shaoen Wu

Second Advisor

Jian Zhang

Third Advisor

Ying Wang


The rapid advancement in Deep Learning (DL), especially in Reinforcement Learning (RL) and Imitation Learning (IL), has positioned it as a promising approach for a multitude of autonomous robotic systems. However, the current methodologies are predominantly constrained to singular setups, necessitating substantial data and extensive training periods. Moreover, these methods have exhibited suboptimal performance in tasks requiring long-horizontal maneuvers, such as Radio Frequency Identification (RFID) inventory, where a robot requires thousands of steps to complete.

In this thesis, we address the aforementioned challenges by presenting the Cross-modal Reasoning Model (CMRM), a novel zero-shot Imitation Learning policy, to tackle long-horizontal robotic tasks. The RFID inventory task is a typical long-horizontal robotic task that can be formulated as a Partially Observable Markov Decision Process (POMDP); the robot should be able to recall previous actions and reason from current environmental observations to optimize its strategy. To this end, our CMRM has been designed with a two-stream flow structure to extract abstract information concealed in environmental observations and subsequently generate robot actions by reasoning structural and temporal features from historical and current observations. Extensive experiments in a virtual platform and mockup real store are conducted to evaluate the proposed CMRM. Experimental results demonstrate that CMRM is capable of performing RFID inventory tasks in unstructured environments with complex layouts and provides competitive accuracy that surpasses previous methods and manual inventory. To facilitate the training and assessment of CMRM, we constructed a Unity3D-based virtual platform that can be configured into various environments, like an apparel store. This platform is capable of offering photo-realistic objects and precise physical features (gravities, appearance, and more) to provide close to real environments for training and testing robots. Subsequently, the robot, once trained, was deployed in an actual retail environment to perform RFID inventory tasks. This approach effectively bridges the ``reality gap", enabling the robot to perform the RFID inventory task seamlessly in both virtual and real-world settings, thereby demonstrating zero-shot generalization capabilities.