Presentation Type
Article
Location
Kennesaw, Georgia
Start Date
1-4-2026 1:45 PM
End Date
1-4-2026 3:00 PM
Description
This paper presents the concept and architectural design of a voice-controlled Unmanned Aerial System (UAS) that leverages a fine-tuned Small Language Model (SLM) to convert natural language voice commands into structured MAVLink flight instructions in real time. A custom quadcopter platform has been designed and assembled with a Pixhawk flight controller, and a web-based interface has been developed to integrate browser-based speech recognition with on-device SLM inference. We describe a systematic evaluation methodology for five candidate SLMs spanning encoder-decoder and decoder-only architectures, a custom dataset of 5,450 labeled drone command samples covering 11 operational command types and an unknown rejection class, and a QLoRA-based fine-tuning pipeline targeting the best-performing candidate. A dual-layer rejection architecture is proposed to ensure that non-command inputs are reliably filtered. A key advantage of SLMs is their compact footprint: the models evaluated in this work are small enough to run inference on a standard CPU, although a consumer GPU can optionally be used to accelerate processing. The complete system is designed to operate entirely on-device without cloud connectivity, aiming to demonstrate the feasibility of deploying fine-tuned SLMs for safety-critical voice interfaces on edge hardware.
Voice-Controlled Unmanned Aerial System Using a Fine-Tuned Small Language Model for Real-Time Command Parsing
Kennesaw, Georgia
This paper presents the concept and architectural design of a voice-controlled Unmanned Aerial System (UAS) that leverages a fine-tuned Small Language Model (SLM) to convert natural language voice commands into structured MAVLink flight instructions in real time. A custom quadcopter platform has been designed and assembled with a Pixhawk flight controller, and a web-based interface has been developed to integrate browser-based speech recognition with on-device SLM inference. We describe a systematic evaluation methodology for five candidate SLMs spanning encoder-decoder and decoder-only architectures, a custom dataset of 5,450 labeled drone command samples covering 11 operational command types and an unknown rejection class, and a QLoRA-based fine-tuning pipeline targeting the best-performing candidate. A dual-layer rejection architecture is proposed to ensure that non-command inputs are reliably filtered. A key advantage of SLMs is their compact footprint: the models evaluated in this work are small enough to run inference on a standard CPU, although a consumer GPU can optionally be used to accelerate processing. The complete system is designed to operate entirely on-device without cloud connectivity, aiming to demonstrate the feasibility of deploying fine-tuned SLMs for safety-critical voice interfaces on edge hardware.