Presentation Type

Article

Location

Kennesaw, Georgia

Start Date

1-4-2026 1:45 PM

End Date

1-4-2026 3:00 PM

Description

This paper presents the concept and architectural design of a voice-controlled Unmanned Aerial System (UAS) that leverages a fine-tuned Small Language Model (SLM) to convert natural language voice commands into structured MAVLink flight instructions in real time. A custom quadcopter platform has been designed and assembled with a Pixhawk flight controller, and a web-based interface has been developed to integrate browser-based speech recognition with on-device SLM inference. We describe a systematic evaluation methodology for five candidate SLMs spanning encoder-decoder and decoder-only architectures, a custom dataset of 5,450 labeled drone command samples covering 11 operational command types and an unknown rejection class, and a QLoRA-based fine-tuning pipeline targeting the best-performing candidate. A dual-layer rejection architecture is proposed to ensure that non-command inputs are reliably filtered. A key advantage of SLMs is their compact footprint: the models evaluated in this work are small enough to run inference on a standard CPU, although a consumer GPU can optionally be used to accelerate processing. The complete system is designed to operate entirely on-device without cloud connectivity, aiming to demonstrate the feasibility of deploying fine-tuned SLMs for safety-critical voice interfaces on edge hardware.

Share

COinS
 
Apr 1st, 1:45 PM Apr 1st, 3:00 PM

Voice-Controlled Unmanned Aerial System Using a Fine-Tuned Small Language Model for Real-Time Command Parsing

Kennesaw, Georgia

This paper presents the concept and architectural design of a voice-controlled Unmanned Aerial System (UAS) that leverages a fine-tuned Small Language Model (SLM) to convert natural language voice commands into structured MAVLink flight instructions in real time. A custom quadcopter platform has been designed and assembled with a Pixhawk flight controller, and a web-based interface has been developed to integrate browser-based speech recognition with on-device SLM inference. We describe a systematic evaluation methodology for five candidate SLMs spanning encoder-decoder and decoder-only architectures, a custom dataset of 5,450 labeled drone command samples covering 11 operational command types and an unknown rejection class, and a QLoRA-based fine-tuning pipeline targeting the best-performing candidate. A dual-layer rejection architecture is proposed to ensure that non-command inputs are reliably filtered. A key advantage of SLMs is their compact footprint: the models evaluated in this work are small enough to run inference on a standard CPU, although a consumer GPU can optionally be used to accelerate processing. The complete system is designed to operate entirely on-device without cloud connectivity, aiming to demonstrate the feasibility of deploying fine-tuned SLMs for safety-critical voice interfaces on edge hardware.