Master's Theses

Influence of Speech Disfluencies and Prompt Optimization in LLM-Based Alzheimer's Disease Classification

Muhammad Awais Arshad, Kennesaw State UniversityFollow

Semester of Graduation

Spring 2026

Degree Type

Thesis

Degree Name

Master of Science in Software Engineering

Department

Software Engineering and Game Development - College of Computing and Software Engineering

Committee Chair/First Advisor

Dr. Seyedamin Pouriyeh

Abstract

Large Language Models (LLMs) have been applied to Alzheimer's disease (AD) detection from spontaneous speech transcripts, but the effect of speech disfluencies on classification performance and the biases introduced by different prompting strategies have received limited study. This research examines how the inclusion or removal of disfluencies in AD transcripts affects LLM classification accuracy across prompting techniques, and whether automatic prompt optimization can reduce the resulting classification biases. We first reproduce a baseline study using the ADReSSo dataset (71 recordings, no disfluencies) transcribed via Otter.ai, validating results on DeepSeek V3 and DeepSeek R1. We then compare the ADReSSo dataset against the ADReSS dataset (108 manually transcribed samples with preserved disfluencies), evaluating four LLMs---DeepSeek V3, DeepSeek R1, GPT-5.2, and Gemini 3 Flash---under three prompting strategies: zero-shot, chain-of-thought, and self-consistency prompting. We find that chain-of-thought and self-consistency prompts induce mirror-image classification biases that are model-family specific: DeepSeek models severely over-classify AD (up to 100\% AD accuracy but only 24\% CN accuracy), while GPT-5.2 exhibits the opposite CN-dominant bias. To address these biases, we apply two automatic prompt optimization frameworks that represent different paradigms: DSPy MIPROv2 (Bayesian search) and TextGrad (textual gradient descent). DSPy MIPROv2 reduced bias in the simpler chain-of-thought prompt, improving balanced accuracy by up to 11 percentage points for DeepSeek models, but struggled with the multi-module self-consistency prompt. TextGrad showed the opposite behavior: it improved CN accuracy by up to 25 percentage points on the self-consistency prompt but overcorrected the simpler chain-of-thought prompt. Neither framework was universally better; each was suited to a different level of prompt structural complexity. These results indicate that LLM-based clinical classification pipelines require both bias characterization and framework-appropriate prompt optimization.

Download

Available for download on Saturday, May 08, 2027

COinS

Master's Theses

Influence of Speech Disfluencies and Prompt Optimization in LLM-Based Alzheimer's Disease Classification

Semester of Graduation

Degree Type

Degree Name

Department

Committee Chair/First Advisor

Abstract

Search

Authors

Browse

Useful Links

Master's Theses

Influence of Speech Disfluencies and Prompt Optimization in LLM-Based Alzheimer's Disease Classification

Author

Semester of Graduation

Degree Type

Degree Name

Department

Committee Chair/First Advisor

Abstract

Share

Search

Authors

Browse

Useful Links