Semester of Graduation
Spring 2026
Degree Type
Dissertation/Thesis
Degree Name
Masters in Software Engineering and Game Development
Department
Department of Software Engineering and Game Design and Development
Committee Chair/First Advisor
Dr. Seyedamin Pouriyeh
Abstract
Large Language Models (LLMs) have been applied to Alzheimer's disease (AD) detection from spontaneous speech transcripts, but the effect of speech disfluencies on classification performance and the biases introduced by different prompting strategies have received limited study. This research examines how the inclusion or removal of disfluencies in AD transcripts affects LLM classification accuracy across prompting techniques, and whether automatic prompt optimization can reduce the resulting classification biases. We first reproduce a baseline study using the ADReSSo dataset (71 recordings, no disfluencies) transcribed via Otter.ai, validating results on DeepSeek V3 and DeepSeek R1. We then compare the ADReSSo dataset against the ADReSS dataset (108 manually transcribed samples with preserved disfluencies), evaluating four LLMs---DeepSeek V3, DeepSeek R1, GPT-5.2, and Gemini 3 Flash---under three prompting strategies: zero-shot, chain-of-thought, and self-consistency prompting. We find that chain-of-thought and self-consistency prompts induce mirror-image classification biases that are model-family specific: DeepSeek models severely over-classify AD (up to 100\% AD accuracy but only 24\% CN accuracy), while GPT-5.2 exhibits the opposite CN-dominant bias. To address these biases, we apply two automatic prompt optimization frameworks that represent different paradigms: DSPy MIPROv2 (Bayesian search) and TextGrad (textual gradient descent). DSPy MIPROv2 reduced bias in the simpler chain-of-thought prompt, improving balanced accuracy by up to 11 percentage points for DeepSeek models, but struggled with the multi-module self-consistency prompt. TextGrad showed the opposite behavior: it improved CN accuracy by up to 25 percentage points on the self-consistency prompt but overcorrected the simpler chain-of-thought prompt. Neither framework was universally better; each was suited to a different level of prompt structural complexity. These results indicate that LLM-based clinical classification pipelines require both bias characterization and framework-appropriate prompt optimization.