Location
https://www.kennesaw.edu/ccse/events/computing-showcase/sp26-cday-program.php
Document Type
Event
Start Date
22-4-2026 4:00 PM
Description
Early detection of dementia is important for timely intervention, but traditional diagnostic procedures remain costly, time-consuming, and difficult to scale. Speech-based analysis offers a promising non-invasive alternative because cognitive decline often affects fluency, articulation, and other acoustic properties of speech. In this work, we present a multimodal dementia detection framework that combines self-supervised speech representations from wav2vec2 with demographic metadata, including age, gender, and ethnicity. We first compare the multimodal approach against a strong audio-only baseline under a controlled experimental setup. We then extend the analysis with a systematic ablation study and repeated-run statistical evaluation to measure the contribution of individual metadata features. Results show that the multimodal model consistently outperforms the audio-only baseline, with the largest gain primarily driven by age. In contrast, gender and ethnicity provide only marginal independent benefit. Across repeated experiments, the multimodal configurations also show stable performance, supporting the robustness of the observed improvements. These findings suggest that self-supervised speech embeddings capture meaningful dementia-related information, while selected demographic context, especially age, can provide complementary predictive value. Overall, this work strengthens the case for multimodal learning as a practical direction for scalable speech-based dementia screening.
Included in
GC-155-130 Multimodal Speech-Based Dementia Detection
https://www.kennesaw.edu/ccse/events/computing-showcase/sp26-cday-program.php
Early detection of dementia is important for timely intervention, but traditional diagnostic procedures remain costly, time-consuming, and difficult to scale. Speech-based analysis offers a promising non-invasive alternative because cognitive decline often affects fluency, articulation, and other acoustic properties of speech. In this work, we present a multimodal dementia detection framework that combines self-supervised speech representations from wav2vec2 with demographic metadata, including age, gender, and ethnicity. We first compare the multimodal approach against a strong audio-only baseline under a controlled experimental setup. We then extend the analysis with a systematic ablation study and repeated-run statistical evaluation to measure the contribution of individual metadata features. Results show that the multimodal model consistently outperforms the audio-only baseline, with the largest gain primarily driven by age. In contrast, gender and ethnicity provide only marginal independent benefit. Across repeated experiments, the multimodal configurations also show stable performance, supporting the robustness of the observed improvements. These findings suggest that self-supervised speech embeddings capture meaningful dementia-related information, while selected demographic context, especially age, can provide complementary predictive value. Overall, this work strengthens the case for multimodal learning as a practical direction for scalable speech-based dementia screening.