Location

https://www.kennesaw.edu/ccse/events/computing-showcase/sp26-cday-program.php

Document Type

Event

Start Date

22-4-2026 4:00 PM

Description

Early detection of dementia is important for timely intervention, but traditional diagnostic procedures remain costly, time-consuming, and difficult to scale. Speech-based analysis offers a promising non-invasive alternative because cognitive decline often affects fluency, articulation, and other acoustic properties of speech. In this work, we present a multimodal dementia detection framework that combines self-supervised speech representations from wav2vec2 with demographic metadata, including age, gender, and ethnicity. We first compare the multimodal approach against a strong audio-only baseline under a controlled experimental setup. We then extend the analysis with a systematic ablation study and repeated-run statistical evaluation to measure the contribution of individual metadata features. Results show that the multimodal model consistently outperforms the audio-only baseline, with the largest gain primarily driven by age. In contrast, gender and ethnicity provide only marginal independent benefit. Across repeated experiments, the multimodal configurations also show stable performance, supporting the robustness of the observed improvements. These findings suggest that self-supervised speech embeddings capture meaningful dementia-related information, while selected demographic context, especially age, can provide complementary predictive value. Overall, this work strengthens the case for multimodal learning as a practical direction for scalable speech-based dementia screening.

Share

COinS
 
Apr 22nd, 4:00 PM

GC-155-130 Multimodal Speech-Based Dementia Detection

https://www.kennesaw.edu/ccse/events/computing-showcase/sp26-cday-program.php

Early detection of dementia is important for timely intervention, but traditional diagnostic procedures remain costly, time-consuming, and difficult to scale. Speech-based analysis offers a promising non-invasive alternative because cognitive decline often affects fluency, articulation, and other acoustic properties of speech. In this work, we present a multimodal dementia detection framework that combines self-supervised speech representations from wav2vec2 with demographic metadata, including age, gender, and ethnicity. We first compare the multimodal approach against a strong audio-only baseline under a controlled experimental setup. We then extend the analysis with a systematic ablation study and repeated-run statistical evaluation to measure the contribution of individual metadata features. Results show that the multimodal model consistently outperforms the audio-only baseline, with the largest gain primarily driven by age. In contrast, gender and ethnicity provide only marginal independent benefit. Across repeated experiments, the multimodal configurations also show stable performance, supporting the robustness of the observed improvements. These findings suggest that self-supervised speech embeddings capture meaningful dementia-related information, while selected demographic context, especially age, can provide complementary predictive value. Overall, this work strengthens the case for multimodal learning as a practical direction for scalable speech-based dementia screening.