GRM-043 Performance Assessment of DeepSeek versus Bard and ChatGPT in Detecting Alzheimer’s Dementia
Location
https://www.kennesaw.edu/ccse/events/computing-showcase/sp25-cday-program.php
Streaming Media
Document Type
Event
Start Date
15-4-2025 4:00 PM
Description
Alzheimer’s disease is a growing public health issue due to its progressive nature and increasing prevalence. Large language models (LLMs) offer promising avenues for non-invasive cognitive assessment through natural language understanding. In this study, we evaluate DeepSeek’s general-purpose model V3 and reasoning-enhanced R1 variant—for identifying Alzheimer’s dementia (AD) and Cognitively Normal (CN) individuals using transcripts derived from spontaneous speech. Two baseline prompting strategies (zero-shot, chain-of-thought ) were applied to both model types and an additional query (self-consistency prompting) was applied to assess better predictions. Accuracy was the primary performance metric. When positively identifying AD, the general-purpose DeepSeek V3 model produced the highest true positives at 88%, but tended to misclassify CN as AD. In contrast, the DeepSeek-R1 model achieved the highest true negatives at 90% for CN classification. Overall, DeepSeek models surpass chance-level classification, but further refinement is needed before clinical applicability can be ensured.
Included in
GRM-043 Performance Assessment of DeepSeek versus Bard and ChatGPT in Detecting Alzheimer’s Dementia
https://www.kennesaw.edu/ccse/events/computing-showcase/sp25-cday-program.php
Alzheimer’s disease is a growing public health issue due to its progressive nature and increasing prevalence. Large language models (LLMs) offer promising avenues for non-invasive cognitive assessment through natural language understanding. In this study, we evaluate DeepSeek’s general-purpose model V3 and reasoning-enhanced R1 variant—for identifying Alzheimer’s dementia (AD) and Cognitively Normal (CN) individuals using transcripts derived from spontaneous speech. Two baseline prompting strategies (zero-shot, chain-of-thought ) were applied to both model types and an additional query (self-consistency prompting) was applied to assess better predictions. Accuracy was the primary performance metric. When positively identifying AD, the general-purpose DeepSeek V3 model produced the highest true positives at 88%, but tended to misclassify CN as AD. In contrast, the DeepSeek-R1 model achieved the highest true negatives at 90% for CN classification. Overall, DeepSeek models surpass chance-level classification, but further refinement is needed before clinical applicability can be ensured.