Location
https://www.kennesaw.edu/ccse/events/computing-showcase/fa24-cday-program.php
Document Type
Event
Start Date
19-11-2024 4:00 PM
Description
This paper presents a detailed analysis of the al- gorithmic complexity of the K-means clustering algorithm, a foundational method in unsupervised machine learning. Although the problem of finding the optimal solution is NP-hard, K-means is widely used for efficiently partitioning data into clusters by minimizing within-cluster variance. We explore four main ideas for improvement:1) parallel points generation and processing for speeding up convergence, 2) penalty scoring for avoiding clusters with high variability within them, 3) utilization of other distance measurements such as Manhattan distance for providing better clustering in structures of objects of different nature and 4) probability addition in the form of Gaussian Mixture Models (GMM) for more adaptable and soft k-means clustering. A practical application of K-means is applied to telecommu- nication data to understand purchasing behavior via customer segmentation. It was observed that K-means++ provided the most optimum method for centroid initialization while parallel K-means performed the task of minimizing the execution time. Penalty scoring produced more balanced clusters compared with the baseline and GMM allowed for more flexibility in defining cluster boundaries. Initial findings show that K-means++ has an average silhouette score of 0.67 while the GMM method has an average silhouette of 0.65 which is particularly more appropriate for the complex customer segmentation.
Included in
GMC-191 Optimizing K-means Clustering for Customer Analytics: A Multi-faceted Enhancement Approach
https://www.kennesaw.edu/ccse/events/computing-showcase/fa24-cday-program.php
This paper presents a detailed analysis of the al- gorithmic complexity of the K-means clustering algorithm, a foundational method in unsupervised machine learning. Although the problem of finding the optimal solution is NP-hard, K-means is widely used for efficiently partitioning data into clusters by minimizing within-cluster variance. We explore four main ideas for improvement:1) parallel points generation and processing for speeding up convergence, 2) penalty scoring for avoiding clusters with high variability within them, 3) utilization of other distance measurements such as Manhattan distance for providing better clustering in structures of objects of different nature and 4) probability addition in the form of Gaussian Mixture Models (GMM) for more adaptable and soft k-means clustering. A practical application of K-means is applied to telecommu- nication data to understand purchasing behavior via customer segmentation. It was observed that K-means++ provided the most optimum method for centroid initialization while parallel K-means performed the task of minimizing the execution time. Penalty scoring produced more balanced clusters compared with the baseline and GMM allowed for more flexibility in defining cluster boundaries. Initial findings show that K-means++ has an average silhouette score of 0.67 while the GMM method has an average silhouette of 0.65 which is particularly more appropriate for the complex customer segmentation.