Location

https://www.kennesaw.edu/ccse/events/computing-showcase/fa24-cday-program.php

Streaming Media

Document Type

Event

Start Date

19-11-2024 4:00 PM

Description

This paper presents a detailed analysis of the al- gorithmic complexity of the K-means clustering algorithm, a foundational method in unsupervised machine learning. Although the problem of finding the optimal solution is NP-hard, K-means is widely used for efficiently partitioning data into clusters by minimizing within-cluster variance. We explore four main ideas for improvement:1) parallel points generation and processing for speeding up convergence, 2) penalty scoring for avoiding clusters with high variability within them, 3) utilization of other distance measurements such as Manhattan distance for providing better clustering in structures of objects of different nature and 4) probability addition in the form of Gaussian Mixture Models (GMM) for more adaptable and soft k-means clustering. A practical application of K-means is applied to telecommu- nication data to understand purchasing behavior via customer segmentation. It was observed that K-means++ provided the most optimum method for centroid initialization while parallel K-means performed the task of minimizing the execution time. Penalty scoring produced more balanced clusters compared with the baseline and GMM allowed for more flexibility in defining cluster boundaries. Initial findings show that K-means++ has an average silhouette score of 0.67 while the GMM method has an average silhouette of 0.65 which is particularly more appropriate for the complex customer segmentation.

Share

COinS
 
Nov 19th, 4:00 PM

GMC-191 Optimizing K-means Clustering for Customer Analytics: A Multi-faceted Enhancement Approach

https://www.kennesaw.edu/ccse/events/computing-showcase/fa24-cday-program.php

This paper presents a detailed analysis of the al- gorithmic complexity of the K-means clustering algorithm, a foundational method in unsupervised machine learning. Although the problem of finding the optimal solution is NP-hard, K-means is widely used for efficiently partitioning data into clusters by minimizing within-cluster variance. We explore four main ideas for improvement:1) parallel points generation and processing for speeding up convergence, 2) penalty scoring for avoiding clusters with high variability within them, 3) utilization of other distance measurements such as Manhattan distance for providing better clustering in structures of objects of different nature and 4) probability addition in the form of Gaussian Mixture Models (GMM) for more adaptable and soft k-means clustering. A practical application of K-means is applied to telecommu- nication data to understand purchasing behavior via customer segmentation. It was observed that K-means++ provided the most optimum method for centroid initialization while parallel K-means performed the task of minimizing the execution time. Penalty scoring produced more balanced clusters compared with the baseline and GMM allowed for more flexibility in defining cluster boundaries. Initial findings show that K-means++ has an average silhouette score of 0.67 while the GMM method has an average silhouette of 0.65 which is particularly more appropriate for the complex customer segmentation.