Date of Submission
Master of Science in Computer Science (MSCS)
High Performance Computing
Dr. Yong Shi
Dr. Yong Shi
Dr. Selena He
Dr. Xiaohua Xu
Clustering is an unsupervised machine learning task that seeks to partition a set of data into smaller groupings, referred to as “clusters”, where items within the same cluster are somehow alike, while differing from those in other clusters. There are many different algorithms for clustering, but many of them are overly complex and scale poorly with larger data sets. In this paper, a new algorithm for clustering is proposed to solve some of these issues. Density-based clustering algorithms use a concept called the “underlying density function”, which is a conceptual higher-dimension function that describes the possible results from the continuous data set that our input data is just a discrete sample of. The algorithm proposed in this paper seeks to use this concept by creating a piecewise approximation of the underlying density function, and then merging points towards local density maxima from this higher-dimensioned space. First, the data space is divided into a grid-based structure and the density of each grid is calculated. Second, each of these “grid-squares” determines the densest space in its local area. Finally, the grid squares are merged together in the direction of their local density maximum, ultimately merging with one of the density maxima that form the root of a cluster. The experimental results show significant time improvements over standard algorithms such as DBSCAN with no accuracy penalty. Furthermore, the algorithm is also suitable for use with parallel and distributed systems, as an implementation with Apache Spark showed proper parallel scaling with low data set sizes required to overtake the serial implementation.
Brown, Daniel, "Fast Clustering Using a Grid-Based Underlying Density Function Approximation" (2020). Master of Science in Computer Science Theses. 31.