Date of Submission

Spring 4-15-2020

Degree Type

Thesis

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

Committee Chair/First Advisor

Dr. Yong Shi

Track

High Performance Computing

Chair

Dr. Yong Shi

Committee Member

Dr. Selena He

Committee Member

Dr. Xiaohua Xu

Related Publications

D. Brown, A. Japa and Y. Shi, "A Fast Density-Grid Based Clustering Method," 2019 IEEE 9th Annual Computing and Communication Workshop and Conferencep (CCWC), Las Vegas, NV, USA, 2019, pp. 0048-0054.

D. Brown, A. Japa and Y. Shi, "An Attempt at Improving Density-based Clustering Algorithms," 2019 ACM Southeast Conference, Atlanta, GA, USA, 2019, pp. 0172-0175.

D. Brown, Y. Shi, "A Distributed Density-Grid Clustering Algorithm for Multi-Dimensional Data," 2020 IEEE 10th Annual Computing and Communication Workshop and Conferencep (CCWC), Las Vegas, NV, USA, 2020

Abstract

Clustering is an unsupervised machine learning task that seeks to partition a set of data into smaller groupings, referred to as “clusters”, where items within the same cluster are somehow alike, while differing from those in other clusters. There are many different algorithms for clustering, but many of them are overly complex and scale poorly with larger data sets. In this paper, a new algorithm for clustering is proposed to solve some of these issues. Density-based clustering algorithms use a concept called the “underlying density function”, which is a conceptual higher-dimension function that describes the possible results from the continuous data set that our input data is just a discrete sample of. The algorithm proposed in this paper seeks to use this concept by creating a piecewise approximation of the underlying density function, and then merging points towards local density maxima from this higher-dimensioned space. First, the data space is divided into a grid-based structure and the density of each grid is calculated. Second, each of these “grid-squares” determines the densest space in its local area. Finally, the grid squares are merged together in the direction of their local density maximum, ultimately merging with one of the density maxima that form the root of a cluster. The experimental results show significant time improvements over standard algorithms such as DBSCAN with no accuracy penalty. Furthermore, the algorithm is also suitable for use with parallel and distributed systems, as an implementation with Apache Spark showed proper parallel scaling with low data set sizes required to overtake the serial implementation.

Download

Included in

Other Computer Sciences Commons, Theory and Algorithms Commons

COinS

Master of Science in Computer Science Theses

Fast Clustering Using a Grid-Based Underlying Density Function Approximation

Date of Submission

Degree Type

Degree Name

Department

Committee Chair/First Advisor

Track

Chair

Committee Member

Committee Member

Related Publications

Abstract

Included in

Search

Authors

Browse

Links

Useful Links

Master of Science in Computer Science Theses

Fast Clustering Using a Grid-Based Underlying Density Function Approximation

Author

Date of Submission

Degree Type

Degree Name

Department

Committee Chair/First Advisor

Track

Chair

Committee Member

Committee Member

Related Publications

Abstract

Included in

Share

Search

Authors

Browse

Links

Useful Links