Date of Submission

Summer 8-9-2019

Degree Type

Thesis

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

Committee Chair/First Advisor

Dr. Yong Shi

Track

Others

Thesis

Chair

Dr. Coskun Cetinkaya

Committee Member

Dr. Yong Shi

Committee Member

Dr. Selena He

Committee Member

Dr. Mingon Kang

Abstract

The K-Nearest Neighbors (KNN) algorithm is a simple but powerful technique used in the field of data analytics. It uses a distance metric to identify existing samples in a dataset which are similar to a new sample. The new sample can then be classified via a class majority voting of its most similar samples, i.e. nearest neighbors. The KNN algorithm can be applied in many fields, such as recommender systems where it can be used to group related products or predict user preferences. In most cases, the performance of the KNN algorithm tends to suffer as the size of the dataset increases because the number of comparisons performed increases exponentially. In this paper, we propose a KNN optimization algorithm which leverages vector space models to enhance the nearest neighbors search for a new sample. It accomplishes this enhancement by restricting the search area, and therefore reducing the number of comparisons necessary to find the nearest neighbors. The experimental results demonstrate significant performance improvements without degrading the algorithm’s accuracy. The applicability of this optimization algorithm is further explored in the field of Big Data by parallelizing the work using Apache Spark. The experimental results of the Spark implementation demonstrate that it outperforms the serial, or local, implementation of this optimization algorithm after the dataset size reaches a specific threshold. Thus, further improving the performance of this optimization algorithm in the field of Big Data, where large datasets are prevalent.

Share

COinS