Date of Submission
Master of Science in Computer Science (MSCS)
Dr. Yong Shi
Dr. Coskun Cetinkaya
Dr. Yong Shi
Dr. Selena He
Dr. Mingon Kang
The K-Nearest Neighbors (KNN) algorithm is a simple but powerful technique used in the field of data analytics. It uses a distance metric to identify existing samples in a dataset which are similar to a new sample. The new sample can then be classified via a class majority voting of its most similar samples, i.e. nearest neighbors. The KNN algorithm can be applied in many fields, such as recommender systems where it can be used to group related products or predict user preferences. In most cases, the performance of the KNN algorithm tends to suffer as the size of the dataset increases because the number of comparisons performed increases exponentially. In this paper, we propose a KNN optimization algorithm which leverages vector space models to enhance the nearest neighbors search for a new sample. It accomplishes this enhancement by restricting the search area, and therefore reducing the number of comparisons necessary to find the nearest neighbors. The experimental results demonstrate significant performance improvements without degrading the algorithm’s accuracy. The applicability of this optimization algorithm is further explored in the field of Big Data by parallelizing the work using Apache Spark. The experimental results of the Spark implementation demonstrate that it outperforms the serial, or local, implementation of this optimization algorithm after the dataset size reaches a specific threshold. Thus, further improving the performance of this optimization algorithm in the field of Big Data, where large datasets are prevalent.
Japa, Arialdis, "KNN Optimization for Multi-Dimensional Data" (2019). Master of Science in Computer Science Theses. 25.