Topic clustering of COVID-19 open research dataset(CORD-19) using graph clustering approach

Disciplines

Applied Statistics | Artificial Intelligence and Robotics | Databases and Information Systems | Discrete Mathematics and Combinatorics | Numerical Analysis and Scientific Computing | Programming Languages and Compilers | Statistical Models | Statistics and Probability

Abstract (300 words maximum)

Topic clustering is an important approach in text analytics, because labeled documents are rarely available to classify documents for a specific problem. Current problem across the world is the global pandemic COVID-19 disease caused by novel coronavirus, opened up specific problems related to the COVID-19 research. A large corpus of scientific research articles were released as dataset to the world for finding best research articles to support the corona virus vaccine research. This paper utilizes the tf-idf preprocessing technique to create similarity matrix, which is used as weighted edge adjacency matrix for graph clustering. K-Means as a standalone method was also used to compare the results with the graph clustering algorithms. The clustering efficiency is measured by inter and intra-clustering distance metrics. Decision trees are used on the clustered data to compare the clustering algorithms based on the classification accuracy. Finally the conclusions and future directions are provided to retrieve documents specific for COVID-19 out of the entire corpus.

Academic department under which the project should be listed

CCSE - Data Science and Analytics

Primary Investigator (PI) Name

Dr. Joe DeMaio

This document is currently not available here.

Share

COinS
 

Topic clustering of COVID-19 open research dataset(CORD-19) using graph clustering approach

Topic clustering is an important approach in text analytics, because labeled documents are rarely available to classify documents for a specific problem. Current problem across the world is the global pandemic COVID-19 disease caused by novel coronavirus, opened up specific problems related to the COVID-19 research. A large corpus of scientific research articles were released as dataset to the world for finding best research articles to support the corona virus vaccine research. This paper utilizes the tf-idf preprocessing technique to create similarity matrix, which is used as weighted edge adjacency matrix for graph clustering. K-Means as a standalone method was also used to compare the results with the graph clustering algorithms. The clustering efficiency is measured by inter and intra-clustering distance metrics. Decision trees are used on the clustered data to compare the clustering algorithms based on the classification accuracy. Finally the conclusions and future directions are provided to retrieve documents specific for COVID-19 out of the entire corpus.