Topic clustering of COVID-19 open research dataset(CORD-19) using graph clustering approach
Disciplines
Applied Statistics | Artificial Intelligence and Robotics | Databases and Information Systems | Discrete Mathematics and Combinatorics | Numerical Analysis and Scientific Computing | Programming Languages and Compilers | Statistical Models | Statistics and Probability
Abstract (300 words maximum)
Topic clustering is an important approach in text analytics, because labeled documents are rarely available to classify documents for a specific problem. Current problem across the world is the global pandemic COVID-19 disease caused by novel coronavirus, opened up specific problems related to the COVID-19 research. A large corpus of scientific research articles were released as dataset to the world for finding best research articles to support the corona virus vaccine research. This paper utilizes the tf-idf preprocessing technique to create similarity matrix, which is used as weighted edge adjacency matrix for graph clustering. K-Means as a standalone method was also used to compare the results with the graph clustering algorithms. The clustering efficiency is measured by inter and intra-clustering distance metrics. Decision trees are used on the clustered data to compare the clustering algorithms based on the classification accuracy. Finally the conclusions and future directions are provided to retrieve documents specific for COVID-19 out of the entire corpus.
Academic department under which the project should be listed
CCSE - Data Science and Analytics
Primary Investigator (PI) Name
Dr. Joe DeMaio
Topic clustering of COVID-19 open research dataset(CORD-19) using graph clustering approach
Topic clustering is an important approach in text analytics, because labeled documents are rarely available to classify documents for a specific problem. Current problem across the world is the global pandemic COVID-19 disease caused by novel coronavirus, opened up specific problems related to the COVID-19 research. A large corpus of scientific research articles were released as dataset to the world for finding best research articles to support the corona virus vaccine research. This paper utilizes the tf-idf preprocessing technique to create similarity matrix, which is used as weighted edge adjacency matrix for graph clustering. K-Means as a standalone method was also used to compare the results with the graph clustering algorithms. The clustering efficiency is measured by inter and intra-clustering distance metrics. Decision trees are used on the clustered data to compare the clustering algorithms based on the classification accuracy. Finally the conclusions and future directions are provided to retrieve documents specific for COVID-19 out of the entire corpus.