A Study on the Comparative Effectiveness of News Article Embedding Methods for Extractive Summarization
Disciplines
Artificial Intelligence and Robotics | Theory and Algorithms
Abstract (300 words maximum)
Our research project seeks to assess the comparative effectiveness of various methods of article embedding on the performance of a Recurrent Neural Network trained to perform extractive summarization. Typically, extractive summarization is performed via the following general steps: dataset selection and preprocessing, data embedding, model training, and model evaluation. In our research, we aim to primarily focus on assessing the “data embedding” step and determine what kind of embedding performs most optimally for extractive news article summarization. To perform our study, we make use of the CNN/DailyMail dataset, which consists of 311,971 news articles and corresponding summaries. These summaries are abstractive in nature, so to make use of the dataset for extractive purposes, we predict extractive sentence labels by using a greedy ROUGE-based approach, which computes the five article sentences which combined have the closest ROUGE-based similarity to the expert-written summary of the article. These sentences are labeled as being part of the “ideal” extractive sentence summary. Then, the various news articles are transformed into series of sentence vector embeddings via various methods, including: series of TF-IDF sentence vectors computed on a by-article basis, series of sentence vectors based on a Word2Vec approach, where sentence vectors are computed by averaging Word2Vec vectors of the words they contain using both per-article trained and Google News 300 Pre-trained Word2Vec models, and series of sentence vectors based on a SBERT sentence embedding approach. These various series of sentence embeddings are then fed into a Long Short-Term Memory Recurrent Neural Network which is trained to predict the sentences which will belong in the final extractive summary of each article. Finally, we will gather the performance results for the RNN for each method of embedding and compare them to determine which type of article embedding allows for the best performance.
Use of AI Disclaimer
no
Academic department under which the project should be listed
CCSE – Computer Science
Primary Investigator (PI) Name
Md Abdulla Al Hafiz Khan
A Study on the Comparative Effectiveness of News Article Embedding Methods for Extractive Summarization
Our research project seeks to assess the comparative effectiveness of various methods of article embedding on the performance of a Recurrent Neural Network trained to perform extractive summarization. Typically, extractive summarization is performed via the following general steps: dataset selection and preprocessing, data embedding, model training, and model evaluation. In our research, we aim to primarily focus on assessing the “data embedding” step and determine what kind of embedding performs most optimally for extractive news article summarization. To perform our study, we make use of the CNN/DailyMail dataset, which consists of 311,971 news articles and corresponding summaries. These summaries are abstractive in nature, so to make use of the dataset for extractive purposes, we predict extractive sentence labels by using a greedy ROUGE-based approach, which computes the five article sentences which combined have the closest ROUGE-based similarity to the expert-written summary of the article. These sentences are labeled as being part of the “ideal” extractive sentence summary. Then, the various news articles are transformed into series of sentence vector embeddings via various methods, including: series of TF-IDF sentence vectors computed on a by-article basis, series of sentence vectors based on a Word2Vec approach, where sentence vectors are computed by averaging Word2Vec vectors of the words they contain using both per-article trained and Google News 300 Pre-trained Word2Vec models, and series of sentence vectors based on a SBERT sentence embedding approach. These various series of sentence embeddings are then fed into a Long Short-Term Memory Recurrent Neural Network which is trained to predict the sentences which will belong in the final extractive summary of each article. Finally, we will gather the performance results for the RNN for each method of embedding and compare them to determine which type of article embedding allows for the best performance.