Multi-Hop QA Comparison

Primary Investigator (PI) Name

Md Abdullah Al Hafiz Khan

Department

CCSE – Computer Science

Abstract

The goal of this project is to evaluate and compare transformer-based models (RoBERTa, Longformer, DPR, and T5) on complex questions and answer corpora. Using corpus’ such as hotpot_qa, we will ask complex questions that take multiple hops between documents to return correct answers. Work will be done in Python with a few supporting libraries to aid implementation and visualization. Evaluation of these models will be done using formulas such as Euclidean distance to determine how far away our answer was from the expected output. Other comparison metrics will include accuracy, precision, recall, and f1 score. Once a baseline evaluation of these models is complete, fine-tuning will be attempted to further improve the results. Complex question and answering are important when our questions may be incomplete or only give the general gist of what we are truly asking. We aim to put the model’s ability to infer up to the test. Comparing these models and completing fine-tuning with respect to the datasets used should provide a meaningful culmination of data-points to display and analyze. Pending analysis, shortcomings in the aforementioned models with be investigated and potential solutions will be explored and potentially implemented.

Disciplines

Numerical Analysis and Scientific Computing | Other Computer Sciences

This document is currently not available here.

Share

COinS
 

Multi-Hop QA Comparison

The goal of this project is to evaluate and compare transformer-based models (RoBERTa, Longformer, DPR, and T5) on complex questions and answer corpora. Using corpus’ such as hotpot_qa, we will ask complex questions that take multiple hops between documents to return correct answers. Work will be done in Python with a few supporting libraries to aid implementation and visualization. Evaluation of these models will be done using formulas such as Euclidean distance to determine how far away our answer was from the expected output. Other comparison metrics will include accuracy, precision, recall, and f1 score. Once a baseline evaluation of these models is complete, fine-tuning will be attempted to further improve the results. Complex question and answering are important when our questions may be incomplete or only give the general gist of what we are truly asking. We aim to put the model’s ability to infer up to the test. Comparing these models and completing fine-tuning with respect to the datasets used should provide a meaningful culmination of data-points to display and analyze. Pending analysis, shortcomings in the aforementioned models with be investigated and potential solutions will be explored and potentially implemented.