Open-Domain Question Answering with Wikipedia and SQuAD

Disciplines

Data Storage Systems | Other Computer Engineering

Abstract (300 words maximum)

Finding precise answers in a sea of online information remains a constant challenge, even with powerful search engines at our fingertips. Our project explores how natural language processing can bridge that gap by building an open-domain question answering system that understands a question and retrieves a clear, evidence-based answer instead of a list of documents. The system combines information retrieval and deep language understanding in a two-step process. First, a retriever model using Dense Passage Retrieval or MiniLM embeddings with FAISS identifies the most relevant sections from large text collections such as Wikipedia. Then, a fine-tuned BERT or DistilBERT model reads those sections and pinpoints the exact phrase or sentence that answers the question. This structure allows the retriever to quickly narrow the search space while the reader provides precise, context-aware responses. We will train and test the model using benchmark datasets such as SQuAD v1/v2 and the Natural Questions dataset, which contain thousands of annotated question–answer pairs drawn from real text. System performance will be evaluated through Exact Match, F1 Score, and Recall to assess both retrieval accuracy and answer quality. The goal of this project is not only to build a functional QA prototype but also to show how transformer models and dense retrieval can work together to make information access faster, more reliable, and closer to how humans naturally seek answers. In the long run, we hope this approach can help transform traditional search into true question understanding, allowing users to interact with knowledge more directly and meaningfully.

Use of AI Disclaimer

no

Academic department under which the project should be listed

CCSE – Computer Science

Primary Investigator (PI) Name

Md Abdullah Al Hafiz Khan

This document is currently not available here.

Share

COinS
 

Open-Domain Question Answering with Wikipedia and SQuAD

Finding precise answers in a sea of online information remains a constant challenge, even with powerful search engines at our fingertips. Our project explores how natural language processing can bridge that gap by building an open-domain question answering system that understands a question and retrieves a clear, evidence-based answer instead of a list of documents. The system combines information retrieval and deep language understanding in a two-step process. First, a retriever model using Dense Passage Retrieval or MiniLM embeddings with FAISS identifies the most relevant sections from large text collections such as Wikipedia. Then, a fine-tuned BERT or DistilBERT model reads those sections and pinpoints the exact phrase or sentence that answers the question. This structure allows the retriever to quickly narrow the search space while the reader provides precise, context-aware responses. We will train and test the model using benchmark datasets such as SQuAD v1/v2 and the Natural Questions dataset, which contain thousands of annotated question–answer pairs drawn from real text. System performance will be evaluated through Exact Match, F1 Score, and Recall to assess both retrieval accuracy and answer quality. The goal of this project is not only to build a functional QA prototype but also to show how transformer models and dense retrieval can work together to make information access faster, more reliable, and closer to how humans naturally seek answers. In the long run, we hope this approach can help transform traditional search into true question understanding, allowing users to interact with knowledge more directly and meaningfully.