Date of Submission
Master of Science in Computer Science (MSCS)
Automatic extraction of relevant knowledge to domain-specific questions from Optical Character Recognition (OCR) documents is critical for developing intelligent systems, such as document search engines, sentiment analysis, and information retrieval, since hands-on knowledge extraction by a domain expert with a large volume of documents is intensive, unscalable, and time-consuming. There have been a number of studies that have automatically extracted relevant knowledge from OCR documents, such as ABBY and Sandford Natural Language Processing (NLP). Despite the progress, there are still limitations yet-to-be solved. For instance, NLP often fails to analyze a large document. In this thesis, we propose a knowledge extraction framework, which takes domain-specific questions as input and provides the most relevant sentence/paragraph to the given questions in the document. Overall, our proposed framework has two phases. First, an OCR document is reconstructed into a semi-structured document (a document with hierarchical structure of (sub)sections and paragraphs). Then, relevant sentence/paragraph for a given question is identified from the reconstructed semi structured document. Specifically, we proposed (1) a method that converts an OCR document into a semi structured document using text attributes such as font size, font height, and boldface (in Chapter 2), (2) an image-based machine learning method that extracts Table of Contents (TOC) to provide an overall structure of the document (in Chapter 3), (3) a document texture-based deep learning method (DoT-Net) that classifies types of blocks such as text, image, and table (in Chapter 4), and (4) a Question & Answer (Q&A) system that retrieves most relevant sentence/paragraph for a domain-specific question. A large number of document intelligent systems can benefit from our proposed automatic knowledge extraction system to construct a Q&A system for OCR documents. Our Q&A system has applied to extract domain specific information from business contracts at GE Power.
Kosaraju, Sai, "Document Layout Analysis and Recognition Systems" (2019). Master of Science in Computer Science Theses. 28.