Date of Submission

11-8-2019

Degree Type

Thesis

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

Committee Chair/First Advisor

Mingon Kang

Track

Big Data

Chair

Mingon Kang

Committee Member

Chih-Cheng Hung

Committee Member

Junggab Son

Related Publications

Sai Kosaraju, Nelson Zange Tsaku, Pritesh Patel, Tanju Bayramoglu, Girish Modgil, and Mingon Kang. 2019. Table of Contents Recognition in OCR Documents using Image-based Machine Learning. In Proceedings of the 2019 ACM Southeast Conference (ACM SE '19). ACM, New York, NY, USA, 186-189. DOI: https://doi.org/10.1145/3299815.3314455.

Masum Mohammad, Sai Kosaraju*, Tanju Bayramoglu, Girish Modgil, and Mingon Kang. 2018. Automatic knowledge extraction from OCR documents using hierarchical document analysis. In Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems (RACS '18). ACM, New York, NY, USA, 189-194. DOI: https://doi.org/10.1145/3264746.3264793. (* indicates the co-first authors).

S. Kosaraju, M. Masum, N. Tsaku, P. Patel, T. Bayramoglu, G. Modgil, and M. Kang, "DoT-Net: Document Layout Classification Using Texture-based CNN", The 15th International Conference on Document Analysis and Recognition (ICDAR), 2019

Abstract

Automatic extraction of relevant knowledge to domain-specific questions from Optical Character Recognition (OCR) documents is critical for developing intelligent systems, such as document search engines, sentiment analysis, and information retrieval, since hands-on knowledge extraction by a domain expert with a large volume of documents is intensive, unscalable, and time-consuming. There have been a number of studies that have automatically extracted relevant knowledge from OCR documents, such as ABBY and Sandford Natural Language Processing (NLP). Despite the progress, there are still limitations yet-to-be solved. For instance, NLP often fails to analyze a large document. In this thesis, we propose a knowledge extraction framework, which takes domain-specific questions as input and provides the most relevant sentence/paragraph to the given questions in the document. Overall, our proposed framework has two phases. First, an OCR document is reconstructed into a semi-structured document (a document with hierarchical structure of (sub)sections and paragraphs). Then, relevant sentence/paragraph for a given question is identified from the reconstructed semi structured document. Specifically, we proposed (1) a method that converts an OCR document into a semi structured document using text attributes such as font size, font height, and boldface (in Chapter 2), (2) an image-based machine learning method that extracts Table of Contents (TOC) to provide an overall structure of the document (in Chapter 3), (3) a document texture-based deep learning method (DoT-Net) that classifies types of blocks such as text, image, and table (in Chapter 4), and (4) a Question & Answer (Q&A) system that retrieves most relevant sentence/paragraph for a domain-specific question. A large number of document intelligent systems can benefit from our proposed automatic knowledge extraction system to construct a Q&A system for OCR documents. Our Q&A system has applied to extract domain specific information from business contracts at GE Power.

Download

Included in

Computer and Systems Architecture Commons, Other Computer Engineering Commons

COinS

Master of Science in Computer Science Theses

Document Layout Analysis and Recognition Systems

Date of Submission

Degree Type

Degree Name

Department

Committee Chair/First Advisor

Track

Chair

Committee Member

Committee Member

Related Publications

Abstract

Included in

Search

Authors

Browse

Links

Useful Links

Master of Science in Computer Science Theses

Document Layout Analysis and Recognition Systems

Author

Date of Submission

Degree Type

Degree Name

Department

Committee Chair/First Advisor

Track

Chair

Committee Member

Committee Member

Related Publications

Abstract

Included in

Share

Search

Authors

Browse

Links

Useful Links