Date of Award
Summer 7-31-2024
Degree Type
Dissertation
Degree Name
Ph.D. in Data Science and Analytics (Statistics)
Department
College of Computing and Software Engineering - School of Data Science and Analytics
Committee Chair/First Advisor
Minjae Woo
Second Advisor
Xinyan Zhang
Third Advisor
Jiho Noh
Fourth Advisor
Amin Pouriyeh,
Abstract
In the domain of Computer-Aided Diagnosis (CADx) for breast cancer diagnosis through mammography, prevailing models have traditionally been trained and validated using old film-based mammography. However, contemporary U.S. hospital practices involve the utilization of Full Field Digital Mammography (FFDM), offering more detailed images captured at various angles than old film-scanned mammography. Despite this shift, the existing body of research predominantly focuses on old-film based datasets, the implications of FFDM for CADx systems have not been understood. This dissertation addresses the issues emerged from FFDM such as data augmentation between old film-based set and new FFDM whether they are more effective or not, a tremendous size of FFDM dataset, and dataset imbalance issue within cancer image dataset by concentrating on the practical applications of FFDM datasets in breast cancer diagnosis through deep learning models. The central focus of this study is to assess the efficacy of data augmentation when applied to augmented datasets between old film-based mammography and FFDM. Despite the shared domain of mammography, an in-depth investigation into the impact of data augmentation on a binary classifier trained using the augmented dataset from both film-based and FFDM sources is critical. The second issue of FFDM dataset pertains to the substantial file sizes, 2.5 TB (640,000 mammograms) Emory Breast Mammography Dataset (EMBED) is a typical example. Conventional compression algorithms prove inadequate, prompting the proposal of a novel Latent Diffusion model (LDM)-based compression and decompression framework. This synthetic image generator demonstrates efficacy in reducing dataset sizes by half, achieving comparable decompression results to lossless algorithms. The third issue to tackle involves addressing the imbalance in cancerous and normal cases within the EMBED dataset, and the solution proposal by the employment of patchwise synthetic data generation. By training Generative Adversarial Networks (GANs) on cancerous image patches and generating synthetic cancerous sub-patches, a balanced dataset is synthesized to evaluate the classifier's performance against the original dataset. This multifaceted research aims to enhance the understanding and utilization of FFDM datasets in CADx systems, offering insights into data augmentation, efficient compression, and balanced dataset creation for improved breast cancer diagnosis.
Included in
Biomedical Informatics Commons, Computational Engineering Commons, Risk Analysis Commons