Date of Award

Summer 7-31-2024

Degree Type

Dissertation

Degree Name

Ph.D in Analytics and Data Science

Department

School of Data Science & Analytics

Committee Chair/First Advisor

Minjae Woo

Second Advisor

Xinyan Zhang

Third Advisor

Jiho Noh

Fourth Advisor

Amin Pouriyeh,

Abstract

In the domain of Computer-Aided Diagnosis (CADx) for breast cancer diagnosis through mammography, prevailing models have traditionally been trained and validated using old film-based mammography. However, contemporary U.S. hospital practices involve the utilization of Full Field Digital Mammography (FFDM), offering more detailed images captured at various angles than old film-scanned mammography. Despite this shift, the existing body of research predominantly focuses on old-film based datasets, the implications of FFDM for CADx systems have not been understood. This dissertation addresses the issues emerged from FFDM such as data augmentation between old film-based set and new FFDM whether they are more effective or not, a tremendous size of FFDM dataset, and dataset imbalance issue within cancer image dataset by concentrating on the practical applications of FFDM datasets in breast cancer diagnosis through deep learning models. The central focus of this study is to assess the efficacy of data augmentation when applied to augmented datasets between old film-based mammography and FFDM. Despite the shared domain of mammography, an in-depth investigation into the impact of data augmentation on a binary classifier trained using the augmented dataset from both film-based and FFDM sources is critical. The second issue of FFDM dataset pertains to the substantial file sizes, 2.5 TB (640,000 mammograms) Emory Breast Mammography Dataset (EMBED) is a typical example. Conventional compression algorithms prove inadequate, prompting the proposal of a novel Latent Diffusion model (LDM)-based compression and decompression framework. This synthetic image generator demonstrates efficacy in reducing dataset sizes by half, achieving comparable decompression results to lossless algorithms. The third issue to tackle involves addressing the imbalance in cancerous and normal cases within the EMBED dataset, and the solution proposal by the employment of patchwise synthetic data generation. By training Generative Adversarial Networks (GANs) on cancerous image patches and generating synthetic cancerous sub-patches, a balanced dataset is synthesized to evaluate the classifier's performance against the original dataset. This multifaceted research aims to enhance the understanding and utilization of FFDM datasets in CADx systems, offering insights into data augmentation, efficient compression, and balanced dataset creation for improved breast cancer diagnosis.

Available for download on Saturday, July 05, 2025

Share

COinS