Date of Award
Spring 5-11-2022
Degree Type
Dissertation
Degree Name
Doctor of Philosophy in Analytic and Data Science
Department
Statistics and Analytical Sciences
Committee Chair/First Advisor
Dr. Jennifer Priestley
Committee Member
Dr. Herman Ray
Committee Member
Dr. Ying Xie
Abstract
Binary classification using imbalanced datasets remains a challenge. Typically, supervised learning algorithms minimize the binary cross-entropy objective function to determine the final parameter estimates. This objective function assumes an equal class distribution between the minority (i.e. events) and majority (i.e. non-events) classes, which almost never exists in real-world modeling. In the imbalanced data setting, the equal class distribution is grossly violated, and the resulting parameter estimates are biased toward the majority class. To overcome the bias and improve model generalization, we focus on modifying the original binary cross-entropy objective function by uniquely weighting each minority class observation. We base our weighting methodology from a technique developed in a recently published manuscript, which implemented a locally weighted log-likelihood objective function within logistic regression. Building from this published method, we develop instance-level weights for each minority class observation that are learned from the data but overcome the challenges of the original method. Our method drastically reduces the number of decision variables that must be estimated, ensures the boundedness of the instance-level weights, and maintains the convexity of the objective function for efficient and reliable parameter estimation. This dissertation provides a comprehensive critique of the recently published base algorithm and derives an alternative formulation of the objective function. We implement this novel objective function in logistic regression and neural networks models and show significant performance improvement using synthetic and real-world imbalanced datasets.