Date of Award
Spring 3-24-2020
Degree Type
Dissertation
Degree Name
Doctor of Philosophy in Analytic and Data Science
Department
Statistics and Analytical Sciences
Committee Chair/First Advisor
Herman Ray
Committee Member
Joseph DeMaio
Committee Member
Lin Li
Committee Member
Sherry Ni
Committee Member
Ying Xie
Abstract
The log-likelihood function is the optimization objective in the maximum likelihood method for estimating models (e.g., logistic regression, neural network). However, its formulation is based on assumptions that the target classes are equally distributed and the overall accuracy is maximized, which do not apply to class imbalance problems (e.g., fraud detection, rare disease diagnoses, customer conversion prediction, cybersecurity, predictive maintenance). When trained on imbalanced data, the resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood objective function in the learning process. Existing penalized log-likelihood functions require either hard hyperparameter estimation or high computational complexity. In the present work, we propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients/parameters. The proposed log-likelihood function is applied to train logistic regression and neural network models, which are compared with models trained by existing penalized log-likelihood functions on 10 public imbalanced datasets. The model performance is measured by the statistics of Area under ROC Curve (i.e. AUROC or AUC) over repeated runs of 10-fold stratified cross validation, including 95\% confidence interval, mean and standard deviation, as well as the training time. A more detailed analysis is conducted to examine the estimated probability distributions and additional performance measurements (i.e. Type I error, Type II error, accuracy) under the chosen probability cutoff. The results demonstrate that the discrimination ability of the models is improved by using the proposed log-likelihood function as the learning objective while reducing or maintaining the computational complexity compared with existing ones.