Date of Award

Spring 3-24-2020

Degree Type

Dissertation

Degree Name

Doctor of Philosophy in Analytic and Data Science

Department

Statistics and Analytical Sciences

Committee Chair/First Advisor

Herman Ray

Committee Member

Joseph DeMaio

Committee Member

Lin Li

Committee Member

Sherry Ni

Committee Member

Ying Xie

Abstract

The log-likelihood function is the optimization objective in the maximum likelihood method for estimating models (e.g., logistic regression, neural network). However, its formulation is based on assumptions that the target classes are equally distributed and the overall accuracy is maximized, which do not apply to class imbalance problems (e.g., fraud detection, rare disease diagnoses, customer conversion prediction, cybersecurity, predictive maintenance). When trained on imbalanced data, the resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood objective function in the learning process. Existing penalized log-likelihood functions require either hard hyperparameter estimation or high computational complexity. In the present work, we propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients/parameters. The proposed log-likelihood function is applied to train logistic regression and neural network models, which are compared with models trained by existing penalized log-likelihood functions on 10 public imbalanced datasets. The model performance is measured by the statistics of Area under ROC Curve (i.e. AUROC or AUC) over repeated runs of 10-fold stratified cross validation, including 95\% confidence interval, mean and standard deviation, as well as the training time. A more detailed analysis is conducted to examine the estimated probability distributions and additional performance measurements (i.e. Type I error, Type II error, accuracy) under the chosen probability cutoff. The results demonstrate that the discrimination ability of the models is improved by using the proposed log-likelihood function as the learning objective while reducing or maintaining the computational complexity compared with existing ones.

Share

COinS