Statistics and Analytical Sciences

Document Type


Submission Date



Credit risk prediction is an important problem in the financial services domain. While machine learning techniques such as Support Vector Machines and Neural Networks have been used for improved predictive modeling, the outcomes of such models are not readily explainable and, therefore, difficult to apply within financial regulations. In contrast, Decision Trees are easy to explain, and provide an easy to interpret visualization of model decisions. The aim of this paper is to predict worst non-financial payment status among businesses, and evaluate decision tree model performance against traditional Logistic Regression model for this task. The dataset for analysis is provided by Equifax and includes over 300 potential predictors from more than 11 million unique businesses. After a data discovery phase, including imputation, cleaning, and transforming potential predictors, Decision Tree and Logistic Regression models were built on the same finalized analysis dataset. Evaluating the models based on ROC index, and Kolmogorov-Smirnov statistic, Decision Tree performed as well as the Logistic Regression model.