MITIGATING THE EFFECTS OF CLASS IMBALANCE USING SMOTE AND TOMEK LINK UNDERSAMPLING IN SAS®

Presenters

Disciplines

Applied Statistics

Abstract (300 words maximum)

Many standard learning algorithms have trouble adequately learning the underrepresented class in imbalanced datasets. Altering the training data with sampling methods can make it easier for classifiers to learn the class of interest. Two such methods are SMOTE, which generates synthetic minority class examples, and Tomek link undersampling, which clears majority class examples from class boundaries. Both methods were implemented in SAS® along with a combination of SMOTE followed by Tomek link undersampling (SMOTE+Tomek). Using a dataset of credit card fraud transactions where the class of interest was either fraud (minority class) or not fraud (majority class), the efficacy of these techniques was tested by training four classifiers – a random forest, a neural network, a support vector machine, and a rule induction classifier – on training datasets processed using each method and testing them on a validation set. The performance of the classifiers on the validation set was assessed using the ROC index, precision, recall, and the ratio of false negatives to false positives (FN/FP). SMOTE and SMOTE+Tomek were the most effective preprocessing methods for improving the detection of fraudulent transactions in the credit card dataset. Both methods improved recall and lowered the FN/FP for every classifier, indicating improved sensitivity to fraud. At the same time, SMOTE and SMOTE+Tomek improved the ROC index, indicating an improved ability to distinguish fraud from non-fraud.

Academic department under which the project should be listed

CCSE - Data Science and Analytics

Primary Investigator (PI) Name

Dr. Xuelei Ni

This document is currently not available here.

Share

COinS
 

MITIGATING THE EFFECTS OF CLASS IMBALANCE USING SMOTE AND TOMEK LINK UNDERSAMPLING IN SAS®

Many standard learning algorithms have trouble adequately learning the underrepresented class in imbalanced datasets. Altering the training data with sampling methods can make it easier for classifiers to learn the class of interest. Two such methods are SMOTE, which generates synthetic minority class examples, and Tomek link undersampling, which clears majority class examples from class boundaries. Both methods were implemented in SAS® along with a combination of SMOTE followed by Tomek link undersampling (SMOTE+Tomek). Using a dataset of credit card fraud transactions where the class of interest was either fraud (minority class) or not fraud (majority class), the efficacy of these techniques was tested by training four classifiers – a random forest, a neural network, a support vector machine, and a rule induction classifier – on training datasets processed using each method and testing them on a validation set. The performance of the classifiers on the validation set was assessed using the ROC index, precision, recall, and the ratio of false negatives to false positives (FN/FP). SMOTE and SMOTE+Tomek were the most effective preprocessing methods for improving the detection of fraudulent transactions in the credit card dataset. Both methods improved recall and lowered the FN/FP for every classifier, indicating improved sensitivity to fraud. At the same time, SMOTE and SMOTE+Tomek improved the ROC index, indicating an improved ability to distinguish fraud from non-fraud.