Date of Submission

Fall 12-16-2020

Degree Type


Degree Name

Master of Science in Computer Science (MSCS)


Computer Science

Committee Chair/First Advisor

Dr. Dan lo


Big Data


Dr. Dan Lo

Committee Member

Dr. Yong Shi

Committee Member

Dr. Hossain Shahriar


Imbalanced datasets have been a unique challenge for machine learning, requiring specialized approaches to correctly classify the minority class. Financial fraud detection involves using highly imbalanced datasets with a class imbalance of up to .01% frauds to 99.99% regular transactions. It is essential to identify all frauds in financial fraud detection, even if some classifications' precision is low. I developed a random forest assembly that separates fraudulent transactions into tiers of precision. With this approach, 96% of fraudulent transactions are identified, showing an 8% increase in recall when compared to standard approaches. 59% of fraud classifications' precision increases by 10% up to 98% by optimizing several random forests on different fitness functions. These models are then combined to act as a sieve with increasing tolerance for low precision classifications. The effectiveness of random forest for financial fraud detection is also improved through feature extraction techniques. Random forest is weak at detecting patterns between interdepended features. This problem is address through unsupervised feature extraction. I will demonstrate a new random forest architecture PCA-embedded random forest, which increased random forest performance.

Included in

Data Science Commons