Date of Submission

Fall 12-16-2020

Degree Type

Thesis

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

Track

Big Data

Faculty Advisor

Dr. Dan lo

Chair

Dr. Dan Lo

Committee Member

Dr. Yong Shi

Committee Member

Dr. Hossain Shahriar

Abstract

Imbalanced datasets have been a unique challenge for machine learning, requiring specialized approaches to correctly classify the minority class. Financial fraud detection involves using highly imbalanced datasets with a class imbalance of up to .01% frauds to 99.99% regular transactions. It is essential to identify all frauds in financial fraud detection, even if some classifications' precision is low. I developed a random forest assembly that separates fraudulent transactions into tiers of precision. With this approach, 96% of fraudulent transactions are identified, showing an 8% increase in recall when compared to standard approaches. 59% of fraud classifications' precision increases by 10% up to 98% by optimizing several random forests on different fitness functions. These models are then combined to act as a sieve with increasing tolerance for low precision classifications. The effectiveness of random forest for financial fraud detection is also improved through feature extraction techniques. Random forest is weak at detecting patterns between interdepended features. This problem is address through unsupervised feature extraction. I will demonstrate a new random forest architecture PCA-embedded random forest, which increased random forest performance.

Included in

Data Science Commons

Share

COinS