Date of Award
Spring 4-18-2025
Degree Type
Dissertation/Thesis
Degree Name
DOCTOR OF PHILOSOPHY
Department
Data Science and Analytics
Committee Chair/First Advisor
Dr. Ramazan Aygun
Second Advisor
Dr. Sherry Ni
Third Advisor
Dr. Mahmut Karakaya
Abstract
Addressing imbalanced datasets is challenging due to machine learning models' inclination to learn the majority class. Graph construction plays a major role in determining how Graph Neural Networks (GNNs) perform on imbalanced datasets. In this research, we introduce the BalancerGNN framework to tackle highly imbalanced datasets, demonstrating its effectiveness in fraud detection as one of the case studies. This framework is designed to work for any binary node classification dataset with significant class imbalances. This research addresses the following questions: i) How effective are feature engineering techniques in the case of imbalanced datasets? ii) How do graph representation learning and construction techniques help with the imbalanced dataset issue? and iii) How effective is it to use the GNN with a weighted loss function with multiple components based on dataset characteristics? The framework has three major components: i) node construction with feature representations, ii) graph construction using balanced neighbor sampling, and iii) GNN training using balanced training batches leveraging a multi-component loss function. For node construction, we have introduced i) Graph-based Variable Clustering (GVC) to optimize feature selection and remove redundancies by analyzing multicollinearity and ii) Encoder-Decoder based Dimensionality Reduction (EDDR) using transformer-based techniques to reduce feature dimensions while keeping important information about textual embeddings intact. The experiments on Medicare, Equifax, IEEE, and Auto insurance fraud datasets highlight the importance of node construction with feature representations. BalancerGNN, trained with balanced batches, outperforms other methods, showing strong abilities in identifying fraud cases, with sensitivity rates ranging from 72.46% to 95.53% across datasets. Additionally, BalancerGNN achieves impressive accuracy rates, ranging from 74.01% to 94.32%. These outcomes underscore the importance of graph representation and neighbor sampling techniques in developing BalancerGNN for fraud detection models in real-world applications.