Date of Award

Spring 4-18-2025

Degree Type

Dissertation/Thesis

Degree Name

DOCTOR OF PHILOSOPHY

Department

Data Science and Analytics

Committee Chair/First Advisor

Dr. Ramazan Aygun

Second Advisor

Dr. Sherry Ni

Third Advisor

Dr. Mahmut Karakaya

Abstract

Addressing imbalanced datasets is challenging due to machine learning models' inclination to learn the majority class. Graph construction plays a major role in determining how Graph Neural Networks (GNNs) perform on imbalanced datasets. In this research, we introduce the BalancerGNN framework to tackle highly imbalanced datasets, demonstrating its effectiveness in fraud detection as one of the case studies. This framework is designed to work for any binary node classification dataset with significant class imbalances. This research addresses the following questions: i) How effective are feature engineering techniques in the case of imbalanced datasets? ii) How do graph representation learning and construction techniques help with the imbalanced dataset issue? and iii) How effective is it to use the GNN with a weighted loss function with multiple components based on dataset characteristics? The framework has three major components: i) node construction with feature representations, ii) graph construction using balanced neighbor sampling, and iii) GNN training using balanced training batches leveraging a multi-component loss function. For node construction, we have introduced i) Graph-based Variable Clustering (GVC) to optimize feature selection and remove redundancies by analyzing multicollinearity and ii) Encoder-Decoder based Dimensionality Reduction (EDDR) using transformer-based techniques to reduce feature dimensions while keeping important information about textual embeddings intact. The experiments on Medicare, Equifax, IEEE, and Auto insurance fraud datasets highlight the importance of node construction with feature representations. BalancerGNN, trained with balanced batches, outperforms other methods, showing strong abilities in identifying fraud cases, with sensitivity rates ranging from 72.46% to 95.53% across datasets. Additionally, BalancerGNN achieves impressive accuracy rates, ranging from 74.01% to 94.32%. These outcomes underscore the importance of graph representation and neighbor sampling techniques in developing BalancerGNN for fraud detection models in real-world applications.

Share

COinS