Semester of Graduation

Fall 2025

Degree Type

Dissertation/Thesis

Degree Name

Masters in Computer Science

Department

Computer science

Committee Chair/First Advisor

Dr. Bobin Deng

Second Advisor

Dr. Yixin Xie

Third Advisor

Dr. Seyedamin Pouriyeh

Abstract

This thesis investigates how computational methods can help us understand protein docking and vascular diseases by integrating machine learning. The main goal of this research is to develop a system that can predict whether variants in the SOX17 gene are associated to Pulmonary Arterial Hypertension (PAH), a significant vascular condition. To do this, a dataset of 1,063 SOX17 variants was created using publicly available sources such as ClinVar, gnomAD, LOVD, and Ensembl. The study included cleaning and structuring the dataset, encoding genomic data and training a Random Forest classifier to classify variants as positive, negative, or unknown for disease association. The model achieved 97.9% training accuracy and 84.0% test accuracy, indicating that it can successfully learn patterns from genomic data. Also evaluation metrices such as F1-scores, precision and recall were used to assist model behavior. The results show that mutation location and nucleotide differences have a major influence on predictions. This thesis shows that machine learning, especially tree-based models, can help us better understand which genetic variants are actually problematic and identify the unclear ones that need more investigation. This work opens up some exciting possibilities for the future, like using protein docking simulations to see exactly how specific mutations might affect SOX17's function at the molecular level.

Share

COinS