Semester of Graduation
Fall 2025
Degree Type
Dissertation/Thesis
Degree Name
Masters in Computer Science
Department
Computer science
Committee Chair/First Advisor
Dr. Bobin Deng
Second Advisor
Dr. Yixin Xie
Third Advisor
Dr. Seyedamin Pouriyeh
Abstract
This thesis investigates how computational methods can help us understand protein docking and vascular diseases by integrating machine learning. The main goal of this research is to develop a system that can predict whether variants in the SOX17 gene are associated to Pulmonary Arterial Hypertension (PAH), a significant vascular condition. To do this, a dataset of 1,063 SOX17 variants was created using publicly available sources such as ClinVar, gnomAD, LOVD, and Ensembl. The study included cleaning and structuring the dataset, encoding genomic data and training a Random Forest classifier to classify variants as positive, negative, or unknown for disease association. The model achieved 97.9% training accuracy and 84.0% test accuracy, indicating that it can successfully learn patterns from genomic data. Also evaluation metrices such as F1-scores, precision and recall were used to assist model behavior. The results show that mutation location and nucleotide differences have a major influence on predictions. This thesis shows that machine learning, especially tree-based models, can help us better understand which genetic variants are actually problematic and identify the unclear ones that need more investigation. This work opens up some exciting possibilities for the future, like using protein docking simulations to see exactly how specific mutations might affect SOX17's function at the molecular level.