Location
https://ccse.kennesaw.edu/computing-showcase/cday-programs/spring2021program.php
Document Type
Event
Start Date
26-4-2021 5:00 PM
Description
Persistent malware variants are a constant threat to computing infrastructure across all regions and business sectors. Traditional detection systems focus primarily on signature-based analysis but this approach cannot adequately keep pace with the velocity and volume of new malware variants that are continuously deployed onto the internet. Most network traffic detection techniques are focused on analyzing raw packets and have not deterred the surge of persistent malware. Therefore, it is important to develop new research techniques that are focused on optimized metadata from malware network traffic to effectively identify an ever-increasing expanse of malicious software. Recent research efforts by Letteri et al. have produced a quality data set (MTA-KDD’19) that is utilized for this research project. New information in the area of malware network traffic detection is pursued through this research proposal. Specifically, I seek to find a defensible answer to the following question: Can machine learning techniques produce highly accurate classification models for malicious network traffic detection based on analysis of a statistically optimized data set? I believe that an affirmative answer to this research question provides a beneficial contribution to the academic community. The principal tool utilized to analyze the optimized data set for this research project is the Waikato Environment for Knowledge Analysis (WEKA). There are 64,550 instances and 33 features in the MTA-KDD’19 data set that are analyzed along with cross-validation and percentage split alternatives. The classification experiment performed by the authors of the MTA-KDD’19 data set is used as a baseline. The following machine learning classification models have been applied for this research investigation: Multilayer Perceptron, Decision Tree, Support Vector Machine, and K-Nearest Neighbors. The preliminary settings for these machine learning models include 10-fold cross-validation and 80% train 20% test data split. The Decision Tree classifier produced the best preliminary result with 100% accuracy when set to run an 80% training 20% test split and 99.9954% accuracy when set to run 10-fold cross-validation. This preliminary result has outperformed the results observed in the experiment presented by the authors of the MTA-KDD’19 data set. Other preliminary metrics illustrate that the selected models exhibit consistent and highly accurate performance. The multilayer perceptron classifier produced a preliminary result of 99.3649% accuracy when set to run an 80% training 20% test split and 99.3416% accuracy when set to run 10-fold cross-validation. The K-Nearest Neighbor classifier (K=1) produced a preliminary result of 98.9311% accuracy when set to run an 80% training 20% test split and 99.0024% accuracy when set to run 10-fold cross-validation. The Support Vector Machine classifier produced a preliminary result of 97.8081% accuracy when set to run an 80% training 20% test split and 97.7755% accuracy when set to run 10-fold cross-validation. The final stage of this research project will include implementation of additional machine learning methodologies. These methods will include feature selection techniques and ensemble learning models.Advisors(s): Dr. Seyedamin PouriyehTopic(s): SecurityCYBR 7240
Included in
GR-23 Machine Learning Techniques for Malware Network Traffic Detection
https://ccse.kennesaw.edu/computing-showcase/cday-programs/spring2021program.php
Persistent malware variants are a constant threat to computing infrastructure across all regions and business sectors. Traditional detection systems focus primarily on signature-based analysis but this approach cannot adequately keep pace with the velocity and volume of new malware variants that are continuously deployed onto the internet. Most network traffic detection techniques are focused on analyzing raw packets and have not deterred the surge of persistent malware. Therefore, it is important to develop new research techniques that are focused on optimized metadata from malware network traffic to effectively identify an ever-increasing expanse of malicious software. Recent research efforts by Letteri et al. have produced a quality data set (MTA-KDD’19) that is utilized for this research project. New information in the area of malware network traffic detection is pursued through this research proposal. Specifically, I seek to find a defensible answer to the following question: Can machine learning techniques produce highly accurate classification models for malicious network traffic detection based on analysis of a statistically optimized data set? I believe that an affirmative answer to this research question provides a beneficial contribution to the academic community. The principal tool utilized to analyze the optimized data set for this research project is the Waikato Environment for Knowledge Analysis (WEKA). There are 64,550 instances and 33 features in the MTA-KDD’19 data set that are analyzed along with cross-validation and percentage split alternatives. The classification experiment performed by the authors of the MTA-KDD’19 data set is used as a baseline. The following machine learning classification models have been applied for this research investigation: Multilayer Perceptron, Decision Tree, Support Vector Machine, and K-Nearest Neighbors. The preliminary settings for these machine learning models include 10-fold cross-validation and 80% train 20% test data split. The Decision Tree classifier produced the best preliminary result with 100% accuracy when set to run an 80% training 20% test split and 99.9954% accuracy when set to run 10-fold cross-validation. This preliminary result has outperformed the results observed in the experiment presented by the authors of the MTA-KDD’19 data set. Other preliminary metrics illustrate that the selected models exhibit consistent and highly accurate performance. The multilayer perceptron classifier produced a preliminary result of 99.3649% accuracy when set to run an 80% training 20% test split and 99.3416% accuracy when set to run 10-fold cross-validation. The K-Nearest Neighbor classifier (K=1) produced a preliminary result of 98.9311% accuracy when set to run an 80% training 20% test split and 99.0024% accuracy when set to run 10-fold cross-validation. The Support Vector Machine classifier produced a preliminary result of 97.8081% accuracy when set to run an 80% training 20% test split and 97.7755% accuracy when set to run 10-fold cross-validation. The final stage of this research project will include implementation of additional machine learning methodologies. These methods will include feature selection techniques and ensemble learning models.Advisors(s): Dr. Seyedamin PouriyehTopic(s): SecurityCYBR 7240