Disciplines
Information Security | Programming Languages and Compilers
Abstract (300 words maximum)
Recently, the amount of encrypted malicious network traffic masquerading as normal traffic of data has increased greatly. This poses a concern for the user’s security and privacy. Moreover, malicious traffic rates have been reported to skyrocket during the COVID-19 pandemic. Therefore, we should adopt new methods to tackle such unpleasant traffic detection problems as soon as possible. Regular security solutions depending on common analysis like deep packet inspection have been proven to be less effective while detecting malware using machine learning–based solutions are becoming more popular. These solutions are believed to be less expensive, faster, and more secure since no traffic interceptor is required. However, current research papers cannot be compared and are quite unreliable since they use different datasets to train their models and the lack of well-recognized datasets and feature sets. Thus, the target of this research is to detect malware traffic flows by extracting new features from multiple popular public sources with well-known machine-learning algorithms such as KNN, XGBoost, and Random Forest. These are among the best machine learning algorithms that are expected to produce high (as high as 95%) malware-detected rates. The system will first extract relevant features including packet count, size, and protocol type. They will then be inputted into a machine-learning model for detection. The model will be trained on a large dataset mixed with benign and malicious traffic to accurately detect the encrypted malicious traffic flows. Conclusions will discuss the malicious network traffic detection rates of different feature sets tested by multiple machine learning algorithms and the challenges that might occur in the process, including the need for high-quality training data and the possibility of encountering false positives and false negatives. Further research in this area will emphasize improving the model’s detection rates and addressing these challenges.
Academic department under which the project should be listed
CCSE - Information Technology
Loading...
Primary Investigator (PI) Name
Liang Zhao
Encrypted Malicious Network Traffic Detection Using Machine Learning
Recently, the amount of encrypted malicious network traffic masquerading as normal traffic of data has increased greatly. This poses a concern for the user’s security and privacy. Moreover, malicious traffic rates have been reported to skyrocket during the COVID-19 pandemic. Therefore, we should adopt new methods to tackle such unpleasant traffic detection problems as soon as possible. Regular security solutions depending on common analysis like deep packet inspection have been proven to be less effective while detecting malware using machine learning–based solutions are becoming more popular. These solutions are believed to be less expensive, faster, and more secure since no traffic interceptor is required. However, current research papers cannot be compared and are quite unreliable since they use different datasets to train their models and the lack of well-recognized datasets and feature sets. Thus, the target of this research is to detect malware traffic flows by extracting new features from multiple popular public sources with well-known machine-learning algorithms such as KNN, XGBoost, and Random Forest. These are among the best machine learning algorithms that are expected to produce high (as high as 95%) malware-detected rates. The system will first extract relevant features including packet count, size, and protocol type. They will then be inputted into a machine-learning model for detection. The model will be trained on a large dataset mixed with benign and malicious traffic to accurately detect the encrypted malicious traffic flows. Conclusions will discuss the malicious network traffic detection rates of different feature sets tested by multiple machine learning algorithms and the challenges that might occur in the process, including the need for high-quality training data and the possibility of encountering false positives and false negatives. Further research in this area will emphasize improving the model’s detection rates and addressing these challenges.