Improving the Prediction Accuracy of Text Data and Attribute Data Mining with Data Preprocessing
Date of Submission
Master of Science in Computer Science (MSCS)
DR. KAI QIAN
DR. CHIA-TIEN DAN LO
DR. YONG SHI
Data Mining is the extraction of valuable information from the patterns of data and turning it into useful knowledge. Data preprocessing is an important step in the data mining process. The quality of the data affects the result and accuracy of the data mining results. Hence, Data preprocessing becomes one of the critical steps in a data mining process.
In the research of text mining, document classification is a growing field. Even though we have many existing classifying approaches, Naïve Bayes Classifier is good at classification because of its simplicity and effectiveness. The aim of this paper is to identify the impact of preprocessing the dataset on the performance of a Naïve Bayes Classifier. Naïve Bayes Classifier is suggested as the best method to identify the spam emails. The Impact of preprocessing phase on the performance of the Naïve Bayes classifier is analyzed by comparing the output of both the preprocessed dataset result and non-preprocessed dataset result. The test results show that combining Naïve Bayes classification with the proper data preprocessing can improve the prediction accuracy.
In the research of Attributed data mining, a decision tree is an important classification technique. Decision trees have proved to be valuable tools for the classiﬁcation, description, and generalization of data. J48 is a decision tree algorithm which is used to create classification model. J48 is an open source Java implementation of the C4.5 algorithm in the Weka data mining tool. In this paper, we present the method of improving accuracy for decision tree mining with data preprocessing. We applied the supervised filter discretization on J48 algorithm to construct a decision tree. We compared the results with the J48 without discretization. The results obtained from experiments show that accuracy of J48 after discretization is better than J48 before discretization.