Compiler IR-Based Program Encoding Method for Software Defect Prediction

Yong Chen, Nanjing Audit University
Chao Xu, Nanjing Audit University
Jing Selena He, Kennesaw State University
Sheng Xiao, Hunan First Normal University

Abstract

With the continuous expansion of software applications, people's requirements for software quality are increasing. Software defect prediction is an important technology to improve software quality. It often encodes the software into several features and applies the machine learning method to build defect prediction classifiers, which can estimate the software areas is clean or buggy. However, the current encoding methods are mainly based on the traditionalmanual features or the AST of source code. Traditionalmanual features are difficult to reflect the deep semantics of programs, and there is a lot of noise information in AST, which affects the expression of semantic features. To overcome the above deficiencies, we combined with the Convolutional Neural Networks (CNN) and proposed a novel compiler Intermediate Representation (IR) based program encoding method for software defect prediction (CIR-CNN). Specifically, our program encoding method is based on the compiler IR, which can eliminate a large amount of noise information in the syntax structure of the source code and facilitate the acquisition of more accurate semantic information. Secondly, with the help of data flow analysis, a Data Dependency Graph (DDG) is constructed on the compiler IR, which helps to capture the deeper semantic information of the program. Finally, we use the widely used CNN model to build a software defect prediction model, which can increase the adaptive ability of the method. To evaluate the performance of the CIR-CNN,we use seven projects fromPROMISE datasets to set up comparative experiments. The experiments results show that, in WPDP, with our CIR-CNNmethod, the prediction accuracywas improved by 12% for the AST-encoded CNN-basedmodel and by 20.9% for the traditional features-based LR model, respectively. And inCPDP, theAST-encodedDBNbased model was improved by 9.1% and the traditional features-based TCA+ model by 19.2%, respectively.