Classifying Variable Stars from Stellar Light Curve Data
Disciplines
Applied Statistics | Data Science | Multivariate Analysis | Numerical Analysis and Scientific Computing | Stars, Interstellar Medium and the Galaxy | Statistical Models
Abstract (300 words maximum)
Due to advances in collection techniques, variable star light curve data is being produced faster than the existing curves can be classified. Automated classification has been attempted, but most endeavors use sophisticated techniques to extract high-level variables and many produce inconsistent results, often finding the greatest predictive impact from low-level variables. Here, several of the more successful methods were compared using these low-level features from the OGLE4 variable star catalogue. In addition, a probability-based, multi-level classifier was developed to increase classification accuracy of the underrepresented classes and improve user confidence. Random Forest and Gradient Boosting Trees presented the highest accuracy and the multi-level classifier outperformed even these. Not only could these models accurately predict the classes using easier-to-calculate variables, but the multi-level framework also increases this accuracy further and functions as a trustable system, rejecting low-confidence samples based on a user-determined confidence threshold.
Academic department under which the project should be listed
CCSE - Data Science and Analytics
Primary Investigator (PI) Name
Ramazan Aygun
Classifying Variable Stars from Stellar Light Curve Data
Due to advances in collection techniques, variable star light curve data is being produced faster than the existing curves can be classified. Automated classification has been attempted, but most endeavors use sophisticated techniques to extract high-level variables and many produce inconsistent results, often finding the greatest predictive impact from low-level variables. Here, several of the more successful methods were compared using these low-level features from the OGLE4 variable star catalogue. In addition, a probability-based, multi-level classifier was developed to increase classification accuracy of the underrepresented classes and improve user confidence. Random Forest and Gradient Boosting Trees presented the highest accuracy and the multi-level classifier outperformed even these. Not only could these models accurately predict the classes using easier-to-calculate variables, but the multi-level framework also increases this accuracy further and functions as a trustable system, rejecting low-confidence samples based on a user-determined confidence threshold.