A Large-Scale Study of the Impact of Feature Selection Techniques on Defect Classification Models

Authors - Baljinder Ghotra, Shane McIntosh, Ahmed E. Hassan
Venue - International Conference on Mining Software Repositories, pp. 146–157, 2017

Related Tags - MSR 2017 defect prediction

Abstract - The performance of a defect classification model depends on the features that are used to train it. Feature redundancy, correlation, and irrelevance can hinder the performance of a classification model. To mitigate this risk, researchers often use feature selection techniques, which transform or select a subset of the features in order to improve the performance of a classification model. Recent studies compare the impact of different feature selection techniques on the performance of defect classification models. However, these studies compare a limited number of classification techniques and have reached contradictory results about the impact of feature selection techniques. To address this limitation, we study 30 feature selection techniques (11 filter-based ranking techniques, six filter-based subset techniques, 12 wrapper-based subset techniques, and a no feature selection configuration) and 21 classification techniques when applied to 18 datasets from the NASA and PROMISE corpora. Our results show that a Correlation-based filter-subset feature selection technique with a BestFirst search method outperforms other feature selection techniques across the studied datasets (it outperforms in 70–87% of the PROMISE–NASA data sets) and across the studied classification techniques (it outperforms for 90% of the techniques). Hence, we recommend the application of such a selection technique when building defect classification models.

Preprint - PDF

Bibtex

@inproceedings{ghotra2017msr,
  Author = {Baljinder Ghotra and Shane McIntosh and Ahmed E. Hassan},
  Title = {{A Large-Scale Study of the Impact of Feature Selection Techniques on Defect Classification Models}},
  Year = {2017},
  Booktitle = {Proc. of the International Conference on Mining Software Repositories (MSR)},
  Pages = {146–157}
}