Abstract - Today's agile software organizations aim to empower developers to make appropriate decisions rather than enforce adherence to a process. As a result, the data in software archives is more likely to be incomplete and noisy. Since software analytics techniques are trained using this data, automated techniques are required to recover such information.
In this paper, we lay the foundation for the adoption of software analytics techniques at Shopify (a large software organization that develops commerce-related products and solutions) by recovering missing issue type labels. To do so, we train classifiers to label issue reports as defect-fixing or not using textual features from 951 manually-labelled issue reports. Our classifiers show promise in intra- and inter-project experimental settings: (1) outperforming baseline approaches in the intra-project setting like random guessing (AUC values of 0.5271–0.8070) and Zero-R (F1-scores that are 0.31–21.72 percentage points better); and (2) achieving inter-project performance scores that are on par with intra-project classifiers when trained using pools of data that are drawn from multiple other projects. Interestingly, when the importance of precision is taken into consideration, standard model construction operations like rebalancing should be omitted to produce classifiers that are more suitable for deployment.
Preprint - PDF
Bibtex