Abstract - Reviewer recommendation systems are used to suggest community members to review change requests. Like several other recommendation systems, it is customary to evaluate recommendations using held out historical data. While history-based evaluation makes pragmatic use of available data, historical records may be: (1) overly optimistic, since past assignees may have been suboptimal choices for the task at hand; or (2) overly pessimistic, since “incorrect” recommendations may have been equal (or even better) choices.
In this paper, we empirically evaluate the extent to which historical data is an appropriate benchmark for reviewer recommendation systems. We replicate the cHRev and WLRRec approaches and apply them to 9,679 reviews from the Gerrit open source community. We then assess the recommendations with members of the Gerrit reviewing community using quantitative methods (personalized questionnaires about their comfort level with tasks) and qualitative methods (semi-structured interviews).
We find that history-based evaluation is far more pessimistic than optimistic in the context of Gerrit review recommendations. Indeed, while 86% of those who had been assigned to a review in the past felt comfortable handling the review, 74% of those labelled as incorrect recommendations also felt that they would have been comfortable reviewing the changes. This indicates that, on the one hand, when reviewer recommendation systems recommend the past assignee, they should indeed be considered correct. Yet, on the other hand, recommendations labelled as incorrect because they do not match the past assignee may have been correct as well.
Our results suggest that current reviewer recommendation evaluations do not always model the reality of software development. Future studies may benefit from looking beyond repository data to gain a clearer understanding of the practical value of proposed recommendations.
Preprint - PDF
Bibtex