How Do API Selections Affect the Runtime Performance of Data Analytics Tasks?
As data volume and complexity grow at an unprecedented rate, the performance of data analytics programs is becoming a major concern for developers. We observed that developers sometimes use alternative data analytics APIs to improve program runtime performance while preserving functional equivalence. However, little is known on the characteristics and performance attributes of alternative data analytics APIs. In this paper, we propose a novel approach to extract alternative implementations that invoke different data analytics APIs to solve the same tasks. A key appeal of our approach is that it exploits the comparative structures in Stack Overflow discussions to discover programming alternatives. We show that our approach is promising, as 86% of the extracted code pairs were validated as true alternative implementations. In over 20% of these pairs, the faster implementation was reported to achieve a 10x or more speedup over its slower alternative. We hope that our study offers a new perspective of API recommendation and motivates future research on optimizing data analytics programs.