Code to Comment "Translation": Data, Metrics, Baselining & Evaluation (ASE 2020 - Research Papers)

Session: Empirical Software Engineering (1) Wed 23 Sep 2020 17:10 - 18:10 Chair(s): Jinqiu YangConcordia University, Montreal, Canada

The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep-learning methods to this task—specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CodeNN, DeepCom, FunCom, and DocString. We compare them with WMT19, a standard dataset frequently used to train state-of-the-art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using "affinity pairs" of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.

David Gros

University of California, Davis

Hariharan Sezhiyan

University of California, Davis

Prem Devanbu

University of California

United States

Zhou Yu

University of California, Davis

Session: Empirical Software Engineering (1) Wed 23 Sep 2020 17:10 - 18:10 Chair(s): Jinqiu YangConcordia University, Montreal, Canada

Software performance is critical to the quality of the software system. Performance bugs can cause significant performance degradation such as long response time and low system throughput that ultimately lead to poor user experiences. Many modern software projects use bug tracking systems that allow developers and users to report issues they have identified in the software. While bug reports are intended to help developers to understand and fix bugs, they are also extensively used by researchers for finding benchmarks to evaluate their testing and debugging approaches. Researchers often rely on the description of a confirmed performance bug report to reproduce the performance bug to be used in their evaluation. Although researchers spend a considerable amount of time and effort in finding usable performance bugs from bug repositories, they often get only a few usable performance bugs. Reproducing performance bugs is a difficult task even for domain experts such as developers. Compared to functional bugs, performance bugs are substantially more complicated to reproduce because they often manifest through large inputs and specific execution conditions. The amount of information disclosed in a bug report may not always be sufficient to reproduce the performance bug for researchers, and thus hinders the usability of bug repository as the resource for finding benchmarks. Our study targets reproducing performance bugs from the perspectives of non-domain experts such as software engineering researchers. One big difference compared to the prior work is that we specifically target confirmed performance bugs to report why software engineering researchers may not succeed in reproducing such bugs rather than understanding and characterizing non- reproducible bugs from the viewpoints of developers. Therefore, a failed-to-reproduce performance bug in this work is defined as a developer confirmed reproducible performance bug that cannot be reproduced by researchers due to the lack of domain knowledge or environment limitations. The goal of this study is to share our experience as software engineering researchers in reproducing performance bugs through investigating the impact of different factors identified in confirmed performance bug reports in open-source projects. We studied the characteristics of confirmed performance bugs by reproducing them using only information available from the bug report to examine the challenges of performance bug reproduction. We spent more than 800 hours over the course of six months to study and reproduce 93 confirmed performance bugs, which are randomly sampled from two large-scale open-source server applications. We 1) studied the characteristics of the reproduced performance bug reports; 2) summarized the causes of failed-to-reproduce confirmed performance bug reports; 3) shared our experience on suggesting workarounds to improve the bug reproduction success rate; 4) delivered a virtual machine image that contains a set of 17 ready-to-execute performance bug benchmarks. The findings of our study provide guidance and a set of suggestions to help researchers to understand, evaluate, and successfully reproduce performance bugs. We also provided a set of implications for both researchers and practitioners on developing techniques for testing and diagnosing performance bugs, improving the quality of bug reports, and detecting failed-to-reproduce bug reports.

Link to Publication: https://www.sciencedirect.com/science/article/pii/S0164121219301438

Xue Han

University of Kentucky

Daniel Carroll

University of Kentucky

Tingting Yu

University of Kentucky

United States

Session: Empirical Software Engineering (1) Wed 23 Sep 2020 17:10 - 18:10 Chair(s): Jinqiu YangConcordia University, Montreal, Canada

Dependencies among software entities are the basis for many software analytic research and architecture analysis tools. Dynamically typed languages, such as Python, JavaScript and Ruby, tolerate the lack of explicit type references, making certain syntactic dependencies indiscernible in source code. We call these \emph{possible dependencies}, in contrast with the \emph{explicit dependencies} that are directly referenced in source code. Type inference techniques have been widely studied and applied, but existing architecture analytic research and tools have not taken possible dependencies into consideration. The fundamental question is, \emph{to what extent will these missing possible dependencies impact the architecture analysis?} To answer this question, we conducted an empirical study with 105 Python projects, using type inference techniques to manifest possible dependencies. Our study revealed that the architectural impact of possible dependencies is substantial—higher than that of explicit dependencies: (1) file-level possible dependencies account for at least 27.93% of all file-level dependencies, and create different dependency structures than that of explicit dependencies only, with an average difference of 30.71%; (2) adding possible dependencies significantly improves the precision (0.52%$\sim$14.18%), recall(31.73%$\sim$39.12%), and F1 scores (22.13%$\sim$32.09%) of capturing co-change relations; (3) on average, a file involved in possible dependencies influences 28% more files and 42% more dependencies within architectural sub-spaces than a file involved in just explicit dependencies; % Accordingly, possible dependencies dramatically change the file and dependency sets within 23.11% and 26.39% of architectural sub-spaces respectively; (4) on average, a file involved in possible dependencies consumes 32% more maintenance effort. Consequently, maintainability scores reported by existing tools make a system written in these dynamic languages appear to be better modularized than it actually is. This evidence strongly suggests that possible dependencies have a more significant impact than explicit dependencies on architecture quality, that architecture analysis and tools should assess and even emphasize the architectural impact of possible dependencies due to dynamic typing. %Our findings benefit architecture analysis and coding practice for software developed by dynamic languages like Python.

Wuxia Jin

Xi'an Jiaotong University

Yuanfang Cai

Drexel University

United States

Rick Kazman

University of Hawai‘i at Mānoa

Gang Zhang

Emergent Design Inc

Qinghua Zheng

Xi'an Jiaotong University

Ting Liu

Xi'an Jiaotong University

China

	17:10 - 18:10: Empirical Software Engineering (1)Research Papers / Journal-first Papers at Koala Chair(s): Jinqiu YangConcordia University, Montreal, Canada

	17:10 - 17:30 Talk		Code to Comment "Translation": Data, Metrics, Baselining & Evaluation Research Papers David GrosUniversity of California, Davis, Hariharan SezhiyanUniversity of California, Davis, Prem DevanbuUniversity of California, Zhou YuUniversity of California, Davis
	17:30 - 17:50 Talk		Reproducing Performance Bug Reports in Server Applications: The Researchers' Experiences Journal-first Papers Xue HanUniversity of Kentucky, Daniel CarrollUniversity of Kentucky, Tingting YuUniversity of Kentucky Link to publication DOI
	17:50 - 18:10 Talk		Exploring the Architectural Impact of Possible Dependencies in Python software Research Papers Wuxia JinXi'an Jiaotong University, Yuanfang Cai Drexel University, Rick KazmanUniversity of Hawai‘i at Mānoa, Gang ZhangEmergent Design Inc, Qinghua ZhengXi'an Jiaotong University, Ting LiuXi'an Jiaotong University