Research Papers
Tue 11 Oct 2022 10:30 - 10:50 at Banquet A - Technical Session 2 - Debugging and Troubleshooting Chair(s): Andrew Begel Carnegie Mellon University, Software and Societal Systems DepartmentA class may need to obey temporal properties in order to function correctly. For example, the correct usage protocol for an iterator is to always check whether there is a next element before asking for it; iterating over a collection when there are no items left leads to a NoSuchElementException. Automatic test case generation tools such as Randoop and Evo- Suite do not have any notion of these temporal properties. Gener- ating test cases by randomly invoking methods on a new instance of the class under test may raise run-time exceptions that do not necessarily expose software faults, but are rather a consequence of violations of temporal properties.
This paper presents Call Me Maybe, a novel technique that uses natural language processing to analyze Javadoc comments to identify temporal properties. This information can guide a test case generator towards executing sequences of method calls that respect the temporal properties. Our evaluation on 73 subjects from seven popular Java systems show that Randoop flags over 10K fewer false negatives and enriches over 12K correctly failing test cases due to violations of temporal properties with clear explanation that can help software developers.
Research Papers
Tue 11 Oct 2022 10:50 - 11:10 at Banquet A - Technical Session 2 - Debugging and Troubleshooting Chair(s): Andrew Begel Carnegie Mellon University, Software and Societal Systems DepartmentPretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming standard generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a standard generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks.
Pre-printIndustry Showcase
Tue 11 Oct 2022 11:10 - 11:30 at Banquet A - Technical Session 2 - Debugging and Troubleshooting Chair(s): Andrew Begel Carnegie Mellon University, Software and Societal Systems DepartmentTo secure computer infrastructure, we need to configure all security-relevant settings. We need security experts to identify security-relevant settings, but this process is time-consuming and expensive. Our proposed solution uses state-of-the-art natural language processing to classify settings as security-relevant based on their description. Our evaluation shows that our trained classifiers do not perform well enough to replace the human security experts but can help them classify the settings. By publishing our labeled data sets and the code of our trained model, we want to help security experts analyze configuration settings and enable further research in this area.
Pre-printResearch Papers
Tue 11 Oct 2022 11:30 - 11:50 at Banquet A - Technical Session 2 - Debugging and Troubleshooting Chair(s): Andrew Begel Carnegie Mellon University, Software and Societal Systems DepartmentPatch correctness has been the focus of automated program repair (APR) in recent years due to the propensity of APR tools to generate overfitting patches. Given a generated patch, the oracle (e.g., test suites) is generally weak in establishing correctness. Therefore, the literature has proposed various approaches of leveraging machine learning with engineered and deep learned features, or exploring dynamic execution information, to further explore the correctness of APR-generated patches. In this work, we propose a novel perspective to the problem of patch correctness assessment: a correct patch implements changes that ``answer'' to a problem posed by buggy behaviour. Concretely, we turn the patch correctness assessment into a Question Answering problem. To tackle this problem, our intuition is that natural language processing can provide the necessary representations and models for assessing the semantic correlation between a bug (question) and a patch (answer). Specifically, we consider as inputs the bug reports as well as the natural language description of the generated patches. Our approach, \toolname, first considers state of the art commit message generation models to produce the relevant inputs associated to each generated patch. Then we leverage a neural network architecture to learn the semantic correlation between bug reports and commit messages. Experiments on a large dataset of 9,135 patches generated for three bug datasets (Defects4j, Bugs.jar and Bears) show that \toolname can achieve an AUC of 0.873 on predicting patch correctness, and recalling 92% correct patches while filtering out 64% incorrect patches. Our experimental results further demonstrate the influence of inputs quality on prediction performance. We further perform experiments to highlight that the model indeed learns the relationship between bug reports and code change descriptions for the prediction. Finally, we compare against prior work and discuss the benefits of our approach.
Pre-printLate Breaking Results
Tue 11 Oct 2022 11:50 - 12:00 at Banquet A - Technical Session 2 - Debugging and Troubleshooting Chair(s): Andrew Begel Carnegie Mellon University, Software and Societal Systems DepartmentIn the context of software development, managing and organizing agile boards of multi-disciplinary teams distributed around the world is a great challenge, especially regarding the process of assigning tickets to the correct team roles. Incorrectly assigned tickets can result in significant resource waste in any project and directly influence delivery outcomes and project costs. This work proposes a method for ticket analysis and automatic team assignment using Natural Language Processing and explainable Machine Learning models. Results show that the models perform well on a real-world team assignment task and provide insights into their decision.
NIER Track
Tue 11 Oct 2022 14:00 - 14:10 at Banquet A - Technical Session 6 - Source Code Manipulation Chair(s): Collin McMillan University of Notre DameSource code documentation is an important artifact for efficient software development. Code documentation could greatly benefit from automation since manual documentation is often labouring, resource and time-intensive. In this paper, we employed Codex for automatic code documentation creation. Codex is a GPT-3 based model pre-trained on both natural and programming languages. We find that Codex outperforms existing techniques even with basic settings like one-shot learning (i.e., providing only one example for training). Codex achieves an overall BLEU score of 20.6 for six different programming languages (11.2% improvement over earlier state-of-the-art techniques). Thus, Codex shows promise and warrants in-depth future studies for automatic code documentation generation to support diverse development tasks.
Research Papers
Tue 11 Oct 2022 14:10 - 14:30 at Banquet A - Technical Session 6 - Source Code Manipulation Chair(s): Collin McMillan University of Notre DameCompetitive programming has become a popular way for programmers to test their skills. Large-scale online programming contests attract millions of experienced programmers to compete against each other. Competition-level programming problems are challenging in nature, and participants often fail to solve the problem on their first attempt. Some online platforms for competitive programming allow programmers to practice on competition-level problems as well, and the standard feedback for an incorrect practice submission is the first test case that the submission fails. Often, the failed test case does not provide programmers with enough information to resolve the errors in their code, and they abandon the problem after making several more unsuccessful attempts.
We present Clef, the first data-driven tool that can generate feedback on competition-level code automatically by repairing programmers’ incorrect submissions. The key development is that Clef can learn how to generate repairs for incorrect submissions by examining the repairs that other programmers made to their own submissions over time. Since the differences between a incorrect program and a correct program for the same task may be significant, we introduce a new data structure, merge trees, to capture the changes between submissions. Merge trees are versatile: they can encode both both large algorithm-level redesigns and small statement-level alterations. Clef applies the patterns it learns from a database of submissions to generate repairs for new submissions outside the database. We evaluated Clef on six real-world problems from Codeforces, the world’s largest platform for competitive programming. Clef achieves 42.1% accuracy in repairing programmers’ incorrect submissions. Even when given incorrect submissions from programmers who never found the solution to a problem on their own, Clef repairs the users’ programs 34.1% of the time.
Late Breaking Results
Tue 11 Oct 2022 14:30 - 14:40 at Banquet A - Technical Session 6 - Source Code Manipulation Chair(s): Collin McMillan University of Notre DameTransformer networks such as CodeBERT already achieve very good results for code clone detection in benchmark datasets, so one could assume that this task has already been solved. However, code clone detection is not a trivial task. Semantic code clones in particular are difficult to detect. We show that the generalizability of CodeBERT decreases by evaluating two different subsets of Java code clones from BigCloneBench. We observe a significant drop of F1 score when we evaluate different code snippets and different functionality IDs than those used for model building.
DOI Pre-printNIER Track
Tue 11 Oct 2022 14:40 - 14:50 at Banquet A - Technical Session 6 - Source Code Manipulation Chair(s): Collin McMillan University of Notre DameCode completion is an important feature in an IDE to improve developers’ productivity. Existing code completion approaches focus on completing the current code token, next token or statement, or code pattern. We propose AstCC, a code completion approach to suggest the next syntactic unit via an AST-based statistical language model. AstCC learns from a large code corpus to derive the next AST subtree representing a syntactic unit, and then fills in the template with the concrete variables from the current program scope. Our empirical evaluation shows that AstCC can correctly suggest the next syntactic unit in 33% of the cases, and in 62% of the cases, it correctly suggests within five candidates. We will also explain the potential applications of AstCC in automated program repair, automated test case generation, and syntactic pattern mining.
Research Papers
Tue 11 Oct 2022 14:50 - 15:10 at Banquet A - Technical Session 6 - Source Code Manipulation Chair(s): Collin McMillan University of Notre DameRecent years have brought a surge of work on predicting pieces of source code, e.g., for code completion, code migration, program repair, or translating natural language into code. All this work faces the challenge of evaluating the quality of a prediction w.r.t. some oracle, typically in the form of a reference solution. A common evaluation metric is the BLEU score, an n-gram-based metric originally proposed for evaluating natural language translation, but adopted in software engineering because it can be easily computed on any programming language and enables automated evaluation at scale. However, a key difference between natural and programming languages is that in the latter, completely unrelated pieces of code may have many common n-grams simply because of the syntactic verbosity and coding conventions of programming languages. We observe that these trivially shared n-grams hamper the ability of the metric to distinguish between truly similar code examples and code examples that are merely written in the same language. This paper presents CrystalBLEU, an evaluation metric based on BLEU, that allows for precisely and efficiently measuring the similarity of code. Our metric preserves the desirable properties of BLEU, such as being language-agnostic, able to handle incomplete or partially incorrect code, and efficient, while reducing the noise caused by trivially shared n-grams. We evaluate CrystalBLEU on two datasets from prior work and on a new, labeled dataset of semantically equivalent programs. Our results show that CrystalBLEU can distinguish similar from dissimilar code examples 1.9–4.5 times more effectively, when compared to the original BLEU score and a previously proposed variant of BLEU for code.
Research Papers
Tue 11 Oct 2022 15:10 - 15:30 at Banquet A - Technical Session 6 - Source Code Manipulation Chair(s): Collin McMillan University of Notre DameCode summarization generates brief natural language descriptions of source code pieces, which can assist developers in understanding code and reduce documentation workload. Recent neural models on code summarization are trained and evaluated on large-scale multi-project datasets consisting of independent code-summary pairs. Despite the technical advances, their effectiveness on a specific project is rarely explored. In practical scenarios, however, developers are more concerned with generating high-quality summaries for their working projects. And these projects are usually poorly documented, hence having few historical code-summary pairs. To this end, we investigate low-resource project-specific code summarization, a novel task more consistent with the developers’ requirements. To better characterize project-specific knowledge with limited training samples, we propose a meta transfer learning method by incorporating a lightweight fine-tuning mechanism into a meta-learning framework. Experimental results on nine real-world projects verify the superiority of our method over alternative ones and reveal how the project-specific knowledge is learned.
Research Papers
Wed 12 Oct 2022 10:00 - 10:20 at Banquet A - Technical Session 10 - Testing I Chair(s): Gordon Fraser University of PassauUnit tests are widely used to check source code quality, but they can be too coarse-grained or ill-suited for testing individual program statements. We introduce inline tests to make it easier to check for faults in statements. We motivate inline tests through several language features and a common testing scenario in which inline tests could be beneficial. For example, inline tests can allow a developer to test a regular expression in place. We also define language-agnostic requirements for inline testing frameworks. Lastly, we implement I-Test, the first inline testing framework. I-Test works for Python and Java, and it satisfies most of the requirements. We evaluate I-Test on open-source projects by using it to test 144 statements in 31 Python programs and 37 Java programs. We also perform a user study. All nine user study participants say that inline tests are easy to write and that inline testing is beneficial. The cost of running inline tests is negligible, at 0.007x–0.014x, and our inline tests helped find two faults that have been fixed by the developers.
Pre-printTool Demonstrations
Wed 12 Oct 2022 09:30 - 10:00 at Ballroom A - Tool Poster Session 2Refactoring software can be hard and time-consuming. Several refactoring tools assist developers in reaching more readable and maintainable code. However, most of them are characterized by long feedback loops that impoverish their refactoring experience. We believe that we can reduce this problem by focusing on the concept of Live Refactoring and its main principles: the live recommendation and continuous visualization of refactoring candidates, and the immediate visualization of results from applying a refactoring to the code. Therefore, we implemented a Live Refactoring Environment that identifies, suggests, and applies Extract Method refactorings. To evaluate our approach, we carried out an empirical experiment. Early results showed us that our refactoring environment improves several code quality aspects, being well received, understood, and used by the experiment participants. The source code of our tool is available on: https://github.com/saracouto1318/LiveRef. Its demonstration video can be found at: https://youtu.be/_jxx21ZiQ0o.
Research Papers
Wed 12 Oct 2022 10:30 - 10:50 at Banquet A - Technical Session 10 - Testing I Chair(s): Gordon Fraser University of PassauVoice-based virtual assistants are becoming increasingly popular. Such systems provide frameworks to developers on which they can build their own apps. End-users can interact with such apps through a Voice User Interface (VUI), which allows to use natural language commands to perform actions. Testing such systems is far from trivial, especially due to the fact that the same command can be expressed using several semantically equivalent utterances, to which the VUI is expected to correctly react. To support developers in testing VUIs, Deep learning (DL)-based tools have been integrated in the development environments to generate paraphrases for selected seed utterances. This is for example the case of the Alexa Developer Console (ADC). Such tools, however, generate a limited number of paraphrases and do not allow to cover several corner cases. In this paper, we introduce VUI-UPSET, a novel approach that aims at adapting chatbot-testing approaches to VUI-testing. Indeed, both systems provide a similar natural-language-based interface to users. We conducted an empirical study to understand how VUI-UPSET compares to existing approaches in terms of (i) correctness of the generated paraphrases, and (ii) capability of revealing bugs. We manually analyzed 5,872 generated paraphrases, totaling 13,310 evaluations. Our results show that the DL-based tool integrated in the ADC generates a significantly higher percentage of meaningful paraphrases compared to VUI-UPSET. However, VUI-UPSET generates a higher number of bug-revealing paraphrases, which allows developers to test more thoroughly their apps at the cost of discarding a higher number of irrelevant paraphrases.
Industry Showcase
Wed 12 Oct 2022 10:50 - 11:10 at Banquet A - Technical Session 10 - Testing I Chair(s): Gordon Fraser University of PassauMany big companies are providing cloud services through RESTful APIs nowadays. With the growing popularity of RESTful API, testing RESTful API becomes crucial. To address this issue, researchers have proposed several automatic RESTful API testing techniques. In Huawei, we design and implement an automatic RESTful API testing framework named MOREST. MOREST has been used to test ten RESTful API services and helped to detected 83 previously unknown bugs which were all confirmed and fixed by the developers. On one hand, we find that MOREST shows great capability of detecting bugs in RESTful APIs. On the other hand, we also notice that human effort is inevitable and important when applying automatic RESTful API techniques in practice.
In this paper, we focus on discussing the industry practice of using automatic RESTful API testing techniques. Specifically, we emphasize the impact of human-in-the-loop for RESTful API testing. By analyzing the human factors, we summarize insights for improving the coordination between automated tools and human experts, increasing the level of automation for the tools as well as boosting the overall testing performance.
Research Papers
Wed 12 Oct 2022 11:10 - 11:30 at Banquet A - Technical Session 10 - Testing I Chair(s): Gordon Fraser University of PassauVirtual personal assistant (VPA) services, e.g. Amazon Alexa and Google Assistant, are becoming increasingly popular recently. Users interact with them through voice-based apps, e.g. Amazon Alexa skills and Google Assistant actions. Unlike the desktop and mobile apps which have visible and intuitive graphical user interface (GUI) to facilitate interaction, VPA apps convey information purely verbally through the voice user interface (VUI), which is known to be limited in its invisibility, single mode and high demand of user attention. This may lead to various problems on the usability and correctness of VPA apps.
In this work, we propose a model-based framework named Vitas to handle VUI testing of VPA apps. Vitas interacts with the app VUI, and during the testing process, it retrieves semantic information from voice feedbacks by natural language processing. It incrementally constructs the finite state machine (FSM) model of the app with a weighted exploration strategy guided by key factors such as the coverage of app functionality. We conduct a large-scale testing on 41,581 VPA apps (i.e., skills) of Amazon Alexa, the most popular VPA service, and find that 51.29% of them have weaknesses. They largely suffer from problems such as unexpected exit/start, privacy violation and so on. Our work reveals the immaturity of the VUI designs and implementations in VPA apps, and sheds light on the improvement of several crucial aspects of VPA apps.
Research Papers
Wed 12 Oct 2022 13:30 - 13:50 at Banquet A - Technical Session 14 - Bug Prediction and Localization Chair(s): David Lo Singapore Management UniversityContinuous integration (CI) is a software practice by which developers frequently merge and test code under development. In CI settings, the change information is finer-grained. Prior studies have widely studied and evaluated the performance of spectrum-based fault localization (SBFL) techniques. While the continuous nature of CI requires the code changes to be atomic and presents fine-grained information on what part of the system is being changed, traditional SBFL techniques do not benefit from it. In this paper, we conduct an empirical study on the effectiveness of using and integrating code and coverage changes for fault localization in CI settings. We conduct our study on seven open source systems, with a total of 192 faults. We find that while both change information covers a reduced search space compared to code coverage, the percentages of faulty methods in the search space are 7 and 14 times higher than code coverage for code changes and coverage changes, respectively. Then, we propose three change-based fault localization techniques and compare them with Ochiai, a commonly used SBFL technique. Our results show that all three change-based techniques outperform Ochiai, achieving an improvement that varies from 7% to 23% and 17% to 24% over Ochiai for average MAP and MRR, respectively. Moreover, we find that our change-based fault localization techniques can be integrated with Ochiai, achieving up to 53% and 52% improvement over Ochiai in average MAP and MRR respectively, and locating 41 more faults at Top-1.
Industry Showcase
Wed 12 Oct 2022 13:50 - 14:10 at Banquet A - Technical Session 14 - Bug Prediction and Localization Chair(s): David Lo Singapore Management UniversityWe share our experience in developing Code Understanding Linter, an automated code review tool based on language models of code. We introduce several ideas to make the tool be more practical, including combining two different language models, filtering meaningless outputs from the model, and generating developer-friendly diagnosis messages by interpreting the outputs from the model. On top of those ideas, we describe the design and implementation of an automated code review tool to detects variable-misuse defects in Python codes and suggest how to fix them. We evaluated the tool with a set of code repositories in Samsung Electronics, which contains real-world Python codes. Our experiment proves that our tool can discover hidden defects in the real-world codes, but the false positive rate is far higher than we expected. After manually investigating every false positives, we discuss the limitations of the language models and possible solutions.
Journal-first Papers
Wed 12 Oct 2022 14:10 - 14:30 at Banquet A - Technical Session 14 - Bug Prediction and Localization Chair(s): David Lo Singapore Management UniversityMany critical codebases are written in C, and most of them use preprocessor directives to encode variability, effectively encoding software product lines. These preprocessor directives, however, challenge any static code analysis. SPLlift, a previously presented approach for analyzing software product lines, is limited to Java programs that use a rather simple feature encoding and to analysis problems with a finite and ideally small domain. Other approaches that allow the analysis of real-world C software product lines use special-purpose analyses, preventing the reuse of existing analysis infrastructures and ignoring the progress made by the static analysis community. This work presents VARALYZER, a novel static analysis approach for software product lines. VARALYZER first transforms preprocessor constructs to plain C while preserving their variability and semantics. It then solves any given distributive analysis problem on transformed product lines in a variability-aware manner. VARALYZER ’s analysis results are annotated with feature constraints that encode in which configurations each result holds. Our experiments with 95 compilation units of OpenSSL show that applying VARALYZER enables one to conduct inter-procedural, flow-, field- and context-sensitive data-flow analyses on entire product lines for the first time, outperforming the product-based approach for highly-configurable systems.
DOINIER Track
Wed 12 Oct 2022 14:30 - 14:40 at Banquet A - Technical Session 14 - Bug Prediction and Localization Chair(s): David Lo Singapore Management UniversityFaults in spreadsheets are not uncommon and they can have significant negative consequences in practice. Various approaches for fault localization were proposed in recent years, among them techniques that transferred ideas from spectrum-based fault localization (SFL) to the spreadsheet domain. Applying SFL to spreadsheets proved to be effective, but has certain limitations. Specifically, the constrained computational structures of spreadsheets may lead to large sets of cells that cannot be distinguished using SFL and thus have to be inspected manually. In this work, we propose to combine SFL with a fault prediction method based on spreadsheet metrics in a machine learning (ML) approach. In particular, we train supervised ML models using two orthogonal types of features: (i) variables that are used to compute similarity coefficients in SFL and (ii) spreadsheet product metrics that have shown to be good predictors for faulty formulas in previous work. Experiments with a widely-used corpus of faulty spreadsheets indicate that the combined model helps to significantly improve fault localization performance in terms of wasted effort and accuracy.
Research Papers
Wed 12 Oct 2022 14:40 - 15:00 at Banquet A - Technical Session 14 - Bug Prediction and Localization Chair(s): David Lo Singapore Management UniversityFailures that are not related to a specific fault can reduce the effectiveness of fault localization in multi-fault scenarios. To tackle this challenge, researchers and practitioners typically cluster failures (e.g., failed test cases) into several disjoint groups, with those caused by the same fault grouped together. In such a fault isolation process that requires input in a mathematical form, ranking-based failure proximity (R-proximity) is widely used to model failed test cases. In R-proximity, each of failed test cases is represented as a suspiciousness ranking list of program statements through a finger- printing function (i.e., a risk evaluation formula, REF). Although many off-the-shelf REFs have been integrated into R-proximity, they were designed for single-fault localization originally. To the best of our knowledge, no REF has been developed to serve as a fingerprinting function of R-proximity in multi-fault scenarios. For better clustering failures in fault isolation, in this paper, we present a genetic programming-based framework along with a sophisticated fitness function, for evolving REFs with the goal of more properly representing failures in multi-fault scenarios. By using a small set of programs for training, we get a collection of REFs that can obtain good results applicable in a larger and more general scale of scenarios. The best one of them outperforms the state-of-the-art by 50.72% and 47.41% in faults number estimation and clustering effectiveness, respectively. Our framework is highly configurable for further use, and the evolved formulas can be directly applied in future failure representation tasks without any retraining.
Journal-first Papers
Wed 12 Oct 2022 15:00 - 15:20 at Banquet A - Technical Session 14 - Bug Prediction and Localization Chair(s): David Lo Singapore Management Universityno description available
Tool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1Test-based generate-and-validate automated program repair (APR) systems generate many patches that pass the test suite without fixing the bug. The generated patches must be manually inspected by the developers, a task that tends to be time-consuming, thereby diminishing the role of APR in reducing debugging costs. We present the design and implementation of a novel tool, named Shibboleth, for automatic assessment of the patches generated by test-based generate-and-validate APR systems. Shibboleth leverages lightweight static and dynamic heuristics from both test and production code to rank and classify the patches. Shibboleth is based on the idea that the buggy program is almost correct and the bugs are small mistakes that require small changes to fix and specifically the fix does not remove the code implementing correct functionality of the program. Thus, the tool measures the impact of patches on both production code (via syntactic and semantic similarity) and test code (via code coverage) to separate the patches that result in similar programs and that do not remove desired program elements. We have evaluated Shibboleth on 1,871 patches, generated by 29 Java-based APR systems for Defects4J programs. The technique outperforms state-of-the-art raking and classification techniques. Specifically, in our ranking data set, in 66% of the cases, Shibboleth ranks the correct patch in top-1 or top-2 positions and, in our classification data set, it achieves an accuracy and F1-score of 0.887 and 0.852, respectively, in classification mode. A demo video of the tool is available at https://bit.ly/3NvYJN8.
Research Papers
Wed 12 Oct 2022 16:10 - 16:30 at Banquet A - Technical Session 18 - Testing II Chair(s): Darko Marinov University of Illinois at Urbana-ChampaignSoftware systems powering OS kernels, basebands, bootloaders, firmware, IoT or automotive build the foundation of infrastructure that billions of people rely on every day. Testing these systems is crucial, especially as their complexity grows and they are often written in unsafe languages such as C/C++.
However, testing such complex systems poses significant challenges, e.g., custom hardware for which there is no emulator, or a non-trivial setup of testing and debugging on the target device. As a result, the commonly used testing techniques and tools are not always easily applicable.
An off-target (OT) testing is a promising technique which addresses these challenges: part of the code is extracted and adapted to run on a different hardware platform with better tool support, easier debugging and higher test throughput. Unfortunately, since the process of creating an OT program has been manual, the technique did not scale well and was mostly used in an ad hoc manner.
In this paper we present a novel complex systems testing approach called Auto Off-target (AoT). Based on the information extracted from the source code and from the build process, AoT can automatically generate OT programs in C. AoT goes beyond the code generation and provides mechanisms that help to recreate and discover the program state in the OT code. The generated OTs are self-contained and independent of the original build environment. As a result, pieces of complex or embedded software can be easily run, analyzed, debugged and tested on a standard x86_64 machine.
We evaluate AoT on tens of thousands of functions selected from OS kernels, a bootloader and a network stack. We demonstrate that majority of the generated OTs can be automatically tested with fuzzing and symbolic execution. We further used AoT in a bug finding campaign and discovered seven bugs in the Android redfin and oriole kernels powering Google Pixel 5 and 6 phones.
DOI Pre-printTool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1The rapid expansion of robotics relies on properly configuring and testing hardware and software. Due to the expense and hazard of real-world testing on hardware, robot system testing increasingly utilizes extensive simulation. Creating robot simulation tests requires specialized skills in robot programming and simulation tools. While there are many platforms and tool-kits to create these simulations, they can be cumbersome when combined with automated testing. We present Maktub: a tool for creating tests using Unity and ROS. Maktub leverages the extensive 3D manipulation capabilities of Unity to lower the barrier in creating system tests for robots. A key idea of Maktub is to make tests without needing robotic software development skills. A video demonstration of Maktub can be found here: https://youtu.be/c0Bacy3DlEE, and the source code can be found here https://github.com/RobotCodeLab/Maktub.
Journal-first Papers
Wed 12 Oct 2022 16:40 - 17:00 at Banquet A - Technical Session 18 - Testing II Chair(s): Darko Marinov University of Illinois at Urbana-ChampaignMutation testing research has indicated that a major part of its application cost is due to the large number of low utility mutants that it introduces. Although previous research has identified this issue, no previous study has proposed any effective solution to the problem. Thus, it remains unclear how to mutate and test a given piece of code in a best effort way, i.e., achieving a good trade-off between invested effort and test effectiveness. To achieve this, we propose Cerebro, a machine learning approach that statically selects subsuming mutants, i.e., the set of mutants that resides on the top of the subsumption hierarchy, based on the mutants’ surrounding code context. We evaluate Cerebro using 48 and 10 programs written in C and Java, respectively, and demonstrate that it preserves the mutation testing benefits while limiting application cost, i.e., reduces all cost application factors such as equivalent mutants, mutant executions, and the mutants requiring analysis. We demonstrate that Cerebro has strong inter-project prediction ability, which is significantly higher than two baseline methods, i.e., supervised learning on features proposed by state-of-the-art, and random mutant selection. More importantly, our results show that Cerebro’s selected mutants lead to strong tests that are respectively capable of killing 2 times higher than the number of subsuming mutants killed by the baselines when selecting the same number of mutants. At the same time, Cerebro reduces the cost-related factors, as it selects, on average, 68% fewer equivalent mutants, while requiring 90% fewer test executions than the baselines.
Link to publication DOIResearch Papers
Wed 12 Oct 2022 17:00 - 17:20 at Banquet A - Technical Session 18 - Testing II Chair(s): Darko Marinov University of Illinois at Urbana-ChampaignQuestion answering (QA) software uses information retrieval and natural language processing techniques to automatically answer questions posed by humans in a natural language. Like other AI- based software, QA software may contain bugs. To automatically test QA software without human labeling, previous work extracts facts from question answer pairs and generates new questions to detect QA software bugs. Nevertheless, the generated questions can be ambiguous, confusing, or with chaotic syntax, which are unanswerable for QA software. As a result, a large proportion of the reported bugs are false positives. In this work, we proposed QATest, a sentence-level mutation based metamorphic testing tool for QA software. To eliminate false positives and achieve precise automatic testing, QATest leverages five Metamorphic Relations (MRs) as well as semantics-guided searching and enhanced test oracles. Our evaluation on three QA datasets demonstrates that QATest outperforms the state-of-the-art in both quantity (8,133 vs. 6,601 bugs) and quality (97.67% vs. 49% true positive rate) of the reported bugs. Moreover, the test inputs generated by QATest successfully reduce MR violation rate from 44.29% to 20.51% when being adopted in fine-tuning the QA software under test.
Pre-printJournal-first Papers
Wed 12 Oct 2022 17:20 - 17:40 at Banquet A - Technical Session 18 - Testing II Chair(s): Darko Marinov University of Illinois at Urbana-ChampaignFault Localization (FL) is an important first step in software debugging and is mostly manual in the current practice. Many methods have been proposed over the years to automate the FL process, including information retrieval (IR)-based techniques. These methods localize the fault based on the similarity of the reported bug report and the source code. Newer variations of IR-based FL (IRFL) techniques also look into the history of bug reports and leverage them during the localization. However, all existing IRFL techniques limit themselves to the current project’s data (local data). In this study, we introduce, which is an IRFL framework consisting of methods that use models pre-trained on the global data (extracted from open-source benchmark projects). In, we investigate two heuristics: (a) the effect of global data on a state-of-the-art IR-FL technique, namely, and (b) the application of a Word Embedding technique (Doc2Vec) together with global data. Our large-scale experiment on 51 software projects shows that using global data improves on average 6.6% and 4.8% in terms of MRR (Mean Reciprocal Rank) and MAP (Mean Average Precision), with over 14% in a majority (64% and 54% in terms of MRR and MAP, respectively) of the cases. This amount of improvement is significant compared to the improvement rates that five other state-of-the-art IRFL tools provide over. In addition, training the models globally is a one-time offline task with no overhead on ’s run-time fault localization. Our study, however, shows that a Word Embedding-based global solution did not further improve the results.
Link to publication DOIResearch Papers
Wed 12 Oct 2022 17:40 - 18:00 at Banquet A - Technical Session 18 - Testing II Chair(s): Darko Marinov University of Illinois at Urbana-ChampaignUnit testing is a critical part of the software development process, ensuring the correctness of basic programming units in a program (e.g., a method). Search-based software testing (SBST) is an automated approach to generating test cases. SBST generates test cases with the genetic algorithms by specifying the coverage criterion (e.g., branch coverage). However, a good test suite must have different properties, which cannot be captured by using an individual coverage criterion alone. Therefore, the state-of-the-art approach combines multiple criteria to generate test cases. As combining multiple coverage criteria brings multiple objectives for optimization, it hurts the test suites’ coverage for certain criteria compared with using the single criterion. To cope with this, we propose a novel approach named \textbf{smart selection}. Based on the coverage correlations among criteria and the coverage goals’ subsumption relationships, smart selection selects a subset of coverage goals to reduce the number of optimization objectives and avoid missing any properties of all criteria. We conduct experiments to evaluate smart selection on $400$ Java classes with three state-of-the-art genetic algorithms. On average, smart selection outperforms the original combination (i.e., combining all goals) on $77$ Java classes, accounting for $65.1%$ of the classes having significant differences between the two approaches.
DOI Pre-printResearch Papers
Thu 13 Oct 2022 10:00 - 10:20 at Banquet A - Technical Session 22 - Code Summarization and Recommendation Chair(s): Houari Sahraoui Université de MontréalSmart contracts are gaining popularity as a means to support transparent, traceable, and self-executing decentralized applications, which enable the exchange of value in a trustless environment. Developers of smart contracts rely on various libraries, such as OpenZeppelin for Solidity contracts, to improve application quality and reduce development costs. The API documentations of these libraries are important sources of information for developers who are unfamiliar with the APIs. Yet, maintaining high-quality documentations is non-trivial, and errors in documentations may place barriers for developers to learn the correct usages of APIs. In this paper, we propose a technique, DocCon, to detect inconsistencies between documentations and the corresponding code for Solidity smart contract libraries. Our fact-based approach allows inconsistencies of different severity levels to be queried, from a database containing precomputed facts about the API code and documentations. DocCon successfully detected high-priority API documentation errors in popular smart contract libraries, including mismatching parameters, missing requirements, outdated descriptions, etc. Our experiment result shows that DocCon achieves good precision and is applicable to different libraries: 29 and 22 out of our reported 40 errors have been confirmed and fixed by library developers so far.
Pre-printNIER Track
Thu 13 Oct 2022 10:20 - 10:30 at Banquet A - Technical Session 22 - Code Summarization and Recommendation Chair(s): Houari Sahraoui Université de MontréalVery large language models (LLMs), such as GPT-3 and Codex have achieved state-of-the-art performance on several natural-language tasks, and show great promise also for code. A particularly exciting aspect of LLMs is their knack for few-shot and zero-shot learning: they can learn to perform a task with very few examples. Few-shotting has particular synergies in software engineering, where there are a lot of phenomena (identifier names, APIs, terminology, coding patterns) that are known to be highly project-specific. However, project-specific data can be quite limited, especially early in the history of a project; thus the few-shot learning capacity of LLMs might be very relevant. In this paper, we investigate the use few-shot training with the very large GPT (Generative Pre-trained Transformer) Codex model, and find evidence suggesting that one can significantly surpass state-of-the-art models for code-summarization, leveraging project-specific training.
DOI Pre-printResearch Papers
Thu 13 Oct 2022 10:30 - 10:50 at Banquet A - Technical Session 22 - Code Summarization and Recommendation Chair(s): Houari Sahraoui Université de MontréalPrior studies have demonstrated that approaches to generate an answer summary for a given technical query in Software Question and Answer (SQA) sites are desired. We find that existing approaches are assessed solely through user studies. Hence, a new user study needs to be performed every time a new approach is introduced; this is time-consuming, slows down the development of the new approach, and results from different user studies may not be comparable to each other. There is a need for a benchmark with ground truth summaries to complement assessment through user studies. Unfortunately, such a benchmark is non-existent for answer summarization for technical queries from SQA sites.
To fill the gap, we manually construct a high-quality benchmark to enable automatic evaluation of answer summarization for technical queries for SQA sites. It contains 111 query-summary pairs extracted from 382 Stack Overflow answers with 2,014 sentence candidates. Using the benchmark, we comprehensively evaluate the performance of existing approaches and find that there is still a big room for improvements.
Motivated by the results, we propose a new approach TechSumBot with three key modules:1) Usefulness Ranking module, 2) Centrality Estimation module, and 3) Redundancy Removal module. We evaluate TechSumBot in both automatic (i.e., using our benchmark) and manual (i.e., via a user study) manners. The results from both evaluations consistently demonstrate that TechSumBot outperforms the best performing baseline approaches from both SE and NLP domains by a large margin, i.e., 10.83%–14.90%, 32.75%–36.59%, and 12.61%–17.54%, in terms of ROUGE-1, ROUGE-2, and ROUGE-L on automatic evaluation, and 5.79%–9.23% and 17.03%–17.68%, in terms of average usefulness and diversity score on human evaluation. This highlights that the automatic evaluation of our benchmark can uncover findings similar to the ones found through user studies. More importantly, automatic evaluation has a much lower cost, especially when it is used to assess a new approach. Additionally, we also conducted an ablation study, which demonstrates that each module in TechSumBot contributes to boosting the overall performance of TechSumBot. We release the benchmark as well as the replication package of our experiment at https://anonymous.4open.science/r/TECHSUMBOT.
Journal-first Papers
Thu 13 Oct 2022 10:50 - 11:10 at Banquet A - Technical Session 22 - Code Summarization and Recommendation Chair(s): Houari Sahraoui Université de Montréalno description available
NIER Track
Thu 13 Oct 2022 11:10 - 11:20 at Banquet A - Technical Session 22 - Code Summarization and Recommendation Chair(s): Houari Sahraoui Université de MontréalRecommender systems are a valuable tool for software engineers. For example, they can provide developers with a ranked list of files likely to contain a bug, or multiple auto-complete suggestions for a given method stub. However, the way these recommender systems interact with developers is often rudimentary—a long list of recommendations only ranked by the model’s confidence. In this vision paper, we lay out our research agenda for re-imagining how recommender systems for software engineering communicate their insights to developers. When issuing recommendations, we aim to recommend diverse rather than redundant solutions and present them in ways that highlight their differences. We also want to allow for seamless and interactive navigation of suggestions while striving for holistic end-to-end evaluations. By doing so, we believe that recommender systems can play an even more important role in helping developers write better software.
Christoph Treude is a Senior Lecturer in Software Engineering in the School of Computing and Information Systems at the University of Melbourne. The goal of his research is to improve the quality of software and the productivity of those producing it, with a particular focus on getting information to software developers when and where they need it.
His research combines empirical studies with the innovation of tools and approaches that take the wide variety of natural language artefacts in software repositories into account. He has authored more than 100 scientific articles with more than 200 co-authors, and his work has received an ARC Discovery Early Career Research Award (2018-2020), industry funding from Google, Facebook, and DST, as well as four best paper awards including two ACM SIGSOFT Distinguished Paper Awards. Prior to joining the University of Melbourne, Christoph Treude held a Senior Lecturer position at the University of Adelaide and worked as a postdoctoral researcher at McGill University, the University of São Paulo, and the Federal University of Rio Grande do Norte. He currently serves as a board member on the Editorial Board of the Empirical Software Engineering journal and was general co-chair for the 36th IEEE International Conference on Software Maintenance and Evolution.
Industry Showcase
Thu 13 Oct 2022 11:20 - 11:40 at Banquet A - Technical Session 22 - Code Summarization and Recommendation Chair(s): Houari Sahraoui Université de MontréalAlthough the automatic model updating process has been widely used in industrial recommendation systems, there are several challenges for utilizing multi-source data to improve recommendation performance, including model and engineering level. In this paper, we introduce a novel \textbf{M}ulti\textbf{-V}iew Approach with \textbf{H}ybrid \textbf{A}ttentive \textbf{N}etworks (MV-HAN) for contents retrieval in the matching stage of recommender systems. The proposed model enables high-order feature interaction from various input features while effectively transferring knowledge between different types. Moreover, the MV-HAN employs deep neural networks with a well-placed parameters sharing strategy, improving the retrieval performance in sparse types. The MV-HAN inherits the efficiency advantages in the online service from the two-tower model, by mapping all representations, including users and contents of different types, into the same space. This enables fast retrieval of similar contents with an approximate nearest neighbor algorithm. We conduct offline experiments on several industrial datasets, showing that the proposed MV-HAN significantly outperforms baselines on the contents retrieval task. Moreover, the MV-HAN is deployed in a real-world matching system. Results of Online A/B tests demonstrate that the proposed method can significantly improve the quality of recommendations.
DOI Pre-printResearch Papers
Thu 13 Oct 2022 11:40 - 12:00 at Banquet A - Technical Session 22 - Code Summarization and Recommendation Chair(s): Houari Sahraoui Université de MontréalWith the support of exception handling mechanisms, when an error occurs, its corresponding typed exception can be thrown. The thrown exception can be caught, if the type of the thrown exception matches the type of the expected exceptions. If the thrown exception is caught, the handling code will resolve the error (e.g., closing resources). Although this mechanism is critical for resolving runtime errors, bugs inside this process can have far-reaching impacts. Therefore, researchers have proposed various approaches to assist catching and handling such thrown exceptions and detect corresponding bugs (e.g., resource leaks). If the thrown exceptions themselves are incorrect, their errors will never be correctly caught and handled. In practice, such thrown exceptions have caused real critical bugs. However, to the best of our knowledge, no approach has been proposed to recommend which exceptions shall be thrown. Exceptions are widely adopted in programs, often poorly documented, and sometimes ambiguous, making the rules of throwing correct exceptions rather complicated. A project team can leverage exceptions in a way totally different from other teams. As a result, experienced programmers can have difficulties in determining which exception shall be thrown, even if its surrounding code is already written. In this paper, we propose the first approach, ThEx, to predict which exception(s) shall be thrown under a given programming context. The basic idea is to learn a classification model from existing thrown exceptions in different contexts. Here, the learning features are extracted from various code information surrounding the thrown exceptions, such as the thrown locations and related variable names. Then, given a new context, ThEx can automatically predict its best exception(s). We have evaluated ThEx on 12,012 thrown exceptions that were collected from nine popular open-source projects. Our results show that it can achieve high f-scores and mcc values (both around 0.8). On this benchmark, we also evaluated the impacts of our underlying technical details. Furthermore, we evaluated our approach in the wild, and used ThEx to detect anomalies from the latest versions of the nine projects. In this way, we found 20 anomalies, and reported them as bugs to their issue trackers. Among them, 18 were confirmed, and 13 have already been fixed.
I received my Ph.D degree from Peking University in 2009. My Ph.D dissertation was nominated for the distinguished Ph.D dissertation award of China Computer Federation. After graduation, I joined Institute of Software, Chinese Academy of Sciences as an assistant professor, and was promoted as an associated professor in 2011. From 2012 to 2014, I was a visiting scholar with University of California, Davis. In 2014, I joined Shanghai Jiao Tong University. I am a recipient of ACM SIGSOFT Distinguished Paper Award, the best paper award of ASE, and the best paper award of APSEC.
Research Papers
Thu 13 Oct 2022 13:30 - 13:50 at Banquet A - Technical Session 27 - Dynamic and Concolic Analysis Chair(s): ThanhVu Nguyen George Mason UniversityPrograms that deal with heap-allocated inputs are difficult to analyze with symbolic execution (SE). Lazy Initialization (LI) is an approach to SE that deals with heap-allocated inputs by starting SE over a fully symbolic heap, and initializing the inputs’ fields on demand, as the program under analysis accesses them. However, when the program’s assumed precondition has structural constraints over the inputs, operationally captured via repOK routines, LI may produce spurious symbolic structures, making SE traverse infeasible paths and undermining SE’s performance. repOK can only decide the feasibility of fully concrete structures, and thus previous work relied on manually crafted specifications designed to decide the (in)validity of partially symbolic inputs, to avoid producing spurious symbolic structures. However, these additional specifications require significant further effort from developers.
To deal with this issue, we introduce SymSolve, a test generation based approach that, given a partially symbolic structure and a repOK, automatically decides if the structure can be extended to a fully concrete one satisfying repOK. As opposed to previous approaches, SymSolve does not require additional specifications. It works by exploring feasible concretizations of partially symbolic structures in a bounded-exhaustive manner, until it finds a fully concrete structure satisfying repOK, or it exhausts the search space, deeming the corresponding partially symbolic structure spurious. SymSolve exploits sound pruning of the search space, combined with symmetry breaking (to discard structures isomorphic to previously explored ones), to efficiently explore very large search spaces.
We incorporate SymSolve into LI in order to decide the feasibility of partially symbolic inputs, obtaining our LISSA technique. We experimentally assess LISSA against related techniques over various case studies, consisting of programs with heap-allocated inputs with complex constraints. The results show that LISSA is faster and scales better than related techniques.
NIER Track
Thu 13 Oct 2022 13:50 - 14:00 at Banquet A - Technical Session 27 - Dynamic and Concolic Analysis Chair(s): ThanhVu Nguyen George Mason UniversityAnalysis of data is the foundation of multiple scientific disciplines, manifesting nowadays in complex and diverse scientific workflows often involving exploratory analyses. Such analyses represent a particular case for traditional data engineering workflows, as results may be hard to interpret and judge whether they are correct or not, and where experimentation is a central theme. Input data, or assumptions made about it may be incorrect and may need to be refined – an analogous problem to fault localization in software engineering. Typical techniques assume that a fault is identified, usually by an oracle in the form of a test. The workflows we target however are usually explorative, which makes it hard – if not impossible – to define tests specifying correct behaviour, while spotting irregularities is highly desired. To this end, we advocate data input reduction such that a specified outcome is preserved, aiding debugging. In our proposal, reductions are used as debug hypotheses for data. We outline our bold vision on building engineering support for outcome-preserving input reduction within data analysis workflows, and report on preliminary results.
Pre-printResearch Papers
Thu 13 Oct 2022 14:00 - 14:20 at Banquet A - Technical Session 27 - Dynamic and Concolic Analysis Chair(s): ThanhVu Nguyen George Mason UniversityConcolic execution is a dynamic twist of symbolic execution designed with scalability in mind. Recent concolic executors heavily rely on program instrumentation to achieve such scalability. The instrumentation code can be added at compilation time, e.g., using an LLVM pass, or directly at execution time with the help of a dynamic binary translator. With the former strategy, the resulting code is more efficient but requires recompilation. With the latter strategy, the instrumentation code is typically less efficient but does not require recompilation. Unfortunately, recompiling the entire code of a program is not always easy or practical for a user, e.g., in presence of third-party components, such as system libraries. At the same time, efficiency may be crucial for the tool’s effectiveness.
In this paper, we investigate a different design for a concolic executor, called SymFusion, which is based on hybrid instrumentation. In particular, our approach allows the user to recompile the core components of an application, thus minimizing the analysis overhead on them, while still being able to dynamically instrument the rest of the application components at execution time. Our experimental evaluation shows that our design can achieve a nice balance between efficiency and efficacy on several real-world applications.
Pre-printResearch Papers
Thu 13 Oct 2022 14:20 - 14:40 at Banquet A - Technical Session 27 - Dynamic and Concolic Analysis Chair(s): ThanhVu Nguyen George Mason UniversitySoftware systems are becoming increasingly configurable. A paradigmatic example is the Linux kernel, which can be adjusted for a tremendous variety of hardware devices, from mobile phones to supercomputers, thanks to the thousands of configurable features it supports. Many relevant problems on configurable systems, such as completing a partial configuration to get the system instance that consumes the least energy or optimizes any other quality attribute, could be solved through exhaustive analysis of all configurations. However, configuration spaces are typically colossal and cannot be entirely computed in practice. Alternatively, configuration samples can be analyzed to approximate the answers. Generating those samples is not trivial since features usually have inter-dependencies that constrain the configuration space. Therefore, getting a single valid configuration by chance is extremely unlikely. As a result, advanced samplers are being proposed to generate random samples at a reasonable computational cost. However, to date, no sampler can deal with highly configurable complex systems, such as the Linux kernel. This paper proposes a new sampler that does scale for those systems, based on an original theoretical approach called extensible logic groups.
Journal-first Papers
Thu 13 Oct 2022 14:40 - 15:00 at Banquet A - Technical Session 27 - Dynamic and Concolic Analysis Chair(s): ThanhVu Nguyen George Mason UniversityDynamic taint tracking, a technique that traces relationships between values as a program executes, has been used to support a variety of software engineering tasks. Some taint tracking systems only consider data flows and ignore control flows. As a result, relationships between some values are not reflected by the analysis. Many applications of taint tracking either benefit from or rely on these relationships being traced, but past works have found that tracking control flows resulted in over-tainting, dramatically reducing the precision of the taint tracking system. In this article, we introduce Conflux, alternative semantics for propagating taint tags along control flows. Conflux aims to reduce over-tainting by decreasing the scope of control flows and providing a heuristic for reducing loop-related over-tainting. We created a Java implementation of Conflux and performed a case study exploring the effect of Conflux on a concrete application of taint tracking, automated debugging. In addition to this case study, we evaluated Conflux’s accuracy using a novel benchmark consisting of popular, real-world programs. We compared Conflux against existing taint propagation policies, including a state-of-the-art approach for reducing control-flow-related over-tainting, finding that Conflux had the highest F1 score on 43 out of the 48 total tests.
Link to publication DOI Pre-printResearch Papers
Thu 13 Oct 2022 15:00 - 15:20 at Banquet A - Technical Session 27 - Dynamic and Concolic Analysis Chair(s): ThanhVu Nguyen George Mason UniversityThread alternation aggravates the difficulty of concurrent program verification since the number of traces to be explored grows rapidly as the scale of a concurrent program increases. \textit{Partial-Order Reduction} (POR) techniques alleviate the trace-space explosion problem by partitioning the traces into different equivalent classes. However, due to the coarse dependency approximation of transitions, there are still a large number of redundant traces explored throughout the verification. In this paper, a symbolic approach, namely \textit{Prioritized Constraint-Aided Dynamic Partial-Order Reduction} (PC-DPOR), is proposed to reduce the redundant traces. Specifically, a constrained dependency graph is presented to refine dependencies between transitions, and the exploration of isolated transitions in the graph is prioritized to reduce redundant equivalent traces. Further, we utilize the generated constraints to dynamically detect whether the enabled transitions at the given reachable states are dependent, and thereby to overcome the inherent imprecision of the traditional dependence over-approximation. We have implemented the proposed approach as an extension of CPAchecker by utilizing BDDs as the representation of state sets. Experimental results show that our approach can effectively reduce the time and memory consumption for verifying concurrent programs. In particular, the number of explored states is reduced to 8.62% on average.
Research Papers
Thu 13 Oct 2022 16:00 - 16:20 at Banquet A - Technical Session 31 - Code Similarities and Refactoring Chair(s): Hua Ming Oakland UniversityAn Object-Relational Mapping (ORM) provides an object-oriented interface to a database and facilitates the development of database-backed applications. In an ORM, programmers do not need to write queries in a separate query language such as SQL, they instead write ordinary method calls that are mapped by the ORM to database queries. This added layer of abstraction hides the significant performance cost of database operations, and misuse of ORMs can lead to far more queries being generated than necessary. Of particular concern is the infamous “N+1 problem”, where an initial query yields N results that are used to issue N subsequent queries. This anti-pattern is prevalent in applications that use ORMs, as it is natural to iterate over collections in object-oriented languages. However, iterating over data that originates from a database and calling an ORM method in each iteration may result in suboptimal performance. In such cases, it is often possible to reduce the number of round-trips to the database by issuing a single, larger query that fetches all desired results at once.
We propose an approach for automatically refactoring applications that use ORMs to eliminate instances of the “N+1 problem”, which relies on static analysis to detect data flow between ORM API calls. We implement this approach in a tool called Reformulator, targeting the Sequelize ORM in JavaScript, and evaluate it on 8 JavaScript projects. We found 44 N+1 query pairs in these projects, and Reformulator refactored all of them successfully, resulting in improved performance (up to 7.67x) while preserving program behavior. Further experiments demonstrate that the relative performance improvements grew as the database size was increased (up to 38.58x), and that front-end page load times were improved.
Journal-first Papers
Thu 13 Oct 2022 16:20 - 16:40 at Banquet A - Technical Session 31 - Code Similarities and Refactoring Chair(s): Hua Ming Oakland UniversityRefactoring aims at improving the maintainability of source code without modifying its external behavior. Previous works proposed approaches to recommend refactoring solutions to software developers. The generation of the recommended solutions is guided by metrics acting as proxy for maintainability (e.g., number of code smells removed by the recommended solution). These approaches ignore the impact of the recommended refactorings on other non-functional requirements, such as performance, energy consumption, and so forth. Little is known about the impact of refactoring operations on non-functional requirements other than maintainability. We aim to fill this gap by presenting the largest study to date to investigate the impact of refactoring on software performance, in terms of execution time. We mined the change history of 20 systems that defined performance benchmarks in their repositories, with the goal of identifying commits in which developers implemented refactoring operations impacting code components that are exercised by the performance benchmarks. Through a quantitative and qualitative analysis, we show that refactoring operations can significantly impact the execution time. Indeed, none of the investigated refactoring types can be considered “safe” in ensuring no performance regression. Refactoring types aimed at decomposing complex code entities (e.g., Extract Class/Interface, Extract Method) have higher chances of triggering performance degradation, suggesting their careful consideration when refactoring performance-critical code.
Link to publication DOI Authorizer linkResearch Papers
Thu 13 Oct 2022 16:40 - 17:00 at Banquet A - Technical Session 31 - Code Similarities and Refactoring Chair(s): Hua Ming Oakland UniversityWe propose a method for synthesizing invariants that can help verify relational properties over two programs or two different executions of a program. Applications of such invariants include verifying functional equivalence, non-interference security, and continuity properties. Our method generates invariant candidates using syntax guided synthesis (SyGuS) and then filters them using an SMT-solver based verifier, to ensure they are both inductive invariants and sufficient for verifying the property at hand. To improve performance, we propose two learning based techniques: a logical reasoning (LR) technique to determine which part of the search space can be pruned away, and a reinforcement learning (RL) technique to determine which part of the search space to prioritize. Our experiments on a diverse set of relational verification benchmarks show that our learning based techniques can drastically reduce the search space and, as a result, they allow our method to generate invariants of a higher quality in much shorter time than state-of-the-art invariant synthesis techniques.
Tool Demonstrations
Wed 12 Oct 2022 09:30 - 10:00 at Ballroom A - Tool Poster Session 2We developed a plugin for IntelliJ IDEA called AntiCopyPaster, which tracks the pasting of code fragments inside the IDE and suggests the appropriate Extract Method refactoring to combat the propagation of duplicates. Unlike the existing approaches, our tool is integrated with the developer’s workflow, and pro-actively recommends refactorings. Since not all code fragments need to be extracted, we develop a classification model to make this decision. When a developer copies and pastes a code fragment, the plugin searches for duplicates in the currently opened file, waits for a short period of time to allow the developer to edit the code, and finally inferences the refactoring decision based on a number of features.
Our experimental study on a large dataset of 18,942 code fragments mined from 13 Apache projects shows that AntiCopyPaster correctly recommends Extract Method refactorings with an F-score of 0.82. Furthermore, our survey of 59 developers reflects their satisfaction with the developed plugin’s operation. The plugin and its source code are publicly available on GitHub at https://github.com/JetBrains-Research/anti-copy-paster. The demonstration video can be found on YouTube: https://www.youtube.com/watch?v=_wwHg-qFjJY.
DOI Pre-printResearch Papers
Thu 13 Oct 2022 17:10 - 17:30 at Banquet A - Technical Session 31 - Code Similarities and Refactoring Chair(s): Hua Ming Oakland UniversityCode clone detection is an important research problem that has attracted wide attention in software engineering. Many methods have been proposed for detecting code clone, among which text-based and token-based approaches are scalable but lack consideration of code semantics, thus resulting in the inability to detect semantic code clones. Methods based on intermediate representations of codes can solve the problem of semantic code clone detection. However, graph-based methods are not practicable due to code compilation, and existing tree-based approaches are limited by the scale of trees for scalable code clone detection.
In this paper, we propose \emph{TreeCen}, a scalable tree-based code clone detector, which satisfies scalability while detecting semantic clones effectively. Given the source code of a method, we first extract its abstract syntax tree (AST) based on static analysis and transform it into a simple graph representation (\ie tree graph) according to the node type, rather than using traditional heavyweight tree matching. We then treat the tree graph as a social network and adopt centrality analysis on each node to maintain the tree details. By this, the original complex tree can be converted into a 72-dimensional vector while containing comprehensive structural information of the AST. Finally, these vectors are fed into a machine learning model to train a detector and use it to find code clones. We conduct comparative evaluations on effectiveness and scalability. The experimental results show that \emph{TreeCen} maintains the best performance of the other six state-of-the-art methods (\ie \emph{SourcererCC}, \emph{RtvNN}, \emph{DeepSim}, \emph{SCDetector}, \emph{Deckard}, and \emph{ASTNN}) with F1 scores 0.99 and 0.95 on BigCloneBench and Google Code Jam datasets, respectively. In terms of scalability, \emph{TreeCen} is about 79 times faster than another state-of-the-art tree-based semantic code clone detector (\ie \emph{ASTNN}).
Tool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1We present Trimmer, a state-of-the art tool for code size reduction. Trimmer reduces code size by specializing a program for constant inputs provided by developers. These constant inputs can be provided as command-line options or configuration files, and are used to specify features that must be retained, which in turn identify features that are unused in a specific deployment and can be removed. Trimmer includes sophisticated compiler transformations for input specialization, supports precise yet efficient context-sensitive inter-procedural constant propagation, and introduces a custom loop unroller. Trimmer is easy-to-use, and is highly configurable. In this paper, we discuss Trimmer’s configurable knobs that allow developers to explicitly control analysis precision vs analysis times. We also discuss the high-level implementation of Trimmer’s static analysis passes. The source code of Trimmer is publicly available at https://github.com/ashish-gehani/Trimmer. The video demonstration is available at https://www.youtube.com/watch?v=6pAuJ68INnI
Research Papers
Thu 13 Oct 2022 17:40 - 18:00 at Banquet A - Technical Session 31 - Code Similarities and Refactoring Chair(s): Hua Ming Oakland UniversityExisting approaches for program debloating often use a usage profile, typically provided as a set of inputs, for identifying the features of a program to be preserved. Specifically, given a program and a set of inputs, these techniques produce a reduced program that behaves correctly for these inputs. Focusing only on \textit{reduction}, however, would typically result in programs that are overfitted to the inputs used for debloating. For this reason, another important factor to consider in the context of debloating is \textit{generality}, which measures the extent to which a debloated program behaves correctly also for inputs that were not in the initial usage profile. Unfortunately, most evaluations of existing debloating approaches only consider reduction, thus providing partial information on the effectiveness of these approaches. To address this limitation, we perform an empirical evaluation of the reduction and generality of 4 debloating techniques, 3 state-of-the-art techniques and a baseline, on a set of 25 programs and different sets of inputs for these programs. Our results show that these approaches can indeed produce programs that are overfitted to the inputs used and have low generality. Based on these results, we also propose two new augmentation approaches and evaluate their effectiveness. The results of this additional evaluation show that these two approaches can help improve program generality without significantly affecting size reduction. Finally, because different approaches have different strengths and weaknesses, we also provide guidelines to help users choose the most suitable approach based on their specific needs and context.