no description available
Plenary
Tue 16 Nov 2021 09:00 - 10:00 at Kangaroo - MIP Talk 1 Chair(s): Myra Cohen Iowa State Universityno description available
Social/Networking
Tue 16 Nov 2021 10:00 - 11:00 at Kangaroo - Virtual Reception Chair(s): Mattia Fazzini University of Minnesotano description available
Research Papers
Tue 16 Nov 2021 11:00 - 11:20 at Kangaroo - Automation Chair(s): Eunsuk Kang Carnegie Mellon UniversitySmart contracts have obtained much attention and crucial for automatic financial and business transactions. As Turing-complete programs, they are first compiled into bytecode and then executed on the Blockchain platform. For end-users who have never seen the source code, they can read the user notice shown in end-user client to understand what a transaction does of a smart contract function. However, due to time constraints or lack of motivation, user notice is often missing during the development of smart contracts. For end-users who lack the information of the user notices, there is no easy way for them to check the code semantics of the smart contracts. Thus, in this paper, we propose a new approach SMARTDOC to generate user notice for smart contract functions automatically. Our tool can help end-users better understand the smart contract and aware of the financial risks, improving the users’ confidence on the reliability of the smart contracts. SMARTDOC exploits the Transformer to learn the representation of source code and generates natural language descriptions from the learned representation. We also integrate the Pointer mechanism to copy words from the input source code instead of generating words during the prediction process. We extract 7,878 ⟨function,notice⟩ pairs from 54,739 smart contracts written in Solidity. Due to the limited amount of collected smart contract functions (i.e., 7,878 functions), we exploit a transfer learning technique to utilize the learned knowledge to improve the performance of SMARTDOC. The learned knowledge obtained by the pre-training on a corpus of Java code, that has similar characteristics as Solidity code. The experimental results show that our approach can effectively generate user notice given the source code and significantly outperform the state-of-the-art approaches. To investigate human perspectives on our generated user notice, we also conduct a human evaluation and ask participants to score user notice generated by different approaches. Results show that SMARTDOC outperforms baselines from three aspects, naturalness, informativeness, and similarity.
Research Papers
Tue 16 Nov 2021 11:20 - 11:40 at Kangaroo - Automation Chair(s): Eunsuk Kang Carnegie Mellon University(Experience Paper.) We developed a set of tools designed to provide rapid feedback to students as they learn to write programs in assembly language (LC-3, a RISC-like educational instruction set architecture). At the heart of the system is an extended version of KLEE, KLC3, that enables us to both identify issues and perform equivalence checking between student code and a gold (correct) version of each assignment. Feedback begins when students edit their code using a VSCode extension that leverages static analysis to perform a variety of correctness and style checks, encouraging students to improve their code quality. Each time a student commits code to their Git repository, our system triggers. Using KLC3 (KLEE), the student code is executed along with the gold version, and issues and behavioral differences are delivered back to students through their Git repository as a human-readable report, test cases, and scripts. A queueing system allows students to monitor progress, but the response is generally available within minutes. We also extended the LC-3 simulation tools to support reverse debugging, making the process of finding complex bugs much more tractable for students, and used Emscripten to develop a browser-based interface for use in testing and debugging. Finally, our tools maintain an individual regression test suite for each student and require a submission to pass all previous tests before re-evaluation in KLC3, thus avoiding encouraging programming-by-guesswork. We deployed the system to provide feedback for the assembly programming assignments in a class of over 100 students in Fall 2020. Students wrote a median of around 700 lines of assembly for these assignments, making heavy use of our tools to understand and eliminate their bugs. Anonymous student feedback on the tools was uniformly positive. Since that semester, we have continued to refine and expand our tools’ analysis capabilities and performance, and plan to deploy the system again in the near future (the class is offered every Fall).
Journal-first Papers
Tue 16 Nov 2021 11:40 - 11:50 at Kangaroo - Automation Chair(s): Eunsuk Kang Carnegie Mellon UniversityContext: Web applications evolve frequently to incorporate new functionality and adopt the recent trends, but even minor adjustments can result in major changes to the web page’s underlying structure. Capture-Replay tools are widely used for the automated testing of web applications. The scripts written using these Capture-Replay tools are strongly coupled with the web elements of web applications. Even simple changes, such as slight modifications in a web page layout may break the existing test scripts, because the web elements that the scripts are referencing may not be valid in the new version. Objective: In this work, we propose a model-based automated approach to repair the broken test scripts of Capture-Replay testing tools. This approach covers the various types of changes of web elements that may result in the breakage of test scripts. No other existing work provides such a comprehensive test repair strategy and are testing framework dependent. Methodology: This model-based web test repair approach is independent of the underlying Capture-Replay tool. To provide a tool independent methodology, we develop a UML profile that allows the capture of various concepts related to Capture-Replay test scripts for web applications. For this purpose, we extend the UML Test Profile (UTP) and add important concepts relevant to capture and replay test scripts of web applications. It uses a DOM-based strategy that automatically detects the DOM-level differences between the two versions of the web application. Based on the identified differences, test suite is classified as reusable, obsolete, retestable, and then fixes them by applying the repair heuristics. We also develop an open-source tool that automates the proposed test repair approach. The proposed approach is evaluated by conducting a series of four evaluations on one industrial and six open-source case studies to evaluate the effectiveness of our proposed approach in terms of repairing the broken test scripts, coverage of DOM elements, and fault-finding capability. We also evaluate the usefulness of the repaired test scripts according to the opinion of professional testers. We also compare our approach with the only other available DOM-based test repair approach, WATER. Results: The results of the empirical evaluation indicate that the proposed approach the proposed approach effectively repairs the 91% of broken web test scripts and achieves a similar DOM-coverage, as by the original test suite, on the evolved versions of subject applications. A team of professional testers found the suggested repairs, for different types of test breakages, useful for the regression testing of evolving web applications. Furthermore, the DOM-based fault-finding capability of the repaired test suite is equivalent to the original test suite. Our empirical evaluation on 528 Selenium web driver test scripts of seven web applications shows that the proposed can effectively repair 83% of the overall breakages, whereas the existing technique WATER repairs 58% test breakages which only includes attribute-based locators and broken assertion values. Conclusion: We have proposed a scalable and automated strategy to migrate the existing automated test scripts (e.g., Capture Replay) towards the new and evolved version of the web application. Our approach repairs the test scripts that may be broken due to the breakages (e.g., broken locators, missing web elements) reported in the existing test breakage taxonomy. Our approach is based on a DOM-based strategy and is independent of the underlying Capture-Replay tool. We developed a tool to demonstrate the applicability of the approach. We perform an empirical study on seven subject applications. The results show that the approach successfully repairs the broken test scripts while maintaining the same DOM coverage and fault-finding capability.
Tool Demonstrations
Tue 16 Nov 2021 11:50 - 11:55 at Kangaroo - Automation Chair(s): Eunsuk Kang Carnegie Mellon UniversityThis paper describes BeAFix, a tool for automated repair of faulty Alloy models. The tool builds upon the AlloyAnalyzer, the analysis tool for Alloy. It generates repair candidates by mutating a faulty Alloy model, and employs a bounded-exhaustive approach to traverse the space of repair candidates. Since BeAFix’s mutation operators make the space of repair candidates to quickly grow, the tool supports some sound pruning techniques, that allow it to fix Alloy models with more than one faulty line or expression. Additionally, BeAFix does not require tests as a patch acceptance criterion. Although BeAFix supports tests as oracles, our tool is also able to leverage property-based oracles, which are more commonly found in Alloy models in the form of predicate satisfiability and assertion validity checks. A video demonstration of BeAFix can be found at https://youtu.be/5RG40SmlFXQ. The tool’s binaries and further details about its usage, can all be found at http://sites.google.com/view/beafixevaluation.
Tool Demonstrations
Tue 16 Nov 2021 11:55 - 12:00 at Kangaroo - Automation Chair(s): Eunsuk Kang Carnegie Mellon UniversityFault localization can help developers identify buggy statements or expressions in programs. Existing fault localization techniques are often designed for imperative programs (e.g., C and Java) and rely on tests to compare correct and incorrect execution traces to identify suspicious statements. In this demo paper, we present FLACK, a tool to automatically locate faults for models written in Alloy, a declarative language where the models are not executed but instead converted into a logical formula and solved using a SAT solver. FLACK takes as input an Alloy model that violates some assertions and returns a ranked list of suspicious expressions contributing to the violation. The key idea is to analyze the differences between counterexamples, i.e., instances of the model that do not satisfy the assertion and instances that do satisfy the assertion to find suspicious expressions in the input model. An experiment with 157 Alloy models with various bugs shows the efficiency and accuracy of FLACK in localizing the causes of these bugs. FLACK and its evaluation benchmark and results can be downloaded from https://github.com/guolong-zheng/flack. The video demonstration is available at https://youtu.be/FKa2ohqIUms.
Research Papers
Tue 16 Nov 2021 12:00 - 12:20 at Kangaroo - Programming Chair(s): Amiangshu Bosu Wayne State UniversityDeep learning has been widely adopted in industry and has achieved great success in a wide range of application areas. Bugs in deep learning programs can cause catastrophic failures, in addition to a serious waste of resources and time.
This paper aims at detecting industrial TensorFlow program bugs. We report an extensive empirical study on 12,289 failed TensorFlow jobs, showing that existing static tools can effectively detect 72.55% of the top three types of Python bugs in industrial TensorFlow programs. In addition, we propose (for the first time) a constraint-based approach for detecting TensorFlow shape-related errors (one of the most common TensorFlow-specific bugs), together with an associated tool, \textsc{ShapeTracer}. Our evaluation on a set of 60 industrial TensorFlow programs shows that \textsc{ShapeTracer} is efficient and effective: it analyzes each program in at most 3 seconds and detects effectively 40 out of 60 industrial TensorFlow program bugs, with no false positives. \textsc{ShapeTracer} has been deployed in the \AnonP platform (\emph{anonymized for double blind review}) and will be released soon.
Pre-printResearch Papers
Tue 16 Nov 2021 12:20 - 12:40 at Kangaroo - Programming Chair(s): Amiangshu Bosu Wayne State UniversityJava 8 has introduced lambda expressions, a core feature of functional programming. Since its introduction, there is an increasing trend of lambda adoptions in Java projects. Developers often adopt lambda expressions to simplify code, avoid code duplication or simulate other functional features. However, we observe that lambda expressions can incur different types of side effects (e.g., performance issues and memory leakages) or even severe bugs, and developers also frequently remove lambda expressions in their implementations. However, there is no systematic study to characterize and understand such inappropriate usages of lambda expressions in Java. Without such knowledge, it is hard to guarantee the correct and appropriate usages of lambda expressions for Java developers. Consequently, the advantages of utilizing lambda expressions can be significantly compromised by the collatoral side effects. To bridge this gap, in this study, we present the first large-scale, quantitative and qualitative empirical study to characterize and understand inappropriate usages of lambda expressions. Particularly, we first extracted 3,662 instances of lambda expressions that were removed by developers due to inappropriate usages from 103 large-scale open-source projects, and compared their characteristics with those of 31,228 correct usages of lambdas. For instance, we observe that lambdas using customized functional interfaces are more likely to be removed by developers. To obtain a comprehensive understanding, we further collected over 100 real issues caused by abusing lambdas. We manually analyzed the reasons, impacts and migration patterns of those lambdas, and summarized 7 main reasons for the removal of lambdas. For example, performance degradation, poor readability, hard to debug and eliminating lazy evaluations are the most common reasons. Our study also reveals strong associations between the reasons to remove lambdas and the associated migration patterns. Moreover, from a complementary perspective, we performed an user study over 30 developers to seek for the underlying reasons on why do they remove lambda expressions in practice and confirmed 8 of the 9 reasons we summarized. Finally, based on our empirical results, we developed suggestions on scenarios to avoid lambda usages for Java developers and also pointed out future directions for researchers.
New Ideas and Emerging Results (NIER) track
Tue 16 Nov 2021 12:40 - 12:50 at Kangaroo - Programming Chair(s): Amiangshu Bosu Wayne State UniversityEfficient representation of source code is essential for various software engineering tasks such as code search and code clone detection. One such technique for representing source code involves extracting paths from the AST and using a learning model to capture program properties. Code2vec is a commonly used path-based approach that uses an attention-based neural network to learn code embeddings which can then be used for various software engineering tasks. However, this approach uses only ASTs and does not leverage other graph structures such as Control Flow Graphs (CFG) and Program Dependency Graphs (PDG). Similarly, most recent approaches for representing source code still use AST and do not leverage semantic graph structures. Even though there exists an integrated graph approach (Code Property Graph) for representing source code, it has only been explored in the domain of software security. Moreover, it does not leverage the paths from the individual graphs. In our work, we extend the path-based approach code2vec to include semantic graphs, CFG, and PDG, along with AST, which is still largely unexplored in the domain of software engineering. We evaluate our approach on the task of MethodNaming using a custom C dataset of 730K methods collected from 16 C projects from GitHub. In comparison to code2vec, our approach improves the F1 Score by 11% on the full dataset and up to 100% with individual projects. We show that semantic features from the CFG and PDG paths are indeed helpful. We envision that looking at a mocktail of source code representations for various software engineering tasks can lay the foundation for a new line of research and a re-haul of existing research.
Journal-first Papers
Tue 16 Nov 2021 12:50 - 13:00 at Kangaroo - Programming Chair(s): Amiangshu Bosu Wayne State UniversityThis presentation proposal is for our paper titled “On Tracking Java Methods with Git Mechanisms”, which was published by Journal of Systems and Software (JSS). The original paper at the journal’s web site is accessible at https://doi.org/10.1016/j.jss.2020.110571.
no description available
Research Papers
Tue 16 Nov 2021 18:00 - 18:20 at Kangaroo - Testing I Chair(s): Xiaoyin Wang University of Texas at San AntonioQuestion Answering (QA) is an attractive and challenging area in NLP community. There are diverse algorithms being proposed and various benchmark datasets with different topics and task formats being constructed. QA software has also been widely used in daily human life now. However, current QA software is mainly tested in a reference-based paradigm, in which the expected outputs (labels) of test cases need to be annotated with much human effort before testing. As a result, neither the just-in-time test during usage nor the extensible test on massive unlabeled real-life data is feasible, which keeps the current testing of QA software from being flexible and sufficient. In this paper, we propose a method, QAAskeR, with three novel Metamorphic Relations for testing QA software. QAAskeR does not require the annotated labels but tests QA software by checking its behaviors on multiple recursively asked questions that are related to the same knowledge. Experimental results show that QAAskeR can reveal violations at over 80% of valid cases without using any pre-annotated labels. Diverse answering issues, especially the limited generalization on question types across datasets, are revealed on a state-of-the-art QA algorithm.
Research Papers
Tue 16 Nov 2021 18:20 - 18:40 at Kangaroo - Testing I Chair(s): Xiaoyin Wang University of Texas at San AntonioWith the ever-increasing use of web APIs in modern-day applications, it is becoming more important to test the system as a whole. In the last decade, tools and approaches have been proposed to automate the creation of system-level test cases for these APIs using evolutionary algorithms (EAs). One of the limiting factors of EAs is that the genetic operators (crossover and mutation) are fully randomized, potentially breaking promising patterns in the sequences of API requests discovered during the search. Breaking these patterns has a negative impact on the effectiveness of the test case generation process. To address this limitation, this paper proposes a new approach that uses agglomerative hierarchical clustering (AHC) to infer a linkage tree model, which captures, replicates, and preserves these patterns in new test cases. We evaluate our approach, called LT-MOSA, by performing an empirical study on 7 real-world benchmark applications w.r.t. branch coverage and real-fault detection capability. We also compare LT-MOSA with the two existing state-of-the-art white-box techniques (MIO, MOSA) for REST API testing. Our results show that LT-MOSA achieves a statistically significant increase in test target coverage (i.e., lines and branches) compared to MIO and MOSA in 4 and 5 out of 7 applications, respectively. Furthermore, LT-MOSA discovers 27 and 18 unique real-faults that are left undetected by MIO and MOSA, respectively.
DOI Pre-printIndustry Showcase
Tue 16 Nov 2021 18:40 - 18:50 at Kangaroo - Testing I Chair(s): Xiaoyin Wang University of Texas at San AntonioWe present our work on testing access control of large national e-health Internet portal which has millions of monthly visits. Our aim is twofold: (1) to improve testing by applying systematic and rigorous (semi-formal) approach and (2) to obtain holistic view of portal’s complex access control structure. Applying more rigorous approach facilitates reducing ambiguity while holistic picture aids on easier and often also faster comprehension of complex control structure by avoiding reading a lot of textual specifications. We use set-theoretic approach for specifying access control. Then, from access control’s abstract set notations we get a visualize version in form of the access control tree. Access control tree presented in this paper has 15 leaves (scopes) which results in 105 pairs of abstract test scenarios. More complete version of the tree has 66 leaves (scopes) that results in over 2000 pairs of abstract test scenarios (although not all of them can be valid). From abstract scenarios we implemented over 600 concrete and automated test cases. Manual execution test of one concrete test takes about five minutes while automated execution of all tests takes about one hour (thus achieving over 40 times speedup). These automated test cases run as a part of our CI/CD pipeline.
New Ideas and Emerging Results (NIER) track
Tue 16 Nov 2021 18:50 - 19:00 at Kangaroo - Testing I Chair(s): Xiaoyin Wang University of Texas at San AntonioPart-of-Speech (POS) tagging for sentences is a basic and widely-used Natural Language Processing (NLP) technique. People rely heavily on it to predict POS tags that serve as the base for many advanced NLP tasks, such as sentiment analysis, word sense disambiguation, and information retrieval. However, POS tagging tools could make wrong predictions, which bring consequent error propagation to the advanced tasks and even cause serious threats in critical application domains. In this paper, we propose to test POS tagging tools with Metamorphic Testing against some properties that they should follow. The preliminary exploration with two groups of Metamorphic Relations shows that our method can effectively reveal defects of three common POS tagging tools (i.e., spaCy, NLTK, and Flair) on handling fairly simple intra- and inter-sentence transformation regarding adverbial clause and sentence appending. This demonstrates the great potential of our method to deliver a systematic test and reveal the unaware issues, which may benefit the validation, repair, and improvement, for POS tagging tools.
Research Papers
Tue 16 Nov 2021 19:00 - 19:20 at Kangaroo - Code Chair(s): Michael Pradel University of StuttgartExisting studies show that code summaries help developers understand and maintain source code. Unfortunately, these summaries are often mismatched, missing or outdated in software projects. Code summarization aims to generate brief and accurate natural language descriptions automatically for source code. According to Gros et al., code summaries are highly structured and have many repetitive patterns, for example, they are often begin with patterns like “return true if…” and “create a new…”. The promising results obtained by previous approaches also prove the existence of these patternized words. Besides the patternized words, a code summary also contains important keywords, which are the key to reflecting the functionality of the code. However, the state-of-the-art code summarization approaches perform poorly on predicting the keywords, which leads to the generated summaries suffer a loss in informativeness. To alleviate this problem, this paper proposes a novel retrieve-and-edit approach named EditSum for code summarization. Specifically, EditSum first retrieves a similar code snippet from a pre-defined corpus and treats its summary as a prototype summary to learn the pattern. Then, EditSum edits the prototype automatically to combine the pattern in the prototype with the semantic information of input code. Our motivation is that the retrieved prototype provides a good start-point for post-generation because the summaries of similar code snippets often have the same pattern. The post-editing process further reuses the patternized words in prototype and generates keywords based on the semantic information of code. We conduct experiments on a large-scale Java corpus, which contains about 2M samples, and experimental results demonstrate that EditSum outperforms the state-of-the-art approaches by a substantial margin. The human evaluation also proves the summaries generated by EditSum are more informative and useful. We also verify that EditSum performs well on predicting the patternized words and keywords. The code and data will be open-sourced.
Research Papers
Tue 16 Nov 2021 19:20 - 19:40 at Kangaroo - Code Chair(s): Michael Pradel University of StuttgartProgram translation is necessary in many real-world scenarios, such as porting codebases from an obsolete or deprecated language to a modern one or re-implementing existing projects in one’s preferred programming language. One way to automate program translation is to make use of Big Code. Existing data-driven approaches either training a translation model or leveraging cross-language retrieval. The former requires large amounts of training data and extra information or neglects significant characteristics of programs. The latter has a barrier to finding the translation with only the features of the input program as the query. In this paper, we present BigPT for interactive cross-language retrieval from Big Code only based on raw code and reusing the retrieved code to assist program translation. We build on existing work on cross-language code representation and we propose a novel predictive transformation model based on auto-encoders. The model is trained on Big Code to generate a target-language representation, which will be used as the query to retrieve the most relevant translations for a given program. Our succinct query enables the user to easily update and correct the returned results to improve the retrieval process. Our experiments show that BigPT outperforms state-of-the-art baselines in terms of program accuracy. Using our novel querying and retrieving mechanism, BigPT can be scaled to the large dataset and efficiently retrieve the translation.
New Ideas and Emerging Results (NIER) track
Tue 16 Nov 2021 19:40 - 19:50 at Kangaroo - Code Chair(s): Michael Pradel University of StuttgartMachine Learning is a vital part of various modern day decision making software.
At the same time, it has shown to exhibit bias, which can cause an unjust treatment of individuals and population groups. One method to achieve fairness in machine learning software is to provide individuals with the same degree of benefit, regardless of sensitive attributes (e.g., students receive the same grade, independent of their sex or race). However, there can be other attributes that one might want to discriminate against (e.g., students with homework should receive higher grades). We will call such attributes anti-protected attributes. When reducing the bias of machine learning software, one risks the loss of discriminatory behaviour of anti-protected attributes. To combat this, we use grid search to show that machine learning software can be debiased (e.g., reduce gender bias) while also improving the ability to discriminate against anti-protected attributes.
Tool Demonstrations
Tue 16 Nov 2021 19:50 - 19:55 at Kangaroo - Code Chair(s): Michael Pradel University of StuttgartAutomation in quantum software testing is essential to support systematic and cost-effective testing. Towards this direction, we present a quantum software testing tool called Quito that can automatically generate test suites covering three coverage criteria defined on inputs and outputs of a quantum program coded in Qiskit, i.e., input coverage, output coverage, and input-output coverage. Quito also implements two types of test oracles based on program specifications, i.e., checking whether a quantum program produced a wrong output or checking a probabilistic test oracle with statistical test. We describe the architecture and methodology of the tool. We also validated the tool with one quantum program and one faulty version of it. Results indicate that Quito can generate test suites and perform test assessments that detect faults, and produce test results with a good time performance.
Tool Demonstrations
Tue 16 Nov 2021 19:55 - 20:00 at Kangaroo - Code Chair(s): Michael Pradel University of StuttgartMany code changes that developers make in their projects are repeated and constitute recurrent change patterns. It is of interest to collect such patterns from the version history of open-source repositories and suggest the most useful of them as quick fixes. In this paper, we present Revizor — a tool aimed to build custom plugins for PyCharm, a popular Python IDE. A Revizor-based plugin can take recurrent change patterns and highlight potential places for their application in the developer’s code editor. If the developer accepts the quick fix, the plugin automatically performs the edit. Our approach uses a graph-based representation of code changes, which allows us to support complex distributed code patterns. We have also asked several experienced developers to rate the usability and the performance of our plugin, and they gave us a positive feedback.
The source code of the tool and test plugin prototype are available on GitHub: https://github.com/JetBrains-Research/revizor A demonstration video with a short tool description can be found on YouTube: https://youtu.be/5eLs14nco7E
Pre-printResearch Papers
Tue 16 Nov 2021 21:00 - 21:20 at Kangaroo - Fuzzing Applications Chair(s): Thuan Pham The University of MelbourneBrowsers use security policies to block malicious behaviors. Cross-Origin Read Blocking (CORB) is a browser security policy for preventing side-channel attacks such as Spectre. We propose a web browser security policy fuzzer called CorbFuzz for checking CORB and similar policies. In implementing a security policy, the browser only has access to HTTP requests and responses, and takes policy actions based solely on those interactions. In checking the browser security policies, CorbFuzz uses a policy oracle that tracks the web application behavior and infers the desired policy action based on the web application state. By comparing the policy oracle with the browser behavior, CorbFuzz detects weaknesses in browser security policies. CorbFuzz checks the web browser policy by fuzzing a set of web applications where the persistent layer queries are symbolically evaluated for increased coverage and automation. CorbFuzz collects type information from database queries and branch conditions in order to prevent the generation of inconsistent data values during fuzzing. We evaluated CorbFuzz on CORB and Opaque Response Blocking (ORB) policies on web applications collected from Github and found three classes of weaknesses in Chromium’s implementation of CORB.
Pre-printResearch Papers
Tue 16 Nov 2021 21:20 - 21:40 at Kangaroo - Fuzzing Applications Chair(s): Thuan Pham The University of MelbourneUnlike traditional software, smart contracts have the unique organization in which a sequence of transactions shares internal states. Unfortunately, such a characteristic makes existing fuzzing tools fail to discern critical transaction sequences. To tackle this challenge, we employ a combined static and dynamic analysis for fuzzing smart contracts. First, we statically analyze smart contract binaries to predict which transaction sequences will lead to effective testing, and figure out if there is a certain constraint that each transaction should satisfy. Such information is then passed to the traditional fuzzing phase and used to construct an initial seed corpus. Furthermore, we perform a light-weight dynamic data-flow analysis to collect data-flow-based feedback to effectively guide fuzzing. We implement our technique on a practical open-source fuzzer, named SMARTIAN. SMARTIAN can discover bugs in real-world smart contracts without the need for the source code. Our experimental results show that SMARTIAN is more effective than existing state-of-the-art tools in finding known CVEs from real-world contracts, and it also outperforms other tools in terms of code coverage.
Industry Showcase
Tue 16 Nov 2021 21:40 - 21:50 at Kangaroo - Fuzzing Applications Chair(s): Thuan Pham The University of MelbourneComprehensive testing is of critical importance to ensure the reliability of software systems, especially for mission-critical systems such as FinTech systems. We share in this paper our observations of Ant Group’s status quo in testing their financial services. Specifically, the important influences over system execution path from both external environment settings and input object properties during automated fuzzing test process. To support these observations, we propose FinFuzzer, an automated fuzzing test framework that detects and transfers the corresponding environmental settings into system inputs, prioritizes the input object properties, and mutates system inputs on both environment settings and important object properties. We apply FinFuzzer to 4 projects developed in Ant Group, and the results show that our approach can surpass the state-of-art techniques in terms of test coverage in a much shorter time.
Tool Demonstrations
Tue 16 Nov 2021 21:50 - 21:55 at Kangaroo - Fuzzing Applications Chair(s): Thuan Pham The University of MelbourneGreybox fuzzing is an effective method for software testing. Greybox fuzzers, e.g., AFL, use instrumentation to collect path coverage information to guide the test generation. The instrumentation is usually inserted by a modified compiler tool-chain, meaning that the program must be recompiled in order to be compatible with greybox fuzzing. When source code is unavailable, or for projects with complex build systems, recompilation is not always feasible. In this paper we present E9AFL, a fast and scalable tool that inserts AFL instrumentation to program binaries. E9AFL is built on top of a static binary rewriting tool. To combat the overhead caused by binary instrumentation, E9AFL develops a set of optimization strategies. Evaluation results show that E9AFL outperforms existing binary instrumentation tools and achieves comparable performance with the compile time instrumentation.
Tool Demonstrations
Tue 16 Nov 2021 22:00 - 22:02 at Kangaroo - Tool Demo (1) Chair(s): Sridhar Chimalakonda RISHA Lab, Indian Institute of Technology, TirupatiManaging large and fast-evolving software systems can be a challenging task. Numerous solutions have been developed to assist in this process, enhancing software quality and reducing development costs. These techniques—e.g., regression test selection and change impact analysis—are often built as standalone tools, unable to share or reuse information among them. In this paper, we introduce a software evolution management engine, EVOME, to streamline and simplify the development of such tools, allowing them to be easily prototyped using an intuitive query language and quickly deployed for different types of projects. EVOME is based on differential factbase, a uniform exchangeable representation of evolving software artifacts, and can be accessed directly through a Web interface. We demonstrate the usage and key features of EVOME on real open-source software projects. The demonstration video can be found at: http://youtu.be/6mMgu6rfnjY.
Pre-printTool Demonstrations
Tue 16 Nov 2021 22:04 - 22:06 at Kangaroo - Tool Demo (1) Chair(s): Sridhar Chimalakonda RISHA Lab, Indian Institute of Technology, TirupatiCode merging plays an important role in collaborative software development. However, it is often tedious and error-prone for developers to manually resolve merge conflicts, especially when there are many conflicts after merging long-lived branches or parallel versions. In this paper, we present \emph{SoManyConflicts}, a language-agnostic approach to help developers resolve merge conflicts systematically and interactively, by utilizing relations between merge conflicts. \emph{SoManyConflicts} employs a graph representation to model the relations between merge conflicts (e.g., dependency, similarity, etc.), and provides 3 major features: 1) suggest an order of resolution to the developer by clustering conflicts into different groups based on graph connectivity; 2) suggest related conflicts of the currently-focused conflict to the developer based on topological sorting, 3) suggest resolution strategies for unresolved conflicts based the set of already resolved conflicts. We have implemented \emph{SoManyConflicts} as a Visual Studio Code extension that supports multiple languages (Java, JavaScript, and TypeScript, etc.), which is briefly introduced in the video: https://youtu.be/_asWh_j1KTU. The source code is publicly available at: https://github.com/Symbolk/somanyconflicts.
Tool Demonstrations
Tue 16 Nov 2021 22:06 - 22:08 at Kangaroo - Tool Demo (1) Chair(s): Sridhar Chimalakonda RISHA Lab, Indian Institute of Technology, TirupatiModern web applications manipulate a large amount of user data and undergo frequent data-schema changes.These changes bring up a unique refactoring task: updating application code to be consistent with data schema. Previous study and our own investigation show that this type of refactoring is error prone and time consuming for developers. This paper presents EvolutionSaver, a static code analysis and transformation tool that automates schema-related code refactoring and consistency checking. EvolutionSaver is implemented as an IDE plugin that works for both Rails and Django applications. The source code of EvolutionSaver is available on Github https://github.com/jwjwyoung/EvolutionSaver and the plugin can be downloaded from Visual Studio Marketplace https://marketplace.visualstudio.com/items?itemName=evolutionsaver.evolutionsaver, with its tutorial available at https://www.youtube.com/watch?v=qBiMkLFIjbE.
Tool Demonstrations
Tue 16 Nov 2021 22:10 - 22:12 at Kangaroo - Tool Demo (1) Chair(s): Sridhar Chimalakonda RISHA Lab, Indian Institute of Technology, TirupatiInspection of code changes is a time-consuming task that constitutes a big part of everyday work of software engineers. Existing IDEs provide little information about the semantics of code changes within the file editor view. Therefore developers have to track changes across multiple files, which is a hard task with large codebases.
In this paper, we present RefactorInsight, a plugin for IntelliJ IDEA that introduces a smart diff for code changes in Java and Kotlin where refactorings are auto-folded and provided with their description, thus allowing users to focus on changes that modify the code behavior like bug fixes and new features. RefactorInsight supports three usage scenarios: viewing smart diffs with auto-folded refactorings and hints, inspecting refactorings in pull requests and at any specific commit in the project change history, and exploring the refactoring history of methods and classes. The evaluation shows that commit processing time is acceptable: on median it is less than 0.2 seconds, which delay does not disrupt developers’ IDE workflows.
RefactorInsight is available at https://github.com/JetBrains-Research/RefactorInsight. The demonstration video is available at https://youtu.be/-6L2AKQ66nA.
Pre-printArtifact Evaluation
Tue 16 Nov 2021 23:00 - 23:05 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State Universityno description available
Aldeida is a Senior Lecturer at the Faculty of Information Technology, Monash University in Australia, where she leads the Software Engineering Discipline. Aldeida’s research is in the area of search-based software engineering (SBSE), with a particular focus on what makes software engineering problems (design, testing, program repair) hard to optimise and designing approaches that make it easier to apply SBSE techniques to new problems. Aldeida has published more than 50 papers at top AI, optimisation and software engineering venues, served as PC member and organising committee at both SE and optimisation/AI conferences, such as ASE, ICSE, GECCO, SSBSE, ISSTA, IJCAI, etc., is in the editorial board of JSS, and a reviewer for journals such as JSS, IEEE TSE, IEEE TEC, etc. Aldeida has attracted more than $1,000,000 in competitive research funding and was awarded the prestigious Discovery Early Career Researcher Award (DECRA fellowship) from the Australian Research Council.
Artifact Evaluation
Tue 16 Nov 2021 23:05 - 23:12 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State UniversityPrinciples of Artifacts: Some Lessons Learned
This mini keynote summarizes a few properties that should be followed when publishing artifacts, explains the three levels of achievements that the artifact communities nowadays distinguish, and emphasizes the need for standardization and uniformity in the processes, such that our artifact processes and badges are understandable by others.
Dirk Beyer is Professor of Computer Science and has a Research Chair for Software Systems at LMU Munich, Germany. He was Full Professor at University of Passau (2009-2016), Assistant and Associate Professor at Simon Fraser University, B.C., Canada, and Postdoctoral Researcher at EPFL in Lausanne, Switzerland (2004-2006) and at the University of California, Berkeley, USA (2003-2004) in the group of Tom Henzinger. Dirk Beyer holds a Dipl.-Inf. degree (1998) and a Dr. rer. nat. degree (2002) in Computer Science from the Brandenburg University of Technology in Cottbus, Germany. In 1998 he was Software Engineer with Siemens AG, SBS Dept. Major Projects in Dresden, Germany. His research focuses on models, algorithms, and tools for the construction and analysis of reliable software systems. He is the architect, designer, and implementor of several successful tools. For example, CrocoPat is the first efficient interpreter for relational programming, CCVisu is a successful tool for visual clustering, and CPAchecker and BLAST are two well-known and successful software model checkers.
Artifact Evaluation
Tue 16 Nov 2021 23:12 - 23:15 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State UniversityReasoning on immutability is important for preventing bugs, e.g., in multi-threaded software. So far, static analysis to infer immutability properties has mostly focused on individual objects and references. Reasoning about fields and entire classes, while significantly simpler, has gained less attention. Even a consistently used terminology is missing, which makes it difficult to implement analyses that rely on immutability information. We propose a model for class and field immutability that unifies terminology for immutability flavors considered by previous work and covers new levels of immutability to handle lazy initialization and immutability dependent on generic type parameters. We implement CiFi, a set of modular, collaborating analyses for different flavors of immutability, inferring the properties defined in our model and propose a benchmark of representative test cases for class and field immutability. We use the benchmark to showcase CiFi’s precision and recall, in comparison to state of the art, and use CiFi to study the prevalence of immutability in real-world libraries, showcasing the practical quality and relevance of our model.
Artifact Evaluation
Tue 16 Nov 2021 23:15 - 23:18 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State UniversityQuestion Answering (QA) is an attractive and challenging area in NLP community. There are diverse algorithms being proposed and various benchmark datasets with different topics and task formats being constructed. QA software has also been widely used in daily human life now. However, current QA software is mainly tested in a reference-based paradigm, in which the expected outputs (labels) of test cases need to be annotated with much human effort before testing. As a result, neither the just-in-time test during usage nor the extensible test on massive unlabeled real-life data is feasible, which keeps the current testing of QA software from being flexible and sufficient. In this paper, we propose a method, QAAskeR, with three novel Metamorphic Relations for testing QA software. QAAskeR does not require the annotated labels but tests QA software by checking its behaviors on multiple recursively asked questions that are related to the same knowledge. Experimental results show that QAAskeR can reveal violations at over 80% of valid cases without using any pre-annotated labels. Diverse answering issues, especially the limited generalization on question types across datasets, are revealed on a state-of-the-art QA algorithm.
Artifact Evaluation
Tue 16 Nov 2021 23:18 - 23:21 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State UniversityData scientists typically practice exploratory programming using computational notebooks, to comprehend new data and extract insights. To do this they iteratively refine their code, actively trying to re-use and re-purpose solutions created by other data scientists, in real time. However, recent studies have shown that a vast majority of publicly available notebooks cannot be executed out of the box. One of the prominent reasons is the deprecation of data science APIs used in such notebooks, due to the rapid evolution of data science libraries. In this work we propose RELANCER, an automatic technique that restores the executability of broken Jupyter Notebooks, in near real time, by upgrading deprecated APIs. RELANCER employs an iterative runtime error driven approach to identify and fix one API issue at a time. This is supported by a machine-learned model which uses the runtime error message to predict the kind of API repair needed - an update in API or package name, a parameter, or a parameter value. Then RELANCER creates a search space of candidate repairs by combining knowledge from API migration examples on GitHub as well as the API documentation and employs a second machine learned model to rank this space of candidate mappings. An evaluation of RELANCER on a curated dataset of 255 un-executable Jupyter Notebooks from Kaggle shows that RELANCER can successfully restore the executability of 56% of the subjects, while baselines relying on just GitHub examples and just API documentation can only fix 37% and 36% of the subjects respectively. Further, pursuant to its real-time use case, RELANCER can restore execution to 49% of subjects, within a 5 minute time limit, while a baseline lacking its machine learning models can only fix 24%.
Artifact Evaluation
Tue 16 Nov 2021 23:21 - 23:24 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State UniversityWe introduce a new approach, CONCH, for de-bloating contexts for all the object-sensitive pointer analysis algorithms developed for object-oriented languages, where the calling contexts of a method are distinguished by its receiver objects. Our key insight is to approximate a recently proposed set of two necessary conditions for an object to be context-sensitive, i.e., context-dependent (whose precise verification is undecidable) with a set of three linearly verifiable conditions (in terms of the number of statements in the program) that are almost always necessary for real-world object-oriented applications, based on three key observations regarding context-dependability for their objects used. To create a practical implementation, we introduce a new IFDS-based algorithm for reasoning about object reachability in a program. By debloating contexts for two representative object-sensitive pointer analyses applied to a set of 12 representative Java programs, CONCH can speed up the two baselines together substantially (3.1x on average with a maximum of 15.9x) and analyze 7 more programs scalably, but at only a negligible loss of precision (less than 0.1%).
Artifact Evaluation
Tue 16 Nov 2021 23:24 - 23:27 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State UniversityMarkdown compilers are widely used for translating plain Markdown text into formatted text, yet they suffer from performance bugs that cause performance degradation and resource exhaustion. Currently, there is little knowledge and understanding about these performance bugs in the wild. In this work, we first conduct a comprehensive study of known performance bugs in Markdown compilers. We identify that the ways Markdown compilers handle the language’s context-sensitive features are the dominant root cause of performance bugs. To detect unknown performance bugs, we develop MdPerfFuzz, a fuzzing framework with a syntax-tree based mutation strategy to efficiently generate test cases to manifest such bugs. It equips an execution trace similarity algorithm to de-duplicate the bug reports. With MdPerfFuzz, we successfully identified 216 new performance bugs in real-world Markdown compilers and applications. Our work demonstrates that the performance bugs are a common, severe, yet previously overlooked security problem.
Artifact Evaluation
Tue 16 Nov 2021 23:27 - 23:32 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State Universityno description available
Full prof, ex-nurse,rocketman,taxi-driver,journalist (it all made sense at the time).
Artifact Evaluation
Tue 16 Nov 2021 23:32 - 23:42 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State Universityno description available
Artifact Evaluation
Tue 16 Nov 2021 23:42 - 00:00 at Kangaroo - Artefacts Plenary (Any Day Band 2) Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State Universityno description available
Research Papers
Wed 17 Nov 2021 08:00 - 08:20 at Kangaroo - Bugs I Chair(s): Elena Sherman Boise State UniversityStatic bug detectors aim at helping developers to automatically find and prevent bugs. In this experience paper, we study the effectiveness of static bug detectors at identifying Null Pointer Dereferences or Null Pointer Exceptions (NPEs). NPEs pervade all programming domains from systems to web development. Specifically, our study measures the effectiveness of five Java static bug detectors: CheckerFramework, Eradicate, Infer, NullAway, and SpotBugs. We conduct our study on 102 real-world and reproducible NPEs from 42 open-source projects found in the BugSwarm and Defects4J datasets. We apply two known methods to determine whether a bug is found by a given tool, and introduce two new methods that leverage stack trace and code coverage information. Additionally, we provide a categorization of the tool’s capabilities and the bug characteristics to better understand the strengths and weaknesses of the tools. Overall, the tools under study only find 30 out of 102 bugs (29.4%), with the majority found by Eradicate. Based on our observations, we identify and discuss opportunities to make the tools more effective and useful.
Research Papers
Wed 17 Nov 2021 08:20 - 08:40 at Kangaroo - Bugs I Chair(s): Elena Sherman Boise State UniversityData scientists reportedly spend 60 to 80 percent of their time in their daily routines on data wrangling, i.e. cleaning data and extracting features. However, data wrangling code is often repetitive and error-prone to write. Moreover, it is easy to introduce subtle bugs when reusing and adopting existing code, which result not in crashes but reduce model quality. To support data scientists with data wrangling, we present a technique to generate interactive documentation for data wrangling code. We use (1) program synthesis techniques to automatically summarize data transformations and (2) test case selection techniques to purposefully select representative examples from the data based on execution information collected with tailored dynamic program analysis. We demonstrate that a JupyterLab extension with our technique can provide documentation for many cells in popular notebooks and find in a user study that users with our plugin are faster and more effective at finding realistic bugs in data wrangling code.
Industry Showcase
Wed 17 Nov 2021 08:40 - 08:50 at Kangaroo - Bugs I Chair(s): Elena Sherman Boise State UniversityAt Google, fuzzing C and C++ libraries has discovered tens of thousands of security and robustness bugs. However, these bugs are often reported much after they were first introduced. In many cases, developers are provided only with fault-inducing test inputs and replication instructions that highlight a crash, but additional debugging information may be needed to localize the cause of the bug. Hence, developers need to spend substantial time debugging the code and identifying commits that introduced the bug. In this paper, we discuss our experience with automating a fuzzing-enabled bisection that pinpoints the commit in which the crash first manifests itself. This ultimately reduces the time critical bugs stay open in our code base. We report on our experience over the past 12 months, which shows that developers fix bugs on average 2.23 times faster when aided by this automated analysis.
Tool Demonstrations
Wed 17 Nov 2021 08:50 - 08:55 at Kangaroo - Bugs I Chair(s): Elena Sherman Boise State UniversityA test case that intermittently passes or fails when performed under the same version of source code and test code is said to be flaky. The presence of flaky tests wastes testing time and effort. The most popular approach in industry to detect flakiness is ReRun. The idea behind ReRun is very simple: failing test cases are re-executed many times looking for inconsistencies in the output. Despite its simplicity, the ReRun strategy is very expensive both in terms of time and in terms of computational resources. This is particularly true for contexts where thousands of test cases are performed on a daily basis. Reducing the rerunning overhead is, thus, of utmost importance. This paper presents SHAKER, an open-source tool for detecting flakiness in time-constrained tests by adding noise in the execution environment. The main idea behind SHAKER is to add stressing tasks that compete with the test execution for the use of resources (CPU or memory). SHAKER is available as a GitHub Actions workflow that can be seamlessly integrated with any GitHub project. Alternatively, SHAKER can also be used via its provided Command Line Interface. In our evaluation, SHAKER was able to discover more flaky tests than ReRun and in a faster way (less re-executions); besides, our approach revealed tens of new flaky tests that went undetected by ReRun even after 50 re-executions. Thanks to its flexibility and ease of use, we believe that SHAKER can be useful for both practitioners and researchers.
Link to publicationResearch Papers
Wed 17 Nov 2021 09:00 - 09:20 at Kangaroo - Learning I Chair(s): Denys Poshyvanyk William and MaryDeep Learning (DL) components are routinely integrated into software systems that need to perform complex tasks such as image or natural language processing. The adequacy of the test data used to test such systems can be assessed by their ability to expose artificially injected faults (mutations) that simulate real DL faults.
In this paper, we describe an approach to automatically generate new test inputs that can be used to augment the existing test set so that its capability to detect DL mutations increases. Our tool DeepMetis implements a search based input generation strategy. To account for the non-determinism of the training and the mutation processes, our fitness function involves multiple instances of the DL model under test. Experimental results show that DeepMetis is effective at augmenting the given test set, increasing its capability to detect mutants by 63% on average. A leave-one-out experiment shows that the augmented test set is capable to expose unseen mutants, which simulate the occurrence of yet undetected faults.
Research Papers
Wed 17 Nov 2021 09:20 - 09:40 at Kangaroo - Learning I Chair(s): Denys Poshyvanyk William and MaryModel-based testing is a structured method to test complex systems. Scaling up model-based testing to large systems requires improving the efficiency of various steps involved in test-case generation and more importantly, in test-execution. One of the most costly steps of model-based testing is to bring the system to a known state, best achieved through synchronising sequences. A synchronising sequence is an input sequence that brings a given system to a predetermined state regardless of system’s initial state. Depending on the structure, the system might be complete, i.e., all inputs are applicable at every state of the system. However, some systems are partial and in this case not all inputs are usable at every state. Derivation of synchronising sequences from complete or partial systems is a challenging task. In this paper, we introduce a novel Q-learning algorithm that can derive synchronising sequences from systems with complete or partial structures. The proposed algorithm is faster and can process larger systems than the fastest sequential algorithm that derives synchronising sequences from complete systems. Moreover, the proposed method is also faster and can process larger systems than the most recent massively parallel algorithm that derives synchronising sequences from partial systems. Furthermore, the proposed algorithm generates shorter synchronising sequences.
New Ideas and Emerging Results (NIER) track
Wed 17 Nov 2021 09:40 - 09:50 at Kangaroo - Learning I Chair(s): Denys Poshyvanyk William and MaryPre-trained models of code built on the transformer architecture have performed well on many software engineering (SE) tasks, including predictive code generation. However, whether the vector representations from these pre-trained models comprehensively encode characteristics of source code well enough to be applicable to a broad spectrum of downstream tasks remains an open question.
One way to investigate this is with diagnostic tasks called probes. In this paper, we construct four probing tasks (probing for surface-level, syntactic, structural, and semantic information) for pre-trained code models. We show how probes can be used to identify whether models are deficient in (understanding) certain code properties, characterize different model layers, and get insight into the model sample-efficiency that may be necessary for each type of task.
We probe four models that vary in their expected knowledge of code properties: BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow). While GraphCodeBERT performs more consistently overall, we find that BERT performs surprisingly well on some code tasks, which calls for further investigation. We release all the task datasets and evaluation code publicly.
Tool Demonstrations
Wed 17 Nov 2021 09:50 - 09:55 at Kangaroo - Learning I Chair(s): Denys Poshyvanyk William and MaryDeep learning (DL) training is nondeterministic and such nondeterminism was shown to cause significant variance of model accuracy (up to 10.8%). Such variance may affect the validity of the comparison of newly proposed DL techniques with baselines. To ensure such validity, DL researchers and practitioners must replicate their experiments multiple times with identical settings to quantify the variance of the proposed approaches and baselines. Replicating and measuring DL variances reliably and efficiently is challenging and understudied. We propose a ready-to-deploy framework DEVIATE that (1)measures DL training variance of a DL model with minimal manual efforts, and (2) provides statistical tests of both accuracy and variance. Specifically, DEVIATEautomaticallyanalyzes the DL training code and extracts monitored important metrics (such as accuracy and loss). In addition, DEVIATE performs popular statistical tests and provides users with a report of statistical p-values and effect sizes along with various confidence levels when comparing to selected baselines. We demonstrate the effectiveness of DEVIATE by performing case studies with adversarial training. Specifically, for an adversarial training process that uses the Fast Gradient Signed Method to generate adversarial examples as the training data, DEVIATEmeasures a max difference of accuracy among 8 identical training runs with fixed random seeds to be up to 5.1%.
Tool Demonstrations
Tue 16 Nov 2021 12:55 - 13:00 at Koala - Languages Chair(s): Jean-Guy Schneider Deakin UniversityIn this paper, we demonstrate the implementation details and usage of GenTree, a dynamic analysis tool for learning a program’s interactions. Configurable software systems, while providing more flexibility to the users, are harder to develop, test, and analyze. GenTree can efficiently analyze the interactions among configuration options in configurable software. These interactions compactly represent large sets of configurations and thus allow us to efficiently analyze and discover interesting properties (e.g., bugs) in configurable software. Our experiments on 17 configurable systems spanning 4 languages show that GenTree efficiently finds precise interactions using a tiny fraction of the configuration space. GenTree and its dataset are opensource and available at https://github.com/unsat/gentree and a video demo is at https://youtu.be/x3eqUflvlN8
Research Papers
Wed 17 Nov 2021 11:00 - 11:20 at Kangaroo - Finding Defects Chair(s): Xiao Liu School of Information Technology, Deakin UniversityAs online service systems continue to grow in terms of complexity and volume, how service incidents are managed will greatly impact company revenue and user trust. Due to the cascading effect, cloud failure often comes with an overwhelming number of incidents from dependent services and devices. To pursue an efficient incident management, related incidents should be quickly aggregated to narrow down the problem scope. To this end, in this paper, we propose GRLIA, an incident aggregation framework based on graph representation learning over a graph of cascaded cloud failures. The graph representation is learned for each unique incident in an unsupervised and unified fashion to simultaneously encode the topological and temporal relationship among incidents. Therefore, it can be easily employed for online incident aggregation by measuring their distance. Furthermore, we leverage fine-grained system monitoring data, i.e., Key Performance Indicators (KPIs), to identify the complete scope of failures’ cascading impact. The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of company $\mathcal{H}$. The experimental results demonstrate that GRLIA is effective and outperforms existing methods. Furthermore, our framework has been successfully deployed in industrial practice.
Research Papers
Wed 17 Nov 2021 11:20 - 11:40 at Kangaroo - Finding Defects Chair(s): Xiao Liu School of Information Technology, Deakin UniversityJust-In-Time (JIT) defect prediction (i.e., an AI/ML model to predict defect-introducing commits) is proposed to help developers prioritize their limited Software Quality Assurance (SQA) resources on the most risky commits. However, the explainability of JIT defect models remains largely unexplored (i.e., practitioners still do not know why a commit is predicted as defect-introducing). Recently, LIME has been used to generate explanations for any AI/ML models. However, the random perturbation approach used by LIME to generate synthetic neighbors is still suboptimal, i.e., generating synthetic neighbors that may not be similar to an instance to be explained, producing low accuracy of the local models, leading to inaccurate explanations for just-in-time defect models.
In this paper, we propose PyExplainer—i.e., a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of JIT defect models. Through a case study of two open-source software projects, we find that our PyExplainer produces (1) synthetic neighbors that are 41%-45% more similar to an instance to be explained; (2) 18%-38% more accurate local models; and (3) explanations that are 69%-98% more unique and 17%-54% more consistent with the actual characteristics of defect-introducing commits in the future than LIME (a state-of-the-art model-agnostic technique). This could help practitioners focus on the most important aspects of the commits to mitigate the risk of being defect-introducing. Thus, the contributions of this paper build an important step towards Explainable AI for Software Engineering, making software analytics more explainable and actionable. Finally, we publish our PyExplainer as a Python package to support practitioners and researchers.
New Ideas and Emerging Results (NIER) track
Wed 17 Nov 2021 11:40 - 11:50 at Kangaroo - Finding Defects Chair(s): Xiao Liu School of Information Technology, Deakin UniversityParallel coverage-guided greybox fuzzing is the most common setup for vulnerability discovery at scale. However, so far it has received little attention from the research community compared to single-mode fuzzing, leaving open several problems particularly in its task allocation strategies. Current approaches focus on managing micro tasks, at the seed input level, and their task division algorithms are either ad-hoc or static. In this paper, we leverage research on graph partitioning and search algorithms to propose a systematic and dynamic task allocation solution that works at the macro-task level. First, we design an attributed graph to capture both the program structures (e.g., program call graph) and fuzzing information (e.g., branch hit counts, bug discovery probability). Second, our graph partitioning algorithm divides the global program search space into sub-search-spaces. Finally our search algorithm prioritizes these sub-search-spaces (i.e., tasks) and explores them to maximize code coverage and number of bugs found. We implemented a prototype tool called AFLTeam. In our preliminary experiments on well-tested benchmarks, AFLTeam achieved higher code coverage (up to 16.4% branch coverage improvement) compared to the default parallel mode of AFL and discovered 2 zero-day bugs in FFmpeg and JasPer toolkits.
Journal-first Papers
Wed 17 Nov 2021 11:50 - 12:00 at Kangaroo - Finding Defects Chair(s): Xiao Liu School of Information Technology, Deakin UniversityBug localization is an important aspect of software maintenance because it can locate modules that should be changed to fix a specific bug. Our previous study showed that the accuracy of the information retrieval (IR)-based bug localization technique improved when used in combination with code smell information. Although this technique showed promise, the study showed limited usefulness because of the small number of: (1) projects in the dataset, (2) types of smell information, and (3) baseline bug localization techniques used for assessment. This paper presents an extension of our previous experiments on Bench4BL, the largest bug localization benchmark dataset available for bug localization. In addition, we generalized the smell-aware bug localization technique to allow different configurations of smell information, which were combined with various bug localization techniques. Our results confirmed that our technique can improve the performance of IR-based bug localization techniques for the class level even when large datasets are processed. Furthermore, because of the optimized configuration of the smell information, our technique can enhance the performance of most state-of-the-art bug localization techniques.
Link to publication DOIResearch Papers
Wed 17 Nov 2021 12:00 - 12:20 at Kangaroo - Learning II Chair(s): John Grundy Monash UniversityIn recent years, Neural Machine Translator (NMT) has shown promises in automatically editing source code. Typical NMT based code editor only considers the code that needs to be changed as input and suggests developers with a ranked list of patched code to choose from - where the correct one may not be always at the top of the list. While NMT based code editing systems generate a broad spectrum of plausible patches, the correct one depends on the developers’ requirement and often on the context where the patch is applied. Thus, if developers provide some hints, using natural language or providing patch context, NMT models can benefit from them.
As a proof of concept, in this research, we leverage three modalities of information: edit location, edit code context, commit messages (as a proxy of developers’ hint in natural language) to automatically generate edits with NMT models. To that end, we build MODIT, a multi-modal NMT based code editing engine. With in-depth investigation and analysis, we show that developers’ hint as an input modality can narrow the search space for patches and outperform state-of-the-art models to generate correctly patched code in top-1 position.
Research Papers
Wed 17 Nov 2021 12:20 - 12:40 at Kangaroo - Learning II Chair(s): John Grundy Monash UniversityThis paper presents Arvada, an algorithm for learning context-free grammars from a set of positive examples and a Boolean-valued oracle. Arvada learns a context-free grammar by building parse trees from the positive examples. Starting from initially flat trees, Arvada builds structure to these trees with a key operation: it \emph{bubbles} sequences of sibling nodes in the trees into a new node, adding a layer of indirection to the tree. Bubbling operations enable recursive generalization in the learned grammar. We evaluate Arvada against GLADE and find it achieves on average increases of 4.98$\times$ in recall and 3.13$\times$ in F1 score, while incurring only a 1.27$\times$ slowdown and requiring only 0.87$\times$ as many calls to the oracle. Arvada has a particularly marked improvement over GLADE on grammars with highly recursive structure, like those of programming languages.
Link to publication Pre-printIndustry Showcase
Wed 17 Nov 2021 12:40 - 12:50 at Kangaroo - Learning II Chair(s): John Grundy Monash UniversityGraphQL is a query language for APIs and a runtime for executing those queries, fetching the requested data from existing microservices, REST APIs, databases, or other sources. Its expressiveness and its flexibility have made it an attractive candidate for API providers in many industries especially through the web. A major drawback to blindly servicing a client’s query in GraphQL is that the cost of a query can be unexpectedly large, creating computation and resource overload for the provider, and API rate-limit overages and infrastructure overload for the client. To mitigate these drawbacks, it is necessary to efficiently estimate the cost of a query before executing it. Estimating query cost is challenging because GraphQL queries have a nested structure, GraphQL APIs follow different design conventions, and the underlying data sources are hidden. Estimates based on worst-case static query analysis have had limited success because they tend to grossly overestimate cost. We propose a machine-learning approach to efficiently and accurately estimate the query cost. We also demonstrate the power of this approach by testing it on query-response data from publicly available commercial APIs. Our framework is efficient and predicts query costs with high accuracy, consistently outperforming the static analysis by a large margin.
Social/Networking
Wed 17 Nov 2021 13:00 - 14:00 at Kangaroo - Ask Me Anything - Tom Zimmermann Chair(s): August Shi University of Texas at AustinPlenary
Wed 17 Nov 2021 18:00 - 19:00 at Kangaroo - Keynote - Andreas Zeller Chair(s): Dan Hao Peking UniversityAbstract:
Notebooks – rich, interactive documents that join together code, documentation, and outputs – are all the rage with data scientists. But can they be used for actual software development? In this talk, I share experiences from authoring two interactive textbooks – fuzzingbook.org and debuggingbook.org – and show how notebooks not only serve for exploring and explaining code and data, but also how they can be used as software modules, integrating self-checking documentation, tests, and tutorials all in one place. The resulting software focuses on the essential, is well-documented, highly maintainable, easily extensible, and has a much higher shelf life than the "duct tape and wire” prototypes frequently found in research and beyond.
Biography:
Andreas Zeller is faculty at the CISPA Helmholtz Center for Information Security and professor for Software Engineering at Saarland University, both in Saarbrücken, Germany. His research on automated debugging, mining software archives, specification mining, and security testing has won several awards for its impact in academia and industry. Zeller is an ACM Fellow, an IFIP Fellow, an ERC Advanced Grant Awardee, and holds an ACM SIGSOFT Outstanding Research Award.
Research Papers
Wed 17 Nov 2021 19:00 - 19:20 at Kangaroo - Detection Chair(s): Cuiyun Gao Harbin Institute of TechnologyNode.js has become a widely-used event-driven architecture for server-side and desktop applications. Node.js provides an effective asynchronous event-driven programming model, and supports asynchronous tasks and multi-priority event queues. Unexpected races among events and asynchronous tasks can cause severe consequences. Existing race detection approaches in Node.js applications mainly adopt random fuzzing technique, and can miss races due to large testing space and suffer from large overhead.
In this paper, we propose a dynamic race detection approach \emph{NRace} for Node.js applications. In NRace, we build precise happens-before relations among events and asynchronous tasks in Node.js applications, which also take multi-priority event queues into consideration. We further develop a predictive race detection technique based on these relations. We evaluate NRace on 10 real-world Node.js applications. The experimental result shows that NRace can precisely detect 6 races, and 5 of them have been confirmed by developers.
Research Papers
Wed 17 Nov 2021 19:20 - 19:40 at Kangaroo - Detection Chair(s): Cuiyun Gao Harbin Institute of TechnologySoftware systems often record important runtime information in system logs for troubleshooting purposes. There have been many studies that use log data to construct machine learning models for detecting system anomalies. Through our empirical study, we find that existing log-based anomaly detection approaches are significantly affected by log parsing errors that are introduced by 1) OOV (out-of-vocabulary) words, and 2) semantic misunderstandings. The log parsing errors could cause the loss of important information for anomaly detection. To address the limitations of existing methods, we propose NeuralLog, a novel log-based anomaly detection approach that does not require log parsing. NeuralLog extracts the semantic meaning of raw log messages and represents them as semantic vectors. These representation vectors are then used to detect anomalies through a Transformer-based classification model, which can capture the contextual information from log sequences. Our experimental results show that the proposed approach can effectively understand the semantic meaning of log messages and achieve accurate anomaly detection results. Overall, NeuralLog achieves F1-scores greater than 0.95 on four public datasets, outperforming the existing approaches.
Link to publication DOI Pre-printNew Ideas and Emerging Results (NIER) track
Wed 17 Nov 2021 19:40 - 19:50 at Kangaroo - Detection Chair(s): Cuiyun Gao Harbin Institute of TechnologyBased on 2020 SRE report, 80% of SREs work on post-mortem analysis of incidents due to lack of provided information and 16% of toil come from investigating false positives/negatives. As a cloud service provider, the desire is to proactively identify signals that can help reduce outages and/or reduce the mean time to resolution. By leveraging AI for Operations (AIOps), this work proposes a novel methodology for proactive identification of log anomalies and its resolutions by sifting through the log lines. Typically, relevant information to retrieve resolutions corresponding to logs is spread across multiple heterogeneous corpora that exist in silos, namely historical ticket data, historical log data, and symptom resolution available in product documentation, for example. In this paper, we focus on augmented dataset preparation from heterogeneous corpora, metadata selection and prediction, and finally, using these elements during run-time to retrieve contextual resolutions for signals triggered via logs. For early evaluation, we used logs from a production middleware application server, predicted log anomalies and their resolutions, and conducted qualitative evaluation with subject matter experts; the metadata prediction is 78.57% accurate, the retrieval accuracy of resolutions is 65.7%.
New Ideas and Emerging Results (NIER) track
Wed 17 Nov 2021 19:50 - 20:00 at Kangaroo - Detection Chair(s): Cuiyun Gao Harbin Institute of TechnologyMaintaining confidential information control in software is a persistent security problem where failure means secrets can be revealed via program behaviors. Information flow control techniques traditionally have been based on static or symbolic analyses - limited in scalability and specialized to particular languages. When programs do leak secrets there are no approaches to automatically repair them unless the leak causes a functional test to fail. We present our vision for HyperGI, a genetic improvement framework that detects, localizes and repairs information leakage. Key elements of HyperGI include (1) the use of two orthogonal test suites, (2) a dynamic leak detection approach which estimates and localizes potential leaks, and (3) a repair component that produces a candidate patch using genetic improvement. We demonstrate the successful use of HyperGI on several programs which have no failing functional tests. We manually examine the resulting patches and identify trade-offs and future directions for fully realizing our vision.
Pre-printLate Breaking Results
Wed 17 Nov 2021 20:00 - 20:02 at Kangaroo - LBR + DS Poster (1) (Wed 07:00 - 10:00) Chair(s): Maria Spichkova RMIT University, AustraliaModern Cyber-Physical Systems (CPSs) that need to perform complex control tasks (e.g., autonomous driving) are increasingly using AI-enabled controllers, mainly based on deep neural networks (DNNs). The quality assurance of such types of systems is of vital importance. However, their verification can be extremely challenging, due to their complexity and uninterpretable decision logic. Falsification is an established approach for CPS quality assurance, which, instead of attempting to prove the system correctness, aims at finding a time-variant input signal violating a formal specification describing the desired behaviour; it often employs a search-based testing approach that tries to minimize the robustness of the specification, given by its quantitative semantics. However, guidance provided by robustness is mostly black-box and only related to the system output, but does not allow to understand whether the temporal internal behaviour of the neural network controller has been explored sufficiently. To bridge this gap, in this paper, we make an early attempt and first propose four time-aware coverage criteria specifically designed for neural network controllers in the context of CPS, which consider different features by design: the simple temporal activation of a neuron, the continuous activation of a neuron for a given duration, and the differential neuron activation behavior over time. We further show that these criteria can be employed in the falsification process, by providing more exploration in the search. Preliminary experiments have been performed on Adaptive Cruise Control system, and show that considering coverage during falsification increases the falsification rate.
Late Breaking Results
Wed 17 Nov 2021 20:02 - 20:04 at Kangaroo - LBR + DS Poster (1) (Wed 07:00 - 10:00) Chair(s): Maria Spichkova RMIT University, AustraliaRecent years have seen a rapid development of machine learning based multi-module unmanned aerial vehicle (UAV) systems. To address the oracle problem in autonomous systems, numerous studies have been conducted to use metamorphic testing to automatically generate test scenes for various modules, e.g., those in self-driving cars. However, as most of the studies are based on unit testing, also known as end-to-end model-based testing, a similar testing approach may not be equally effective for UAV systems where multiple modules are working closely together. Therefore, in this paper, instead of unit testing, we propose a novel metamorphic system testing framework for UAV, named MSTU, to detect the defects in multi-module UAV systems. A preliminary evaluation plan to apply MSTU on an emerging autonomous multi-module UAV system is also presented to demonstrate the feasibility of the proposed testing framework.
Doctoral Symposium
Mon 15 Nov 2021 20:30 - 20:45 at Wombat - DS Session 4Smart phones and mobile apps have become an essential part of our daily lives. It is necessary to ensure the quality of these apps. Two important aspects of code quality are maintainability and security. The goals of my PhD project are (1) to study code smells, security issues and their evolution in iOS apps and frameworks, (2) to enhance training and teaching using visualisation support, and (3) to support developers in automat- ically detecting dependencies to vulnerable library elements in their apps. For each of the three tools, dedicated tool support will be provided, i.e., GraphifyEvolution, VisualiseEvolution, and DependencyEvolution respectively. The tool GraphifyEvolution exists and has been applied to analyse code smells in iOS apps written in Swift. The tool has a modular architecture and can be extended to add support for additional languages and external analysis tools. In the remaining two years of my PhD studies, I will complete the other two tools and apply them in case studies with developers in industry as well as in university teaching.
Pre-printDoctoral Symposium
Mon 15 Nov 2021 21:00 - 21:15 at Wombat - DS Session 4Despite microservices and other component-based architecture styles being state of the art in research for many years by now, issue management across the boundaries of a single component is still challenging. Components that were developed independently and can be used independently are joined together in the overall architecture, which results in dependencies between those components. Due to these dependencies, bugs can result that propagate along the call chains through the architecture. Other types of issues, such as the violation of non-functional quality properties, can also impact other components. However, traditional issue management systems end at the boundaries of a component, making tracking of issues across different components time-consuming and error-prone. Therefore, a need for automation arises for cross-component issue management, which automatically puts issues of the independent components in the correct mutual context, creating new cross-component issues and syncing cross-component issues between different components. This automation could enable developers to manage issues across components as efficiently as possible and increases the system’s quality. To solve this problem, we propose an initial approach for semi-automated cross-component issue management in conjunction with service-level objectives based on our Gropius system. For example, relationships between issues of the same or different components can be predicted using classification to identify dependencies of issues across component boundaries. In addition, we are developing a system to model, monitor and alert service-level objectives. Based on this, the impact of such quality violations on the overall system and the business process will be analysed and explained through cross-component issues.
File AttachedFor a CV, please take a look at https://www.linkedin.com/in/sandro-speth/.
Doctoral Symposium
Mon 15 Nov 2021 19:15 - 19:30 at Wombat - DS Session 3Deep learning-based techniques have been widely applied to the program analysis tasks, in fields such as type inference, fault localization, and code summarization. Hitherto deep learning-based software engineering systems rely thoroughly on supervised learning approaches, which require laborious manual effort to collect and label a prohibitively large amount of data. However, most Turing-complete imperative languages share similar control- and data-flow structures, which make it possible to transfer knowledge learned from one language to another. In this paper, we propose a general cross-lingual transfer learning framework PLATO for program analysis by using a series of techniques that are general to different downstream tasks. PLATO allows Bert-based models to leverage prior knowledge learned from the labeled dataset of one language and transfer it to the others. We evaluate our approaches on several downstream tasks such as type inference and code summarization to demonstrate its feasibility.
Late Breaking Results
Wed 17 Nov 2021 20:10 - 20:12 at Kangaroo - LBR + DS Poster (1) (Wed 07:00 - 10:00) Chair(s): Maria Spichkova RMIT University, AustraliaSimple domain-specific graphical languages and libraries can empower a variety of users to create application behaviors and logic. However, it remains challenging to produce and maintain a heterogeneous set of client applications based on these descriptions, as each client typically requires the developers to both understand and embed the domain-specific logic. This is because application logic must be encoded to some extent in both the server and client sides.
In this paper, we propose an alternative approach, which allows the specification of application logic to reside solely on the cloud. In our system, reusable application components are assembled on the cloud in different logical chains and the client is solely concerned with how data is displayed and gathered from users. In this way, the chaining of requests and responses is done by the cloud and the client side has no knowledge of the application logic. This means that the experts in the domain build modular cloud components, arrange them in logical chains, generate a simple user interface, and later leave it to client-side developers to customize the presentation.
Doctoral Symposium
Mon 15 Nov 2021 21:15 - 21:30 at Wombat - DS Session 4Non-deterministically behaving tests impede software development as they hamper regression testing, destroy trust, and waste resources. This phenomenon, also called test flakiness, has received increasing attention over the past years. The multitude of both peer-reviewed literature and online blog articles touching the issue illustrates that flaky tests are deemed both a relevant research topic and a serious problem in everyday business. A major shortcoming of existing work aiming to mitigate flaky tests is its limited applicability since many of the proposed tools are highly relying on specific ecosystems. This issue also reflects on various attempts to investigate flaky tests: Using mostly similar sets of open-source Java projects, many studies are unable to generalize their findings to projects laying beyond this scope. On top of that, a holistic understanding of flaky tests also suffers from a lack of analyses focusing on the developers’ perspective with most existing studies taking a code-centric approach. With my work, I want to close these gaps: I plan to create an overarching and widely applicable framework that empowers developers to tackle flaky tests through existing and novel techniques and enables researchers to swiftly deploy and evaluate new approaches. As a starting point, I am studying test flakiness from previously unconsidered angles: I widen the scope of observation investigating flakiness beyond the realm of the Java ecosystem while also capturing the practitioners’ opinion. By adding to the understanding of the phenomenon I not only hope to close existing research gaps, but to retrieve a clear vision of how research on test flakiness can create value for developers working in the field.
File AttachedDoctoral Symposium
Mon 15 Nov 2021 18:45 - 19:00 at Wombat - DS Session 3Software systems evolve continuously during their lifecycle. Developers incrementally introduce new features and fix bugs during the process, leading to lots of changes and artifacts accumulated. Driven by those rich data recorded in version control systems or issue trackers, lots of work has been done to analyze the software histories. In this PhD work, we propose a universal representation to effectively store and query over knowledge extracted from the histories, with the hope of supporting software evolution research. We have created a toolset, named DIFF BASE , to extract both relations between program entities at the same version, as well as atomic changes between versions. Then users can compose queries using algebraic operators, Datalog or an SQL-like language to accomplish several different evolution management tasks. Based on the existing research outcome, possible future work includes utilizing the facts approach in a scalable solution to discovering compatibility issues involving changes of multiple components and improvement on the storage and query performance of DIFFBASE.
File AttachedLate Breaking Results
Wed 17 Nov 2021 20:16 - 20:18 at Kangaroo - LBR + DS Poster (1) (Wed 07:00 - 10:00) Chair(s): Maria Spichkova RMIT University, AustraliaThere is a need to summarize bug reports as they can become long due to a large number of comments from conversations between developers and various DevOps tools. Although automated approaches to bug report summarization have been developed, we believe they are aiming at the wrong target - getting as close as possible to a gold-standard summary. Instead, researchers should be looking to create automated bug report annotation approaches that allow project members to create their own summaries based on their task specific information needs. We present such an approach.
Late Breaking Results
Wed 17 Nov 2021 20:18 - 20:20 at Kangaroo - LBR + DS Poster (1) (Wed 07:00 - 10:00) Chair(s): Maria Spichkova RMIT University, AustraliaSoftware developers sometimes use inefficient data structures or library interfaces without considering the potential impact they may have during the runtime of a program. This is due to the significant effort required to research and evaluate possibly more efficient alternatives. Consequently, there is a need for tooling to automate the design space exploration. Our proposed code optimisation solution, called Artemis++, tries to address this issue with automatic exploration and transformation of data structures to optimise software performance. In preliminary testing on three mainstream C++ libraries, we have observed improvements up to 16.09%, 27.90%, and 2.74% for CPU usage, runtime and memory, respectively.
Late Breaking Results
Wed 17 Nov 2021 20:20 - 20:22 at Kangaroo - LBR + DS Poster (1) (Wed 07:00 - 10:00) Chair(s): Maria Spichkova RMIT University, AustraliaAs the increase of software size, developers depend on issue tracking systems and researchers attempt to design automatic bug triage approaches. To simulate the process of manual triage where managers make triage after reading descriptions of bug reports, prior approaches typically use natural language processing (NLP) for text understanding. Although the technical choice is straightforward, this technical choice is not built on solid empirical evidences. In this paper, we explore the impact of textual features in bug triage. By enabling and disabling the textual features of a state-of-the-art approach, we analyze their impacts on assigning thousands bug reports of six widely used open source projects. Our result shows that instead of improving it, textual features in fact reduce the effectiveness. In particular, after we turn off its textual features, the f-scores of the baseline approach are improved by 8%. After manual inspection, we find two reasons to explain our result: (1) classic NLP techniques are insufficient to analyze bug reports, because they are not pure natural language texts and contain other elements (e.g., code samples); and (2) some bug reports are poorly written. Our findings reveal a strong need and sufficient opportunities to explore more advanced techniques to handle these complicated elements in bug reports.
Doctoral Symposium
Mon 15 Nov 2021 10:45 - 11:00 at Wombat - DS Session 2Can a machine find and fix a Semantic Bug? A Semantic Bug is a deviation from the expected program behaviour that causes to produce incorrect outputs for certain inputs. To identify this category of bugs, the knowledge on the expected program behaviour is essential. The reason is that a program with a semantic bug does not fail (i.e., crash or hang) in the middle of the execution in most scenarios. Thus, only a human (a user or a developer) knowing the correct program behaviour can detect this kind of bug by observing the output. However, identifying bugs solely through human effort is not practical for all software. A Test Oracle is any procedure used to differentiate the correct and incorrect behaviours of a program. This dissertation mainly focuses on developing learning techniques to produce Automated Test Oracles for programs with semantic bugs. Also, discovering methods to incorporate human knowledge effectively for the learning techniques is another concern. The automated test oracles could make semantic bug detection more efficient. Also, such test oracles could guide Automated Program Repair tools to generate more accurate fixes for semantic bugs.
File AttachedDoctoral Symposium
Mon 15 Nov 2021 19:00 - 19:15 at Wombat - DS Session 3Unmanned aerial systems (UAS) have a large number of applications in civil and military domains. UAS rely on various avionics systems that are safety-critical and mission-critical. A major requirement of international safety standards is to perform rigorous system-level testing of avionics systems, including software systems. The current industrial practice is to manually create test scenarios, manually or automatically execute these scenarios using simulators, and manual evaluation of the outcomes. A fundamental part of system-level testing of such systems is the simulation of environmental context. The test scenarios typically consist of setting certain environment conditions and testing the system under test in these settings. The state-of-the-art approaches available for this purpose also require manual test scenario development and manual test evaluation. In this research work, we propose an approach to automate the system-level testing of the UAS. The proposed approach (AITester) utilizes model-based testing and artificial intelligence (AI) techniques to automatically generate, execute, and evaluate various test scenarios. The test scenarios are generated on the fly, i.e., during test execution based on the environmental context at runtime. We develop a toolset to support automation. We perform a pilot experiment using a widely-used open-source autopilot, ArduPilot. The preliminary results show that the AITester is effective and efficient in violating environmental conditions.
File AttachedHassan Sartaj is an Assistant Professor at National University of Computer and Emerging Sciences, Islamabad, Pakistan. In 2021, he received Ph.D. in Software Engineering. He is a member of the IEEE Computer Society. He is also Chapter Treasurer of the IEEE Islamabad Chapter (C16).
Doctoral Symposium
Mon 15 Nov 2021 20:45 - 21:00 at Wombat - DS Session 4The genuine supervision of modern IT systems brings new opportunities and challenges by making available big data streams that, if properly analysed, can support high standards of scalability, reliability and efficiency. Rule-based inference engines on streaming data are a key component of maintenance systems for detecting anomalies and automatizing their resolution, but they remain confined to simple and general rules, a lesson learned from the expert systems era. Artificial Intelligence for Operations Systems (AIOps) propose to take advantage of advanced analytics, such as machine learning and data mining on big data, to improve every step of supervision systems, such as incident management (detection, triage, root cause analysis, automated healing). However, the best AIOPs techniques often rely on ``opaque'' models, strongly limiting their adoption. In this thesis, we aim at studying how Subgroup Discovery can help AIOps. This data mining offers possibilities to extract hypotheses from data, resp. from predictive models, helping the experts to understand the underlying processes generating the data, resp. predictions. To ensure relevancy of our propositions, this project involves both data mining researchers and practitioners from Infologic, a French software editor.
Pre-printDoctoral Symposium
Mon 15 Nov 2021 20:15 - 20:30 at Wombat - DS Session 4When users deploy or invoke smart contracts on Ethereum, a fee is charged for avoiding resource abuse. Metered in gas, the fee is the product of the amount of gas used and the gas price. The more gas used indicates a higher transaction fee. In my doctoral research, we aim to investigate two widely studied issues regarding gas, i.e., gas estimation and gas optimization. The former is to predict gas costs for executing a transactions to avoid out-of-gas exceptions, and the latter is to modify existing contracts to save transaction fee. We target some problems that previous work did not solve: gas estimation for loop functions, and gas optimization for storage usage and arrays. We expect that my research can help Ethereum users avoid economical loss for out-of-gas exceptions and pay less transaction fee.
Pre-printDoctoral Symposium
Mon 15 Nov 2021 19:30 - 19:45 at Wombat - DS Session 3Fuzzing is a technique that aims to detect vulnerabilities or exceptions through unexpected input and has found tremendous recent interest in both academia and industry. Although these fuzzing methods have great advantages in the field of vulnerability detection, they also have their own disadvantages in the face of different target programs. It is obviously impractical for a fuzzing test method to adapt to all the target programs. Therefore, we study how to select the appropriate fuzzing methods for different target programs. Specifically, we first analyze the program, and then extract the feature vectors of the target program to get the information of the program, such as syntax, context and so on. Next, we build a matching model to match the similarity of target program and the fuzzing algorithm to select the fuzzing algorithm with higher matching degree. Through our matching model, we get a more suitable fuzzing algorithm to improve the detection efficiency, precision, recall, F-measure, and other statistical measures.
File AttachedDeveloper forums like StackOverflow have become essential resources to modern software development practices. However, many code snippets lack a well-defined method declaration, and thus they are often incomplete for immediate reuse. Developers must adapt the retrieved code snippets by parameterizing the variables involved and identifying the return value. This activity, which we call APIzation of a code snippet, can be tedious and time-consuming. In this paper, we present APIzator to perform APIzations of Java code snippets automatically. APIzator is grounded by four common patterns that we extracted by studying real APIzations in GitHub. APIzator presents a static analysis algorithm that automatically extracts the method parameters and return statements. We evaluated APIzator with a ground-truth of 200 APIzations collected from 20 developers. For 113 (56.50 %) and 115 (57.50 %) APIzations, APIzator and the developers extracted identical parameters and return statements, respectively. For 163 (81.50 %) APIzations, either the parameters or the return statements were identical.
Pre-printCyber-Physical Systems (CPSs) are composed of computational control logic and physical processes, that intertwine with each other. CPSs are widely used in various domains of daily life, including those safety-critical systems and infrastructures, such as medical monitoring, autonomous vehicles, and water treatment systems. It is thus critical to effectively test them. However, it is not easy to obtain test cases which can fail the CPS. In this work, we propose a failure-inducing input generation approach FIGCPS for CPS, which requires no knowledge of the CPS under test or any history logs of the CPS which are usually hard to obtain. Our approach adopts deep reinforcement learning techniques, which interact with the CPS under test and effectively search for failure-inducing input guided by rewards. Our approach adaptively collects information from the CPS, which reduces the training time and is also able to explore different states. Moreover, our approach considers both continuous action space and large-dimension discrete action space, which are common for CPS systems. The evaluation results show that FIGCPS not only achieves a higher success rate than the state-of-the-art approach, but also finds two new attacks in a well-tested CPS.
In the context of Model-Driven Engineering applied to video games, software models are high-level abstractions that represent source code implementations of varied content such as the stages of the game, vehicles, or enemy entities (e.g., final bosses).
In this work, we present our Evolutionary Model Generation (EMoGen) approach to generate software models that are comparable in quality to the models created by human developers. Our approach is based on an evolution (mutation and crossover) and assessment cycle to generate the software models. We evaluated the software models generated by EMoGen in the Kromaia video game, which is a commercial video game released on Steam and PlayStation 4. Each model generated by EMoGen has more than 1000 model elements.
The results, which compare the software models generated by our approach and those generated by the developers, show that our approach achieves results that are comparable to the ones created manually by the developers in the retail and digital versions of the video game case study. However, our approach only takes five hours of unattended time in comparison to ten months of work by the developers. We perform a statistical analysis, and we make an implementation of EMoGen readily available.
The present work is aligned with Model-Driven Engineering ideas and it considers that models represent High-Level Abstractions corresponding to a certain source code. This work deals with commercial video game development, but it is focused on Software Engineering applied to video games, or Game Software Engineering (GSE).
In our work, the term “model” refers to Software Model and should not be confused with “mesh” or “polygon mesh”, the terms used in video games and computer graphics for the visual representation of 3D geometry/shapes. Therefore, the models studied in our research represent source code implementations and are not related to 3D visual data. The following figure (included in this work) shows a representation of the architecture for Kromaia, the commercial video game case study that was used in our research, and the role of the models present in such architecture:
https://bitbucket.org/svitusj/emogen/src/master/Figures/Figure-Architecture.pdf
Past works addressed the differences between classical Software Engineering and Game Software Engineering. The works belonging to the second area were mainly focused on issues like Requirement Traceability, and some of them applied Model-Driven Engineering to video games. Such works studied the generation of source code from software models. Our work, however, explores a different direction and addresses the generation of software models. This paper claims that it is possible to produce human-competitive software models in the context of video games, with these models specifying new video game content, which is relevant to accelerate the development of video games. We present an approach that includes the encoding and manipulation of software models and uses an evolutionary algorithm to generate software models in less (unattended) time than video game developers. We evaluate the software models produced by our approach and human developers in a PlayStation 4 / PC commercial video game case study.
The behavior of a cyber-physical system (CPS) is usually defined in terms of the input and output signals processed by sensors and actuators. Requirements specifications of CPSs are typically expressed using signal-based temporal properties. Expressing such requirements is challenging, because of (1) the many features that can be used to characterize a signal behavior; (2) the broad variation in expressiveness of the specification languages (i.e., temporal logics) used for defining signal-based temporal properties. Thus, system and software engineers need effective guidance on selecting appropriate signal behavior types and an adequate specification language, based on the type of requirements they have to define. In this paper, we present a taxonomy of the various types of signal-based properties and provide, for each type, a comprehensive and detailed description as well as a formalization in a temporal logic. Furthermore, we review the expressiveness of state-of-the-art signal-based temporal logics in terms of the property types identified in the taxonomy. Moreover, we report on the application of our taxonomy to classify the requirements specifications of an industrial case study in the aerospace domain, in order to assess the feasibility of using the property types included in our taxonomy and the completeness of the latter.
Research Papers
Wed 17 Nov 2021 22:00 - 22:20 at Kangaroo - Analysis II Chair(s): Annibale Panichella Delft University of TechnologyJavaScript is one of the mainstream programming languages for client-side programming, server-side programming, and even embedded systems. Various JavaScript engines developed and maintained in diverse fields must conform to the syntax and semantics described in ECMAScript, the standard specification of JavaScript. Since an incorrect description in ECMAScript can lead to wrong JavaScript engine implementations, checking the correctness of ECMAScript is critical and essential. However, all the specification updates are currently manually reviewed by the Ecma Technical Committee 39 (TC39) without any automated tools. Moreover, in late 2014, the committee announced the yearly release cadence and open development process of ECMAScript to quickly adapt to evolving development environments. Because of such frequent updates, checking the correctness of ECMAScript becomes more labor-intensive and error-prone.
To alleviate the problem, we propose JSTAR, a JavaScript Specification Type Analyzer using Refinement. It is the first tool that performs type analysis on JavaScript specifications and detects specification bugs using a bug detector. For a given specification, JSTAR first compiles each abstract algorithm written in a structured natural language to a corresponding function in IRES, an untyped intermediate representation for ECMAScript. Then, it performs type analysis for compiled functions with specification types defined in ECMAScript. Based on the result of type analysis, JSTAR detects specification bugs using a bug detector consisting of four checkers. To increase the precision of the type analysis, we present condition-based refinement for type analysis, which prunes out infeasible abstract states using conditions of assertions and branches. We evaluated JSTAR with all 864 versions in the official ECMAScript repository for the recent three years from 2018 to 2021. JSTAR took 137.3 seconds on average to perform type analysis for each version, and detected 157 type-related specification bugs with 59.2% precision; 93 out of 157 bugs are true bugs. Among them, 14 bugs are newly detected by JSTAR, and the committee confirmed them all.
Research Papers
Wed 17 Nov 2021 22:20 - 22:40 at Kangaroo - Analysis II Chair(s): Annibale Panichella Delft University of TechnologyMany recently proposed code clone detectors exploit neural networks to capture latent semantics of source code, thus achieving impressive results for detecting semantic clones. These neural clone detectors rely on the availability of large amounts of labeled training data. We identify a key oversight in the current evaluation methodology for neural clone detection: cross-functionality generalization (i.e., detecting semantic clones of which the functionalities are unseen in training). Specifically, we focus on this question: do nerual clone detectors truly learn the ability to detect semantic clones, or they just learn how to model specific functionalities in training data while cannot generalize to realistic unseen functionalities? This paper investigates how the generalizability can be evaluated and improved.
Our contributions are 3-folds: (1) We propose an evaluation methodology that can systematically measure the cross-functionality generalizability of neural clone detection. Based on this evaluation methodology, an empirical study is conducted and the results indicate that current neural clone detectors cannot generalize well as expected. (2) We conduct empirical analysis to understand key factors that can impact the generalizability. We investigate 3 factors: training data diversity, vocabulary, and locality. Results show that the performance loss on unseen functionalities can be reduced through addressing the out-of-vocabulary problem and increasing training data diversity. (3) We propose a human-in-the-loop mechanism that help adapt neural clone detectors to new code repositories containing lots of unseen functionalities. It improves annotation efficiency with the combination of transfer learning and active learning. Experimental results show that it reduces the amount of annotations by about 88%.
Research Papers
Wed 17 Nov 2021 22:40 - 23:00 at Kangaroo - Analysis II Chair(s): Annibale Panichella Delft University of TechnologySmart contracts are programs running on blockchain to execute transactions. When input constraints or security properties are violated at runtime, the transaction being executed by a smart contract needs to be reverted to avoid undesirable consequences. On Ethereum, the most popular blockchain that supports smart contracts, developers can choose among three transaction-reverting statements (i.e., require, if…revert, and if…throw) to handle anomalous transactions. While these transaction-reverting statements are vital for preventing smart contracts from exhibiting abnormal behaviors or suffering malicious attacks, there is limited understanding on how they are used in practice. In this work, we perform the first empirical study to characterize transaction-reverting statements in Ethereum smart contracts. We measured the prevalence of these statements in 3,866 verified smart contracts from popular dapps and built a taxonomy of their purposes via manually analyzing 557 transaction-reverting statements. We also compared template contracts and their corresponding custom contracts to understand how developers customize the use of transaction-reverting statements. Finally, we analyzed the security impact of transaction-reverting statements by removing them from smart contracts and comparing the mutated contracts against the original ones. Our study led to important findings. For example, we found that transaction-reverting statements are commonly used to perform seven types of authority verifications or validity checks and missing such statements may compromise the security of smart contracts. We also found that current smart contract security analyzers cannot effectively handle transaction-reverting statements when detecting security vulnerabilities. Our findings can shed light on further research in the broad area of smart contract quality assurance and provide practical guidance to smart contract developers on the appropriate use of transaction-reverting statements.
Plenary
Wed 17 Nov 2021 23:00 - 00:00 at Kangaroo - MIP Talk 2 Chair(s): David Lo Singapore Management Universityno description available
Plenary
Thu 18 Nov 2021 08:00 - 09:00 at Kangaroo - Keynote - Laurie Williams Chair(s): Denys Poshyvanyk William and MaryAbstract:
Software security lies at the intersection of software engineering and cybersecurity – building security into a product. Software security techniques focus on preventing the injection of vulnerabilities and detecting the vulnerabilities that make their way into a product or the deployment pipeline before the product is released. Increasingly, artificial intelligence is being used to power software security techniques to aid organizations in deploying secure products. This talk will present a landscape of research and practice at the intersection of software engineering, cybersecurity, and artificial intelligence to solve cybersecurity challenges. The talk will also present research projects conducted by the speaker’s own research group.
Biography:
Laurie Williams is a Distinguished University Professor in the Computer Science Department of the College of Engineering at North Carolina State University (NCSU). Laurie is a co-director of the NCSU Science of Security Lablet sponsored by the National Security Agency, the NCSU Secure Computing Institute, and is the Principal Cybersecurity Technologist of the SecureAmerica Institute. Laurie’s research focuses on software security; agile software development practices and processes, particularly continuous deployment; and software reliability, software testing and analysis. Laurie is an ACM and an IEEE Fellow.
Research Papers
Thu 18 Nov 2021 09:00 - 09:20 at Kangaroo - Development Chair(s): James C. Davis Purdue University, USATo effectively utilize cloud computing, cloud practice and research require accurate knowledge of the performance of cloud applications. However, due to the random performance fluctuations, obtaining accurate performance results in the cloud is extremely difficult. To handle this random fluctuation, prior research on cloud performance testing relied on a non-parametric statistic tool called bootstrapping to design their stop criteria. However, in this paper, we show that the basic bootstrapping employed by prior work overlooks the internal dependency within cloud performance test data, which leads to inaccurate performance results.
We then present Metior, a novel automated cloud performance testing methodology, which is designed based on statistical tools of block bootstrapping, law of large numbers, and autocorrelation. These statistical tools allow Metior to properly consider the internal dependency within cloud performance test data. They also provide better coverage of cloud performance fluctuation and reduce the testing cost. Experimental evaluation on two public clouds showed that 98% of Metior’s tests could provide performance results with less than 3% error. Metior also significantly outperformed existing cloud performance testing methodologies in terms of accuracy and cost – with up to 14% increase in the accurate test count and up to 3.1 times reduction in testing cost.
New Ideas and Emerging Results (NIER) track
Thu 18 Nov 2021 09:20 - 09:30 at Kangaroo - Development Chair(s): James C. Davis Purdue University, USAPrivacy requirements have become increasingly important as information about us is continuously accumulated and digitally stored. However, despite the many proposed methodologies and tools to address these requirements, privacy is often underperformed in most domains of the software industry. Two of the major reasons underlying this under-performance are (1) the low expertise and understanding of privacy by the two main actors in requirements engineering, users and analysts, and (2) the fact that software developers often do not perceive privacy requirements as a priority for their companies, thus neglecting to meet these requirements even when they do have the required knowledge, skills, and supporting tools to do so. To address these two problems, we designed PR1SED (Privacy Requirements as 1st class citizens in SoftwarE Development), an iterative, customizable, socio-technical environment. PR1SED integrates knowledge from software engineering and organizational psychology to better facilitate privacy requirements during systems design. It welds technical tools for eliciting, modeling, and designing privacy aspects, thus addressing the knowledge gap of both data subjects and analysts, with social mechanisms for achieving a supportive and sustainable organizational privacy climate within a company, thus reorienting the organizational attention and engagement toward addressing privacy requirements. This work-in-progress paper presents the framework we developed to build PR1SED and discusses how the different components of the environment will be developed.
Industry Showcase
Thu 18 Nov 2021 09:30 - 09:40 at Kangaroo - Development Chair(s): James C. Davis Purdue University, USAJava virtual machine (JVM) has the well-known slow startup and warmup issues. This is because JVM needs to dynamically create many runtime data before reaching peak performance, including class metadata, method profile data, and just-in-time (JIT) compiled native code, for each run of even the same application. Many techniques are then proposed to reuse and share these runtime data across different runs. For example, Class Data Sharing (CDS) and Ahead-of-time (AOT) compilation aim to save and share class metadata and compiled native code, respectively. Unfortunately, these techniques are developed independently and cannot leverage the ability of each other well. This paper presents an approach that systematically reuses JVM runtime data to accelerate application startup and warmup. We first propose and implement JWarmup, a technique that can record and reuse JIT compilation data (e.g., compiled methods and their profile data). Then, we feed JIT compilation data to the AOT compiler to perform profile-guided optimization (PGO). We also integrate existing CDS and AOT techniques to further optimize application startup. Evaluation on real-world applications shows that our approach can bring 41.35% improvement to the application startup. Moreover, our approach can trigger JIT compilation in advance and reduce CPU load at peak time.
Doctoral Symposium
Mon 15 Nov 2021 11:15 - 11:30 at Wombat - DS Session 2Android apps are developed using a Software Development Kit (SDK), where the Android application programming interface (API) enables app developers to harness the functionalities of Android devices by interacting with services and hardware. However, API frequently evolves together with its associated SDK. The mismatch between the API level supported by the device where apps are installed and the API level targeted by app developers can induce compatibility issues. These issues can manifest themselves as unexpected behaviors, including runtime crashes, creating a poor user experience. Recent studies investigated API evolution to ensure the reliability of the Android apps, however, they require improvements. This work aims to establish novel methodologies that will improve the state-of-the-art compatibility issue detection and testing approaches.
File AttachedDoctoral Symposium
Mon 15 Nov 2021 09:35 - 09:50 at Wombat - DS Session 1Effective locating and fixing defects requires detailed defect reports. Unlike traditional software systems, machine learning applications are subject defects caused from changes in the input data streams (concept drift) and assumptions encoded into models. Without appropriate training, developers face difficulties understanding and interpreting faults in machine learning (ML). However, little research is done on how to prepare developers to detect and investigate machine learning system defects. Software engineers often do not have sufficient knowledge to fix the issues themselves without the help of data scientists or domain experts. To investigate this issue, we analyse issue templates and check how developers report machine learning related issues in open-source applied AI projects. The overall goal is to develop a tool for automatically repairing ML defects or generating defect reports if a fix cannot be made. Previous research has identified classes of faults specific to machine learning systems, such as performance degradation arising from concept drift where the machine learning model is no longer aligned with the real-world environment. However, the current issue templates that developers use do not seem to capture the information needed. This research seeks to systematically develop a two-way human-machine information exchange protocol to support domain experts, software engineers, and data scientists to collaboratively detect, report, and respond to these new classes of faults.
Pre-print File AttachedDoctoral Symposium
Mon 15 Nov 2021 09:20 - 09:35 at Wombat - DS Session 1Experimentation plays an important role in the work of data scientists to explore unfamiliar problem domains, to answer questions from the data, and to develop diverse machine learning applications. Good experimentation requires creativity, based on prior results and informed from the literature. However, finding relevant information from relevant sources to guide experimentation is causing inefficiencies during experimentation process of data scientists. The objective of this research is to help data scientists through the presentation of context aware ranked data science experiments, considering problem domain, development task and learning task. Data science experiments for this study were extracted from publicly available interactive notebooks and were manually annotated based on a taxonomy of data science techniques and a meta model of a data science experiment. Further, the ranking algorithm was developed for data science experiments for given problem domain and development task. As a result, a tool was developed to demonstrate context aware ranked data science experiments for given problem domains such as natural language processing, computer vision and time series and for development stages such as feature engineering and model selection. This study shows that tools and techniques can be designed to be more data science context aware, in fact, much more so than for software engineering tools. This study supports these efforts by providing knowledge that can improve experimentation process of data scientists.
File AttachedLate Breaking Results
Thu 18 Nov 2021 10:06 - 10:08 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioFuzzing has been a widely-used technique for discovering software vulnerabilities. Many existing fuzzers leverage coverage-feedback to evolve seeds to maximize (optimize) program branch coverage. Recently, some techniques propose to train deep learning models to predict the branch coverage of an arbitrary input. Those techniques have proved their success in improving coverage and discovering bugs under different experimental settings. However, deep learning models, usually as a black magic box, are notoriously lack of explanation. Moreover, their performance can be sensitive to the collected runtime coverage information for training, indicating potentially unstable performance. To order to understand how reliable and why the deep learning models can be used for fuzzing, To this end, in this work we conduct a systematic and extensive empirical study on 4 types of deep learning models across 6 projects to reproduce the actual performance of deep learning fuzzers, analyze the advantages and disadvantages of deep learning in the process of fuzzing applications, and explore the future direction of the combination of the two. Our empirical results reveal that the deep learning models can only be effective in very limited scenarios, which is largely restrained by training data imbalance, dependant labels, model over-generalization, and the insufficient expressiveness of the state-of-the-art models. Consequently, the estimated gradients by the models to cover a branch can be less helpful in many scenarios.
Late Breaking Results
Thu 18 Nov 2021 10:08 - 10:10 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioUsing third-party executable components to build control systems, which is quite common, poses challenges for verification because the informal behavior descriptions that typically accompany the components often fall short of the needed rigor. Consequently, there is a need to formalize a component contract that is strong enough to help establish system properties and also weak enough to account for all potential component behaviors in the system’s context. In this paper, we present a novel approach that allows an analyst to hypothesize a component contract, explore if the component meets the contract, and, if not, have automated support to help repair the contract. Preliminary results show that, in more than 32% of the cases, the repaired contract is logically equivalent to a developer-written one; in a further 63% of cases, it is a distinct, valid, and non-trivial property.
Late Breaking Results
Thu 18 Nov 2021 10:10 - 10:12 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioCode summarization aims to generate brief natural language descriptions for source code. As source code is highly structured and follows strict programming language grammars, its Abstract Syntax Tree (AST) is often leveraged to inform the encoder about the structural information. However, ASTs are usually $5\sim15$ times longer than the source code. Current approaches ignore the size limit and simply feed the whole linearized AST into the encoder. To address this problem, we propose AST-Transformer to efficiently encode tree-structured ASTs. Experiments show that AST-Transformer outperforms the state-of-arts by a substantial margin while being able to reduce $90\sim95%$ of the computational complexity in the encoding process.
Doctoral Symposium
Mon 15 Nov 2021 10:30 - 10:45 at Wombat - DS Session 2We propose an automated pipeline for analyzing privacy leaks in Android applications. By using a combination of dynamic and static analysis, we validate the results from each other to improve accuracy. Compare to the state-of-the-art approaches, we not only capture the network traffic for analysis, but also look into the data flows inside the application. We particularly focus on the privacy leakage caused by third-party services and high-risk permissions. The proposed automated approach will combine taint analysis, permission analysis, network traffic analysis, and dynamic function tracing during run-time to identify private information leaks. We further implement an automatic validation and complementation process to reduce false positives. A small-scale experiment has been conducted on 30 Android applications and a large-scale experiment on more than 10,000 Android applications is in progress.
File AttachedLate Breaking Results
Thu 18 Nov 2021 10:14 - 10:16 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioDeep Neural Networks (DNN) are known to be vulnerable to adversarial samples, the detection of which is crucial for the wide application of these DNN models. Recently, a number of deep testing methods in software engineering were proposed to find the vulnerability of DNN systems, and one of them, i.e., Model Mutation Testing (MMT), was used to successfully detect various adversarial samples generated by different kinds of adversarial attacks. However, the mutated models in MMT are always huge in number (e.g., over 100 models) and lack diversity (e.g., can be easily circumvented by high-confidence adversarial samples), which makes it less efficient in real applications and less effective in detecting high-confidence adversarial samples. In this study, we propose Graph-Guided Testing (GGT) for adversarial sample detection to overcome these aforementioned challenges. GGT generates pruned models with the guide of graph characteristics, each of them has only about 5% parameters of the mutated model in MMT, and graph guided models have higher diversity. The initial experiments on CIFAR10 validate that GGT performs much better than MMT with respect to both effectiveness and efficiency.
Late Breaking Results
Thu 18 Nov 2021 10:16 - 10:18 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioMicroservice design offers many advantages for enterprise applications, including increased scalability and faster deployment times. Microservices’ independence from one another in development and deployment provides these advantages. This separation, however, results in the absence of a centralized view of the application’s functionality, and each microservice’s data model is isolated and replicated. As a result, it has the potential to deviate from the architectural design’s original intent. To address this, we offer a method for analyzing a microservice mesh and generating a communication diagram, context map, and microservice-specific limited contexts using static code analysis.
Late Breaking Results
Thu 18 Nov 2021 10:18 - 10:20 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioWhen approaching a clustering problem, choosing the right clustering algorithm and parameters is essential, as each clustering algorithm is proficient at finding clusters of a particular nature. Due to the unsupervised nature of clustering algorithms, there are no ground truth values available for empirical evaluation, which makes automation of the parameter selection process through hyperparameter tuning difficult. Previous approaches to hyperparameter tuning for clustering algorithms have relied on internal metrics, which are often biased towards certain algorithms, or having some ground truth labels available, moving the problem into the semi-supervised space. This preliminary study proposes a framework for semi-automated hyperparameter tuning of clustering problems, using a grid search to develop a series of graphs and easy to interpret metrics that can then be used for more efficient domain-specific evaluation. Preliminary results show that internal metrics are unable to capture the semantic quality of the clusters developed and approaches driven by internal metrics would come to different conclusions than those driven by manual evaluation.
Link to publicationLate Breaking Results
Thu 18 Nov 2021 10:20 - 10:22 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioBusiness process mining of a large-scale project has many benefits such as finding vulnerabilities, improving processes, collecting data for data science, generating more clear and simple representation, etc. The general way of process mining is to turn event data such as application logs into insights and actions. Observing logs broad enough to depict the whole business logic scenario of a large project can become very costly due to difficult environment setup, unavailability of users, presence of not reachable or hardly reachable log statements, etc. Using static source code analysis to extract logs and arranging them perfect runtime execution order is a potential way to solve the problem and reduce the business process mining operation cost.
Doctoral Symposium
Mon 15 Nov 2021 11:00 - 11:15 at Wombat - DS Session 2Binary code similarity detection is to detect the similarity of code at binary (assembly) level without source code. Existing works have their limitations when dealing with mutated binary code generated by different compiling options. In this paper, we propose a novel approach to addressing this problem. By inspecting the binary code, we found that generally, within a function, some instructions aim to calculate (prepare) values for other instructions. The latter instructions are defined by us as key instructions. Currently, we define four categories of key instructions: calling subfunctions, comparing instruction, returning instruction, and memory-store instruction. Thus if we symbolically execute similar binary codes, symbolic values at these key instructions are expected to be similar. As such, we implement a prototype tool, which has three steps. First, it symbolically executes binary code; Second, it extracts symbolic values at defined key instructions into a graph; Last, it compares the symbolic graph similarity. In our implementation, we also address some problems, including path explosion and loop handling.
File AttachedLate Breaking Results
Thu 18 Nov 2021 10:24 - 10:26 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioThis paper proposes a new mutation operator using neural network that generates plausible code elements to improve performance of MBFL techniques on the omission faults. Unlike the existing mutation operator that modifies or removes the existing code elements, the suggested mutation operator inserts new elements of code at a mutation site with a trained neural network model. We extended MUSE to use the proposed mutation operator, and conducted a case study with 3 omission faults found in JFreeChart of Defects4J. As a result, the accuracy of the mutation-based fault localization technique with the new mutation operator increased significantly in all three faults.
Late Breaking Results
Thu 18 Nov 2021 10:26 - 10:28 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioDespite the fact that there are numerous classifications of technical debt based on various criteria, Code Debt or code smells is a category that appears in the majority of current research. One of the primary causes of code debt is the urgency to deliver software quickly, as well as bad coding practices. Among many approaches, static code analysis has received the most attention in studies to detect code-smell/code debt. However, most of them examine the same programming language, although today’s software company utilizes many development stacks with various languages and tools. This problem can be resolved by detecting code debt with Issue/Ticket cards. This paper presents a method for detecting code debt leveraging natural language processing on issue tickets. It also proposes a method for calculating the average amount of time that a code debt was present in the software. This method is implemented utilizing git mining.
File AttachedLate Breaking Results
Thu 18 Nov 2021 10:28 - 10:30 at Kangaroo - LBR + DS Poster (2) (Thursday 21:00 - 00:00) Chair(s): Xiaoyin Wang University of Texas at San AntonioThe need for cyber resilience is increasingly important in our technology-dependent society, where computing systems, devices and data will continue to be the target of cyber-attackers. Hence, we propose a conceptual ’Human-in-the-Loop Explainable-AI-Enabled Vulnerability Detection, Investigation, and Mitigation’ (HXAI-VDIM) system. Specifically, instead of resolving complex scenario of security vulnerabilities as an output of an AI/ML model, we integrate the security analyst or forensic investigator into the man-machine loop and leverage explainable AI (XAI) to combine both AI and Intelligence Assistant (IA) to amplify human intelligence in both proactive and reactive processes. Our goal is that HXAI-VDIM integrates human and machine in an interactive and iterative loop with security visualization that utilizes human intelligence to guide the XAI-enabled system and generate refined solutions.
Doctoral Symposium
Mon 15 Nov 2021 09:05 - 09:20 at Wombat - DS Session 1Software requirements Change Impact Analysis (CIA) is a pivotal process in requirements engineering (RE) since changes to requirements are inevitable. When a requirement change is requested, its impact on all software artefacts has to be investigated to accept or reject the request. Manually performed CIA in large-scale software development is time-consuming and error-prone so, automating this analysis can improve the process of requirements change management. The main goal of this research is to apply a combination of Machine Learning (ML) and Natural Language Processing (NLP) based approaches to develop a prediction model for forecasting the requirement change impact on other requirements in the specification document. The proposed prediction model will be evaluated using appropriate datasets for accuracy and performance. The resulting tool will support project managers to perform automated change impact analysis and make informed decisions on the acceptance or rejection of requirement change requests.
File AttachedDoctoral Symposium
Mon 15 Nov 2021 09:50 - 10:05 at Wombat - DS Session 1Software developers embed logging statements inside the source code as an imperative duty in modern software development as log files are necessary for tracking down runtime system issues and troubleshooting system management tasks. Prior research has emphasized the importance of logging statements in the operation and debugging of software systems. However, the current logging process is mostly manual and ad hoc, and thus, proper placement and content of logging statements remain as challenges. To overcome these challenges, methods that aim to automate log placement and log content, i.e., ‘where, what, and how to log’, are of high interest. Thus, in this research, we propose to accomplish the goal of this research, that is “to predict the log statements by utilizing source code clones and natural language processing (NLP)”, with the following four research objectives: (RO1) investigate whether source code clones can be leveraged for log statement location prediction, (RO2) propose a clone-based approach for log statement prediction, (RO3) predict log statement’s description with code-clone and NLP models, and (RO4) examine approaches to automatically predict additional details of log statement, such as its verbosity level and variables. For this purpose, we perform an experimental analysis on seven open-source java projects, extract their method-level code clones, investigate their attributes, and utilize them for log location and description prediction. Our work demonstrates the effectiveness of log-aware clone detection for automated log location and description prediction and outperforms the prior work.
Pre-printResearch Papers
Thu 18 Nov 2021 11:00 - 11:20 at Kangaroo - Vulnerability Chair(s): Xusheng Xiao Case Western Reserve UniversityFollowing the coordinated vulnerability disclosure model, a vulnerability in open source software (OSS) is suggested to be fixed ``silently'', without disclosing the fix until the vulnerability is disclosed. Yet, it is crucial for OSS users to be aware of vulnerability fixes as early as possible, as once a vulnerability fix is pushed to the source code repository, a malicious party could probe for the corresponding vulnerability to exploit it. In practice, OSS users often rely on the vulnerability disclosure information from security advisories (e.g., National Vulnerability Database) to sense vulnerability fixes. However, the time between the availability of a vulnerability fix and its disclosure can vary from days to months, and in some cases, even years. Due to manpower constraints and the lack of expert knowledge, it is infeasible for OSS users to manually analyze all code changes for vulnerability fix detection. Therefore, it is essential to identify vulnerability fixes automatically and promptly. In a first-of-its-kind study, we propose VulFixMiner, a Transformer-based approach, capable of automatically extracting semantic meaning from commit-level code changes to identify silent vulnerability fixes. We construct our model using sampled commits from 204 projects, and evaluate using the full set of commits from 52 additional projects. The evaluation results show that VulFixMiner outperforms various state-of-the-art baselines in terms of AUC (i.e., 0.81 and 0.73 on Java and Python dataset, respectively) and two effort-aware performance metrics (i.e., EffortCost, P$_{opt}$). Especially, with an effort of inspecting 5% of total LOC, VulFixMiner can identify 49% of total vulnerability fixes. Additionally, with manual verification of sampled commits that were identified as vulnerability fixes, but not marked as such in our dataset, we observe that 35% (29 out of 82) of the commits are for fixing vulnerabilities, indicating VulFixMiner is also capable of identifying unreported vulnerability fixes.
Research Papers
Thu 18 Nov 2021 11:20 - 11:40 at Kangaroo - Vulnerability Chair(s): Xusheng Xiao Case Western Reserve UniversityIt is increasingly suggested to identify Software Vulnerabilities (SVs) in code commits to give early warnings about potential security risks. However, there is a lack of effort to assess vulnerability-contributing commits right after they are detected to provide timely information about the exploitability, impact and severity of SVs. Such information is important to plan and prioritize the mitigation for the identified SVs. We propose a novel Deep multi-task learning model, DeepCVA, to automate seven Commit-level Vulnerability Assessment tasks simultaneously based on Common Vulnerability Scoring System (CVSS) metrics. We conduct large-scale experiments on 1,229 vulnerability-contributing commits containing 542 different SVs in 246 real-world software projects to evaluate the effectiveness and efficiency of our model. We show that DeepCVA is the best-performing model with 38% to 59.8% higher Matthews Correlation Coefficient than many supervised and unsupervised baseline models. DeepCVA also requires 6.3 times less training and validation time than seven cumulative assessment models, leading to significantly less model maintenance cost as well. Overall, DeepCVA presents the first effective and efficient solution to automatically assess SVs early in software systems.
Pre-printResearch Papers
Thu 18 Nov 2021 12:00 - 12:20 at Kangaroo - Debt and Refactoring Chair(s): Yuan Tian Queens University, Kingston, CanadaSoftware Refactoring is the process of restructuring existing code to improve software quality while preserving existing behavior. In many cases, multiple refactorings must be applied together to correct quality issues such as code smells. While such collections of refactorings include refactorings that depend on other refactorings (i.e., one cannot be applied without also applying another), existing refactoring recommendation tools generate solutions that include invalid refactorings because they do not account for dependencies among refactorings. Consequently, developers prefer manually applying refactorings to using such tools. A key contributor to this problem is that search-based refactoring approaches, which are widely adopted to recommend refactorings, employ random change operators (e.g., crossover and mutation) to evolve solutions without considering the dependencies among refactorings. In this paper, we propose intelligent change operators and integrate them into a multi-objective search algorithm to recommend valid refactorings that address conflicting quality objectives. The proposed intelligent crossover and mutation operators incorporate refactoring dependencies to avoid creating invalid refactorings or invalidating existing refactorings. Further, the intelligent crossover operator is augmented to create offspring that improve solution quality by exchanging blocks of valid refactorings that improve a solution’s weakest objectives. We used our intelligent change operators to generate refactoring recommendations for four widely used open-source projects. The results show that our intelligent change operators improve the diversity of solutions, accelerate solution convergence, reduce the number of invalid refactorings by up to 71.52% compared to existing search-based refactoring approaches, and increase the quality of the solutions. Our approach outperformed the state-of-the-art search-based refactoring approaches and an existing deterministic refactoring tool based on manual validation by developers with an average manual correctness, precision and recall of 0.89, 0.82, and 0.87.
Research Papers
Thu 18 Nov 2021 12:20 - 12:40 at Kangaroo - Debt and Refactoring Chair(s): Yuan Tian Queens University, Kingston, CanadaSoftware containers, such as Docker, are recently considered as the mainstream technology of providing reusable software artifacts. Developers can easily build and deploy their applications based on the large number of reusable Docker images that are publicly available. Thus, a current popular trend in industry is to move towards the containerization of their applications. However, container-based projects compromise different components including the Docker and Docker-compose files, and several other dependencies to the source code that is combining different containers and facilitating the interactions with them. Similar to any other complex systems, container-based projects are prone to various quality and technical debt issues related to different artifacts: Docker and Docker-compose files, and regular source code ones. Unfortunately, there is a gap of knowledge in how container-based projects actually evolve and are maintained.
In this paper, we address the above gap by studying refactorings, i.e., structural changes while preserving the behavior, applied in open-source Docker projects, and the technical debt issues they alleviate. We analyzed 68 projects, consisting of 19,5 MLOC, along with 193 manually examined commits. The results indicate that developers refactor these Docker projects for a variety of reasons that are specific to the configuration, combination and execution of containers, by leading to several new technical debt categories and refactoring types compared to existing refactoring domains. For instance, refactorings for reducing the image size of Dockerfiles, improving the extensibility of Docker-compose files, and regular source code refactorings are mainly associated with the evolution of Docker and Docker-compose files. We also introduced 24 new Docker-specific refactorings and technical debt categories, respectively, and defined different best practices. The implications of this study will assist practitioners, tool builders, and educators in improving the quality of Docker projects.
Tool Demonstrations
Thu 18 Nov 2021 12:40 - 12:45 at Kangaroo - Debt and Refactoring Chair(s): Yuan Tian Queens University, Kingston, CanadaSelf-Admitted Technical Debt (SATD) is a special form of technical debt in which developers intentionally record their hacks in their code by adding comments for attention. Here, we focus on issue-related “On-hold SATD”, where developers suspend proper implementation due to issues reported inside or outside the project. When the referenced issues are resolved, the On-hold SATD also need to be addressed, but since monitoring these issue reports takes a lot of time and effort, developers may not be aware of the resolved issues and leave the On-hold SATD in the code. In this paper, we propose FixMe, a GitHub bot that helps developers detecting and monitoring On-hold SATD in their repositories and notify them whenever the On-hold SATDs are ready to be fixed (i.e. the referenced issues are resolved). The bot can automatically detect On-hold SATD comments from source code using machine learning techniques and discover referenced issues. When the referenced issues are resolved, developers will be notified by FixMe bot. The evaluation conducted with 11 participants shows that our FixMe bot can support them in dealing with On-hold SATD. FixMe is available at https://www.fixmebot.app/ and FixMe’s VDO is at https://youtu.be/e9JYsYGuRCw.
Link to publication DOI Pre-printPlenary
Thu 18 Nov 2021 13:00 - 14:00 at Kangaroo - Keynote - Karen Li Chair(s): John Grundy Monash UniversityAbstract:
There are many complex engineering challenges for an IT organisation to deliver world-class products: the cognitive load and productivity of engineers, the development at scale, the quality, uniformity and compliance of the delivered products, the sustainable continuous delivery with high caliber velocity and stability, just to name a few. This talk will elaborate the trended endeavours attempted in industry, with an emphasis on how automation has been put forward to help achieve engineering excellence.
Biography:
Karen Li is a Product Architect at Xero, previously a Lead Engineer. As an Engineer Karen focused on applying sustainably excellent engineering practice and delivering capabilities to Xero customers and internal employees. As an Architect, Karen focuses on providing context and connecting work with the wider strategies of the organisation. Karen has been with industry for 10+ years, she’s excited to bridge between academic science and industry practice. Prior to industry, Karen had an academic career (PhD in Computer Science, University of Auckland, New Zealand). Her research area was domain-specific visual languages led software automation.
Research Papers
Thu 18 Nov 2021 18:00 - 18:20 at Kangaroo - Firmware Chair(s): ingo Mueller Monash UniversityLinux kernel is widely used in embedded systems. To understand practical threats to the Linux kernel, we need to perform dynamic analysis with a full-system emulator, e.g., QEMU. However, due to the hardware fragmentation, e.g., various types of peripherals, most embedded systems are not currently supported by QEMU. Though some progress has been made on rehosting firmware, it mainly focuses on user space programs or simple real-time operating systems.
The goal of this work is to boost the capability of rehosting the embedded Linux kernels in QEMU. By doing so, dynamic analysis systems can be firstly applied on embedded Linux kernels by leveraging off-the-shelf tools upon QEMU. Accordingly, we proposed a new technique called model-guided kernel execution. It combines the peripheral abstractions in the Linux kernel and kernel-peripheral interactions to semi-automatically generate peripheral models that are then used to synthesize new QEMU virtual machines to start the dynamic analysis.
We have implemented a prototype called FirmGuide. It generates 9 peripheral models with full functionality and 64 with minimum functionality covering 26 SoCs. Our evaluation with 6, 188 firmware images shows that it can successfully rehost more than 95% of Linux kernels in 2 architectures and 22 versions. None of them can be rehosted in the vanilla QEMU. The result of the LTP benchmark shows the reliability and robustness of the rehosted Linux kernels. We further conduct two security applications, i.e., vulnerability analysis and fuzzing, on the rehosted Linux kernels to demonstrate the usage scenarios.
Pre-printResearch Papers
Thu 18 Nov 2021 18:20 - 18:40 at Kangaroo - Firmware Chair(s): ingo Mueller Monash UniversityIoT devices are abnormally prone to diverse errors due to harsh environments and limited computational capabilities. As a result, correct error handling is critical in IoT. Implementing correct error handling is non-trivial, thus requiring extensive testing such as fuzzing. However, existing fuzzing techniques cannot effectively test IoT error-handling code. First, errors typically represent corner cases, thus are hard to trigger. Second, testing error-handling code would frequently crash the execution, which prevents fuzzing from testing following deep error paths.
In this paper, we propose iFIZZ, a new bug detection system specifically designed for testing error-handling code in IoT firmware. iFIZZ first employs an automated binary-based approach to identify realistic runtime errors by analyzing errors and error conditions in closed-source IoT firmware. Then, iFIZZ employs state-aware and bounded error generation to reach deep error paths effectively. We implement and evaluate iFIZZ on 10 popular IoT firmware. The results show that iFIZZ can find many bugs hidden in deep error paths. Specifically, iFIZZ finds 109 critical bugs, 63 of which are even in widely used IoT libraries. iFIZZ also features high code coverage and efficiency, and covers 67.3% more error paths than normal execution. Meanwhile, the depth of error handling covered by iFIZZ is 7.3 times deeper than that covered by the state-of-the-art method. Furthermore, iFIZZ has been practically adopted and deployed in a worldwide leading IoT company. We will open-source iFIZZ to facilitate further research in this area.
Industry Showcase
Thu 18 Nov 2021 18:40 - 18:50 at Kangaroo - Firmware Chair(s): ingo Mueller Monash UniversityInternet-of-things (IoT) or mobile devices are omnipresent in our daily life; the security issues inside them are especially crucial. Greybox fuzzing has been shown effective in detecting vulnerabilities. However, applications in IoT or mobile devices are usually proprietary to specific vendors, fuzzers are required to support binary-only targets. Moreover, since these devices are of heterogeneous architectures, assigned with limited resources, and many testing targets are server-like programs, applying existing fuzzing techniques faces great challenges.
This paper proposes BiFF, a general-purpose fuzzer that aims to stress these issues. It supports binary-only targets, is general (supports multiple CPU architectures including Intel, ARM, MIPS, and PowerPC), fast (has the lowest runtime overhead compared to existing fuzzers), and flexible (uses a new fuzzing workflow that can fuzz any piece of code inside the target binary). Experiments demonstrate that BiFF has the best performance compared with state-of-the-art binary fuzzers and can fuzz the server-like programs which cannot be fuzzed by the existing fuzzers. Using BiFF, we’ve found 24 unknown vulnerabilities (including memory corruptions, infinite loops and infinite recursions) from industrial products.
New Ideas and Emerging Results (NIER) track
Thu 18 Nov 2021 18:50 - 19:00 at Kangaroo - Firmware Chair(s): ingo Mueller Monash UniversitySpecification learning and controller synthesis are two methods that promise to provide control systems with assured adaptive capabilities at runtime. Specification learning can automatically update specifications in light of violation traces observed within the operational environment. Controller synthesis can then automatically generate implementations that are guaranteed to satisfy these specifications in every environment.
Specification learning is implemented using general-purpose AI systems. These systems are highly configurable, and the configuration choice heavily affects the effectiveness. Setting configuration parameters is far from obvious as they bear no clear semantic relation with the adaptation task. State of the art requires configurations to be set by domain experts at design time for each application domain.
In this paper, we argue that to create assured control systems that can effectively and efficiently adapt at runtime, the learning systems upon which they are built must also have adaptive learning strategies for determining configurations at runtime. We demonstrate this idea with a proof-of-concept that computes domain-dependent policies using reinforcement learning.
Research Papers
Thu 18 Nov 2021 19:00 - 19:20 at Kangaroo - Developers Chair(s): Chetan Arora Deakin UniversityOnline chatrooms are gaining popularity as a communication channel between widely distributed developers of Open Source Software (OSS) projects. Most discussion threads in chatrooms follow a Q&A format, with some developers (askers) raising an initial question and others (respondents) joining in to provide answers. These discussion threads are embedded with rich information that can satisfy the diverse needs of various OSS stakeholders. However, retrieving information from threads is challenging as it requires a thread-level analysis to understand the context. Moreover, the chat data is transient and unstructured, consisting of entangled informal conversations. In this paper, we address this challenge by identifying the information types available in developer chats and further introducing an automated mining technique. Through manual examination of chat data from three chatrooms on Gitter, using card sorting, we build a thread-level taxonomy with nine information categories and create a labeled dataset with 2,959 threads. We propose a classification approach (named F2Chat) to structure the vast amount of threads based on the information type automatically, helping stakeholders quickly acquire their desired information. F2Chat combines handcrafted non-textual features with deep textual features extracted by neural models. Specifically, it has two stages with the first one leveraging the siamese architecture to pretrain the textual feature encoder, and the second one facilitating an in-depth fusion of two types of features. Evaluation results suggest that our approach achieves an average F1-score of 0.628, which improves the baseline by 57%. Experiments also verify the effectiveness of our identified non-textual features under both intra-project and cross-project validations.
Research Papers
Thu 18 Nov 2021 19:20 - 19:40 at Kangaroo - Developers Chair(s): Chetan Arora Deakin UniversityNeural models of code are successfully tackling various prediction tasks, complementing and sometimes even outperforming traditional program analysis. While most work focuses on end-to-end evaluations of such models, it often remains unclear what the models actually learn, and to what extent their reasoning about code matches that of skilled humans. A poor understanding of the model reasoning risks deploying models that are right for the wrong reason, and taking decisions based on spurious correlations in the training dataset. This paper investigates to what extent the attention weights of effective neural models match the reasoning of skilled humans. To this end, we present a methodology for recording human attention and use it to gather 1,508 human attention maps from 91 participants, which is the largest such dataset we are aware of. Computing human-model correlations shows that the copy attention of neural models often matches the way humans reason about code (Spearman rank coefficients of 0.49 and 0.47), which gives an empirical justification for the intuition behind copy attention. In contrast, the regular attention of models is mostly uncorrelated to human attention. We find that models and humans sometimes focus on different kinds of tokens, e.g., strings are important to humans but mostly ignored by models. The results also show that human-model agreement positively correlates with accurate predictions by a model, which calls for neural models that even more closely mimic human reasoning. Beyond the insights from our study, we envision the release of our dataset of human attention maps to help understand future neural models of code and to foster work on human-inspired models.
Pre-printIndustry Showcase
Thu 18 Nov 2021 19:40 - 19:50 at Kangaroo - Developers Chair(s): Chetan Arora Deakin UniversityThe popularity of cloud technologies has led to the development of a new type of applications that specifically target cloud environments. Such applications require a lot of cloud infrastructure to run, which brought about the Infrastructure as Code approach, where the infrastructure is also coded using a separate language in parallel to the main application. In this paper, we propose a new concept of Infrastructure in Code, where the infrastructure is deduced from the application code itself, without the need for separate specifications. We describe this concept, discuss existing solutions that can be classified as Infrastructure in Code and their limitations, and then present our own framework called Kotless — an extendable cloud-agnostic serverless framework for Kotlin that supports two cloud providers, three DSLs, and two runtimes. Finally, we showcase the usefulness of Kotless by demonstrating its efficiency in developing serverless applications on different platforms.
Pre-printNew Ideas and Emerging Results (NIER) track
Thu 18 Nov 2021 19:50 - 20:00 at Kangaroo - Developers Chair(s): Chetan Arora Deakin UniversityThe research on engineering software applications that employ artificial intelligence (AI) and machine learning (ML) is at an all time peak. However, most of the research is focused on the interaction between humans and AI which is predominantly concerned with building immersive interfaces and user experiences that allow for increased telemetry or on handling AI and ML applications in production (MLOps). Nonetheless, the research on fundamental architectural differences between AI-powered applications and traditional non AI-powered ones did not receive its fair share of attention. To that end, we believe that a new take on the fundamental architecture of building software applications is needed. With the ever increasing prominence of content-driven and AI-focused applications, it is our conviction that content could be served by servers without clients actually requesting and that servers could (should) request data on demand from clients without waiting for their requests. Hence, in this paper, we propose the fluid architecture that facilitates the bidirectional interaction between clients and servers in AI-based systems.
Tool Demonstrations
Thu 18 Nov 2021 19:50 - 19:55 at Koala - Bugs II Chair(s): Annibale Panichella Delft University of TechnologyGiven that quantum software testing is a new area of research, there is a lack of benchmark programs and bug repositories to assess the effectiveness of testing techniques. To this end, quantum mutation analysis focuses on systematically generating faulty versions of Quantum Programs (QPs), called mutants, using mutation operators. Such mutants can be used as benchmarks to assess the quality of test cases in a test suite. Thus, we present Muskit – a quantum mutation analysis tool for QPs coded in IBM’s Qiskit language. Muskit defines mutation operators on gates of QPs and selection criteria to reduce the number of mutants to generate. Moreover, it allows for the execution of test cases on mutants and the generation of results for test analyses. Muskit is provided as a command-line interface, GUI, and web application. We validated Muskit by using it to generate and execute mutants for four QPs
Tool Demonstrations
Tue 16 Nov 2021 11:50 - 11:55 at Koala - Empirical Studies Chair(s): Felipe Fronchetti Virginia Commonwealth UniversityNumerous efforts have been invested in improving the effectiveness of bug localization techniques, whereas little attention is paid to making these tools run more efficiently in continuously evolving software repositories. This paper first analyzes the information retrieval model behind a classic bug localization tool, BugLocator, and builds a mathematical foundation that the model can be updated incrementally when codebase or bug reports evolve. Then, we present IncBL, a tool for Incremental Bug Localization in evolving software repositories. IncBL is evaluated on the Bugzbook dataset, and the results show that IncBL can significantly reduce the running time by 77.79% on average compared with re-computing the model, while maintaining the same level of accuracy. We also implement IncBL as a Github App that can be easily integrated into open-source projects on Github, and users can also deploy and use IncBL locally. The demo video for IncBL can be viewed at https://youtu.be/G4gMuvlJSb0, and the source code can be found at https://github.com/soarsmu/IncBL}
Tool Demonstrations
Tue 16 Nov 2021 12:50 - 12:55 at Koala - Languages Chair(s): Jean-Guy Schneider Deakin UniversityProgrammers often use Q&A sites (e.g. Stack Overflow) to understand a root cause of program bugs. Runtime exceptions is one of such important class of bugs that is actively discussed on Stack Overflow. However, it may be difficult for beginner programmers to come up with appropriate keywords for search. Moreover, they need to switch their attentions between IDE and browser, and it is time-consuming. To overcome these difficulties, we proposed a method, ``MAESTRO'', to find suitable Q&A posts automatically for Java runtime exception by utilizing structure information of codes described in programming Q&A website. In this paper, we describe a usage scenario of IDE-plugin, the architecture and user interface of the implementation, and results of user studies. A video is available at https://youtu.be/4X24jJrMUVw . A demo software is available at https://github.com/FujitsuLaboratories/Q-A-MAESTRO .
Tool Demonstrations
Thu 18 Nov 2021 19:55 - 20:00 at Koala - Bugs II Chair(s): Annibale Panichella Delft University of TechnologyThe concept of the test smell represents potential problems with the readability and maintainability of the test code. Common test smells focus on static aspects of the source code, such as code length and complexity. These are easy to detect and do not cause problems in terms of test execution. On the other hand, dynamic smells, which are based on test runtime behavior, lead to misunderstanding of the test results. For example, rotten green tests give developers the false impression that the test was passed without any problems, even though the test was poorly executed. Therefore, we should detect dynamic smells and take countermeasures as early as possible through the development. In this paper, we introduce JTDog, a Gradle plugin for dynamic smell detection. JTDog has high portability due to its integration into the build tool. We applied JTDog to 150 projects on GitHub and confirmed that the JTDog plugin has high portability. In addition, JTDog detected 958 dynamic smells in 55 projects. JTDog is available at https://github.com/kusumotolab/JTDog, and the demo video is available at https://youtu.be/t374HYMCavI.
Research Papers
Thu 18 Nov 2021 21:00 - 21:20 at Kangaroo - Learning Applications Chair(s): Michael Pradel University of StuttgartDespite the proliferation of Android testing tools, Google Monkey has remained the de facto standard for practitioners. The popularity of Google Monkey is largely due to the fact that it is a black-box testing tool, making it widely applicable to all types of Android apps, regardless of their underlying implementation details. An important drawback of Google Monkey, however, is the fact that it uses the most naive form of test input generation technique, i.e., random testing.In this work, we present Deep GUI, an approach that aims to complement the benefits of black-box testing with a more intelligent form of GUI input generation. Given only screenshots of apps, Deep GUI first employs deep learning to construct a model of valid GUI interactions. It then uses this model to generate effective inputs for an app under test without the need to probe its implementation details. Moreover, since the data collection, training, and inference processes are performed independent of the platform, the model inferred by Deep GUI has application for testing apps in other platforms as well. We implemented a prototype of Deep GUI in a tool called Monkey++ by extending Google Monkey and evaluated it for its ability to crawl Android apps. We found that Monkey++ achieves significant improvements over Google Monkey in cases where an app’s UI is complex, requiring sophisticated inputs. Furthermore, our experimental results demonstrate the model inferred using Deep GUI can be reused for effective GUI input generation across platforms without the need for retraining.
Research Papers
Thu 18 Nov 2021 21:20 - 21:40 at Kangaroo - Learning Applications Chair(s): Michael Pradel University of StuttgartDeep neural networks (DNNs) are being increasingly deployed as integral parts of software systems. However, due to the complex interconnections among hidden layers and massive hyperparameters, DNNs require being trained using a large number of labeled inputs, which calls for extensive human effort for collecting and labeling data. Spontaneously, to alleviate this growing demand, a surge of state-of-the-art studies comes up with different metrics to select a small yet informative dataset for the model training. These research works have demonstrated that DNN models can achieve competitive performance using a carefully selected small set of data. However, the literature lacks proper investigation of the limitations of data selection metrics, which is crucial to apply them in practice. In this paper, we fill this gap and conduct an extensive empirical study to explore the limits of selection metrics. Our study involves 15 selection metrics evaluated over 5 datasets (2 image classification tasks and 3 text classification tasks), 10 DNN architectures, and 20 labeling budgets (ratio of training data being labeled). Our findings reveal that, while selection metrics are usually effective in producing accurate models, they may induce a loss of model robustness (against adversarial examples) and resilience to compression. Overall, we demonstrate the existence of a trade-off between labeling effort and different model qualities. This paves the way for future research in devising selection metrics considering multiple quality criteria.
Journal-first Papers
Thu 18 Nov 2021 21:40 - 21:50 at Kangaroo - Learning Applications Chair(s): Michael Pradel University of StuttgartCompetitive Crowdsource Software Development (CCSD) has gained tremendous attention in the software engineering community. It explores the possibility of replacing inhouse software development to obtain cost-effective, innovative and high-quality solutions on time. CCSD depends on an open call format, where clients (companies) crowdsource their software development projects to CCSD platforms that arrange online competitions. The crowd (developers) participate in such competitions and present their innovative solutions to win monetary rewards. There are numbers of CCSD platforms e.g., TopCoder, uTest, GetACoder, and Taskcn that arrange online software development competitions. However, TopCoder1 is the largest and widely trusted CCSD platform.
Research Papers
Thu 18 Nov 2021 22:00 - 22:20 at Kangaroo - Analysis III Chair(s): Jifeng Xuan Wuhan UniversityIn this paper, we address the problem of finding a correspondence, or matching, between the functions of two programs in binary form, which is one of the most common task in binary diffing. We introduce a new formulation of this problem as a particular instance of a graph edit problem over the call graphs of the programs. In this formulation, the quality of a mapping is evaluated simultaneously with respect to both function content and call graph similarities. We show that this formulation is equivalent to a network alignment problem. We propose a solving strategy for this problem based on max-product belief propagation. Finally, we implement a prototype of our method, called QBinDiff, and propose an extensive evaluation which shows that our approach outperforms state of the art diffing tools.
Research Papers
Thu 18 Nov 2021 22:20 - 22:40 at Kangaroo - Analysis III Chair(s): Jifeng Xuan Wuhan UniversityReasoning on immutability is important for preventing bugs, e.g., in multi-threaded software. So far, static analysis to infer immutability properties has mostly focused on individual objects and references. Reasoning about fields and entire classes, while significantly simpler, has gained less attention. Even a consistently used terminology is missing, which makes it difficult to implement analyses that rely on immutability information. We propose a model for class and field immutability that unifies terminology for immutability flavors considered by previous work and covers new levels of immutability to handle lazy initialization and immutability dependent on generic type parameters. We implement CiFi, a set of modular, collaborating analyses for different flavors of immutability, inferring the properties defined in our model and propose a benchmark of representative test cases for class and field immutability. We use the benchmark to showcase CiFi’s precision and recall, in comparison to state of the art, and use CiFi to study the prevalence of immutability in real-world libraries, showcasing the practical quality and relevance of our model.
New Ideas and Emerging Results (NIER) track
Thu 18 Nov 2021 22:40 - 22:50 at Kangaroo - Analysis III Chair(s): Jifeng Xuan Wuhan UniversityMetamorphic testing is a well-established testing technique that has been successfully applied in various domains, including testing deep learning models to assess their robustness against data noise or malicious input. Currently, metamorphic testing approaches for machine learning (ML) models focused on image processing and object recognition tasks. Hence, these approaches cannot be applied to ML targeting program analysis tasks. In this paper, we extend metamorphic testing approaches for ML models targeting software programs. We present Lampion, a novel testing framework that applies (semantics preserving) metamorphic transformations on the test datasets. Lampion produces new code snippets equivalent to the original test set but different in their identifiers or syntactic structure. We evaluate Lampion against CodeBERT, a state-of-the-art ML model for Code-To-Text tasks that creates Javadoc summaries for given Java methods. Our results show that simple transformations significantly impact the target model behavior, providing additional information on the models reasoning apart from the classic performance metric.
Pre-printNew Ideas and Emerging Results (NIER) track
Thu 18 Nov 2021 22:50 - 23:00 at Kangaroo - Analysis III Chair(s): Jifeng Xuan Wuhan UniversityThis research explores the possibility of a new anti-analysis technique, carefully designed to attack weaknesses of the existing program analysis approaches. It encodes a program code snippet to hide, and its decoding process is implemented by a sophisticated state machine that produces multiple outputs depending on inputs. The key idea of the proposed technique is to ambiguously decode the program code, resulting in multiple decoded code snippets that are challenging to distinguish from each other. Our approach is stealthier than previous similar approaches as its execution does not exhibit different behaviors between when it decodes correctly or incorrectly. This paper also presents analyses of weaknesses of existing techniques and discusses potential improvements. We implement and evaluate the proof of concept approach, and our preliminary results show that the proposed technique imposes various new unique challenges to the program analysis technique. It also suggests a need for hybrid analysis that can complement the limitations of existing techniques to handle the proposed technique.
no description available