no description available
The need for more autonomy in many systems (e.g., automotive, aerospace) leads to increasingly complex systems that are becoming harder to verify and certify. This problem is compounded by the rising use of AI to enable such autonomy in many critical functions. Such situation has a significant impact on software development practice and makes it particularly difficult to assess risks and meet industry standards. This talk will focus on automated techniques, relying on various AI disciplines, to enable the testing and risk analysis of such systems, in ways that are practical and scalable. The talk will report and reflect on various research projects carried out in collaboration with the automotive and satellite industries.
Research Papers
Tue 11 Oct 2022 10:30 - 10:50 at Banquet B - Technical Session 3 - Fuzzing I Chair(s): Aravind Machiry Purdue UniversityAutonomous driving systems (ADSs) must be tested thoroughly before they can be deployed in autonomous vehicles. High-fidelity simulators allow them to be tested against diverse scenarios, including those that are difficult to recreate in real-world testing grounds. While previous approaches have shown that test cases can be generated automatically, they tend to focus on weak oracles (e.g. reaching the destination without collisions) without assessing whether the journey itself was undertaken safely and satisfied the law. In this work, we propose LawBreaker, an automated framework for testing ADSs against real-world traffic laws, which is designed to be compatible with different scenario description languages. LawBreaker provides a rich driver-oriented specification language for describing traffic laws, and a fuzzing engine that searches for different ways of violating them by maximising specification coverage. To evaluate our approach, we implemented it for Apollo+LGSVL and specified the traffic laws of China. LawBreaker was able to find 14 violations of these laws, including 173 test cases that caused accidents.
DOI Pre-printResearch Papers
Tue 11 Oct 2022 10:50 - 11:10 at Banquet B - Technical Session 3 - Fuzzing I Chair(s): Aravind Machiry Purdue UniversityWith rapidly growing fuzzing technology, there has been surging demand for automatically synthesizing buggy programs. Previous approaches have been focused on injecting bugs into existing programs, making them suffer from providing the ground truth as the generated programs may contain unexpected bugs. In this paper, we address this challenge by casting the bug synthesis problem as a maze generation problem. Specifically, we synthesize a whole buggy program by encoding a sequence of moves in a maze as a chain of function calls. By design, our approach provides the exact ground truth of the synthesized benchmark. Furthermore, it allows generation of benchmarks with realistic path constraints extracted from existing vulnerabilities. We implement our idea in a tool, named Fuzzle, and evaluate it with five state-of-the-art fuzzers to empirically prove its value.
Tool Demonstrations
Tue 11 Oct 2022 11:10 - 11:20 at Banquet B - Technical Session 3 - Fuzzing I Chair(s): Aravind Machiry Purdue UniversityWith the rapid development of autonomous driving systems (ADS), especially the increasing adoption of deep neural networks (DNNs) as their core components, effective quality assurance methods for ADS have attracted growing interests in both academia and industry. In this paper, we report a new testing platform ADEPT we have developed, aiming to provide practically realistic and comprehensive testing facilities for DNN-based ADS. ADEPT is based on the virtual simulator CARLA and provides numerous testing facilities such as scene construction, ADS importation, test execution and recording, etc. In particular, ADEPT features two distinguished test scenario generation strategies designed for autonomous driving. First, we make use of real-life accident reports from which we leverage natural language processing to fabricate abundant driving scenarios. Second, we synthesize physically-robust adversarial attacks by taking the feedback of ADS into consideration and thus are able to generate closed-loop test scenarios. The experiments confirm the efficacy of the platform. A video demonstrating the usage of ADEPT can be found at https://youtu.be/evMorf0uR_s.
Research Papers
Tue 11 Oct 2022 11:20 - 11:40 at Banquet B - Technical Session 3 - Fuzzing I Chair(s): Aravind Machiry Purdue UniversityHeap-based temporal vulnerabilities (e.g., use-after-free and double-free) are highly sensitive to heap operation (e.g., memory allocation, deallocation and access) sequences. To efficiently find such vulnerabilities, traditional code coverage-guided fuzzing solutions could be promoted by integrating heap operation sequence feedback. But current sequence sensitive solutions have limitations in practice.
In this paper, we propose a novel fuzzing solution named HTFuzz, to find heap-based temporal vulnerabilities. At the core, we utilize fuzzing to increase the coverage of runtime heap operation sequences and the diversity of pointers accessed by these operations, where the former reflects the control-flow and the latter reflects the data-flow of heap operation sequences. With such increases, the fuzzer could find more heap-based temporal vulnerabilities. We have developed a prototype of HTFuzz and evaluated it on 14 real-world applications, and compared it with 11 state-of-the-art fuzzers. The results showed that, HTFuzz outperformed all the baselines and was statistically better on the number of heap-based temporal vulnerabilities discovered. In detail, HTFuzz found (1.82x, 2.62x, 2.66x, 2.02x, 2.21x, 2.06x, 1.47x, 2.98x, 1.98x) more heap operation sequences and (1.45x, 3.56x, 3.56x, 4.57x, 1.78x, 1.78x, 1.68x, 4.00x, 1.45x) more 0day heap-based temporal vulnerabilities than (AFL, AFL-sensitive-ma, AFL-sensitive-mw, Memlock, PathAFL, TortoiseFuzz, MOPT, Angora, Ankou), respectively. HTFuzz discovered 37 new vulnerabilities with 37 CVEs assigned, including 32 new heap-based temporal vulnerabilities and 5 of other types.
Research Papers
Tue 11 Oct 2022 11:40 - 12:00 at Banquet B - Technical Session 3 - Fuzzing I Chair(s): Aravind Machiry Purdue UniversityGreybox fuzzing is a proven effective testing method for the detection of security vulnerabilities and other bugs in modern software systems. Greybox fuzzing can also be used in combination with a sanitizer, such as AddressSanitizer (ASAN), to further enhance the detection of certain classes of bug such as buffer overflow and use-after-free errors. However, sanitizers also introduce additional performance overheads, and this can degrade the performance of greybox fuzzing—measured in the order of 2.36x for fuzzing with ASAN—potentially negating the benefit of using a sanitizer in the first place. Recent research attributes this to extra overheads to additional page faults that are generated when the disjoint sanitizer metadata is accessed at runtime.
In this paper, we present a new design that can detect memory errors without a proliferation of page faults. The basic idea is to track memory validity using randomized tokens that are stored directly in the memory itself, rather than in disjoint metadata. All read/write operations are instrumented to check for the token, and if present, a memory error will be detected. We implement our design in the form of the ReZZan—a sanitizer specifically optimized for fuzz testing. Since there is no disjoint metadata access, no additional page faults are generated, minimizing the performance overhead to around 1.14-1.27x (depending on the configuration).
Research Papers
Tue 11 Oct 2022 12:00 - 12:20 at Banquet B - Technical Session 3 - Fuzzing I Chair(s): Aravind Machiry Purdue UniversityFuzzing is a promising approach to testing DBMS. One crucial component in DBMS fuzzing is grammar: since DBMSs enforce strict validation on inputs, a grammar improves fuzzing efficiency by generating syntactically- and semantically-correct SQL statements. However, due to the vast differences in the complex grammar of various DBMSs, it is painstaking to adapt these fuzzers to them. Considering that lots of DBMSs are not yet well tested, there is an urgent need for an effective DBMS fuzzing approach that is free from grammar dependencies. In this paper, we propose Griffin, a grammar-free DBMS fuzzer. Rather than relying on grammar, Griffin summarizes the DBMS’s state into metadata graph, a lightweight data structure which improves mutation correctness in fuzzing. Specifically, it first tracks the metadata of the statements in built-in SQL test cases as they are executed, and constructs the metadata graph to describe the dependencies between metadata and statements iteratively. Based on the graphs, it reshuffles statements and employs metadata-guided substitution to correct semantic errors. We evaluate Griffin on four popular DBMSs, namely MariaDB, SQLite, PostgreSQL, and DuckDB. Griffin covers 27.79%-155.71%, 96.75%-455.82%, 32.99%189.36% more branches, and finds 19, 19, and 15 more bugs in 12 hours than SQLancer, SQLsmith, and Squirrel, respectively. In total, Griffin finds 55 previously unknown bugs and 13 of them have been confirmed as CVEs in the National Vulnerability Database.
NIER Track
Tue 11 Oct 2022 12:20 - 12:30 at Banquet B - Technical Session 3 - Fuzzing I Chair(s): Aravind Machiry Purdue UniversityCoverage-guided Greybox fuzzing is regarded as a practical approach to detect software vulnerabilities, which targets to expand code coverage as much as possible. A common implementation is to assign more energy to such seeds which find new edges with less execution time. However, solely considering new edges may less effective because some hard-to-find branches often exist in the complex code of program. Code complexity is one of the key indicators to measure the code security. Compared to the code with simple structure, the program with higher code complexity is more likely to find more branches and cause security problems. To this end, we propose a novel fuzzing method which further use code complexity to optimize power schedule process in AFL (American Fuzzy Lop) and AFLFAST (American Fuzzy Lop Fast) . The goal of our method is to generate inputs which are more bias toward the code with higher complexity of the program under test. In addition, we conduct a preliminary empirical study under three widely used real-world programs, and the experimental results show that the proposed approach can trigger more crashes as well as improve the coverage discovery.
Research Papers
Tue 11 Oct 2022 14:00 - 14:20 at Banquet B - Technical Session 7 - Fuzzing II Chair(s): Karine Even-Mendoza Imperial College LondonFuzz testing ("fuzzing'') is a widely-used and effective dynamic technique to discover crashes and security vulnerabilities in software, supported by numerous tools, which keep improving in terms of their detection capabilities and speed of execution. In this paper, we report our findings from using state-of-the-art mutation-based and hybrid fuzzers (AFL, Angora, honggfuzz, Intriguer, MOpt-AFL, QSym, and SymCC) on a non-trivial code base, that of Contiki-NG, to expose and fix serious vulnerabilities in various layers of its network stack, during a period of more than three years. As a by-product, we provide a Git-based platform which allowed us to create and apply a new, quite challenging, open-source bug suite for evaluating fuzzers on real-world software vulnerabilities. Using this bug suite, we present an impartial and extensive evaluation of the effectiveness of these fuzzers, and measure the impact that sanitizers have on it. Finally, we offer our experiences and opinions on how fuzzing tools should be used and evaluated in the future.
DOI Pre-printResearch Papers
Tue 11 Oct 2022 14:20 - 14:40 at Banquet B - Technical Session 7 - Fuzzing II Chair(s): Karine Even-Mendoza Imperial College LondonFuzzing has been an important approach for finding bugs and vulnerabilities in programs. Many fuzzers deployed in industry run daily and can generate an overwhelming number of crashes. Diagnosing such crashes can be very challenging and time consuming. Existing fuzzers typically employ heuristics such as code coverage or call stack hashes to weed out duplicate reporting of bugs. While these heuristics are cheap, they are often imprecise and end up still reporting many “unique” crashes corresponding to the same bug. In this paper, we present \textit{FuzzerAid} that uses \textit{fault signatures} to group crashes reported by the fuzzers. Fault signature is a small executable program and consists of a selection of necessary statements from the original program that can reproduce a bug. In our approach, we first generate a fault signature using a given crash. We then execute the fault signature with other crash inducing inputs. If the failure is reproduced, we classify the crashes into the group labeled with the fault signature; if not, we generate a new fault signature. After all the crash inducing inputs are classified, we further merge the fault signatures of the same root cause into a group. We implemented our approach in a tool called \textit{FuzzerAid} and evaluated it on 3020 crashes generated from 15 real-world bugs and 4 large open source projects. Our evaluation shows that we are able to correctly group 99.1% of the crashes and reported only 17 (+2) “unique” bugs, outperforming the state-of-the-art fuzzers.
Research Papers
Tue 11 Oct 2022 14:40 - 15:00 at Banquet B - Technical Session 7 - Fuzzing II Chair(s): Karine Even-Mendoza Imperial College LondonThe tremendous advancements in deep learning techniques have empowered question answering(QA) systems with the capability of dealing with various tasks. Many commercial QA systems, such as Siri, Google Home, and Alexa, have been deployed to assist people in different daily activities. However, modern QA systems are often designed to deal with different topics and task formats, which makes both the test collection and labeling tasks difficult and thus threats their quality.
To alleviate this challenge, in this paper, we design and implement a fuzzing framework for QA systems, namely QATest, based on the metamorphic testing theory. It provides the first uniform solution to generate tests with oracle information automatically for various QA systems, such as machine reading comprehension, open-domain QA, and QA on knowledge bases. To further improve testing efficiency and generate more tests detecting erroneous behaviors, we design N-gram coverage criteria and perplexity priority based on the features of the question data to guide the generation process. To evaluate the performance of QATest, we experiment it on four QA systems that are designed for different tasks. The experiment results show that the tests generated by QATest detect hundreds of erroneous behaviors of QA systems efficiently. Also, the results confirm that the testing criteria can improve test diversity and fuzzing efficiency.
Research Papers
Tue 11 Oct 2022 15:00 - 15:20 at Banquet B - Technical Session 7 - Fuzzing II Chair(s): Karine Even-Mendoza Imperial College LondonAs computer programs running on top of blockchain, smart contracts have proliferated a myriad of decentralized applications while bringing security vulnerabilities, which may cause disastrous failures and huge financial losses. Thus, it is crucial and urgent to detect the vulnerabilities of smart contracts. However, existing fuzzers for smart contracts are still inefficient to detect sophisticated vulnerabilities that require specific vulnerable transaction sequences to trigger. To address this challenge, we propose a novel vulnerability-guided fuzzer based on reinforcement learning, namely RLF, for generating vulnerable transaction sequences to detect such sophisticated vulnerabilities in smart contracts. In particular, we firstly model the process of fuzzing smart contracts as a Markov decision process to construct our reinforcement learning framework. We then creatively design an appropriate reward with consideration of both vulnerability and code coverage so that it can effectively guide our fuzzer to generate specific transaction sequences to reveal vulnerabilities, especially for the vulnerabilities related to multiple functions. We conduct extensive experiments to evaluate RLF’s performance. The experimental results demonstrate that our RLF outperforms state-of-the-art fuzzers.
MIP Awards
Wed 12 Oct 2022 08:00 - 08:15 at Banquet B - Welcome to Day 2 Chair(s): Myra Cohen Iowa State University, Houari Sahraoui Université de MontréalSoftware developers spend a significant portion of their resources handling user-submitted bug reports. For software that is widely deployed, the number of bug reports typically outstrips the resources available to triage them. As a result, some reports may be dealt with too slowly or not at all. We present a descriptive model of bug report quality based on a statistical analysis of surface features of over 27,000 publicly available bug reports for the Mozilla Firefox project. The model predicts whether a bug report is triaged within a given amount of time. Our analysis of this model has implications for bug reporting systems and suggests features that should be emphasized when composing bug reports. We evaluate our model empirically based on its hypothetical performance as an automatic filter of incoming bug reports. Our results show that our model performs significantly better than chance in terms of precision and recall. In addition, we show that our modelcan reduce the overall cost of software maintenance in a setting where the average cost of addressing a bug report is more than 2% of the cost of ignoring an important bug report.
Link to publication DOIMIP Awards
Wed 12 Oct 2022 08:15 - 08:30 at Banquet B - Welcome to Day 2 Chair(s): Myra Cohen Iowa State University, Houari Sahraoui Université de MontréalKeynotes
Wed 12 Oct 2022 08:30 - 09:30 at Banquet B - Welcome to Day 2 Chair(s): Myra Cohen Iowa State University, Houari Sahraoui Université de MontréalMachines today can write software, compose music, create art, predict events, and listen and learn from humans. Notably, automation also plays an essential role in high performing software development teams by automating tasks and improving developer productivity. But automation can’t (yet) replace human imagination and the intelligence that arises when multiple great minds work together to solve the complex problems that are inherent in software and systems design. In this talk, we will review how automation in modern software development has evolved and the many benefits it has brought. We will then explore how a deeper understanding of the developer experience points to untapped possibilities for innovating automation for software engineering, focusing on how they can:
We will conclude by discussing how we can measure the impact of new innovations on the developer experience, and how doing so will drive actionable change and empower developers to do their best work joyfully.
Research Papers
Wed 12 Oct 2022 10:00 - 10:20 at Banquet B - Technical Session 12 - Builds and Versions Chair(s): Yi Li Nanyang Technological University, SingaporeAbstract Syntax Trees (ASTs) are widely used beyond compilers in many tools that measure and improve code quality, such as code analysis, bug detection, mining code metrics, refactoring, etc. With the advent of fast software evolution and multistage releases, the temporal analysis of an AST history is becoming useful to understand and maintain code. However, jointly analyzing thousands versions of ASTs independently faces scalability issues, mostly combinatorial, both in terms of memory and CPU usage. In this paper, we propose a novel type of AST, called HyperAST, that enables efficient temporal code analysis on a given software history by: 1/ leveraging code redundancy through space (between code elements) and time (between versions); 2/ reusing intermediate computation results. We show how the HyperAST can be built incrementally on a set of commits to capture all multiple ASTs at once in an optimized way. We evaluated the HyperAST on a curated list of large software projects. Compared to Spoon, a state-of-the-art technique as a baseline, we observed that the \HyperAST outperforms it with an order-of-magnitude difference from x6 up to x8076 in CPU construction time and from x12 up to x1159 in memory footprint. While the HyperAST requires up to 2 hours 22 minutes and 7.2Gb for the biggest project, Spoon requires up to 93 hours and 31 minutes 2.2TB. The gains in construction time varied from +83,4% to +99,99% and the gains in memory footprint varied from +91,8% to +99,9%. We further compared the task of finding references of declarations with the HyperAST and Spoon. We observed on average 90% precision and 97% recall without a significant difference in search time.
Research Papers
Wed 12 Oct 2022 10:20 - 10:40 at Banquet B - Technical Session 12 - Builds and Versions Chair(s): Yi Li Nanyang Technological University, SingaporeTo enhance the compatibility in the version control of Java Third-party Libraries (TPLs), Maven adopts Semantic Versioning (SemVer) to standardize the underlying meaning of versions, but users could still confront abnormal execution and crash after upgrades even if compilation and linkage succeed. It is caused by semantic breaking (SemB) issues, such that APIs directly used by users have identical signatures but inconsistent semantics across upgrades. To strengthen compliance with SemVer rules, developers and users should be alerted of such issues. Unfortunately, it is challenging to detect them statically, because semantic changes in the internal methods of APIs are hard to be captured. Dynamic testing can confirmingly uncover some, but it is limited by inadequate coverage.
To detect SemB issues over compatible upgrades (Patch and Minor) by SemVer rules, we conducted an empirical study on 180 SemB issues to understand the root causes, inspired by which, we propose Sembid (Semantic Breaking Issue Detector) to statically detect such issues of TPLs for developers and users. Since APIs are directly used by users, Sembid detects and reports SemB issues based on APIs. For a pair of APIs, Sembid walks through the call chains originating from the API to locate breaking changes by measuring semantic diff. Then, Sembid checks if the breaking changes can affect API’s output along call chains. The evaluation showed Sembid achieved 90.26% recall and 81.29% precision and outperformed other API checkers on SemB API detection. We also revealed Sembid detected over 3 times more SemB APIs with better coverage than unit tests, the commonly used solution. Furthermore, we carried out an empirical study on 1, 629, 589 APIs from 546 version pairs of top Java libraries and found there were 2-4 times more SemB APIs than those with signature-based issues. Due to various version release strategies, 33.83% of Patch version pairs and 64.42% of Minor version pairs had at least one API affected by any breaking.
Research Papers
Wed 12 Oct 2022 10:40 - 11:00 at Banquet B - Technical Session 12 - Builds and Versions Chair(s): Yi Li Nanyang Technological University, SingaporeDuring software merge, the edits from different branches can textually overlap (i.e., textual conflicts) or cause build and test errors (i.e., build and test conflicts), jeopardizing programmer productivity and software quality. Existing tools primarily focus on textual conflicts; few tools detect higher-order conflicts (i.e., build and test conflicts). However, existing detectors of build conflicts are quite limited. Due to their heavy usage of automatic build, the build-based detectors (e.g., Crystal) can only report build errors instead of pinpointing the root causes; developers have to manually locate conflicting edits. Furthermore, these detectors only help when the branches-to-merge have no textual conflict.
We present a novel approach Bucond (“build conflict detector”) to detect conflicts via static analysis, instead of using automatic build. Given the three program versions in a merging scenario: base 𝑏, left 𝑙 , and right 𝑟 , Bucond first models each version as a graph, and compares graphs to extract entity-related edits (e.g., rename a class) in 𝑙 and 𝑟 branches. Our insight is that build conflicts occur when certain edits are co-applied to related entities between branches. Bucond realizes this insight via pattern-based reasoning, to identify any cross-branch edit combination that can trigger build conflicts (e.g., one branch adds a reference to field F while the other branch removes F). We conducted a systematic exploration to devise 57 conflicting-edit patterns, covering 97% of total build conflicts in our experiments. We evaluated Bucond with three datasets and Bucond worked effectively on all datasets. It complements existing build-based conflict detectors, as it (1) detects conflicts with 100% precision and 88%–100% recall, (2) pinpoints the conflicting edits, and (3) works well when those detectors are inapplicable.
Research Papers
Wed 12 Oct 2022 11:00 - 11:20 at Banquet B - Technical Session 12 - Builds and Versions Chair(s): Yi Li Nanyang Technological University, SingaporeAs one of the representative software ecosystems, PyPI, together with the installation management tool pip, greatly facilitates Python developers to automatically manage the reuse of third-party libraries, thus saving development time and cost. Despite its great success in practice, a recent empirical study revealed the risks of dependency conflict (DC) issues and then summarized the characteristics of DC issues. However, the dependency resolving strategy, which is the foundation of the prior study, has been evolved to a new one, namely the backtracking strategy. To understand how this evolution of the dependency resolving strategy affects the prior findings, we conducted an empirical study to revisit the characteristics of DC issues under the new strategy. Our study revealed that, of the two previously discovered DC issue manifestation patters, one has been significantly changed (Pattern A), while the other remained the same (Pattern B). We also observed, the fixing strategy for the DC issues of Pattern A suffers from the efficiency issue, while the one for the DC issues of Pattern B would lead to a waste of time and space. Based on our findings, we propose a tool smartPip to overcome the limitations of the fixing strategies. To resolve the DC issues of Pattern A, instead of iteratively verifying each candidate dependency library, we leverage a pre-built knowledge base of library dependencies to collect version constraints for concerned libraries, and then use constraint solving to directly find feasible solutions. To resolve the DC issues of Pattern B, we improve the existing virtual environment solution to reuse the local libraries as far as possible. Finally, we evaluated smartPip in three benchmark datasets of open source projects. The results showed that, smartPip can achieve up to 1.19X - 1.60X speedups in resolving the DC issues of Pattern A, comparing with pip with the new strategy. Comparing with the built-in Python virtual environment (venv), smartPip reduced 34.55% - 80.26% of storage space and achieve up to 2.26X - 6.53X speedups in resolving the DC issues of Pattern B.
Research Papers
Wed 12 Oct 2022 11:20 - 11:40 at Banquet B - Technical Session 12 - Builds and Versions Chair(s): Yi Li Nanyang Technological University, SingaporeBuild scripts play an important role in transforming the source code into executable artifacts. However, the development of build scripts is typically error-prone. As one kind of the most prevalent errors in build scripts, the dependency-related errors, including missing dependencies and redundant dependencies, draw the attention of many researchers. A variety of build dependency analysis techniques have been proposed to tackle them. Unfortunately, most of these techniques, even the state-of-the-art ones, suffer from efficiency issues due to the expensive cost of monitoring the complete build process to build dynamic dependencies. Especially for large-scale projects, such the cost would not be affordable.
In this work, we propose a new technique to accelerate the build dependency error detection by reducing the time cost of the build monitoring. The key idea of this acceleration is to reduce the size of a program while still preserving the same dynamic dependencies as the original one. Building the reduced program does not generate a real software artifact, but it yields the same list of dependency errors and meanwhile speeds up the process. We implement our technique as the tool VirtualBuild, and evaluate it on real-world projects. The results show that VirtualBuild can detect all the dependency errors found by existing tools at a low cost. Compared with the state-of-the-art technique, VirtualBuild accelerates the build process by 8.74 times and further improves the efficiency of error detection by 6.13 times on average. Specifically, in the large-scale project LLVM that contains 5.67 MLoC, VirtualBuild reduces the overall time from over four hours to 38.63 minutes.
Research Papers
Wed 12 Oct 2022 11:40 - 12:00 at Banquet B - Technical Session 12 - Builds and Versions Chair(s): Yi Li Nanyang Technological University, SingaporeDespite the benefits, continuous integration (CI) can incur high costs. One of the well-recognized costs is long build time, which greatly affects the speed of software development and increases the cost in computational resources. While there exist configuration options in the CI infrastructure to accelerate builds, the CI infrastructure is often not optimally configured, leading to CI configuration smells. Attempts have been made to detect or repair CI configuration smells. However, none of them is specifically designed to improve build performance in CI.
In this paper, we first create a catalog of 20 performance-related CI configuration smells (PCSs) in three tools (i.e., Travis CI, Maven and Gradle) of the CI infrastructure for Java projects. Then, we propose an automated approach, named BuildSonic, to detect and repair 15 types of PCSs by analyzing configuration files. We have conducted large-scale experiments to evaluate BuildSonic. We detected 20,318 PCSs in 99.0% of the 4,140 Java projects, with a precision of 0.998 and a recall of 0.991. We submitted 1,138 pull requests for sampled PCSs of each PCS type, 246 and 11 of which have been respectively merged and accepted by developers. We successfully triggered CI builds before and after merging 288 pull requests, and observed an average build performance improvement of 12.4% after repairing a PCS.
Research Papers
Wed 12 Oct 2022 13:30 - 13:50 at Banquet B - Technical Session 15 - Compilers and Languages Chair(s): Lingming Zhang University of Illinois at Urbana-ChampaignBinary analysis or the ability to analyze binary code is an important capability required for many security and software engineering applications. Consequently, there are many binary analysis tech- niques and tools with varied capabilities. However, testing these tools requires a large, varied binary dataset with corresponding source-level information. In this paper, we present Cornucopia, an architecture agnostic automated framework that can generate a large number of semantically equivalent binaries from program source code. We exploit compiler optimizations and use feedback- guided learning to maximize the generation of unique binaries that correspond to the same program. Our evaluation shows that Cor- nucopia was able to generate 309K binaries across four archi- tectures (x86, x64, ARM, MIPS) with an average of 403 binaries for each program. Our experiments also revealed a large number (∼300) of issues with LLVM optimization scheduler resulting in compiler crashes. Our evaluation of four popular binary analysis tools angr, Ghidra, ida, and radare, using Cornucopia gener- ated binaries, revealed various issues with these tools. Specifically, we found 263 crashes in angr and one memory corruption issue in ida. Our differential testing on the analysis results revealed vari- ous semantic bugs in these tools. We also tested machine learning tools, Asm2Vec, SAFE, and Debin, that claim to capture binary semantics and show that they perform very poorly (e.g., Debin F1 score dropped to 12.9% from reported 63.1%) on Cornucopia generated binaries. In summary, our exhaustive evaluation shows that Cornucopia is an effective mechanism to generate binaries that can be used to test binary analysis techniques effectively.
Journal-first Papers
Wed 12 Oct 2022 13:50 - 14:10 at Banquet B - Technical Session 15 - Compilers and Languages Chair(s): Lingming Zhang University of Illinois at Urbana-Champaignno description available
Research Papers
Wed 12 Oct 2022 14:10 - 14:30 at Banquet B - Technical Session 15 - Compilers and Languages Chair(s): Lingming Zhang University of Illinois at Urbana-ChampaignWe present JAttack, a framework that enables template-based testing for compilers. Using JAttack, a developer writes a template program that describes a set of programs to be generated and given as test inputs to a compiler. Such a framework enables developers to incorporate their domain knowledge on testing compilers, giving a basic program structure that allows for exploring complex programs that can trigger sophisticated compiler optimizations. A developer writes a template program in the host language (Java) that contains holes to be filled by JAttack. Each hole, written using a domain-specific language, constructs a node within an extended abstract syntax tree (eAST). An eAST node defines the search space for the hole, i.e., a set of expressions and values. JAttack generates programs by executing templates and filling each hole by randomly choosing expressions and values (available within the search space defined by the hole). Additionally, we introduce several optimizations to reduce JAttack’s generation cost. While JAttack could be used to test various compiler features, we demonstrate its capabilities in helping test just-in-time (JIT) Java compilers, whose optimizations occur at runtime after a sufficient number of executions. Using JAttack, we have found six critical bugs that were confirmed by Oracle developers. Four of them were previously unknown, including two unknown CVEs (Common Vulnerabilities and Exposures). JAttack shows the power of combining developers’ domain knowledge (via templates) with random testing to detect bugs in JIT compilers.
DOI Pre-printIndustry Showcase
Wed 12 Oct 2022 14:30 - 14:50 at Banquet B - Technical Session 15 - Compilers and Languages Chair(s): Lingming Zhang University of Illinois at Urbana-ChampaignRust is a young systems programming language, but it has gained tremendous popularity thanks to its assurance of memory safety. However, the performance of Rust has been less systematically understood, although many people are claiming that Rust is comparable to C/C++ regarding efficiency.
In this paper, we aim to understand the performance of Rust, using C as the baseline. First, we collect a set of micro benchmarks where each program is implemented with both Rust and C. To ensure fairness, we manually validate that the Rust version and the C version implement the identical functionality using the same algorithm. Our measurement based on the micro benchmarks shows that Rust is in general slower than C, but the extent of the slowdown varies across different programs. On average, Rust brings a 1.75x ``performance overhead'' compared to C. Second, we dissect the root causes of the overhead and unveil that it is primarily incurred by run-time checks inserted by the compiler and restrictions enforced by the language design. With the run-time checks disabled and the restrictions loosened, Rust presents a performance indistinguishable from C.
Research Papers
Wed 12 Oct 2022 14:50 - 15:10 at Banquet B - Technical Session 15 - Compilers and Languages Chair(s): Lingming Zhang University of Illinois at Urbana-ChampaignAutomatically fixing compilation errors can greatly raise the productivity of software development, by guiding the novice or AI programmers to write and debug code. Recently, learning-based program repair has gained extensive attention and became the state-of-the-art in practice. But it still leaves plenty of space for improvement. In this paper, we propose an end-to-end solution TransRepair to locate the error lines and create the correct substitute for a C program simultaneously. Superior to the counterpart, our approach takes into account the context of erroneous code and diagnostic compilation feedback. Then we devise a Transformer-based neural network to learn the ways of repair from the erroneous code as well as its context and the diagnostic feedback.To increase the effectiveness of TransRepair, we summarize 5 types and 74 fine-grained sub-types of compilations errors from two real-world program datasets and the Internet.Then a program corruption technique is developed to synthesize a large dataset with 1,821,275 erroneous C programs. Through the extensive experiments, we demonstrate that TransRepair outperforms the state-of-the-art in both single repair accuracy and full repair accuracy. Further analysis sheds light on the strengths and weaknesses in the contemporary solutions for future improvement.
Research Papers
Wed 12 Oct 2022 15:10 - 15:30 at Banquet B - Technical Session 15 - Compilers and Languages Chair(s): Lingming Zhang University of Illinois at Urbana-ChampaignThe source code must be compiled into machine code, so that machines can understand the intention of software. Compilers are important, because a bug in compilers can hinder or even crash the compilation. As there are often more than one compiler for each programming language, differential testing has been widely used to detect bugs in compilers. Its basic idea is to compile test programs with different compilers, and compare their compilation results to detect bugs. For this research line, test programs are critical, and researchers have proposed various approaches to generate test programs. The state-of-the-art approaches can be roughly divided into random-based and mutation-based approaches: random-based approaches generate random programs and mutation-based approaches mutate programs to generate more test programs. Both lines of approaches mainly generate random code, but it is more beneficial to use real programs, since it is easier to learn the impacts of compiler bugs and it becomes reasonable to use both valid and invalid code. However, most real programs from code repositories are ineffective to trigger compiler bugs, partially because they are compiled before they are submitted. In this experience paper, we apply two techniques such as differential testing and code snippet extraction to the specific research domain of compiler testing. Based on our observations on the practice of testing compilers, we identify bug reports of compilers as a new source for compiler testing. To illustrate the benefits of the new source, we implement a tool, called LeRe, that extracts test programs from bug reports and uses differential testing to detect compiler bugs with extracted programs. After we enriched the test programs, we have found 156 unique bugs in the latest versions of gcc and clang. Among them, 103 bugs are confirmed as valid, and 9 bugs are already fixed. Our found bugs contain 59 accept-invalid bugs and 33 reject-valid bugs. In these bugs, compilers wrongly accept invalid programs or reject valid programs. The new source enables us detecting accept-invalid and reject-valid bugs that were usually missed by the prior approaches. The prior approaches seldom report the two types of bugs. Besides our found bugs, we also present our analysis on our invalid bug reports. The results are useful for programmers, when they are switching from one compiler to another, and can provide insights, when researchers apply differential testing to detect bugs in more types of software
I received my Ph.D degree from Peking University in 2009. My Ph.D dissertation was nominated for the distinguished Ph.D dissertation award of China Computer Federation. After graduation, I joined Institute of Software, Chinese Academy of Sciences as an assistant professor, and was promoted as an associated professor in 2011. From 2012 to 2014, I was a visiting scholar with University of California, Davis. In 2014, I joined Shanghai Jiao Tong University. I am a recipient of ACM SIGSOFT Distinguished Paper Award, the best paper award of ASE, and the best paper award of APSEC.
NIER Track
Wed 12 Oct 2022 16:00 - 16:10 at Banquet B - Technical Session 17 - SE for AI Chair(s): Tim Menzies North Carolina State UniversityBytecode is used in software analysis and other approaches due to its advantages such as high availability and simple specification. Therefore, to leverage these advantages in training language models with bytecode, it is important to clearly recognize the characteristics of the naturalness of bytecode. However, the naturalness of bytecode has not been actively explored.
In this paper, we experimentally show the naturalness of bytecode instructions and investigate their characteristics by empirically assessing 10 Java open-source projects. Consequently, we demonstrate that the bytecode instructions are more natural than source code representations and less natural than abstract syntax tree representations at a method-level. Furthermore, we found that there is no correlation between the naturalness of bytecode instructions and source code representations at a method-level. Based on the findings, it is needed to explore the naturalness of bytecode instructions’ characteristics. We expect that the findings of this paper will be helpful for future work to study various automated software engineering tasks that use bytecode models.
Research Papers
Wed 12 Oct 2022 16:10 - 16:30 at Banquet B - Technical Session 17 - SE for AI Chair(s): Tim Menzies North Carolina State UniversityAssigning appropriate developers to the bugs is one of the main challenges in bug triage. Demands for automatic bug triage are increasing in the industry, as manual bug triage is labor-intensive and time-consuming in large projects. The key to the bug triage task is extracting semantic information from a bug report. In recent years, large Pre-trained Language Models (PLMs) including BERT have achieved dramatic progress in the natural language processing (NLP) domain. However, applying large PLMs to the bug triage task for extracting semantic information has several challenges. In this paper, we address the challenges and propose a novel framework for bug triage named \textbf{LBT-P}, standing for \textbf{L}ight \textbf{B}ug \textbf{T}riage framework with a \textbf{P}re-trained language model. It compresses a large PLM into small and fast models using knowledge distillation techniques and also prevents catastrophic forgetting of PLM by introducing knowledge preservation fine-tuning. We also develop a new loss function exploiting representations of earlier layers as well as deeper layers in order to handle the overthinking problem. We demonstrate our proposed framework on the real-world private dataset and three public real-world datasets: Google Chromium, Mozilla Core, and Mozilla Firefox. The result of the experiments shows the superiority of LBT-P.
NIER Track
Wed 12 Oct 2022 16:30 - 16:40 at Banquet B - Technical Session 17 - SE for AI Chair(s): Tim Menzies North Carolina State UniversityTo succeed with the development of modern software, organizations must have the agility to adapt faster to constantly evolving environments to deliver more reliable and optimized solutions that can be adapted to the needs and environments of their stakeholders including users, customers, business, development, and IT. However, stakeholders do not have sufficient automated support for global decision making, considering the increasing variability of the solution space, the frequent lack of explicit representation of its associated variability and decision points, and the uncertainty of the impact of decisions on stakeholders and the solution space. This leads to an ad-hoc decision making process that is slow, error-prone, and often favors local knowledge over global, organization-wide objectives. The Multi-Plane Models and Data (MP-MODA) framework explicitly represents and manages variability, impacts, and decision points. It enables automation and tool support in aid of a multi-criteria decision making process involving different stakeholders within a feedback-driven software development process where feedback cycles aim to reduce uncertainty. We present the conceptual structure of the framework, discuss its potential benefits, and enumerate key challenges related to tool supported automation and analysis within MP-MODA.
Research Papers
Wed 12 Oct 2022 16:40 - 17:00 at Banquet B - Technical Session 17 - SE for AI Chair(s): Tim Menzies North Carolina State UniversityMicroservices Architecture (MSA) has evolved into a de-facto standard for developing cloud-native enterprise applications during the last decade. MSA provides a variety of benefits, including quicker infrastructure deployment and service availability, elastic scalability, dependability, and increased security. To leverage these characteristics, existing (monolithic) systems must be decomposed into loosely couple microservices. This approach of decomposition, when undertaken manually, can be a rather laborious process rife with errors. As a result, there has been a surge of interest in applying AI-based technologies to identify decomposition strategies. However, the usefulness of these approaches is limited by the expressiveness of the program representation (as a result of their reliance on imprecise but extensive static analysis or exact but incomplete dynamic analysis) and their inability to model the application’s dependency on critical external resources such as databases tables. Consequently, partitioning recommendations offered by current tools result in architectures that result in (a)~distributed monoliths, and/or (b)~force the use of (often criticized) distributed transactions.
This work attempts to overcome these challenges by introducing CARGO (short for Context sensitive lAbel pRopaGatiOn)—a novel un-/semi-supervised partition refinement technique that uses a comprehensive system dependence graph built using context- and flow-sensitive static analysis of the monolithic application. CARGO offers two modes of operation: a) an unsupervised mode, where it decomposes the program with no a priori partition assignments, and ii) a semi-supervised mode, where it is seeded by the partitions discovered using any state-of-the-art microservice partitioning algorithms to refine and thereby enrich the partitioning quality of the current state-of-the-art algorithms. CARGO was used to augment four state-of-the-art microservice partitioning techniques (comprised of 1 industrial tool and 3 open-source projects). These were applied on five Java EE applications (comprised of 4 open source and 1 proprietary projects). Experimental results show that CARGO is capable of improving the partition quality of all modern microservice partitioning techniques (as measured by 4 architectural metrics). CARGO also substantially reduces distributed database transactions, which are a major bottleneck in most microservice applications. Furthermore, our real-world performance evaluation on a benchmark JEE application (Daytrader) deployed under varying loads shows that CARGO also lowers the overall the latency of the deployed microservice application by 11% and increases throughput by 120% on average, across various sequences of simulated user actions.
Pre-printTool Demonstrations
Wed 12 Oct 2022 09:30 - 10:00 at Ballroom A - Tool Poster Session 2As software systems continue to grow in complexity and scale, deploying and delivering them becomes increasingly difficult. In this work, we develop DeployQA, a novel QA bot that automatically answers software deployment questions over user manuals and Stack Overflow posts. DeployQA is built upon RoBERTa. To bridge the gap between natural language and the domain of software deployment, we propose three adaptations in terms of vocabulary, pre-training, and fine-tuning, respectively. We evaluate our approach on our constructed DeQuAD dataset. The results show that DeployQA remarkably outperforms baseline methods by leveraging the three domain adaptation strategies.
Research Papers
Wed 12 Oct 2022 17:10 - 17:30 at Banquet B - Technical Session 17 - SE for AI Chair(s): Tim Menzies North Carolina State UniversitySoftware requirement classification is a longstanding and important problem in requirement engineering. Previous studies have applied various machine learning techniques for this problem, including Support Vector Machine (SVM) and decision trees. With the recent popularity of NLP technique, the state-of-the-art approach NoRBERT utilizes the pre-trained language model BERT and achieves a satisfactory performance. However, the dataset PROMISE used by the existing approaches for this problem consists of only hundreds of requirements that are outdated according to today’s technology and market trends. Besides, the NLP technique applied in these approaches might be obsolete. In this paper, we propose an approach of prompt learning for requirement classification using BERT-based pretrained language models (PRCBERT), which applies flexible prompt templates to achieve accurate requirements classification. Experiments conducted on two existing small-size requirement datasets (PROMISE and NFR-Review) and our collected large-scale requirement dataset NFR-SO prove that PRCBERT exhibits significantly better classification performance than NoRBERT and MLM-BERT (BERT with the standard prompt template). On the de-labeled NFR-Review and NFR-SO datasets, Trans_PRCBERT (the version of PRCBERT which is fine-tuned on PROMISE) is able to have a satisfactory zero-shot performance with 50.30% and 72.87% F1-score when enabling a self-learning strategy.
NIER Track
Wed 12 Oct 2022 17:30 - 17:40 at Banquet B - Technical Session 17 - SE for AI Chair(s): Tim Menzies North Carolina State UniversityAutomated code generation is a longstanding challenge in both communities of software engineering and artificial intelligence. Currently, some works have started to investigate the functional correctness of code generation, where a code snippet is considered correct if it passes a set of test cases. However, most existing works still model code generation as text generation without considering program-specific information, such as functionally equivalent code snippets and test execution feedback. To address the above limitations, this paper proposes a new idea of combining program analysis with deep learning for neural code generation, where functionally equivalent code snippets and test execution feedback will be considered at the training stage. Specifically, we firstly design several code transformation heuristics to produce different variants of the code snippet satisfying the same functionality. In addition, we employ the test execution feedback and design a test-driven discriminative task to train a novel discriminator, aiming to let the model distinguish whether the generated code is correct or not. The preliminary results on a newly published dataset demonstrate the effectiveness of our proposed framework for code generation. Particularly, in terms of the pass@1 metric, we achieve 8.81 and 11.53 gains compared with CodeGPT and CodeT5, respectively.
Late Breaking Results
Wed 12 Oct 2022 17:40 - 17:50 at Banquet B - Technical Session 17 - SE for AI Chair(s): Tim Menzies North Carolina State UniversityDespite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. We anticipate that this approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.
no description available
Testing has always been about simulation, although often implicit. For example, when a software tester mocks out the behaviour of a code unit, they are effectively stimulating its behaviour. When a test tool creates a test user account in order to log in and run a test it is essentially simulating a potential user behaviour. Traditionally, software engineers, and the research that supports them, has tended to apply greater focus to problems and challenges associated with test mock techniques and applications. However, as software is increasingly deployed to user communities that interact through the software platform, effective techniques for simulating user behaviour are becoming increasingly important. In this talk we outline engineering trade-offs between test (simulation) realism, which we seek to maximise, and various testing costs we seek to minimise. We present an overview of and results from the deployment, at Meta, of two simulation-inspired testing technologies: Sapienz and Web Enabled Simulation, both of which aim to realistically simulate user behaviour, on client side and server side respectively. We show how a simulation-based testing approach opens new perspectives on new and existing automated software engineering challenges and opportunities.
Research Papers
Thu 13 Oct 2022 10:00 - 10:20 at Banquet B - Technical Session 21 - SE for AI II Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Today, an increasing number of Adaptive Deep Neural Networks (AdNNs) are being used to make decisions on resource-constrained embedded devices. We observe that, similar to traditional software, redundant computations exist in AdNNs, resulting in considerable performance degradation. The performance degradation in AdNNs is dependent on the input workloads, and is referred to as input-dependent performance bottlenecks (IDPBs). To ensure an AdNN satisfies the performance requirements of real-time applications, it is essential to conduct performance testing to detect IDPBs in the AdNN. Existing neural network testing methods are primarily concerned with correctness testing, which does not involve performance testing. To fill this gap, we propose DeepPerform, a scalable approach to generate test samples to detect the IDPB of AdNNs. We first demonstrate how the problem of generating performance test samples detecting IDPBs can be formulated as an optimization problem. Following that, we demonstrate how tool efficiently handles the optimization problem by learning and estimating the distribution of AdNNs’ computational consumption. We evaluate DeepPerform on three widely used datasets against five popular AdNN models. The results show that DeepPerform generates test samples that cause more severe performance degradation (FLOPs: increase up to 552%). Furthermore, DeepPerform is substantially more efficient than the baseline methods in terms of generating test inputs (runtime overhead: only 6–10 milliseconds).
Late Breaking Results
Thu 13 Oct 2022 10:20 - 10:30 at Banquet B - Technical Session 21 - SE for AI II Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Machine learning (ML) systems based on deep neural networks are more present than ever in software solutions for numerous industries. Their inner workings relying on models learning with data are as helpful as they are mysterious for non-expert people. There is an increasing need to make the design and development of those solutions accessible to a more general public while at the same time making them easier to explore. In this paper, to address this need, we discuss a proposition of a new assisted approach, centered on the downstream task to be performed, for helping practitioners to start using and applying Deep Learning (DL) techniques. This proposal, supported by an initial testbed UI prototype, uses an externalized form of knowledge, where JSON files compile different pipeline metadata information with their respective related artifacts (e.g., model code, the dataset to be loaded, good hyperparameter choices) that are presented as the user interacts with a conversational agent to suggest candidate solutions for a given task.
Research Papers
Thu 13 Oct 2022 10:30 - 10:50 at Banquet B - Technical Session 21 - SE for AI II Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Due to the ability to bypass the oracle problem, Metamorphic Testing (MT) has been a popular technique to test deep learning (DL) software. However, no work has taken notice of the prioritization for Metamorphic test case Pairs (MPs), which is quite essential and beneficial to the effectiveness of MT in DL testing. When the fault-sensitive MPs apt to trigger violations and expose defects are not prioritized, the revealing of some detected violations can be greatly delayed or even missed to conceal critical defects. In this paper, we propose the first method to prioritize the MPs for DL software, so as to boost the revealing of detected violations in DL testing. Specifically, we devise a new type of metric to measure the execution diversity of DL software on MPs based on the distribution discrepancy of the neuron outputs. The fault-sensitive MPs are next prioritized based on the devised diversity metric. Comprehensive evaluation results show that the proposed prioritization method and diversity metric can effectively prioritize the fault-sensitive MPs, boost the revealing of detected violations, and even facilitate the selection and design of the effective Metamorphic Relations for the image classification DL software.
Journal-first Papers
Thu 13 Oct 2022 10:50 - 11:10 at Banquet B - Technical Session 21 - SE for AI II Chair(s): Andrea Stocco Università della Svizzera italiana (USI)A growing demand is witnessed in both industry and academia for employing Deep Learning (DL) in various domains to solve real-world problems. Deep reinforcement learning (DRL) is the application of DL in the domain of Reinforcement Learning. Like any software system, DRL applications can fail because of faults in their programs. In this paper, we present the first attempt to categorize faults occurring in DRL programs. We manually analyzed 761 artifacts of DRL programs (from Stack Overflow posts and GitHub issues) developed using well-known DRL frameworks (OpenAI Gym, Dopamine, Keras-rl, Tensorforce) and identified faults reported by developers/users. We labeled and taxonomized the identified faults through several rounds of discussions. The resulting taxonomy is validated using an online survey with 19 developers/researchers. To allow for the automatic detection of faults in DRL programs, we have defined a meta-model of DRL programs and developed DRLinter, a model-based fault detection approach that leverages static analysis and graph transformations. The execution flow of DRLinter consists in parsing a DRL program to generate a model conforming to our meta-model and applying detection rules on the model to identify faults occurrences. The effectiveness of DRLinter is evaluated using 21 synthetic and real faulty DRL programs. For synthetic samples, we injected faults observed in the analyzed artifacts from Stack Overflow and GitHub. The results show that DRLinter can successfully detect faults in both synthesized and real-world examples with a recall of 75% and a precision of 100%.
Link to publication DOI Authorizer linkResearch Papers
Thu 13 Oct 2022 11:10 - 11:30 at Banquet B - Technical Session 21 - SE for AI II Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Quality assurance is of great importance for deep learning (DL) systems, especially when they are applied in safety-critical applications. While quality issues of native DL applications have been extensively analyzed, the issues of JavaScript-based DL applications have never been systematically studied. Compared with native DL applications, JavaScript-based DL applications can run on major browsers, making the platform and device independent. Specifically, the quality of JavaScript-based DL applications depends on the 3 parts: the application, the third-party DL library used and the underlying DL framework (e.g., TensorFlow.js), called JavaScript-based DL system. In this paper, we conduct the first empirical study on the quality issues of JavaScript-based DL systems. Specifically, we collect and analyze 700 real-world faults from relevant GitHub repositories, including the official TensorFlow.js repository, 13 third-party DL libraries and 58 JavaScript-based DL applications. To better understand the characteristics of these faults, we manually analyze and construct taxonomies for the fault symptoms, root causes, and fix patterns, respectively. Moreover, we also study the distributions of faults, symptoms, and root causes on different stages of the development lifecycle, the 3-level architecture in the DL system, and the 4 major components of TensorFlow.js framework. Based on the results, we suggest actionable implications and research avenues that can potentially facilitate the development, testing and debugging of JavaScript-based DL systems.
NIER Track
Thu 13 Oct 2022 11:30 - 11:40 at Banquet B - Technical Session 21 - SE for AI II Chair(s): Andrea Stocco Università della Svizzera italiana (USI)The task of a deep learning (DL) program is to train a model with high precision and apply it to different scenarios. A DL program often involves massive numerical calculations. Therefore, the robustness and stability of the numerical calculations are dominant to the quality of DL programs. Indeed, numerical bugs are common in DL programs, producing NaN (Not-a-Number) and INF (Infinite). A numerical bug may render the DL models inaccurate, causing the DL applications unusable. In this work, we conduct the first empirical study on numerical bugs in DL programs by analyzing the programs implemented on the top of two popular DL libraries (i.e., TensorFlow and Pytorch). Specifically, We collect a dataset of 400 numerical bugs in DL programs. Then, we classify these numerical bugs into 9 categories based on their root causes and summarize two findings. Finally, we provide the implications of our study on detecting numerical bugs in DL programs.
Research Papers
Thu 13 Oct 2022 11:40 - 12:00 at Banquet B - Technical Session 21 - SE for AI II Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Deep learning (DL) techniques have attracted much attention in recent years, and have been applied to many application scenarios, including those that are safety-critical. Improving the universal robustness of DL models is vital and many approaches have been proposed in the last decades aiming at such a purpose. Among existing approaches, adversarial training is the most representative. It advocates a post model tuning process via incorporating adversarial samples. Although successful, they still suffer from the challenge of generalizability issues in the face of various attacks with unsatisfactory effectiveness. Targeting this problem, in this paper we propose a novel model training framework, which aims at improving the universal robustness of DL models via model transformation incorporated with a data augmentation strategy in a delta debugging fashion. We have implemented our approach in a tool, called Dare, and conducted an extensive evaluation on 9 DL models. The results show that our approach significantly outperforms existing adversarial training techniques. Specifically, Dare has achieved the highest Empirical Robustness in 29 of 45 testing scenarios under various attacks, while the number drops to 5 of 45 for the best baseline approach.
Research Papers
Thu 13 Oct 2022 13:30 - 13:50 at Banquet B - Technical Session 26 - Testing III Chair(s): Owolabi Legunsen Cornell UniversityAugmented Reality (AR) software development frameworks typically provide virtual reality scenes for testing purposes. To make sure AR apps meet users’ expectation, precision of virtual object placement is an important measurement of AR applications’ usability during AR software testing. Within virtual reality scenes, although a test script can automatically move the camera to view different parts of the scene and to place virtual objects at different locations, the testing process is still often manual because a human tester needs to either watch the test execution or watch videos / screenshots recorded during test execution to decide whether an object placement is noticeably imprecise. In this paper, we develop a novel technique, PredART, to explore whether it is possible to predict whether the placement of an object is noticeably imprecise by human users. The prediction results can be used for automatic assertions in AR software testing and raise warnings to human testers only when a potential imprecise placement is found. Our evaluation shows that PredART is able to predict noticeable placement errors with a F-score of 75%.
Research Papers
Thu 13 Oct 2022 13:50 - 14:10 at Banquet B - Technical Session 26 - Testing III Chair(s): Owolabi Legunsen Cornell UniversityGame-like programs have become increasingly popular in many software engineering domains such as mobile apps, web applications, or programming education. However, creating tests for programs that have the purpose of challenging human players is a daunting task for automatic test generators. Even if test generation succeeds in finding a relevant sequence of events to exercise a program, the randomized nature of games means that it may neither be possible to reproduce the exact program behavior underlying this sequence, nor to create test assertions checking if observed randomized game behavior is correct. To overcome these problems, we propose Neatest, a novel test generator based on the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. Neatest systematically explores a program’s statements, and creates neural networks that operate the program in order to reliably reach each statement—that is, Neatest learns to play the game in a way to reliably reach different parts of the code. As the networks learn the actual game behavior, they can also serve as test oracles by evaluating how surprising the observed inputs of a program under test are compared to the inputs obtained on a supposedly correct version of the program. We evaluate this approach in the context of Scratch, an educational programming environment. Our empirical study on 25 non-trivial Scratch games demonstrates that our approach can successfully train neural networks that are not only far more resilient to random influences than traditional test suites consisting of static input sequences, but are also highly effective with an average mutation score of more than 70%.
Pre-printIndustry Showcase
Thu 13 Oct 2022 14:10 - 14:30 at Banquet B - Technical Session 26 - Testing III Chair(s): Owolabi Legunsen Cornell UniversityIn today’s IT environment with a growing number of costly outages, increasing complexity of the systems, and availability of massive operational data, there is a strengthening demand to effectively leverage Artificial Intelligence and Machine Learning (AI/ML) towards enhanced resiliency.
In this paper, we present an automatic fault injection platform to enable and optimize the generation of data needed for building AI/ML models to support modern IT operations. The merits of our platform include the ease of use, the possibility to orchestrate complex fault scenarios and to optimize the data generation for the modeling task at hand. Specifically, we designed a fault injection service that (i) combines fault injection with data collection in a unified framework, (ii) supports hybrid and multi-cloud environments, and (iii) does not require programming skills for its use. Our current implementation covers the most common fault types both at the application and infrastructure levels. The platform also includes some AI capabilities. In particular, we demonstrate the interventional causal learning capability currently available in our platform. We show how our system is able to learn a model of error propagation in a micro-service application in a cloud environment (when the communication graph among micro-services is unknown and only logs are available) for use in subsequent applications such as fault localization.
Research Papers
Thu 13 Oct 2022 14:30 - 14:50 at Banquet B - Technical Session 26 - Testing III Chair(s): Owolabi Legunsen Cornell UniversityMutation faults are the core of mutation testing and have been widely used in many other software testing and debugging tasks. Hence, constructing high-quality mutation faults is critical. There are many traditional mutation techniques that construct syntactic mutation faults based on a limited set of manually-defined mutation operators. To improve them, the state-of-the-art deep-learning (DL) based technique (i.e., DeepMutation) has been proposed to construct mutation faults by learning from real faults via classic sequence-to-sequence neural machine translation (NMT). However, its performance is not satisfactory since it cannot ensure syntactic correctness of constructed mutation faults and suffers from the effectiveness issue due to the huge search space and limited features by simply treating each targeted method as a token stream.
In this work, we propose a novel DL-based mutation technique (i.e., LEAM) to overcome the limitations of both traditional techniques and DeepMutation. LEAM adapts the syntax-guided encoder-decoder architecture by extending a set of grammar rules specific to our mutation task, to guarantee syntactic correctness of constructed mutation faults. Instead of predicting a sequence of tokens one by one to form a whole mutated method, it predicts the statements to be mutated under the context of the targeted method to reduce search space, and then predicts grammar rules for mutation fault construction based on both semantic and structural features in AST. We conducted an extensive study to evaluate LEAM based on the widely-used Defects4J benchmark. The results demonstrate that the mutation faults constructed by LEAM can not only better represent real faults than two state-of-the-art traditional techniques (i.e., Major and PIT) and DeepMutation, but also substantially boost two important downstream applications of mutation faults, i.e., test case prioritization and fault localization.
DOI Pre-printResearch Papers
Thu 13 Oct 2022 14:50 - 15:10 at Banquet B - Technical Session 26 - Testing III Chair(s): Owolabi Legunsen Cornell UniversityDatabase Management Systems (DBMSs) utilize transactions to ensure the consistency and integrity of data. Incorrect transaction implementations in DBMSs can lead to severe consequences, e.g., incorrect database states and query results. Therefore, it is critical to ensure the reliability of transaction implementations.
In this paper, we propose \emph{DT$^2$}, an approach for automatically testing transaction implementations in DBMSs. We first randomly generate a database and a group of concurrent transactions operating the database, which can support complex features in DBMSs, e.g., various database schemas and cross-table queries. We then leverage differential testing to compare transaction execution results on multiple DBMSs to find discrepancies. The effectiveness of our method is heavily impaired by the non-determinism of concurrent transactions. Therefore, we propose a transaction test protocol to ensure the deterministic execution of concurrent transactions.
We evaluate DT$^2$ on three widely-used MySQL-compatible DBMSs, i.e., MySQL, MariaDB and TiDB. In total, we have detected 10 unique transaction bugs and 88 transaction-related compatibility issues from the observed discrepancies. Our empirical study on these compatibility issues shows that DBMSs suffer from various transaction-related compatibility issues, although they claim that they are compatible. These compatibility issues can also lead to serious consequences, e.g., inconsistent database states among DBMSs.
Tool Demonstrations
Wed 12 Oct 2022 09:30 - 10:00 at Ballroom A - Tool Poster Session 2Dynamic program analysis tools such as Eraser, Memcheck, or ThreadSanitizer abstract the contents of individual memory locations and store the abstraction results in a separate data structure called shadow memory. They then use this meta-information to efficiently implement the actual analyses. In this paper, we describe the implementation of an efficient symbolic shadow memory extension for the CBMC bounded model checker that can be accessed through an API, and sketch its use in the design of a new data race analyzer that is implemented by a code-to-code translation. Artifact/tool URL: shorturl.at/jzVZ0 Video URL: youtu.be/pqlbyiY5BLU
Research Papers
Thu 13 Oct 2022 16:10 - 16:30 at Banquet B - Technical Session 32 - Formal Methods and Models II Chair(s): Khouloud Gaaloul University of Michigan - DearbornFeature modeling is widely used to systematically model features of variant-rich software systems and their dependencies. By translating feature models into propositional formulas and analyzing them with solvers, a wide range of automated analyses across all phases of the software development process become possible. Most solvers only accept formulas in conjunctive normal form (CNF), so an additional transformation of feature models is often necessary. However, it is unclear whether this transformation has a noticeable impact on analyses. In this paper, we compare three transformations (i.e., distributive, Tseitin, and Plaisted-Greenbaum) for bringing feature-model formulas into CNF. We analyze which transformation can be used to correctly perform feature-model analyses and evaluate three CNF transformation tools (i.e., FeatureIDE, KConfigReader, and KClause) on a corpus of 22 real-world feature models. Our empirical evaluation illustrates that some CNF transformations do not scale to complex feature models or even lead to wrong results for certain analyses. Further, the choice of the CNF transformation can substantially influence the performance of subsequent analyses.
Journal-first Papers
Thu 13 Oct 2022 16:30 - 16:50 at Banquet B - Technical Session 32 - Formal Methods and Models II Chair(s): Khouloud Gaaloul University of Michigan - DearbornStochastic model checking can automatically verify and analyse the software-driven autonomous systems with stochastic behaviors, which is a formal verification technique based on system models. When coping with large-scale systems, it suffers from state space explosion problem very seriously. Model abstraction is a potential technique for mitigating this problem. At present, only a few properties specified by PCTL (Probabilistic Computation Tree Logic), such as probabilistic safety and probabilistic reachability, can be preserved in the practical model abstraction of stochastic model checking, which are the proper subset of PCTL* properties. For dealing with this, an effective and efficient three-valued model abstraction framework for full PCTL* stochastic model checking is proposed in this paper. We propose a new abstract model to preserve full PCTL* properties for nondeterministic and probabilistic system, which orthogonally integrates interval probability of transition and game for nondeterminism. A game-based three-valued PCTL* stochastic model checking algorithm is developed to verify abstract model, and a BPSO (binary particle swarm optimization) algorithm integrated with sample learning is designed to refine the indefinite result of three-valued PCTL* stochastic model checking abstract model. It is proved that full PCTL* properties are preserved when the result of three-valued stochastic model checking is definite, and the efficiency of this framework is demonstrated by some large cases.
File AttachedResearch Papers
Thu 13 Oct 2022 16:50 - 17:10 at Banquet B - Technical Session 32 - Formal Methods and Models II Chair(s): Khouloud Gaaloul University of Michigan - DearbornWe propose Janus, an approach for finding incompleteness bugs in SMT solvers. The key insight is to mutate SMT formulas with local weakening and strengthening rules that preserve the satisfiability of the seed formula. The generated mutants are used to test SMT solvers for incompleteness bugs, i.e., inputs on which the SMT solver unexpectedly returns unknown. We realized Janus on top of the SMT solver fuzzing framework YinYang. From June to August 2021, we stress-tested the two state-of-the-art SMT solvers Z3 and CVC5 with Janus and totally reported 32 incompleteness bugs. Out of these, 24 have been confirmed as unique bugs and 8 are already fixed by the developers. Our diverse bug findings uncovered functional, regression, and performance bugs—several surprising enough to trigger discussions among the developers sharing their in-depth analysis.
Research Papers
Thu 13 Oct 2022 17:10 - 17:30 at Banquet B - Technical Session 32 - Formal Methods and Models II Chair(s): Khouloud Gaaloul University of Michigan - DearbornLinear temporal logic (LTL) satisfiability checking is a fundamental and hard problem (PSPACE-complete). In this paper, we explore checking LTL satisfiability via end-to-end learning, so that we can take only polynomial time to check LTL satisfiability. Existing approaches have shown that it is possible to leverage end-to-end neural networks to predict the Boolean satisfiability problem with performance considerably higher than random guessing. Inspirited by their work, we study two interesting questions: can end-to-end neural networks check LTL satisfiability, and can neural networks learn the semantics of LTL. To this end, we train some neural networks with different architectures to explore the potential of neural networks on LTL satisfiability checking. We demonstrate that neural networks can indeed capture some effective biases for LTL satisfiability checking by training. Besides, designing a special architecture matching some logical properties of LTL, e.g., recursive property, permutation invariance, and sequentiality, can provide a better inductive bias. We also show the competitive results of neural networks compared with the state-of-the-art approaches, i.e., nuXmv and Aalta, on large scale datasets.
Research Papers
Thu 13 Oct 2022 17:30 - 17:50 at Banquet B - Technical Session 32 - Formal Methods and Models II Chair(s): Khouloud Gaaloul University of Michigan - DearbornDeep learning has become a promising programming paradigm in software development, owing to its surprising performance in solving many challenging tasks. Deep neural networks (DNNs) are increasingly being deployed in practice, but are limited on resource-constrained devices owing to their demand for computational power. Quantization has emerged as a promising technique to reduce the size of DNNs with comparable accuracy as their floating-point numbered counterparts. The resulting quantized neural networks (QNNs) can be implemented energy-efficiently. Similar to their floating-point numbered counterparts, quality assurance techniques for QNNs, such as testing and formal verification, are essential but are currently less explored. In this work, we propose a novel and efficient formal verification approach for QNNs. In particular, we are the first to propose an encoding that reduces the verification problem of QNNs into the solving of integer linear constraints, which can be solved using off-of-the-shelf solvers. Our encoding is both sound and complete. We demonstrate the application of our approach on local robustness verification and maximum robustness radius computation. We implement our approach in a prototype tool QVIP and conduct a thorough evaluation. Experimental results on QNNs with different quantization bits confirm the effectiveness and efficiency of our approach, e.g., our approach is two orders of magnitude faster and able to solve more verification tasks in the same time limit than the state-of-the-art methods.