Research Papers
Tue 11 Oct 2022 10:30 - 10:50 at Ballroom C East - Technical Session 1 - AI for SE I Chair(s): Andrea Stocco Università della Svizzera italiana (USI)AI-enabled software systems (AIS) are prevalent in a wide range of applications, such as visual tasks of autonomous systems, extensively deployed in automotive, aerial, and naval domains. Hence, it is crucial for human to evaluate the model’s intelligence before AIS is deployed to safety-critical environments, such as public roads.
In this paper, we assess AIS visual intelligence through measuring the completeness of its perception of primary concepts in a domain and the concept variants. For instance, is the visual perception of an autonomous detector mature enough to recognize the instances of \textit{pedestrian} (an automotive domain’s concept) in Halloween customs? An AIS will be more reliable once the model’s ability to perceive a concept is displayed in a human-understandable language. For instance, is the pedestrian in \textit{wheelchair} mistakenly recognized as a pedestrian on \textit{bike}, since the domain concepts bike and wheelchair, both associate with a mutual feature \textit{wheel}?
We answer the above-type questions by implementing a generic process within a framework, called B-AIS, which systematically evaluates AIS perception against the semantic specifications of a domain, while treating the model as a black-box. Semantics is the meaning and understanding of words in a language, and therefore, is more comprehensible for human brain than AIS pixel-level visual information. B-AIS processes the heterogeneous artifacts to be comparable, and leverages the comparison’s results to reveal AIS weaknesses in a human-understandable language. The evaluations of B-AIS for the vision task of pedestrian detection showed B-AIS identified the missing variants of the pedestrian with $F_{2}$ measures of 95% and in the dataset and 85% in the model.
Industry Showcase
Tue 11 Oct 2022 10:50 - 11:10 at Ballroom C East - Technical Session 1 - AI for SE I Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Visualization is very important for machine learning (ML) pipelines because it can show explorations of the data to inspire data scientists and show explanations of the pipeline to improve understandability and trust. In this paper, we present a novel approach that automatically generates visualizations for ML pipelines by learning visualizations from highly-voted Kaggle pipelines. The solution extracts both code and dataset features from these high-quality human-written pipelines and corresponding training datasets, learns the mapping rules from code and dataset features to visualizations using association rule mining (ARM), and finally uses the learned rules to predict visualizations for unseen ML pipelines. The evaluation results show that the proposed solution is feasible and effective to generate visualizations for ML pipelines.
Research Papers
Tue 11 Oct 2022 11:10 - 11:30 at Ballroom C East - Technical Session 1 - AI for SE I Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Aircraft industry is constantly striving for more efficient design optimization methods in terms of human efforts, computation time, and resources consumption. Hybrid surrogate optimization maintains high results quality while providing rapid design assessments when both the surrogate model and the switch mechanism for eventually transitioning to the HF model are calibrated properly. Feedforward neural networks (FNNs) can capture highly nonlinear input-output mappings, yielding efficient surrogates for aircraft performance factors. However, FNNs often fail to generalize over the out-of-distribution (OOD) samples, which hinders their adoption in critical aircraft design optimization. Through SmOOD, our smoothness-based out-of-distribution detection approach, we propose to codesign a model-dependent OOD indicator with the optimized FNN surrogate, to produce a trustworthy surrogate model with selective but credible predictions. Unlike conventional uncertainty-grounded methods, SmOOD exploits inherent smoothness properties of the HF simulations to effectively expose OODs through revealing their suspicious sensitivities, thereby avoiding over-confident uncertainty estimates on OOD samples. By using SmOOD, only high-risk OOD inputs are forwarded to the HF model for re-evaluation, leading to more accurate results at a low overhead cost. Three aircraft performance models are investigated. Results show that FNN-based surrogates outperform their Gaussian Process counterparts in terms of predictive performance. Moreover, SmOOD does cover averagely 85% of actual OODs on all the study cases. When SmOOD plus FNN surrogates are deployed in hybrid surrogate optimization settings, they result in a decrease error rate of 34.65% and a computational speed up rate of 58.36x, respectively.
Pre-printResearch Papers
Tue 11 Oct 2022 11:30 - 11:50 at Ballroom C East - Technical Session 1 - AI for SE I Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Contemporary DNN testing works are frequently conducted using metamorphic testing (MT). In general, de facto MT frameworks mutate DNN input images using semantics-preserving mutations and determine if DNNs can yield consistent predictions. Nevertheless, we find that DNNs may rely on erroneous decisions to make predictions, which may still retain the outputs by chance. Such DNN defects would be neglected by existing MT frameworks. Erroneous decisions, however, would likely result in successive mispredictions over diverse images that may exist in real-life scenarios.
This research aims to unveil the pervasiveness of hidden DNN defects caused by incorrect DNN decisions (but retaining consistent DNN predictions). To do so, we tailor and optimize modern eXplainable AI (XAI) techniques to identify visual concepts that represent regions in an input image upon which the DNN makes predictions. Then, we extend existing MT-based DNN testing frameworks to check the consistency of DNN decisions made over a test input and its mutated outputs. Our evaluation shows that existing MT frameworks are oblivious to a considerable number of DNN defects caused by erroneous decisions. We conduct human evaluations to justify the validity of our findings and to elucidate their characteristics. Through the lens of DNN decision-based metamorphic relations, we re-examine the effectiveness of metamorphic transformations proposed by existing MT frameworks. We summarize lessons from this study, which can provide insights and guidelines for future DNN testing.
Research Papers
Tue 11 Oct 2022 11:50 - 12:10 at Ballroom C East - Technical Session 1 - AI for SE I Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Despite the great success in many applications, deep neural networks are not always robust in practice. For instance, a convolutional neuron network (CNN) model for classification tasks often performs unsatisfactorily in classifying some particular classes of objects. In this work, we are concerned with patching the weak part of a CNN model instead of improving it through the costly retraining of the entire model. Inspired by the fundamental concepts of modularization and composition in software engineering, we propose a structured modularization approach, CNNSplitter, which decomposes a strong CNN model for $N$-class classification into $N$ smaller CNN modules. Each module is a sub-model containing a part of the convolution kernels of the strong model. To patch a weak CNN model that performs unsatisfactorily on a target class (TC), we compose the weak CNN model with the corresponding module obtained from a strong CNN model. The ability of the weak CNN model to recognize the TC can thus be improved through patching. Moreover, the ability to recognize non-TCs is also improved, as the samples misclassified as TC could be classified as non-TCs correctly. Experimental results with two representative CNNs on three widely-used public datasets show that the averaged improvement on the TC in terms of precision and recall are 12.54% and 2.14%, respectively. Moreover, patching improves the accuracy of non-TCs by 1.18%. The results demonstrate that CNNSplitter can patch a weak CNN model through modularization and composition, thus providing a new solution for developing robust CNN models.
Research Papers
Tue 11 Oct 2022 12:10 - 12:30 at Ballroom C East - Technical Session 1 - AI for SE I Chair(s): Andrea Stocco Università della Svizzera italiana (USI)The size of deep learning models in artificial intelligence (AI) software is increasing rapidly, which hinders the large-scale deployment on resource-restricted devices (e.g., smartphone). To mitigate this issue, AI software compression plays a crucial role, which aims to compress model size while keeping high performance. However, the intrinsic defects in the big model may be inherited by the compressed one. Such defects may be easily leveraged by attackers, since the compressed models are usually deployed in a large number of devices without adequate protection. In this paper, we try to address the safe model compression problem from a safety-performance co-optimization perspective. Specifically, inspired by the test-driven development (TDD) paradigm in software engineering, we propose a test-driven sparse training framework called SafeCompress. By simulating the attack mechanism to fight, SafeCompress can automatically compress a big model to a sparse one. Further, considering a representative attack, i.e., membership inference attack (MIA), we develop a concrete safe model compression mechanism, called MIA-SafeCompress. Extensive experiments are conducted to evaluate MIA-SafeCompress on five datasets for both computer vision and natural language processing tasks. The results verify the effectiveness and generalization of our method. We also discuss how to adapt SafeCompress to other attacks besides MIA, demonstrating the flexibility of SafeCompress.
DOI Pre-printResearch Papers
Tue 11 Oct 2022 14:00 - 14:20 at Ballroom C East - Technical Session 5 - Code Analysis Chair(s): Vahid Alizadeh DePaul UniversityChannel-based concurrency is a widely used alternative to shared-memory concurrency but is difficult to use correctly. Common programming errors may result in blocked threads that wait indefinitely. Recent work exposes this as a considerable problem in Go programs and shows that many such errors can be detected automatically using SMT encoding and dynamic analysis techniques.
In this paper, we present an alternative approach to detect such errors based on abstract interpretation. To curb the large state spaces of real-world multi-threaded programs, our static program analysis leverages standard pre-analyses to divide the given program into individually analyzable fragments. Experimental results on 6 large real-world Go program show that the abstract interpretation achieves good scalability and finds 104 blocking errors that are missed by the state-of-the-art tool GCatch.
Link to publicationTool Demonstrations
Tue 11 Oct 2022 14:20 - 14:30 at Ballroom C East - Technical Session 5 - Code Analysis Chair(s): Vahid Alizadeh DePaul UniversitySmart contracts are self-executing computer programs deployed on blockchain to enable trustworthy exchange of value without the need of a central authority. With the absence of documentation and specifications, routine tasks such as program understanding, maintenance, verification, and validation, remain challenging for smart contracts. In this paper, we propose a dynamic invariant detection tool, InvCon, for Ethereum smart contracts to mitigate this issue. The detected invariants can be used to not only support the reverse engineering of contract specifications, but also enable standard-compliance checking for contract implementations. InvCon provides a Web-based interface and a demonstration video of it is available at: https://youtu.be/Y1QBHjDSMYk.
Pre-printJournal-first Papers
Tue 11 Oct 2022 14:30 - 14:50 at Ballroom C East - Technical Session 5 - Code Analysis Chair(s): Vahid Alizadeh DePaul UniversityRegression testing is a critical but expensive activity that ensures previously tested functionality is not broken by changes made to the code. Regression test selection (RTS) techniques aim to select and run only those test cases impacted by code changes. The techniques possess different characteristics related to their selection accuracy, test suite size reduction, time to select and run the test cases, and the fault detection ability of the selected test cases. This paper presents an empirical comparison of four Java-based RTS techniques (Ekstazi, HyRTS, OpenClover and STARTS) using multiple revisions from five open source projects.
The results show that STARTS selects more test cases than Ekstazi and HyRTS. OpenClover selects the most test cases. Safety and precision violations measure to what extent a technique misses test cases that should be selected and selects only the test cases that are impacted. Using HyRTS as the baseline, OpenClover had significantly worse safety violations compared to STARTS and Ekstazi, and significantly worse precision violations compared to Ekstazi. While STARTS and Ekstazi did not differ on safety violations, Ekstazi had significantly fewer precision violations than STARTS. The average fault detection ability of the RTS techniques was 8.75% lower than the original test suite.
Link to publication DOITool Demonstrations
Tue 11 Oct 2022 14:50 - 15:00 at Ballroom C East - Technical Session 5 - Code Analysis Chair(s): Vahid Alizadeh DePaul UniversityDynamic taint analysis (DTA) is a popular approach to help protect JavaScript applications against injection vulnerabilities. In 2016, the ECMAScript 7 JavaScript language standard introduced many language features that most existing DTA tools for JavaScript do not support, e.g., the async/await keywords for asynchronous programming. We present Augur, a high-performance dynamic taint analysis for ES7 JavaScript that leverages VM-\textit{supported} instrumentation. Integrating directly with a public, stable instrumentation API gives Augur the ability to run with high performance inside the VM and remain resilient to language revisions. We extend the abstract-machine approach to DTA with semantics to handle asynchronous function calls. In addition to providing the classic DTA use case of injection vulnerability detection, Augur is highly configurable to support any type of taint analysis, making it useful outside of the security domain. We evaluated Augur on a set of 20 benchmarks, and observed a median runtime overhead of only 1.77×. We note a median performance improvement of 298% compared to the previous state-of-the-art Ichnaea.
Tool demo: https://www.youtube.com/watch?v=GczQ-2A58LE
Link to open source code repository: https://github.com/nuprl/augur
Tool Demonstrations
Tue 11 Oct 2022 15:00 - 15:10 at Ballroom C East - Technical Session 5 - Code Analysis Chair(s): Vahid Alizadeh DePaul UniversityTypes in TypeScript play an important role in the correct usage of variables and APIs. Type errors such as variable or function misuse can be avoided with explicit type annotations. In this work, we introduce FlexType, an IDE extension that can be used on both JavaScript and TypeScript to infer types in an interactive or automatic fashion. We perform experiments with FlexType in JavaScript to determine how many types FlexType could resolve if it were to be used to migrate top JavaScript projects to TypeScript. FlexType is able to annotate 56.69% of all types with high precision and confidence including native and imported types from modules. In addition to the automatic inference, we believe the interactive Visual Studio Code extension is inherently useful in both TypeScript and JavaScript especially when resolving types is taxing for the developer.
The source code is available at GitHub and a video demonstration at https://youtu.be/4dPV05BWA8A.
Pre-printResearch Papers
Tue 11 Oct 2022 15:10 - 15:30 at Ballroom C East - Technical Session 5 - Code Analysis Chair(s): Vahid Alizadeh DePaul UniversityLearning-based program repair has achieved good results in a recent series of papers. Yet, we observe that the related work fails to repair some bugs because of a lack of knowledge about 1) the application domain of the program being repaired, and 2) the fault type being repaired. In this paper, we solve both problems by changing the learning paradigm from supervised training to self-supervised training in an approach called SelfAPR. First, SelfAPR generates and constructs training samples by perturbing a previous version of the program being repaired, enforcing the neural model to capture project-specific knowledge. This is different from the previous work based on mined past commits. Second, SelfAPR extracts and encodes test execution diagnostics into the input representation, steering the neural model to fix the kind of fault. This is different from the existing studies that only consider static source code as input. We implement SelfAPR and evaluate it in a systematic manner. We train SelfAPR with 850 705 training samples obtained by perturbing 17 open-source projects. We evaluate SelfAPR on 818 bugs from Defects4J, SelfAPR correctly repairs 114 of them, outperforming all the supervised learning repair approaches.
Research Papers
Wed 12 Oct 2022 10:00 - 10:20 at Ballroom C East - Technical Session 9 - Security and Privacy Chair(s): Wei Yang University of Texas at DallasInformation leaks in software can unintentionally reveal private data, yet they are hard to detect and fix. Although several methods have been proposed to detect leakage, such as static verification-based approaches, they require specialist knowledge, and are time-consuming. Recently, HyperGI introduced a dynamic, hypertest-based approach that detects and produces potential fixes for information leakage. Its fitness function tries to balance information leakage and program correctness, but as the authors of that work point out, there may be a tradeoff between keeping program semantics and reducing information leakage.
In this work we ask if it is possible to automatically detect and repair information leakage in more realistic programs without requiring specialist knowledge. Our approach, called LeakReducer explicitly encodes the tradeoff between program correctness and information leakage as a multi-objective optimisation problem. We apply LeakReducer to a set of leaky programs including the well known Heartbleed bug. It is comparable with HyperGI on their toy applications. In addition, we demonstrate it can find and reduce leakage in real applications and we see diverse solutions on our Pareto front. Upon investigation we find that having a Pareto front helps with some types of information leakage, but not all.
Research Papers
Wed 12 Oct 2022 10:20 - 10:40 at Ballroom C East - Technical Session 9 - Security and Privacy Chair(s): Wei Yang University of Texas at DallasAutomated online recognition of unexpected conditions is an indispensable component of autonomous vehicles to ensure safety even in unknown and uncertain situations. In this paper we propose a runtime monitoring technique rooted in the attention maps computed by explainable artificial intelligence techniques. Our approach, implemented in a tool called ThirdEye, turns attention maps into confidence scores that are used to discriminate safe from unsafe driving behaviours. The intuition is that uncommon attention maps are associated with unexpected runtime conditions.
In our empirical study, we evaluated the effectiveness of different configurations of ThirdEye at predicting simulation-based injected failures induced by both unknown conditions (adverse weather and lighting) and unsafe/uncertain conditions created with mutation testing. Results show that, overall, ThirdEye can predict 98% misbehaviours, up to three seconds in advance, outperforming a state-of-the-art failure predictor for autonomous vehicles.
DOI Pre-printIndustry Showcase
Wed 12 Oct 2022 10:40 - 11:00 at Ballroom C East - Technical Session 9 - Security and Privacy Chair(s): Wei Yang University of Texas at DallasOpenpilot is an open source system to assist drivers by providing features like automated lane centering and adaptive cruise control. Like most systems for autonomous vehicles, Openpilot relies on a sophisticated deep neural network (DNN) to provide its functionality, one that is susceptible to safety property violations that can lead to crashes. To uncover such potential violations before deployment, we investigate the use of falsification, a form of directed testing that analyzes a DNN to generate an input that will cause a safety property violation. Specifically, we explore the application of a state-of-the-art falsifier to the DNN used in OpenPilot, which reflects recent trends in network design. Our investigation reveals the challenges in applying such falsifiers to real-world DNNs, conveys our engineering efforts to overcome such challenges, and showcases the potential of falsifiers to detect property violations and provide meaningful counterexamples. Finally, we summarize the lessons learned as well as the pending challenges for falsifiers to realize their potential on systems like OpenPilot.
Research Papers
Wed 12 Oct 2022 11:00 - 11:20 at Ballroom C East - Technical Session 9 - Security and Privacy Chair(s): Wei Yang University of Texas at DallasVarious virtual personal assistant (VPA) services, e.g. Amazon Alexa and Google Assistant, have become increasingly popular in recent years. This can be partly attributed to a flourishing ecosystem centered around them. Third-party developers are enabled to create VPA applications (or \emph{VPA apps} for short), e.g. Amazon Alexa skills and Google Assistant Actions, which then are released to app stores and become easily accessible by end users through their smart devices.
Similar to their mobile counterparts, VPA apps are accompanied by a privacy policy document that informs users of their data collection, use, retention and sharing practices. The privacy policies are legal documents, which are usually lengthy and complex, hence making it difficult for users to comprehend. Due to this developers may exploit the situation by intentionally or unintentionally failing to comply with them.
In this work, we conduct the first systematic study on the privacy policy compliance issue of VPA apps. We develop \emph{Skipper}, which targets the VPA apps (i.e., \emph{skills}) of Amazon Alexa, the most popular VPA service. \emph{Skipper} automatically depicts the skill into the \emph{declared privacy profile}, by analyzing their privacy policy documents with Natural Language Process (NLP) and machine learning techniques. It then conducts a black-box testing to generate the \emph{behavioral privacy profile} of the skill and checks the consistency between the two profiles. We conduct a large-scale auditing on all 61,505 skills available on Amazon Alexa store. \emph{Skipper} finds that the vast majority of skills suffer from the privacy policy noncompliance issue. Our work reveals the \emph{state quo} of the privacy policy compliance in contemporary VPA apps. Our findings are expected to raise an alert to the app developers and users, and would encourage the VPA app store operators to put in place regulations on privacy policy compliance.
Research Papers
Wed 12 Oct 2022 11:20 - 11:40 at Ballroom C East - Technical Session 9 - Security and Privacy Chair(s): Wei Yang University of Texas at DallasSeveral studies have shown that automated support for different activities of the security patch management process has great potential for reducing delays in installing security patches. However, it is also important to understand how automation is used in practice, its limitations in meeting real-world needs and what practitioners really need, an area that has not been empirically investigated in the existing software engineering literature. This paper reports an empirical study aimed at investigating different aspects of automation for security patch management using semi-structured interviews with 17 practitioners from three different organisations in the healthcare domain. The findings are focused on the role of automation in security patch management for providing insights into the as-is state of the automation in practice, the limitations of current automation, how automation support can be enhanced to effectively meet practitioners’ needs, and the role of the human in an automated process. Based on the findings, we have derived a set of recommendations for directing future efforts aimed at developing automated support for security patch management.
Research Papers
Wed 12 Oct 2022 11:40 - 12:00 at Ballroom C East - Technical Session 9 - Security and Privacy Chair(s): Wei Yang University of Texas at DallasBrowser extensions have emerged as integrated characteristics in modern browsers, with the aim to boost the online browsing experience. Their advantageous position between a user and the Internet grants them easy access to the user’s sensitive personal data, which has raised mounting privacy concerns from both legislators and the extension users. In this work, we propose an end-to-end automatic extension privacy compliance auditing approach, analyzing the compliance of privacy policy versus regulation requirements and their actual privacy-related practices during runtime.
Our approach utilizes the state-of-the-art language processing model BERT for annotating the policy texts, and a hybrid technique to analyze the privacy-related elements (e.g., API calls and HTML objects) from the static source code and dynamically generated files during runtime. We collect a comprehensive dataset within 42 hours in April 2022, containing a total of 64,114 extensions. To facilitate the model training, we construct a corpus named PrivAud-100 which contains 100 manually annotated privacy policies. Based on this dataset and the corpus, we conduct a systematic audition, and identify widespread privacy compliance issues. We find around 92% of the extensions have at least one violation in either their privacy policies or data collection practices. We further propose an index to facilitate the filtering and identification of extensions with significant probability of privacy compliance violations. Our work should raise the awareness from the extension users, service providers, and platform operators, and encourage them to implement solutions towards better privacy compliance. To facilitate future research in this area, we have released our dataset.
Research Papers
Wed 12 Oct 2022 13:30 - 13:50 at Ballroom C East - Technical Session 13 - Application Domains Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Automatically producing behavioral exception (BE) API documentation helps developers correctly use the libraries. The state-of-the-art approaches are either rule-based, which is too restrictive in its applicability, or deep learning (DL)-based, which requires large training dataset. To address those issues, we propose StatGen, a novel hybrid approach between statistical machine translation (SMT) and tree-structured translation to generate BE documentation for any code and vice versa. We consider an API method to possess two levels of abstraction: the source code for the API method, and its documentation. StatGen is specifically designed for this two-way inference, taking advantages of the structures of source code and documentation to achieve higher accuracy. For practical use, if the code does not have BE documentation, StatGen can help users in writing it, and if it exists, one can use StatGen to verify the consistency between BE documentation and implementations. Moreover, it can generate BE code from existing BE documentation.
We conducted empirical experiments to intrinsically evaluate StatGen. We show that it achieves high precision (82% and 79%), and recall (86% and 90%), in inferring BE documentation from source code and vice versa. Our results show that StatGen achieves high accuracy in precision, recall, and BLEU score, and outperforms the state-of-the-art baselines in SMT, Neural Machine Translation, tree-based transformer, and dual-task learner. We showed StatGen’s usefulness in two applications. First, we used StatGen to generate the BE documentation for Apache APIs that lack of documentation by learning from the documentation of the equivalent APIs in JDK. 46% of the generated documentation were rated as useful and 41% as somewhat useful. In the second application, we used StatGen to detect the inconsistency between BE documentation and corresponding implementations of several packages in JDK8.
NIER Track
Wed 12 Oct 2022 13:50 - 14:00 at Ballroom C East - Technical Session 13 - Application Domains Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Programming errors enable security attacks on smart contracts, which are used to manage large sums of financial assets. Automated program repair (APR) techniques aim to reduce developers’ burden of manually fixing bugs by automatically generating patches for a given issue. Existing APR tools for smart contracts focus on mitigating typical smart contract vulnerabilities rather than violations of functional specification. However, in decentralized financial (DeFi) smart contracts, the inconsistency between intended behavior and implementation translates into the deviation from the underlying financial model, resulting in irrecoverable monetary losses for the application and its users. In this work, we propose DeFinery—a technique for automated repair of a smart contract that does not satisfy a user-defined correctness property, financial or otherwise. To explore a larger set of diverse patches while providing formal correctness guarantees w.r.t. the intended behavior, we combine search-based patch generation with semantic analysis of an original program for inferring its specification. Our experiments in repairing nine real-world and benchmark smart contracts reveal that DeFinery efficiently navigates the search space and generates higher-quality patches that cannot be obtained by other smart contract APR tools.
Pre-printResearch Papers
Wed 12 Oct 2022 14:00 - 14:20 at Ballroom C East - Technical Session 13 - Application Domains Chair(s): Andrea Stocco Università della Svizzera italiana (USI)The HTML5
NIER Track
Wed 12 Oct 2022 14:20 - 14:30 at Ballroom C East - Technical Session 13 - Application Domains Chair(s): Andrea Stocco Università della Svizzera italiana (USI)As IoT systems are given more responsibility and autonomy, they offer greater benefits, but also carry greater risks. We believe this trend invigorates an old challenge of software engineering: how to develop high-risk software-intensive systems safely and securely under market pressures? As a first step, we conducted a systematic analysis of recent IoT failures to identify engineering challenges. We collected and analyzed 22 news reports and studied the sources, impacts, and repair strategies of failures in IoT systems. We observed failure trends both within and across application domains. We also observed that failure themes have persisted over time. To alleviate these trends, we outline a research agenda toward a Failure-Aware Software Development Life Cycle for IoT development. We propose an encyclopedia of failures and an empirical basis for system postmortems, complemented by appropriate automated tools.
Research Papers
Wed 12 Oct 2022 14:30 - 14:50 at Ballroom C East - Technical Session 13 - Application Domains Chair(s): Andrea Stocco Università della Svizzera italiana (USI)IoT devices have been under frequent attacks in recent years, causing severe impacts. Previous research has shown the evolution and features of some specific IoT malware families or stages of IoT attacks through offline sample analysis. However, we still lack a systematic observation of various system resources abused by active attackers and the malicious intentions behind these behaviors. This makes us difficult to design appropriate protection strategies to defend against existing attacks and possible future variants.
In this paper, we fill this gap by analyzing 117,862 valid attack sessions captured by our dedicated high-interaction IoT honeypot, HoneyAsclepius, and further discover the intentions in our designed workflow. HoneyAsclepius enables high capture capability as well as continuous behavior monitoring during active attack sessions in real-time. Through a large-scale deployment, we collected 11,301,239 malicious behaviors originating from 50,594 different attackers. Based on this information, we further separate the behaviors in different attack sessions targeting distinct categories of system resources, estimate the temporal relations and summarize their malicious intentions behind. Inspired by such investigations, we present several key insights about abusive behaviors of the file, network, process, and special capability resources, and further propose practical defense strategies to better protect IoT devices.
Journal-first Papers
Wed 12 Oct 2022 14:50 - 15:10 at Ballroom C East - Technical Session 13 - Application Domains Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Being the most popular programming language for developing Ethereum smart contracts, Solidity allows using inline assembly to gain fine-grained control. Although many empirical studies on smart contracts have been conducted, to the best of our knowledge, none has examined inline assembly in smart contracts. To fill the gap, in this paper, we conduct the first large-scale empirical study of inline assembly on >7.6 million open-source Ethereum smart contracts from three aspects, namely, source code, bytecode, and transactions after designing new approaches to tackle several technical challenges. Through a thorough quantitative and qualitative analysis of the collected data, we obtain many new observations and insights. Moreover, by conducting a questionnaire survey on using inline assembly in smart contracts, we draw new insights from the valuable feedback. This work sheds light on the development of smart contracts as well as the evolution of Solidity and its compilers.
Link to publication DOIResearch Papers
Wed 12 Oct 2022 15:10 - 15:30 at Ballroom C East - Technical Session 13 - Application Domains Chair(s): Andrea Stocco Università della Svizzera italiana (USI)Optical character recognition (OCR) algorithms often run slow. They may take several seconds to recognize the texts on a GUI screen, which makes OCR-based widget localization in test automation unfriendly for use, especially on GPU-free computers. This paper first concludes a common type of widget text to be located in GUI testing: label text, which are short texts in widgets like buttons, menu items, and window titles. We then investigate the characteristics of texts on a GUI screen and introduce a fast GPU-independent Label Text Screening (LTS) technique to accelerate the OCR process for label text localization. The technique opens the black box of OCR engines and uses a combination of simple methods to avoid excessive text analysis on a screen as much as possible. Experiments show that, on the subject datasets, LTS reduces the average OCR-based label text localization time to a large extent. On 4k resolution GUI screens, it keeps the localization time below 0.5 seconds for over about 60% of cases without GPU support on a normal laptop computer. In contrast, the existing CPU-based approaches built on popular OCR engines Tesseract, PaddleOCR, and EasyOCR usually need over 2 seconds to achieve the same goal on the same platform. Even with GPU acceleration, they can hardly keep the analysis time in 1 second. We believe the proposed approach would be helpful for implementing OCR-based test automation tools.
Research Papers
Wed 12 Oct 2022 16:00 - 16:20 at Ballroom C East - Technical Session 19 - Formal Methods and Models I Chair(s): Michalis Famelis Université de MontréalDeliberation is a common and natural behavior in human daily life. For example, when writing papers or articles, we usually first write drafts, and then iteratively polish them until satisfied. In light of such a human cognitive process, we propose DECOM, which is a multi-pass deliberation framework for automatic comment generation. DECOM consists of multiple Deliberation Models and one Evaluation Model. Given a code snippet, we first extract keywords from the code and retrieve a similar code fragment from a pre-defined corpus. Then, we treat the comment of the retrieved code as the initial draft and input it with the code and keywords into DECOM to start the iterative deliberation process. At each deliberation, the deliberation model polishes the draft and generates a new comment. The evaluation model measures the quality of the newly generated comment to determine whether to end the iterative process or not. When the iterative process is terminated, the best-generated comment will be selected as the target comment. Our approach is evaluated on two real-world datasets in Java (87K) and Python (108K), and experiment results show that our approach outperforms the state-of-the-art baselines. A human evaluation study also confirms the comments generated by DECOM tend to be more readable, informative, and useful.
Tool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1Recommender systems (RSs) are increasingly being used to help in all sorts of software engineering tasks, including modelling. However, building a RS for a modelling notation is costly. This is especially detrimental for development paradigms that rely on domain-specific languages (DSLs), like model-driven engineering and lowcode approaches. To alleviate this problem, we propose a DSL called Droid that facilitates the configuration and creation of RSs for particular modelling notations. Its tooling provides automation for all phases in the development of a RS: data preprocessing, system configuration for the modelling language, evaluation and selection of the best recommendation algorithm, and deployment of the RS into a modelling tool. A video of the tool is available at https://www.youtube.com/watch?v=VHiObfKUhS0.
Pre-printTool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1We present RoboSimVer, a tool for modeling and analyzing RoboSim Models. It uses a graphical modeling approach to model platform-independent simulation models of Robotics called RoboSim. For model analysis, we implemented a model-transformation approach to translate RoboSim models into NTA (Network of Timed Automata) and its stochastic version SHA (Stochastic Hybrid Automata) based on some patterns and mapping rules. RoboSimVer is able to get a simulation model. It also provides different rigorous verification techniques to check whether the simulation models satisfy property constraints. For experimental demonstrations, we adopt the case study Alpha algorithm for robotics. We use a robotic platform model of swarm robots in an uncertain environment, to illustrate how our tool supports the verification of stochastic and hybrid systems. The demonstration video of the tool is available at https://youtu.be/mNe4q64GkmQ
Research Papers
Wed 12 Oct 2022 16:40 - 17:00 at Ballroom C East - Technical Session 19 - Formal Methods and Models I Chair(s): Michalis Famelis Université de MontréalThe robustness of deep neural networks is crucial to modern AI-enabled systems. Formal verification has been demonstrated effective in providing certified robustness guarantees. Sigmoid-like neural networks have been adopted in a wide range of applications. Due to their non-linearity, Sigmoid-like activation functions are usually over-approximated for efficient verification, which inevitably introduces imprecision. Considerable efforts have been devoted to finding the so-called tighter approximations to obtain more precise verification results. However, existing tightness definitions are heuristic and lack a theoretical foundation. We conduct a thorough empirical analysis of existing neuron-wise characterizations of tightness and reveal that they are superior only on specific neural networks. We then introduce the notion of network-wise tightness as a unified tightness definition and show that computing network-wise tightness is a complex non-convex optimization problem. We bypass the complexity from different perspectives via two efficient, provably tightest approximations. The experimental results demonstrate the promising performance achievement of our approaches over state of the art: (i) achieving up to 436.36% improvement to certified lower robustness bounds; and (ii) exhibiting notably more precise verification results on convolutional networks.
Research Papers
Wed 12 Oct 2022 17:00 - 17:20 at Ballroom C East - Technical Session 19 - Formal Methods and Models I Chair(s): Michalis Famelis Université de MontréalModern programs are usually heap-based, where the programs manipulate heap-based data structures to perform computations. In software engineering tasks such as test generation and bounded verification, we need to determine the existence of a reachable heap state that satisfies a given specification, or construct the heap state by a sequence of calls to the public methods. Given the huge space combined from the methods and their arguments, the existing approaches typically adopt static analysis or heuristic search to explore only a small part of search space in the hope of finding the target state and target call sequence early on. However, these approaches do not have satisfactory performance on many real-world complex methods and specifications. In this paper, we propose an efficient synthesis algorithm for method call sequences, including an offline procedure for exploring all reachable heap states within a scope, and an online procedure for generating a method call sequence from the explored heap states to satisfy the given specification. To improve the efficiency of state exploration, we introduce a notion of abstract heap state for compactly representing heap states of the same structure and propose a strategy of merging structurally-isomorphic states. The experimental results demonstrate that our approach substantially outperforms the baselines in both test generation and bounded verification.
Journal-first Papers
Wed 12 Oct 2022 17:20 - 17:40 at Ballroom C East - Technical Session 19 - Formal Methods and Models I Chair(s): Michalis Famelis Université de MontréalAbstract—Over the past few years, SMT string solvers have found their applications in an increasing number of domains, such as program analyses in mobile and Web applications, which require the ability to reason about string values. A series of research has been carried out to find quality issues of string solvers in terms of its correctness and performance. Yet, none of them has considered the performance regressions happening across multiple versions of a string solver. To fill this gap, in this paper, we focus on solver performance regressions (SPRs), i.e., unintended slowdowns introduced during the evolution of string solvers. To this end, we develop SPRFinder to not only generate test cases demonstrating SPRs, but also localize the probable causes of them, in terms of commits. We evaluated the effectiveness of SPRFinder on three state-of-the-art string solvers, i.e., Z3Seq, Z3Str3, and CVC4. The results demonstrate that SPRFinder is effective in generating SPR-inducing test cases and also able to accurately locate the responsible commits. Specifically, the average running time on the target versions is 13.2× slower than that of the reference versions. Besides, we also conducted the first empirical study to peek into the characteristics of SPRs, including the impact of random seed configuration for SPR detection, understanding the root causes of SPRs, and characterizing the regression test cases through case studies. Finally, we highlight that 149 unique SPR-inducing commits were discovered in total by SPRFinder, and 16 of them have been confirmed by the corresponding developers. The original paper can be accessed from https://ieeexplore.ieee.org/abstract/document/9760153
Link to publication DOIResearch Papers
Wed 12 Oct 2022 17:40 - 18:00 at Ballroom C East - Technical Session 19 - Formal Methods and Models I Chair(s): Michalis Famelis Université de MontréalCode clone detection aims to find functionally similar code fragments, which is becoming more and more important in the field of software engineering. Many code clone detection methods have been proposed, among which tree-based methods are able to handle semantic code clones. However, these methods are difficult to scale to big code due to the complexity of tree structures. In this paper, we design \emph{Amain}, a scalable tree-based semantic code clone detector by building Markov chains models. Specifically, we propose a novel method to transform the complex original tree into simple Markov chains and compute the similarity of all states in these chains. After obtaining all similarity scores, we feed them into a machine learning classifier to train a code clone detector. To examine the effectiveness of \emph{Amain}, we evaluate it on two widely used datasets namely Google Code Jam and BigCloneBench. Experimental results show that \emph{Amain} is superior to five state-of-the-art code clone detection tools (\ie \emph{SourcererCC}, \emph{Deckard}, \emph{RtvNN}, \emph{ASTNN}, and \emph{SCDetector}). Furthermore, compared to a recent tree-based code clone detector \emph{ASTNN}, \emph{Amain} is more than 160 times faster in predicting semantic code clones.
Tool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1A key threat to the usage of third-party dependencies has been the threat of security vulnerabilities, which risks unwanted access to a user application. As part of an ecosystem of dependencies, users of a library are prone to both the direct and transitive dependencies adopted into their application. Recent work involves tool supports for vulnerable dependency updates, rarely showing the complexity of the transitive updates. In this paper, we introduce our solution to support vulnerability updating. V-Achilles is a prototype that shows a visualization (i.e., using dependency graphs) affected by vulnerability attacks. In addition to the tool overview, we highlight three use cases to demonstrate usefulness, and an application of our prototype with real-world npm packages. The prototype is available at https://github.com/MUICT-SERU/V-Achilles, with an accompanying video demonstration at https://www.youtube.com/watch?v=tspiZfhMNcs.
Journal-first Papers
Thu 13 Oct 2022 10:10 - 10:30 at Ballroom C East - Technical Session 23 - Security Chair(s): John-Paul Ore North Carolina State UniversityThe Java platform provides various cryptographic APIs to facilitate secure coding. However, correctly using these APIs is challenging for developers who lack cybersecurity training. Prior work shows that many developers misused APIs and consequently introduced vulnerabilities into their software. To eliminate such vulnerabilities, people created tools to detect and/or fix cryptographic API misuses. However, it is still unknown (1) how current tools are designed to detect cryptographic API misuses, (2) how effectively the tools work to locate API misuses, and (3) how developers perceive the usefulness of tools’ outputs. For this paper, we conducted an empirical study to investigate the research questions mentioned above. Specifically, we first conducted a literature survey on existing tools and compared their approach design from different angles. Then we applied six of the tools to three popularly used benchmarks to measure tools’ effectiveness of API-misuse detection. Next, we applied the tools to 200 Apache projects and sent 57 vulnerability reports to developers for their feedback. Our study revealed interesting phenomena. For instance, none of the six tools was found universally better than the others; however, CogniCrypt, CogniGuard, and Xanitizer outperformed SonarQube. More developers rejected tools’ reports than those who accepted reports (30 vs. 9) due to their concerns on tools’ capabilities, the correctness of suggested fixes, and the exploitability of reported issues. This study reveals a significant gap between the state-of-the-art tools and developers’ expectations; it sheds light on future research in vulnerability detection.
DOI Pre-printTool Demonstrations
Wed 12 Oct 2022 09:30 - 10:00 at Ballroom A - Tool Poster Session 2Automatic vulnerability detection is of paramount importance to promote the security of an application and should be exercised at the earliest stages within the software development life cycle (SDLC) to reduce the risk of exposure. Despite the advancements with state-of-the-art deep learning techniques in software vulnerability detection, the development environments are not yet leveraging their performance. In this work, we integrate the Transformers architecture, one of the main highlights of advances in deep learning for Natural Language Processing, within a developer-friendly tool for code security. We introduce VDet for Java, a transformer-based VS Code extension that enables one to discover vulnerabilities in Java files. Our preliminary model evaluation presents an accuracy of 85.8% for multi-label classification and can detect up to 21 vulnerability types. The demonstration of our tool can be found at https://youtu.be/OjiUBQ6TdqE.
Tool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1This paper presents Quacky, a tool for quantifying permissiveness of access control policies in the cloud. Given a policy, Quacky translates it into a SMT formula and uses a model counting constraint solver to quantify permissiveness. When given multiple policies, Quacky not only determines which policy is more permissive, but also quantifies the relative permissiveness between the policies. With Quacky, users can automatically analyze complex policies, helping them ensure that there is no unintended access to their data. Quacky supports access control policies written in Amazon’s AWS Identity and Access Management (IAM), Microsoft’s Azure, and Google Cloud Platform (GCP) policy languages. Quacky is open-source and has both a command-line and a web interface. Video URL: \url{https://youtu.be/YsiGOI_SCtg}. The Quacky tool and benchmarks are available at \url{https://github.com/vlab-cs-ucsb/quacky}
Late Breaking Results
Thu 13 Oct 2022 10:50 - 11:00 at Ballroom C East - Technical Session 23 - Security Chair(s): John-Paul Ore North Carolina State UniversityExisting approaches to improving the robustness of source code models concentrate on recognizing adversarial samples rather than valid samples that fall outside of a given distribution, which we refer to as out-of-distribution (OOD) samples. Recognizing such OOD samples is the novel problem investigated in this paper. To this end, we propose to use an auxiliary dataset (out-of-distribution) such that, when trained together with the main dataset, they will enhance the model’s robustness. We adapt energy-bounded learning objective function to assign a higher score to in-distribution samples and a lower score to out-of-distribution samples in order to incorporate such out-of-distribution samples into the training process of source code models. In terms of OOD detection and adversarial samples detection, our evaluation results demonstrate a greater robustness for existing source code models to become more accurate at recognizing OOD data while being more resistant to adversarial attacks at the same time.
Tool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1Cross-Chain bridges have become the most popular solution to support asset interoperability between heterogeneous blockchains. However, while providing efficient and flexible cross-chain asset transfer, the complex workflow involving both on-chain smart contracts and off-chain programs causes emerging security issues. In the past year, there have been more than ten severe attacks against cross-chain bridges, causing billions of loss. With few studies focusing on the security of cross-chain bridges, the community still lacks the knowledge and tools to mitigate this significant threat. To bridge the gap, we conduct the first study on the security of cross-chain bridges. We document three new classes of security bugs and propose a set of security properties and patterns to characterize them. Based on those patterns, we design Xscope, an automatic tool to find security violations in cross-chain bridges and detect real-world attacks. We evaluate Xscope on four popular cross-chain bridges. It successfully detects all known attacks and finds suspicious attacks unreported before. A video of Xscope is available at https://youtu.be/vMRO_qOqtXY.
Research Papers
Thu 13 Oct 2022 11:10 - 11:30 at Ballroom C East - Technical Session 23 - Security Chair(s): John-Paul Ore North Carolina State UniversitySmart contracts have been widely and rapidly used to automate financial and business transactions together with blockchains, helping people make agreements while minimizing trusts. With millions of smart contracts deployed on blockchain, various bugs and vulnerabilities in smart contracts have emerged. Following the rapid development of deep learning, many recent studies have used deep learning for vulnerability detection to conduct security checks before deploying smart contracts. However, these approaches are limited to providing only the decision on whether a smart contract is vulnerable or not, without further analysis on locating suspicious statements potentially responsible for the detected vulnerability.
To address this problem, we propose a deep learning based two-phase smart contract debugger for the Reentrancy vulnerability, one of the most severe vulnerabilities, named as ReVulDL: Reentrancy Vulnerability Detection and Localization. ReVulDL integrates the vulnerability detection and localization into a unified debugging pipeline. For the detection phase, given a smart contract, ReVulDL uses a graph-based pre-training model to learn the complex relationships in propagation chains for detecting whether the smart contract contains a reentrancy vulnerability. For the localization phase, if a reentrancy vulnerability is detected, ReVulDL utilizes interpretable machine learning to locate the suspicious statements in smart contract to provide interpretations of the detected vulnerability. Our large-scale empirical study on 47,398 smart contracts shows that ReVulDL achieves promising results in detecting reentrancy vulnerabilities (e.g., outperforming 15 state-of-the-art vulnerability detection approaches) and locating vulnerable statements (e.g., 70.38% of the vulnerable statements are ranked within top-10).
Research Papers
Thu 13 Oct 2022 13:30 - 13:50 at Ballroom C East - Technical Session 25 - Software Repairs Chair(s): Yannic Noller National University of SingaporeAutomated program repair (APR) techniques have shown great success in automatically finding fixes for programs in programming languages such as C or Java. In this work, we focus on repairing formal specifications, in particular for the Alloy specification language. As opposed to most APR tools, our approach to repair Alloy specifications, named ICEBAR, does not use test-based oracles for patch assessment. Instead, ICEBAR relies on the use of property-based oracles, commonly found in Alloy specifications as predicates and assertions. These property-based oracles define stronger conditions for patch assessment, thus reducing the notorious overfitting issue caused by using test-based oracles, typically observed in APR contexts. Moreover, as assertions and predicates are inherent to Alloy, whereas test cases are not, our tool is potentially more appealing to Alloy users than test-based Alloy repair tools.
At a high level, ICEBAR is an iterative, counterexample-based process, that generates and validates repair candidates. ICEBAR receives a faulty Alloy specification with a failing property-based oracle, and uses Alloy’s counterexamples to build tests and feed ARepair, a test-based Alloy repair tool, in order to produce a repair candidate. The candidate is then checked against the property oracle for overfitting: if the candidate passes, a repair has been found; if not, further counterexamples are generated to construct tests and enhance the test suite, and the process is iterated. ICEBAR includes different mechanisms, with different degrees of reliability, to generate counterexamples from failing predicates and assertions.
Our evaluation shows that ICEBAR significantly improves over ARepair, in both reducing overfitting and improving the repair rate. Moreover, ICEBAR shows that iterative refinement allows us to significantly improve a state-of-the-art tool for automated repair of Alloy specifications without any modifications to the tool.
Research Papers
Thu 13 Oct 2022 13:50 - 14:10 at Ballroom C East - Technical Session 25 - Software Repairs Chair(s): Yannic Noller National University of SingaporeTrained with a sufficiently large training and testing dataset, Deep Neural Networks (DNNs) are expected to generalize. However, inputs may deviate from the training dataset distribution in real deployments. This is a fundamental issue with using a finite dataset, which may lead deployed DNNs to mis-predict in production.
Inspired by input-debugging techniques for traditional software systems, we propose a runtime approach to identify and fix failure-inducing inputs in deep learning systems. Specifically, our approach targets DNN mis-predictions caused by unexpected (deviating and out-of-distribution) runtime inputs. Our approach has two steps. First, it recognizes and distinguishes deviating (``unseen'' semantically-preserving) and out-of-distribution inputs from in-distribution inputs. Second, our approach fixes the failure-inducing inputs by transforming them into inputs from the training set that have similar semantics. We call this process \emph{input reflection} and formulate it as a search problem over the embedding space on the training set.
We implemented a tool called InputReflector based on the above two-step approach and evaluated it with experiments on three DNN models trained on CIFAR-10, MNIST, and FMNIST image datasets. The results show that InputReflector can effectively distinguish deviating inputs that retain semantics of the distribution (e.g., zoomed images) and out-of-distribution inputs from in-distribution inputs. InputReflector repairs deviating inputs and achieves 30.78% accuracy improvement over original models. We also illustrate how InputReflector can be used to evaluate tests generated by deep learning testing tools.
Tool Demonstrations
Wed 12 Oct 2022 09:30 - 10:00 at Ballroom A - Tool Poster Session 2With the application of deep learning (DL) in signal detection, improving the robustness of classification models has received much attention, especially in automatic modulation classification (AMC) of electromagnetic signals. To obtain robust models, a large amount of electromagnetic signal data is required in the training and testing process. However, both the high cost of manual collection and the low quality of data samples from automatically generated data result in the defects of the AMC models. Therefore, it is important to generate electromagnetic data by data augmentation. In this paper, we propose a novel electromagnetic data augmentation tool, namely ElecDaug, which directs the metamorphic process by electromagnetic signal characteristics to achieve automatic data augmentation. Based on electromagnetic data pre-processing, transmission or time-frequency domains characteristic metamorphic, ElecDaug can augment the data samples to build robust AMC models. Preliminary experiments show that ElecDaugcan effectively augment available data samples for model repair. The video is at https://youtu.be/tqC0z5Sg1_k. Documentation and source code can be found here: https://github.com/ehhhhjw/tool_ElecDaug.git.
Research Papers
Thu 13 Oct 2022 14:20 - 14:40 at Ballroom C East - Technical Session 25 - Software Repairs Chair(s): Yannic Noller National University of SingaporeAutomated program repair (APR) holds the promise of aiding manual debugging activities. Over a decade of evolution, a broad range of APR techniques have been proposed and evaluated on a set of real-world bug datasets. However, while more and more bugs have been correctly fixed, we observe that the growth of newly fixed bugs by APR techniques has hit a bottleneck in recent years. In this work, we explore the possibility of addressing complicated bugs by proposing TransplantFix, a novel APR technique that leverages graph differencing-based transplantation from the donor method. The key novelty of TransplantFix lies in three aspects: 1) we propose to use a graph-based differencing algorithm to distill semantic fix actions from the donor method; 2) we devise an inheritance-hierarchy-aware code search approach to identify donor methods with similar functionality; 3) we present a namespace transfer approach to effectively adapt donor code.
We investigate the unique contributions of TransplantFix by conducting an extensive comparison that covers a total of 42 APR techniques and evaluating TransplantFix on 839 real-world bugs from Defects4J v1.2 and v2.0. TransplantFix presents superior results in three aspects. First, it has achieved the best performance as compared to the state-of-the-art APR techniques proposed in the last three years, in terms of the number of newly fixed bugs, reaching a 60%-300% improvement. Furthermore, not relying on any fix actions crafted manually or learned from big data, it reaches the best generalizability among all APR techniques evaluated on Defects4J v1.2 and v2.0. In addition, it shows the potential to synthesize complicated patches consisting of at most eight-line insertions at a hunk. TransplantFix presents fresh insights and a promising avenue for follow-up research towards addressing more complicated bugs.
NIER Track
Thu 13 Oct 2022 14:40 - 14:50 at Ballroom C East - Technical Session 25 - Software Repairs Chair(s): Yannic Noller National University of SingaporeTemplate-based automatic program repair (T-APR) techniques depend on the quality of bug-fixing templates, which are pairs of buggy- and patch-code templates. For such templates to be of sufficient quality for T-APR techniques to succeed, they must satisfy three criteria: applicability, fixability, and efficiency. Mining appropriate bug-fixing templates for use in T-APR is an optimization problem in finding templates which satisfy all three criteria. Existing template mining approaches select templates based only on the first criteria, and are thus suboptimal in their performance. This study proposes a multi-objective optimization-based bug-fixing template mining method for T-APR in which we estimate template quality based on nine code abstraction tasks and three objective functions. Our method determines the optimal code abstraction strategy (i.e., the optimal combination of abstraction tasks) which maximizes the values of three objective functions and generates a final set of bug-fixing templates by clustering template candidates to which the optimal abstraction strategy is applied. Our preliminary experiment demonstrated that our optimized strategy can improve templates’ applicability and efficiency by 7% and 146% over the existing mining technique, respectively. We therefore conclude that the multi-objective optimization-based template mining technique effectively finds high-quality bug-fixing templates.
Research Papers
Thu 13 Oct 2022 14:50 - 15:10 at Ballroom C East - Technical Session 25 - Software Repairs Chair(s): Yannic Noller National University of SingaporeRecently, the emerging trend in automatic program repair is to apply deep neural networks to generate fixed code from buggy ones, called NPR (Neural Program Repair). However, the existing NPR systems are trained and evaluated under very different settings (e.g., different training data, inconsistent evaluation data, wide-ranged candidate numbers), which makes it hard to draw fair-enough conclusions when comparing them. Motivated by this, we first build a standard benchmark dataset and an extensive framework tool to mitigate threats for the comparison. The dataset consists of a training set, a validation set and an evaluation set with 144,641, 13,739 and 13,706 bug-fix pairs of Java respectively. The tool supports selecting specific training and evaluation datasets and automatically conducting the pipeline of training and evaluating NPR models, as well as easily integrating new NPR model by implementing well-defined interfaces. Then, based on the benchmark and tool, we conduct a comprehensive empirical comparison of six SOTA NPR systems w.r.t the repairability, inclination and generalizability. The experimental results reveal deeper characteristics of compared NPR systems and subvert some existing comparative conclusions, which further verify the necessity of unifying the experimental setups in exploring the progresses of NPR systems. Finally, we identify some promising research directions derived from our findings.
Research Papers
Thu 13 Oct 2022 16:00 - 16:20 at Ballroom C East - Technical Session 29 - AI for SE II Chair(s): Tim Menzies North Carolina State UniversityDebugging, that is, identifying and fixing bugs in software, is a central part of software development. Developers are therefore often confronted with the task of deciding whether a given code snippet contains a bug, and if yes, where. Recently, data-driven methods have been employed to learn this task of bug detection, resulting (amongst others) in so called neural bug detectors. Neural bug detectors are trained on millions of buggy and correct code snippets.
Given the “neural learning” procedure, it seems likely that neural bug detectors – on the specific task of finding bugs – have a performance similar to human software developers. For this work, we set out to substantiate or refute such a hypothesis. We report on the results of an empirical study with over 100 software developers, targeting the comparison of humans and neural bug detectors. As detection task, we chose a specific form of bugs (variable misuse bugs) for which neural bug detectors have recently made significant progress.
Our study shows that despite the fact that neural bug detectors see millions of such examples during training, software developers – when conducting bug detection as a majority decision – are slightly better than neural bug detectors. Altogether, we find a large overlap in the performance, both for classifying code as buggy and for localizing the buggy line in the code.
In comparison to developers, one of the two evaluated neural bug detectors, however, raises a higher number of false alarms.
Research Papers
Thu 13 Oct 2022 16:20 - 16:40 at Ballroom C East - Technical Session 29 - AI for SE II Chair(s): Tim Menzies North Carolina State UniversityDue to the popularity of smart contracts in the modern financial ecosystem, there has been growing interest in formally verifying their correctness and security properties. Most existing techniques in this space focus on common vulnerabilities like arithmetic overflows and perform verification by leveraging contract invariants (i.e., logical formulas hold at transaction boundaries). In this paper, we propose a new technique, based on deep reinforcement learning, for automatically learning contract invariants that are useful for proving arithmetic safety. Our method incorporates an off-line training phase in which the verifier uses its own verification attempts to learn a policy for contract invariant generation. This learned (neural) policy is then used at verification time to predict likely invariants that are also useful for proving arithmetic safety. We implemented this idea in a tool called Cider and incorporated it into an existing verifier (based on refinement type checking) for proving arithmetic safety. Our evaluation shows that Cider improves both the quality of the inferred invariants as well as inference time, leading to faster verification and lower false positive rates overall.
Research Papers
Thu 13 Oct 2022 16:40 - 17:00 at Ballroom C East - Technical Session 29 - AI for SE II Chair(s): Tim Menzies North Carolina State UniversityAlthough large pre-trained models of code have delivered significant advancements in various code processing tasks, there is an impediment to the wide and fluent adoption of these powerful models in the daily workflow of software developers: these large models consume hundreds of megabytes of memory and run slowly especially on personal devices, which causes problems in model deployment and greatly degrades the user experience.
It motivates us to propose Compressor, a novel approach that can compress the pre-trained models of code into extremely small models with negligible performance sacrifice. Our proposed method formulates the design of tiny models as simplifying the pre-trained model architecture: searching for a significantly smaller model that follows an architectural design similar to the original pre-trained model. To tackle this problem, Compressor proposes a genetic algorithm (GA)-based strategy to guide the simplification process. Prior studies found that a model with higher computational cost tends to be more powerful. Inspired by this insight, the GA algorithm is designed to maximize a model’s Giga floating-point operations (GFLOPs), an indicator of the model computational cost, to satisfy the constraint of the target model size. Then, we use the knowledge distillation technique to train the small model: unlabelled data is fed into the large model and the outputs are used as labels to train the small model. We evaluate Compressor with two state-of-the-art pre-trained models, i.e., CodeBERT and GraphCodeBERT, on two important tasks, i.e, vulnerability prediction and clone detection. We use the proposed method to compress models to a size (3 MB), which is only 0.6% of the original model size. The results show that compressed CodeBERT and GraphCodeBERT reduce the inference latency by 70.75% and 79.21%, respectively. More importantly, they maintain 96.15% and 97.74% of the original performance on the vulnerability prediction task. They even maintain higher ratios (99.20% and 97.52%) of the original performance on the clone detection task.
DOI Pre-printResearch Papers
Thu 13 Oct 2022 17:00 - 17:20 at Ballroom C East - Technical Session 29 - AI for SE II Chair(s): Tim Menzies North Carolina State UniversityMany real-world online systems require the forecast of monitored time series metrics to detect and localize anomalies, schedule resources, and assist relevant staffs in decision making. Even though many time series forecasting techniques have been proposed, few of them can be directly applied in online systems due to their efficiency and lack of model sharing. To address the challenges, this paper presents TTSF-transformer, a transferable time series forecasting service using deep transformer model. TTSF-transformer normalizes multiple metric frequencies to ensure the model sharing across multi-source systems, employs a deep transformer model with Bayesian estimation to generate the predictive marginal distribution, and introduces transfer learning and incremental learning into the training process to ensure the long-term performance. We conduct experiments on real-world time series metrics from two different types of game business in Tencent. The results show that TTSF-transformer significantly outperforms other state-of-the-art methods and is suitable for wide deployment in large online systems.
Journal-first Papers
Thu 13 Oct 2022 17:20 - 17:40 at Ballroom C East - Technical Session 29 - AI for SE II Chair(s): Tim Menzies North Carolina State UniversityIn presence of multiple objectives to be optimized in Search-Based Software Engineering (SBSE), Pareto search has been commonly adopted. It searches for a good approximation of the problem’s Pareto optimal solutions, from which the stakeholders choose the most preferred solution according to their preferences. However, when clear preferences of the stakeholders (e.g., a set of weights which reflect relative importance between objectives) are available prior to the search, weighted search is believed to be the first choice since it simplifies the search via converting the original multi-objective problem into a single-objective one and enable the search to focus on what only the stakeholders are interested in.
This paper questions such a “weighted search first” belief. We show that the weights can, in fact, be harmful to the search process even in the presence of clear preferences. Specifically, we conduct a large scale empirical study which consists of 38 systems/projects from three representative SBSE problems, together with two types of search budget and nine sets of weights, leading to 604 cases of comparisons. Our key finding is that weighted search reaches a certain level of solution quality by consuming relatively less resources at the early stage of the search; however, Pareto search is at the majority of the time (up to 77% of the cases) significantly better than its weighted counterpart, as long as we allow a sufficient, but not unrealistic search budget. This is a beneficial result, as it discovers a potentially new “rule-of-thumb” for the SBSE community: even when clear preferences are available, it is recommended to always consider Pareto search by default for multi-objective SBSE problems provided that solution quality is more important. Weighted search, in contrast, should only be preferred when the resource/search budget is limited, especially for expensive SBSE problems. This, together with other findings and actionable suggestions in the paper, allows us to codify pragmatic and comprehensive guidance on choosing weighted and Pareto search for SBSE under the circumstance that clear preferences are available. All code and data can be accessed at: https://github.com/ideas-labo/pareto-vs-weight-for-sbse.
Pre-printResearch Papers
Thu 13 Oct 2022 17:40 - 18:00 at Ballroom C East - Technical Session 29 - AI for SE II Chair(s): Tim Menzies North Carolina State UniversityWith the rapid development of Deep Learning, deep predictive models have been widely applied to improve Software Engineering tasks, such as defect prediction and issue classification, and have achieved remarkable success. They are mostly trained in a supervised manner, which heavily relies on high-quality datasets. Unfortunately, due to the nature and source of software engineering data, the real-world datasets often suffer from the issues of sample mislabelling and class imbalance, thus undermining the effectiveness of deep predictive models in practice. This problem has become a major obstacle for deep learning-based Software Engineering. In this paper, we propose RobustTrainer, the first approach to learning deep predictive models on raw training datasets where the mislabelled samples and the imbalanced classes coexist. RobustTrainer consists of a two-stage training scheme, where the first learns feature representations robust to sample mislabelling and the second builds a classifier robust to class imbalance based on the learned representations in the first stage. We apply RobustTrainer to two popular Software Engineering tasks, i.e., Bug Report Classification and Software Defect Prediction. Evaluation results show that RobustTrainer effectively tackles the mislabelling and class imbalance issues and produces significantly better deep predictive models compared to the other six comparison approaches.
[Workshop] HILT' 22 -- Supporting a Rigorous Approach to Software Development
Fri 14 Oct 2022 08:30 - 10:00 at Ballroom C East - Session 1no description available