Bio: Julia Lawall received the PhD degree in 1994 from Indiana University. She has been a Senior research scientist at INRIA since 2011. Previously, she was a Lektor (Associate Professor) at the University of Copenhagen. Her research interests include the use of programming language and software engineering technology to improve the development and evolution of systems code. She leads the development of the Coccinelle program matching and transformation system and contributes regularly to the Linux kernel based on the tools developed in her research. She is on the editorial board of the journal Science of Computer Programming, and has been a program chair of USENIX ATC, ASE, PEPM, GPCE, and ICFP.
Research Papers
Tue 12 Sep 2023 10:30 - 10:42 at Room C - Testing AI Systems 1 Chair(s): Leonardo Mariani University of Milano-BicoccaAutomated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to study how far this outstanding challenge can be solved by recent advances in large language models (LLMs) such as ChatGPT. However, our study reveals that ChatGPT has a relatively low success rate (28.8%) in finding correct failure-inducing test cases for buggy programs. A possible conjecture is that finding failure-inducing test cases requires analyzing the subtle differences (nuances) between the tokens of a program’s correct version and those for its buggy version. When these two versions have similar sets of tokens and attentions, ChatGPT is weak in distinguishing their differences.
We find that ChatGPT can successfully generate failure-inducing test cases when it is guided to focus on the nuances. Our solution is inspired by an interesting observation that ChatGPT could infer the intended functionality of buggy code if it is similar to the correct version. Driven by the inspiration, we develop a novel technique, called Differential Prompting, to effectively find failure-inducing test cases with the help of the compilable code synthesized by the inferred intention. Prompts are constructed based on the nuances between the given version and the synthesized code. We evaluate Differential Prompting on QuixBugs (a popular benchmark of buggy programs) and recent programs published at Codeforces (a popular programming contest portal, which is also an official benchmark of ChatGPT). We compare Differential Prompting with two baselines constructed using conventional ChatGPT prompting and PYNGUIN (the state-of- the-art unit test generation tool for Python programs). Our evaluation results show that for programs of QuixBugs, Differential Prompting can achieve a success rate of 75.0% in finding failure- inducing test cases, outperforming the best baseline by 2.6X. For programs of Codeforces, Differential Prompting’s success rate is 66.7%, outperforming the best baseline by 4.0X.
Pre-printNIER Track
Tue 12 Sep 2023 10:42 - 10:54 at Room C - Testing AI Systems 1 Chair(s): Leonardo Mariani University of Milano-BicoccaSoftware testing is an important part of the development cycle, yet it requires specialized expertise and substantial developer effort to adequately test software. The recent discoveries of the capabilities of large language models (LLMs) suggest that they can be used as automated testing assistants, and thus provide helpful information and even drive the testing process. To highlight the potential of this technology, we present a taxonomy of LLM-based testing agents based on their level of autonomy, and describe how a greater level of autonomy can benefit developers in practice. An example use of LLMs as a testing assistant is provided to demonstrate how a conversational framework for testing can help developers. This also highlights how the often criticized hallucination of LLMs can be beneficial while testing. We identify other tangible benefits that LLM-driven testing agents can bestow, and also discuss some potential limitations.
Pre-print File AttachedNIER Track
Tue 12 Sep 2023 10:54 - 11:06 at Room C - Testing AI Systems 1 Chair(s): Leonardo Mariani University of Milano-BicoccaThe performance of state-of-the-art Deep Learning models heavily depends on the availability of well-curated training and testing datasets that sufficiently capture the operational domain. Data augmentation is an effective technique in alleviating data scarcity, reducing the time-consuming and expensive data collection and labelling processes. Despite their potential, existing data augmentation techniques primarily focus on simple geometric and colour space transformations, like noise, flipping and resizing, producing datasets with limited diversity. When the augmented dataset is used for testing the Deep Learning models, the derived results are typically uninformative about the robustness of the models. We address this gap by introducing GENFUZZER, a novel coverage-guided data augmentation fuzzing technique for Deep Learning models underpinned by generative AI. We demonstrate our approach using widely-adopted datasets and models employed for image classification, illustrating its effectiveness in generating informative datasets leading up to a 26% increase in widely-used coverage criteria
File AttachedResearch Papers
Tue 12 Sep 2023 11:06 - 11:18 at Room C - Testing AI Systems 1 Chair(s): Leonardo Mariani University of Milano-Bicoccano description available
Research Papers
Tue 12 Sep 2023 11:18 - 11:30 at Room C - Testing AI Systems 1 Chair(s): Leonardo Mariani University of Milano-BicoccaLearning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data’s properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation.
Pre-printResearch Papers
Tue 12 Sep 2023 11:30 - 11:42 at Room C - Testing AI Systems 1 Chair(s): Leonardo Mariani University of Milano-BicoccaDeep neural networks (DNNs) have demonstrated their outperformance in various software systems, but also exhibit misbehavior and even result in irreversible disasters. Therefore, it is crucial to identify the mis-behavior of DNN-based software and improve DNNs’ quality. Test input prioritization is one of the most appealing ways to guarantee DNNs’ quality, which prioritizes test inputs so that more bug-revealing inputs can be identified earlier with limited time and manual labeling efforts. However, the existing prioritization methods are still limited from three aspects: certifiability, effectiveness, and generalizability. To overcome the challenges, we propose CertPri, a test input prioritization technique designed based on a movement cost perspective of test inputs in DNNs’ feature space. CertPri differs from previous works in three key aspects: (1) certifiable: it provides a formal robustness guarantee for the movement cost; (2) effective: it leverages formally guaranteed movement costs to identify malicious bug-revealing inputs; and (3) generic: it can be applied to various tasks, data, models, and scenarios. Extensive evaluations across 2 tasks (i.e., classification and regression), 6 data forms, 4 model structures, and 2 scenarios (i.e., white-box and black-box) demonstrate CertPri’s superior performance. For instance, it significantly improves 53.97% prioritization effectiveness on average compared with baselines. Its robustness and generalizability are 1.41~2.00 times and 1.33~3.39 times that of baselines on average, respectively.
Pre-printResearch Papers
Tue 12 Sep 2023 13:30 - 13:42 at Room C - Testing AI Systems 2 Chair(s): Lwin Khin Shar Singapore Management UniversityNIER Track
Tue 12 Sep 2023 13:42 - 13:54 at Room C - Testing AI Systems 2 Chair(s): Lwin Khin Shar Singapore Management UniversityNIER Track
Tue 12 Sep 2023 13:54 - 14:06 at Room C - Testing AI Systems 2 Chair(s): Lwin Khin Shar Singapore Management UniversityResearch Papers
Tue 12 Sep 2023 14:06 - 14:18 at Room C - Testing AI Systems 2 Chair(s): Lwin Khin Shar Singapore Management UniversityThe reliability of decision-making policies is urgently important today as they have established the fundamentals of many critical applications, such as autonomous driving and robotics. To ensure reliability, there have been a number of research efforts on testing decision-making policies that solve Markov decision processes (MDPs). However, due to the deep neural network (DNN)-based inherit and infinite state space, developing scalable and effective testing frameworks for decision-making policies still remains open and challenging.
In this paper, we present an effective testing framework for decision-making policies. The framework adopts a generative diffusion model-based test case generator that can easily adapt to different search spaces, ensuring the practicality and validity of test cases. Then, we propose a termination state novelty-based guidance to diversify agent behaviors and improve the test effectiveness. Finally, we evaluate the framework on five widely used benchmarks, including autonomous driving, aircraft collision avoidance, and gaming scenarios. The results demonstrate that our approach identifies more diverse and influential failure-triggering test cases compared to current state-of-the-art techniques. Moreover, we employ the detected failure cases to repair the evaluated models, achieving better robustness enhancement compared to the baseline method.
File AttachedJournal-first Papers
Tue 12 Sep 2023 14:18 - 14:30 at Room C - Testing AI Systems 2 Chair(s): Lwin Khin Shar Singapore Management UniversityWhen Deep Neural Networks (DNNs) are used in safety-critical systems, engineers should determine the safety risks associated with failures (i.e., erroneous outputs) observed during testing. For DNNs processing images, engineers visually inspect all failure-inducing images to determine common characteristics among them. Such characteristics correspond to hazard-triggering events (e.g., low illumination) that are essential inputs for safety analysis. Though informative, such activity is expensive and error prone.
To support such safety analysis practices, we propose Simulator-based Explanations for DNN failurEs (SEDE), a technique that generates readable descriptions for commonalities in failure-inducing, real-world images and improves the DNN through effective retraining. SEDE leverages the availability of simulators, which are commonly used for cyber-physical systems. It relies on genetic algorithms to drive simulators toward the generation of images that are similar to failure-inducing, real-world images in the test set; it then employs rule learning algorithms to derive expressions that capture commonalities in terms of simulator parameter values. The derived expressions are then used to generate additional images to retrain and improve the DNN.
With DNNs performing in-car sensing tasks, SEDE successfully characterized hazard-triggering events leading to a DNN accuracy drop. Also, SEDE enabled retraining leading to significant improvements in DNN accuracy, up to 18 percentage points.
Link to publication DOI Pre-print File AttachedNIER Track
Tue 12 Sep 2023 14:30 - 14:42 at Room C - Testing AI Systems 2 Chair(s): Lwin Khin Shar Singapore Management Universityno description available
Research Papers
Tue 12 Sep 2023 15:30 - 15:42 at Room C - Testing AI Systems 3 Chair(s): Mike Papadakis University of Luxembourg, LuxembourgThere has been an increasing interest in enhancing the fairness of machine learning (ML). Despite the growing number of fairness-improving methods, we lack a systematic understanding of the trade-offs among factors considered in the ML pipeline when fairness-improving methods are applied. This understanding is essential for developers to make informed decisions regarding the provision of fair ML services. Nonetheless, it is extremely difficult to analyze the trade-offs when there are multiple fairness parameters and other crucial metrics involved, coupled, and even in conflict with one another.
This paper uses causality analysis as a principled method for analyzing trade-offs between fairness parameters and other crucial metrics in ML pipelines. To practically and effectively conduct causality analysis, we propose a set of domain-specific optimizations to facilitate accurate causal discovery and a unified, novel interface for trade-off analysis based on well-established causal inference methods. We conduct a comprehensive empirical study using three real-world datasets on a collection of widely used fairness-improving techniques. Our study obtains actionable suggestions for users and developers of fair ML. We further demonstrate the versatile usage of our approach in selecting the optimal fairness-improving method, paving the way for more ethical and socially responsible AI technologies.
Pre-printNIER Track
Tue 12 Sep 2023 15:42 - 15:54 at Room C - Testing AI Systems 3 Chair(s): Mike Papadakis University of Luxembourg, LuxembourgMachine Learning (ML), particularly deep learning, has seen vast advancements, leading to the rise of Machine Learning-Enabled Systems (MLS). However, numerous software engineering challenges persist in propelling these MLS into production, largely due to various run-time uncertainties that impact the overall Quality of Service (QoS). These uncertainties emanate from ML models, software components, and environmental factors. Self-adaptation techniques present potential in managing run-time uncertainties, but their application in MLS remains largely unexplored. As a solution, we propose the concept of a Machine Learning Model Balancer, focusing on managing uncertainties related to ML models by using multiple models. Subsequently, we introduce AdaMLS, a novel self-adaptation approach that leverages this concept and extends the traditional MAPE-K loop for continuous MLS adaptation. AdaMLS employs lightweight unsupervised learning for dynamic model switching, thereby ensuring consistent QoS. Through a self-adaptive object detection system prototype, we demonstrate AdaMLS’s effectiveness in balancing system and model performance. Preliminary results suggest AdaMLS surpasses naive and single state-of-the-art models in QoS guarantees, heralding the advancement towards self-adaptive MLS with optimal QoS in dynamic environments.
Pre-printIndustry Showcase (Papers)
Tue 12 Sep 2023 15:54 - 16:06 at Room C - Testing AI Systems 3 Chair(s): Mike Papadakis University of Luxembourg, LuxembourgArtificial Intelligence (AI) enabled embedded devices are becoming increasingly important in the field of healthcare where such devices are utilized to assist physicians, clinicians, and surgeons in their diagnosis, therapy planning, and rehabilitation. However, it is still a challenging task to come up with an accurate and efficient machine learning model for resource-limited devices that work $24\times7$. It requires both intuition and experience. This dependence on human expertise and reliance on trial-and-error-based design methods create impediments to the standard processes of effort estimation, design phase planning, and generating service-level agreements for projects that involve AI-enabled MedTech devices.
In this paper, we present AutoML search from an algorithmic perspective, instead of a more prevalent optimization or black-box tool perspective. We briefly present and point to case studies that demonstrate the efficacy of the automation approach in terms of productivity improvements. We believe that our proposed method can make AutoML more amenable to the applications of software engineering principles and also accelerate biomedical device engineering, where there is a high dependence on skilled human resources.
File AttachedResearch Papers
Tue 12 Sep 2023 16:06 - 16:18 at Room C - Testing AI Systems 3 Chair(s): Mike Papadakis University of Luxembourg, LuxembourgComputational notebooks have become the go-to way for solving data-science problems. While they are designed to combine code and documentation, prior work shows that documentation is largely ignored by the developers because of the manual effort. Automated documentation generation can help, but existing techniques fail to capture algorithmic details and developers often end up editing the generated text to provide more explanation and sub-steps. This paper proposes a novel machine-learning pipeline, Cell2Doc, for code cell documentation in Python data science notebooks. Our approach works by identifying different logical contexts within a code cell, generating documentation for them separately, and finally combining them to arrive at the documentation for the entire code cell. Cell2Doc takes advantage of the capabilities of existing pre-trained language models and improves their efficiency for code cell documentation. We also provide a new benchmark dataset for this task, along with a data- preprocessing pipeline that can be used to create new datasets. We also investigate an appropriate input representation for this task. Our automated evaluation suggests that our best input representation improves the pre-trained model’s performance by 2.5x on average. Further, Cell2Doc achieves 1.33x improvement during human evaluation in terms of correctness, informativeness, and readability against the corresponding standalone pre-trained model.
Pre-printJournal-first Papers
Tue 12 Sep 2023 16:18 - 16:30 at Room C - Testing AI Systems 3 Chair(s): Mike Papadakis University of Luxembourg, LuxembourgNIER Track
Tue 12 Sep 2023 16:30 - 16:42 at Room C - Testing AI Systems 3 Chair(s): Mike Papadakis University of Luxembourg, LuxembourgUncertain, unpredictable, real-time, and lifelong evolution causes operational failures in intelligent software systems, leading to significant damages, safety and security hazards, and tragedies. To fully unleash such systems’ potential and facilitate their wider adoption, ensuring the trustworthiness of their decision-making under uncertainty is the prime challenge. To overcome this challenge, an intelligent software system and its operating environment should be continuously monitored, tested, and refined during its lifetime operation. Existing technologies, such as digital twins, can enable continuous synchronisation with such systems to reflect their most up-to-date states. Such representations are often in the form of prior-knowledge-based and machine-learning models, together called ‘model universe’. In this paper, we present our vision of combining techniques from software engineering, evolutionary computation, and machine learning to support the model universe evolution.
Most Influential Papers (MIP)
Wed 13 Sep 2023 08:30 - 08:42 at Room C - MIP awards Chair(s): Myra Cohen Iowa State UniversityMost Influential Papers (MIP)
Wed 13 Sep 2023 08:42 - 08:54 at Room C - MIP awards Chair(s): Myra Cohen Iowa State Universityno description available
Software isn’t created in one dramatic step. It improves bit by bit, one little step at a time — editing, running unit tests, fixing build errors, addressing code reviews, editing some more, appeasing linters, and fixing more errors — until finally it becomes good enough to merge into a code repository. Software engineering isn’t an isolated process, but a dialogue among human developers, code reviewers, bug reporters, software architects and tools, such as compilers, unit tests, linters and static analyzers. I’ll talk about DIDACT(Dynamic Integrated Developer ACTivity), which is a methodology for training large machine learning (ML) models for software development. The novelty of DIDACT is that it uses the process of software development as the source of training data for the model, rather than just the polished end state of that process, the finished code. By exposing the model to the contexts that developers see as they work, paired with the actions they take in response, the model learns about the dynamics of software development and is more aligned with how developers spend their time. We leverage instrumentation of Google’s software development to scale up the quantity and diversity of developer-activity data beyond previous works. Results are promising along two dimensions: usefulness to professional software developers, and as a potential basis for imbuing ML models with general software development skills.
Danny is a Senior Staff Research Scientist at Google DeepMind and a lead of Code AI efforts there. He is also an Adjunct Professor in the Dept of Computer Science at McGill University and a Core Industrial Member at the Mila Quebec AI Institute. He was one of the early people working on Machine Learning for Code and has made a number of contributions to the area over the last 10 years, including DeepCoder, Terpret, generative models of code, and natural language + code modeling. In addition, he has made a number of contributions to core machine learning, including early work on the revival of graph neural networks and a Best Paper Award at NeurIPS for work on sampling. Over the last six years, he has been at Google working on making AI-powered developer tools useful in real software development processes. He holds a PhD in Machine Learning from the University of Toronto and was previously a Researcher at Microsoft Research Cambridge and a Research Fellow at the University of Cambridge.
Journal-first Papers
Wed 13 Sep 2023 10:30 - 10:42 at Room C - Program Repair 1 Chair(s): Arie van Deursen Delft University of Technologyno description available
Tool Demonstrations
Wed 13 Sep 2023 10:42 - 10:54 at Room C - Program Repair 1 Chair(s): Arie van Deursen Delft University of TechnologyAs software systems grow larger and more complex, debugging takes up an increasingly significant portion of developers’ time and efforts during software maintenance. To aid software engineers in debugging, many automated debugging and repair techniques have been proposed. Both the development and evaluation of these automated techniques depend on benchmarks of bugs. While many different defect benchmarks have been developed, only a few benchmarks are widely used due to the origin of the collected bugs as well as the usability of the benchmarks themselves, risking a biased research landscape. This paper presents BugsC++, a new benchmark that contains 209 real-world bugs collected from 22 open-source C/C++ projects. BugsC++ aims to provide high usability by providing a similar user interface to the widely used Defects4J. Further, BugsC++ ensures the replicability of the bugs in its collection by encapsulating each buggy program in a Docker container. By providing a highly usable real-world defect benchmark for C/C++, we hope to promote debugging research for C/C++.
Link to publication Pre-print File AttachedNIER Track
Wed 13 Sep 2023 10:54 - 11:06 at Room C - Program Repair 1 Chair(s): Arie van Deursen Delft University of TechnologyLarge Language models (LLMs) can be induced to solve non-trivial problems with “few-shot” prompts including illustrative problem-solution examples. Now if the few-shots also include “chain of thought” (CoT) explanations, which are of the form problem-explanation-solution, LLMs will generate a “explained” solution, and perform even better. Recently an exciting, substantially better technique, self-consistency 1 has emerged, based on the intuition that there are many plausible explanations for the right solution; when the LLM is sampled repeatedly to generate a pool of explanation-solution pairs, for a given problem, the most frequently occurring solutions in the pool (ignoring the explanations) tend to be even more likely to be correct! Unfortunately, the use of this highly-performant S-C (or even CoT) approach in software engineering settings is hampered by the lack of explanations; most software datasets lack explanations. In this paper, we describe an application of the S-C approach to program repair, using the commit log on the fix as the explanation, only in the illustrative few-shots. We achieve state-of-the art results, beating previous approaches to prompting-based program repair, on the MODIT dataset; we also find evidence suggesting that the correct commit messages are helping the LLM learn to produce better patches.
Pre-printResearch Papers
Wed 13 Sep 2023 11:06 - 11:18 at Room C - Program Repair 1 Chair(s): Arie van Deursen Delft University of TechnologyResearch Papers
Wed 13 Sep 2023 11:18 - 11:30 at Room C - Program Repair 1 Chair(s): Arie van Deursen Delft University of Technologyno description available
Tool Demonstrations
Wed 13 Sep 2023 11:30 - 11:42 at Room C - Program Repair 1 Chair(s): Arie van Deursen Delft University of TechnologyAutomated program repair (APR) approaches suffer from long patch validation time, which limits their practical application and receives relatively low attention. The patch validation process repeatedly executes tests to filter patches, and has been recognized as the dual of mutation analysis. We systematically investigate existing mutation testing techniques and recognize five families of acceleration techniques that are suitable for patch validation, two of which are never adapted to a general-purpose patch validator. We implement and demonstrate ExpressAPR, the first framework that combines five families of acceleration techniques for patch validation as the complete set. In our evaluation on 30 random Defects4J bugs and four APR systems, ExpressAPR accelerates patch validation for two orderof-magnitudes over plain validation or one order-of-magnitude over the state-of-the-art approach, benefiting APR researchers and users with a much shorter patch validation time.
Demo video available at https://youtu.be/7AB-4VvBuuM
Tool repo (source code + Docker image + evaluation dataset) available at https://github.com/ExpressAPR/ExpressAPR
File AttachedTool Demonstrations
Wed 13 Sep 2023 13:30 - 13:42 at Room C - Program Verification 1 Chair(s): Nicolás Rosner Amazon Web ServicesSoftware verification is challenging, and auxiliary program invariants are used to improve the effectiveness of verification approaches. For instance, the k-induction implementation in CPAchecker, an award-winning framework for program analysis, uses invariants produced by a configurable data-flow analysis to strengthen induction hypotheses. This invariant generator, CPA-DF, uses arithmetic expressions over intervals as its abstract domain and is able to prove some safe verification tasks alone. After extensively evaluating CPA-DF on SV-Benchmarks, the largest publicly available suite of C safety-verification tasks, we discover that its potential as a stand-alone analysis or a sub-analysis in a parallel portfolio for combined verification approaches has been significantly underestimated: (1) As a stand-alone analysis, CPA-DF finds almost as many proofs as the plain k-induction implementation without auxiliary invariants. (2) As a sub-analysis running in parallel to the plain k-induction implementation, CPA-DF boosts the portfolio verifier to solve a comparable amount of tasks as the heavily-optimized k-induction implementation with invariant injection. Our detailed analysis reveals that dynamic precision adjustment is crucial to the efficiency and effectiveness of CPA-DF. To generalize our results beyond CPAchecker, we use CoVeriTeam, a platform for cooperative verification, to compose three portfolio verifiers that execute CPA-DF and three other software verifiers in parallel, respectively. Surprisingly, running CPA-DF merely in parallel to these state-of-the-art tools further boosts the number of correct results up to more than 20%.
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 13:42 - 13:54 at Room C - Program Verification 1 Chair(s): Nicolás Rosner Amazon Web Servicesno description available
NIER Track
Wed 13 Sep 2023 13:54 - 14:06 at Room C - Program Verification 1 Chair(s): Nicolás Rosner Amazon Web ServicesResearch Papers
Wed 13 Sep 2023 14:06 - 14:18 at Room C - Program Verification 1 Chair(s): Nicolás Rosner Amazon Web ServicesDetecting non-termination is crucial for ensuring program correctness and security, such as preventing denial-of-service attacks. While termination analysis has been studied for many years, existing methods have limited scalability and are only effective on small programs. To address this issue, we propose a practical termination checking technique, called EndWatch, for detecting non-termination through testing. Specifically, we introduce two methods to generate non-termination oracles based on checking state revisits, i.e., if the program returns to a previously visited state at the same program location, it does not terminate. The non-termination oracles can be incorporated into testing tools (e.g., AFL used in this paper) to detect non-termination in large programs. For linear loops, we perform symbolic execution on individual loops to infer State Revisit Conditions (SRC) and instrument SRC into target loops. For non-linear loops, we instrument target loops for checking concrete state revisits during execution. We evaluated EndWatch on standard benchmarks with small-sized programs and real-world projects with large-sized programs. The evaluation results show that EndWatch is more effective than the state-of-the-art tools on standard benchmarks (detecting 87% of non-terminating programs while the best baseline detects only 67%), and useful in detecting non-termination in real-world projects (detecting 90% of known non-termination CVEs and 4 unknown bugs).
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 14:18 - 14:30 at Room C - Program Verification 1 Chair(s): Nicolás Rosner Amazon Web ServicesTwo-player games are a fruitful way to represent and reason about several important synthesis tasks. These tasks include controller synthesis (where one asks for a controller for a given plant such that the controlled plant satisfies a given temporal specification), program repair (setting values of variables to avoid exceptions), and synchronization synthesis (adding lock/unlock statements in multi-threaded programs to satisfy safety assertions). In all these applications, a solution directly corresponds to a winning strategy for one of the players in the induced game. In turn, \emph{logically-specified} games offer a powerful way to model these tasks for large or infinite-state systems. Much of the techniques proposed for solving such games typically rely on abstraction-refinement or template-based solutions. In this paper, we show how to apply classical fixpoint algorithms, that have hitherto been used in explicit, finite-state, settings, to a symbolic logical setting. We implement our techniques in a tool called GenSys-LTL and show that they are not only effective in synthesizing valid controllers for a variety of challenging benchmarks from the literature, but often compute maximal winning regions and maximally-permissive controllers. We achieve 46.38X speed-up over the state of the art and also scale well for non-trivial LTL specifications.
Pre-print File AttachedNIER Track
Wed 13 Sep 2023 14:30 - 14:42 at Room C - Program Verification 1 Chair(s): Nicolás Rosner Amazon Web Servicesno description available
Research Papers
Wed 13 Sep 2023 15:30 - 15:42 at Room C - Software Testing for Specialized Systems 1 Chair(s): Fabrizio Pastore University of LuxembourgResearch Papers
Wed 13 Sep 2023 15:42 - 15:54 at Room C - Software Testing for Specialized Systems 1 Chair(s): Fabrizio Pastore University of LuxembourgEmbedded Network Stacks (ENS) enable low-resource devices to communicate with the outside world, facilitating the development of the Internet of Things and Cyber-Physical Systems. Some defects in ENS are thus high-severity cybersecurity vulnerabilities: they are remotely triggerable and can impact the physical world. While prior research has shed light on the characteristics of defects in many classes of software systems, no study has described the properties of ENS defects nor identified a systematic technique to expose them. The most common automated approach to detecting ENS defects is feedback-driven randomized dynamic analysis (“fuzzing”), a costly and unpredictable technique.
This paper provides the first systematic characterization of cybersecurity vulnerabilities in ENS. We analyzed 61 vulnerabilities across 6 open-source ENS. Most of these ENS defects are concentrated in the transport and network layers of the network stack, require reaching different states in the network protocol, and can be triggered by only 1-2 modifications to a single packet. We therefore propose a novel systematic testing framework that focuses on the transport and network layers, uses seeds that cover a network protocol’s states, and systematically modifies packet fields. We evaluated this framework on 4 ENS and replicated 12 of the 14 reported IP/TCP/UDP vulnerabilities. On recent versions of these ENSs, it discovered 7 novel defects (6 assigned CVES) during a bounded systematic test that covered all protocol states and made up to 3 modifications per packet. We found defects in 3 of the 4 ENS we tested that had not been found by prior fuzzing research. Our results suggest that fuzzing should be deferred until after systematic testing is employed.
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 15:54 - 16:06 at Room C - Software Testing for Specialized Systems 1 Chair(s): Fabrizio Pastore University of Luxembourgno description available
File AttachedJournal-first Papers
Wed 13 Sep 2023 16:06 - 16:18 at Room C - Software Testing for Specialized Systems 1 Chair(s): Fabrizio Pastore University of LuxembourgContinuous integration is widely adopted in software projects to reduce the time it takes to deliver the changes to the market. To ensure software quality, developers also run regression test cases in a continuous fashion. The CI practice generates commit-by-commit software evolution data that provides great opportunities for future testing research. However, such data is often unavailable due to space limitation (e.g., developers only keep the data for a certain period) and the significant effort involved in rerunning the test cases on a per-commit basis. In this paper, we present T-Evos, a dataset on test result and coverage evolution, covering 8,093 commits across 12 open-source Java projects. Our dataset includes the evolution of statement-level code coverage for every test case (either passed and failed), test result, all the builds information, code changes, and the corresponding bug reports. We conduct an initial analysis to demonstrate the overall dataset. In addition, we conduct an empirical study using T-Evos to study the characteristics of test failures in CI settings. We find that test failures are frequent, and while most failures are resolved within a day, some failures require several weeks to resolve. We highlight the relationship between code changes and test failure, and provide insights for future automated testing research. Our dataset may be used for future testing research and benchmarking in CI. Our findings provide an important first step in understanding code coverage evolution and test failures in a continuous environment.
Pre-printResearch Papers
Wed 13 Sep 2023 16:18 - 16:30 at Room C - Software Testing for Specialized Systems 1 Chair(s): Fabrizio Pastore University of LuxembourgNIER Track
Wed 13 Sep 2023 16:30 - 16:42 at Room C - Software Testing for Specialized Systems 1 Chair(s): Fabrizio Pastore University of Luxembourgno description available
Most Influential Papers (MIP)
Thu 14 Sep 2023 08:30 - 08:45 at Room C - MIP awards and ASE 2024 Chair(s): Massimiliano Di Penta University of Sannio, ItalyKeynotes
Thu 14 Sep 2023 08:45 - 08:57 at Room C - MIP awards and ASE 2024 Chair(s): Massimiliano Di Penta University of Sannio, Italyno description available
Bio: Dan Hao is a Professor in the School of Computer Science at Peking University. Her main research interest on software testing and debugging has won several awards for its impact in academia and industry. Dan Hao is an ACM Distinguished Member. She has been Program Committee Co-Chair of a number of conferences, e.g., ASE 2021, SANER 2022, and ICST 2023. She serves at the editorial boards of several international journals (IEEE-TSE, ACM-TOSEM, CSUR, EMSE, and ASEJ) and the program committees of numerous international software engineering conferences.
Research Papers
Thu 14 Sep 2023 10:30 - 10:42 at Room C - Software Testing for Specialized Systems 2 Chair(s): Zishuo Ding University of WaterlooThis experience paper describes thirteen considerations for implementing machine learning software defect prediction (ML SDP) in vivo. Specifically, we provide the following report on the ground of the most important observations and lessons learned gathered during a large-scale research effort and introduction of ML SDP to the system-level testing quality assurance process of one of the leading telecommunication vendors in the world — Nokia. We adhere to a holistic and logical progression based on the principles of the business analysis body of knowledge: from identifying the need and setting requirements, through designing and implementing the solution, to profitability analysis, stakeholder management, and handover. Conversely, for many years, industry adoption has not kept up the pace of academic achievements in the field, despite promising potential to improve quality and decrease the cost of software products for many companies worldwide. Therefore, discussed considerations hopefully help researchers and practitioners bridge the gaps between academia and industry.
Pre-printResearch Papers
Thu 14 Sep 2023 10:42 - 10:54 at Room C - Software Testing for Specialized Systems 2 Chair(s): Zishuo Ding University of Waterloono description available
Research Papers
Thu 14 Sep 2023 10:54 - 11:06 at Room C - Software Testing for Specialized Systems 2 Chair(s): Zishuo Ding University of Waterloono description available
Tool Demonstrations
Thu 14 Sep 2023 11:06 - 11:18 at Room C - Software Testing for Specialized Systems 2 Chair(s): Zishuo Ding University of WaterlooWe present Provengo, a comprehensive suite of tools designed to facilitate the implementation of Scenario-Driven Model-Based Testing (SDMBT), an innovative approach that utilizes scenarios to construct a model encompassing the user’s perspective and the system’s business value, while also defining the desired outcomes. With the assistance of Provengo, testers gain the ability to effortlessly create natural user stories and seamlessly integrate them into a model capable of generating effective tests. The demonstration illustrates how SDMBT effectively addresses the bootstrapping challenge commonly encountered in model-based testing (MBT) by enabling incremental development, starting from simple models and gradually augmenting them with additional stories.
Pre-print File AttachedResearch Papers
Thu 14 Sep 2023 11:18 - 11:30 at Room C - Software Testing for Specialized Systems 2 Chair(s): Zishuo Ding University of Waterloono description available
File AttachedTool Demonstrations
Thu 14 Sep 2023 11:30 - 11:42 at Room C - Software Testing for Specialized Systems 2 Chair(s): Zishuo Ding University of WaterlooWith the increased developments in quantum computing, the availability of systematic and automatic testing approaches for quantum programs is becoming increasingly essential. To this end, we present the quantum software testing tool QuCAT for combinatorial testing of quantum programs. QuCAT provides two functionalities of use. With the first functionality, the tool generates a test suite of a given strength (e.g., pair-wise). With the second functionality, it generates test suites with increasing strength until a failure is triggered or a maximum strength is reached. QuCAT uses two test oracles to check the correctness of test outputs. We assess the cost and effectiveness of QuCAT with 3 faulty versions of 5 quantum programs. Results show that combinatorial test suites with a low strength can find faults with limited cost, while a higher strength performs better to trigger some difficult faults with relatively higher cost. Repository: https://github.com/qiqihannah/QuCAT-Tool Video: https://youtu.be/UsqgOudKLio
Pre-print File AttachedResearch Papers
Thu 14 Sep 2023 11:42 - 11:54 at Room C - Software Testing for Specialized Systems 2 Chair(s): Zishuo Ding University of WaterlooThe widespread adoption of DNNs in NLP software has highlighted the need for robustness. Researchers proposed various automatic testing techniques for adversarial test cases. However, existing methods suffer from two limitations: weak error-discovering capabilities, with success rates ranging from 0% to 24.6% for BERT-based NLP software, and time inefficiency, taking 177.8s to 205.28s per test case, making them challenging for time-constrained scenarios.
To address these issues, this paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases. Specifically, we adopt Levy flight for population initialization to increase the diversity of generated test cases. We also design an inertial weight adaptive update operator to improve the efficiency of LEAP’s global optimization of high-dimensional text examples and a mutation operator based on the greedy strategy to reduce the search time.
We conducted a series of experiments to validate LEAP’s ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%, which is 6.1% higher than the next best approach (PSOattack). While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other heuristic-based methods. Additionally, the experimental results demonstrate that LEAP can generate more transferable test cases and significantly enhance the robustness of DNN-based systems.
Deep neural networks (DNNs) are susceptible to bugs, just like other types of software systems. A significant uptick in using DNN, and its applications in wide-ranging areas, including safety-critical systems, warrant extensive research on software engineering tools for improving the reliability of DNN-based systems. One such tool that has gained significant attention in the recent years is DNN fault localization. This paper revisits mutation-based fault localization in the context of DNN models and proposes a novel technique, named deepmufl, applicable to a wide range of DNN models. We have implemented deepmufl and have evaluated its effectiveness using 109 bugs obtained from StackOverflow. Our results show that deepmufl detects 53/109 of the bugs by ranking the buggy layer in top-1 position, outperforming state-of-the-art static and dynamic DNN fault localization systems that are also designed to target the class of bugs supported by deepmufl. Moreover, we observed that we can halve the fault localization time for a pre-trained model using mutation selection, yet losing only 7.55% of the bugs localized in top-1 position.
Pre-printWhen deploying Deep Neural Networks (DNNs), developers often convert models from one deep learning framework to another (e.g., TensorFlow to PyTorch). However, this process is error-prone and can impact target model accuracy. To identify the extent of such impact, we perform and briefly present a differential analysis against three DNNs widely used for image recognition (MobileNetV2, ResNet101, and InceptionV3) converted across four well-known deep learning frameworks (PyTorch, Keras, TensorFlow (TF), and TFLite), which revealed numerous model crashes and output label discrepancies of up to 72%. To mitigate such errors, we present a novel approach towards fault localization and repair of buggy deep learning framework conversions, focusing on pre-trained image recognition models. Our technique consists of four stages of analysis: 1) conversion tools, 2) model parameters, 3) model hyperparameters, and 4) graph representation. In addition, we propose various strategies towards fault repair of the faults detected. We implement our technique on top of the Apache TVM deep learning compiler, and we test it by conducting a preliminary fault localization analysis for the conversion of InceptionV3 from TF to TFLite. Our approach detected a fault in a common DNN converter tool, which introduced precision errors in weights, reducing model accuracy. After our fault localization, we repaired the issue, reducing our conversion error to zero.
Pre-print File AttachedEfficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code—supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution. We present our ongoing work on automated refactoring that assists developers in specifying whether and how their otherwise eagerly-executed imperative DL code could be reliably and efficiently executed as graphs while preserving semantics. The approach, based on a novel imperative tensor analysis, will automatically determine when it is safe and potentially advantageous to migrate imperative DL code to graph execution and modify decorator parameters or eagerly executing code already running as graphs. The approach is being implemented as a PyDev Eclipse IDE plug-in and uses the WALA Ariadne analysis framework. We discuss our ongoing work towards optimizing imperative DL code to its full potential.
Pre-print File AttachedUnsupervised learning systems using clustering have gained significant attention for numerous applications due to their unique ability to discover patterns and structures in large unlabeled datasets. However, their effectiveness highly depends on their configuration, which requires domain-specific expertise and often involves numerous manual trials. Specifically, selecting appropriate algorithms and hyperparameters adds to the complexity of the configuration process. In this paper, we propose, apply, and assess an automated approach (AutoConf) for configuring unsupervised learning systems using clustering, leveraging metamorphic testing and Bayesian optimization. Metamorphic testing is utilized to verify the configurations of unsupervised learning systems by applying a series of input transformations. We use Bayesian optimization guided by metamorphic-testing output to automatically identify the optimal configuration. The approach aims to streamline the configuration process and enhance the effectiveness of unsupervised learning systems. It has been evaluated through experiments on six datasets from three domains for anomaly detection. The evaluation results show that our approach can find configurations outperforming the baseline approaches as they achieved a recall of 0.89 and a precision of 0.84 (on average).
File AttachedDeep reinforcement learning (DRL) is increasingly applied in large-scale productions like Netflix and Facebook. As with most data-driven systems, DRL systems can exhibit undesirable behaviors due to environmental drifts, which often occur in constantly-changing production settings. Continual Learning (CL) is the inherent self-healing approach for adapting the DRL agent in response to the environment’s conditions shifts. However, successive shifts of considerable magnitude may cause the production environment to drift from its original state. Recent studies have shown that these environmental drifts tend to drive CL into long, or even unsuccessful, healing cycles, which arise from inefficiencies such as catastrophic forgetting, warm-starting failure, and slow convergence. In this paper, we propose Dr. DRL, an effective self-healing approach for DRL systems that integrates a novel mechanism of intentional forgetting into vanilla CL to overcome its main issues. Dr. DRL deliberately erases the DRL system’s minor behaviors to systematically prioritize the adaptation of the key problem-solving skills. Using well-established DRL algorithms, Dr. DRL is compared with vanilla CL on various drifted environments. Dr. DRL is able to reduce, on average, the healing time and fine-tuning episodes by, respectively, 18.74% and 17.72%. Dr. DRL successfully helps agents to adapt to 19.63% of drifted environments left unsolved by vanilla CL while maintaining and even enhancing by up to 45% the obtained rewards for drifted environments that are resolved by both approaches.
Pre-printResearch Papers
Thu 14 Sep 2023 15:30 - 15:42 at Room C - Code Generation 3 Chair(s): David Lo Singapore Management UniversityResearch Papers
Thu 14 Sep 2023 15:42 - 15:54 at Room C - Code Generation 3 Chair(s): David Lo Singapore Management UniversityResearch Papers
Thu 14 Sep 2023 15:54 - 16:06 at Room C - Code Generation 3 Chair(s): David Lo Singapore Management UniversitySoftware developers often struggle to update APIs, leading to manual, time-consuming, and error-prone processes. We introduce MELT, a new approach that generates lightweight API migration rules directly from pull requests in popular library repositories. Our key insight is that pull requests merged into open-source libraries are a rich source of information sufficient to mine API migration rules. By leveraging code examples mined from the library source and automatically generated code examples based on the pull requests, we infer transformation rules in comby, a language for structural code search and replace. Since inferred rules from single code examples may be too specific, we propose a generalization procedure to make the rules more applicable to client projects. MELT rules are syntax-driven, interpretable, and easily adaptable. Moreover, unlike previous work, our approach enables rule inference to seamlessly integrate into the library workflow, removing the need to wait for client code migrations. We evaluated MELT on pull requests from four popular libraries, successfully mining 461 migration rules from code examples in pull requests and 114 rules from auto-generated code examples. Our generalization procedure increases the number of matches for mined rules by 9x. We applied these rules to client projects and ran their tests, which led to an overall decrease in the number of warnings and fixing some test cases demonstrating MELT’s effectiveness in real-world scenarios.
Pre-print File AttachedResearch Papers
Thu 14 Sep 2023 16:06 - 16:18 at Room C - Code Generation 3 Chair(s): David Lo Singapore Management UniversityJournal-first Papers
Thu 14 Sep 2023 16:18 - 16:30 at Room C - Code Generation 3 Chair(s): David Lo Singapore Management UniversityIn recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU to approximate the results of human judgement. These metrics originate from the machine translation domain and it is unclear whether they are applicable for the code generation tasks and how well they agree with the human evaluation on this task. There are also other metrics, CodeBLEU and RUBY, developed to estimate the similarity of code, that take into account the properties of source code. However, for these metrics there are hardly any studies on their agreement with the human evaluation. Despite all that, minimal differences in the metric scores have been used in recent papers to claim superiority of some code generation models over the others. In this paper, we present a study on the applicability of six metrics – BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY – for evaluation of code generation models. We conduct a study on two different code generation datasets and use human annotators to assess the quality of all models run on these datasets. The results indicate that for the CoNaLa dataset of Python one-liners, none of the metrics can correctly emulate human judgement on which model is better with >95% certainty if the difference in model scores is less than 5 points. For the HearthStone dataset, which consists of classes of a particular structure, a difference in model scores of at least 2 points is enough to claim the superiority of one model over the other. Our findings suggest that the ChrF metric is a better fit for the evaluation of code generation models than the commonly used BLEU and CodeBLEU. Yet, finding a metric for code generation that closely agrees with humans requires additional work.
Link to publication DOI Pre-print File AttachedResearch Papers
Thu 14 Sep 2023 16:30 - 16:42 at Room C - Code Generation 3 Chair(s): David Lo Singapore Management University