no description available
no description available
Journal-first Papers
Tue 12 Sep 2023 10:54 - 11:06 at Plenary Room 2 - Cloud and Distributed Systems 1Software vulnerabilities can affect critical systems within an organization impacting processes, workflows, privacy, and safety. When a software vulnerability becomes known, affected systems are at risk until appropriate updates become available and eventually deployed. This period can last from a few days to several months, during which attackers can develop exploits and take advantage of the vulnerability. It is tedious and time-consuming to keep track of vulnerabilities manually and perform necessary actions to shut down, update, or modify systems. Vulnerabilities affect system components, such as a web server, but sometimes only target specific versions or component combinations.
We propose a novel approach for automated mode switching of software systems to support system administrators in dealing with vulnerabilities and reducing the risk of exposure. We rely on model-driven techniques and use a multi-modal architecture to react to discovered vulnerabilities and provide automated contingency support. We have developed a dedicated domain-specific language to describe potential mitigation as mode switches. We have evaluated our approach with a web server case study, analyzing historical vulnerability data. Based on the vulnerabilities scores sum, we demonstrated that switching to less vulnerable modes reduced the attack surface in 98.9% of the analyzed time.
Link to publication DOI File Attachedno description available
A major threat to distributed software systems’ reliability is vicious cycles, which are observed when an event in the distributed software system’s execution causes a system degradation, and the degradation, in turn, causes more of such events. Vicious cycles often result in large-scale cloud outages that are hard to recover from due to their self-reinforcing nature.
This paper formally defines Vicious Cycle, and conducts the first in-depth study of 33 real-world vicious cycles in 13 widely-used open-source distributed software systems, shedding light on the root causes, triggering conditions, and fixing strategies of vicious cycles, with over a dozen concrete implications to combat them. Our findings show that the majority of the vicious cycles are caused by incorrect error handlers, where the handlers do not obtain enough information to distinguish between 1) an error induced by incoming requests and 2) an error induced by an unexpected interference from another error handler.
This paper further performs a feasibility study by 1) building a monitoring tool that prevents one type of vicious cycle by collecting information to make a more informed decision in error handling, and 2) investigating the effectiveness of one commonly suggested practice – injecting exponential backoff – to prevent vicious cycles induced by unconstrained retry.
Pre-printno description available
Tool Demonstrations
Tue 12 Sep 2023 13:30 - 13:42 at Plenary Room 2 - Cloud and Distributed Systems 2 Chair(s): Tim Menzies North Carolina State UniversityAIoT (Artificial Intelligence of Things) which integrates AI and IoT has received rapidly growing interest from the software engineering community in recent years. It is crucial to design scalable, efficient, and reliable software solutions for large-scale AIoT systems in edge computing environments. However, the lack of effective service management including the support for service collaboration, AI application, and data security in the edge, has seriously limited the development of AIoT systems. To seal this gap, we propose EXPRESS 2.0 which is an intelligent service management framework for AIoT in the edge. Specifically, on top of the existing EXPRESS platform, EXPRESS 2.0 includes the intelligent service collaboration management module, AI application management module, and data security management module. To demonstrate the effectiveness of the framework, we design and implement a last-mile delivery system using both UAVs (Unmanned Aerial Vehicles) and UGVs (Unmanned Ground Vehicles). The EXPRESS 2.0 is open-sourced at https://github.com/ISEC-AHU/EXPRESS2.0. A video demonstration of EXPRESS 2.0 is at https://youtu.be/GHKD_VvJD88.
Research Papers
Tue 12 Sep 2023 13:42 - 13:54 at Plenary Room 2 - Cloud and Distributed Systems 2 Chair(s): Tim Menzies North Carolina State UniversityJournal-first Papers
Tue 12 Sep 2023 13:54 - 14:06 at Plenary Room 2 - Cloud and Distributed Systems 2 Chair(s): Tim Menzies North Carolina State Universityno description available
File AttachedIndustry Showcase (Papers)
Tue 12 Sep 2023 14:06 - 14:18 at Plenary Room 2 - Cloud and Distributed Systems 2 Chair(s): Tim Menzies North Carolina State UniversityAs a team from Alibaba Cloud, we have developed and open-sourced Apache RocketMQ, a cloud-native “messaging, eventing, streaming” real-time data processing platform that covers cloud-edge-device collaboration scenarios. During the development of RocketMQ, we formulated a log-based storage high availability paradigm that provides a high availability design solution for distributed log storage software used in industrial applications. The paradigm includes six essential components that enable the cluster to recover automatically. Our evaluation shows that this paradigm can achieve high availability, fast recovery, high throughput, and data loss prevention. We hope this paradigm will inspire and guide the development of high-availability solutions for all log-based storage systems.
Journal-first Papers
Tue 12 Sep 2023 14:18 - 14:30 at Plenary Room 2 - Cloud and Distributed Systems 2 Chair(s): Tim Menzies North Carolina State Universityno description available
Link to publicationResearch Papers
Tue 12 Sep 2023 14:30 - 14:42 at Plenary Room 2 - Cloud and Distributed Systems 2 Chair(s): Tim Menzies North Carolina State UniversityThe prevalence and severity of software configuration-induced issues have driven the design and development of a number of detection and diagnosis techniques. Many of these techniques need to perform static taint analysis on configuration-related variables to analyze the data flow, control flow, and execution paths given by configuration options. However, existing taint analysis or static slicer tools are not suitable for configuration analysis due to the complex effects of configuration on program behaviors.
In this experience paper, we conducted an empirical study on the propagation policy of configuration options. We concluded four rules of how configurations affect program behaviors, among which implicit data-flow and control-flow propagation are often ignored by existing tools. We report our experience designing and implementing a taint analysis infrastructure for configurations, ConfTainter. It can support various kinds of configuration analysis, e.g., explicit or implicit analysis for data or control flow. Based on the infrastructure, researchers and developers can easily implement analysis techniques for different configuration- related targets, e.g., misconfiguration detection. We evaluated the effectiveness of ConfTainter on 5 popular open-source systems. The result shows that the accuracy rate of data- and control-flow analysis is 96.1% and 97.7%, and the recall rate is 94.2% and 95.5%, respectively. We also apply ConfTainter to two types of configuration-related tasks: misconfiguration detection and configuration-related bug detection. The result shows that ConfTainter is highly applicable for configuration- related tasks with a few lines of code.
Pre-printResearch Papers
Tue 12 Sep 2023 15:30 - 15:42 at Plenary Room 2 - Code Generation 1 Chair(s): Kui Liu Huaweino description available
Journal-first Papers
Tue 12 Sep 2023 15:42 - 15:54 at Plenary Room 2 - Code Generation 1 Chair(s): Kui Liu HuaweiDevelopers often perform repetitive code editing activities (up to 70%) for various reasons (e.g., code refactoring) during software development. Many deep learning (DL) models have been proposed to automate code editing by learning from the code editing history. Among DL-based models, pre-trained code editing models have achieved the state-of-the-art (SOTA) results. Pre-trained models are first pre-trained with pre-training tasks and fine-tuned with the code editing task. Existing pre-training tasks mainly are code infilling tasks (e.g., masked language modeling), which are derived from the natural language processing field and are not designed for automatic code editing.
In this paper, we propose a novel pre-training task specialized in code editing and present an effective pre-trained code editing model named CodeEditor. Compared to previous code infilling tasks, our pre-training task further improves the performance and generalization ability of code editing models. Specifically, we collect lots of real-world code snippets as the ground truth and use a powerful generator to rewrite them into mutated versions. Then, we pre-train our CodeEditor to edit mutated versions into the corresponding ground truth, to learn edit patterns. We conduct experiments on four code editing datasets and evaluate the pre-trained CodeEditor in three settings (i.e., fine-tuning, few-shot, and zero-shot). (1) In the fine-tuning setting, we train the pre-trained CodeEditor with four datasets and evaluate it on the test data. CodeEditor outperforms the SOTA baselines by 15%, 25.5%, and 9.4% and 26.6% on four datasets. (2) In the few-shot setting, we train the pre-trained CodeEditor with limited data and evaluate it on the test data. CodeEditor substantially performs better than all baselines, even outperforming baselines that are fine-tuned with all data. (3) In the zero-shot setting, we evaluate the pre-trained CodeEditor on the test data without training. CodeEditor correctly edits 1,113 programs while the SOTA baselines can not work. The results show that the superiority of our pre-training task and the pre-trained CodeEditor is more effective in automatic code editing.
Link to publicationResearch Papers
Tue 12 Sep 2023 15:54 - 16:06 at Plenary Room 2 - Code Generation 1 Chair(s): Kui Liu HuaweiAutomated code generation has been extensively studied in recent literature. In this work, we first survey 66 participants to motivate a more pragmatic code generation scenario, i.e., library-oriented code generation, where the generated code should implement the functionally of the natural language query with the given library. We then revisit existing learning- based code generation techniques and find they have limited effectiveness in such a library-oriented code generation scenario. To address this limitation, we propose a novel library-oriented code generation technique, CodeGen4Libs, which incorporates two stages: import generation and code generation. The import generation stage generates import statements for the natural language query with the given third-party libraries, while the code generation stage generates concrete code based on the generated imports and the query. To evaluate the effectiveness of our approach, we conduct extensive experiments on a dataset of 403,780 data items. Our results demonstrate that CodeGen4Libs outperforms baseline models in both import generation and code generation stages, achieving improvements of up to 97.4% on EM (Exact Match), 54.5% on BLEU, and 53.5% on Hit@All. Overall, our proposed CodeGen4Libs approach shows promising results in generating high-quality code with specific third-party libraries, which can improve the efficiency and effectiveness of software development.
Pre-printTool Demonstrations
Tue 12 Sep 2023 16:06 - 16:18 at Plenary Room 2 - Code Generation 1 Chair(s): Kui Liu HuaweiWriting code for Arduino poses unique challenges. A developer 1) needs hardware-specific knowledge about the interface configuration between the Arduino controller and the I/O hardware, 2) identifies a suitable driver library for the I/O hardware, and 3) follows certain usage patterns of the driver library in order to use them properly. In this work, based on a study of real-world user queries posted in the Arduino forum, we propose ArduinoProg to address such challenges. ArduinoProg consists of three components, i.e., Library Retriever, Configuration Classifier, and Pattern Generator. Given a query, Library Retriever retrieves library names relevant to the I/O hardware identified from the query using vector-based similarity matching. Configuration Classifier predicts the interface configuration between the I/O hardware and the Arduino controller based on the method definitions of each library. Pattern Generator generates the usage pattern of a library using a sequence-to-sequence deep learning model. We have evaluated ArduinoProg using real-world queries, and our results show that the components of ArduinoProg can generate accurate and useful suggestions to guide developers in writing Arduino code. \newline Demo video: \url{bit.ly/3Y3aeBe} \newline Tool: \url{https://huggingface.co/spaces/imamnurby/ArduinoProg} \newline Code and data: \url{https://github.com/imamnurby/ArduinoProg}
Pre-printResearch Papers
Tue 12 Sep 2023 16:18 - 16:30 at Plenary Room 2 - Code Generation 1 Chair(s): Kui Liu HuaweiResearch Papers
Wed 13 Sep 2023 10:30 - 10:42 at Plenary Room 2 - Code Quality and Code Smells Chair(s): Bernd Fischer Stellenbosch Universityno description available
Research Papers
Wed 13 Sep 2023 10:42 - 10:54 at Plenary Room 2 - Code Quality and Code Smells Chair(s): Bernd Fischer Stellenbosch UniversityDeep learning has been widely adopted to tackle various code-based tasks by building deep code models based on a large amount of code snippets. While these deep code models have achieved great success, even state-of-the-art models suffer from noise present in inputs leading to erroneous predictions. While it is possible to enhance models through retraining/fine-tuning, this is not a once-and-for-all approach and incurs significant overhead. In particular, these techniques cannot on-the-fly improve performance of (deployed) models. There are currently some techniques for input denoising in other domains (such as image processing), but since code input is discrete and must strictly abide by complex syntactic and semantic constraints, input denoising techniques in other fields are almost not applicable. In this work, we propose the first input denoising technique (i.e., CodeDenoise) for deep code models. Its key idea is to localize noisy identifiers in (likely) mispredicted inputs, and denoise such inputs by cleansing the located identifiers. It does not need to retrain or reconstruct the model, but only needs to cleanse inputs on-the-fly to improve performance. Our experiments on 18 deep code models (i.e., three pre-trained models with six code-based datasets) demonstrate the effectiveness and efficiency of CodeDenoise. For example, on average, CodeDenoise successfully denoises 21.91% of mispredicted inputs and improves the original models by 2.04% in terms of the model accuracy across all the subjects in an average of 0.48 second spent on each input, substantially outperforming the widely-used fine-tuning strategy.
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 10:54 - 11:06 at Plenary Room 2 - Code Quality and Code Smells Chair(s): Bernd Fischer Stellenbosch UniversityReading source code occupies most of developer’s daily activities. Any maintenance and evolution task requires developers to read and understand the code they are going to modify. For this reason, previous research focused on the definition of techniques to automatically assess the readability of a given snippet. However, when many unreadable code sections are detected, developers might be required to manually modify them all to improve their readability. While existing approaches aim at solving specific readability-related issues, such as improving variable names or fixing styling issues, there is still no approach to automatically suggest which actions should be taken to improve code readability. In this paper, we define the first holistic readability-improving approach. As a first contribution, we introduce a methodology for automatically identifying readability-improving commits, and we use it to build a large dataset of 122k commits by mining the whole revision history of all the projects hosted on GitHub between 2015 and 2022. We show that such a methodology has ∼86% accuracy. As a second contribution, we train and test the T5 model to emulate what developers did to improve readability. We show that our model achieves a perfect prediction accuracy between 21% and 28%. The results of a manual evaluation we performed on 500 predictions shows that when the model does not change the behavior of the input and it applies changes (34% of the cases), in the large majority of the cases (79.4%) it allows to improve code readability.
Pre-printResearch Papers
Wed 13 Sep 2023 11:06 - 11:18 at Plenary Room 2 - Code Quality and Code Smells Chair(s): Bernd Fischer Stellenbosch UniversityUpon evolving their software, organizations and individual developers have to spend a substantial effort to pay back technical debt, i.e., the fact that software is released in a shape not as good as it should be, e.g., in terms of functionality, reliability, or maintainability. This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models, and in particular models exploiting different strategies for pre-training and fine-tuning. We start by extracting a dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595 open-source projects. SATD refers to technical debt instances documented (e.g., via code comments) by developers. We use this dataset to experiment with seven different generative deep learning (DL) model configurations. Specifically, we compare transformers pre-trained and fine-tuned with different combinations of training objectives, including the fixing of generic code changes, SATD removals, and SATD-comment prompt tuning. Also, we investigate the applicability in this context of a recently-available Large Language Model (LLM)-based chat bot. Results of our study indicate that the automated repayment of SATD is a challenging task, with the best model we experimented with able to automatically fix ~2% to 8% of test instances, depending on the number of attempts it is allowed to make. Given the limited size of the fine-tuning dataset (~5k instances), the model’s pre-training plays a fundamental role in boosting performance. Also, the ability to remove SATD steadily drops if the comment documenting the SATD is not provided as input to the model. Finally, we found general-purpose LLMs to not be a competitive approach for addressing SATD.
Pre-print File AttachedJournal-first Papers
Wed 13 Sep 2023 11:18 - 11:30 at Plenary Room 2 - Code Quality and Code Smells Chair(s): Bernd Fischer Stellenbosch UniversityAutomatically generated static code warnings suffer from a large number of false alarms. Hence, developers only take action on a small percent of those warnings. To better predict which static code warnings should not be ignored, we suggest that analysts need to look deeper into their algorithms to find choices that better improve the particulars of their specific problem. Specifically, we show here that effective predictors of such warnings can be created by methods that locally adjust the decision boundary (between actionable warnings and others). These methods yield a new high water-mark for recognizing actionable static code warnings. For eight open-source Java projects (cassandra, jmeter, commons, lucene-solr, maven, ant, tomcat, derby) we achieve perfect test results on 4/8 datasets and, overall, a median AUC (area under the true negatives, true positives curve) of 92%.
Link to publication DOI Authorizer link Pre-printTool Demonstrations
Wed 13 Sep 2023 11:30 - 11:42 at Plenary Room 2 - Code Quality and Code Smells Chair(s): Bernd Fischer Stellenbosch UniversityThis paper presents GLITCH, a new technology-agnostic framework that enables automated polyglot code smell detection for Infrastructure as Code scripts. GLITCH uses an intermediate representation on which different code smell detectors can be defined. It currently supports the detection of nine security smells and nine design & implementation smells in scripts written in Ansible, Chef, Docker, Puppet, or Terraform. Studies conducted with GLITCH not only show that GLITCH can reduce the effort of writing code smell analyses for multiple IaC technologies, but also that it has higher precision and recall than current state-of-the-art tools. A video describing and demonstrating GLITCH is available at: https://youtu.be/E4RhCcZjWbk.
Pre-print File AttachedJournal-first Papers
Wed 13 Sep 2023 11:42 - 11:54 at Plenary Room 2 - Code Quality and Code Smells Chair(s): Bernd Fischer Stellenbosch UniversityDefect prediction can help at prioritizing testing tasks by, for instance, ranking a list of items (methods and classes) according to their likelihood to be defective. While many studies investigated how to predict the defectiveness of commits, methods, or classes separately, no study investigated how these predictions differ or benefit each other. Specifically, at the end of a release, before the code is shipped to production, testing can be aided by ranking methods or classes, and we do not know which of the two approaches is more accurate. Moreover, every commit touches one or more methods in one or more classes; hence, the likelihood of a method and a class being defective can be associated with the likelihood of the touching commits being defective. Thus, it is reasonable to assume that the accuracy of methods-defectiveness-predictions (MDP) and the class-defectiveness-predictions (CDP) are increased by leveraging commits-defectiveness-predictions (aka JIT).
The contribution of this paper is fourfold: (i) We compare methods and classes in terms of defectiveness and (ii) of accuracy in defectiveness prediction, (iii) we propose and evaluate a first and simple approach that leverages JIT to increase MDP accuracy and (iv) CDP accuracy.
We analyse accuracy using two types of metrics (threshold-independent and effort-aware). We also use feature selection metrics, nine machine learning defect prediction classifiers, more than 2.000 defects related to 38 releases of nine open source projects from the Apache ecosystem. Our results are based on a ground truth with a total of 285,139 data points and 46 features among commits, methods and classes.
Our results show that leveraging JIT by using a simple median approach increases the accuracy of MDP by an average of 17% AUC and 46% PofB10 while it increases the accuracy of CDP by an average of 31% AUC and 38% PofB20.
From a practitioner’s perspective, it is better to predict and rank defective methods than defective classes. From a researcher’s perspective, there is a high potential for leveraging statement-defectiveness-prediction (SDP) to aid MDP and CDP.
Link to publication DOI File AttachedResearch Papers
Wed 13 Sep 2023 13:30 - 13:42 at Plenary Room 2 - Code Summarization Chair(s): Ray Buse GoogleCommit message generation (CMG) is a challenging task in automated software engineering that aims to generate natural language descriptions of code changes for commits. Previous methods all start from the modified code snippets, outputting commit messages through template-based, retrieval-based, or learning-based models. While these methods can summarize what is modified from the perspective of code, they struggle to provide reasons for the commit. The correlation between commits and issues that could be a critical factor for generating rational commit messages is still unexplored.
In this work, we delve into the correlation between commits and issues from the perspective of dataset and methodology. We construct the first dataset anchored on combining correlated commits and issues. The dataset consists of an unlabeled commit-issue parallel part and a labeled part in which each example is provided with human-annotated rational information in the issue. Furthermore, we propose ExGroFi (Extraction, Grounding, Fine-tuning), a novel paradigm that can introduce the correlation between commits and issues into the training phase of models. To evaluate whether it is effective, we perform comprehensive experiments with various state-of-the-art CMG models. The results show that compared with the original models, the performance of ExGroFi-enhanced models is significantly improved.
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 13:42 - 13:54 at Plenary Room 2 - Code Summarization Chair(s): Ray Buse GoogleCommit messages are crucial to software development, allowing developers to track changes and collaborate effectively. Despite their utility, most commit messages lack important information since writing high-quality commit messages is tedious and time-consuming. The active research on commit message generation (CMG) has not yet led to wide adoption in practice. We argue that if we could shift the focus from commit message generation to commit message completion and use previous commit history as additional context, we could significantly improve the quality and the personal nature of the resulting commit messages.
In this paper, we propose and evaluate both of these novel ideas. Since the existing datasets lack historical data, we collect and share a novel dataset called CommitChronicle, containing 10.7M commits across 20 programming languages. We use this dataset to evaluate the completion setting and the usefulness of the historical context for state-of-the-art CMG models and GPT-3.5-turbo. Our results show that in some contexts, commit message completion shows better results than generation, and that while in general GPT-3.5-turbo performs worse, it shows potential for long and detailed messages. As for the history, the results show that historical information improves the performance of CMG models in the generation task, and the performance of GPT-3.5-turbo in both generation and completion.
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 13:54 - 14:06 at Plenary Room 2 - Code Summarization Chair(s): Ray Buse GoogleResearch Papers
Wed 13 Sep 2023 14:06 - 14:18 at Plenary Room 2 - Code Summarization Chair(s): Ray Buse GooglePre-trained models of source code have gained widespread popularity in many code intelligence tasks. Recently, with the scaling of the model and corpus size, large language models have shown the ability of in-context learning (ICL). ICL employs task instructions and a few examples as demonstrations, and then inputs the demonstrations to the language models for making predictions. This new learning paradigm is training-free and has shown impressive performance in various natural language processing and code intelligence tasks. However, the performance of ICL heavily relies on the quality of demonstrations, e.g., the selected examples. It is important to systematically investigate how to construct a good demonstration for code-related tasks. In this paper, we empirically explore the impact of three key factors on the performance of ICL in code intelligence tasks: the selection, order, and number of demonstration examples. We conduct extensive experiments on three code intelligence tasks including code summarization, bug fixing, and program synthesis. Our experimental results demonstrate that all the above three factors dramatically impact the performance of ICL in code intelligence tasks. Additionally, we summarize our findings and provide takeaway suggestions on how to construct effective demonstrations, taking into account these three perspectives. We also show that a carefully-designed demonstration based on our findings can lead to substantial improvements over widely-used demonstration construction methods, e.g., improving BLEU-4, EM, and EM by at least 9.90%, 175.96%, and 50.81% on code summarization, bug fixing, and program synthesis, respectively.
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 14:18 - 14:30 at Plenary Room 2 - Code Summarization Chair(s): Ray Buse GoogleDecompilation is a widely used process for reverse engineers to significantly enhance code readability by lifting assembly code to a higher-level C-like language, pseudo-code. Nevertheless, the process of compilation and stripping irreversibly discards high-level semantic information that is crucial to code comprehension, such as comments, identifier names, and types. Existing approaches typically recover only one type of information, making them suboptimal for semantic inference. In this paper, we treat pseudo-code as a special programming language, then present a unified pre-trained model, HexT5, that is trained on vast amounts of natural language comments, source identifiers, and pseudo-code using novel pseudo-code-based pretraining objectives. We fine-tune HexT5 on various downstream tasks, including code summarization, variable name recovery, function name recovery, and similarity detection. Comprehensive experiments show that HexT5 achieves state-of-the-art performance on four downstream tasks, and it demonstrates the robust effectiveness and generalizability of HexT5 for binary-related tasks.
File AttachedResearch Papers
Wed 13 Sep 2023 14:30 - 14:42 at Plenary Room 2 - Code Summarization Chair(s): Ray Buse Googleno description available
Tool Demonstrations
Wed 13 Sep 2023 15:30 - 15:42 at Plenary Room 2 - Code Generation 2 Chair(s): Marianne Huchard LIRMMLearning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, large language models (LLMs) like \textit{Codex} and \textit{CodeGen} treat code as generic sequences of text and are trained on huge corpora of code data, achieving state of the art performance on several software engineering (SE) tasks. However, valid source code, unlike natural language, follows a strict structure and pattern governed by the underlying grammar of the programming language. Current LLMs do not exploit this property of the source code as they treat code like a sequence of tokens and overlook key structural and semantic properties of code that can be extracted from code-views like the Control Flow Graph (CFG), Data Flow Graph (DFG), Abstract Syntax Tree (AST), etc. Unfortunately, the process of generating and integrating code-views for every programming language is cumbersome and time consuming. To overcome this barrier, we propose our tool \textit{COMEX} - a framework that allows researchers and developers to create and combine multiple code-views which can be used by machine learning (ML) models for various SE tasks. Some salient features of our tool are: (i) it works directly on source code (which need not be compilable), (ii) it currently supports Java and C#, (iii) it can analyze both method-level snippets and program-level snippets by using both intra-procedural and inter-procedural analysis, and (iv) it is easily extendable to other languages as it is built on \emph{tree-sitter} - a widely used incremental parser that supports over 40 languages. We believe this easy-to-use code-view generation and customization tool will give impetus to research in source code representation learning methods and ML4SE. The demonstration of our tool can be found at https://youtu.be/GER6U87FVbU
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 15:42 - 15:55 at Plenary Room 2 - Code Generation 2 Chair(s): Marianne Huchard LIRMMThe performance of programming-by-example systems varies significantly across different tasks and even across different examples in one task. The key issue is that the search space depends on the given examples in a complex way. In particular, scalable synthesizers typically rely on a combination of machine learning to prioritize search order and deduction to prune search space, making it hard to quantitatively reason about how much an example speeds up the search. We propose a novel approach for quantifying the effectiveness of an example at reducing synthesis time. Based on this technique, we devise an algorithm that actively queries the user to obtain additional examples that significantly reduce synthesis time. We evaluate our approach on 30 challenging benchmarks across two different data science domains. Even with ineffective initial user-provided examples for pruning, our approach on average achieves a 6.0X speed-up in synthesis time compared to state-of-the-art synthesizers.
File AttachedResearch Papers
Wed 13 Sep 2023 15:55 - 16:08 at Plenary Room 2 - Code Generation 2 Chair(s): Marianne Huchard LIRMMno description available
File AttachedResearch Papers
Wed 13 Sep 2023 16:08 - 16:21 at Plenary Room 2 - Code Generation 2 Chair(s): Marianne Huchard LIRMMPython is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. Existing type inference approaches can be generally grouped into three categories, i.e., rule-based, supervised, and cloze-style approaches. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems caused by dynamic features and external calls. Supervised type inference approaches, while feature-agnostic and able to mitigate the low coverage problem, require large, high-quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem by leveraging the general knowledge in powerful pre-trained code models. However, their performance is limited since they ignore the domain knowledge from static typing rules which reflect the inference logic. What is more, their predictions are not interpretable, hindering developers’ understanding and verification of the results.
This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypeGen constructs example prompts from human annotations. TypeGen only requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, TypeGen enhances the interpretability of results through the use of the input-explanation-output strategy, which generates both explanations and type predictions in COT prompts. Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match by using only five examples. Furthermore, TypeGen achieves substantial improvements of 27% to 84% compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-1 Exact Match.
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 16:21 - 16:34 at Plenary Room 2 - Code Generation 2 Chair(s): Marianne Huchard LIRMMNIER Track
Wed 13 Sep 2023 16:34 - 16:47 at Plenary Room 2 - Code Generation 2 Chair(s): Marianne Huchard LIRMMQuantum Intermediate Representation (QIR) is an LLVM-based intermediate representation developed by Microsoft for quantum program compilers. QIR is designed to offer a universal solution for quantum program compilers, decoupled from both front-end languages and back-end hardware, thereby eliminating the need for redundant development of intermediate representations and compilers. However, the lack of a formal definition and reliance on natural language descriptions in the current state of QIR result in interpretational ambiguity and a dearth of rigor in implementing quantum functions. In this paper, we present formal definitions for QIR’s data types and instruction sets to establish correctness and safety assurances for operations and intermediate code conversions within QIR. To demonstrate the effectiveness of our approach, we provide examples of unsafe QIR codes where errors can be identified with our method.
File AttachedResearch Papers
Wed 13 Sep 2023 16:47 - 17:00 at Plenary Room 2 - Code Generation 2 Chair(s): Marianne Huchard LIRMMResearch Papers
Thu 14 Sep 2023 10:30 - 10:42 at Plenary Room 2 - Program Repair 2 Chair(s): Shin Yoo KAISTJournal-first Papers
Thu 14 Sep 2023 10:42 - 10:54 at Plenary Room 2 - Program Repair 2 Chair(s): Shin Yoo KAISTResearch Papers
Thu 14 Sep 2023 10:54 - 11:06 at Plenary Room 2 - Program Repair 2 Chair(s): Shin Yoo KAISTModern web applications often resort to application development frameworks such as React, Vue.js, and Angular. While the frameworks facilitate the development of web applications with several useful components, they are inevitably vulnerable to unmanaged memory consumption since the frameworks often produce Single Page Applications (SPAs). Web applications can be alive for hours and days with behavior loops, in such cases, even a single memory leak in a SPA app can cause performance degradation on the client side. However, recent debugging techniques for web applications still focus on memory leak detection, which requires manual tasks and produces imprecise results.
We propose LeakPair, a technique to repair memory leaks in single page applications. Given the insight that memory leaks are mostly non-functional bugs and fixing them might not change the behavior of an application, the technique is designed to proactively generate patches to fix memory leaks, without leak detection, which is often heavy and tedious. To generate effective patches, LeakPair follows the idea of pattern-based program repair since the automated repair strategy shows successful results in many recent studies. We evaluate the technique on more than 20 open-source projects without using explicit leak detection. The patches generated by our technique are also submitted to the projects as pull requests. The results show that LeakPair can generate effective patches to reduce memory consumption that are acceptable to developers. In addition, we execute the test suites given by the projects after applying the patches, and it turns out that the patches do not cause any functionality breakage; this might imply that LeakPair can generate non-intrusive patches for memory leaks.
Link to publication Pre-printResearch Papers
Thu 14 Sep 2023 11:06 - 11:18 at Plenary Room 2 - Program Repair 2 Chair(s): Shin Yoo KAISTno description available
Research Papers
Thu 14 Sep 2023 11:18 - 11:30 at Plenary Room 2 - Program Repair 2 Chair(s): Shin Yoo KAISTThe development of correct and efficient software can be hindered by compilation errors, which must be fixed to ensure the code’s syntactic correctness and program language constraints. Neural network-based approaches have been used to tackle this problem, but they lack guarantees of output correctness and can require an unlimited number of modifications. Fixing compilation errors within a given number of modifications is a challenging task. We demonstrate that finding the minimum number of modifications to fix a compilation error is NP-hard. To address compilation error fixing problem, we propose OrdinalFix, a complete algorithm based on shortest-path CFL (context-free language) reachability with attribute checking that is guaranteed to output a program with the minimum number of modifications required. Specifically, OrdinalFix searches possible fixes from the smallest to the largest number of modifications. By incorporating merged attribute checking to enhance efficiency, the time complexity of OrdinalFix is acceptable for application. We evaluate OrdinalFix on two datasets and demonstrate its ability to fix compilation errors within reasonable time limit. Comparing with existing approaches, OrdinalFix achieves a success rate of 83.5%, surpassing all existing approaches (71.7%).
Pre-print File AttachedNIER Track
Thu 14 Sep 2023 11:30 - 11:42 at Plenary Room 2 - Program Repair 2 Chair(s): Shin Yoo KAISTWith our reliance on software continuously increasing, it is of utmost importance that it be reliable. However, complete prevention of bugs in live systems is unfortunately an impossible task due to time constraints, incomplete testing, and developers not having knowledge of the full stack. As a result, mitigating risks for systems in production through hot patching and hot fixing has become an integral part of software development. In this paper, we first give an overview of the terminology used in the literature for research on this topic. Subsequently, we build upon these findings and present our vision for an automated framework for predicting and mitigating critical software issues at runtime. Our framework combines hot patching and hot fixing research from multiple fields, in particular: software defect and vulnerability prediction, automated test generation and repair, as well as runtime patching. We hope that our vision inspires research collaboration between the different communities.
Pre-print File AttachedResearch Papers
Thu 14 Sep 2023 13:30 - 13:42 at Plenary Room 2 - Software Testing for Specialized Systems 3 Chair(s): Xiaoyin Wang University of Texas at San AntonioResearch Papers
Thu 14 Sep 2023 13:42 - 13:54 at Plenary Room 2 - Software Testing for Specialized Systems 3 Chair(s): Xiaoyin Wang University of Texas at San Antoniono description available
Pre-printTool Demonstrations
Thu 14 Sep 2023 13:54 - 14:06 at Plenary Room 2 - Software Testing for Specialized Systems 3 Chair(s): Xiaoyin Wang University of Texas at San AntonioRigorous testing of Small Uncrewed Aerial Systems (sUAS) is crucial to ensure their safe and reliable deployment in the real world. sUAS developers aim to validate the reliability and safety of their applications through simulation testing. However, the dynamic nature of the real-world environment, including factors such as challenging weather conditions and wireless interference, causes unique software faults that may only be revealed through field testing. Considering the high cost and impracticality of conducting field testing in thousands of environmental contexts and conditions, there exists a pressing need to develop automated techniques that can generate high-fidelity, realistic environments enabling sUAS developers to deploy their applications and conduct thorough simulation testing in close-to-reality environmental conditions. To address this need, DroneWorld offers a comprehensive small Unmanned Aerial Vehicle (sUAV) simulation ecosystem that automatically generates realistic environments based on developer-specified constraints, monitors sUAV activities against predefined safety parameters, and generates detailed acceptance test reports for effective debugging and analysis of sUAV applications. Providing these capabilities, DroneWorld offers a valuable solution for enhancing the testing and development process of sUAV applications. The comprehensive demo of DroneWorld is available at https://youtu.be/RUsXYMi9rWs
Pre-print File AttachedResearch Papers
Thu 14 Sep 2023 14:06 - 14:18 at Plenary Room 2 - Software Testing for Specialized Systems 3 Chair(s): Xiaoyin Wang University of Texas at San AntonioResearch Papers
Thu 14 Sep 2023 14:18 - 14:30 at Plenary Room 2 - Software Testing for Specialized Systems 3 Chair(s): Xiaoyin Wang University of Texas at San AntonioMutation testing can help reduce the risks of re- leasing faulty software. For such reason, it is a desired practice for the development of embedded software running in safety- critical cyber-physical systems (CPS). Unfortunately, state-of- the-art test data generation techniques for mutation testing of C and C++ software, two typical languages for CPS software, rely on symbolic execution, whose limitations often prevent its application (e.g., it cannot test black-box components).
We propose a mutation testing approach that leverages fuzz testing, which has proved effective with C and C++ software. Fuzz testing automatically generates diverse test inputs that exercise program branches in a varied number of ways and, therefore, exercise statements in different program states, thus maximizing the likelihood of killing mutants, our objective.
We performed an empirical assessment of our approach with software components used in satellite systems currently in orbit. Our empirical evaluation shows that mutation testing based on fuzz testing kills a significantly higher proportion of live mutants than symbolic execution (i.e., up to an additional 47 percentage points). Further, when symbolic execution cannot be applied, fuzz testing provides significant benefits (i.e., up to 41% mutants killed). Our study is the first one comparing fuzz testing and symbolic execution for mutation testing; our results provide guidance towards the development of fuzz testing tools dedicated to mutation testing.
MOTIF is available at: https://github.com/SNTSVV/MOTIF
Replication package: https://figshare.com/articles/conference_contribution/Fuzzing_for_CPS_Mutation_Testing/22693525
Pre-print File AttachedTool Demonstrations
Thu 14 Sep 2023 14:30 - 14:42 at Plenary Room 2 - Software Testing for Specialized Systems 3 Chair(s): Xiaoyin Wang University of Texas at San AntonioSeveral experience reports illustrate that mutation testing is capable of supporting a ``shift-left'' testing strategy for software systems coded in textual programming languages like C++. For graphical modelling languages like Simulink, such experience reports are missing, primarily because of a lack of adequate tool support. In this paper we present a proof-of-concept (named MUT4SLX) for automatic mutant generation and test execution of Simulink models. MUT4SLX features 15 mutation operators which are modelled after realistic faults (mined from an industrial bug database) and are fast to inject (because we only replace parameter values within blocks). An experimental evaluation on a sample project (a Helicopter Control System) demonstrates that MUT4SLX is capable of injecting 70 mutants in less than a second, resulting in a total analysis time of 8.14 hours.
The tool is available at: https://github.com/haliliceylan/MUT4SLX/ and the demonstration video can be found at: https://youtu.be/inud_NRGutc.
File AttachedJournal-first Papers
Thu 14 Sep 2023 14:42 - 14:54 at Plenary Room 2 - Software Testing for Specialized Systems 3 Chair(s): Xiaoyin Wang University of Texas at San Antoniono description available
Journal-first Papers
Thu 14 Sep 2023 15:30 - 15:42 at Plenary Room 2 - Fuzzing Chair(s): Lars Grunske Humboldt-Universität zu BerlinFuzzing is a popular software testing method that discovers bugs by massively feeding target applications with automatically generated inputs. Many state-of-art fuzzers use branch coverage as a feedback metric to guide the fuzzing process. The fuzzer retains inputs for further mutation only if branch coverage is increased. However, branch coverage only provides a shallow sampling of program behaviours and hence may discard interesting inputs to mutate. This work aims at taking advantage of the large body of research over defining finer-grained code coverage metrics (such as control-flow, data-flow or mutation coverage) and at evaluating how fuzzing performance is impacted when using these metrics to select interesting inputs for mutation. We propose to make branch coverage-based fuzzers support most fine-grained coverage metrics out of the box (i.e., without changing fuzzer internals). We achieve this by making the test objectives defined by these metrics (such as conditions to activate or mutants to kill) explicit as new branches in the target program. Fuzzing such a modified target is then equivalent to fuzzing the original target, but the fuzzer will also retain inputs covering the additional metrics objectives for mutation. In addition, all the fuzzer mechanisms to penetrate hard-to-cover branches will help covering the additional metrics objectives. We use this approach to evaluate the impact of supporting two fine-grained coverage metrics (multiple condition coverage and weak mutation) over the performance of two state-of-the-art fuzzers (AFL++ and QSYM) with the standard LAVA-M and MAGMA benchmarks. This evaluation suggests that our mechanism for runtime fuzzer guidance, where the fuzzed code is instrumented with additional branches, is effective and could be leveraged to encode guidance from human users or static analysers. Our results also show that the impact of fine-grained metrics over fuzzing performance is hard to predict before fuzzing, and most of the time either neutral or negative. As a consequence, we do not recommend using them to guide fuzzers, except maybe in some possibly favorable circumstances yet to investigate, like for limited parts of the code or to complement classical fuzzing campaigns.
Link to publication File AttachedResearch Papers
Thu 14 Sep 2023 15:42 - 15:54 at Plenary Room 2 - Fuzzing Chair(s): Lars Grunske Humboldt-Universität zu BerlinResearch Papers
Thu 14 Sep 2023 15:54 - 16:06 at Plenary Room 2 - Fuzzing Chair(s): Lars Grunske Humboldt-Universität zu Berlinno description available
Research Papers
Thu 14 Sep 2023 16:06 - 16:18 at Plenary Room 2 - Fuzzing Chair(s): Lars Grunske Humboldt-Universität zu BerlinA physical simulation engine (PSE) is a software system that simulates physical environments and objects. Modern PSEs feature both forward and backward simulations, where the forward phase predicts the behavior of a simulated system, and the backward phase provides gradients (guidance) for learning-based control tasks, such as a robot arm learning to fetch items. This way, modern PSEs show promising support for learning-based control methods. To date, PSEs have been largely used in various high-profitable, commercial applications, such as games, movies, virtual reality (VR), and robotics. Despite the prosperous development and usage of PSEs by academia and industrial manufacturers such as Google and NVIDIA, PSEs may produce incorrect simulations, which may lead to negative results, from poor user experience in entertainment to accidents in robotics-involved manufacturing and surgical operations.
This paper introduces PHYFU, a fuzzing framework designed specifically for PSEs to uncover errors in both forward and backward simulation phases. PHYFU mutates initial states and asserts if the PSE under test behaves consistently with respect to basic Physics Laws (PLs). We further use feedback-driven test input scheduling to guide and accelerate the search for errors. Our study of four PSEs covers mainstream industrial vendors (Google and NVIDIA) as well as academic products. We successfully uncover over 5K error-triggering inputs that generate incorrect simulation results spanning across the whole software stack of PSEs.
DOI Pre-printResearch Papers
Thu 14 Sep 2023 16:18 - 16:30 at Plenary Room 2 - Fuzzing Chair(s): Lars Grunske Humboldt-Universität zu BerlinFuzzing applies input mutations iteratively with the only goal of finding more bugs, resulting in synthetic tests that tend to lack realism. Big data analytics are expected to ingest real-world data as input. Therefore, when synthetic test data are not easily comprehensible, they are less likely to facilitate the downstream task of fixing errors. Our position is that fuzzing in this domain must achieve both high naturalness and high code coverage. We propose a new natural synthetic test generation tool for big data analytics, called NaturalFuzz. It generates both unstructured, semi-structured, and structured data with corresponding semantics such as ‘zipcode’ and ‘age.’ The key insights behind NaturalFuzz are two-fold. First, though existing test data may be small and lack coverage, we can grow this data to increase code coverage. Second, we can strategically mix constituent parts across different rows and columns to construct new realistic synthetic data by leveraging fine-grained data provenance. On commercial big data application benchmarks, NaturalFuzz achieves an additional 19.9% coverage and detects 1.9× more faults than a machine learning-based synthetic data generator (SDV) when generating comparably sized inputs. This is because an ML-based synthetic data generator does not consider which code branches are exercised by which input rows from which tables, while NaturalFuzz is able to select input rows that have a high potential to increase code coverage and mutate the selected data towards unseen, new program behavior. NaturalFuzz’s test data is more realistic than the test data generated by two baseline fuzzers (BigFuzz and Jazzer), while increasing code coverage and fault detection potential. NaturalFuzz is the first fuzzing methodology with three benefits: (1) exclusively generate natural inputs, (2) fuzz multiple input sources simultaneously, and (3) find deeper semantics faults.
File AttachedTool Demonstrations
Thu 14 Sep 2023 16:30 - 16:42 at Plenary Room 2 - Fuzzing Chair(s): Lars Grunske Humboldt-Universität zu BerlinIn object-oriented design, class specifications are primarily used to express properties describing the intended behavior of the class methods and constraints on class’ objects. Although the presence of these specifications is important for various software engineering tasks such as test generation, bug finding and automated debugging, developers rarely write them.
In this tool demo we present the details of SpecFuzzer, a tool that aims at alleviating the problem of writing class specifications by using a combination of grammar-based fuzzing, dynamic invariant detection and mutation analysis to automatically infer specifications for Java classes. Given a class under analysis, SpecFuzzer uses (i) a generator of candidate assertions derived from a grammar automatically extracted from the class; (ii) a dynamic invariant detector –Daikon– in order to discard the assertions invalidated by a test suite; and (iii) a mutation-based mechanism to cluster and rank assertions, so that similar constraints are grouped and the stronger ones prioritized.
Pre-print File AttachedNIER Track
Thu 14 Sep 2023 16:42 - 16:54 at Plenary Room 2 - Fuzzing Chair(s): Lars Grunske Humboldt-Universität zu Berlinno description available