Research Papers
Tue 12 Sep 2023 10:30 - 10:42 at Room D - Infrastructure, Build, and Logs Chair(s): Fatemeh Hendijani Fard University of British Columbia, Arie van Deursen Delft University of Technologyno description available
Pre-printNIER Track
Tue 12 Sep 2023 10:42 - 10:54 at Room D - Infrastructure, Build, and Logs Chair(s): Fatemeh Hendijani Fard University of British Columbia, Arie van Deursen Delft University of TechnologyJournal-first Papers
Tue 12 Sep 2023 10:54 - 11:06 at Room D - Infrastructure, Build, and Logs Chair(s): Fatemeh Hendijani Fard University of British Columbia, Arie van Deursen Delft University of Technologyno description available
Industry Showcase (Papers)
Tue 12 Sep 2023 11:06 - 11:18 at Room D - Infrastructure, Build, and Logs Chair(s): Fatemeh Hendijani Fard University of British Columbia, Arie van Deursen Delft University of TechnologyDevelopment teams in large companies often maintain a huge codebase whose build time can be painfully long in a single machine. To reduce the build time, tools such as Bazel and distcc are used to build the codebase in a distributed fashion. However, in the process of a distributed building, it is normal that certain remote nodes crash due to two types of errors: Out Of Memory (OOM) and Deadline Exceeded (DE) errors. These crashes lead to a time-consuming rebuild, which is also a problem faced by WeiXin Group (WXG) of Tencent Inc., the company that created WeChat. Since existing tools cannot help avoid the OOM and DE errors, we propose PCRLinear, which predicts the memory and time requirement of a C++ file, allowing the original distributed build system to schedule compilation adaptively according to the prediction. Our experiments show that PCRLinear reduces the OOM and DE errors to zero and demonstrates a significant average build performance improvement of 30%.
Journal-first Papers
Tue 12 Sep 2023 11:30 - 11:42 at Room D - Infrastructure, Build, and Logs Chair(s): Fatemeh Hendijani Fard University of British Columbia, Arie van Deursen Delft University of TechnologyDocker is a containerization technology that allows developers to ship software applications along with their dependencies in Docker images. Developers can extend existing images using them as base images when writing Dockerfiles. However, a lot of alternative functionally-equivalent base images are available. While many studies define and evaluate quality features that can be extracted from Docker artifacts, it is still unclear what are the criteria on which developers choose a base image over another. In this paper, we aim to fill this gap. First, we conduct a literature review through which we define a taxonomy of quality features, identifying two main groups: Configuration-related features (i.e., mainly related to the Dockerfile and image build process), and externally observable features (i.e., what the Docker image users can observe). Second, we ran an empirical study considering the developers’ preference for 2,441 Docker images in 1,911 open-source software projects. We want to understand (i) how the externally observable features influence the developers’ preferences, and (ii) how they are related to the configuration-related features. Our results pave the way to the definition of a reliable quality measure for Docker artifacts, along with tools that support developers for a quality-aware development of them
Link to publicationResearch Papers
Tue 12 Sep 2023 11:42 - 11:54 at Room D - Infrastructure, Build, and Logs Chair(s): Fatemeh Hendijani Fard University of British Columbia, Arie van Deursen Delft University of Technologyno description available
Research Papers
Tue 12 Sep 2023 13:30 - 13:42 at Room D - Smart Contracts, Blockchain, Energy efficiency, and green softwareThe advent of edge devices dedicated to machine learning tasks enabled the execution of AI-based applications that efficiently process and classify the data acquired by the resource-constrained devices populating the Internet of Things. The proliferation of such applications (e.g., critical monitoring in smart cities) demands new strategies to make these systems also sustainable from an energetic point of view.
In this paper, we present an energy-aware approach for the design and deployment of self-adaptive AI-based applications that can balance application objectives (e.g., accuracy in object detection and frames processing rate) with energy consumption. We address the problem of determining the set of configurations that can be used to self-adapt the system with a meta-heuristic search procedure that only needs a small number of empirical samples. The final set of configurations are selected using weighted gray relational analysis, and mapped to the operation modes of the self-adaptive application.
We validate our approach on an AI-based application for pedestrian detection. Results show that our self-adaptive application can outperform non-adaptive baseline configurations by saving up to 81% of energy while loosing only between 2% and 6% in accuracy.
Pre-print File AttachedJournal-first Papers
Tue 12 Sep 2023 13:42 - 13:54 at Room D - Smart Contracts, Blockchain, Energy efficiency, and green softwareno description available
Research Papers
Tue 12 Sep 2023 13:54 - 14:06 at Room D - Smart Contracts, Blockchain, Energy efficiency, and green softwareSmart contracts are programs running on the blockchain. Comments in source code provide meaningful information for developers to facilitate code writing and understanding. Given various kinds of token standards in smart contracts (e.g., ERC-20, ERC-721), developers often copy&paste code from other projects as templates, and then implement their own logic as add-ons to such templates. In many cases, the consistency between code and comment is not well-aligned, leading to comment-code inconsistencies (as we call CCIs). Such inconsistencies can mislead developers and users, and even introduce vulnerabilities to the contracts. In this paper, we present SmartCoCo, a novel framework to detect comment-code inconsistencies in smart contracts. In particular, our research focuses on comments related to roles, parameters, and events that may lead to security implications. To achieve this, SmartCoCo takes the original smart contract source code as input and automatically analyzes the comment and code to find potential inconsistencies. SmartCoCo associates comment constraints and code facts via a set of propagation and binding strategies, allowing it to effectively discover inconsistencies with more contextual information. We evaluated SmartCoCo on 101,780 unique smart contracts on Ethereum. The evaluation result shows that SmartCoCo achieves good effectiveness and efficiency. In particular, SmartCoCo reports 4,732 inconsistencies from 1,745 smart contracts, with a precision of over 79% on 439 manual-labeled comment-code inconsistencies. Meanwhile, it only takes 2.64 seconds to check a smart contract on average.
Pre-print File AttachedTool Demonstrations
Tue 12 Sep 2023 14:06 - 14:18 at Room D - Smart Contracts, Blockchain, Energy efficiency, and green softwareWith the exponential growth of data, the demand for effective data analysis tools has increased significantly. R language, known for its statistical modeling and data analysis capabilities, has become one of the most popular programming languages among data scientists and researchers. As the importance of energy-aware software systems continues to rise, several studies investigate the impact of source code and different stages of machine learning model training on energy consumption. However, existing studies in this domain primarily focus on programming languages like Python and Java, leaving a gap for energy measuring tools in other programming language such as R. To address this gap, we propose ``\textbf{\textit{RJoules}}'', a tool designed to measure the energy consumption of R code snippets. We evaluate the correctness and performance of \textit{RJoules} by applying it to four machine learning algorithms on three different systems. Our aim is to support developers and practitioners in building energy-aware systems in R. The demonstration of the tool is available at \url{https://youtu.be/yMKFuvAM-DE} and related artifacts at https://rishalab.github.io/RJoules.
Pre-print File AttachedIndustry Showcase (Papers)
Tue 12 Sep 2023 14:18 - 14:30 at Room D - Smart Contracts, Blockchain, Energy efficiency, and green softwareAdvances in technologies like artificial intelligence and metaverse have led to a proliferation of software systems in business and everyday life. With this widespread penetration, the carbon emissions of software are rapidly growing as well, thereby negatively impacting the long-term sustainability of our environment. Hence, optimizing software from a sustainability standpoint becomes more crucial than ever. We believe that the adoption of automated tools that can identify energy-inefficient patterns in the code and guide appropriate refactoring can significantly assist in this optimization. In this extended abstract, we present an industry case study that evaluates the sustainability impact of refactoring energy-inefficient code patterns identified by automated software sustainability assessment tools for a large application. Preliminary results highlight a positive impact on the application’s sustainability post-refactoring, leading to a 29% decrease in energy consumption.
Industry Showcase (Papers)
Tue 12 Sep 2023 14:30 - 14:42 at Room D - Smart Contracts, Blockchain, Energy efficiency, and green softwareAs the world takes cognizance of AI’s growing role in greenhouse gas(GHG) and carbon emissions, the focus of AI research & development is shifting towards inclusion of energy efficiency as another core metric. Sustainability, a core agenda for most organizations, is also being viewed as a core nonfunctional requirement in software engineering. A similar effort is being undertaken to extend sustainability principles to AI-based systems with focus on energy efficient training and inference techniques. But an important question arises, does there even exist any metrics or methods which can quantify adoption of “green” practices in the life cycle of AI-based systems? There is a huge gap which exists between the growing research corpus related to sustainable practices in AI research and its adoption at an industry scale. The goal of this work is to introduce a methodology and novel metric for assessing ”greenness” of any AI-based system and its development process, based on energy efficient AI research and practices. The novel metric, termed as Green AI Quotient, would be a key step towards AI practitioner’s Green AI journey. Empirical validation of our approach suggest that Green AI Quotient is able to encourage adoption and raise awareness regarding sustainable practices in AI lifecycle.
NIER Track
Tue 12 Sep 2023 14:42 - 14:54 at Room D - Smart Contracts, Blockchain, Energy efficiency, and green softwareno description available
Research Papers
Tue 12 Sep 2023 15:30 - 15:42 at Room D - Web Development 1 Chair(s): Ben Hermann TU DortmundModern web services increasingly rely on REST APIs. Effectively testing these APIs is challenging due to the vast search space to be explored, which involves selecting API operations for sequence creation, choosing parameters for each operation from a potentially large set of parameters, and sampling values from the virtually infinite parameter input space. Current testing tools lack efficient exploration mechanisms, treating all operations and parameters equally (i.e., not considering their importance or complexity) and lacking prioritization strategies. Furthermore, these tools struggle when response schemas are absent in the specification or exhibit variants. To address these limitations, we present an adaptive REST API testing technique that incorporates reinforcement learning to prioritize operations and parameters during exploration. Our approach dynamically analyzes request and response data to inform dependent parameters and adopts a sampling-based strategy for efficient processing of dynamic API feedback. We evaluated our technique on ten RESTful services, comparing it against state-of-the-art REST testing tools with respect to code coverage achieved, requests generated, operations covered, and service failures triggered. Additionally, we performed an ablation study on prioritization, dynamic feedback analysis, and sampling to assess their individual effects. Our findings demonstrate that our approach outperforms existing REST API testing tools in terms of effectiveness, efficiency, and fault-finding ability.
Pre-print File AttachedIndustry Showcase (Papers)
Tue 12 Sep 2023 15:42 - 15:54 at Room D - Web Development 1 Chair(s): Ben Hermann TU DortmundThe microservice paradigm is a popular software development pattern that breaks down a large application into smaller, independent services. While this approach offers several advantages, such as scalability, agility, and flexibility, it also introduces new security challenges. This paper presents a novel approach to securing microservice architectures using fuzz testing. Fuzz testing is known to find security vulnerabilities in software by feeding it with unexpected or random inputs. In this paper, we propose a zero-config fuzz test generation technique for microservices that can maximize coverage of internal states by mutating the frontend requests and the backend responses from dependent services. We also present the results of our fuzz testing, which reported and got fixed thousands of security vulnerabilities in real-world microservice applications.
File AttachedJournal-first Papers
Tue 12 Sep 2023 15:54 - 16:06 at Room D - Web Development 1 Chair(s): Ben Hermann TU Dortmundno description available
Research Papers
Tue 12 Sep 2023 16:06 - 16:18 at Room D - Web Development 1 Chair(s): Ben Hermann TU Dortmundno description available
Journal-first Papers
Tue 12 Sep 2023 16:18 - 16:30 at Room D - Web Development 1 Chair(s): Ben Hermann TU Dortmundno description available
Research Papers
Tue 12 Sep 2023 16:30 - 16:42 at Room D - Web Development 1 Chair(s): Ben Hermann TU DortmundAPI recommendation methods have evolved from literal and semantic keyword matching to query expansion and query clarification. The latest query clarification method is knowledge graph (KG)-based, but limitations include out-of-vocabulary (OOV) failures and rigid question templates. To address these limitations, we propose a novel knowledge-guided query clarification approach for API recommendation that leverages a large language model (LLM) guided by KG. We utilize the LLM as a neural knowledge base to overcome OOV failures, generating fluent and appropriate clarification questions and options. We also leverage the structured API knowledge and entity relationships stored in the KG to filter out noise, and transfer the optimal clarification path from KG to the LLM, increasing the efficiency of the clarification process. Our approach is designed as an AI chain that consists of five steps, each handled by a separate LLM call, to improve accuracy, efficiency, and fluency for query clarification in API recommendation. We verify the usefulness of each unit in our AI chain, which all received high scores close to a perfect 5. When compared to the baselines, our approach shows a significant improvement in MRR, with a maximum increase of 63.9% higher when the query statement is covered in KG and 37.2% when it is not. Ablation experiments reveal that the guidance of knowledge in the KG and the knowledge-guided pathfinding strategy are crucial for our approach’s performance, resulting in a 19.0% and 22.2% increase in MAP, respectively. Our approach demonstrates a way to bridge the gap between KG and LLM, effectively compensating for the strengths and weaknesses of both.
Research Papers
Wed 13 Sep 2023 10:30 - 10:42 at Room D - Program Analysis Chair(s): Domenico Bianculli University of Luxembourgno description available
Research Papers
Wed 13 Sep 2023 10:42 - 10:54 at Room D - Program Analysis Chair(s): Domenico Bianculli University of LuxembourgContext-free language (CFL) reachability is a fundamental framework for formulating program analyses. CFL-reachability analysis works on top of an edge-labeled graph by deriving reachability relations and adding them as labeled edges to the graph. Existing CFL-reachability algorithms typically adopt a single-reachability relation derivation (SRD) strategy, i.e., one reachability relation is derived at a time. Unfortunately, this strategy can lead to redundancy, hindering the efficiency of the analysis.
To address this problem, this paper proposes PEARL, a multi-derivation approach that reduces derivation redundancy for transitive relations that frequently arise when solving reachability relations, significantly improving the efficiency of CFL-reachability analysis. Our key insight is that multiple edges involving transitivity can be simultaneously derived via batch propagation of reachability relations on the transitivity-aware subgraphs that are induced from the original edge-labeled graph. We evaluate the performance of PEARL on two clients, i.e., context-sensitive value-flow analysis and field-sensitive alias analysis for C/C++. By eliminating a large amount of redundancy, PEARL achieves average speedups of 82.73x for value-flow analysis and 155.26x for alias analysis over the standard CFL-reachability algorithm. The comparison with POCR, a state-of-the-art CFL-reachability solver, shows that PEARL runs 10.1x (up to 29.2x) and 2.37x (up to 4.22x) faster on average respectively for value-flow analysis and alias analysis with less consumed memory.
Pre-print File AttachedTool Demonstrations
Wed 13 Sep 2023 10:54 - 11:06 at Room D - Program Analysis Chair(s): Domenico Bianculli University of LuxembourgThe satisfiability problem modulo the nonlinear real arithmetic (NRA) theory serves as the foundation for a wide range of important applications, such as model checking, program analysis, and software testing. However, due to the high computational complexity, developing efficient solving algorithms for this problem has consistently presented a substantial challenge. We present a hybrid SMT(NRA) solver, called NRAgo, which combines the efficiency of gradient-based optimization method with the completeness of algebraic solving algorithm. With our approach, the practical performance on many satisfiable instances is substantially improved. The experimental evaluation shows that NRAgo achieves remarkable acceleration effects on a set of challenging SMT(NRA) benchmarks that are hard to solve for state-of-the-art SMT solvers.
File AttachedNIER Track
Wed 13 Sep 2023 11:06 - 11:18 at Room D - Program Analysis Chair(s): Domenico Bianculli University of LuxembourgJournal-first Papers
Wed 13 Sep 2023 11:18 - 11:30 at Room D - Program Analysis Chair(s): Domenico Bianculli University of LuxembourgDeep learning (DL) has recently been widely applied to diverse source code processing tasks in the software engineering (SE) community, which achieves competitive performance (e.g., accuracy). However, the robustness, which requires the model to produce consistent decisions given minorly perturbed code inputs, still lacks systematic investigation as an important quality indicator. This article initiates an early step and proposes a framework CARROT for robustness detection, measurement, and enhancement of DL models for source code processing. We first propose an optimization-based attack technique CARROTA to generate valid adversarial source code examples effectively and efficiently. Based on this, we define the robustness metrics and propose robustness measurement toolkit CARROTM, which employs the worst-case performance approximation under the allowable perturbations. We further propose to improve the robustness of the DL models by adversarial training (CARROTT) with our proposed attack techniques. Our in-depth evaluations on three source code processing tasks (i.e., functionality classification, code clone detection, defect prediction) containing more than 3 million lines of code and the classic or SOTA DL models, including GRU, LSTM, ASTNN, LSCNN, TBCNN, CodeBERT, and CDLH, demonstrate the usefulness of our techniques for ❶ effective and efficient adversarial example detection, ❷ tight robustness estimation, and ❸ effective robustness enhancement.
Link to publication DOI File AttachedResearch Papers
Wed 13 Sep 2023 11:30 - 11:42 at Room D - Program Analysis Chair(s): Domenico Bianculli University of LuxembourgProgram analysis techniques such as abstract interpretation and symbolic execution suffer from imprecision due to over- and underapproximation, which results in false alarms and missed violations. To alleviate this imprecision, we propose a novel data structure, program state probability (PSP), that leverages execution samples to probabilistically approximate reachable program states. The core intuition of this approximation is that the probability of reaching a given state varies greatly, and thus we can considerably increase analysis precision at the cost of a small probability of unsoundness or incompleteness, which is acceptable when analysis targets bug-finding. Specifically, PSP enhances existing analyses by disregarding low-probability states deemed feasible by overapproximation and recognising high-probability states deemed infeasible by underapproximation. We apply PSP in three domains. First, we show that PSP enhances the precision of the Clam abstract interpreter in terms of MCC from 0.09 to 0.27 and F1 score from 0.22 to 0.34. Second, we demonstrate that a symbolic execution search strategy based on PSP that prioritises program states with a higher probability increases the number of found bugs and reduces the number of solver calls compared to state-of-the-art techniques. Third, a program repair patch prioritisation strategy based on PSP reduces the average patch rank by 26%.
Pre-printResearch Papers
Wed 13 Sep 2023 11:42 - 11:54 at Room D - Program Analysis Chair(s): Domenico Bianculli University of LuxembourgWe present a novel approach - CLAA - for API aspect detection in API reviews that utilizes transformer models trained with a supervised contrastive loss objective function. We evaluate CLAA using performance and impact analysis. For performance analysis, we utilized a benchmark dataset on developer discussions collected from Stack Overflow and compare the results to those obtained using state-of-the-art transformer models. Our experiments show that contrastive learning can significantly improve the performance of transformer models in detecting aspects such as Performance, Security, Usability, and Documentation. For impact analysis, we performed empirical and developer study. On a randomly selected and manually labeled 200 online reviews, CLAA achieved 92% accuracy while the SOTA baseline achieved 81.5%. According to our developer study involving 10 participants, the use of ‘Stack Overflow + CLAA’ resulted in increased accuracy and confidence during API selection. Replication package: https://github.com/disa-lab/Contrastive-Learning-API-Aspect-ASE2023
Pre-printResearch Papers
Wed 13 Sep 2023 13:30 - 13:42 at Room D - Open Source and Software Ecosystems 2 Chair(s): Paul Grünbacher Johannes Kepler University Linz, AustriaResearch Papers
Wed 13 Sep 2023 13:42 - 13:55 at Room D - Open Source and Software Ecosystems 2 Chair(s): Paul Grünbacher Johannes Kepler University Linz, Austriano description available
Research Papers
Wed 13 Sep 2023 13:55 - 14:08 at Room D - Open Source and Software Ecosystems 2 Chair(s): Paul Grünbacher Johannes Kepler University Linz, AustriaProper incentives are important for motivating developers in open-source communities, which is crucial for maintaining the development of open-source software healthy. To provide such incentives, an accurate and objective developer contribution measurement method is needed. However, existing methods rely heavily on manual peer review, lacking objectivity and transparency. The metrics of some automated works about effort estimation use only syntax-level or even text-level information, such as changed lines of code, which lack robustness. Furthermore, some works about identifying core developers provide only a qualitative understanding without a quantitative score or have some project-specific parameters, which makes them not practical in real-world projects. To this end, we propose CValue, a multidimensional information fusion-based approach to measure developer contributions. CValue extracts both syntax and semantic information from the source code changes in four dimensions: modification amount, understandability, inter-function and intra-function impact of modification. It fuses the information to produce the contribution score for each of the commits in the projects. Experimental results show that CValue outperforms other approaches by 19.59% on 10 real-world projects with manually labeled ground truth. We validated and proved that the performance of CValue, which takes 83.39 seconds per commit, is acceptable to be applied in real-world projects. Furthermore, we performed a large-scale experiment on 174 projects and detected 2,282 developers having inflated commits. Of these, 2,050 developers did not make any syntax contribution; and 103 were identified as bots.
Pre-printJournal-first Papers
Wed 13 Sep 2023 14:08 - 14:21 at Room D - Open Source and Software Ecosystems 2 Chair(s): Paul Grünbacher Johannes Kepler University Linz, AustriaSoftware developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects.
To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like k-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization.
To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects.
Link to publication DOI Pre-printResearch Papers
Wed 13 Sep 2023 14:21 - 14:34 at Room D - Open Source and Software Ecosystems 2 Chair(s): Paul Grünbacher Johannes Kepler University Linz, AustriaCode is often reused to facilitate collaborative development, to create software variants, to experiment with new ideas, or to develop new features in isolation. Social-coding platforms, such as GitHub, enable enhanced code reuse with forking, pull requests, and cross-project traceability. With these concepts, forking has become a common strategy to reuse code by creating clones (i.e., forks) of projects. Thereby, forking establishes fork ecosystems of co-existing projects that are similar, but developed in parallel, often with rather sporadic code propagation and synchronization. Consequently, forked projects vary in quality and often involve redundant development efforts. Unfortunately, as we will show, many projects do not benefit from test cases created in other forks, even though those test cases could actually be reused to enhance the quality of other projects. We believe that reusing test cases—in addition to the implementation code—can improve software quality, software maintainability, and coding efficiency in fork ecosystems. While researchers have worked on test-case-reuse techniques, their potential to improve the quality of real fork ecosystems is unknown. To shed light on test-case reusability, we study to what extent test cases can be reused across forked projects. We mined a dataset of test cases from 305 fork ecosystems on GitHub—totaling 1,089 projects—and assessed the potential for reusing these test cases among the forked projects. By performing a manual inspection of the test cases’ applicability, by transplanting the test cases, and by analyzing the causes of non-applicability, we contribute an understanding of the benefits (e.g., uncovering bugs) and of the challenges (e.g., automated code transplantation, deciding about applicability) of reusing test cases in fork ecosystems.
File AttachedResearch Papers
Wed 13 Sep 2023 14:34 - 14:47 at Room D - Open Source and Software Ecosystems 2 Chair(s): Paul Grünbacher Johannes Kepler University Linz, AustriaOpen source software (OSS) licenses regulate the conditions under which users can reuse, modify, and distribute the software legally. However, there exist various OSS licenses in the community, written in a formal language, which are typically long and complicated to understand. In this paper, we conducted a 661-participants online survey to investigate the perspectives and practices of developers towards OSS licenses. The user study revealed an indeed need for an automated tool to facilitate license understanding. Motivated by the user study and the fast growth of licenses in the community, we propose the first study towards automated license summarization. Specifically, we released the first high quality text summarization dataset and designed two tasks, i.e., license text summarization (LTS), aiming at generating a relatively short summary for an arbitrary license, and license term classification (LTC), focusing on the attitude inference towards a predefined set of key license terms (e.g., Distribute). Aiming at the two tasks, we present LiSum, a multi-task learning method to help developers overcome the obstacles of understanding OSS licenses. Comprehensive experiments demonstrated that the proposed jointly training objective boosted the performance on both tasks, surpassing state-of-the-art baselines with gains of at least 5 points w.r.t. F1 scores of four summarization metrics and achieving 95.13% micro average F1 score for classification simultaneously. We released all the datasets, the replication package, and the questionnaires for the community.
Pre-printIndustry Showcase (Papers)
Wed 13 Sep 2023 14:47 - 15:00 at Room D - Open Source and Software Ecosystems 2 Chair(s): Paul Grünbacher Johannes Kepler University Linz, AustriaDesigning and optimizing deep models require managing large datasets and conducting carefully designed controlled experiments that depend on large sets of hyper-parameters and problem dependent software/data configurations. These experiments are executed by training the model under observation with varying configurations. Since executing a typical training run can take days even on proven acceleration fabrics such as Graphics Processing Units (GPU), avoiding human error in configuration preparations and securing the repeatability of the experiments are of utmost importance. Failed training runs lead to lost time, wasted energy and frustration. On the other hand, unrepeatable or poorly monitored/logged training runs make it exceedingly hard to track performance and lock on a successful and well generalizing deep model. Hence, managing large datasets and training automation are crucial for efficiently training deep models. In this paper, we present two open source software tools that aim to achieve these goals, namely, a Dataset Manager (DatumAid) tool and a Training Automation Manager (OrchesTrain) tool. DatumAid is a software tool that integrates with Computer Vision Annotation Tool (CVAT) to facilitate the management of annotated datasets. By adding additional functionality, DatumAid allows users to filter labeled data, manipulate datasets, and export datasets for training purposes. The tool adopts a simple code structure while providing flexibility to users through configuration files. OrchesTrain aims to automate model training process by facilitating rapid preparation and training of models in the desired style for the intended tasks. Users can seamlessly integrate their models prepared in the PyTorch library into the system and leverage the full capabilities of OrchesTrain. It enables the simultaneous or separate usage of Wandb, MLflow, and TensorBoard loggers. To ensure reproducibility of the conducted experiments, all configurations and codes are saved to the selected logger in an appropriate structure within a YAML file along with the serialized model files. Both software tools are publicly available on GitHub.
Research Papers
Wed 13 Sep 2023 15:30 - 15:42 at Room D - Bug Detection Chair(s): Andreea Vescan Babes-Bolyai Universityno description available
File AttachedJournal-first Papers
Wed 13 Sep 2023 15:42 - 15:54 at Room D - Bug Detection Chair(s): Andreea Vescan Babes-Bolyai UniversityResearch Papers
Wed 13 Sep 2023 15:54 - 16:06 at Room D - Bug Detection Chair(s): Andreea Vescan Babes-Bolyai UniversityThe SZZ algorithm has been widely used for identifying bug-inducing commits. However, it suffers from low precision, as not all deletion lines in the bug-fixing commit are related to the bug fix. Previous studies have attempted to address this issue by using static methods to filter out noise, e.g., comments and refactoring operations in the bug-fixing commit. However, these methods have two limitations. First, it is challenging to include all refactoring and non-essential change patterns in a tool, leading to the potential exclusion of relevant lines and the inclusion of irrelevant lines. Second, applying these tools might not always improve performance.
In this paper, to address the aforementioned challenges, we propose NEURALSZZ, a deep learning approach for detecting the root cause deletion lines in a bug-fixing commit and using them as input for the SZZ algorithm. NEURALSZZ first constructs a heterogeneous graph attention network model that captures the semantic relationships between each deletion line and the other deletion and addition lines. To pinpoint the root cause of a bug, NEURALSZZ uses a learning-to-rank technique to rank all deletion lines in the commit. To evaluate the effectiveness of NEURALSZZ, we utilize three datasets containing high-quality bug-fixing and bug-inducing commits. The experiment results show that NEURALSZZ outperforms various baseline methods, e.g., traditional machine learning-based approaches and Bi-LSTM in identifying the root cause of bugs. Moreover, by utilizing the top-ranked deletion lines and applying the SZZ algorithm, NEURALSZZ demonstrates better precision and F1- score compared to previous SZZ algorithms.
Pre-printResearch Papers
Wed 13 Sep 2023 16:06 - 16:18 at Room D - Bug Detection Chair(s): Andreea Vescan Babes-Bolyai UniversityReal bug fixes found in open source repositories seem to be the perfect source for learning to localize and repair real bugs. Yet, the scale of existing bug fix collections is typically too small for training data-intensive neural approaches. Neural bug detectors are hence almost exclusively trained on artificial bugs, produced by mutating existing source code and thus easily obtainable at large scales. However, neural bug detectors trained on artificial bugs usually underperform when faced with real bugs. To address this shortcoming, we set out to explore the impact of training on real bug fixes at scale. Our systematic study compares neural bug detectors trained on real bug fixes, artificial bugs and mixtures of real and artificial bugs at various dataset scales and with varying training techniques. Based on our insights gained from training on a novel dataset of 33k real bug fixes, we were able to identify a training setting capable of significantly improving the performance of existing neural bug detectors by up to 170% on simple bugs in Python. In addition, our evaluation shows that further gains can be expected by increasing the size of the real bug fix dataset or the code dataset used for generating artificial bugs. To facilitate future research on neural bug detection, we release our real bug fix dataset, trained models and code.
Pre-print File AttachedResearch Papers
Wed 13 Sep 2023 16:18 - 16:30 at Room D - Bug Detection Chair(s): Andreea Vescan Babes-Bolyai UniversityThe fundamental asynchronous thread (java.lang.Thread) in Java can be easily misused, due to the lack of deep understanding for garbage collection and thread interruption mechanism. For example, a careless implementation of asynchronous thread may cause no response to the interrupt mechanism in time, resulting in unexpected thread-related behaviors, especially resource leak/waste. Currently, few works aim at these misuses and related works adopt either the dynamic approach which lacks effective inputs or the static path-sensitive approach with high time consumption due to the path explosion, causing false negatives. We have found that the behavior of threads and the interaction between threads and its referencing objects can be abstracted. In this paper, we propose an event analysis approach to detect the defects in Java programs and Android apps, which focuses on the existence or the order of the events to reduce the false negatives. We extract the misuse-related events, containing the thread events and the destroy events of the object referenced by the thread. Then we analyze the events with loop identification, happens-before relationship construction and alias determination. Finally, we implement an automatic tool named Leopard and evaluate it on real world Java programs and Android apps. Experiments show that it is efficient when comparing with the existing approach (misuse: 723 vs 47, time: 60s vs 30min), which also outperforms the existing work in precision. The manual check indicates that Leopard is more efficient and effective than existing work. Besides, 66 issues reported by us have been confirmed and 21 of them have been fixed by developers.
File AttachedJournal-first Papers
Wed 13 Sep 2023 16:30 - 16:42 at Room D - Bug Detection Chair(s): Andreea Vescan Babes-Bolyai UniversityContext Advances in defect prediction models, aka classifiers, have been validated via accuracy metrics. Effort-aware metrics (EAMs) relate to benefits provided by a classifier in accurately ranking defective entities such as classes or methods. PofB is an EAM that relates to a user that follows a ranking of the probability that an entity is defective, provided by the classifier. Despite the importance of EAMs, there is no study investigating EAMs trends and validity. Aim Theaimofthispaperistwofold:1)werevealissuesinEAMsusage,and2)wepropose and evaluate a normalization of PofBs (aka NPofBs), which is based on ranking defective entities by predicted defect density. Method We perform a systematic mapping study featuring 152 primary studies in major journals and an empirical study featuring 10 EAMs, 10 classifiers, two industrial, and 12 open-source projects. Results Our systematic mapping study reveals that most studies using EAMs use only a single EAM (e.g., PofB20) and that some studies mismatched EAMs names. The main result of our empirical study is that NPofBs are statistically and by orders of magnitude higher than PofBs. Conclusions In conclusion, the proposed normalization of PofBs: (i) increases the realism of results as it relates to a better use of classifiers, and (ii) promotes the practical adoption of prediction models in industry as it shows higher benefits. Finally, we provide a tool to compute EAMs to support researchers in avoiding past issues in using EAMs. Keywords Defect prediction · Accuracy metrics · Effort-aware metrics
DOI File AttachedResearch Papers
Wed 13 Sep 2023 16:42 - 16:54 at Room D - Bug Detection Chair(s): Andreea Vescan Babes-Bolyai UniversityOptimizing compilers are as ubiquitous as they are crucial to software development. However, bugs in compilers are not uncommon. Among the most serious are bugs in compiler optimizations, which can cause unexpected behavior in compiled binaries. Existing approaches for detecting such bugs have focused on end-to-end compiler fuzzing, which limits their ability for targeted exploration of a compiler’s optimizations.
This paper proposes FLUX (Finding bugs with LLVM IR basedUnit test cross(X)overs), a fuzzer that is designed to generate test cases that stress compiler optimizations. Previous compiler fuzzers are overly constrained by having to construct well-formed inputs. FLUX sidesteps this constraint by using human-written unit test suites as a starting point, and then selecting random combinations of them to generate new tests. We hypothesize that tests generated this way will be able to explore new execution paths through compiler optimizations and find new bugs. Our evaluation of FLUX on LLVM indicates that it is able to increase path coverage over the baseline LLVM unit test suite and explores more edge coverage than previous work. Further, we demonstrate FLUX’s ability to generate miscompiled and crash-producing IR on LLVM’s optimizations. After a month of fuzzing, FLUX found 28 unique bugs in LLVM’s active development branch. We have reported 11 of these bugs which led to 6 of them being patched by LLVM developers. 22 of these are crashes that are triggered by well-formed input programs, and 6 of these are miscompilation bugs that silently produced incorrect code.
Pre-print File AttachedJournal-first Papers
Thu 14 Sep 2023 10:30 - 10:42 at Room D - Mobile Development 1 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information Securityno description available
File AttachedResearch Papers
Thu 14 Sep 2023 10:42 - 10:54 at Room D - Mobile Development 1 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information SecurityExisting Android malware detection systems primarily concentrate on detecting malware apps, leaving a gap in the research concerning the detection of malicious components in apps. In this work, we propose a novel approach to detect fine-granularity malicious components for Android apps and build a prototype (AMCDroid). For a given app, AMCDroid first models app behavior to a homogenous graph based on the call graph and code statements of the app. Then, the graph is converted to a statement tree sequence for malware detection through the AST-based Neural Network with Feature Mapping (ASTNNF) model. Finally, if the app is detected as malware, AMCDroid applies fine-granularity malicious component detection (MCD) algorithm which is based on many-objective genetic algorithm to the homogenous graph for detecting malicious component in the app adaptively. We evaluate AMCDroid on 95,134 samples. Compared with the other two state-of-the-art methods in malware detection, AMCDroid gets the highest performance on the test set with 0.9699 F1-Score, and shows better robustness in facing obfuscation. Moreover, AMCDroid is capable of detecting fine-granularity malicious components of (obfuscated) malware apps. Especially, its average F1-Score exceeds another state-of-the-art method by 50%.
File AttachedResearch Papers
Thu 14 Sep 2023 10:54 - 11:06 at Room D - Mobile Development 1 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information SecurityAndroid is the most popular operating system for mobile devices nowadays. Permissions are a very important part of Android security architecture. Apps frequently need the users’ permission, but many of them only ask for it once—when the user uses the app for the first time—and then they keep and abuse the given permissions. Longing to enhance Android permission security and users’ private data protection is the driving factor behind our approach to explore fine-grained contextsensitive permission usage analysis and thereby identify misuses in Android apps. In this work, we propose an approach for classifying the fine-grained permission uses for each functionality of Android apps that a user interacts with. Our approach, named DROIDGEM, relies on mainly three technical components to provide an in-context classification for permission (mis)uses by Android apps for each functionality triggered by users: (1) static inter-procedural control-flow graphs and call graphs representing each functionality in an app that may be triggered by users’ or systems’ events through UI-linked event handlers, (2) graph embedding techniques converting graph structures into numerical encoding, and (3) supervised machine learning models classifying (mis)uses of permissions based on the embedding. We have implemented a prototype of DROIDGEM and evaluated it on 89 diverse apps. The results show that DROIDGEM can accurately classify whether permission used by the functionality of an app triggered by a UI-linked event handler is a misuse in relation to manually verified decisions, with up to 95% precision and recall. We believe that such a permission classification mechanism can be helpful in providing fine-grained permission notices in a context related to app users’ actions, and improving their awareness of (mis)uses of permissions and private data in Android apps.
File AttachedResearch Papers
Thu 14 Sep 2023 11:06 - 11:18 at Room D - Mobile Development 1 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information SecurityTool Demonstrations
Thu 14 Sep 2023 11:18 - 11:30 at Room D - Mobile Development 1 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information SecurityComponents are the fundamental building blocks of Android applications. Different functional modules represented by components often rely on inter-component communication mechanisms to achieve cross-module data transfer and method invocation. It is necessary to conduct robustness testing on components to prevent component launching crashes and privacy leaks caused by unexpected input parameters. However, as the complexity of the input parameter structure and the diversity of possible inputs, developers may overlook specific inputs that result in exceptions. At the same time, the vast input space also brings challenges to efficient component testing. In this paper, we designed an automated testing tool for Android application components named \textit{\textbf{ICTDroid}}, which combines static parameter extraction and adaptive-strength combinatorial testing generation to detect bugs with a compact test suite. Experiments have shown that the tool triggers 205 unique exceptions in 30 open-source applications with 1,919 test cases in 83 minutes, where the developers have confirmed three of six issues we reported. The tool and demostration video of \textit{ICTDroid} is available at https://lightningrs.github.io/tools/ICTDroid.html.
File AttachedResearch Papers
Thu 14 Sep 2023 11:30 - 11:42 at Room D - Mobile Development 1 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information SecurityAutoscaling functions provide the foundation for achieving elasticity in the modern cloud computing paradigm. It enables dynamic provisioning or de-provisioning resources for cloud software services and applications without human intervention to adapt to workload fluctuations. However, autoscaling microservice is challenging due to various factors. In particular, complex, time-varying service dependencies are difficult to quantify accurately and can lead to cascading effects when allocating resources. This paper presents DeepScaler, a deep learning-based holistic autoscaling approach for microservices that focus on coping with service dependencies to optimize service-level agreements (SLA) assurance and cost efficiency. DeepScaler employs (i) an expectation-maximization-based learning method to adaptively generate affinity matrices revealing service dependencies and (ii) an attention-based graph convolutional network to extract spatio-temporal features of microservices by aggregating neighbors’ information of graph-structural data. Thus DeepScaler can capture more potential service dependencies and accurately estimate the resource requirements of all services under dynamic workloads. It allows DeepScaler to reconfigure the resources of the interacting services simultaneously in one resource provisioning operation, avoiding the cascading effect caused by service dependencies. Experimental results demonstrate that our method implements a more effective autoscaling mechanism for microservice that not only allocates resources accurately but also adapts to dependencies changes, significantly reducing SLA violations by an average of 41% at lower costs.
Pre-print File AttachedJournal-first Papers
Thu 14 Sep 2023 13:30 - 13:42 at Room D - Mobile Development 2 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information Securityno description available
Tool Demonstrations
Thu 14 Sep 2023 13:42 - 13:54 at Room D - Mobile Development 2 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information SecurityAndroid applications are getting bigger with an increasing number of features. However, not all the features are needed by a specific user. The unnecessary features can increase the attack surface and cost additional resources (e.g., storage and memory). Therefore, it is important to remove unnecessary features from Android applications. However, it is difficult for the end users to fully explore the apps to identify the unnecessary features, and there is no off-the-shelf tool available to assist users to debloat the apps by themselves. In this work, we propose AutoDebloater to debloat Android applications automatically for end users. AutoDebloater is a web application that can be accessed by end-users through a web browser. In particular, AutoDebloater can automatically explore an app and identify the transitions between activities. Then, AutoDebloater will present the Activity Transition Graph to users and ask them to select the activities they do not want to keep. Finally, AutoDebloater will remove the activities that are selected by users from the app. We conducted a user study on five Android apps downloaded from three categories (i.e., Finance, Tools, and Navigation) in Google Play and F-Droid. The results show that users are satisfied with AutoDebloater in terms of the stability of the debloated apps and the ability of AutoDebloater to identify features that are never noticed before. The tool is available at http://autodebloater.club. The code is available at https://github.com/jiakun-liu/autodebloater/ and the demonstration video can be found at https://youtu.be/Gmz0-p2n9D4
Research Papers
Thu 14 Sep 2023 13:54 - 14:06 at Room D - Mobile Development 2 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information SecurityReact Native is a widely-used open-source framework that facilitates the development of cross-platform mobile apps. The framework enables JavaScript code to interact with native-side code, such as Objective-C/Swift for iOS and Java/Kotlin for Android, via a communication mechanism provided by React Native. However, previous research and tools have overlooked this mechanism, resulting in incomplete analysis of React Native app code. To address this limitation, we have developed REUNIFY, a prototype tool that integrates the JavaScript and native-side code of React Native apps into an intermediate language that can be processed by the Soot static analysis framework. By doing so, REUNIFY enables the generation of a comprehensive model of the app’s behavior. Our evaluation indicates that, by leveraging REUNIFY, the Soot-based framework can improve its coverage of static analysis for the 1,007 most popular React Native Android apps, augmenting the number of lines of Jimple code by 70%. Additionally, we observed an average increase of 84% in new nodes reached in the callgraph for these apps, after integrating REUNIFY. When REUNIFY is used for taint flow analysis, an average of two additional privacy leaks were identified. Overall, our results demonstrate that REUNIFY significantly enhances the Soot-based framework’s capability to analyze React Native Android apps.
Pre-printResearch Papers
Thu 14 Sep 2023 14:06 - 14:18 at Room D - Mobile Development 2 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information Securityno description available
Research Papers
Thu 14 Sep 2023 14:18 - 14:30 at Room D - Mobile Development 2 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information Securityno description available
Industry Showcase (Papers)
Thu 14 Sep 2023 14:30 - 14:42 at Room D - Mobile Development 2 Chair(s): Jordan Samhi CISPA Helmholtz Center for Information SecurityGovernments worldwide are increasingly embracing digital transformation initiatives to improve service delivery, enhance citizen engagement and participation, and achieve better outcomes. However, obtaining continuous feedback on these initiatives presents a significant challenge. This paper investigates the feasibility of leveraging mobile app reviews as a valuable source of citizen feedback on government digital services. Through an analysis of 100146 app reviews from 129 government mobile apps in Australia, we identify several functional and usability issues such as authentication, inaccuracy, integration, instability, and verification. Moreover, we uncover several factors influencing user satisfaction, including accuracy, convenience, dependability, efficiency and reliability. Our research demonstrates a close alignment between user feedback and government digital transformation strategy, emphasising the potential of mobile app reviews as a cost-effective source of ongoing citizen feedback
Research Papers
Thu 14 Sep 2023 15:30 - 15:42 at Room D - Configuration and Version Management Chair(s): Shahar Maoz Tel Aviv Universityno description available
Research Papers
Thu 14 Sep 2023 15:42 - 15:55 at Room D - Configuration and Version Management Chair(s): Shahar Maoz Tel Aviv UniversitySoftware ecosystems (e.g., npm, PyPI) are the backbone of modern software developments. Developers add new packages to ecosystems every day to solve new problems or provide alternative solutions, causing obsolete packages to decline in their importance to the community. Packages in decline are reused less overtime and may become less frequently maintained. Thus, developers usually migrate their dependencies to better alternatives. Replacing packages in decline with better alternatives requires time and effort by developers to identify packages that need to be replaced, find the alternatives, asset migration benefits, and finally, perform the migration.
This paper proposes an approach that automatically identifies packages that need to be replaced and finds their alternatives supported with real-world examples of open source projects performing the suggested migrations. At its core, our approach relies on the dependency migration patterns performed in the ecosystem to suggest migrations to other developers. We evaluated our approach on the npm ecosystem and found that 96% of the suggested alternatives are accurate. Furthermore, by surveying expert JavaScript developers, 67% of them indicate that they will use our suggested alternative packages in their future projects.
Pre-printResearch Papers
Thu 14 Sep 2023 15:55 - 16:08 at Room D - Configuration and Version Management Chair(s): Shahar Maoz Tel Aviv UniversityModern cloud services are prone to failures due to their complex architecture, making diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging multiple sources of data, including the alerts, error logs, and domain expertise through past experiences to locate the root cause(s). These experiences are documented as natural language text in outage reports for previous outages. However, utilizing the raw yet rich semi-structured information in the reports systematically is time-consuming. Structured information, on the other hand, such as alerts that are often used during fault diagnosis, is voluminous and requires expert knowledge to discern. Several strategies have been proposed to use each source of data separately for root cause analysis. In this work, we build a diagnostic service called ESRO that recommends root causes and remediation for failures by utilizing structured as well as semi-structured sources of data systematically. ESRO constructs a causal graph using alerts and a knowledge graph using outage reports, and merges them in a novel way to form a unified graph during training. A retrieval-based mechanism is then used to search the unified graph and rank the likely root causes and remediation techniques based on the alerts fired during an outage at inference time. Not only the individual alerts, but their respective importance in predicting an outage group is taken into account during recommendation. We evaluated our model on several cloud service outages of a large SaaS enterprise over the course of ∼2 years, and obtained an average improvement of 27% in rouge scores after comparing the likely root causes against the ground truth over state-of-the-art baselines. We further establish the effectiveness of ESRO through qualitative analysis on multiple real outage examples.
File AttachedResearch Papers
Thu 14 Sep 2023 16:08 - 16:21 at Room D - Configuration and Version Management Chair(s): Shahar Maoz Tel Aviv UniversityResearch Papers
Thu 14 Sep 2023 16:21 - 16:34 at Room D - Configuration and Version Management Chair(s): Shahar Maoz Tel Aviv UniversityCollaborative development is critical to improve the productivity. Multiple contributors work simultaneously on the same project and might make changes to the same code locations. This can cause conflicts and require manual intervention from developers to resolve them. To alleviate the human efforts of manual conflict resolution, researchers have proposed various automatic techniques. More recently, deep learning models have been adopted to solve this problem and achieved state-of-the-art performance. However, these techniques leverage classification to combine the existing elements of input. The classification-based models cannot generate new tokens or produce flexible combinations, and have a wrong hypothesis that fine-grained conflicts of one single coarse-grained conflict are independent.
In this work, we propose to generate the resolutions of merge conflicts from a totally new perspective, that is, generation, and we present a conflict resolution technique, MergeGen. First, we design a structural and fine-grained conflict-aware representation for the merge conflicts. Then, we propose to leverage an encoder-decoder-based generative model to process the designed conflict representation and generate the resolutions auto-regressively. We further perform a comprehensive study to evaluate the effectiveness of MergeGen. The quantitative results show that MergeGen outperforms the state-of-the-art (SOTA) techniques from both precision and accuracy. Our evaluation on multiple programming languages verifies the good generalization ability of MergeGen. In addition, the ablation study shows that the major component of our technique makes a positive contribution to the performance of MergeGen, and the granularity analysis reveals the high tolerance of MergeGen to coarse-grained conflicts. Moreover, the analysis on generating new tokens further proves the advance of generative models.
Pre-print File AttachedResearch Papers
Thu 14 Sep 2023 16:34 - 16:47 at Room D - Configuration and Version Management Chair(s): Shahar Maoz Tel Aviv UniversityResearch Papers
Thu 14 Sep 2023 16:47 - 17:00 at Room D - Configuration and Version Management Chair(s): Shahar Maoz Tel Aviv University