Software should be inclusive and accessible for everyone. Are you doing enough to teach your students how to create accessible software? Are you informing and motivating that software should be equally available for everyone? This workshop will present the Accessibility Learning Labs (ALL) which provide instructors complete material to easily incorporate the imperative topic of accessibility into a wide-range of existing courses; requiring no setup or preparation time by the instructor. During this workshop, we will present our educational accessibility labs that can be included into a variety of curricula ranging from 9-12 to undergraduate and graduate courses. We will also discuss new, innovative and simple ways of including accessibility into already tight curriculum.
A precursor to many software maintenance tasks is program comprehension, where developers read the source code to understand the system’s behavior. Consuming a majority of this source code are identifier names, i.e., lexical tokens that uniquely identify entities in the code (such as classes, methods, variables, etc.). Hence, to assist with developer productivity and the quality of their work and thereby with software maintenance costs, it is imperative that identifier names are both readable and understandable. A strong or high-quality name is one that reflects its intended behavior. This tutorial provides an overview of the importance of identifier naming in source code and past research in this field. We also examine common identifier naming structures, best practices, and semantics through examples. More specifically, we introduce the attendees to the concept of naming evolution, grammar patterns, and linguistic anti-patterns. Additionally, we explore how readable names can be of poor quality by examining the context around the usage of terms in the name and their relationship to the surrounding code. Finally, we will also demonstrate tools that help developers with identifier name appraisals and recommendations.
Refactoring is a critical task in software maintenance and is usually performed to enforce better design and coding practices, while coping with design defects. The Extract Method refactoring is widely used for merging duplicate code fragments into a single new method. Several studies attempted to recommend Extract Method refactoring opportunities using different techniques, including program slicing, program dependency graph analysis, change history analysis, structural similarity, and feature extraction. However, irrespective of the method, most of the existing approaches interfere with the developer’s workflow: they require the developer to stop coding and analyze the suggested opportunities and consider all refactoring suggestions in the entire project without focusing on the development context. To increase the adoption of the Extract Method refactoring, in this tutorial, we aim to show the effectiveness of machine learning and deep learning algorithms for its recommendation while maintaining the workflow of the developer. Finally, we demonstrate case study on how Extract Method technique can be used to address the aforementioned challenges by making the predictions of Extract Method refactoring more practical, and actionable.
Research Papers
Tue 11 Oct 2022 10:30 - 10:50 at Gold A - Technical Session 4 - Mobile Apps I Chair(s): Jacques Klein University of LuxembourgDespite being one of the largest and most popular projects, the official Android framework has only provided test cases for less than 30% of its APIs. Such a poor test case coverage rate has led to many compatibility issues that can cause apps to crash at runtime on specific Android devices, resulting in poor user experiences of both apps and the Android ecosystem. To mitigate this impact, various approaches have been proposed to automatically detect such compatibility issues. Unfortunately, these approaches have only focused on detecting signature-induced compatibility issues (i.e., a certain API does not exist in certain Android versions), leaving other equally important types of compatibility issues unresolved. In this work, we propose a novel prototype tool, JUnitTestGen, to fill this gap by mining existing Android API usage to generate unit test cases. After locating Android API usage in given real-world Android apps, JUnitTestGen performs inter-procedural backward data-flow analysis to generate a minimal executable code snippet (i.e., test case). Experimental results on thousands of real-world Android apps show that JUnitTestGen is effective in generating valid unit test cases for Android APIs. We show that these generated test cases are indeed helpful for pinpointing compatibility issues, including ones involving semantic code changes.
DOI Pre-printJournal-first Papers
Tue 11 Oct 2022 10:50 - 11:10 at Gold A - Technical Session 4 - Mobile Apps I Chair(s): Jacques Klein University of LuxembourgApps’ pervasive role in our society led to the definition of test automation approaches to ensure their dependability. However, state-of-the-art approaches tend to generate large numbers of test inputs and are unlikely to achieve more than 50% method coverage.
In this article, we propose a strategy to achieve significantly higher coverage of the code affected by updates with a much smaller number of test inputs, thus alleviating the test oracle problem.
More specifically, we present ATUA, a model-based approach that synthesizes App models with static analysis, integrates a dynamically refined state abstraction function and combines complementary testing strategies, including (1) coverage of the model structure, (2) coverage of the App code, (3) random exploration, and (4) coverage of dependencies identified through information retrieval. Its model-based strategy enables ATUA to generate a small set of inputs that exercise only the code affected by the updates. In turn, this makes common test oracle solutions more cost-effective, as they tend to involve human effort.
A large empirical evaluation, conducted with 72 App versions belonging to nine popular Android Apps, has shown that ATUA is more effective and less effort-intensive than state-of-the-art approaches when testing App updates.
Link to publicationIndustry Showcase
Tue 11 Oct 2022 11:10 - 11:30 at Gold A - Technical Session 4 - Mobile Apps I Chair(s): Jacques Klein University of LuxembourgIn the industrial setting, mobile apps undergo frequent updates to catch up with the changing real-world requirements. It leads to the strong practical demands of continuous testing, i.e., obtaining quick feedback on app quality during development. However, existing automated GUI testing techniques fall short in this scenario as they simply run an app version from scratch and do not reuse the knowledge from previous testing runs to accelerate the testing cycle. To fill this important gap, we introduce a reusable automated model-based GUI testing technique. Our key insight is that the knowledge of event-activity transitions from the previous testing runs, i.e., executing which events can reach which activities, is valuable for guiding the follow-up testing runs to quickly cover major app functionalities. To this end, we propose (1) a probabilistic model to memorize and leverage this knowledge during testing, and (2) design a model-based guided testing strategy (enhanced by a reinforcement learning algorithm), to achieve faster-and-higher coverage testing. We implemented our technique as an automated testing tool named Fastbot2. Our evaluation on the two popular industrial apps (with billions of user installations) from ByteDance, Douyin and Toutiao, shows that Fastbot2 outperforms the state-of-the-art testing tools (Monkey, APE and Stoat) in both activity coverage and fault detection in the context of continuous testing. To date, Fastbot2 has been deployed in the CI pipeline at ByteDance for nearly two years, and 50.8% of the developer-fixed crash bugs were reported by Fastbot2, which significantly improves app quality. Fastbot2 has been made publicly available to benefit the community at: https://github.com/bytedance/Fastbot_Android. To date, it has received 500+ stars on GitHub and been used by many app vendors and individual developers to test their apps.
NIER Track
Tue 11 Oct 2022 11:30 - 11:40 at Gold A - Technical Session 4 - Mobile Apps I Chair(s): Jacques Klein University of LuxembourgIt is the basic right of a user to know how the permissions are used within the Android app’s scope and to refuse the app if granted permissions are used for the activities other than the specified use which can amount to malicious behavior. This paper proposes an approach and a vision to automatically model the permissions necessary for Android apps from users’ perspective and enable fine-grained permission controls by users, thus facilitating users in making more well-informed and flexible permission decisions for different app functionalities, which in turn improve the security and data privacy of the App and enforce apps to reduce permission misuses. Our proposed approach works in mainly two stages. First, it looks for discrepancies between the permission uses perceivable by users and the permissions actually used by apps via program analysis techniques. Second, it runs prediction algorithms using machine learning techniques to catch the discrepancies in permission usage and thereby alert the user for action about data violation. We have evaluated preliminary implementations of our approach and achieved promising fine-grained permission control accuracy. In addition to the benefits of users’ privacy protection, we envision that wider adoption of the approach may also enforce better privacy-aware design by responsible bodies such as app developers, governments, and enterprises.
Pre-printResearch Papers
Tue 11 Oct 2022 11:40 - 12:00 at Gold A - Technical Session 4 - Mobile Apps I Chair(s): Jacques Klein University of LuxembourgMachine learning based Android malware detection has attracted a great deal of research work in recent years. A reliable malware dataset is critical to evaluate the effectiveness of malware detection approaches. Unfortunately, existing malware datasets used in our community are mainly labelled by taking advantage of existing anti-virus services (i.e., VirusTotal), which are prone to mislabelling. This, however, would lead to the inaccurate evaluation of the malware detection techniques. Removing the label noises from Android malware datasets can be quite challenging, especially at a large data scale. To address this problem, we propose an effective approach called MalWhiteout to reduce the label errors in Android malware datasets. Specifically, we creatively introduce Confident Learning (CL), an advanced noise estimation approach, to the domain of Android malware detection. To combat false positives introduced by CL, we incorporate the idea of ensemble learning and inter-app relation to achieve a more robust capability in noise detection. We evaluate MalWhiteout on a curated large-scale and reliable benchmark dataset. Experimental results show that MalWhiteout is capable of detecting label noises with over 94% accuracy even at a high noise ratio (i.e., 30%) of the dataset. MalWhiteout outperforms the state-of-the-art approach in terms of both effectiveness (8% to 218% improvement) and efficiency (70 to 249 times faster) across different settings. By reducing label noises, we further show that the performance of existing malware detection approaches can be improved.
Tool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1To reduce the attack surface from app source code, massive tools focus on detecting vulnerabilities in Android apps. However, some obvious weaknesses have been highlighted in the previous studies. For example, (1) most of the available tools such as AndroBugs, MobSF, Qark, and Super use pattern-based methods to detect vulnerabilities. Although they are effective in detecting some types, a large number of false positives would be introduced, which inevitably increases the patching overhead for app developers. (2) Similarly, the static taint analysis tools such as FlowDroid and IccTA present hundreds of vulnerability candidates of data leakage instead of confirmed vulnerabilities. (3) Last but not least, a relatively complete vulnerability taxonomy is missing, which would introduce a lot of false negatives. In this paper, based on our prior knowledge in this research domain, we empirically propose a vulnerability taxonomy as the baseline and then extend AUSERA by augmenting the detection capability to 50 vulnerability types. Meanwhile, a new benchmark dataset including all these 50 vulnerabilities is constructed to demonstrate the effectiveness of AUSERA. The tool and datasets are available at: https://github.com/tjusenchen/AUSERA and the demonstration video can be found at: https://youtu.be/UCiGwVaFPpY.
Research Papers
Tue 11 Oct 2022 12:10 - 12:30 at Gold A - Technical Session 4 - Mobile Apps I Chair(s): Jacques Klein University of LuxembourgInter-component communication (ICC) is a widely used mechanism in mobile apps, which enables message-based control flow transferring and data passing between Android components. Effective ICC resolution requires precisely identifying entry points, analyzing data values of ICC fields, modeling related framework APIs, etc. Due to various control-flow- and data-flow-related characteristics involved and the lack of oracles for real-world apps, the comprehensive evaluation of ICC resolution techniques is challenging.
To fill this gap, we collect multiple-type benchmark suites with 4,104 apps, covering hand-made apps, open-source, and commercial ones. Considering their differences, various evaluation metrics, e.g., number count, graph structure, and reliable oracle based metrics, are adopted on-demand. As the oracle for real-world apps is unavailable, we design a dynamic analysis approach to extract the real ICC links triggered during GUI exploration. By auditing the code implementations, we carefully check the extracted ICCs and confirm 1,680 ones to form a reliable oracle set, in which each ICC is labeled with 25 code characteristic tags. The evaluation performed on six state-of-the-art ICC resolution tools shows that 1) the completeness of static ICC resolution results on real-world apps is not satisfactory, as up to 39%-85% ICCs are missed by tools; 2) many wrongly reported ICCs are sent from or received by only a few components and the graph structure information can help the identification; 3) the efficiency of fundamental tools, like ICC resolution ones, should be optimized in both engineering and research aspects, as users may set time limits when invoking them. By investigating both the missed and wrongly reported ICCs, we discuss the strengths of different tools for users and summarize eight common FN/FP patterns in ICC resolution for tool developers.
DOI Pre-printResearch Papers
Tue 11 Oct 2022 14:00 - 14:20 at Gold A - Technical Session 8 - Mobile Apps II Chair(s): Wei Yang University of Texas at DallasMobile apps, an essential technology in today’s world, should provide equal access to all, including 15% of the world population with disabilities. Assistive Technologies (AT), with the help of Accessibility APIs, provide alternative ways of interaction with apps for disabled users who cannot see or touch the screen. Prior studies have shown that mobile apps are prone to the \textit{under-access} problem, i.e., a condition in which functionalities in an app are not accessible to disabled users, even with the use of ATs. We study the dual of this problem, called the \textit{over-access} problem, and defined as a condition in which an AT can be used to gain access to functionalities in an app that are inaccessible otherwise. Over-access has severe security and privacy implications, allowing one to bypass protected functionalities using ATs, e.g., using VoiceOver to read notes on a locked phone. Over-access also degrades the accessibility of apps by presenting to disabled users information that is actually not intended to be available on a screen, thereby confusing and hindering their ability to effectively navigate. In this work, we first empirically study overly accessible elements in Android apps and define a set of conditions that can result in over-access problem. We then present OverSight, an automated framework that leverages these conditions to detect overly accessible elements and verifies their accessibility dynamically using an AT. Our empirical evaluation of OverSight on real-world apps demonstrates OverSight’s effectiveness in detecting previously unknown security threats, workflow violations, and accessibility issues.
Research Papers
Tue 11 Oct 2022 14:20 - 14:40 at Gold A - Technical Session 8 - Mobile Apps II Chair(s): Wei Yang University of Texas at DallasAccessibility is a critical software quality affecting more than 15% of the world’s population with some form of disabilities. Modern mobile platforms, i.e., iOS and Android, provide guidelines and testing tools for developers to assess the accessibility of their apps. The main focus of the testing tools is on examining a particular screen’s compliance with some predefined rules derived from accessibility guidelines. Unfortunately, these tools cannot detect accessibility issues that manifest themselves in interactions with apps using assistive services, e.g., screen readers. A few recent studies have proposed assistive-service driven testing; however, they require manually constructed inputs from developers to evaluate a specific screen or presume availability of UI test cases. In this work, we propose an automated accessibility crawler for mobile apps, Groundhog, that explores an app with the purpose of finding accessibility issues without any manual effort from developers. \textit{Groundhog} assesses the functionality of UI elements in an app with and without assistive services and pinpoints accessibility issues with an intuitive video of how to replicate them. Our experiments show \textit{Groundhog} is highly effective in detecting accessibility barriers that existing techniques cannot discover. Powered by \textit{Groundhog}, we conducted an empirical study on a large set of real-world apps and found new classes of critical accessibility issues that should be the focus of future work in this area.
Research Papers
Tue 11 Oct 2022 14:40 - 15:00 at Gold A - Technical Session 8 - Mobile Apps II Chair(s): Wei Yang University of Texas at DallasThe proliferation of mobile applications (app) over the past decade has imposed unprecedented challenges on end-users privacy. Apps constantly demand access to sensitive user information in exchange for more personalized services. These -mostly unjustified- data collection tactics have raised major concerns among mobile app users. These concerns are commonly expressed in mobile app reviews. However, privacy concerns are typically overshadowed by more generic categories of user feedback, often related to app reliability and usability. This makes extracting these concerns manually, or even using automated methods, a challenging task. To address these challenges, in this paper, we propose an effective unsupervised approach for summarizing privacy concerns in mobile app reviews. Our analysis is conducted using a dataset of 2.6 million app reviews sampled from three different application domains. The results show that users in different application domains express their privacy concerns using different vocabulary. This domain knowledge can be leveraged to help unsupervised automated text summarization algorithms to effectively generate concise summaries of privacy concerns in review collections. Our analysis in this paper is intended to help app developers to quickly and accurately identify the most critical privacy concerns in their domain of operation.
Tool Demonstrations
Tue 11 Oct 2022 15:00 - 15:10 at Gold A - Technical Session 8 - Mobile Apps II Chair(s): Wei Yang University of Texas at DallasTo face the climate change, Android developers urge to become green software developers. But how to ensure carbon-efficient mobile apps at large? In this paper, we introduce $ecoCode$, a SonarQube plugin able to highlight code structures that are smelly from an energy perspective. It is based on a curated list of energy code smells likely to impact negatively the battery lifespan of Android-powered devices. The $ecoCode$ plugin enables analysis of any native Android project written in Java in order to enforce green code.
DOI File AttachedResearch Papers
Tue 11 Oct 2022 15:10 - 15:30 at Gold A - Technical Session 8 - Mobile Apps II Chair(s): Wei Yang University of Texas at DallasAs the bridge between users and software, Graphical User Interface (GUI) is critical to the app accessibility. Scaling up the font or display size of GUI can help improve the visual impact, readability, and usability of an app, and is frequently used by the elderly and people with vision impairment. Yet this can easily lead to scaling issues such as text truncation, component overlap, which negatively influence the acquirement of the right information and the fluent usage of the app. Previous techniques for UI display issue detection and cross-platform inconsistency detection cannot work well for these scaling issues. In this paper, we propose an automated method, dVermin, for scaling issue detection, through detecting the inconsistency of a view under the default and a larger display scale. The evaluation result shows that dVermin achieves 97% precision and 97% recall in issue page detection, and 84% precision and 91% recall for issue view detection, outperforming two state-of-the-art baselines by a large margin. We also evaluate dVermin with popular Android apps on F-droid, and successfully uncover 21 previously-undetected scaling issues with 20 of them being confirmed/fixed.
Pre-printResearch Papers
Wed 12 Oct 2022 10:00 - 10:20 at Gold A - Technical Session 11 - Analysis and Types Chair(s): Thiago Ferreira University of Michigan - FlintUnit type errors, where values with physical unit types (e.g., meters, hours) are used incorrectly in a computation, are common in to- day’s unmanned aerial system (UAS) firmware. Recent studies show that unit type errors represent over 10% of bugs in UAS firmware. Moreover, the consequences of unit type errors are severe, despite their simplicity. Over 30% of unit type errors cause UAS crashes. This paper proposes SA4U: a practical system for detecting unit type errors in real-world UAS firmware. SA4U requires no modifications to firmware or developer annotations. It deduces the unit types of program variables by analyzing simulation traces and protocol definitions. SA4U uses the deduced unit types to identify when unit conversion errors occur. SA4U is effective: it identified 14 previously undetected errors in two popular open-source firmware (ArduPilot & PX4.)
NIER Track
Wed 12 Oct 2022 10:20 - 10:30 at Gold A - Technical Session 11 - Analysis and Types Chair(s): Thiago Ferreira University of Michigan - FlintArtificial diversification of a software program can be a versatile tool in a wide range of software engineering and security scenarios. For example, randomizing implementation aspects can increase the costs for attackers as it prevents them from benefiting of precise knowledge of the target. A promising angle for diversification can be having two runs of a program on the same input yield inherently diverse instruction traces. Inspired by on-stack replacement designs for managed runtimes, in this paper we study how to transform a C program to realize continuous transfers of control and program state among function variants as they run. We discuss the technical challenges toward such goal and propose effective compiler techniques for it that enable the re-use of existing techniques for static diversification with no modifications. We implement our approach in LLVM and evaluate it on both synthetic and real-world subjects.
Research Papers
Wed 12 Oct 2022 10:30 - 10:50 at Gold A - Technical Session 11 - Analysis and Types Chair(s): Thiago Ferreira University of Michigan - FlintThe objective of pre-trained language models is to learn contextual representations of textual data. Pre-trained language models have become mainstream in natural language processing and code modeling. Using probes, a technique to study the linguistic properties of hidden vector spaces, previous works have shown that these pre-trained language models encode simple linguistic properties in their hidden representations. However, none of the previous work assessed whether these models encode the whole grammatical structure of a programming language. In this paper, we prove the existence of a \textit{syntactic subspace}, lying in the hidden representations of pre-trained language models, which contain the syntactic information of the programming language. We show that this subspace can be extracted from the model’s representations and define a novel probing method, the AST-Probe, that enables recovering the whole abstract syntax tree (AST) of an input code snippet. In our experimentations, we show that this syntactic subspace exists in five state-of-the-art pre-trained language models. In addition, we highlight that the middle layers of the models are the ones that encode most of the AST information. Finally, we estimate the optimal size of this syntactic subspace and show that its dimension is substantially lower than those of the models’ representation spaces. This suggests that pre-trained language models use a small part of their representation spaces to encode syntactic information of the programming languages.
Link to publication Pre-printLate Breaking Results
Wed 12 Oct 2022 10:50 - 11:00 at Gold A - Technical Session 11 - Analysis and Types Chair(s): Thiago Ferreira University of Michigan - FlintTo make concurrent programming easier, languages (e.g., Go, Rust, Clojure) have started to offer core support for message passing through channels in shared memory. However, channels also have their issues. Multiparty session types (MPST) constitute a method to make channel usage simpler. In this paper, to consolidate the best qualities of “static MPST” (early feedback, fast execution) and “dynamic MPST” (high expressiveness), we present a project that reinterprets the MPST method through the lens of gradual typing.
Research Papers
Wed 12 Oct 2022 11:00 - 11:20 at Gold A - Technical Session 11 - Analysis and Types Chair(s): Thiago Ferreira University of Michigan - FlintRecently, Python has adopted gradual typing to support type checking and program documentation. However, to enjoy the benefit of gradual typing, developers have to manually write type annotation, which is recognized to be a time-consuming and error-prone task. To alleviate human efforts on manual type annotation, machine-learning-based approaches have been proposed to recommend types based on code features. However, they suffer from the correctness problem, i.e., the recommended types can not pass type checking. To address the correctness problem of the machine-learning-based approaches, in this paper, we present a static type recommendation approach, named Stray. Stray can recommend types correctly. We evaluate the performance of Stray by comparing it against three state-of-art type recommendation approaches, and find that Stray outperforms these baselines by over 30% absolute improvement in both precision and recall.
Research Papers
Wed 12 Oct 2022 11:20 - 11:40 at Gold A - Technical Session 11 - Analysis and Types Chair(s): Thiago Ferreira University of Michigan - FlintPartial code usually involves non-fully-qualified type names (non-FQNs) and undeclared receiving objects. Resolving the FQNs of these non-FQN types and undeclared receiving objects (referred to as type inference) is the prerequisite to effective search and reuse of partial code. Existing dictionary-lookup based methods build a symbolic knowledge base of API names and code contexts, which involve significant compilation overhead and are sensitive to unseen API names and code context variations. In this paper, we formulate type inference as a cloze-style fill-in-blank language task. Built on source code naturalness, our approach trains a code masked language model (MLM) as a neural knowledge base of code elements with a novel ``pre-train, prompt and predict'' paradigm from raw source code. Our approach is lightweight and has minimum requirements on code compilation. Unlike existing symbolic name and context matching for type inference, our prompt-tuned code MLM packs FQN syntax and usage in its parameters and supports fuzzy neural type inference. We systematically evaluate our approach on a large amount of source code from GitHub and Stack Overflow. Our results confirm the effectiveness of our approach design and the practicality for partial code type inference. As the first of its kind, our neural type inference method opens the door to many innovative ways of using partial code.
Research Papers
Wed 12 Oct 2022 11:40 - 12:00 at Gold A - Technical Session 11 - Analysis and Types Chair(s): Thiago Ferreira University of Michigan - FlintThe Spring framework is widely used in developing enterprise web applications. Spring core technologies, such as Dependency Injection and Aspect-Oriented Programming, make development faster and easier. However, the implementation of Spring core technologies uses a lot of dynamic features. Those features impose significant challenges when using static analysis to reason about the behavior of Spring-based applications. In this paper, we propose Jasmine, a static analysis framework with respect to Spring core technologies extends from Soot to enhance the call graph’s completeness while not greatly affecting its performance. We evaluate Jasmine’s completeness, precision, and performance using Spring micro-benchmarks and a suite of 18 real-world Spring programs. Our experiments show that Jasmine effectively enhances the state-of-the-art tools based on Soot and Doop to better support Spring core technologies. We also add Jasmine support to FlowDroid and discovered twelve sensitive information leakage paths in our benchmarks. Jasmine is expected to provide significant benefits for many program analyses scenes of Spring applications where more completeness of call graphs are required.
Research Papers
Wed 12 Oct 2022 13:30 - 13:50 at Gold A - Technical Session 16 - Software Vulnerabilities Chair(s): Mohamed Wiem Mkaouer Rochester Institute of TechnologyData science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model’s accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process.
Research Papers
Wed 12 Oct 2022 13:50 - 14:10 at Gold A - Technical Session 16 - Software Vulnerabilities Chair(s): Mohamed Wiem Mkaouer Rochester Institute of TechnologyInfrastructure as Code (IaC) is the process of managing IT infrastructure via programmable configuration files (also called IaC scripts). Like other software artefacts, IaC scripts may contain security smells, which are coding patterns that can result in security weaknesses. Automated analysis tools to detect security smells in IaC scripts exist, but they focus on specific technologies such as Puppet, Ansible, or Chef. This means that when the detection of a new smell is implemented in one of the tools, it is not immediately available for the technologies supported by the other tools — the only option is to duplicate the effort.
This paper presents GLITCH, a new technology-agnostic framework that enables automated polyglot smell detection by transforming IaC scripts into an intermediate representation, on which different security smell detectors can be defined. GLITCH currently supports the detection of nine different security smells in scripts written in Puppet, Ansible, or Chef. We compare GLITCH with state-of-the-art security smell detectors. The results obtained not only show that GLITCH can reduce the effort of writing security smell analyses for multiple IaC technologies, but also that it has higher precision and recall than the current state-of-the-art tools.
Pre-printJournal-first Papers
Wed 12 Oct 2022 14:10 - 14:30 at Gold A - Technical Session 16 - Software Vulnerabilities Chair(s): Mohamed Wiem Mkaouer Rochester Institute of Technologyno description available
Research Papers
Wed 12 Oct 2022 14:30 - 14:50 at Gold A - Technical Session 16 - Software Vulnerabilities Chair(s): Mohamed Wiem Mkaouer Rochester Institute of TechnologyWeb applications are attractive attack targets given their popularity and large number of vulnerabilities. To mitigate the threat of web vulnerabilities, an important piece of information is their affected versions. However, it is non-trivial to build accurate affected version information because confirming a version as affected or unaffected requires security expertise and huge efforts, while there are usually hundreds of versions to examine. As a result, such information is maintained in a low-quality manner in almost every public vulnerability database. Therefore, it is extremely useful to have a tool that can automatically and precisely examine a large part (even if not all) of the software versions as affected or unaffected.
To this end, this paper proposes a vulnerability-centric approach for precise (un)affected version analysis for web vulnerabilities. The key idea is to extract the vulnerability logic from a patch and directly use the vulnerability logic to check whether a version is (un)affected or not. Compared with existing works, our vulnerability-centric approach helps to tolerate the code changes across different software versions. We construct a high-quality dataset with 34 CVEs and 299 software versions to evaluate our approach. The results show that our approach achieves a precision of 98.15% and a recall of 85.01% in identifying (un)affected versions and significantly outperforms existing tools (e.g., V-SZZ, ReDebug, V0Finder).
Research Papers
Wed 12 Oct 2022 14:50 - 15:10 at Gold A - Technical Session 16 - Software Vulnerabilities Chair(s): Mohamed Wiem Mkaouer Rochester Institute of TechnologyInfrastructure-as-Code (IaC) is a technology that enables the managing, provisioning, and distributing of infrastructure through code instead of manual processes. As with any piece of code, IaC scripts are not immune to defects. A recent Cloud Threat Report from Palo Alto Network’s Unit 42 announced the discovery of over 199K vulnerable IaC templates. This highlights the importance of tools to prevent vulnerabilities from reaching production and shift security left in the development pipeline. Unfortunately, we observed through a comprehensive study that security linters for IaC scripts can be very imprecise. Our approach to address this problem was to leverage community expertize to improve the precision of these tools. More precisely, we interviewed professional developers of Puppet scripts to collect their feedback on the root causes of imprecision of the state-of-the-art security linter for Puppet. From that feedback, we developed a new linter adjusting 7 rules of the original linter ruleset and adding 3 new rules. We conducted a new study with 131 professional developers, showing an increase in precision from 8% to 83%. The main message of this paper is that obtaining professional feedback is feasible and highly effective and that feedback is key to the creation of high precision rulesets, which is critical for the usefulness and adoption of IaC security linters.
Research Papers
Wed 12 Oct 2022 15:10 - 15:30 at Gold A - Technical Session 16 - Software Vulnerabilities Chair(s): Mohamed Wiem Mkaouer Rochester Institute of TechnologyVulnerabilities, referred to as CLV issues, are induced by cross-language invocations of vulnerable libraries. Such issues greatly increase the attack surface of Python/Java projects due to their pervasive use of C libraries. Since existing Python/Java build tools in PyPI and Maven ecosystems fail to report vulnerable libraries written in other languages such as C, CLV issues are easily missed by developers. In this paper, we conduct the first empirical study on the status quo of CLV issues in PyPI and Maven ecosystems. It is found that 82,951 projects in these ecosystems are directly or indirectly dependent on libraries compiled from the C project versions that are identified to be vulnerable in CVE reports. Our study arouses the awareness of CLV issues in popular ecosystems and presents related analysis results.
The study also leads to the development of the first automated tool, \textsc{Insight}, which provides a turn-key solution to the identification of CLV issues in PyPI and Maven projects based on published CVE reports of vulnerable C projects. \textsc{Insight} automatically identifies if a PyPI or Maven project is using a C library compiled from vulnerable C project versions in published CVE reports. It also deduces the vulnerable APIs involved by analyzing the usage of various foreign function interfaces such as \emph{CFFI} and \emph{JNI} in the concerned PyPI or Maven project. \textsc{Insight} achieves a high detection rate of 88.4% on a popular CLV issue benchmark. Contributing to the open-source community, we report 226 CLV issues detected in the actively maintained PyPI and Maven projects that are directly dependent on vulnerable C library versions. Our reports are well received and appreciated by developers with queries on the availability of \textsc{Insight}. 127 reported issues (56.2%) were quickly confirmed by developers and 74.8% of them were fixed/under fixing by popular projects, such as {\mycode Mongodb}~\cite{Mongodb} and {\mycode Eclipse/Sumo}~\cite{Eclipse/Sumo}.
Journal-first Papers
Wed 12 Oct 2022 16:00 - 16:20 at Gold A - Technical Session 20 - Web, Cloud, Networking Chair(s): Karine Even-Mendoza Imperial College LondonPerformance models have been used in the past to understand the performance characteristics of software systems. However, the identification of performance criticalities is still an open challenge, since there might be several system components contributing to the overall system performance. This work combines two different areas of research to improve the process of interpreting model-based performance analysis results: (i) software performance engineering that provides the ground for the evaluation of the system’s performance; (ii) mutation-based techniques that nicely supports the experimentation of changes in performance models and contribute to a more systematic assessment of performance indices. We propose mutation operators for specific performance models, i.e., queueing networks, that resemble changes commonly made by designers when exploring the properties of a system’s performance. Our approach consists in introducing a mutation-based approach that generates a set of mutated queueing network models. The performance of these mutated networks is compared to that of the original network to better understand the effect of variations in the different components of the system. A set of benchmarks is adopted to show how the technique can be used to get a deeper understanding of the performance characteristics of software systems.
Link to publication DOITool Demonstrations
Tue 11 Oct 2022 10:00 - 10:30 at Ballroom A - Tool Poster Session 1NOTE: SUBMITTED A RESEARCH PAPER TO ASE RESEARCH TRACK Application development for the modern Web involves sophisticated engineering workflows which include user interface aspects. Those involve Web elements typically created with HTML/CSS markup and JavaScript-like languages, yielding Web documents. WebMonitor leverages requirements formally specified in a logic able to capture both the layout of visual components as well as how they change over time, as a user interacts with them. Then, requirements are verified upon arbitrary web pages, allowing for automated support for a wide set of use cases in interaction testing and simulation. We position WebMonitor within a developer workflow, where in case of a negative result, a visual counterexample is returned. The monitoring framework we present follows a black-box approach, and as such is independent of the underlying technologies a Web application may be developed with, as well as the browser and operating system used.
Research Papers
Wed 12 Oct 2022 16:30 - 16:50 at Gold A - Technical Session 20 - Web, Cloud, Networking Chair(s): Karine Even-Mendoza Imperial College LondonCommunication nondeterminism is one of the main reasons for the intractability of verification of message passing concurrency. In many practical message passing programs, the non-deterministic communication structure is symmetric and decomposed into epochs to obtain efficiency. Thus, symmetries and epoch structure can be exploited to reduce verification complexity. In this paper, we present a dynamic-symbolic runtime verification technique for single-path MPI programs, which (i) exploits communication symmetries by way of specifying symmetry breaking predicates (SBP) and (ii) performs compositional verification based on epochs. On the one hand, SBPs prevent the symbolic decision procedure from exploring isomorphic parts of the search space, and on the other hand, epochs restrict the size of a program needed to be analyzed at a point in time. We show that our analysis is sound and complete for single-path MPI programs on a given input. We further demonstrate that our approach leads to (i) a significant reduction in verification times and (ii) scaling up to larger benchmark sizes compared to prior trace verifiers.
Journal-first Papers
Wed 12 Oct 2022 16:50 - 17:10 at Gold A - Technical Session 20 - Web, Cloud, Networking Chair(s): Karine Even-Mendoza Imperial College LondonIn industrial environments it is critical to find out the capacity of a system and plan for a deployment layout that meets the production traffic demands. The system capacity is influenced by both the performance of the system’s constituting components and the physical environment setup. In a large system, the configuration parameters of individual components give the flexibility to developers and load test engineers to tune system performance without changing the source code. However, due to the large search space, estimating the capacity of the system given different configuration values is a challenging and costly process. In this paper, we propose an approach, called MLASP, that uses machine learning models to predict the system key performance indicators (i.e., KPIs), such as throughput, given a set of features made off configuration parameter values, including server cluster setup, to help engineers in capacity planning for production environments. Under the same load, we evaluate MLASP on two large-scale mission-critical enterprise systems developed by Ericsson and on one open-source system. We find that: 1) MLASP can predict the system throughput with a very high accuracy. The difference between the predicted and the actual throughput is less than 1%; and 2) By using only a small subset of the training data (e.g., 3% of the entire data for the open-source system), MLASP can still predict the throughput accurately. We also document our experience of successfully integrating the approach into an industrial setting. In summary, this paper highlights the benefits and potential of using machine learning models to assist load test engineers in capacity planning.
Link to publication DOIResearch Papers
Wed 12 Oct 2022 17:10 - 17:30 at Gold A - Technical Session 20 - Web, Cloud, Networking Chair(s): Karine Even-Mendoza Imperial College LondonWith the ever increasing scale and complexity of online systems, incidents are gradually becoming commonplace. Without appropriate handling, they can seriously harm the system availability. However, in large-scale online systems, these incidents are usually drowning in a slew of issues (i.e., something abnormal, while not necessarily an incident), rendering them difficult to handle. Typically, these issues will result in a cascading effect across the system, and a proper management of the incidents depends heavily on a thorough analysis of this effect. Therefore, in this paper, we propose a method to automatically analyze the cascading effect of availability issues in online systems and extract the corresponding graph based issue representations incorporating both of the issue symptoms and affected service attributes. With the extracted representations, we train and utilize a graph neural networks based model to perform incident detection. Then, for the detected incident, we leverage the PageRank algorithm with a flexible transition matrix design to locate its root cause. We evaluate our approach using real-world data collected from a very large instant messaging company. The results confirm the effectiveness of our approach. Moreover, our approach is successfully deployed in the company and eases the burden of operators in the face of a flood of issues and related alert signals.
Late Breaking Results
Wed 12 Oct 2022 17:30 - 17:40 at Gold A - Technical Session 20 - Web, Cloud, Networking Chair(s): Karine Even-Mendoza Imperial College LondonSustainable software engineering has received a lot of attention in recent times, as we witness an ever-growing slice of energy use, for example, at data centers, as software systems utilize the underlying infrastructure. Characterizing servers for their energy use accurately without being intrusive, is therefore important to make sustainable software deployment choices. In this paper, we introduce ESAVE which is a machine learning-based approach that leverages a small set of hardware attributes to characterize any server or virtual machine’s energy use across different levels of utilization. This is based upon an extensive exploration of multiple ML approaches, with a focus on a minimal set of required attributes, while showcasing good accuracy. Early validations show that ESAVE has only around 12% average prediction error. This approach is non-intrusive and therefore can enable many sustainable software engineering use cases, promoting greener DevOps.
Industry Showcase
Wed 12 Oct 2022 17:40 - 18:00 at Gold A - Technical Session 20 - Web, Cloud, Networking Chair(s): Karine Even-Mendoza Imperial College LondonDeploying applications on hybrid clouds with computational artifacts distributed over public backends and private edges involve several constraints. Designing such deployment requires application architects to solve several challenges, spanning over hard regulatory policy constraints as well as business policy constraints such as enablement of privacy by on-prem processing of data to the extent the business wants, backend support of privacy enabling technologies (PET), sustainability in terms of green energy utilization, latency sensitivity of the application. In this paper, we propose to optimize hybrid cloud application architectures, while taking all those factors into consideration, and empirically demonstrate the effectiveness of our approach. To the best of our knowledge, this work is the first of its kind.
Research Papers
Thu 13 Oct 2022 10:00 - 10:20 at Gold A - Technical Session 24 - Human Aspects Chair(s): Silvia Abrahão Universitat Politècnica de ValènciaExploratory testing is an effective testing approach which leverages the tester’s knowledge and creativity to design test cases to provoke and recognize failures at the system level from the end user’s perspective. Although some principles and guidelines have been proposed to guide exploratory testing, there are no effective tools for automatic generation of exploratory test scenarios (a.k.a soap opera tests). Existing test generation techniques rely on specifications, program differences and fuzzing, which are not suitable for exploratory test generation. In this paper, we propose to leverages the scenario and oracle knowledge in bug reports to generate soap opera test scenarios. We develop open information extraction methods to construct a system knowledge graph (KG) of user tasks and failures from the steps to reproduce, expected results and observed results in bug reports. We construct a proof-of-concept KG from 25,939 bugs of the Firefox browser. Our evaluation shows the constructed KG is of high quality. Based on the KG, we creates soap opera test scenarios by combining the scenarios of relevant bugs, and develop a web tool to present the created test scenarios and support exploratory testing. In our user study, 5 users find 18 bugs from 5 seed bugs in 2 hours using our tool, while the control group find only 5 bugs based on the recommended similar bugs.
Research Papers
Thu 13 Oct 2022 10:20 - 10:40 at Gold A - Technical Session 24 - Human Aspects Chair(s): Silvia Abrahão Universitat Politècnica de ValènciaEmotions (e.g., Joy, Anger) are prevalent in daily software engineering (SE) activities, and are known to be significant indicators of work productivity (e.g., bug fixing efficiency). Recent studies have shown that directly applying general purpose emotion classification tools to SE corpora is not effective. Even within the SE domain, tool performance degrades significantly when trained on one communication channel and evaluated on another (e.g, StackOverflow vs. GitHub comments). Retraining a tool with channel-specific data takes significant effort since manually annotating large datasets of ground truth data is expensive.
In this paper, we address this data scarcity problem by automatically creating new training data using a data augmentation technique. Based on an analysis of the types of errors made by popular SE-specific emotion recognition tools, we specifically target our data augmentation strategy in order to improve the performance of emotion recognition. Our results show an average improvement of 9.3% in micro F1-Score for three existing emotion classification tools (ESEM-E, EMTk, SEntiMoji) when trained with our best augmentation strategy.
Pre-printNIER Track
Thu 13 Oct 2022 10:40 - 10:50 at Gold A - Technical Session 24 - Human Aspects Chair(s): Silvia Abrahão Universitat Politècnica de ValènciaThe logic behind design decisions, called design rationale, is very valuable. In the past, researchers have tried to automatically extract and exploit this information, but prior techniques are only applicable to specific contexts and there is insufficient progress on an end-to-end rationale information extraction pipeline. Here we outline a path towards such a pipeline that leverages several Machine Learning (ML) and Natural Language Processing (NLP) techniques. Our proposed context-independent approach, called Kantara, produces a knowledge graph representation of decisions and of their rationales, which considers their historical evolution and traceability. We also propose inconsistency checking mechanisms to ensure the correctness of the extracted information and the coherence of the development process. We conducted a preliminary evaluation of our proposed approach on a small example sourced from the Linux Kernel, which shows promising results.
Pre-printJournal-first Papers
Thu 13 Oct 2022 10:50 - 11:10 at Gold A - Technical Session 24 - Human Aspects Chair(s): Silvia Abrahão Universitat Politècnica de ValènciaRequirements Engineering in the industry is expertise-driven, heavily manual, and centered around various types of requirement specification documents being prepared and maintained. These specification documents are in diverse formats and vary depending on whether it is a business requirement document, functional specification, interface specification, client specification, and so on. These diverse specification documents embed crucial product knowledge such as functional decomposition of the domain into features, feature hierarchy, feature types and their specific feature characteristics, dependencies, business context, etc. Moreover, in a product development scenario, thousands of pages of requirement specification documentation is created over the years. Comprehending functionality and its associated context from large volumes of specification documents is a highly complex task. To address this problem, we propose to digitalize the requirement specification documents into processable models. This paper discusses the salient aspects involved in the digitalization of requirements knowledge from diverse requirement specification documents. It proposes an AI engine for the automatic transformation of diverse text-based requirement specifications into machine-processable models using NLP techniques and the generation of context-sensitive user stories. The paper describes the key requirement abstractions and concepts essential in an industrial scenario, the conceptual meta-model, and DizReq engine (AI engine for digitalizing requirements) implementation for automatically transforming diverse requirement specifications into user stories embedding the business context. The evaluation results from digitalizing specifications of an IT product suite are discussed: mean feature extraction efficiency is 40 features/file, mean user story extraction efficiency is 71 user stories/file, feature extraction accuracy is 94%, and requirement extraction accuracy is 98%.
Link to publication DOIJournal-first Papers
Thu 13 Oct 2022 11:10 - 11:30 at Gold A - Technical Session 24 - Human Aspects Chair(s): Silvia Abrahão Universitat Politècnica de ValènciaNeural networks are getting increasingly popular thanks to their exceptional performance in solving many real-world problems. At the same time, they are shown to be vulnerable to attacks, difficult to debug and subject to fairness issues. To improve people’s trust in the technology, it is often necessary to provide some human-understandable explanation of neural networks’ decisions, e.g., why is that my loan application is rejected whereas hers is approved? That is, the stakeholder would be interested to minimize the chances of not being able to explain the decision consistently and would like to know how often and how easy it is to explain the decisions of a neural network before it is deployed.
In this work, we provide two measurements on the decision explainability of neural networks. Afterwards, we develop algorithms for evaluating the measurements of user-provided neural networks automatically. We evaluate our approach on multiple neural network models trained on benchmark datasets. The results show that existing neural networks’ decisions often have low explainability according to our measurements. This is in line with the observation that adversarial samples can be easily generated through adversarial perturbation, which are often hard to explain. Our further experiments show that the decisions of the models trained with robust training are not necessarily easier to explain, whereas decisions of the models retrained with samples generated by our algorithms are easier to explain.
Link to publication DOIJournal-first Papers
Thu 13 Oct 2022 11:30 - 11:50 at Gold A - Technical Session 24 - Human Aspects Chair(s): Silvia Abrahão Universitat Politècnica de ValènciaSoftware engineers are crowdsourcing answers to their everyday challenges on Q&A forums (e.g., Stack Overflow) and more recently in public chat communities such as Slack, IRC and Gitter. Many software-related chat conversations contain valuable expert knowledge that is useful for both mining to improve programming support tools and for readers who did not participate in the original chat conversations. However, most chat platforms and communities do not contain built-in quality indicators (e.g., accepted answers, vote counts). Therefore, it is difficult to identify conversations that contain useful information for mining or reading, i.e,. conversations of post hoc quality. In this paper, we investigate automatically detecting developer conversations of post hoc quality from public chat channels. We first describe an analysis of 400 developer conversations that indicate potential characteristics of post hoc quality, followed by a machine learning-based approach for automatically identifying conversations of post hoc quality. Our evaluation of 2000 annotated Slack conversations in four programming communities (python, clojure, elm, and racket) indicates that our approach can achieve precision of 0.82, recall of 0.90, F-measure of 0.86, and MCC of 0.57. To our knowledge, this is the first automated technique for detecting developer conversations of post hoc quality.
Link to publicationno description available
[Workshop] HCSE&CS '22
Fri 14 Oct 2022 08:30 - 10:00 at Gold A - Session 1 Chair(s): Mohan Baruwal Chhetri CSIRO’s Data61, Xiao Liu School of Information Technology, Deakin Universityno description available