Workshop
Karen Renaud is a Scottish computing Scientist working on all aspects of Human-Centred Security and Privacy. She was educated at the Universities of Pretoria, South Africa and Glasgow. She is particularly interested in deploying behavioural science techniques to improve security behaviours, and in encouraging end-user privacy-preserving behaviours. Her research approach is multi-disciplinary, essentially learning from other, more established, fields and harnessing methods and techniques from other disciplines to understand and influence cyber security behaviours. She is Professor Extraordinaire at the University of South Africa and Visiting Professor at Rhodes University in South Africa.
Workshop
Workshop
Zimbabwe has introduced a number of currency reforms since the beginning of 2019. Of note are the Statutory Instruments 142 (SI142) and SI16 enacted in March 2019, which replaced the multi-currency regime with the Zimbabwe dollar as legal tender, rendering all other currencies illegal. These events constitute a unique opportunity to carry out a “natural experiment”. We used the opportunity to explore the impact of the currency changes on Zimbabweans’ protective PIN shielding behaviours. Our study was carried out spanning both the multi-currency and Zimbabwe Dollar (ZWL) currency periods. We observed far less PIN shielding than in comparable studies in other countries, and we suggest that this is due to uncertainty and reduced currency values. In essence, this was a natural experiment into the impact of currency changes on citizen behaviours, and what we discovered revealed an unexpected real-life risk homeostasis response.
Workshop
Software applications continue to challenge user privacy when users interact with them. Privacy practices (e.g. Data Minimisation (DM), Privacy by Design (PbD) or General Data Protection Regulation (GDPR)) and related privacy engineering" methodologies exist and provide clear instructions for developers to implement privacy into software systems they develop that preserve user privacy. However, those practices and methodologies are not yet a common practice in the software development community. There has been no previous research focused on developing
educational" interventions such as serious games to enhance software developers’ coding behaviour. Therefore, this research proposes a game design framework as an educational tool for software developers to improve (secure) coding behaviour, so they can develop privacy-preserving software applications that people can use. The elements of the proposed framework were incorporated into a gaming application scenario that enhances the software developers’ coding behaviour through their motivation. The proposed work not only enables the development of privacy-preserving software systems but also helping the software development community to put privacy guidelines and engineering methodologies into practice.
Workshop
Malicious users can exploit undiscovered software vulnerabilities i.e., undiscovered weaknesses in software, to cause serious consequences, such as large-scale data breaches. A systematic approach that synthesizes strategies used by security testers can aid practitioners to identify latent vulnerabilities. The goal of this paper is to help practitioners identify software vulnerabilities by categorizing vulnerability discovery strategies using open source software bug reports. We categorize vulnerability discovery strategies by applying qualitative analysis on 312 OSS bug reports. Next, we quantify the frequency and evolution of the identified strategies by analyzing 1,632 OSS bug reports collected from five software projects spanning across 2009 to 2019. The five software projects are Chrome, Eclipse, Mozilla, OpenStack, and PHP.
We identify four vulnerability discovery strategies: diagnostics, malicious payload construction, misconfiguration, and pernicious execution. For Eclipse and OpenStack, the most frequently used strategy is diagnostics, where security testers inspect source code and build/debug logs. For three web-related software projects namely, Chrome, Mozilla, and PHP, the most frequently occurring strategy is malicious payload construction i.e., creating malicious files, such as malicious certificates and malicious videos.
Workshop
Smart Buildings are defined as the ``buildings of the future" and use the latest Internet of Things (IoT) technologies to automate building operations and services. This is to both increase operational efficiency as well as maximize occupant comfort and environmental impact. However, these ‘‘smart devices’’ – typically used with default settings – also enable the capture and sharing of a variety of sensitive and personal data about the occupants. Given the non-intrusive nature of most IoT devices, individuals have little awareness of what data is being collected about them and what happens to it downstream. Even if they are aware, convenience overrides any privacy concerns, and they do not take sufficient steps to control the data collection, thereby exacerbating the privacy paradox. At the same time, IoT-based building automation systems are revealing highly sensitive insights about the building occupants by synthesizing data from multiple sources and this can be exploited by the device vendors and unauthorised third parties. To address the tension between privacy and convenience in an increasingly connected world, we propose a user-centric informed consent model to handle the privacy paradox in Smart Buildings. The proposed model aims to (a) inform and increase user awareness about how their data is being collected and used, (b) provide fine-grained visibility into privacy compliance and infringement by IoT devices, and (c) recommend corrective actions through nudges (or soft notifications). We illustrate how our proposed consent model works through a use case scenario of a voice-activated smart office.
Workshop
Context: Insecure coding patterns (ICPs), such as hard-coded passwords can be inadvertently introduced in infrastructure as code (IaC) scripts, providing malicious users the opportunity to attack provisioned computing infrastructure. As performing code reviews is resource-intensive, a characterization of co-located ICPs, i.e., ICPs that occur together in a script can help practitioners to prioritize their review efforts and mitigate ICPs in IaC scripts. Objective: The goal of this paper is to help practitioners in prioritizing code review efforts for infrastructure as code (IaC) scripts by conducting an empirical study of co-located insecure coding patterns in IaC scripts. Methodology: We conduct an empirical study with 1613, 2764 and 2845 Puppet scripts respectively collected from three organizations namely, Mozilla, Openstack, and Wikimedia. We apply association rule mining to identify co-located ICPs in IaC scripts. Results: We observe 17.9%, 32.9%, and 26.7% of the scripts to include co-located ICPs respectively, for Mozilla, Openstack, and Wikimedia. The most frequent co-located ICP category is hard-coded secret and suspicious comment. Conclusion: Practitioners can prioritize code review efforts for IaC scripts by reviewing scripts that include co-located ICPs.
Workshop
With the exponential growth of social media platforms like Twitter, a seemingly vast amount of data has become available for mining to draw conclusions about various topics, including awareness systems requirements. The exchange of health-related information on social media has been heralded as a new way to explore information-seeking behaviour during pandemics and design and develop awareness systems that address the public’s information need. Online datasets such as Twitter, Google Trends and Reddit have several advantages over traditional data sources, including real-time data availability, ease of access, and reduced cost.
In this paper, to explore the pandemic awareness systems’ requirements, we utilize data from the large accessible database of tweets and Reddit’s posts to explore the contextual patterns and temporal trends in Canadians’ information-seeking behaviour during the COVID-19 pandemic. To validate our inferences and to understand how google searches regarding COVID-19 were distributed throughout the course of the pandemic in Canada, we complement our Twitter and Reddit data with that collected through Google Trends, which tracks the popularity of specific search terms on Google. Our results show that Social media content contains useful technical information and can be used as a source to explore the requirements of pandemic awareness systems.
Workshop
Many individuals find it difficult to meet their personal health and wellbeing goals. To address this, mobile health (mHealth) apps aimed at improving individuals’ diet, physical activity, sleep and mental health, are emerging at an increasing pace. These modern digital health interventions are a promising solution to promote behaviour change and help people maintain better health while controlling rising healthcare expenditures. However, the real-life effects of mHealth apps are often overshadowed by high dropout rates, with the loss of participants during the intervention seeming to be the rule rather than the exception. We designed the mHealth4U model as a sample-based study of user requirement and design preferences to enable more targeted health and wellbeing self-management. This model is aimed at understanding how life-changing digital health interventions can be designed and what software design components might increase consumers’ acceptance, adherence and continuous engagement. We put forward three hypotheses in terms of designing an mHealth app that is consumer-centred: consumers prefer (1) self-management mHealth apps that target multiple key health and wellbeing dimensions, (2) intelligent recommendations, and (3) behaviour change support delivered precisely where, when and how it is needed most. We design the mHealth4U model around the 3U cyber security design components (user, usage and usability) and validate the hypotheses through a randomised sampling test with 114 participants. The results of this research will inform the design of a next-generation of digital health interventions capable of supporting the end-users to achieve the healthy lifestyle they deserve.
Workshop
For software projects, significant delays can result in heavy penalty which may end up with project costs exceeding their budgets. As a consequence,employees,i.e.,software developers,are often requested to work overtime in order to reduce or even eliminate the delays. By doing so, overtime payment may often be introduced and excessive overtime payment can also easily swallow company profit which may even lead to serious overdraft. Hence software manager needs to decide who should work overtime and how much overtime they would take in order to control the cost. This means that it is important to investigate how to reduce or eliminate the overall penalties by taking multiple concurrent software projects into account. In practice, there is normally a number of available employees with same or similar skills and domain knowledge from other similar concurrent projects. In addition,they have different skill proficiency. So rescheduling those employees with appropriate overtime may be feasible to find a solution which can reduce or eliminate the penalties of delayed software projects. Since this kind of scheduling is a typical NP-hard problem, a novel generic strategy is proposed to help select appropriate employees and determine how much overtime to be assigned to the delayed activities. The new strategy combines the features of ant colony optimization algorithm and Tabu strategy and includes four rules to reduce the search space. A set of comprehensive generic experiments is carried out in order to evaluate the performance of the proposed strategy in a general manner. In addition,three real world software project instances are also utilized to evaluate our strategy. The results demonstrate that our strategy is effective which outperforms the other representative strategies which are applied successfully at software project scheduling.
Workshop
Agile software development welcomes changes throughout software development - but this implies that agile teams face several dilemmas. When to respond to a change; how to respond; how to manage the change. Our current understanding and support for agile teams during such change management is very limited.
Psychological behavioral change models can be used to better understand the behavior of agile teams. Combining our understanding of agile teams and practices with a review of behavior change models, we propose several avenues for studying behavior and behavioral changes in agile teams. Our proposed interdisciplinary approach provides a much needed avenue to acknowledge and address the psychological and behavioral aspects of the humans central to the software engineering process, ultimately assisting with their well-being and productivity.
Workshop
Developers construct bioinformatics software to automate crucial analysis and research related to biological science. However, challenges while developing bioinformatics software can prohibit advancement in biological science research. Through a human-centric systematic analysis, we can identify challenges related to bioinformatics software development and envision future research directions. From our qualitative analysis with 221 Stack Overflow questions, we identify six categories of challenges: file operations, searching genetic entities, defect resolution, configuration management, sequence alignment, and translation of genetic information. To mitigate the identified challenges we envision three research directions that require synergies between bioinformatics and automated software engineering: (i) automated configuration recommendation using optimization algorithms, (ii) automated and comprehensive defect categorization, and (iii) intelligent task assistance with active and reinforcement learning.
Workshop
I am a Professor of Computer Science at the Institute of Computer Science (IAM) of the University of Bern, where he founded the Software Composition Group in 1994. I am a co-author of over 200 publications and co-author of the open-source books Object-Oriented Reengineering Patterns and Pharo by Example.
The advances in machine learning(ML) have stimulated the integration of their capabilities into software systems and services. However, there is a tangible gap between software engineering and machine learning practices, that is delaying the progress of intelligent services development. Software organisations are devoting effort to adjust the software engineering processes and practices to facilitate the integration of machine learning models. Machine learning researchers as well are focusing on improving the interpretability of machine learning models to support overall system robustness. Our research focuses on bridging this gap through a methodology that evaluates the robustness of machine learning-enabled software engineering systems. In particular, this methodology will automate the evaluation of the robustness properties of software systems against dataset shift problems in ML. It will also feature a notification mechanism that facilitates the debugging of ML components.
Automated test generators, such as search based software testing (SBST) techniques, replace the tedious and expensive task of manually writing test cases. SBST techniques are effective at generating tests with high code coverage. However, is high code coverage sufficient to maximise the number of bugs found? We argue that SBST needs to be focused to search for test cases in defective areas rather in non-defective areas of the code in order to maximise the likelihood of discovering the bugs. Defect prediction algorithms give useful information about the bug-prone areas in software. Therefore, we formulate the objective of this thesis: \textit{Improve the bug detection capability of SBST by incorporating defect prediction information}. To achieve this, we devise two research objectives, i.e., 1) Develop a novel approach (SBST${CL}$) that allocates time budget to classes based on the likelihood of classes being defective, and 2) Develop a novel strategy (SBST${ML}$) to guide the underlying search algorithm (i.e., genetic algorithm) towards the defective areas in a class. Through empirical evaluation on 434 real reported bugs in the Defects4J dataset, we demonstrate that our novel approach, SBST$_{CL}$, is significantly more efficient than the state of the art SBST when they are given a tight time budget in a resource constrained environment.
Whenever software components process personal or private data, appropriate data protection mechanisms are mandatory. An essential factor in achieving trust and transparency is not to give preference to a single party but to make it possible to audit the data usage in an unbiased way. The scenario in mind for this contribution contains (i) users bringing in sensitive data they want to be safe, (ii) service developers building software-based services whose Intellectual Properties (IPs) they desire to protect, and (iii) platform providers wanting to be trusted and to be able to rely on the component developers integrity. The authors see these interests as an insufficiently solved field of tension that can be relaxed by a suitable level of transparently represented software components to give insights without exposing every detail.
Formal specifications in \textsf{Alloy} are organized around user-defined data domains, associated with \emph{signatures}, with almost no support for built-in datatypes. This minimality in the built-in datatypes provided by the language is one of its main features, as it contributes to the automated analyzability of models. One of the few built-in datatypes available in Alloy specifications are integers, whose SAT-based treatment allows only for small bit-widths. In many contexts, where relational datatypes dominate, the use of integers may be auxiliary, e.g., in the use of cardinality constraints and other features. However, as the applications of \textsf{Alloy} are increased, e.g., with the use of the language and its tool support as backend engine for different analysis tasks, the provision of efficient support for numerical datatypes becomes a need. In this work, we present our current preliminary approach to providing an efficient, scalable and user-friendly extension to \textsf{Alloy}, with arithmetic support for numerical datatypes. Our implementation allows for arithmetic with varying precisions, and is implemented via standard \textsf{Alloy} constructions, thus resorting to SAT solving for resolving arithmetic constraints in models.
Software reliability is a primary concern in the construction of software, and thus a fundamental component in the definition of software quality. Analyzing software reliability requires a \emph{specification} of the intended behavior of the software under analysis. Unfortunately, software many times lacks such specifications. This issue seriously diminishes the analyzability of software with respect to its reliability. Thus, finding novel techniques to capture the intended software behavior in the form of specifications would allow us to exploit them for automated reliability analysis.
Our research focuses on the application of learning techniques to automatically distinguish correct from incorrect software behavior. The aim here is to decrease the developer’s effort in specifying oracles, and instead \emph{generating} them from actual software behaviors.
The design and development of production-grade microservice backends is a tedious and error-prone task. In particular, they must be capable of handling all Functional Requirements (FRs) and all Non-Functional Requirements (NFRs) (like security) including all operational requirements (like monitoring). This becomes even more difficult if there are many clients with different roles, linked to diverse (non-)functional requirements and many existing services are involved, which have to consider these in a consistent way. In this paper, we present a model-driven approach that automatically generates client-specific production-grade backends by incorporating previously expressed architectural knowledge out of an interpretable specification of the targeted APIs and the NFRs.
There was a time when developers wrote a document containing the steps needed to install new software and handed it over to the Operations folk to deploy. We’ve come a long way from those days as we’ve moved away from infrequent, manual deployment to frequent, high quality, automated deployments. This talk covers the evolution of automation in Continuous Integration and Continuous Deployment, the problems we have solved and the new kind of challenges we face, as we move from on-prem installations to the cloud.
Concolic execution and fuzzing are two complementary coverage-based testing techniques. How to achieve the best of both remains an open challenge. To address this research problem, we propose and evaluate Legion. Legion uses a variation of the Monte Carlo tree search (MCTS) framework from the AI literature to treat automated test generation as a problem of sequential decision-making under uncertainty. Its best-first search strategy provides a principled way to learn the most promising program states to investigate at each search iteration, based on observed rewards from previous iterations. Legion incorporates a form of directed fuzzing that we call approximate path-preserving fuzzing (APPFuzzing) to investigate program states selected by MCTS. APPFuzzing serves as the Monte Carlo simulation technique and is implemented by extending prior work on constrained sampling. We evaluate Legion against competitors in Test-Comp 2020, as well as measuring its sensitivity to hyperparameters, demonstrating its effectiveness on a wide variety of input programs.
In the IT project, the traditional testing method is creating test scenarios/cases according to the business flow, manually defining, and testing whether the output value is correct for each input value. Besides, massive test data is also required to perform the test accurately. So far, the traditional IT project has opened the system after manual iterating scenarios/cases tests that are empirically considered sufficient. However, due to time and resources limitation, this method cannot consider all possible cases in the real world. Thus, we cannot test all scenarios and eliminate all potential defects through this traditional testing method. As a result, unexpected errors or exceptional situations often occur even after the system goes live, which can lead to severe failure. This paper demonstrates a real transaction-based automatic testing solution called ‘PerfecTwin’ with real-world examples. PerfecTwin highlights to overcome the limitation of the traditional manual test by automatically verifying the TO-BE system with the AS-IS system’s actual transaction, which can allow eliminating all the defects before the system goes live.
In object-oriented programming, a method is pure if calling the method does not change object states that exist in the pre-states of the method call. Pure methods are widely-used in automatic techniques, including test generation, complier optimization, and program repair. Due to the source code dependency, it is infeasible to completely and accurately identify all pure methods. Instead, existing techniques such as ReImInfer are designed to identify a subset of accurate results of pure method and mark the other methods as unknown ones. In this paper, we designed and implemented MetPurity , a learning-based tool of pure method identification. Given all methods in a project, MetPurity labels a training set via automatic program analysis and builds a binary classifier (implemented with the random forest classifier) based on the training set. This classifier is used to predict the purity for all the other methods (i.e., unknown ones) in the same project. Preliminary evaluation on four open-source Java projects shows that MetPurity can provide a list of identified pure methods with a low error rate. Applying MetPurity to EvoSuite can increase the number of killed mutants in the test generation of EvoSuite. A demo video of this tool can be found at https://youtu.be/ Ac3cmjn4CCs/; the prototype and evaluation data are available at http://cstar.whu.edu.cn/p/metpurity/.
Symbolic execution is a well established technique for software testing and analysis. However, scalability continues to be a challenge, both in terms of constraint solving cost and path explosion. In this work, we present a novel approach for symbolic execution, which can enhance its scalability by aggressively prioritizing execution paths that are already known to be feasible, and deferring all other paths. We evaluate our technique on nine applications, including SQLite3, make and tcpdump and show it can achieve higher coverage for both seeded and non-seeded exploration.
Most programming languages support foreign language interoperation that allows developers to integrate multiple modules implemented in different languages into a single multilingual program. While utilizing various features from multiple languages expands expressivity, differences in language semantics require developers to understand the semantics of multiple languages and their interoperation. Because current compilers do not support compile-time checking for interoperation, they do not help developers avoid interoperation bugs. Similarly, active research on static analysis and bug detection has been focusing on programs written in a single language.
In this paper, we propose a novel approach to analyze multilingual programs statically. Unlike existing approaches that extend a static analyzer for a host language to support analysis of foreign function calls, our approach extracts semantic summaries from programs written in guest languages using a modular analysis technique, and performs a whole-program analysis with the extracted semantic summaries. To show practicality of our approach, we design and implement a static analyzer for multilingual programs, which analyzes JNI interoperation between Java and C. Our empirical evaluation shows that the analyzer is scalable in that it can construct call graphs for large programs that use JNI interoperation, and useful in that it found 74 genuine interoperation bugs in real-world Android JNI applications.
Jupyter notebooks—documents that contain live code, equations, visualizations, and narrative text—now are among the most popular means to compute, present, discuss, and disseminate scientific findings. In principle, Jupyter notebooks should easily allow to reproduce and extend scientific computations and their findings; but in practice, this is not the case. The individual code cells in Jupyter notebooks can be executed in any order, with identifier usages preceding their definitions and results preceding their computations. In a sample of 936 published notebooks that would be executable in principle, we found that 73% of them would not be reproducible with straightforward approaches, requiring humans to infer (and often guess) the order in which the authors created the cells.
In this paper, we present an approach to
Our Osiris prototype takes a notebook as input and outputs the possible execution schemes that reproduce the exact notebook results. In our sample, Osiris was able to reconstruct such schemes for 82.23% of all executable notebooks, which has more than three times better than the state-of-the-art; the resulting reordered code is valid program code and thus available for further testing and analysis.
Block-based programming languages like Scratch support learners by providing high-level constructs that hide details and by preventing syntactically incorrect programs. Questions nevertheless frequently arise: Is this program satisfying the given task? Why is my program not working? To support learners and educators, automated program analysis is needed for answering such questions. While adapting existing analyses to process blocks instead of textual statements is straightforward, the domain of programs controlled by block-based languages like Scratch is very different from traditional programs:
In Scratch multiple actors, represented as highly concurrent programs, interact on a graphical stage, controlled by user inputs, and program statements mainly determine visual aspects and movement.
Analyzing such programs is further hampered by the absence of clearly defined semantics, often resulting from ad-hoc decisions made by the implementers of the programming environment.
To enable program analysis, we define the semantics of Scratch using an intermediate language. Based on this intermediate language, we implement the Bastet program analysis framework for Scratch programs, using concepts from abstract interpretation and software model checking.
Like Scratch, Bastet is based on Web technologies, written in TypeScript, and can be executed using NodeJS or even directly in a browser.
Evaluation on 272 programs written by children suggests that Bastet offers a practical solution for analysis of Scratch programs.
Recent probabilistic model checking techniques can verify reliability and performance properties of software systems affected by parametric uncertainty. This involves modelling the system behaviour using \emph{interval Markov chains}, i.e., Markov models with transition probabilities or rates specified as intervals. These intervals can be updated continually using Bayesian estimators with imprecise priors, enabling the verification of the system properties of interest at runtime. However, Bayesian estimators are slow to react to sudden changes in the actual value of the estimated parameters, yielding inaccurate intervals and leading to poor verification results after such changes. To address this limitation, we introduce an efficient interval change-point detection method, and we integrate it with a state-of-the-art Bayesian estimator with imprecise priors. Our experimental results show that the resulting end-to-end Bayesian approach to change-point detection and estimation of interval Markov chain parameters handles effectively a wide range of sudden changes in parameter values, and supports runtime probabilistic model checking under parametric uncertainty.
Charts are commonly used for data visualization. Generating a chart usually involves performing data transformations, including data pre-processing and aggregation. These tasks can be cumbersome and time-consuming, even for experienced data scientists. Reproducing existing charts can also be a challenging task when information about data transformations is no longer available.
In this paper, we tackle the problem of recovering data transformations from existing charts. Given an input table and a chart, our goal is to automatically recover the data transformation program underlying the chart. We divide our approach into four steps: (1) data extraction, (2) candidate generation, (3) candidate ranking, and (4) candidate disambiguation. We implemented our approach in a tool called UnchartIt and evaluated it on a set of $50$ benchmarks from Kaggle. Experimental results show that UnchartIt successfully ranks the correct data transformation program in the top-10 in $92%$ of the instances. To disambiguate those programs, we use our new interactive disambiguation procedure, which successfully returns the correct program on 98% of the ambiguous instances by asking on average fewer than 2 questions to the user.
Panel
Diversity in the exhibited behavior of a given system is a desirable characteristic in a variety of application contexts. Synthesis of conformant implementations often proceeds by discovering witnessing Skolem functions, which are traditionally deterministic. In this paper, we present a novel Skolem extraction algorithm to enable synthesis of witnesses with random behavior and demonstrate its applicability in the context of reactive systems. The synthesized solutions are guaranteed by design to meet the given specification, while exhibiting a high degree of diversity in their responses to external stimuli. Case studies demonstrate how our proposed framework unveils a novel application of synthesis in model-based fuzz testing to generate fuzzers of competitive performance to general-purpose alternatives, as well as the practical utility of synthesized controllers in robot motion planning problems.
This paper aims to shed light on how loops are used in smart contracts. Towards this goal, we study various syntactic and semantic characteristics of loops used in over 20,000 Solidity contracts deployed on the Ethereum block chain, with the goal of informing future research on program analysis for smart contracts. Based on our findings, we propose a small domain-specific language (DSL) that can be used to summarize common looping patterns in Solidity. To evaluate what percentage of smart contract loops can be expressed in our proposed DSL, we also design and implement a program synthesis toolchain called Solis that can synthesize loop summaries in our DSL. Our evaluation shows that at least 56% of the analyzed loops can be summarized in our DSL, and 81% of these summaries are exactly equivalent to the original loop.
Machine Learning models from other fields, like Computational Linguistics, have been transplanted to Software Engineering tasks, often quite successfully. Yet a transplanted model’s initial success at a given task does not necessarily mean it is well-suited for the task. In this work, we examine a common example of this phenomenon: the conceit that ``software patching is like language translation''. We demonstrate empirically that there are subtle, but critical distinctions between sequence-to-sequence models and translation model: while program repair benefits greatly from the former, general modeling architecture, it actually suffers from design decisions built into the latter, both in terms of translation accuracy and diversity. Given these findings, we demonstrate how a more principled approach to model design, based on our empirical findings and general knowledge of software development, can lead to better solutions. We propose several models that leverage the same machine learning tools, but whose architecture, data presentation, and metrics are specialized for the software engineering task. The resulting models perform significantly better than the studied baseline, especially in more program repair appropriate metrics. Overall, our results demonstrate the merit of studying the intricacies of machine learned models in software engineering: not only can this help elucidate potential issues that may be overshadowed by increases in accuracy; it can also help innovate on these models to raise the state-of-the-art further. We will publicly release our replication data and materials at \url{https://github.com/ARiSE-Lab/Patch-as-translation}.
Dynamic code, i.e., code that is created or modified at runtime, is ubiquitous in today’s world. The behavior of dynamic code can depend on the logic of the dynamic code generator in subtle and non-obvious ways, e.g., JIT compiler bugs can lead to exploitable vulnerabilities in the resulting JIT-compiled code. Existing approaches to program analysis do not provide adequate support for reasoning about such behavioral relationships. This paper takes a first step in addressing this problem by describing a program representation and a new notion of dependency that allows us to reason about dependency and information flow relationships between the dynamic code generator and the generated dynamic code. Experimental results show that analyses based on these concepts are able to capture properties of dynamic code that cannot be identified using traditional program analyses.
Android platform provisions a number of sophisticated concurrency mechanisms for the development of apps. The concurrency mechanisms, while powerful, are quite difficult to properly master by mobile developers. In fact, prior studies have shown concurrency issues, such as event-race defects, to be prevalent among real-world Android apps. In this paper, we propose a flow-, context-, and thread-sensitive static analysis framework, called ER Catcher, for detection of event-race defects in Android apps. ER Catcher introduces a new type of summary function aimed at modeling the concurrent behavior of methods in both Android apps and libraries. In addition, it leverages a novel, statically constructed Vector Clock for rapid analysis of happens-before relations. Altogether, these design choices enable ER Catcher to not only detect event-race defects with a substantially higher degree of accuracy, but also in a fraction of time compared to the existing state-of-the-art technique.
In recent years, IF-This-Then-That (IFTTT) services are becoming more and more popular. Many platforms such as Zapier, IFTTT.com, and Workato provide such services, which allow users to create workflows with “triggers” and “actions” by using Web Application Programming Interfaces (APIs). However, the number of IFTTT recipes in the above platforms increases much slower than the growth of Web APIs. This is because human efforts are still largely required to build and deploy IFTTT recipes in the above platforms. To address this problem, in this paper, we present an automation tool to automatically generate the IFTTT mashup infrastructure. The proposed tool provides 5 REST APIs, which can automatically generate triggers, rules, and actions in AWS, and create a workflow XML to describe an IFTTT mashup by connecting the triggers, rules, and actions. This workflow XML is automatically sent to Fujitsu RunMyProcess (RMP) to set up and execute IFTTT mashup. The proposed tool, together with its associated method and procedure, enables an end-to-end solution for automatically creating, deploying, and executing IFTTT mashups in a few seconds, which can greatly reduce the development cycle and cost for new IFTTT mashups.
Ask Me Anything
Miryung Kim is a Professor in the Department of Computer Science at the University of California, Los Angeles and is a Director of Software Engineering and Analysis Laboratory. She is known for her research on code clones — code duplication detection, management, and removal solutions. Recently, she has taken a leadership role in defining the emerging area of software engineering for data science.
She received her B.S. in Computer Science from Korea Advanced Institute of Science and Technology in 2001 and her M.S. and Ph.D. in Computer Science and Engineering from the University of Washington in 2003 and 2008 respectively. She ranked No. 1 among all engineering and science students in KAIST in 2001 and received the Korean Ministry of Education, Science, and Technology Award, the highest honor given to an undergraduate student in Korea. She received various awards including an NSF CAREER award, Google Faculty Research Award, and Okawa Foundation Research Award. She was previously an assistant professor at the University of Texas at Austin. Her research is funded by National Science Foundation, Air Force Research Laboratory, Google, IBM, Intel, Okawa Foundation, and Samsung and currently, she is leading a 4.9M Office of Naval Research project on synergistic software customization. She is a Program Co-Chair of the IEEE 35th International Conference on Software Evolution and Maintenance and an Associate Editor of IEEE Transactions on Software Engineering and Empirical Software Engineering.
With the increasing application of deep learning (DL) models in many safety-critical scenarios, effective and efficient DL testing techniques are much in demand to improve the quality of DL models. One of the major challenges is the data gap between the training data to construct the models and the testing data to evaluate them. To bridge the gap, testers aims to collect an effective subset of inputs from the testing contexts, with limited labeling effort, for retraining DL models.
To assist the subset selection, we propose \textbf{M}ultiple-Boundary \textbf{C}lustering and \textbf{P}rioritization (\textbf{MCP}), a technique to cluster test samples into the boundary areas of multiple boundaries for DL models and specify the priority to select samples evenly from all boundary areas, to make sure enough useful samples for each boundary reconstruction.
To evaluate MCP, we conduct an extensive empirical study with three popular DL models and 33 simulated testing contexts. The experiment results show that, compared with state-of-the-art baseline methods, on effectiveness, our approach MCP has a significantly better performance by evaluating the improved quality of retrained DL models; on efficiency, MCP also has the advantages in time costs.
Deep learning (DL) has recently started to be applied in many applications, e.g., autonomous driving, speech recognition, and natural language processing. Yet, many state-of-the-art DL systems are still vulnerable to adversarial examples, which hinders their adoptions in safety- and security-critical scenarios. While some recent progress has been made in analyzing the robustness of feed-forward neural networks, the robustness analysis for stateful DL systems, such as recurrent neural networks (RNNs), still remains largely uncharted. In this paper, we propose MARBLE, a model-based approach for quantitative robustness analysis of real-world RNN-based DL systems. MARBLE first profiles RNNs using training data to collect information on how models behave under controlled perturbations. We then build a probabilistic model to compactly characterize the behavioral robustness of RNNs, through abstraction. Furthermore, we propose a refinement algorithm to iteratively derive a precise abstraction which enables accurate quantification of the robustness measures. We evaluate the effectiveness of MARBLE on both LSTM and GRU models trained separately with three popular natural language datasets. The results demonstrate that (1) our refinement algorithm is more efficient in deriving an accurate abstraction than the random strategy, and (2) MARBLE enables quantitative robustness analysis, in rendering better efficiency, accuracy, and scalability than state-of-the-art techniques.
Machine learning software is being used in many applications (finance, hiring, admissions, criminal justice) having a huge social impact. But sometimes the behavior of this software is biased and it shows discrimination based on some sensitive attributes such as sex, race etc. Prior works concentrated on finding and mitigating bias in ML models. A recent trend is using instance-based model agnostic explanation methods such as LIME[1] to find out bias in the model prediction. Our work concentrates on finding shortcomings of current bias measures and explanation methods. We show how our proposed method based on K nearest neighbors can overcome those shortcomings and find the underlying bias of black-box models. Our results are more trustworthy and helpful for the practitioners. Finally, We describe our future framework combining explanation and planning to build fair software
Link to Publication: https://arxiv.org/abs/2007.02893
Deep learning (DL) has been applied widely, and the quality of DL system becomes crucial, especially for safety-critical applications. Existing work mainly focuses on the quality analysis of DL models, but lacks attention to the underlying libraries and frameworks on which all DL models depend. In this work, we propose Audee, anovel approach for testing DL libraries and localizing bugs. Audee adopts a search-based approach and implements three different mutation strategies to generate diverse tests cases by exploring combinations of model structures, parameters, weights and inputs. Audee is able to detect three types of bugs: logic bugs, crashes and Not-a-Number (NaN) bugs. In particular, for logic bugs, Audee adopts a cross-reference check to detect behavioral inconsistencies across multiple frameworks (e.g., TensorFlow and PyTorch), which indicates potential bugs in their implementations. For NaN bugs, Audee adopts a heuristic-based approach to generate DNNs that tend to output outliers (i.e., too large or small values), and these values are likely to cause NaN value. Furthermore, Audee leverages causal testing based technique to localize layers as well as parameters that cause inconsistencies or bugs. To evaluate the effectiveness of our approach, we applied Audeeon evaluating four DL frameworks, i.e., TensorFlow, CNTK, Theano, and PyTorch. We totally generate 260 models which cover 25 widely-used APIs in the four frameworks. The results demonstrate Audee are effective indetecting inconsistencies, crashes and NaN bugs. In total, 26 unique unknown bugs were discovered, and seven of them have already been confirmed by the developers.
Neural networks are becoming a popular tool for solving many real-world problems such as object recognition and machine translation, thanks to its exceptional performance as an end-to-end solution. However, neural networks are complex black-box models, which hinders humans from interpreting and consequently trusting them in making critical decisions. Towards interpreting neural networks, several approaches have been proposed to extract simple deterministic models from neural networks. The results are not encouraging (e.g., low accuracy and limited scalability), fundamentally due to the limited expressiveness of such simple models.
In this work, we propose an approach to extract probabilistic automata for interpreting an important class of neural networks, i.e., recurrent neural networks. Our work distinguishes itself from existing approaches in two important ways. One is that probability is used to compensate for the loss of expressiveness. This is inspired by the observation that human reasoning is often `probabilistic’. The other is that we adaptively identify the right level of abstraction so that a simple model is extracted in a request-specific way. We conduct experiments on several real-world datasets using state-of-the-art architectures including GRU and LSTM. The result shows that our approach significantly improves existing approaches in terms of accuracy or scalability. Lastly, we demonstrate the usefulness of the extracted models through detecting adversarial texts.
Data augmentation techniques that increase the amount of training data by adding realistic transformations are used in machine learning to improve the level of accuracy. Recent studies have demonstrated that data augmentation techniques improve the robustness of image classification models with open datasets; however, it has yet to be investigated whether these techniques are effective for industrial datasets. In this study, we investigate the feasibility of data augmentation techniques for industrial use. We evaluate data augmentation techniques in image classification and object detection tasks using an industrial in-house graphical user interface dataset. As the results indicate, the genetic algorithm-based data augmentation technique outperforms two random-based methods in terms of the robustness of the image classification model. In addition, through this evaluation and interviews with the developers, we learned following two lessons: data augmentation techniques should (1) maintain the training speed to avoid slowing the development and (2) include extensibility for a variety of tasks.
Networking Event
Computing systems are becoming ever more complex, with decisions increasingly often based on deep learning components. A wide variety of applications are being developed, many of them safety-critical, such as self-driving cars and medical diagnosis. Since deep learning is unstable with respect to adversarial perturbations, there is a need for rigorous software development methodologies that encompass machine learning components. This lecture will describe progress with developing automated verification and testing techniques for deep neural networks to ensure safety and robustness of their decisions with respect to input perturbations. The techniques exploit Lipschitz continuity of the networks and aim to approximate, for a given set of inputs, the reachable set of network outputs in terms of lower and upper bounds, in anytime manner, with provable guarantees. We develop novel algorithms based on feature-guided search, games, global optimisation and Bayesian methods, and evaluate them on state-of-the-art networks. The lecture will conclude with an overview of the challenges in this field.
Reactive synthesis is an automated procedure to obtain a correct-by-construction reactive system from its temporal logic specification. GR(1) is an expressive assume-guarantee fragment of LTL that enables efficient synthesis and has been recently used in different contexts and application domains.
In this work we present just-in-time synthesis (JITS) for GR(1). Rather than constructing a controller at synthesis time, we compute next states during system execution, and only when they are required. We prove that JITS does not compromise the correctness of the synthesized system execution. We further show that the basic algorithm can be extended to enable several variants.
We have implemented JITS in the Spectra synthesizer. Our evaluation, comparing JITS to existing tools over known benchmark specifications, shows that JITS reduces (1) total synthesis time, (2) the size of the synthesis output, and (3) the loading time for system execution, all while having little to no effect on system execution performance.
JavaScript was initially designed for client-side programming in web browsers, but its engine is now embedded in various kinds of host software. Despite the popularity, since the JavaScript semantics is complex especially due to its dynamic nature, understanding and reasoning about JavaScript programs are challenging tasks. Thus, researchers have proposed several attempts to define the formal semantics of JavaScript based on ECMAScript, the official JavaScript specification. However, the existing approaches are manual, labor-intensive, and error-prone and all of their formal semantics target ECMAScript 5.1 (ES5.1, 2011) or its former versions. Therefore, they are not suitable for understanding modern JavaScript language features introduced since ECMAScript 6 (ES6, 2015). Moreover, ECMAScript has been annually updated since ES6, which already made five releases after ES5.1. To alleviate the problem, we propose JISET, a JavaScript IR-based Semantics Extraction Toolchain. It is the first tool that automatically synthesizes parsers and AST-IR translators directly from a given language specification, ECMAScript. For syntax, we develop a parser generation technique with lookahead parsing for BNFES, a variant of the extended BNF used in ECMAScript. For semantics, JISET synthesizes AST-IR translators using forward compatible rule-based compilation. Compile rules describe how to convert each step of abstract algorithms written in a structured natural language into IRES, an Intermediate Representation that we designed for ECMAScript. For the four most recent ECMAScript versions, JISET automatically synthesized parsers for all versions, and compiled 95.03% of the algorithm steps on average. After we complete the missing parts manually, the extracted core semantics of the latest ECMAScript (ES10, 2019) passed all 18,064 applicable tests. Using this first formal semantics of modern JavaScript, we found nine specification errors in ES10, which were all confirmed by the Ecma Technical Committee 39. Furthermore, we showed that JISET is forward compatible by applying it to nine feature proposals ready for inclusion in the next ECMAScript, which let us find four errors in the BigInt proposal.
Regular expressions (regexes) are widely used in different fields of computer science such as programming languages, string processing and databases. However, existing tools for synthesizing or repairing regexes were not designed to be resilient to Regex Denial of Service (ReDoS) attacks. Specifically, if a regex has super-linear (SL) worst-case complexity, an attacker could provide carefully-crafted inputs to launch ReDoS attacks. Therefore, in this paper, we propose a programming-by-example framework, FlashRegex, for generating anti-ReDoS regexes by either synthesizing or repairing from given examples. It is the first framework that integrates regex synthesis and repair with the awareness of ReDoS-vulnerabilities.We present novel algorithms to deduce anti-ReDoS regexes by reducing the ambiguity of these regexes and by using Boolean Satisfiability (SAT) or Neighborhood Search (NS) techniques. We evaluate FlashRegex with five related state-of-the-art tools. The evaluation results show that our work can effectively and efficiently generate anti-ReDoS regexes from given examples, and also reveal that existing synthesis and repair tools have neglected ReDoS-vulnerabilities of regexes. Specifically, the existing synthesis and repair tools generated up to 394 ReDoS-vulnerable regex within few seconds to more than one hours, while FlashRegex generated no SL regex within around five seconds. Furthermore, the evaluation results on ReDoS-vulnerable regex repair also show that FlashRegex has better capability than existing repair tools and even human experts, achieving 4 more ReDoS-invulnerable regex after repair without trimming and resorting, highlighting the usefulness of FlashRegex in terms of generality, automation and user-friendliness.
Panel
Experience
Model counting is the problem for finding the number of solutions to a formula over a bounded universe. This is a classic problem in computer science that has seen many recent advances in techniques and tools that tackle it. These advances have led to applications of model counting in many domains, e.g., quantitative program analysis, reliability, and security. Given the sheer complexity of the underlying problem, today’s model counters employ sophisticated algorithms and heuristics, which result in complex tools that must be heavily optimized. Therefore, establishing the correctness of implementations of model counters necessitates rigorous testing. This experience paper presents an empirical study on testing industrial strength model counters by applying the principles of differential and metamorphic testing together with bounded exhaustive input generation and input minimization. We embody these principles in the TestMC framework, and apply it to test four model counters, including three state-of-the-art model counters from three different classes. Specifically, we test the exact model counters projMC and dSharp, the probabilistic exact model counter Ganak, and the probabilistic approximate model counter ApproxMC. As subjects, we use three complementary test suites of input formulas. One suite consists of larger formulas that are derived from a wide range of real-world software design problems. The second suite consists of a bounded exhaustive set of small formulas that TestMC generated. The third suite consists of formulas generated using an off-the-shelf CNF fuzzer. TestMC found bugs in three of the four subject model counters. The bugs led to crashes, segmentation faults, incorrect model counts, and resource exhaustion by the solvers. Two of the tools were corrected subsequent to the bug reports we submitted based on our study, whereas the bugs we reported in the third tool were deemed by the tool authors to not require a fix.
As big data analytics become increasingly popular, data-intensive scalable computing (DISC) systems help address the scalability issue of handling large data. However, there exists a lack of automated testing techniques to test such data-centric applications, because data is often incomplete, continuously evolving, and hard to know a priori. Fuzz testing has been proven to be highly effective in other domains such as security; however, it is nontrivial to apply such traditional fuzzing to big data analytics directly for three reasons: (1) the long latency of DISC systems prohibits the applicability of fuzzing: naïve fuzzing would spend 98% of the time in setting up a test environment; (2) conventional branch coverage is unlikely to scale to DISC applications because most binary code comes from the framework implementation such as Apache Spark; and (3) random bit or byte-level mutations can hardly generate meaningful data, which fails to reveal real-world application bugs.
We propose a novel coverage-guided fuzz testing tool for big data analytics, called BigFuzz. The key essence of our approach is that: (a) we focus on exercising application logic as opposed to increasing framework code coverage by abstracting the DISC frame-work using specifications. BigFuzz performs automated source to source transformations to construct an equivalent DISC application suitable for fast test generation, and (b) we design schema-aware data mutation operators based on our in-depth study of DISC application error types. BigFuzz speeds up the fuzzing time by 78X-1477X compared to random fuzzing, improves application code coverage by 20%-271%, and achieves 33%-157% improvement in detecting application errors. When compared to the state of the art that uses symbolic execution to test big data analytics, BigFuzz is applicable to twice more programs and can find 80.6% more bugs.
Client-specific equivalence checking (CSEC) is a technique proposed previously to perform impact analysis of changes to down-stream components (libraries) from the perspective of an unchanged system (client). Existing analysis techniques, whether general (regression verification, equivalence checking) or special-purpose, when applied to CSEC, either require users to provide specifications, or do not scale. We propose a novel solution to the CSEC problem, called CC2, that is based on searching the control-flow of a program for impact boundaries. We evaluate a prototype implementation of CC2 on a comprehensive set of benchmarks and conclude that our prototype performs well compared to the state-of-the-art. We also show that CC2 can be applied to real software projects in a case-study.
Deep learning (DL) training algorithms utilize nondeterminism to improve models’ accuracy and training efficiency. Hence, multiple identical training runs (e.g., identical training data, algorithm, and network) produce different models with different accuracy and training time. In addition to these algorithmic factors, DL libraries (e.g., TensorFlow and cuDNN) introduce additional variance(referred to as implementation-level variance) due to parallelism, optimization, and floating-point computation. This work is the first to study the variance of DL systems and the awareness of this variance among researchers and practitioners. Our experiments on three datasets with six popular networks show large overall accuracy differences among identical training runs. Even after excluding weak models, the accuracy difference is still 10.8%. In addition, implementation-level factors alone cause the accuracy difference across identical training runs to be up to 2.9%, the per-class accuracy difference to be up to 52.4%, and the training time to convergence difference to be up to 145.3%. All core(TensorFlow, CNTK, and Theano) and low-level libraries exhibit implementation-level variance across all evaluated versions. Our researcher and practitioner survey shows that 83.8% of the901 participants are unaware of or unsure about any implementation-level variance. In addition, our literature survey shows that only 19.5±3% of papers in recent top software engineering (SE), AI, and systems conferences use multiple identical training runs to quantify the variance of their DL approaches. This paper raises awareness of DL variance and directs SE researchers to challenging tasks such as creating deterministic DL libraries for debugging and improving the reproducibility of DL software and results.
As neural networks make their way into safety-critical systems, where misbehavior can lead to catastrophes, there is a growing interest in certifying the equivalence of two structurally similar neural networks. For example, compression techniques are often used in practice for deploying trained neural networks on computationally- and energy-constrained devices, which raises the question of how faithfully the compressed network mimics the original network. Unfortunately, existing methods either focus on verifying a single network or rely on loose approximations to prove the equivalence of two networks. Due to overly conservative approximation, differential verification lacks scalability in terms of both accuracy and computational cost. To overcome these problems, we propose NeuroDiff, a symbolic and fine-grained approximation technique that drastically increases the accuracy of differential verification while achieving many orders-of-magnitude speedup. NeuroDiff has two key contributions. The first one is new convex approximations that more accurately bound the difference neurons of two networks under all possible inputs. The second one is judicious use of symbolic variables to represent neurons whose difference bounds have accumulated significant error. We also find that these two techniques are complementary, i.e., when combined, the benefit is greater than the sum of their individual benefits. We have evaluated NeuroDiff on a variety of differential verification tasks. Our results show that NeuroDiff is up to 1000X faster and 5X more accurate than the state-of-the-art tool.
One the one hand, as a GitHub profile is becoming an essential part of a developer’s resume it becomes increasingly important to enable HR departments to extract someone’s expertise, through automated analysis of his/her contribution to open-source projects. On the other hand, having clear insights on the technologies used in a project can be very beneficial for resource allocation and project maintainability planning. In the literature, one can identify various approaches for identifying expertise on programming languages, based on the projects that developer contributed to. In this paper, we move one step further and introduce an approach (accompanied by a tool) to identify low-level expertise on particular software frameworks and technologies apart, relying solely on GitHub data, using the GitHub API and Natural Language Processing (NLP)—using the Microsoft Language Understanding Intelligent Service (LUIS). In particular, we developed an NLP model in LUIS for named-entity recognition for three (3) .NET technologies and two (2) front-end frameworks. Our analysis is based upon specific commit contents, in terms of the exact code chunks, which the committer added or changed. We evaluate the precision, recall and f-measure for the derived technologies/frameworks, by conducting a batch test in LUIS and report the results. The proposed approach is demonstrated through a fully functional web application named RepoSkillMiner.
Tool Links: Video, Code Repo, Application, Validation Dataset
CCS CONCEPTS • Software and its engineering → Software creation and manage-ment -> Software post-development issues;
KEYWORDS Expertise; Frameworks; GitHub; Natural Language Processing; Soft-ware Project Management;
Networking Event
Brian Randell described software engineering as “the multi-person development of multi-version programs”. David Parnas has expressed that this “pithy phrase implies everything that differentiates software engineering from other programming”. How does current software engineering research compare against this definition? Is there currently too much focus on research into problems and techniques more associated with programming than software engineering? Are there opportunities to use Randell’s description of software engineering to guide the community to new research directions? In this talk, I will explore these questions and discuss how a consideration of the development streams used by multiple individuals to produce multiple versions of software opens up new avenues for impactful software engineering research.
Path explosion and constraint solving are two challenges to symbolic execution’s scalability. Symbolic execution explores the program’s path space with a searching strategy and invokes the underlying constraint solver in a black-box manner to check the feasibility of a path. Inside the constraint solver, another searching procedure is employed to prove or disprove the feasibility. Hence, there exists the problem of double searchings in symbolic execution. In this paper, we propose to unify the double searching procedures to improve the scalability of symbolic execution. We propose \textit{Multiplex Symbolic Execution} (MuSE) that utilizes the intermediate assignments during the constraint solving procedure to generate new program inputs. MuSE maps the constraint solving procedure to the path exploration in symbolic execution and explores multiple paths in one time of solving. We have implemented MuSE on two symbolic execution tools (based on KLEE and JPF) and three commonly used constraint solving algorithms. The results of the extensive experiments on real-world benchmarks indicate that MuSE has orders of magnitude speedup to achieve the same coverage.
Coverage-guided fuzzing is one of the most popular software testing techniques for vulnerability detection. While effective, current fuzzing methods suffer from significant performance penalty due to instrumentation overhead, which limits its practical use. Existing solutions improve the fuzzing speed by decreasing instrumentation overheads but sacrificing coverage accuracy, which results in unstable performance of vulnerability detection.
In this paper, we propose a coverage-sensitive tracing and scheduling framework Zeror that can improve the performance of existing fuzzers, especially in their speed and vulnerability detection. The Zeror is mainly made up of two parts: (1) a self-modifying tracing mechanism to provide a zero-overhead instrumentation for more effective coverage collection, and (2) a real-time scheduling mechanism to support adaptive switch between the zero-overhead instrumented binary and the fully instrumented binary for better vulnerability detection. In this way, Zeror is able to decrease collection overhead and preserve fine-grained coverage for guidance.
For evaluation, we implement a prototype of Zeror and evaluate it on Google fuzzer-test-suite, which consists of 24 widely-used applications. The results show that Zeror performs better than existing fuzzing speed-up frameworks such as Untracer and INSTRIM, improves the execution speed of the state-of-the-art fuzzers such as AFL and MOPT by 159.80%, helps them achieve better coverage (averagely 10.14% for AFL, 6.91% for MOPT) and detect vulnerabilities faster (averagely 29.00% for AFL, 46.99% for MOPT).
Regression testing is widely recognized as an important but time-consuming process. In the literature, researchers have put dedicated efforts in test selection, reduction, and prioritization, to alleviate this cost issue. These techniques share the commonality that they improve regression testing by optimizing the execution of the whole test suite. In this paper, we attempt to accelerate regression testing from a totally new perspective, i.e., skipping some execution of a new program by reusing program states of an old program. Following this intuition, we propose a state-reuse based acceleration approach SRRTA, which consists of state storage and loading. With the former, SRRTA collects some program states during the execution of an old version through three heuristic-based storage strategies; with the latter, SRRTA loads the stored program states with efficiency optimization strategies. Finally, we conduct a preliminary study on \emph{commons-math} and find that SRRTA reduces 80.3% of the regression testing time, indicating it is very promising in accelerating regression testing.
The software clone detection is an active research area, which is very important for software maintenance, bug detection etc. The two pieces of cloned code reflect some similarities or equivalents in the syntax or structure of the code representations. There are many representations of code like AST, token, PDG etc. The PDG (Program Dependency Graph) of source code can contain both syntactic and structural information. However, most existing PDG-based tools have high time consuming and miss many clones because they detect code clones with exact graph matching by using subgraph isomorphism. In this paper, we propose a novel PDG-based code clone detector, CCGraph, that uses graph kernels. Firstly, we normalize the structure of PDGs and design a two-stage filtering strategy by measuring the characteristic vectors of codes. Then we detect the code clones by using approximate graph matching algorithm based on the reforming WL (Weisfeiler-Lehman) graph kernel. Experiment results show that CCGraph retains a high accuracy, has both better recall and F1-score values, and detects more unique clones than other two related state-of-the-art tools. Besides, CCGraph is much more efficient than the existing PDG-based tools.
The existing concurrency model for Java (or C) requires programmers to design and implement thread-safe classes by explicitly acquiring locks and releasing locks. Such a model is error-prone and is the reason for many concurrency bugs. While there are alternative models like transactional memory, manually writing locks remains prevalent in practice. In this work, we propose AutoLock, which aims to solve the problem by fully automatically generating thread-safe classes. Given a class which is assumed to be correct with sequential clients, AutoLock automatically generates a thread-safe class which is linearizable and does it in a way without requiring a specification of the class. AutoLock takes three steps: (1) infer access annotations (i.e., abstract information on how variables are accessed and aliased), (2) synthesize a locking policy based on the access annotations, and (3) consistently implement the locking policy. AutoLock has been evaluated on a set of benchmark programs and the results show that AutoLock generates thread-safe classes effectively and could have prevented existing concurrency bugs.
JavaScript is one of the most popular programming languages. WeChat Mini-Program is a large ecosystem of JavaScript applications that run on the WeChat platform. Millions of Mini-Programs are accessed by WeChat users every week. The performance and robustness of Mini-Programs are particularly important. Unfortunately, many Mini-Programs suffer from various defects and performance problems. Dynamic analysis is a useful technique to pinpoint application defects. However, due to the dynamic features of the JavaScript language and the complexity of the runtime environment, dynamic analysis techniques were rarely used to improve the quality of JavaScript applications running on industrial platforms such as WeChat Mini-Program. In this work, we report our experience of extending Jalangi, a dynamic analysis framework for JavaScript applications developed by academia, and applying the extended version, named WeJalangi, to diagnose defects in WeChat Mini-Programs. WeJalangi is compatible with existing dynamic analysis tools such as DLint and JITProf. We implemented a null pointer checker on WeJalangi and tested the tool’s usability on 152 open-source Mini-Programs. We also conducted a case study in Tencent by applying WeJalangi on six popular commercial Mini-Programs. In the case study, WeJalangi accurately located six null pointer issues and three of them haven’t been discovered previously. All of the reported defects have been already confirmed by developers and testers.
Strings play many roles in programming because they often contain complex and semantically rich information. For example, programmers use strings to filter inputs via regular expression matching, to express the names of program elements access through some form of reflection, to embed code written in another formal language, and to assemble textual output produced by a program. The omnipresence of strings leads to a wide range of mistakes that developers may make, yet little is currently known about these mistakes. The lack of knowledge about string-related bugs leads to developers repeating the same mistakes again and again, and to poor support for finding and fixing such bugs. This paper presents the first empirical study of the root causes, consequences, and other properties of string-related bugs. We systematically study a diverse set of projects written in JavaScript, a language where strings play a particularly important role. Our findings include (i) that many string-related mistakes are caused by a recurring set of root cause patterns, such as incorrect string literals and regular expressions, (ii) that string-related bugs have a diverse set of consequences, including incorrect output or silent omission of expected behavior, (iii) that string-related bugs occur across all parts of applications, including the core components, and (iv) that almost none of these bugs are detected by existing static analyzers. Our findings not only show the importance and prevalence of string-related bugs, but they help developers to avoid common mistakes and tool builders to tackle the challenge of finding and fixing string-related bugs.
Test-based automated program repair (APR) has attracted huge attention from both industry and academia. Despite the significant progress made in recent studies, the overfitting problem (i.e., the generated patch is plausible but overfitting) is still a major and long-standing challenge. Therefore, plenty of automated techniques have been proposed to assess the correctness of patches either in the patch generation phase or in the evaluation of APR techniques. However, the effectiveness of the existing techniques has not been systematically compared and little is known to their advantages and disadvantages. To fill this gap, we performed a large-scale empirical study in this paper. Specifically, we systematically investigated the effectiveness of existing automated patch correctness assessment techniques, including both static and dynamic ones, based on 902 patches automatically generated by 21 APR tools from 4 different categories (the largest benchmark ever in the literature). Our empirical study revealed the following major findings: (1) static code features with respect to patch syntax and semantics are generally effective in differentiating overfitting patches over correct ones; (2) dynamic techniques can generally achieve high precision while heuristics based on static code features are more effective towards recall; (3) existing techniques are more effective towards certain projects and certain types of APR techniques while less effective to the others; (4) existing techniques are highly complementary to each other. A single technique can only detect at most 53.5% overfitting patches while 93.3% of the overfitting ones can be detected by at least one technique. Based on our findings, we designed an integration strategy to first integrate static code features via learning, and then combine with others by the majorrity voting strategy. Our experiments show that the strategy can enhance the performance of existing patch correctness assessment techniques significantly.
A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or that rely on manually-crafted heuristics, we study the benefit of learning code representations in order to learn deep features that may encode the properties of patch correctness. Our empirical work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings associated with logistic regression yielded an AUC value of about 0.8 in the prediction of patch correctness on a deduplicated dataset of 1000 labeled patches. Our investigations show that learned representations can lead to reasonable performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. These representations may further be complementary to features that were carefully (manually) engineered in the literature.
Reentrancy bugs, one of the most severe vulnerabilities in smart contracts, has caused huge financial loss in recent years. Researchers have proposed general-purpose and rule-based approaches to detecting them. However, empirical studies have shown that they usually suffer from undesirable false positives and false negatives, especially when the code under detection involves the interaction between multiple smart contracts. In this paper, we propose an accurate and efficient cross-contract reentracy detection approach in practice. Rather than design rule-of-thumb heuristics, we conduct a large empirical study of 11714 real-world contracts from Etherscan against three well-known general-purpose security tools for reentrancy detection. We manually summarized the reentrancy scenarios where state-of-the-art approaches cannot address. Based on the empirical evidence, we present Clairvoyance, a cross-function and cross-contract static analysis to detect reentrancy vulnerabilities in real world with significantly higher accuracy. To reduce false negatives, we enable, for the first time, a cross-contract call chain analysis by tracking possibly tainted paths. To reduce false positives, we systematically summarized five major path protective techniques (PPTs) to support fast yet precise path feasibility checking. We implemented our approach and compared Clairvoyance with five state-of-the-art tools on 17770 real-worlds contracts. The results show that Clairvoyance yields the best detection accuracy among all the tools and also finds 101 unknown reentrancy vulnerabilities.
With one of the largest available collection of reusable packages, the JavaScript runtime environment Node.js is one of the most popular programming applications. With recent work showing evidence that known vulnerabilities being prevalent in both an Open Source and industry, we propose and implement a viable code-based vulnerability detection tool in Node.js applications. Our case study lists the challenges when implementing this Node.js vulnerable code detector.
Over the last few years, there has been substantial research on automated analysis, testing, and debugging of Ethereum smart contracts. However, it is not trivial to compare and reproduce that research.
To address this, we present SmartBugs, an extendable and easy-to-use execution framework that simplifies the execution of analysis tools on smart contracts written in Solidity, the primary language used in Ethereum.
SmartBugs is currently distributed with support for 10 tools and two datasets of Solidity contracts. The first dataset can be used to evaluate the precision of analysis tools, as it contains 143 annotated vulnerable contracts with 208 tagged vulnerabilities. The second dataset contains 47,518 unique contracts collected through Etherscan.
We discuss how SmartBugs supported the largest experimental setup to date both in the number of tools and in execution time. Moreover, we show how it enables easy integration and comparison of analysis tools by presenting a new extension to the tool Smartcheck that improves substantially the detection of vulnerabilities related to the DASP10 categories Bad Randomness, Time Manipulation, and Access Control (identified vulnerabilities increased from 11% to 24%).
Panel
Experience
Experience paper: Testing of mobile apps is time-consuming and requires a great deal of manual effort. For this reason, industry and academic researchers have proposed a number of test-input- generation techniques for automating app testing. Although useful, these techniques have weaknesses and limitations that often pre- vent them from achieving high coverage. We believe that one of the reasons for these limitations is that tool developers tend to focus mainly on improving the strategy the techniques employ to explore app behavior, whereas limited effort has been put into investigating other ways to improve the performance of these tech- niques. To address this problem, and get a better understanding of the limitations of input-generation techniques for mobile apps, we conducted an in-depth study of the limitations of Monkey–arguably the most widely used tool for automated testing of Android apps. Specifically, in our study, we manually analyzed Monkey’s perfor- mance on 68 benchmarks to identify the common limitations that prevent the tool from achieving better coverage results. We then assessed the coverage improvement that Monkey could achieve if these limitations were eliminated. In our analysis of the results, we also discuss whether other existing test-input-generation tools suffer from these common limitations and provide insights on how they could address them.
Automated testing of mobile apps has received significant attention in the recent years from researchers and practitioners alike. In this paper, we report on the largest empirical study to date, aimed at understanding the test automation culture prevalent among mobile app developers. We systematically examined more than 3.5 million repositories on GitHub and identified more than 12, 000 non-trivial and real-world Android apps. We then analyzed these non-trivial apps to investigate (1) the trends in adoption of test automation; (2) working habits of mobile app developers in regards to automated testing; and (3) the correlation between the adoption of test automation and the popularity of projects. Among others, we found that (1) only 8% of the mobile app development projects leverage automated testing practices; (2) developers tend to follow the same test automation practices across projects; and (3) popular projects, measured in terms of the number of contributors, stars, and forks on GitHub, are more likely to adopt test automation practices. To understand the rationale behind our observations, we further conducted a survey with 148 professional and experienced developers contributing to the subject apps. Our findings shed light on the current practices and future research directions pertaining to test automation for mobile app development.
Most software systems interact with their environment extensively. This is especially true for mobile apps, whose behavior often depends on sensors, external services, and inter-process communications. These interactions can complicate testing activities, as test cases may need a complete environment to be executed. They can also cause issues such as flakiness, for example when the environment behaves in non-deterministic ways. For these reasons, it is common to create test mocks that can eliminate the need for (part of) the environment to be present during testing. Manual mock creation, however, can be extremely time consuming and error-prone. Moreover, the generated mocks can typically only be used in the context of the specific tests for which they were created. To address these issues, we propose MOKA, a general framework for collecting and generating reusable test mocks in an automated way. MOKA leverages the ability to observe a large number of interactions between an application and its environment and uses an iterative approach to generate mocks with different reusability characteristics—advanced mocks generated through program synthesis and basic record-replay-based mocks. In this paper, we describe the new ideas behind MOKA, its main characteristics, a preliminary study, and a set of possible applications that would benefit from our framework.
This paper presents AirMochi, a tool that provides remote access and control of apps by leveraging a mobile platform’s publicly exported \emph{accessibility features}. While AirMochi is designed to be platform-independent, we discuss its iOS implementation. We show that AirMochi places no restrictions on apps, is able to handle a variety of scenarios, and imposes a negligible performance overhead. https://youtu.be/rhPz2Hs4Ius https://github.com/nkllkc/air_mochi
The use of web applications has become increasingly popular in our routine activities, such as reading the news, paying bills, and shopping on-line. As the availability of these services grows, we are witnessing an increase in the number and sophistication of at- tacks that target them. In particular, SQL injection, a class of code- injection attacks in which specially crafted input strings result in illegal queries to a database, has become one of the most serious threats to web applications. In this paper we present and evalu- ate a new technique for detecting and preventing SQL injection at- tacks. Our technique uses a model-based approach to detect illegal queries before they are executed on the database. In its static part, the technique uses program analysis to automatically build a model of the legitimate queries that could be generated by the applica- tion. In its dynamic part, the technique uses runtime monitoring to inspect the dynamically-generated queries and check them against the statically-built model. We developed a tool, AMNESIA, that implements our technique and used the tool to evaluate the tech- nique on seven web applications. In the evaluation we targeted the subject applications with a large number of both legitimate and malicious inputs and measured how many attacks our technique de- tected and prevented. The results of the study show that our tech- nique was able to stop all of the attempted attacks without generat- ing any false positives.
Workshop
Component-based synthesis is one of the hottest research areas in automated software engineering. It aims to generate programs from a collection of components like Java library. However, the program space constituted by all the components in the library is fairly large, which leads to a vast number of candidate programs generated for a long time. The intractability of the program space affects the synthesis efficiency of the program and the size of the program generated. In this paper, we propose Itas, a framework of iterative program synthesis via API usage knowledge from the Internet, which can significantly improve the efficiency of program synthesis. Itas aims to constrain the program space by combining two main ideas. First, narrow down the program space from the outside via the guidance of API usage knowledge. Second, expand the program space from the inside via iterative strategy based on knowledge. For evaluation, we collect a set of programming tasks and compare our approach with a program synthesis tool on synthesizing these tasks. The experiment results show that Itas can significantly improve the efficiency of program synthesis, which can reduce the synthesis time by 97.1% than the original synthesizer.
Workshop
We study the problem of automatically generating source code from different forms of user intents. Existing methods treating this problem as a language generating task of the neural network, known as Neural Program Synthesis(NPS). Most of these methods struggle with achieving high generating accuracy, one reason for that is the incompleteness and inaccuracy of user intents for a specific programming task. Inspired by the Swarm Intelligence(SI) and Collective Intelligence(CI) techniques, we proposed an automatic task-specific user intent merging framework combining both the bio-inspired algorithm in SI and CI merged from multiple developers. Empirically, we show that our approach is able to provide more accurate and adequate input for NPS, and our experiment on CI indicates that knowledge merging among isolated software developers in our approach has a significant influence on NPS.
Workshop
Software design patterns are solutions to common software problems that are proven to work adequately in particular scenarios. Deciding which design pattern to use for a given software problem often requires practical knowledge acquired with experience in a similar domain and can be highly subjective and error-prone. Further, for novice programmers, an automated approach would be a tremendous help as they usually lack practical knowledge required for deciding which design pattern to use for a particular software problem. The majority of research in software design pattern prediction involves using software structure and features in determining which design pattern to implement. However, there are circumstances where software designers would prefer to know which design pattern to be used by looking at the design problem during or before the implementation phase. Existing design pattern prediction tools cannot be utilized in this scenario due to the absence of code and class structures. To address this issue, this paper proposes a new approach that analyses the context of the software problem from text and predicts a suitable design pattern for the given problem context using feature learning, neural embedding, and classification. To evaluate our approach, we make use of Stack Overflow posts, where developers often discuss design problems and consequences that they should consider, which are related to two main design pattern elements. We evaluate our approach on a case study from Stack Overflow with more than 66,000 questions that discuss problems and consequences related to 23 design patterns. The experimental evaluation shows that our approach can predict design patterns from the text with 82% overall accuracy. This indicates that our approach can successfully be used to support software designers in determining the most suitable design pattern for their software implementation.
Workshop
Workshop
In this paper, we have defined an NLP task, for the automatic extraction of business process redesign suggestions from natural language text. In particular, we have employed a systematic protocol to define the task, which is composed of three elements and three sub-tasks. The elements are: a) a real-world process model, b) actual feedback in natural language text, and c) three-level classification of the feedback. The task is composed of two binary and one multi-class classification sub-tasks. The evaluation of the AutoEPRS-20 task is performed using six traditional supervised learning techniques. The results show that the third sub-task is more challenging that the two binary sub-tasks.
Workshop
Emotion detection is playing a very important role in our life. People express their emotions in different ways i.e face expression, gestures, speech, and text. This research focuses on detecting emotions from the Roman Urdu text. Previously, A lot of work has been done on different languages for emotion detection but there is limited work done in Roman Urdu. Therefore, there is a need to explore Roman Urdu as it is the most widely used language on social media platforms for communication. One major issue for the Roman Urdu is the absence of benchmark corpora for emotion detection from text because language assets are essential for different natural language processing (NLP) tasks. There are many useful applications of the emotional analysis of a text such as improving the quality of products, dialog systems, investment trends, mental health. In this research, to focus on the emotional polarity of the Roman Urdu sentence we develop a comprehensive corpus of 18k sentences that are gathered from different domains and annotate it with six different classes. We applied different baseline algorithms like KNN, Decision tree, SVM, and Random Forest on our corpus. After experimentation and evaluation, the results showed that the SVM model achieves a better F-measure score.
Workshop
In this paper, we have proposed novel concept of mapping natural language customer feedback text to relevant business process model elements. Customer feedback mapped over business process model will provide augmented business process having customer perception. More specifically, in this work, we have proposed systematic approach for mapping feedback comment to relevant process model elements which comprises a)process model generation, b) preparation of real-world customer feedback corpus, c) BPRI framework based mapping guidelines and d) first novel human annotated customer feedback process model element mapping dataset. We have evaluated the effectiveness of six traditional text similarity measures for automatic mapping of customer feedback to process model elements. Based on the results, we concluded that automatic mapping identification is challenging task as six traditional similarity measures resulted zero recall score.
Workshop
Social media, today, demonstrates the rapid growth of modern society as it becomes the main platform for Internet users to communicate and express themselves. People around the world, use a number of devices and resources to access the Internet, set up social networks, conduct online business, e-commerce, e-surveys, etc. Currently, social media is not only a technology that provides information to consumers, it also encourages users to connect and share their views and perspectives. It leads to an increase in inspiration towards Opinion Mining (OM), which is important for both customers and companies in making decisions. Individuals like to see the opinions provided by other customers about a particular product or a service. Companies need to analyze their customer’s feedback to strengthen their business decisions. A lot of research has been performed in various languages in the field of Aspect Based OM (ABOM). However, there are still certain languages that need to be explored, such as Roman Urdu (RU). This paper presents a proposed reviews data-set (a RU data-set) of mobile reviews that has been manually annotated with multi-aspect sentiment labels at the sentence-level. It presents base-line results using different Machine Learning (ML) algorithms. The results demonstrate 71% F1-score for aspect detection and 64% for aspect-based polarity.