Exception handling is an important built-in feature of many modern programming languages such as Java. It allows developers to deal with abnormal or unexpected conditions by try-catch blocks that may occur at runtime in advance. Missing or improper implementation of exception handling can cause catastrophic consequences such as system crash. However, previous studies reveal that developers are unwilling or feel it hard to adopt exception handling mechanism, and tend to ignore it until a system failure forces them to do so. To help developers with exception handling, existing work produces recommendations such as code examples and exception types, which still requires developers to localize the try blocks and modify the catch block code to fit the context. In this paper, we propose a novel neural approach for automated exception handling, which can predict locations of try blocks and automatically generate the complete catch blocks. We collect a large number of Java methods from GitHub and conduct experiments to evaluate our approach. The evaluation results, including quantitative measurement and human evaluation, show that our approach is highly effective and outperforms all baselines. Our work makes one step further towards automated exception handling.
Long build times in continuous integration (CI) can greatly increase the cost in human and computing resources, and thus become a common barrier faced by software organizations adopting CI. Build outcome prediction has been proposed as one of the remedies to reduce such cost. However, the state-of-the-art approaches have a poor prediction performance for failed builds, and are not designed for practical usage scenarios. To address the problems, we first conduct an empirical study on 2,590,917 builds to characterize builds times in real-world projects, and a survey with 75 developers to understand their perceptions about build outcome prediction. Then, motivated by our study and survey results, we propose a new history-aware approach, named BuildFast, to predict CI build outcomes cost-efficiently and practically. It can help to obtain fast integration feedback and reduce integration cost. In particular, we introduce multiple failure-specific features from closely related historical builds via analyzing build logs and changed files, and propose an adaptive prediction model to switch between two models based on the build outcome of the previous build. We also investigate a practical online usage scenario of BuildFast, where builds are predicted in chronological order, and measure the benefit from correct predictions and the cost from incorrect predictions. Our experiments on 20 projects have demonstrated that BuildFast can improve the state-of-the-art approach by 47.5% in F1-score for failed builds.
Using open-source libraries can provide rich functions and reduce development cost. However, some critical issues have also been caused such as license conflicts and vulnerability risks. In this paper, we design and implement an open-source libraries detection tool OSLDetector which uses methods of matching features to detect third-party libraries for multi-platform software in binaries. We took a series of methods such as filtering features and novelty building an internal clone forest to cope with the challenge of feature duplication. The tool can also provide the conflict of licenses and identify possible corresponding vulnerabilities, so these potential risks can be resolved and avoided. To evaluate the efficiency of OSLDetector, we collect 5K libraries containing 9K versions and manage their respective license type and existing vulnerabilities. The experimental results with a precision of 96\% and recall of 92.3\% show that OSLDetector is effective and outperforms similar tools.
A key challenge in automatic Web testing is the generation of syntactically and semantically valid input values that can exercise the many functionalities that impose constraints on the validity of the inputs. Existing test case generation techniques either rely on manually curated catalogs of values, or extract values from external data sources, such as the Web or publicly available knowledge bases. Unfortunately, relying on manual effort is generally too expensive for most practical applications, while domain-specific and application-specific data can be hardly found either on the Web or in general purpose knowledge bases. This paper proposes DBInputs, a novel approach that reuses the data from the database of the target Web applications, to automatically identify domain-specific and application-specific inputs, and effectively fulfil the validity constraints present in the tested Web pages. DBInputs can properly cope with system testing and maintenance testing efforts, since databases are naturally and inexpensively available in those phases. To extract valid inputs from the application databases, DBInputs exploits the syntactic and semantic similarity between the identifiers of the input fields and the ones in the tables of the database, automatically resolving the mismatch between the user interface and the schema of the database. Our experiments provide initial evidence that DBInputs can outperform both random input selection and LINK, a state-of-the-art approach for searching inputs from knowledge bases.
Compiler bugs can be disastrous since they could affect all the software systems built on the buggy compilers. Meanwhile, diagnosing compiler bugs is extremely challenging since usually limited debugging information is available and a large number of compiler files can be suspicious. More specifically, when compiling a given bug-triggering test program, hundreds of compiler files are usually involved, and can all be treated as suspicious buggy files. To facilitate compiler debugging, in this paper we propose the first reinforcement compiler bug isolation approach via structural mutation, called RecBi. For a given bug-triggering test program, RecBi first augments traditional local mutation operators with structural ones to transforms it into a set of passing test programs. Since not all the passing test programs can help isolate compiler bugs effectively, RecBi further leverages reinforcement learning to intelligently guide the process of passing test program generation. Then, RecBi ranks all the suspicious files by analyzing the compiler execution traces of the generated passing test programs and the given failing test program following the practice of compiler bug isolation. The experimental results on 120 real bugs from two most popular C open-source compilers, i.e., GCC and LLVM, show that RecBi is able to isolate about 23%/58%/78% bugs within Top-1/Top-5/Top-10 compiler files, and significantly outperforms the state-of-the-art compiler bug isolation approach by improving 92.86%/55.56%/25.68% isolation effectiveness in terms of Top-1/Top-5/Top-10 results.
Software testing is an important and time-consuming task that is often done manually. In the last decades, researchers have come up with techniques to generate input data (e.g., fuzzing) and automate the process of generating test cases (e.g., search-based testing). However, these techniques are known to have their own limitations: search-based testing does not generate highly-structured data; grammar-based fuzzing does not generate test case structures. To address these limitations, we combine these two techniques. By applying grammar-based mutations to the input data gathered by the search-based testing algorithm, it allows us to co-evolve both aspects of test case generation. We evaluate our approach by performing an empirical study on 20 Java classes from the three most popular JSON parsers across multiple search budgets. Our results show that the proposed approach on average improves branch coverage for JSON related classes by 15% (with a maximum increase of 50%) without negatively impacting other classes.
Methods for randomized testing of compilers to find miscompilation bugs typically require a way to generate programs that are free from undefined behaviour (UB). Tools such as Csmith achieve UB-freedom by heavily restricting the form of generated programs. This leads to highly idiomatic programs, and we hypothesise that this limits the thoroughness with which compilers are tested. Our idea is that researchers should investigate ways to generate less restricted programs that are still UB-free—programs that get closer to the edge of undefined behaviour, but that do not quite cross the edge. We present experiments investigating one instance of idea via a prototype tool, CsmithEdge, that uses a simple dynamic analysis to detect where Csmith has been too conservative in its use of “safe math” wrappers that guarantee UB-freedom for arithmetic operations, eliminating redundant wrappers. By reducing the use of safe math wrappers, CsmithEdge was able to discover two new miscompilation bugs in GCC that could not be found via intensive testing using regular Csmith, as well as achieving substantial differences in code coverage on GCC compared with regular Csmith.
Evolutionary intelligence approaches have been successfully applied to assist developers during debugging by generating a test case reproducing reported crashes. These approaches use a single fitness function called Crash Distance to guide the search process toward reproducing a target crash. Despite the reported achievements, these approaches do not always successfully reproduce some crashes due to a lack of test diversity (premature convergence). In this study, we introduce a new approach, called MO-HO, that addresses this issue via multi-objectivization. In particular, we introduce two new Helper-Objectives for crash reproduction, namely test length (to minimize) and method sequence diversity (to maximize), in addition to Crash Distance. We assessed MO-HO using five multi-objective evolutionary algorithms (NSGA-II, SPEA2, PESA-II, MOEA/D, FEMO) on 124 hard-to-reproduce crashes stemming from open-source projects. Our results indicate that SPEA2 is the best-performing multi-objective algorithm for MO-HO. We evaluated this best-performing algorithm for MO-HO against the state-of-the-art: single-objective approach (Single-Objective Search) and decomposition-based multi-objectivization approach (De-MO). Our results show that MO-HO reproduces five crashes that cannot be reproduced by the current state-of-the-art. Besides, MO-HO improves the effectiveness (+10% and +8% in reproduction ratio) and the efficiency in 34.6% and 36% of crashes (i.e., significantly lower running time) compared to Single-Objective Search and De-MO, respectively. For some crashes, the improvements are very large, being up to +93.3% for reproduction ratio and -92% for the required running time.
Model transformations play an important role in the evolution of systems in various fields such as healthcare, automotive and aerospace industry. Thus, it is important to check the correctness of model transformation programs. Several approaches have been proposed to generate test cases for model transformations based on different coverage criteria (e.g., statements, rules, metamodel elements, etc.). However, the execution of a large number of test cases during the evolution of transformation programs is time-consuming and may include a lot of overlap between the test cases. In this paper, we propose a test case selection approach for model transformations based on multi-objective search. We use the non-dominated sorting genetic algorithm (NSGA-II) to find the best trade-offs between two conflicting objectives: (1) maximize the coverage of rules and (2) minimize the execution time of the selected test cases. We validated our approach on several evolution cases of medium and large ATLAS Transformation Language (ATL) programs.
Link to Publication: https://link.springer.com/article/10.1007%2Fs10515-020-00271-w
Approaches for automatic crash reproduction aim to generate test cases that reproduce crashes starting from the crash stack traces. These tests help developers during their debugging practices. One of the most promising techniques in this research field leverages search-based software testing techniques for generating crash reproducing test cases. In this paper, we introduce Botsing, an open-source search-based crash reproduction framework for Java. Botsing implements state-of-the-art and novel approaches for crash reproduction. The well-documented architecture of Botsing makes it an easy-to-extend framework, and can hence be used for implementing new approaches to improve crash reproduction. We have applied Botsing to a wide range of crashes collected from open source systems. Furthermore, we conducted a qualitative assessment of the crash-reproducing test cases with our industrial partners. In both cases, Botsing could reproduce a notable amount of the given stack traces.
Ask Me Anything
Massimiliano Di Penta is a full professor at the University of Sannio, Italy. His research interests include software maintenance and evolution, mining software repositories, empirical software engineering, search-based software engineering, and software testing. He is an author of over 260 papers that appeared in international journals, conferences, and workshops. He has received several awards for research and service, including four ACM SIGSOFT Distinguished paper awards. Most importantly, he has received several distinguished reviewer awards. He serves and has served in the organizing and program committees of more than 100 conferences, including ICSE, FSE, ASE, and ICSME. He will be program co-chair of ESEC/FSE 2021 and of ICSE 2023. He is co-editor in chief of the Journal of Software: Evolution and Processes edited by Wiley, editorial board member of ACM Transactions on Software Engineering and Methodology and Empirical Software Engineering Journal edited by Springer, and has served the editorial board of the IEEE Transactions on Software Engineering.
In unit testing, mocking is popularly used to ease test effort, reduce test flakiness, and increase test coverage by replacing the actual dependencies with simple implementations. However, there are no clear criteria to determine which dependencies in a unit test should be mocked. Inappropriate mocking can have undesirable consequences: under-mocking could result in the inability to isolate the class under test (CUT) from its dependencies while over-mocking increases the developers’ burden on maintaining the mocked objects and may lead to spurious test failures. According to existing work, various factors can determine whether a dependency should be mocked. As a result, mocking decisions are often difficult to make in practice. Studies on the evolution of mocked objects also showed that developers tend to change their mocking decisions: 17% of the studied mocked objects were introduced sometime after the test scripts were created and another 13% of the originally mocked objects eventually became unmocked. In this work, we are motivated to develop an automated technique to make mocking recommendations to facilitate unit testing. We studied 10,846 test scripts in four actively maintained open-source projects that use mocked objects, aiming to characterize the dependencies that are mocked in unit testing. Based on our observations on mocking practices, we designed and implemented a tool, MockSniffer, to identify and recommend mocks for unit tests. The tool is fully automated and requires only the CUT and its dependencies as input. It leverages machine learning techniques to make mocking recommendations by holistically considering multiple factors that can affect developers’ mocking decisions. Our evaluation of Mock- Sniffer on ten open-source projects showed that it outperformed three baseline approaches, and achieved good performance in two potential application scenarios.
Today, most automated test generators, such as search-based software testing (SBST) techniques focus on achieving high code coverage. However, high code coverage is not sufficient to maximise the number of bugs found, especially when given a limited testing budget. In this paper, we propose an automated test generation technique that is also guided by the estimated degree of defectiveness of the source code. Parts of the code that are likely to be more defective receive more testing budget than the less defective parts. To measure the degree of defectiveness, we leverage Schwa, a notable defect prediction technique.
We implement our approach into EvoSuite, a state of the art SBST tool for Java. Our experiments on the Defects4J benchmark demonstrate the improved efficiency of defect prediction guided test generation and confirm our hypothesis that spending more time budget on likely defective parts increases the number of bugs found in the same time budget.
Crowdsourced mobile testing has been widely used due to its convenience and high efficiency. Crowdsourced workers complete testing tasks and record results in test reports. However, the problem of duplicate reports has prevented the efficiency of crowdsourced mobile testing from further improving. Existing crowdsourced testing report analysis techniques usually leverage screenshots and text descriptions independently, but fail to recognize the link between these two types of information. In this paper, we present a crowdsourced mobile testing report selection tool, namely STIFA, to extract image and text feature information in reports and establish an image-text-fusion bug context. Based on text and image fusion analysis results, STIFA performs cluster analysis and report selection. To evaluate, we employed STIFA to analyze 150 reports from 2 apps. The results show that STIFA can extract, on average, 95.23% text feature information and 84.15% image feature information. Besides, STIFA reaches an accuracy of 87.64% in detecting duplicate reports. The demo can be found at https://github.com/ZhenfeiCao/STIFA.
Network partitions are inevitable in large-scale cloud systems. Despite developers’ efforts in handling network partitions throughout designing, implementing and testing cloud systems, bugs caused by network partitions, i.e., partition bugs, still exist and cause severe failures in production clusters. It is challenging to expose these partition bugs because they often require network partitions to start and stop at specific timings.
In this paper, we propose Consistency-Guided Fault Injection (CoFI), a novel technique that smartly injects network partitions to effectively expose partition bugs. We observe that, network partitions can leave cloud systems at inconsistent states, where partition bugs are more likely to occur. Based on this observation, CoFI first infers invariants (i.e., consistent states) among different nodes in a cloud system. Once observing a violation to the inferred invariants (i.e., inconsistent states) while running the cloud system, CoFI injects network partitions to prevent the cloud system from recovering back to consistent states, and thoroughly tests whether the cloud system still proceeds correctly at inconsistent states.We have applied CoFI to three widely-deployed cloud systems, i.e., Cassandra, HDFS, and YARN. CoFI has detected 7 previously-unknown bugs, and three of them have been confirmed by developers.
In recent years the use of non-traditional computing mechanisms has grown rapidly. One paradigm uses chemical reaction networks (CRNs) to compute via chemical interactions. CRNs are used to prototype molecular devices at the nanoscale such as intelligent drug therapeutics. In practice, these programs are first written and simulated in environments such has MatLab and later compiled into physical molecules such as DNA strands. However, techniques for testing the correctness of CRNs are lacking. Current methods of validating CRNs include model checking and theorem proving, but these are limited in scalability. In this paper we present an automated testing framework, ChemTest. In contrast to model checking, ChemTest evaluates test oracles on individual simulation traces and supports functional, metamorphic, internal and hyper test cases. It also allows for flakiness and programs that are probabilistic. We performed a large case study demonstrating that ChemTest can find seeded faults and scales beyond model checking. Of our tests, 21% are inherently flaky, suggesting that systematic support for this paradigm is needed. On average functional tests find 66.5% of the faults, while metamorphic tests find 80.4%, showing the benefit of using metamorphic relationships in our test framework. In addition, we show how time of evaluation impacts fault detection.
As a common IT infrastructure, APM (Application PerformanceManagement) systems have been widely adopted to monitor call request to an on-line service. Usually, each request may contain multi-dimensional attributes(e.g., City, ISP, Platform, etc. ), which may become the reason for a certain anomaly regarding DSR (Declining Success Rate) of service calls either solely or as a combination. Moreover, each attribute may also have multiple values (e.g., ISP could be T-Mobile, Vodafone, CMCC, etc.), rendering intricate root causes and huge challenges to identify the root causes. In this paper, we propose a prototype tool, Imp APTr(Imp act Analysis based on Pruning Tree), to identify the combination of dimensional attributes as the clues leading to the root causes of anomalies regarding DSR of a service call in a timely manner. Imp APTr have been evaluated in MeiTuan, one of the biggest on-line service providers. Performance regarding the accuracy outperforms several previous tools in the same field.
Just-In-Time (JIT) defect prediction is a classification model that is trained using historical data to predict bug-introducing changes. However, recent studies raised concerns related to the explainability of the predictions of many software analytics applications (i.e., practitioners do not understand why commits are risky and how to improve them). In addition, the adoption of Just-In-Time defect prediction is still limited due to a lack of integration into CI/CD pipelines and modern software development platforms (e.g., GitHub). In this paper, we present an explainable Just-In-Time defect prediction framework to automatically generate feedback to developers by providing the riskiness of each commit, explaining why such commit is risky, and suggesting risk mitigation plans. The proposed framework is integrated into the GitHub CI/CD pipeline as a GitHub application to continuously monitor and analyse a stream of commits in many GitHub repositories. Finally, we discuss the usage scenarios and their implications to practitioners. The VDO demonstration is available at https://youtu.be/HJBzULrS6hE.
This paper presents AirMochi, a tool that provides remote access and control of apps by leveraging a mobile platform’s publicly exported \emph{accessibility features}. While AirMochi is designed to be platform-independent, we discuss its iOS implementation. We show that AirMochi places no restrictions on apps, is able to handle a variety of scenarios, and imposes a negligible performance overhead. https://youtu.be/rhPz2Hs4Ius https://github.com/nkllkc/air_mochi
Recognition of human behaviours including body motions and facial expressions plays a significant role in human-centric software engineering. However, due to the data and computation intensive nature of human behaviour recognition through video analytics, expensive powerful machines are often required, which could hinder the research and application in human-centric software engineering. To address such an issue, this paper proposes a cost-effective human behaviour recognition system named Edge4Real which can be easily deployed in an edge computing environment with commodity machines. Compared with existing centralised solutions, Edge4Real has three major advantages including cost-effectiveness, easy-to-use, and real-time. Specifically, Edge4Real adopts a distributed architecture where components such as motion capturing, human behaviour recognition, data decoding and extraction, and the application of the recognition result, can be deployed on separated end devices and edge nodes in an edge computing environment. Using a virtual reality application which can capture a user’s motion and translate into the motion of a 3D avatar in real time, we successfully validate the effectiveness of the system and demonstrate its promising value to the research and application of human-centric software engineering. The demo video can be found at https://youtu.be/tnEshD8j-kA.
More and more new technologies are used in test development. Among them, automatic test generation, a promising technology to improve the efficiency of unit testing, currently performs not satisfactory in practice. Test recommendation, like code recommendation, is another feasible technology for supporting efficient unit testing and gets more and more attention. In this paper, we develop a novel system, namely HomoTR, which implements online test recommendations by measuring the homology of two methods. If the new method under test shares homology with an existing method that has tests, HomoTR would recommend the tests to the new method. The preliminary experiments show that HomoTR can quickly and effectively recommend test cases to help the testers improve the testing efficiency. Besides, HomoTR has been integrated into the MoocTest platform successfully, so it can also execute the recommended tests automatically and visualize the testing results (e.g., Branch Coverage) friendly to help testers understand the process of testing. The demo video of HomoTR can be found at {\color{blue}\url{https://youtu.be/_227EfcUbus}}.
WebAssembly is a new programming language built for better performance in web applications. It defines a binary code format and a text representation for the code. At first glance, WebAssembly files are not easily understandable to human readers, regardless of the experience level. As a result, distributed third-party WebAssembly modules need to be implicitly trusted by developers as verifying the functionality requires significant effort. To this end, we develop an automated classification tool WASim for identifying the purpose of WebAssembly programs by analyzing features at the module-level. It assigns purpose labels to a module in order to assist developers in understanding the binary module. The code for WASim is available at https://github.com/WASimilarity/WASim and a video demo is available at https://youtu.be/usfYFIeTy0U.
Background: The detection and extraction of causality from natural language sentences have shown great potential in various fields of application. The field of requirements engineering is eligible for multiple reasons: (1) requirements artifacts are primarily written in natural language, (2) causal sentences convey essential context about the subject of requirements, and (3) extracted and formalized causality relations are usable for a (semi-)automatic translation into further artifacts, such as test cases. Objective: We aim at understanding the value of interactive causality extraction based on syntactic criteria for the context of requirements engineering. Method: We developed a prototype of a system for automatic causality extraction and evaluate it by applying it to a set of publicly avail-able requirements artifacts, determining whether the automatic extraction reduces the manual effort of requirements formalization. Result: During the evaluation, we analyzed 2373 natural language sentences from 13 requirements documents, 282 of which were causal (11.88%). The best evaluation of a requirements document provided an automatic extraction of 7.2 of 14 cause-effect graphs on average (51.42%), which demonstrates the feasibility of the approach. Limitation: The feasibility of the approach has been proven in theory but actual human interaction with the system has been disregarded so far. Evaluating the applicability of the automatic causality ex-traction for a requirements engineer is left for future research. Conclusion: A syntactic approach for causality extraction is viable for the context of requirements engineering and can aid a pipeline towards an automatic generation of further artifacts, like test cases, from requirements artifacts.
Cross-Project Defect Prediction (CPDP), which borrows data from similar projects by combining a transfer learner with a classifier, have emerged as a promising way to predict software defects when the available data about the target project is insufficient. However, developing such a model is challenge because it is difficult to determine the right combination of transfer learner and classifier along with their optimal hyper-parameter settings. In this paper, we propose a tool, dubbed BiLO-CPDP, which is the first of its kind to formulate the automated CPDP model discovery from the perspective of bi-level programming. In particular, the bi-level programming proceeds the optimization with two nested levels in a hierarchical manner. Specifically, the upper-level optimization routine is designed to search for the right combination of transfer learner and classifier while the nested lower-level optimization routine aims to optimize the corresponding hyper-parameter settings. To evaluate BiLO-CPDP, we conduct experiments on 20 projects to compare it with a total of 21 existing CPDP techniques, along with its single-level optimization variant and Auto-Sklearn, a state-of-the-art automated machine learning tool. Empirical results show that BiLO-CPDP champions better prediction performance than all other 21 existing CPDP techniques on 70% of the projects, while being overwhelmingly superior to Auto-Sklearn and its single-level optimization variant on all cases. Furthermore, the unique bi-level formalization in BiLO-CPDP also permits to allocate more budget to the upper-level, which significantly boosts the performance.
Code comments are valuable for program comprehension and software maintenance, and also require maintenance with code evolution. However, when changing code, developers sometimes neglect updating the related comments, bringing in inconsistent or obsolete comments (aka., bad comments). Such comments are detrimental since they may mislead developers and lead to future bugs. Therefore, it is necessary to fix and avoid bad comments. In this work, we argue that bad comments can be reduced and even avoided by automatically performing comment updates with code changes. We refer to this task as “Just-In-Time (JIT) Comment Updating” and propose an approach named CUP (Comment UPdater) to automate this task. CUP can be used to assist developers in updating comments during code changes and can consequently help avoid the introduction of bad comments. Specifically, CUP leverages a novel neural sequence-to-sequence model to learn comment update patterns from extant code-comment co-changes and can automatically generate a new comment based on its corresponding old comment and code change. Several customized enhancements, such as a special tokenizer and a novel co-attention mechanism, are introduced in CUP by us to handle the characteristics of this task. We build a dataset with over 108K comment-code co-change samples and evaluate CUP on it. The evaluation results show that CUP outperforms an information-retrieval-based and a rule-based baselines by substantial margins, and can reduce developers’ edits required for JIT comment updating. In addition, the comments generated by our approach are identical to those updated by developers in 1612 (16.7%) test samples, 7 times more than the best-performing baseline.
In object-oriented programming, a method is pure if calling the method does not change object states that exist in the pre-states of the method call. Pure methods are widely-used in automatic techniques, including test generation, complier optimization, and program repair. Due to the source code dependency, it is infeasible to completely and accurately identify all pure methods. Instead, existing techniques such as ReImInfer are designed to identify a subset of accurate results of pure method and mark the other methods as unknown ones. In this paper, we designed and implemented MetPurity , a learning-based tool of pure method identification. Given all methods in a project, MetPurity labels a training set via automatic program analysis and builds a binary classifier (implemented with the random forest classifier) based on the training set. This classifier is used to predict the purity for all the other methods (i.e., unknown ones) in the same project. Preliminary evaluation on four open-source Java projects shows that MetPurity can provide a list of identified pure methods with a low error rate. Applying MetPurity to EvoSuite can increase the number of killed mutants in the test generation of EvoSuite. A demo video of this tool can be found at https://youtu.be/ Ac3cmjn4CCs/; the prototype and evaluation data are available at http://cstar.whu.edu.cn/p/metpurity/.
Software performance testing is an essential quality assurance mechanism that can identify optimization opportunities. Automating this process requires strong tool support, especially in the case of Continuous Integration (CI) where tests need to run completely automatically and it is desirable to provide developers with actionable feedback. A lack of existing tools means that performance testing is normally left out of the scope of CI. In this paper, we propose a toolchain - PerfCI - to pave the way for developers to easily set up and carry out automated performance testing under CI. Our toolchain is based on allowing users to (1) specify performance testing tasks, (2) analyze unit tests on a variety of python projects ranging from scripts to full-blown flask-based web services, by extending a performance analysis framework (VyPR) and (3) evaluate performance data to get feedback on the code. We demonstrate the feasibility of our toolchain by using it on a web service running at the Compact Muon Solenoid (CMS) experiment at the world’s largest particle physics laboratory — CERN.
When generating GUI tests for Android apps, it typically is a separate test computer that generates interactions, which are then executed on an actual Android device. While this approach is efficient in the sense that apps and interactions execute quickly, the communication overhead between test computer and device slows down testing considerably. In this work, we present DD-2, a test generator for Android that tests other apps on the device using Android accessibility services. In our experiments, DD-2 has shown to be 3.2 times faster than its computer-device counterpart, while sharing the same source code.
A software developer works on many tasks per day, frequently switching between these tasks back and forth. This constant churn of tasks makes it difficult for a developer to know the specifics of when they worked on what task, complicating task resumption, planning, retrospection, and reporting activities. We introduce a new approach to help identify the topic of work for a given time interval that is based on capturing the contents of the developer’s active window at regular intervals and creating a vector representation of key information the developer viewed. To evaluate our approach, we created a data set with multiple developers working on the same set of six information seeking tasks that we also make available for other researchers to investigate similar approaches. Our analysis shows that our approach enables: 1) segments of a developer’s work to be automatically associated with a task from a known set of tasks with average accuracy of 70.6%, and 2) a visual representation describing a segment of work that a developer can use to recognize a task with average accuracy of 67.9%.
Code context models consist of source code elements and their relations relevant to a task in a developer’s hand. Prior research showed that making code context models explicit in software tools can benefit software development practices, e.g., code navigation and searching. However, little focus has been put on how to proactively form code context models. In this paper, we explore the proactive formation of code context models based on the topological patterns of code elements from interaction histories for a project. Specifically, we first learn abstract topological patterns based on the stereotype roles of code elements, rather than on specific code elements; we then leverage the learned patterns to predict the code context models for a given task by graph pattern matching. To determine the effectiveness of this approach, we applied the approach to interaction histories stored for the Eclipse Mylyn open source project.We found that our approach achieves maximum F-measures of 0.67, 0.33 and 0.21 for 1-step, 2-step and 3-step predictions, respectively. The most similar approach to ours is Suade, which supports 1-step prediction only. In comparison to this existing work, our approach predicts code context models with significantly higher F-measure (0.57 over 0.23 on average). The results demonstrate the value of integrating historical and structural approaches to form more accurate code context models.
Self-Admitted Technical Debt (SATD) is a sub-type of technical debt. It is introduced to represent such technical debts that are intentionally introduced by developers in the process of software development. While being able to gain short-term benefits, the introduction of SATDs often requires to be paid back later with a higher cost, e.g., introducing bugs to the software or increasing the complexity of the software. To cope with these issues, our community has proposed various machine learning-based approaches to detect SATDs. These approaches, however, are either not generic that usually require manual feature engineering efforts or do not provide promising means to explain the predicted outcomes. To that end, we propose to the community a novel approach, namely HATD, to detect and explain SATDs using attention-based neural networks. Through extensive experiments on 445,365 comments in 20 projects, we show that HATD is effective in detecting SATDs on both in-the-lab and in-the-wild datasets under both within-project and cross-project settings. HATD also outperforms the state-of-the-art approaches in detecting and explaining SATDs.
Code retrieval helps developers reuse the code snippet in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code among a set of code. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., message'' and
msg''), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related. To address these problems, we propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps: the first embeds identifiers by character to capture the overlaps between identifiers, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier.
The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of different components in OCoR.
Given a bug report of a project, the task of locating the faults of the bug report is called fault localization. To help programmers in the fault localization process, many approaches have been proposed, and have achieved promising results to locate faulty files. However, it is still challenging to locate faulty methods, because many methods are short and do not have sufficient details to determine whether they are faulty. In this paper, we present BugPecker, a novel approach to locate faulty methods based on its deep learning on revision graphs. Its key idea includes (1) building revision graphs and capturing the details of past fixes as much as possible, and (2) discovering relations inside our revision graphs to expand the details for methods and calculating various features to assist our ranking. We have implemented BugPecker, and evaluated it on three open source projects. The early results show that BugPecker achieves a mean average precision (MAP) of 0.263 and mean reciprocal rank (MRR) of 0.291, which improve the prior approaches significantly. For example, BugPecker improves the MAP values of all three projects by five times, compared with two recent approaches such as DNNLoc-m and BLIA 1.5.
Systematic model-driven design and early validation enable engineers to verify that a reactive system does not violate its requirements before actually implementing it. Requirements may come from multiple stakeholders, who are often concerned with different facets – design typically involves different experts having different concerns and views of the system. Engineers start from a specification which may be sourced from some domain model, while validation is often done on state-transition structures that support model checking. Two computationally expensive steps may work against scalability: transformation from specification to state-transition structures, and model checking. We propose a technique that makes the former efficient and also makes the resulting transition systems small enough to be efficiently verified. The technique automatically projects the specification into submodels depending on a property sought to be evaluated, which captures some stakeholder’s viewpoint. The resulting reactive system submodel is then transformed into a state-transition structure and verified. The technique achieves cone-of-influence reduction, by slicing at the specification model level. Submodels are analysis-equivalent to the corresponding full model. If stakeholders propose a change to a submodel based on their own view, changes are automatically propagated to the specification model and other views affected. Automated reflection is achieved thanks to bidirectional model transformations, ensuring correctness. We cast our proposal in the context of graph-based reactive models whose dynamics is described by rewriting rules. We demonstrate our view-based framework in practice on a case study within cyber-physical systems.
Signal-based temporal properties (SBTPs) characterize the behavior of a system when its inputs and outputs are signals over time; they are very common for the requirements specification of cyber-physical systems. Although there exist several specification languages for expressing SBTPs, such languages either do not easily allow the specification of important types of properties (such as spike or oscillatory behaviors), or are not supported by (efficient) trace-checking procedures.
In this paper, we propose SB-TemPsy, a novel model-driven trace-checking approach for SBTPs. SB-TemPsy provides (i) SB-TemPsy-DSL, a domain-specific language that allows the specification of SBTPs covering the most frequent requirement types in cyber-physical systems, and (ii) SB-TemPsy-Check, an efficient, model-driven trace-checking procedure. This procedure reduces the problem of checking an SB-TemPsy-DSL property over an execution trace to the problem of evaluating an Object Constraint Language constraint on a model of the execution trace.
We evaluated our contributions by assessing the expressiveness of SB-TemPsy-DSL and the applicability of SB-TemPsy-Check using a representative industrial case study in the satellite domain. SB-TemPsy-DSL could express 97% of the requirements of our case study and SB-TemPsy-Check yielded a trace-checking verdict in 87% of the cases, with an average checking time of 48.7 s. From a practical standpoint and compared to state-of-the-art alternatives, our approach strikes a better trade-off between expressiveness and performance as it supports a large set of property types that can be checked, in most cases, within practical time limits.
High-fidelity Graphical User Interface (GUI) prototyping is a well-established and suitable method for enabling fruitful discussions, clarification and refinement of requirements formulated by customers. GUI prototypes can help to reduce misunderstandings between customers and developers, which may occur due to the ambiguity comprised in informal Natural Language (NL). However, a disadvantage of employing high-fidelity GUI prototypes is their time-consuming and expensive development. Common GUI prototyping tools are based on combining individual GUI components or manually crafted templates. In this work, we present GUI2WiRe, a tool that enables users to retrieve GUI prototypes from a semi-automatically created large-scale GUI repository for mobile applications matching user requirements specified in Natural Language (NLR). We extract multiple text segments from the GUI hierarchy data and employ various Information Retrieval (IR) models and Automatic Query Expansion (AQE) techniques to achieve ad-hoc GUI retrieval from NLR. Retrieved GUI prototypes mined from applications can be inserted in the graphical editor of GUI2WiRe to rapidly create wireframes. GUI components are extracted automatically from the GUI screenshots and basic editing functionality is provided to the user. Finally, a preview of the application is created from the wireframe to allow interactive exploration of the current design. We evaluated the applied IR and AQE approaches for their effectiveness in terms of GUI retrieval relevance on a manually annotated collection of NLR and discuss our planned user studies.
Mobile operating systems evolve quickly, frequently updating the APIs that app developers use to build their apps. Unfortunately, API updates do not always guarantee backward compatibility, causing apps to not longer work properly or even crash when running with an updated system. This paper presents FILO, a tool that assists Android developers in resolving backward compatibility issues introduced by API upgrades. FILO both suggests the method that needs to be modified in the app in order to adapt the app to an upgraded API, and reports key symptoms observed in the failed execution to facilitate the fixing activity. Results obtained with the analysis of 12 actual upgrade problems and the feedback produced by early tool adopters show that FILO can practically support Android developers. FILO can be downloaded from https://gitlab.com/learnERC/filo, and its video demonstration is available at https://youtu.be/WDvkKj-wnlQ.
As most smart systems such as smart logistic and smart manufacturing are delay sensitive, the current mainstream cloud computing based system architecture is facing the critical issue of high latency over the Internet. Meanwhile, as huge amount of data is generated by smart devices with limited battery and computing power, the increasing demand for energy-efficient machine learning and secure data communication at the network edge has become a hurdle to the success of smart systems. To address these challenges with using smart UAV (Unmanned Aerial Vehicle) delivery system as an example, we propose EXPRESS, a novel energy-efficient and secure framework based on mobile edge computing and blockchain technologies. We focus on computation and data (resource) management which are two of the most prominent components in this framework. The effectiveness of the EXPRESS framework is demonstrated through the implementation of a real-world UAV delivery system. As an open-source framework, EXPRESS can help researchers implement their own prototypes and test their computation and data management strategies in different smart systems. The demo video can be found at https://youtu.be/r3U1iU8tSmk.
Over the last few years, there has been substantial research on automated analysis, testing, and debugging of Ethereum smart contracts. However, it is not trivial to compare and reproduce that research.
To address this, we present SmartBugs, an extendable and easy-to-use execution framework that simplifies the execution of analysis tools on smart contracts written in Solidity, the primary language used in Ethereum.
SmartBugs is currently distributed with support for 10 tools and two datasets of Solidity contracts. The first dataset can be used to evaluate the precision of analysis tools, as it contains 143 annotated vulnerable contracts with 208 tagged vulnerabilities. The second dataset contains 47,518 unique contracts collected through Etherscan.
We discuss how SmartBugs supported the largest experimental setup to date both in the number of tools and in execution time. Moreover, we show how it enables easy integration and comparison of analysis tools by presenting a new extension to the tool Smartcheck that improves substantially the detection of vulnerabilities related to the DASP10 categories Bad Randomness, Time Manipulation, and Access Control (identified vulnerabilities increased from 11% to 24%).
One the one hand, as a GitHub profile is becoming an essential part of a developer’s resume it becomes increasingly important to enable HR departments to extract someone’s expertise, through automated analysis of his/her contribution to open-source projects. On the other hand, having clear insights on the technologies used in a project can be very beneficial for resource allocation and project maintainability planning. In the literature, one can identify various approaches for identifying expertise on programming languages, based on the projects that developer contributed to. In this paper, we move one step further and introduce an approach (accompanied by a tool) to identify low-level expertise on particular software frameworks and technologies apart, relying solely on GitHub data, using the GitHub API and Natural Language Processing (NLP)—using the Microsoft Language Understanding Intelligent Service (LUIS). In particular, we developed an NLP model in LUIS for named-entity recognition for three (3) .NET technologies and two (2) front-end frameworks. Our analysis is based upon specific commit contents, in terms of the exact code chunks, which the committer added or changed. We evaluate the precision, recall and f-measure for the derived technologies/frameworks, by conducting a batch test in LUIS and report the results. The proposed approach is demonstrated through a fully functional web application named RepoSkillMiner.
Tool Links: Video, Code Repo, Application, Validation Dataset
CCS CONCEPTS • Software and its engineering → Software creation and manage-ment -> Software post-development issues;
KEYWORDS Expertise; Frameworks; GitHub; Natural Language Processing; Soft-ware Project Management;
In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of sub-tokens into a dense space for 120,000 GitHub repositories in 200 languages. Then, we cluster embeddings to identify groups of semantically similar sub-tokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their sub-tokens. The tool receives an arbitrary project as input, extracts sub-tokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled sub-token clusters with short descriptions to enable Sosed to produce interpretable output.
Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of sub-tokens is available separately at https://github.com/JetBrains-Research/identifiers-extractor/.
The \emph{feature interaction problem} arises when two or more independent features interact with each other in an undesirable manner. Feature interactions remain a challenging and important problem in emerging domains of cyber-physical systems (CPS), such as intelligent vehicles, unmanned aerial vehicles (UAVs) and the Internet of Things (IoT), where the outcome of an unexpected interaction may result in a safety failure. Existing approaches to resolving feature interactions rely on priority lists or fixed strategies, but may not be effective in scenarios where none of the competing feature actions are satisfactory with respect to system requirements. This paper proposes a novel \emph{synthesis-based} approach to resolution, where a conflict among features is resolved by \emph{synthesizing} an action that best satisfies the specification of desirable system behaviors in the given environmental context. Unlike existing resolution methods, our approach is capable of producing a desirable system outcome even when none of the conflicting actions are satisfactory. The effectiveness of the proposed approach is demonstrated using a case study involving interactions among safety-critical features in an autonomous drone.
Fuzzing or fuzz testing is an established technique that aims to discover unexpected program behavior (e.g. bugs, security vulnerabilities, or crashes) by feeding automatically generated data into a program under test. However, the application of fuzzing to test Model-Driven Software Engineering (MDSE) tools is still limited because of the difficulty of existing fuzzers to provide structured, well-typed inputs, namely models that conform to typing and consistency constraints induced by a given meta-model and underlying modeling framework. By drawing from recent advances on both fuzz testing and automated model generation, we present three different approaches for fuzzing MDSE tools: A graph grammar-based fuzzer and two variants of a coverage-guided mutation-based fuzzer working with different sets of model mutation operators. We have evaluated our fuzzing approaches on a set of real-world MDSE tools. Our experimental results show that all approaches can outperform both standard fuzzers and model generators w.r.t.\ their fuzzing capabilities. Moreover, we found that each of our approaches comes with its own strengths and weaknesses in terms of fault finding capabilities and the ability to cover different aspects of the system under test. Thus the approaches complement each other, forming a fuzzer suite for testing MDSE tools.
Context: Emergent behaviors are behaviors not included in a system specification, but that can still happen at runtime. When using scenario-based modeling to design a concurrent system, we can detect such behaviors as implied scenarios (ISs). Analogously to emergent behaviors, an IS is not included in the system model but can arise at runtime. If left untreated, ISs can cause damage if they lead to unwanted behaviors, which can, in turn, affect the reliability of the system. Several approaches to detect ISs have been devised. However, existing approaches stop after the detection process and do not go further in the analysis and treatment process. Additionally, they can output several implied scenarios, which can be cumbersome to the user as they are detected and dealt with on a one-on-one basis. Furthermore, since these approaches do not investigate the relationship between different ISs, they could misguide the user on how to deal with such scenarios.
Objective: In this work, we propose a methodology to fill in the literature gap, which is achieved by finding common behaviors (CBs) among detected ISs that lead the system to unexpected behavior. We enable the user to analyze ISs as groups, which allows fixing multiple emergent behaviors at the same time.
Methodology: The methodology consists of the characterization of ISs as families of CBs, comprising three main steps: (i) collect multiple ISs; (ii) detect CBs among them; and, (iii) characterize such CBs as families. Firstly our approach iteratively collects multiple ISs without the need for user interaction. Next, from these collected ISs, we extract the underlying CBs. The CBs are groups of ISs that have common traces, and are defined as shared sequences of messages among various ISs. Following, a characterization process is performed to define families of CBs. For this purpose, we use the Smith-Waterman algorithm, which is used to find which parts of two sequences have the most in common. In our work, the benefits of using this algorithm are two-fold: (i) it can be used to find shared traces of messages among different CBs, and (ii) the score calculated can be used as a clustering metric, which assists the user to define the families of CBs. By these means, we limit the problem space of IS detection by treating the ISs as a group, instead of individually.
Results: We performed seven case studies %with system specifications reported in the literature to validate our methodology, where a total of 1,798 ISs were collected. From these ISs, our methodology was able to identify only 14 families of CBs, where each system specification in those studies had at most three families, and each family required a single fix. Overall, our methodology also managed to significantly reduce the timespan of the detection process from nearly 37 hours to under 24 minutes and mere 3.2s to run our clusterization process in those case studies. Additionally, we provide quantitative evidence that the unwanted behaviors have been effectively removed as all seven case studies had their reliability increase to 100%.
Conclusion: We have proposed a methodology to deal with multiple ISs at large. This is achieved by detecting CBs among these scenarios. Furthermore, we introduce a method to group similar CBs into families, further reducing the elements the user needs to analyze. Thus, our approach allows the user to investigate and treat multiple ISs at once.
Journal Paper: The full paper was published at the Journal of Systems and Software, and is available at https://doi.org/10.1016/j.jss.2019.110425. All tools and data used for the experiments are available at https://git.io/JfCJp.
Link to Publication: https://www.sciencedirect.com/science/article/pii/S0164121219301992