Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks. In contrast, despite recent momentum on word embeddings for source code, the literature lacks evidence of their generalizability beyond the example task they have been trained for.
In this experience paper, we identify 3 potential downstream tasks, namely code comments generation, code authorship identification, and code clones detection, that source code token embedding models can be applied to. We empirically assess a recently proposed code token embedding model, namely code2vec’s token embeddings. Code2vec was trained on the task of predicting method names, and while there is potential for using the vectors it learns on other tasks, it has not been explored in literature. Therefore, we fill this gap by focusing on its generalizability for the tasks we have identified. Eventually, we show that source code token embeddings cannot be readily leveraged for the downstream tasks. Our experiments even show that our attempts to use them do not result in any improvements over less sophisticated methods. We call for more research into effective and general use of code embeddings.
Code retrieval techniques and tools have been playing a key role in facilitating software developers to retrieve existing code fragments from available open-source repositories given a user query (e.g., a short natural language text describing the functionality for retrieving a particular code snippet). Despite the existing efforts in improving the effectiveness of code retrieval, there are still two main issues hindering them from being used to accurately retrieve satisfiable code fragments from large-scale repositories when answering complicated queries. First, the existing approaches only consider shallow features of source code such as method names and code tokens, but ignoring structured features such as abstract syntax trees (ASTs) and control-flow graphs (CFGs) of source code, which contains rich and well-defined semantics of source code. Second, although the deep-learning-based approach performs well on the representation of source code, it lacks the explainability, making it hard to interpret the retrieval results and almost impossible to understand which features of source code contribute more to the final results. To tackle the two aforementioned issues, this paper proposes MMAN, a novel \underline{M}ulti-\underline{M}odal \underline{A}ttention \underline{N}etwork for semantic source code retrieval. A comprehensive multi-modal representation is developed for representing unstructured and structured features of source code, with one LSTM for the sequential tokens of the code, a Tree-LSTM for the ASTs of the code and a GGNN (Gated Graph Neural Network) for the CFG of the code. Furthermore, a multi-modal attention fusion layer is applied to assign weights to different parts of each modality of source code and then integrate them into a single hybrid representation. Comprehensive experiments and analysis on a large-scale real-world dataset show that our proposed model can accurately retrieve code snippets and outperforms the state-of-the-art methods.
ACM SIGSOFT Distinguished Paper Award
Automated test generation and evaluation in simulation environments is a key technology for verification of automated driving (AD) applications.
Search-based testing (SBT) is an approach for automated test generation that leverages optimization to efficiently generate interesting concrete tests from abstract test descriptions. In this experience paper, we report on our observations after successfully applying SBT to AD control applications in several use cases with different characteristics. Based on our experiences, we derive a number of lessons learned that we consider important for the adoption of \sbt methods and tools in industrial settings. The key lesson is that SBT finds relevant errors and provides valuable feedback to the developers, but requires tool support for writing specifications.
Context: The challenge of locating bugs in mostly large-scale software systems has led to the development of bug localization techniques. However, the lexical mismatch between bug reports and source codes degrades the performances of existing information retrieval or machine learning-based approaches. Objective: To bridge the lexical gap and improve the effectiveness of localizing buggy files by leveraging the extracted semantic information from bug reports and source code. Method: We present BugTranslator, a novel deep learning-based machine translation technique composed of an attention-based recurrent neural network (RNN) Encoder-Decoder with long short-term memory cells. One RNN encodes bug reports into several context vectors that are decoded by another RNN into code tokens of buggy files. The technique studies and adopts the relevance between the extracted semantic information from bug reports and source files. Results: The experimental results show that BugTranslator outperforms a current state-of-the-art word embed- ding technique on three open-source projects with higher MAP and MRR. The results show that BugTranslator can rank actual buggy files at the second or third places on average.
Conclusion: BugTranslator distinguishes bug reports and source code into different symbolic classes and then extracts deep semantic similarity and relevance between bug reports and the corresponding buggy files to bridge the lexical gap at its source, thereby further improving the performance of bug localization.
Link to Publication: https://www.sciencedirect.com/science/article/abs/pii/S0950584918300399
Despite being adopted in software engineering tasks, deep neural networks are treated mostly as a black box due to the difficulty in interpreting how the networks infer the outputs from the inputs. To address this problem, we propose AutoFocus, an automated approach for rating and visualizing the importance of input elements based on their effects on the outputs of the networks. The approach is built on our hypotheses that (1) attention mechanisms incorporated into neural networks can generate discriminative scores for various input elements and (2) the discriminative scores reflect the effects of input elements on the outputs of the networks. This paper verifies the hypotheses by applying AutoFocus on the task of algorithm classification (i.e., given a program source code as input, determine the algorithm implemented by the program). AutoFocus identifies and perturbs code elements in a program systematically, and quantifies the effects of the perturbed elements on the network’s classification results. Based on evaluation on more than 1000 programs for 10 different sorting algorithms, we observe that the attention scores are highly correlated to the effects of the perturbed code elements. Such a correlation provides a strong basis for the uses of attention scores to interpret the relations between code elements and the algorithm classification results of a neural network, and we believe that visualizing code elements in an input program ranked according to their attention scores can facilitate faster program comprehension with reduced code.
Recurrent neural network (RNN) has achieved great success in processing sequential inputs for applications such as automatic speech recognition, natural language processing and machine translation. However, quality and reliability issues of RNNs make them vulnerable to adversarial attacks and hinder their deployment in real-world applications. In this paper, we propose a quantitative analysis framework — DeepStellar — to pave the way for effective security and quality analysis of software systems powered by RNNs. DeepStellar is generic to handle various RNN architectures, including LSTM and GRU, scalable to work on industrial-grade RNN models, and extensible to develop customized analyzers and tools. We demonstrated that, with DeepStellar, users are able to design efficient test generation tools, and develop effective adversarial sample detectors. We evaluated the developed applications on three real RNN models, including speech recognition and image classification.
DeepStellar outperforms existing approaches three hundred times in generating defect-triggering tests and achieves 97% accuracy in detecting adversarial attacks. A video demonstration which shows the main features of DeepStellar is available at https://sites.google.com/view/deepstellar/home/tool-demo.
Developers often look for solutions to programming problems in community Q&A sites like Stack Overflow. Due to the crowdsourcing nature of these Q&A sites, many userprovided answers are wrong, less optimal or out-of-date. Relying on community-curated quality indicators (e.g., accepted answer, answer vote) cannot reliably identify these answer problems. Such problematic answers are often criticized by other users. However, these critiques are not readily discoverable when reading the posts. In this paper, we consider the answers being criticized and their critique posts as controversial discussions in community Q&A sites. To help developers notice such controversial discussions and make more informed choices of appropriate solutions, we design an automatic open information extraction approach for systematically discovering and summarizing the controversies in Stack Overflow and exploiting official API documentation to assist the understanding of the discovered controversies.We apply our approach to millions of java/androidtagged Stack overflow questions and answers and discover a large scale of controversial discussions in Stack Overflow. Our manual evaluation confirms that the extracted controversy information is of high accuracy. A user study with 18 developers demonstrates the usefulness of our generated controversy summaries in helping developers avoid the controversial answers and choose more appropriate solutions to programming questions.
Previous studies showed that replying to a user review usually has a positive effect on the rating that is given by the user to the app. For example, Hassan et al. found that responding to a review increases the chances of a user updating their given rating by up to six times compared to not responding. To alleviate the labor burden in replying to the bulk of user reviews, developers usually adopt a template-based strategy where the templates can express appreciation for using the app or contain the company email for users to react. However, reading large numbers of user reviews every day is not an easy task for developers. The available review-response pairs provide us chances to learn the knowledge relations between reviews and responses.
Although there exists research on studying the popular review patterns (e.g., reviews with longer content and lower rating) that developers tend to respond, approaches to automate the review response process have never been proposed. Inspired by the RNN encoder-decoder model in the natural language processing field, we propose a response generation framework, named RRGen. RRGen explicitly incorporates review attributes, such as user rating and review length, and learns the relations between reviews and corresponding responses in a supervised way from the available training data. Experiments on 58 apps and 309,246 review-response pairs highlight that RRGen outperforms several baselines by 67.4% to 4.5 times in terms of BLEU (an accuracy measure that is widely used to evaluate generation systems). Qualitative analysis also confirms the effectiveness of RRGen in generating the relevant and accurate response.
ACM SIGSOFT Distinguished Paper Award
Enabled by the pull-based development model, developers can easily contribute to a project through pull requests (PRs). When creating a PR, developers can add a free-form description to describe what changes made in this PR and/or why. Such a description is helpful for reviewers and other developers to gain a quick understanding of the PR without touching the details and may reduce the possibility of the PR being ignored or rejected. However, developers sometimes neglect to write descriptions for PRs. For example, in our collected dataset with over 333K PRs, more than 34% of the PR descriptions are empty. To alleviate this problem, we propose an approach to automatically generate PR descriptions based on the commit messages and the source code comments in the PRs. We regard this problem as a text summarization problem and solve it using a novel sequence-to-sequence model. To cope with out-of-vocabulary words in software artifacts and bridge the gap between the training loss function of the sequence-to-sequence model and the evaluation metric ROUGE, which has been shown to correspond to human evaluation, we integrate the pointer generator and directly optimize for ROUGE using reinforcement learning and a special loss function. We build a dataset with over 41K PRs and evaluate our approach on this dataset through ROUGE and a human evaluation. Our evaluation results show that our approach outperforms two baselines by significant margins.
With the advent of social media, developers are increasingly using it in their software development activities. Twitter is also one of the popular social mediums used by developers. A recent study by Singer et al. found that software developers use Twitter to “keep up with the fast-paced development landscape”. Unfortunately, due to the general purpose nature of Twitter, it’s challenging for developers to use Twitter for their development activities. Our survey with 36 developers who use Twitter in their development activities highlights that developers are interested in following specialized software gurus who share relevant technical tweets. To help developers perform this task, in this work we propose a recommendation system to identify specialized software gurus. Our approach first extracts different kinds of features that characterize a Twitter user and then employs a two-stage classification approach to generate a discriminative model, which can differentiate specialized software gurus in a particular domain from other Twitter users that generate domainrelated tweets (aka domain-related Twitter users). We have investigated the effectiveness of our approach in finding specialized software gurus for four different domains (JavaScript, Android, Python, and Linux) on a dataset of 86,824 Twitter users who generate 5,517,878 tweets over one month. Our approach can differentiate specialized software experts from other domain-related Twitter users with an F-measure of up to 0.820. Compared with existing Twitter domain expert recommendation approaches, our proposed approach can outperform their F-measure by at least 7.63%.
Evidence shows that developer reputation is extremely important when accepting pull requests or resolving reported issues. It is particularly salient in Free/Libre Open Source Software since the developers are distributed around the world, do not work for the same organization and, in most cases, never meet face to face. The existing solutions to expose developer reputation tend to be forge specific (GitHub), focus on activity instead of impact, do not leverage social or technical networks, and do not correct often misspelled developer identities. We aim to remedy this by amalgamating data from all public Git repositories, measuring the impact of developer work, expose developer’s collaborators, and correct notoriously problematic developer identity data. We leverage World of Code (WoC), a collection of an almost complete (and continuously updated) set of Git repositories by first allowing developers to select which of the 34 million(M) Git commit author IDs belong to them and then generating their profiles by treating the selected collection of IDs as that single developer. As a side-effect, these selections serve as a training set for a supervised learning algorithm that merges multiple identity strings belonging to a single individual. As we evaluate the tool and the proposed impact measure, we expect to build on these findings to develop reputation badges that could be associated with pull requests and commits so developers could easier trust and prioritize them.
Coding convention plays an important role in guaranteeing software quality. However, coding conventions are usually informally presented and inconvenient for programmers to use. In this paper, we present CocoQa, a system that answers programmer’s questions about coding conventions. CocoQa answers questions by querying a knowledge graph for coding conventions. It employs 1) a subgraph matching algorithm that parses the question into a SPARQL query, and 2) a machine comprehension algorithm that uses an end-to-end neural network to detect answers from searched paragraphs. We have imple- mented CocoQa, and evaluated it on a coding convention QA dataset. The results show that CocoQa can answer questions about coding conventions precisely. In particular, CocoQa can achieve a precision of 82.92% and a recall of 91.10%. Repository: https://github.com/14dtj/CocoQa/ Video: https://youtu.be/VQaXi1WydAU
Emotion awareness research in SE context has been growing in recent years. Currently, researchers often rely on textual communication records to extract emotion states using natural language processing techniques. However how well these extracted emotion states reflect people’s real emotions has not been thoroughly investigated. In this paper, we report a multi-level, longitudinal empirical study with 82 individual members in 27 project teams. We collected their self-reported retrospective emotion states on a weekly basis during their year-long projects and also extract corresponding emotions from the textual communication records. We then model and compare the dynamics of these two types of emotions using multiple statistical and time series analysis methods. Our analyses yield a rich set of findings. The most important one is that the dynamics of emotions extracted using text-based algorithms often do not well reflect the dynamics of self-reported retrospective emotions. Besides, the extracted emotions match self-reported retrospective emotions better at the team level. Our results also suggest that individual personalities and team’s emotion display norms significantly impact the match/mismatch. Our results should warn the research community about the limitations and challenges of applying text-based emotion recognition tools in SE research.
Collaboration with other people is a major theme in the information-seeking process. However, most existing works that address the location of features during the maintenance or evolution of software do not support collaboration, or they are focused on code as the main software artifact. Hence, collaborative feature location in models has not enjoyed much attention to date. In this work, we address this concern by proposing an approach, CoFLiM, that enables the collaboration of several domain experts in order to locate the model fragment of a target feature. CoFLiM uses the feature descriptions of the domain experts and their self-rated confidence level to automatically reformulate the relevant feature descriptions in a single query. This query guides the evolutionary algorithm of our approach that finds the model fragment of the feature being located. We evaluate CoFLiM in a real-world case study from our industrial partner. We analyze the impact of CoFLiM in terms of recall, precision, and the F-measure. Moreover, we compare the reformulation of CoFLiM with four baselines. We also perform a statistical analysis to show that the impact of the results is significant. Our results show that collaboration pays off in the location of features in models. The results also show that the self-rated confidence level can be used to locate features in models. Finally, the results show that there are no significant improvements when more than three domain experts are involved, which is relevant in those industrial contexts where the availability of domain experts is scarce.
Link to Publication: https://link.springer.com/article/10.1007%2Fs10515-019-00251-9
Developers often reuse code snippets from online forums, such as Stack Overflow, GitHub Gists to learn API usages of software frameworks or libraries. Those code snippets often have ambiguous undeclared external references. This makes it difficult to learn and use those APIs correctly. Reusing those code snippets to solve development tasks also requires resolving external references of those APIs. However, manually resolving fully qualified names (FQN) of API elements is a non-trivial task. In this paper, we propose a novel context-sensitive technique, COSTER, to resolve FQNs of API elements in those code snippets. The technique collects locally specific source code elements as well as globally related tokens as the context of FQNs, calculate association score, and build an occurrence likelihood dictionary. While inferring an API element, it collects the code context and ranks candidate FQNs from the dictionary by considering the association score of the tokens in the context, similarity between the context, and similarity between the API element. Evaluation with code examples collected from GitHub and Stack Overflow posts shows that our proposed technique improves precision and recall by 3-18% compared to existing state-of-the-art techniques. The proposed technique significantly reduces the training time compared to the StatType, a state-of-the-art technique, without sacrificing accuracy. Extensive analyses on results establish the facts of the robustness of the proposed technique.
Inferring program transformations from concrete program changes has many potential uses, such as applying systematic program edits, refactoring, and automated program repair. Existing work for inferring program transformations usually rely on statistical information over a potentially large set of program-change examples. However, in many practical scenarios we do not have such a large set of program-change examples.
In this paper, we address the challenge of inferring a program transformation from one single example. Our core insight is that “big code” can provide effective guide for the generalization of a concrete change into a program transformation, i.e., code elements appearing in many files are general and should not be abstracted away. We first propose a framework for transformation inference, where programs are represented as hypergraphs to enable fine-grained generalization of transformations. We then design a transformation inference approach, GENPAT, that infers a program transformation based on code context and statistics from a big code corpus.
We have evaluated GENPAT under two distinct application scenarios, systematic editing and program repair. The evaluation on systematic editing shows that GENPAT significantly outperforms a state-of-the-art approach, SYDIT, with up to 5.5x correctly transformed cases. The evaluation on program repair suggests that GENPAT has the potential to be integrated in advanced program repair tools—GENPAT successfully repaired 19 real-world bugs in the Defects4J benchmark by simply applying transformations inferred from existing patches, where 4 bugs have never been repaired by any existing technique. Overall, the evaluation results suggest that GENPAT is effective for transformation inference and can potentially be adopted for many different applications.
Link to Publication: https://github.com/xgdsmileboy/GenPat
Execution logs, which are generated by logging code, are widely used in modern software projects for tasks like monitoring, debugging, and remote issue resolution. Ineffective logging would cause confusion, lack of information during problem diagnosis, or even system crash. However, it is challenging to develop and maintain logging code, as it inter-mixes with the feature code. Furthermore, unlike feature code, it is very challenging to verify the correctness of logging code. Currently developers usually rely on their intuition when performing their logging activities. There are no well established logging guidelines in research and practice. In this paper, we intend to derive such guidelines through mining the historical logging code changes. In particular, we have extracted and studied the Logging-Code-Issue-Introducing (LCII) changes in six popular large-scale Java-based open source software systems. Preliminary studies on this dataset show that: (1) both co-changed and independently changed logging code changes can contain fixes to the LCII changes; (2) the complexity of fixes to LCII changes are similar to regular logging code updates; (3) it takes longer for developers to fix logging code issues than regular bugs; and (4) the state-of-the-art logging code issue detection tools can only detect a small fraction (3%) of the LCII changes. This highlights the urgent need for this area of research and the importance of such a dataset.
Link to Publication: https://www.eecs.yorku.ca/~chenfsd/resources/emse2019_chen.pdf
A deep learning (DL) model is inherently imprecise. To fix this problem, existing techniques retrain a given DL model over a larger training dataset or with the help of fault injected models or using the insight of failing test cases in the given DL model. In this paper, we present Apricot, a novel weight-adaptation approach to fixing DL models iteratively. Our key observation is that if the deep learning architecture of a DL model is trained over a subset of the original training dataset, the weights in the resultant reduced DL model (rDLM) can provide insights on the adjustment direction and magnitude of the weights in the original DL model to handle the test cases that the original DL model misclassify. Apricot generates a set of such reduced DL models from the original DL model. In each iteration, for each weight of the input DL model (iDLM) of that iteration, Apricot adjusts the weight of this iDLM toward the average weight of these rDLMs correctly classifying the failing test cases experienced by this iDLM and/or away from these rDLMs misclassifying the same failing test cases, followed by training the weight-adjusted iDLM over the original training dataset to generate a new iDLM for the next iteration. The experiment using five state-of-the-art DLMs shows that Apricot can increase the accuracy of these DL models by 0.35%-2.81% with an average of 1.45%. The experiment also reveals the complementary nature of these rDLMs in Apricot.
Automated program repair has been used to provide feedback for incorrect student programming assignments, since program repair captures the code modification needed to make a given buggy program pass a given test-suite. Existing student feedback generation techniques are limited because they either require manual effort in the form of providing an error model, or require a large number of correct student submissions to learn from, or suffer from lack of scalability and accuracy.
In this work, we propose a fully automated approach for generating student program repairs in real-time. This is achieved by first re-factoring all available correct solutions to semantically equivalent solutions. Given an incorrect program, we match the program with the closest matching refactored program based on its control flow structure. Subsequently, we infer the input-output specifications of the incorrect program’s basic blocks from the executions of the correct program’s aligned basic blocks. Finally, these specifications are used to modify the blocks of the incorrect program via search-based synthesis.
Our dataset consists of almost 1,800 real-life incorrect Python program submissions from 361 students for an introductory programming course at a large public university. Our experimental results suggest that our method is more effective and efficient than recently proposed feedback generation approaches. About 30% of the patches produced by our tool Refactory are smaller than those produced by the state-of-art tool Clara, and can be produced given fewer correct solutions (often a single correct solution) and in a shorter time. We opine that our method is applicable not only to programming assignments, and could be seen as a general-purpose program repair method that can achieve good results with just a single correct reference solution.
This paper presents InFix, a technique for automatically fixing erroneous program inputs for novice programmers. Unlike comparable existing approaches for automatic debugging and maintenance tasks, InFix repairs input data rather than source code, does not require test cases, and does not require special annotations. Instead, we take advantage of patterns commonly used by novice programmers to automatically create helpful, high quality input repairs. InFix iteratively applies error-message based templates and random mutations based on insights about the debugging behavior of novices. This paper presents an implementation of InFix for Python. We evaluate on 29,995 unique scenarios with input-related errors collected from four years of data from Python Tutor, a free online programming tutoring environment. Our results generalize and scale; compared to previous work, we consider an order of magnitude more unique programs. Overall, InFix is able to repair 94.5% of deterministic input errors. We also present the results of a human study with 97 participants. Surprisingly, this simple approach produces high quality repairs; humans judged the output of InFix to be equally helpful and within 4% of the quality of human-generated repairs.
This article contributes to defining the design space of program repair. Repair approaches can be loosely characterized according to the main design philosophy, in particular “generate-and-validate” and synthesis-based approaches. Each of those repair approaches is a point in the design space of program repair. Our goal is to facilitate the design, development and evaluation of repair approaches by providing a framework that: a) contains components commonly present in most approaches, b) provides built-in implementations of existing repair approaches. This paper presents a Java framework named Astor that focuses on the design space of generate-and-validate repair approaches. The key novelty of Astor is to provide explicit extension points to explore the design space of program repair. Thanks to those extension points, researchers can both reuse existing program repair components and implement new ones. Astor includes 6 unique implementations of repair approaches in Java, including GenProg for Java called jGenProg. Researchers have already defined new approaches over Astor. The implementations of program repair approaches built already available in Astor are capable of repairing, in total, 98 real bugs from 5 large Java programs. Astor code is publicly available on Github: https://github.com/SpoonLabs/astor.
Automated program repair (APR) is one of the recent advances in automated software engineering aiming for reducing the burden of debugging by suggesting high-quality patches that either directly fix the bugs, or help the programmers in the course of manual debugging. We believe scalability, applicability, and accurate patch validation are the main design objectives for a practical APR technique. In this paper, we present PraPR, our implementation of a practical APR technique that operates at the level of JVM bytecode. We discuss design decisions made in the development of PraPR, and argue that the technique is a viable baseline toward attaining aforementioned objectives. PraPR is publicly available at https://github.com/prapr/prapr.
Developer trust is a major barrier to the deployment of automatically-generated patches. Understanding the effect of a patch is a key element of that trust. We find that differences in sets of formal invariants characterize patch differences and that implication-based distances in invariant space characterize patch similarities. When one patch is similar to another it often contains the same changes as well as additional behavior; this pattern is well-captured by logical implication. We can measure differences using a theorem prover to verify implications between invariants implied by separate programs. Although effective, theorem provers are computationally intensive, but we find that string distance is an efficient heuristic for our implication-based distance measurements. We propose to use distances between patches to construct a hierarchy highlighting patch similarities. We evaluate our approach on over 300 patches and find that it correctly categorizes programs into semantically similar clusters. Clustering programs reduces human effort by reducing the number of semantically distinct patches that must be considered by over 50%, thus reducing the time required to establish trust in automatically generated repairs.
Program debugging is a time-consuming task, and researchers have proposed different kinds of automatic fault localization techniques to mitigate the burden of manual debugging. Among these techniques, two popular families are spectrum-based fault localization (SBFL) and statistical debugging (SD), both localizing faults by collecting statistical information at runtime. Though the ideas are similar, the two families have been developed independently and their combinations have not been systematically explored.
In this paper we perform a systematical empirical study on the combination of SBFL and SD. We first build a unified model of the two techniques, and systematically explore four types of variations, different predicates, different risk evaluation formulas, different granularities of data collection, and different methods of combining suspicious scores.
Our study leads to several findings. First, most of the effectiveness of the combined approach contributed by a simple type of predicates: branch conditions. Second, the risk evaluation formulas of SBFL significantly outperform that of SD. Third, fine-grained data collection significantly outperforms coarse-grained data collection with a little extra execution overhead. Fourth, a linear combination of SBFL and SD predicates outperforms both individual approaches.
According to our empirical study, we propose a new fault localization approach, PREDFL (Predicate-based Fault Localization), with the best configuration for each dimension under the unified model. Then, we explore its complementarity to existing techniques by integrating PREDFL with a state-of-the-art fault localization framework. The experimental results show that PREDFL can further improve the effectiveness of state-of-the-art fault localization techniques. More concretely, integrating PREDFL results in an up to 20.8% improvement w.r.t the faults successfully located at Top-1, which reveals that PREDFL complements existing techniques.
Localizing concurrency faults that occur in production is hard because, (1) detailed field data, such as user input, file content and interleaving schedule, may not be available to developers to reproduce the failure; (2) it is often impractical to assume the availability of multiple failing executions to localize the faults using existing techniques; (3) it is challenging to search for buggy locations in an application given limited runtime data; and, (4) concurrency failures at the system level often involve multiple processes or event handlers (e.g., software signals), which cannot be handled by existing tools for diagnosing intra-process (thread-level) failures. To address these problems, we present SCMiner, a practical online bug diagnosis tool to help developers understand how a system-level concurrency fault happens based on the logs collected by the default system audit tools. SCMiner achieves online bug diagnosis to obviate the need for offline bug reproduction. SCMiner does not require code instrumentation on the production system or rely on the assumption of the availability of multiple failing executions. Specifically, after the system call traces are collected, SCMiner uses data mining and statistical anomaly detection techniques to identify the failure-inducing system call sequences. It then maps each abnormal sequence to specific application functions. We have conducted an empirical study on 19 real-world benchmarks. The results show that SCMiner is both effective and efficient at localizing system-level concurrency faults.
Localization of the root cases for unreproducible builds is an important yet challenging task during software maintenance. The major challenges lie in limited runtime traces from build processes and high diversity of build environments. To address these challenges, in this paper, we propose RepTrace, a framework that identifies the root causes for unreproducible builds based on collected system call traces of the executed build commands. Our framework leverages system call tracing’s uniform interfaces for monitoring executed build commands in diverse build environments. From the collected system call traces, causality analysis included in our framework builds a dependency graph starting from an inconsistent build artifact (across two builds) via two types of dependencies: read/write dependencies among processes and parent/child process dependencies, and searches the graph to find the processes that result in the inconsistencies. To handle massive noisy dependencies and uncertain parent/child dependencies, RepTrace includes two novel techniques: (1) using difference analysis on multiple builds to reduce the search space of read/write dependencies, and (2) computing similarity of the runtime values to filter out noisy parent/child process dependencies. The evaluation results of RepTrace over a set of real-world software packages show that \tool effectively finds not only the root cause commands responsible for the unreproducible builds, but also the files to patch for addressing the unreproducible issues. Among its Top-10 identified commands and files, RepTrace achieves high accuracy of 90.00% and 90.56% in identifying the root causes, respectively.
PatchNet demonstrates a promising prospect of using machine learning to classify Linux kernel patches. It identifies bug-fixing patches, which should be back ported to long-term stable versions, from all patches in the mainline Linux kernel. This tool may greatly lighten the burden and reduce the omission of picking patches for maintainers. However, there are still some obstacles for engineering applications. We present PTracer, a Linux kernel patch trace bot based on an improved PatchNet. PTracer continuously monitors new patches in the git repository of the mainline Linux kernel, filters out unconcerned ones, classifies the rest as bug-fixing or non bug-fixing patches, and report bug-fixing patches to kernel experts of commercial operating systems. As a part of PTracer, kernel experts’ feedback information is collected and used to retrain the neural network periodically for improving performance. We use the patches in February 2019 of the mainline Linux kernel to perform the test. As a result, PTracer recommended 151 patches to CGEL kernel experts out of 5,142, 102 of which were accepted. Our PTracer is the first patch trace bot successfully applied to a commercial operating system and has the advantage of improving software quality and saving labor cost.
Pinpointing the location where a given unit of functionality—or feature—was implemented is a demanding and time-consuming task, yet prevalent in most software maintenance or evolution efforts. To that extent, we present PANGOLIN, an Eclipse plugin that helps developers identifying features among the source code. It borrows Spectrum-based Fault Localization techniques from the software diagnosis research field by framing feature localization as a diagnostic problem. PANGOLIN prompts users to label system executions based on feature involvement, and subsequently presents its spectrum-based feature localization analysis to users with the aid of a color-coded, hierarchic, and navigable visualization which was shown to be effective at conveying diagnostic information to users. Our evaluation shows that PANGOLIN accurately pinpoints feature implementations and is resilient to misclassifications by users.
Modern software often exists in many different, yet similar versions and/or variants, usually derived from a common code base (e.g., via clone-and-own). In the context of product- line engineering, family-based-analysis has shown very promising potential for improving efficiency in applying quality-assurance techniques to variant-rich software, as compared to a variant- by-variant approach. Unfortunately, these strategies rely on an product-line representation superimposing all program variants in a syntactically well-formed, semantically sound and variant- preserving manner, which is manually hard to obtain in practice. We demonstrate the SiMPOSE methodology for automatically generating superimpositions of N given program versions and/or variants facilitating family-based analysis of variant-rich soft- ware. SiMPOSE is based on a novel N-way model-merging tech- nique operating at the level of control-flow automata (CFA) rep- resentations of C programs, CFAs constitute a unified program abstraction utilized by many recent software-analysis tools. We illustrate different merging strategies supported by SiMPOSE, namely variant-by-variant, N-way merging, incremental 2-way merging, and partition-based N/2-way merging, and demonstrate how SiMPOSE can be used to systematically compare their impact on efficiency and effectiveness of family-based unit-test generation. The SiMPOSE tool, the demonstration of its usage as well as related artifacts and documentation can be found at http://pi.informatik.uni-siegen.de/projects/variance/simpose.
Developers often want to find out how to use a certain API (e.g., FileReader.read in JDK library). API usage examples are very helpful in this regard. Over the years, many automated methods have been proposed to generate code examples by clustering and summarizing relevant code snippets extracted from a code corpus. These approaches simplify source code as method invocation sequences or feature vectors. Such simplifications only model partial aspects of the code and tend to produce inaccurate examples.
We propose CodeKernel, a graph kernel based approach to the selection of API usage examples. Instead of approximating source code as method invocation sequences or feature vectors, CodeKernel represents source code as object usage graphs. Then, it clusters graphs by embedding them into a continuous space using a graph kernel. Finally, it outputs code examples by selecting a representative graph from each cluster using designed ranking metrics. Our empirical evaluation shows that CodeKernel selects more accurate code examples than MUSE and eXoaDocs. A user study involving 25 developers in a multinational company also confirms the usefulness of CodeKernel in selecting API usage examples.
High quality method names are critical for the readability and maintainability of programs. However, constructing concise and consistent method names is often challenging, especially for inexperienced developers. To this end, advanced machine learning techniques have been recently leveraged to recommend method names automatically for given method bodies/implementation. Recent large-scale evaluations also suggest that such approaches are accurate. However, little is known about where and why such approaches work or don’t work. To figure out the state of the art as well as the rationale for the success/failure, in this paper we conduct an empirical study on the state-of-the-art approach code2vec. We assess code2vec on a new dataset with more realistic settings. Our evaluation results suggest that although switching to new dataset does not significantly influence the performance, more realistic settings do significantly reduce the performance of code2vec. Further analysis on the successfully recommended method names also reveals the following findings: 1) around half (48.3%) of the accepted recommendations are made on getter/setter methods; 2) a large portion (19.2%) of the successfully recommended method names could be copied from the given bodies. To further validate its usefulness, we ask developers to manually score the difficulty in naming methods they developed. Code2vec is then applied to such manually scored methods to evaluate how often it works in need. Our evaluation results suggest that code2vec rarely works when it is really needed. Finally, to intuitively reveal the state of the art and to investigate the possibility of designing simple and straightforward alternative approaches, we propose a heuristics based approach to recommending method names. Evaluation results on large-scale dataset suggest that this simple heuristics-based approach significantly outperforms the state-of-the-art machine learning based approach, improving precision and recall by 65.25% and 22.45%, respectively. The comparison suggests that machine learning based recommendation of method names still has a long way to go.
Link to Publication: https://liuhuigmail.github.io/2019ASE.pdf
Designing usable APIs is critical to developers’ productivity and software quality, but is quite difficult. Understanding how the API is used by analyzing client code at scale to discover usability issues with the APIs is even harder. In this paper, we focus on “boilerplate” code, which a number of experts in API design have said can be an indicator of API usability problems. We investigate what properties make code count as boilerplate, the reasons for boilerplate, and how programmers can reduce the need for it. We propose MARBLE, a novel approach to automatically mine boilerplate code from a large set of client code. MARBLE adapts existing techniques, including an API usage mining algorithm, an AST comparison algorithm, and a graph partitioning algorithm. We evaluate MARBLE with 13 Java APIs, and show that our algorithm successfully identifies both already-identified and new boilerplate code instances.
The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. However, compilation loses information contained within the original source code (e.g., structure, type information, and variable names). Semantically meaningful variable names are known to increase code understandability, but they generally cannot be recovered by decompilers. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GitHub. Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3% of the time.
Application programming interfaces (APIs) continually evolve to meet ever-changing user needs, and documentation provides an authoritative reference for their usage. However, API documentation is commonly outdated because nearly all of the associated updates are performed manually. Such outdated documentation, especially with regard to API names, causes major software development issues. In this paper, we propose a method for automatically updating outdated API names in API documentation. Our insight is that API updates in documentation can be derived from API implementation changes between code revisions. To evaluate the proposed method, we applied it to four open source projects. Our evaluation results show that our method, FreshDoc, detects outdated API names in API documentation with 48% higher accuracy than the existing state-of-the-art methods do. Moreover, when we checked the updates suggested by FreshDoc against the developers? manual updates in the revised documentation, FreshDoc addressed 82% of the outdated names. When we reported 40 outdated API names found by FreshDoc via issue tracking systems, developers accepted 75% of the suggestions. These evaluation results indicate that FreshDoc can be used as a practical method for the detection and updating of API names in the associated documentation.
Link to Publication: https://ieeexplore.ieee.org/document/8651318
ACM SIGSOFT Distinguished Paper Award
Game testing has been long recognized as a notoriously challenging task, which mainly relies on manual playing and scripting based testing in game industry. Even until recently, automated game testing still remains to be largely untouched niche. A key challenge is that game testing often requires to play the game as a sequential decision process. A bug may only be triggered until completing certain difficult intermediate tasks, which requires a certain level of intelligence. The recent success of deep reinforcement learning (DRL) sheds light on advancing automated game testing, without human competitive intelligent support. However, the existing DRLs mostly focus on winning the game rather than game testing. To bridge the gap, in this paper, we first perform an in-depth analysis of 1349 real bugs from four real-world commercial game products. Based on this, we propose four oracles to support automated game testing, and further propose Wuji, an on-the-fly game testing framework, which leverages evolutionary algorithm, DRL and multi-objective optimization to perform automatic game testing. Wuji balances between winning the game and exploring the space of the game. Winning the game allows the agent to make progress in the game, while space exploration increases the possibility of discovering bugs. We conduct a large-scale evaluation on a simple game and two popular commercial games. The results demonstrate the effectiveness of Wuji in exploring space and detecting bugs. Moreover, Wuji found 3 previously unknown bugs, which have been confirmed by the developers, in the commercial games.
Due to the popularity of deep learning (DL) applications, testing DL libraries are becoming more and more important. Different from traditional testing, for which output is asserted definitely (e.g., an output is compared with an oracle for equality), testing deep learning libraries often requires to perform oracle approximations, i.e., the output is allowed to be within a restricted range of the oracle. However, oracle approximations have not been studied in prior empirical work that focuses on traditional testing practices. The prevalence, common practices, evolution, and maintenance challenges of oracle approximations remain unknown. In this work, we studied oracle approximation assertions that are implemented in four popular deep learning libraries. Our study shows that oracle approximation assertions are a significant portion among all the assertions in the test suites of deep learning libraries. We identify the commonly-used oracle types when there are approximations being performed on oracles through a comprehensive manual study. In addition, we find that developers frequently modify code on oracle approximations, i.e., using a different approximation API, modifying the oracle or the output from the code under test, and using a different threshold value. Finally, we performed in-depth studies to understand the reasons behind the evolution of oracle approximation assertions and our findings reveal maintenance challenges.
We present techniques for automatically inferring formal properties of feed-forward neural networks. We observe that a significant part (if not all) of the logic of feed forward networks is captured in the activation status (on or off) of its neurons. We propose to extract patterns based on neuron decisions as preconditions that imply certain desirable output property e.g., the prediction being a certain class. Together, the inferred preconditions and the output property form a {\em contract} for the network. We present techniques to extract input properties} encoding convex predicates on the input space that imply given output properties and layer properties, representing network properties captured in the hidden layers that imply the desired output behavior. We apply our techniques on networks for the MNIST and ACASXU applications. Our experiments highlight the use of the inferred properties in a variety of tasks, such as explaining predictions, providing robustness guarantees, simplifying proofs, and network distillation.
Deep Learning (DL) has recently achieved tremendous success in various application domains. A variety of DL frameworks and platforms play a key role to catalyze such progress. However, the differences in architecture designs and implementations of existing frameworks and platforms bring new challenges for DL software development and deployment. Till now, there is no study on how various mainstream frameworks and platforms influence both DL software development and deployment in practice.
To fill this gap, we take the first step towards understanding how the most widely-used DL frameworks and platforms support the DL software development and deployment. We conduct a systematic study on these frameworks and platforms by using two types of DNN architectures and three popular datasets. (1) For development process, we investigate the prediction accuracy under the same runtime training configuration or same model weights/biases. We also study the adversarial robustness of trained models by leveraging the existing adversarial attack techniques.The experimental results show that the {computing differences} across frameworks could result in an obvious prediction accuracy decline, which should draw the attention of DL developers. (2) For deployment process, we investigate the prediction accuracy and performance (refers to time cost and memory consumption) when the trained models are migrated/quantized from PC to real mobile devices and web browsers. The DL platform study unveils that the migration and quantization still suffer from compatibility and reliability issues. Meanwhile, we find several DL software bugs by using the results as a benchmark. We further validate the results through bug confirmation from stakeholders and industrial positive feedback to highlight the implications of our study. Through our study, we summarize practical guidelines, identify challenges and pinpoint new research directions, such as understanding the characteristics of DL frameworks and platforms, avoiding compatibility/reliability issues, detecting DL software bugs, and reducing time cost and memory consumption towards developing and deploying high quality DL systems effectively.
Deep neural networks~(DNNs) are increasingly expanding their real-world applications across domains, e.g., image processing, speech recognition and natural language processing. However, there is still limited tool support for DNN testing in terms of test data quality and model robustness. In this paper, we introduce a mutation testing-based tool for DNNs, DeepMutation++, which facilitates the DNN quality evaluation, supporting both feed-forward neural networks~(FNNs) and stateful recurrent neural networks~(RNNs). It not only enables to statically analyze the robustness of a DNN model against the input as a whole, but also allows to identify the vulnerable segments of a sequential input (e.g. audio input) by runtime analysis. It is worth noting that DeepMutation++ specially features the support of RNNs mutation testing. The tool demo video can be found on the project website https://sites.google.com/view/deepmutationpp.
Deep neural network (DNN) has been widely applied to safety-critical scenarios such as autonomous vehicle, security surveillance, and cyber-physical control systems. Yet, the incorrect behaviors of DNNs can lead to severe accidents and tremendous losses due to hidden defects. In this paper, we present DeepHunter, a general-purpose fuzzing framework for detecting defects of DNNs. DeepHunter is inspired by traditional grey-box fuzzing and aims to increase the overall test coverage by applying adaptive heuristics according to runtime feedback. Specifically, DeepHunter provides a series of seed selection strategies, metamorphic mutation strategies, and testing criteria customized to DNN testing; all these components support multiple built-in configurations which are easy to extend. We evaluated DeepHunter on two popular datasets and the results demonstrate the effectiveness of DeepHunter in achieving coverage increase and detecting real defects. A video demonstration which showcases the main features of DeepHunter can be found at https://youtu.be/s5DfLErcgrc.
Precisely modeling complex systems like cyber-physical systems is challenging, which often render model-based system verification techniques like model checking infeasible. To overcome this challenge, we propose a method called LAR to automatically ‘verify’ such complex systems through a combination of learning, abstraction and refinement from a set of system log traces. We assume that log traces and sampling frequency are adequate to capture ‘enough’ behaviour of the system. Given a safety property and the concrete system log traces as input, LAR automatically learns and refines system models, and produces two kinds of outputs. One is a counterexample with a bounded probability of being spurious. The other is a probabilistic model based on which the given property is ‘verified’. The model can be viewed as a proof obligation, i.e., the property is verified if the model is correct. It can also be used for subsequent system analysis activities like runtime monitoring or model-based testing. Our method has been implemented as a self-contained software toolkit. The evaluation on multiple benchmark systems as well as a real-world water treatment system shows promising results.
Link to Publication: https://ieeexplore.ieee.org/abstract/document/8576657
Context Specification mining techniques are typically used to extract the specification of a software in the absence of (up-to-date) specification documents. This is useful for program comprehension, testing, and anomaly detection. However, specification mining can also potentially be used for debugging, where a faulty behavior is abstracted to give developers a context about the bug and help them locating it.
Objective In this project, we investigate this idea in an industrial setting. We propose a very basic semi-automated specification mining approach for debugging and apply that on real reported issues from an AutoPilot software system from our industry partner, MicroPilot Inc. The objective is to assess the feasibility and usefulness of the approach in a real-world setting.
Method The approach is developed as a prototype tool, working on C code, which accept a set of relevant state fields and functions, per issue, and generates an extended finite state machine that represents the faulty behavior, abstracted with respect to the relevant context (the selected fields and functions).
Results We qualitatively evaluate the approach by a set of interviews (including observational studies) with the company’s developers on their real-world reported bugs. The results show that (a) our approach is feasible, (b) it can be automated to some extent, and (c) brings advantages over only using their code-level debugging tools. We also compared this approach with traditional fully automated state-merging algorithms and reported several issues when applying those techniques on a real-world debugging context.
Conclusion The main conclusion of this study is that the idea of an “interactive” specification mining rather than a fully automated mining tool is NOT impractical and indeed is useful for the debugging use case.
Link to Publication: https://www.sciencedirect.com/science/article/pii/S0950584919301004
Modern software systems are increasingly dependent on third-party libraries. It is widely recognized that reusing functionality provided by mature and well-tested third-party libraries can improve developers’ productivity, reduce time-to-market, and produce more reliable software. Today’s open-source repositories provide a wide range of libraries that can be freely downloaded and used. However, as software libraries are documented separately but intended to be used together, developers are unlikely to fully take advantage of these reuse opportunities. In this paper, we present a novel approach to automatically identify third-party library usage patterns, i.e., collections of libraries that are commonly used together by developers. Our approach employs a hierarchical clustering technique to group together software libraries based on external client usage. Our approach is based on the analysis of the joint versus separate use of the libraries. The pattern’s libraries are distributed on different usage cohesion levels/layers. Each layer reflects the co-usage frequency between a set of libraries, while the distribution on the different levels demonstrates the graduation in the degree of co-usage frequency. To evaluate our approach, we mined a large set of over 6000 popular libraries from Maven Central Repository and investigated their usage by over 38,000 client systems from the GitHub repository. Our experiments show that our technique is able to detect the majority (77%) of highly consistent and cohesive library usage patterns across a considerable number of client systems.
Link to Publication: https://www.sciencedirect.com/science/article/pii/S0164121218301699
Software systems are often released without formal specifications. To deal with the problem of lack of and outdated specifications, rule-based specification mining approaches have been proposed. These approaches analyze execution traces of a system to infer the rules that characterize the protocols, typically of a library, that its clients must obey. Rule-based specification mining approaches work by exploring the search space of all possible rules and use interestingness measures to differentiate specifications from false positives. Previous rule-based specification mining approaches often rely on one or two interestingness measures, while the potential benefit of combining multiple available interestingness measures is not yet investigated. In this work, we propose a learning to rank based approach that automatically learns a good combination of 38 interestingness measures. Our experiments show that the learning to rank based approach outperforms the best performing approach leveraging single interestingness measure by up to 66%.
Link to Publication: https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=4990&context=sis_research
Precise pointer analysis is desired since it is a core technique to find memory defects. There are several dimensions of pointer analysis precision, flow sensitivity, context sensitivity, field sensitivity and path sensitivity. For static analysis tools utilizing pointer analysis, considering all dimensions is difficult because the trade-off between precision and efficiency should be balanced. This paper presents TsmartGP, a static analysis tool for finding memory defects in C programs with a precise and efficient pointer analysis. The pointer analysis algorithm is flow, context, field, and quasi path sensitive. Control flow automatons are the key structures for our analysis to be flow sensitive. Function summaries are applied to get context information and elements of aggregate structures are handled to improve precision. Path conditions are used to filter unreachable paths. For efficiency, a multi-entry mechanism is proposed. Utilizing the pointer analysis algorithm, we implement a checker in TsmartGP to find uninitialized pointer errors in 13 real-world applications. Cppcheck and Clang Static Analyzer are chosen for comparison. The experimental results show that TsmartGP can find more errors while its accuracy rate is 100%, much higher than 3.1% for Cppcheck and 16.7% for Clang Static Analyzer. The demo video is available at https://youtu.be/IQlshemk6OA and TsmartGP can be downloaded from https://github.com/laoyaolandq/TsmartGP.
Misuse of APIs happens frequently due to misunderstanding of API semantics and lack of documentation. An important category of API-related defects is the error handling defects, which may result in security and reliability flaws. These defects can be detected with the help of static program analysis, provided that error specifications are known. The error specification of an API function indicates how the function can fail. Writing error specifications manually is time-consuming and tedious. Therefore, automatic inferring the error specification from API usage code is preferred. In this paper, we present Ares, a tool for automatically inferring error specifications for C code through static analysis. we employ multiple heuristics to identify error handling blocks and infer error specifications by analyzing the corresponding condition logic. Ares is evaluated on 19 real world projects, and the results reveal that Ares outperforms the state-of-the-art tool APEx by 37% in precision. Ares can also identify more error specifications than APEx. Moreover, the specifications inferred from Ares help find dozens of API-related bugs in well-known projects such as OpenSSL, among them 10 bugs are confirmed by developers. Video: https://youtu.be/nf1QnFAmu8Q. Repository: https://github.com/lc3412/Ares.
Merge conflicts often occur when developers concurrently change the same code artifacts. While state-of-practice unstructured merge tools (e.g git merge) try to automatically resolve merge conflicts based on textual similarity, semistructured and structured merge tools try to go further by exploiting the syntactic structure and semantics of the involved artifacts. Although there is evidence that semistructured merge has significant advantages over unstructured merge, and that structured merge reports significantly less conflicts than unstructured merge, it is unknown how semistructured merge compares with structured merge. To help developers decide which kind of tool to use, we compare semistructured and structured merge in an empirical study by reproducing more than 40,000 merge scenarios from more than 500 projects. In particular, we assess how often the two merge strategies report different results, we identify conflicts incorrectly reported by one but not by the other (false positives), and conflicts correctly reported by one but missed by the other (false negatives). Our results show that semistructured and structured merge differ on 24% of the scenarios with conflicts. Semistructured merge reports more false positives, whereas structured merge has more false negatives. Finally, we observe that adapting a semistructured merge tool to resolve a particular kind of conflict makes semistructured and structured merge even closer.
Industry widely uses unstructured merge tools that rely on textual analysis to detect and resolve conflicts between code contributions. Semistructured merge tools go further by partially exploring the syntactic structure of code artifacts, and, as a consequence, obtaining significant merge accuracy gains for Java-like languages. To understand whether semistructured merge and the observed gains generalize to other kinds of languages, we implement two semistructured merge tools for JavaScript, and compare them to an unstructured tool. We find that current semistructured merge algorithms and frameworks are not directly applicable for scripting languages like JavaScript. By adapting the algorithms, and studying 10,345 merge scenarios from 50 JavaScript projects on GitHub, we find evidence that our JavaScript tools report fewer spurious conflicts than unstructured merge, without compromising the correctness of the merging process. The gains, however, are much smaller than the ones observed for Java-like languages, suggesting that semistructured merge advantages might be limited for languages that allow both commutative and non-commutative declarations at the same syntactic level.
Code clones are already proven as harmful for maintenance and evaluation of software systems such as development and modification of source code, propagation of potential bugs throughout software system and so on. Now a days, a single software system is developing in various programming languages for greater adaptability. In these systems a single functionality is replicated among their varieties; often implemented in different programming languages (cross language clones(CLC)). Consequences of these clones are more severe as they are hard to detect and hard to track their modifications. However, while there are a great many studies for finding clones in the same programming language, there is a marked lack of studies in detecting CLCs. To fill this gap, in this paper, we are going to propose CLCDSA, a CLC detector which can detect CLCs despite any prerequisites or intermediate states. This model analyses different syntactical features of source codes across different programming languages to detect CLC. To support large scale clone data, this model comprises an action filter based on cross language API calls similarity to discard non-potential clones before proceed to the main model. The design methodologies of CLCDSA is two-folded: a. it can detect CLC On the Fly by calculating features’ value similarity; b. it poses a deep neural network based feature vector learning model to learn the features and detect CLC. An early evaluation of this model observed an average precision, recall and F-measure score of 0.55, 0.86 and 0.64 respectively for the On the Fly phase and 0.61, 0.93 and 0.71 for the neural net phase which indicate that CLCDSA has outperformed all the available models in detecting cross language clones.
COTS software products are widely developed on top of one or more OSS projects, which might lead to OSS reuse vulnerabilities. To discover such vulnerabilities, detecting OSS reuses for COTS software is a necessary step. Existing binary-to-source matching approaches are scalable to tens of thousands of OSS projects. However, when applying to COTS software products, they are suffering from precision problem severely due to their limited code features, imprecise matching score computation and the neglect of code structure of OSS projects. In this paper, we propose a novel binary-to-source matching approach B2SFinder to address these issues. It fully analyzes and selects seven kinds of code features that are presented in both binary file and source code and are not susceptible to compilation. In order to precisely calculate matching scores, it employs a weighted feature matching algorithm that combines three matching methods with two importance-weight computing algorithms. The matching methods are applied to different features according to the representation form of the features. The weighting algorithms compute the weight of a feature instance considering its specificity and occurrence frequency. B2SFinder further identifies the reuse type based on matching scores and code structures of OSS projects. We have implemented a prototype of B2SFinder with optimized data structure. We evaluated it on 21991 binaries of 1000 popular COTS software products. The results showed that it is not only precise but also scalable. It identified up to 2.15 times as many reuse cases as the state-of-the-art approach while only took 53.85 seconds on average for a binary file. It also plays a major role in discovering OSS reuse vulnerabilities.
Code review is an important mechanism for code quality assurance both in open source software and industrial software. Reviewers usually suffer from numerous, tangled and loosely related code changes are bundled in a single commit, which makes code review very difficult. In this paper, we propose CoRA (Code Review Assistant), an automatic approach to decompose a tangled commit into different parts and generate concise descriptions for reviewers. More specifically, CoRA can decompose a commit into independent parts (e.g., bug fixing, new feature adding, or refactoring) by code dependency analysis and tree-based similar-code detection, then identify the most important code changes in each part based on PageRank algorithm and heuristic rules. As a result, CoRA can generate a concise description for each part of the commit. We evaluate our approach in 7 open source software projects and 50 code commits. The results indicate that CoRA can improve the accuracy of decomposing code changes by 6.3% over the state-of-art practice. At the same time, CoRA can identify the important part from the fine-grained code changes with a mean average precision (MAP) of 87.7%. We also conduct a human study with 8 participants to evaluate the performance and usefulness of CoRA, the user feedback indicates that CoRA can effectively help reviewers