Code coverage as the primitive dynamic program behavior information, is widely adopted to facilitate a rich spectrum of software engineering tasks, such as testing, fuzzing, debugging, fault detection, reverse engineering, and program understanding. Thanks to the widespread applications, it is crucial to ensure the reliability of the code coverage profilers.
Unfortunately, due to the lack of research attention and the existence of testing oracle problem, the coverage profilers are far away from being tested sufficiently. Bugs are still regularly seen in the widely deployed profilers, like gcov and llvm-cov, along with gcc and llvm, respectively.
This paper proposes Cod, a fully automated self-validator for effectively uncovering bugs in the coverage profilers. Cod takes a single profiler and a program (either from a compiler’s test suite or generated randomly) as input and uncovers the bugs by identifying the inconsistency of coverage results from the input program and its equivalent mutated variants whose coverage statistics are expected to be identical.
We evaluated Cod over two of the most well-known code coverage profilers, namely gcov and llvm-cov. Within a fourmonth testing period, a total of 196 potential bugs (123 for gcov, 73 for llvm-cov) are found, among which 23 are confirmed by the developers.
Field-exhaustive testing is a testing criterion suitable for object-oriented code over complex, heap-allocated, data structures. It requires test suites to contain enough test inputs to cover all feasible values for object’s fields within a certain scope (input-size bound). While previous work shows that field-exhaustive suites can be automatically generated, the generation technique required a formal specification of the inputs that can be subject to SAT-based analysis. Moreover, the restriction of producing all feasible values for inputs’ fields makes test generation costly.
In this paper, we deal with field coverage as testing criteria that measure the quality of a test suite by examining to what extent the values of inputs’ fields are covered. In particular, we consider field coverage in combination with test generation based on symbolic execution to produce underapproximations of field-exhaustive suites, using the Symbolic Pathfinder tool. To underapproximate these suites we use tranScoping, a technique that estimates characteristics of yet to be run analyses for large scopes, based on data obtained from analyses performed in small scopes. This provides us with a suitable condition to prematurely stop the symbolic execution.
As we show, tranScoping different metrics regarding field coverage allows us to produce significantly smaller suites using a fraction of the generation time. All this while retaining the effectiveness of field exhaustive suites in terms of test suite quality.
In software testing, different testers focus on different aspects of the software such as functionality, performance, design, and other attributes. While many tools and coverage metrics exist to support testers at the code level, not much support is targeted for testers who want to inspect the output of a program such as a dynamic web application. To support this category of testers, we propose a family of output-coverage metrics (similar to statement, branch, and path coverage metrics on code) that measure how much of the possible output has been produced by a test suite and what parts of the output are still uncovered. To do that, we first approximate the output universe using our existing symbolic execution technique. Then, given a set of test cases, we map the produced outputs onto the output universe to identify the covered and uncovered parts and compute output-coverage metrics. In our empirical evaluation on seven real-world PHP web applications, we show that selecting test cases by output coverage is more effective at identifying presentation faults such as HTML validation errors and spelling errors than selecting test cases by traditional code coverage. In addition, to help testers understand output coverage and augment test cases, we also develop a tool called WebTest that displays the output universe in one single web page and allows testers to visually explore covered and uncovered parts of the output.
Link to Publication: https://www.cs.cmu.edu/~ckaestne/pdf/jase18.pdf
Test code is becoming more essential to the modern software development process. However, practitioners often pay inadequate attention to key aspects of test code quality, such as bug detectability, maintainability and speed. Existing tools also typically report a single test code quality measure, such as code coverage, rather than a diversified set of metrics. To measure and visualize quality of test code in a comprehensive fashion, we developed an integrated test code analysis tool called Phanta. In this show case, we posit that the enhancement of test code quality is key to modernizing software development, and show how Phanta’s techniques measure the quality using mutation analysis, test code clone detection, and so on. Further, we present an industrial case study where Phanta was applied to analyze test code in a real Fujitsu project, and share lessons learned from the case study.
We present TestCov, a tool for robust test-suite execution and test-coverage measurement on C programs. TestCov executes program tests in isolated containers to ensure system integrity and reliable resource control. The tool provides coverage statistics per test and for the whole test suite. TestCov uses the simple, XML-based exchange format for test-suite specifications that was established as standard by Test-Comp. TestCov has been successfully used in Test-Comp ’19 to execute almost 9 million tests on 1720 different programs. The source code of TestCov is released under the open-source license Apache 2.0 and available at https://gitlab.com/sosy-lab/software/test-suite-validator. A full artifact, including a demonstration video, is available at https://doi.org/10.5281/zenodo.3418726.
Fuzzing is widely used for vulnerability detection. One of the challenges for an efficient fuzzing is covering code guarded by constraints such as the magic number and nested conditions. Recently, academia has partially addressed the challenge via whitebox methods. However, high-level constraints such as array sorts, virtual function invocations, and tree set queries are yet to be handled.
To meet this end, we present VisFuzz, an interactive tool for better understanding and intervening fuzzing process via real-time visualization. It extracts call graph and control flow graph from source code, maps each function and basic block to the line of source code and tracks real-time execution statistics with detail constraint contexts. With VisFuzz, test engineers first locate blocking constraints and then learn its semantic context, which helps to craft targeted inputs or update test drivers. Preliminary evaluations are conducted on four real-world programs in Google fuzzer-test-suite. Given additional 15 minutes to understand and intervene the state of fuzzing, the intervened fuzzing outperform the original pure AFL fuzzing, and the path coverage improvements range from 10.84% to 150.58%, equally fuzzed by for 12 hours.
Grammar-based testing uses a given grammar to produce syntactically valid inputs. To cover program features, it is necessary to also cover input features—say, all URL variants for a URL parser. Our k-path algorithm for grammar production systematically covers syntactic elements as well as their combinations. In our evaluation, we show that this results in a significantly higher code coverage than state of the art.
Automated testing techniques are often assessed on coverage based metrics. However, despite giving good coverage, the test cases may miss the gap between functional specification and the code implementation. This gap may be subtle in nature, arising due to the absence of logical checks, either in the implementation or in the specification, resulting in inconsistencies in the input definition. The inconsistencies may be prevalent especially for structured inputs, commonly specified using string-based data types. Our study on defects reported over popular libraries reveals that such gaps may not be limited to input validation checks. We propose a test generation technique for structured string inputs where we infer inconsistencies in input definition to expose semantic gaps in the method under test and the method specification. We assess this technique using our tool SEGATE, Semantic Gap Tester. SEGATE uses static analysis and automaton modeling to infer the gap and generate test cases. On our benchmark dataset, comprising of defects reported in 15 popular open-source libraries, written in Java, SEGATE was able to generate tests to expose 80% of the defects.
Most software systems frequently encounter errors when interacting with their environments. When errors occur, error-handling code must execute flawlessly to facilitate system recovery. Implementing correct error handling is repetitive but non-trivial, and developers often inadvertently introduce bugs into error-handling code. Existing tools require correct error specifications to detect error-handling bugs. Manually generating error specifications is error-prone and tedious, while automatically mining error specifications is hard to achieve a satisfying accuracy. In this paper, we propose EH-Miner, a novel and practical tool that can automatically detect error-handling bugs without the need for error specifications. Given a function, EH-Miner mines its error-handling rules when the function is frequently checked by an equivalent condition, and handled by the same action. We applied EH-Miner to 117 mature applications across 15 software domains. EH-Miner mined error-handling rules with the precision rate of 91.1% and the recall rate of 46.9%. We reported 142 bugs to developers, and 106 bugs had been confirmed and fixed at the time of writing. We further applied EH-Miner to Linux kernel, and reported 68 bugs for kernel-4.17, of which 42 had been confirmed.
Modern embedded systems are increasingly complex and contain multiple software layers from BSP (Board Support Packages) to OS to middleware to AI (Artificial Intelligence) algorithms like perception and voice recognition. Integrations of inter-layer and intra-layer in embedded systems provide dedicated services such as taking a picture or movie-streaming. Accordingly, it gets more complicated to find out the root cause of a system failure. This industrial proposal describes a difficulty of testing embedded systems, and presents a case study in terms of integration testing.
As Introductory Programming classes continue to grow, it becomes ever more important to ensure that automatic grading is both efficient yet still maintains high accuracy. Specifically, the use of automated testing for this type of grading has become standard, but it is not without fault. Instructor test suites can often misclassify correct student submissions as incorrect or vice versa. To address this discrepancy, we propose a novel approach which adapts the technique of Test Suite Augmentation to an educational setting in order to improve instructor test suite quality. Our approach first partitions student programs into clusters based on Behavioral Equivalence. We then use these clusters to better guide the Pex test-generator in generating a minimal set of tests which can augment an instructor’s suite and improve grading accuracy. We evaluated our approach on an extensive set of student submissions from a CS1 course and were able to find flaws in existing instructor test suites, and then augment these suites to substantially improve their quality.
The amount of android applications is having a tremendous increasing trend, exerting pressure over practitioners and researchers around application quality, frequent releases, and quick fixing of bugs. This pressure leads practitioners to make use of automated approaches. However, most of those use source-code as input. Nevertheless, third-party services are not able to use this approaches due to privacy factors. In this paper we present MutAPK, an open source mutation testing tool that enables the usage of APKs as input for this task. MutAPK generates mutants without the need of having access to source code, because the mutations are done in an intermediate representation of the code (i.e., SMALI) that does not require compilation. MutAPK is publicly available at GitHub https://bit.ly/2KYvgP9 VIDEO: https://bit.ly/2WOjiyy
ACM SIGSOFT Distinguished Paper Award
Compilers, like other software systems, contain bugs, and compiler testing is the most widely-used way to assure compiler quality. A critical task of compiler testing is to generate test programs that could effectively and efficiently discover bugs. Though we can configure test generators such as Csmith to control the features of the generated programs, it is not clear what test configuration is effective. In particular, an effective test configuration needs to generate test programs that are bug-revealing, i.e., likely to trigger bugs, and diverse, i.e., able to discover different types of bugs. It is not easy to satisfy both properties. In this paper, we propose a novel test-program generation approach, called HiCOND, which utilizes historical data for configuration diversification to solve this challenge. HiCOND first infers the range for each option in a test configuration where bug-revealing test programs are more likely to be generated based on historical data. Then, it identifies a set of test configurations that can lead to diverse test programs through a search method (particle swarm optimization). Finally, based on the set of test configurations for compiler testing, HiCOND generates test programs, which are likely to be bug-revealing and diverse. We have conducted experiments on two popular compilers GCC and LLVM, and the results confirm the effectiveness of our approach. For example, HiCOND detects 75.00%, 133.33%, and 145.00% more bugs than the three existing approaches, respectively. Moreover, HiCOND has been successfully applied to actual compiler testing in a global IT company and detected 11 bugs during the practical evaluation.
We present TEGCER, an automated feedback tool for novice programmers. TEGCER uses supervised classification to match compilation errors in new code submissions with relevant pre-existing errors, submitted by other students before. The dense neural network used to perform this classification task is trained on 15,000+ error-repair code examples. The proposed model yields a test set classification Pred@3 accuracy of 97.7% across 212 error category labels. Using this model as its base, TEGCER presents students with the closest relevant examples of solutions for their specific error on demand. A large scale (N>230) usability study shows that students who use TEGCER are able to resolve errors more than 25% faster on average than students being assisted by human tutors.
Link to Publication: https://github.com/umairzahmed/tegcer
Test cases are crucial to help developers preventing the introduction of software faults. Unfortunately, not all the tests are properly designed or can effectively capture faults in production code. Some measures have been defined to assess test-case effectiveness: the most relevant one is the mutation score, which highlights the quality of a test by generating the so-called mutants, i.e., variations of the production code that make it faulty and that the test is supposed to identify. However, previous studies revealed that mutation analysis is extremely costly and hard to use in practice. The approaches proposed by researchers so far have not been able to provide practical gains in terms of mutation testing efficiency. Indeed, to apply mutation testing in practice a developer needs to (i) generate mutants, (ii) compile the source code, and (iii) run test cases. In cases of systems where test suites count hundrends of test cases, this is unfeasible. As a consequence, the problem of automatically assessing test-case effectiveness in a timely and efficient manner is still far from being solved. In this paper, we present a novel methodology orthogonal to existing approaches. In particular, we investigate the feasibility to estimate test-case effectiveness, as indicated by mutation score, exploiting production and test-code-quality indicators. We firstly select a set of 67 factors, including test and code metrics as well as test- and code-smells, and study their relation with test-case effectiveness. We discover that 41 out of 67 of the investigated factors statistically differs from non-effective and effective tests. Then, we devise a mutation score estimation model exploiting such factors and investigate its performance as well as its most relevant features. The key results of the study reveal that our estimation model only based only on statically computable features has 86% of both F-Measure and AUC-ROC. This means that we can estimate the test-case effectiveness, using source-code-quality indicators, with high accuracy and without executing the tests. In fact, adding line coverage as an additional feature only increases the performance of the model by about the 9%. As a consequence, we can provide a practical approach that is beyond the typical limitations of current mutation testing techniques.
Link to Publication: https://ieeexplore.ieee.org/document/8658120
Analyzing executions of concurrent software is very difficult. Even if a trace is available, such traces are very hard to read and interpret. A textual trace contains a lot of data, most of which is not relevant to the issue at hand. Past visualization attempts either do not show concurrent behavior, or result in a view that is overwhelming for the user.
We provide a visual analytics tool, VA4JVM, for error traces produced by either the Java Virtual Machine, or by Java Pathfinder. Its key features are a layout that spatially associates events with threads, a zoom function, and the ability to filter event data in various ways. We show in examples how filtering and zooming in can highlight a problem without having to read lengthy textual data.
Deep Neural Network(DNN) techniques have been prevalent in software engineering. They are employed to facilitate various software engineering tasks and embedded into many software applications. However, because DNNs are built upon a rich data-driven programming paradigm that employs plenty of labeled data to train a set of neurons to construct the internal system logic, analyzing and understanding their behaviors becomes a difficult task for software engineers. In this paper, we present an instance-based visualization tool for DNN, namely NeuralVis, to support software engineers in visualizing and interpreting deep learning models. NeuralVis is designed for: 1). visualizing the structure of DNN models, i.e., neurons, layers, as well as connections; 2). visualizing the data transformation process; 3). integrating existing adversarial attack algorithms for test input generation; 4). comparing intermediate outputs of different instances. To demonstrate the effectiveness of NeuralVis, we design a task-based user study involving ten participants on two classic DNN models, i.e., LeNet and VGG-12. The result shows NeuralVis can assist engineers in identifying critical features that determine the prediction results. Tool is available at http://118.178.18.181:18080. Video: https://youtu.be/hRxCovrOZFI.
ACM SIGSOFT Distinguished Paper Award
Regular expressions (regexes) are a powerful mechanism for solving string-matching problems. They are supported by all modern programming languages, and have been estimated to appear in more than a third of Python and JavaScript projects. Yet existing studies have focused mostly on one aspect of regex programming: readability. We know little about how developers perceive and program regexes, nor the difficulties that they face.
In this paper, we provide the first study of the regex development cycle, with a focus on (1) how developers make decisions throughout the process, (2) what difficulties they face, and (3) how aware they are about serious risks involved in programming regexes. We took a mixed-methods approach, surveying 279 professional developers from a diversity of backgrounds (including top tech firms) for a high-level perspective, and interviewing 17 developers to learn the details about the difficulties that they face and the solutions that they prefer.
In brief, regexes are hard. Not only are they hard to read, our participants said that they are hard to search for, hard to validate, and hard to document. They are also hard to master: the majority of our studied developers were unaware of critical security risks that can occur when using regexes, and those who knew of the risks did not deal with them in effective manners. Our findings provide multiple implications for future work, including semantic regex search engines for regex reuse and improved input generators for regex validation.
The regular expression (regex) practices of software engineers affect the maintainability, correctness, and security of their software applications. Empirical research has described characteristics like the distribution of regex feature usage, the structural complexity of regexes, and worst-case regex match behaviors. But researchers have not critically examined the methodology they follow to extract regexes, and findings to date are typically generalized from regexes written in only 1– 2 programming languages. This is an incomplete foundation.
Generalizing existing research depends on validating two hypotheses: (1) Various regex extraction methodologies yield similar results, and (2) Regex characteristics are similar across programming languages. To test these hypotheses, we defined eight regex metrics to capture the dimensions of regex representation, string language diversity, and worst-case match complexity. We report that the two competing regex extraction methodologies yield comparable corpuses, suggesting that simpler regex extraction techniques will still yield sound corpuses. But in comparing regexes across programming languages, we found significant differences in some characteristics by programming language. Our findings have bearing on future empirical methodology, as the programming language should be considered, and generalizability will not be assured. Our measurements on a corpus of 537,806 regexes can guide data-driven designs of a new generation of regex tools and regex engines.
As an important extension of symbolic execution (SE), probabilistic symbolic execution (PSE) computes execution probabilities of program paths. Using this information, PSE can prioritize path exploration strategies. To calculate the probability of a path PSE relies on solution counting approaches for the path constraint. The correctness of a solution counting approach depends on the methodology used to count solutions and whether a path constraint maintains a one-to-one relation with program input values. This work focuses on the latter aspect of the solution counting correctness for string constraints.
In general, maintaining a one-to-one relation is not always possible, especially in a presence of non-linear constraints. To deal with this issue, researchers that work on PSE for numerical domains either analyze programs with linear constraints, or develop novel techniques to handle solution counting of non-linear constraints. For the string domain, however, previous work on PSE mainly focuses on efficient and accurate solution counting for automata-based string models and has not investigated whether a one-to-one relationship between the strings encoded by automata and input string values is preserved. In this work we demonstrate that traditional automata-base string models fail to maintain one-to-one relations and propose to use the weighted automata model, which preserves the one-to-one relation between the path constraint it encodes and the input string values. We use this model to implement a string constraint solver and show its correctness on a set of non-trivial synthetic benchmarks. We also present an empirical evaluation of traditional and proposed automata solvers on real-world string constraints. The evaluations show that while being less efficient than traditional automata models, the weighted automata model maintains correct solution counts.
Quantitative program analysis is an emerging area with applications to software reliability, quantitative information flow, side-channel detection and attack synthesis. Most quantitative program analysis techniques rely on model counting constraint solvers, which are typically the bottleneck for scalability. Although the effectiveness of formula caching in expediting expensive model-counting queries has been demonstrated in prior work, our key insight is that many subformulas are shared across non-identical constraints generated during program analyses. This has not been utilized by prior formula caching approaches. In this paper we present a subformula caching framework and integrate it into a model counting constraint solver. We experimentally evaluate its effectiveness under three quantitative program analysis scenarios: 1) model counting constraints generated by symbolic execution, 2) reliability analysis using probabilistic symbolic execution, 3) adaptive attack synthesis for side-channels. Our experimental results demonstrate that our subformula caching approach significantly improves the performance of quantitative program analysis.
Smart pointers are widely used to prevent memory errors in modern C++ code. However, improper usage of smart pointers may also lead to common memory errors, which makes the code not as safe as expected. To avoid smart pointer errors as early as possible, we present a coding style checker to detect possible bad smart pointer usages during compile time, and notify programmers about bug-prone behaviors. The evaluation indicates that the currently available state-of-the-art static code checkers can only detect 25 out of 116 manually inserted errors, while our tool can detect all these errors. And we also found 521 bugs among 8 open source projects with only 4 false positives.
Floating-point arithmetic is widely used in applications from several fields including scientific computing, machine learning, graphics, and finance. Many of these applications are rapidly adopting the use of GPUs to speedup computations. GPUs, however, have limited support to detect floating-point exceptions, making it difficult to software developers to detect unexpected exceptional cases in such applications. We present FPChecker, the first tool to automatically detect floating-point exceptions in GPU applications. FPChecker uses the clang/LLVM compiler to instrument GPU kernels and to detect floating-point exceptions in GPUs at runtime. Once an exception is detected, it reports to the programmer the code location of the exception as well as other useful information. The programmer can then use this information to avoid the exception, e.g., by modifying the application algorithm or changing its input. We present the design of FPChecker, an evaluation of the overhead of the tool, and a real-world case scenario on which the tool is used to identify a hidden exception. The slowdown of FPChecker is moderate ($1.5\times$ on average) and the code is publicly available as open source.
Mutation analysis, which introduces artificial defects into software systems, is the basis of mutation testing, a technique widely applied to evaluate and enhance the quality of test suites. However, despite the deep analogy between tests and formal proofs, mutation analysis has seldom been considered in the context of deductive verification. We propose mutation proving, a technique for analyzing verification projects that use proof assistants. We implemented our technique for the Coq proof assistant in a tool dubbed mCoq. mCoq applies a set of mutation operators to Coq definitions of functions and datatypes, inspired by operators previously proposed for functional programming languages. mCoq then checks proofs of lemmas affected by operator application. To make our technique feasible in practice, we implemented several optimizations in mCoq such as parallel proof checking. We applied mCoq to several medium and large scale Coq projects, and recorded whether proofs passed or failed when applying different mutation operators. We then qualitatively analyzed the mutants, and found several examples of weak and incomplete specifications. For our evaluation, we made many improvements to serialization of Coq code and even discovered a notable bug in Coq itself, all acknowledged by developers. We believe mCoq can be useful both to proof engineers for improving the quality of their verification projects and to researchers for evaluating proof engineering techniques.
Cryptographic primitives are ubiquitous for modern security. The correctness of their implementations is crucial to resist malicious attacks. Typical arithmetic computation of these C programs contains large numbers of non-linear operations, hence is challenging existing automatic C verification tools. We present an automated approach to verify cryptographic C programs. Our approach successfully verifies C implementations of various arithmetic operations used in NIST P-224, P-256, P-521 and Curve25519 in OpenSSL. During verification, we expose a bug and a few anomalies that have been existing for a long time. They have been reported to and confirmed by the OpenSSL community. Our results establish the functional correctness of these C implementations for the first time.
Verification of multitasking embedded software requires taking into account its underlying operating system w.r.t. its scheduling policy and handling of task priorities in order to achieve a higher degree of accuracy. However, such comprehensive verification of multitasking embedded software together with its underlying operating system is very costly and impractical. To reduce the verification cost while achieving the desired accuracy, we propose a variant of CEGAR, named OiL-CEGAR (OS-in-the-Loop Counterexample-Guided Abstraction Refinement), where a composition of a formal OS model and an abstracted application program is used for comprehensive verification and is successively refined using the counterexamples generated from the composition model. The refinement process utilizes the scheduling information in the counterexample, which acts as a mini-OS to check the executability of the counterexample trace on the concrete program. Our experiments using a prototype implementation of OiL-CEGAR show that OiL-CEGAR greatly improves the accuracy and efficiency of property checking in this domain. It automatically removed all false alarms and accomplished property checking within an average of 476 seconds over a set of multitasking programs, whereas model checking using existing approaches over the same set of programs either showed an accuracy of under 11.1% or was unable to finish the verification due to timeout.
Formal methods for software development have made great strides in the last two decades, to the point that their application in safety-critical embedded software is an undeniable success. Their extension to non-critical software is one of the notable forthcoming challenges. For example, C programmers regularly use inline assembly for low-level optimizations and system primitives. This usually results in rendering state-of-the-art formal analyzers developed for C ineffective. We thus propose TInA, the first automated, generic, verification-friendly and trustworthy lifting technique turning inline assembly into semantically equivalent C code, in order to take advantage of existing C analyzers. Extensive experiments on real-world C code with inline assembly (including GMP and ffmpeg) show the feasibility and benefits of TInA.
Verification of programs continues to be a challenge and no single technique succeeds on all programs. In this paper we present VeriAbs, a reachability C program verifier that incorporates a portfolio of techniques implemented as four strategies, where a strategy is a set of techniques applied in a specific sequence. It selects a strategy based on the kind of loops in the program. We analysed the effectiveness of implemented strategies on the 3831 verification tasks from the ReachSafety category of the 8th International Competition on Software Verification (SV-COMP) 2019 and found that although simple classic techniques - explicit state model checking and bounded model checking, succeed on a majority of the programs, a wide range of further techniques are required to analyse the rest. A screencast of the tool is available at https://youtu.be/YHxB2V4U0gQ.
Spreadsheets are widely used but subject to various defects. In this paper, we present SGUARD to effectively detect spreadsheet defects. SGUARD learns spreadsheet features to cluster cells with similar computational semantics, and then refine these clusters to recognize anomalous cells as defects. SGUARD well balances the trade-off between the precision (87.8%) and recall rate (71.9%) in the defect detection, and achieves an F-measure of 0.79, exceeding existing spreadsheet defect detection techniques. We introduce the SGUARD implementation and its usage by a video presentation (https://youtu.be/gNPmMvQVf5Q), and provide its download repository for public access (https://github.com/sheetguard/sguard).
Open source software licenses regulate the circum- stances under which software can be redistributed, reused and modified. Ensuring license compatibility and preventing license restriction conflicts among source code during software changes is the key to protect their commercial use. However, selecting ap- propriate licenses for software changes requires lots of experience and manual effort to examine, assimilate and compare various licenses as well as understand their relationships with software changes. Worse still, there is no state-of-the-art methodology to provide this capability. Motivated by this observation, we propose in this paper Automatic License Prediction (ALP), a novel learning-based method and tool for predicting licenses as software changes. An extensive evaluation of ALP on predicting licenses in 700 open source projects demonstrate its effectiveness: ALP can achieve not only a high overall prediction accuracy (i.e., 92.5% in micro F1-score) but also high accuracies across all license types.
Software defect prediction (SDP) utilizes the learning models to detect the defective modules in project, and their performance depends on the quality of training data. The previous researches mainly focus on the quality problems of class imbalance and feature redundancy. However, training data often contain some instances that belong to different class but have similar values on features, and this leads to class overlap to affect the quality of training data. Our goal is to investigate the impact of class overlap on software defect prediction. At the same time, we propose an improved K-Means clustering cleaning approach (IKMCCA) to solve both the class overlap and class imbalance problems. Specifically, we check whether K-Means clustering cleaning approach (KMCCA) or neighborhood cleaning learning (NCL) or IKMCCA is feasible to improve defect detection performance for two cases (i) within-project defect prediction (WPDP) (ii) cross-project defect prediction (CPDP). To have an objective estimate of class overlap, we carry out our investigations on 28 open source projects, and compare the performance of state-of-the-art learning models for the above-mentioned cases by using IKMCCA or KMCCA or NCL VS. Without cleaning data. The experimental results make clear that learning models obtain significantly better performance in terms of balance, Recall and AUC for both WPDP and CPDP when the overlapping instances are removed. Moreover, it is better to consider both class overlap and class imbalance.
Automatic code completion helps improve developers’ productivity in their programming tasks. A program contains instructions expressed via code statements, which are considered as the basic units of program execution. In this paper, we introduce AutoSC, which combines program analysis and the principle of software naturalness to fill in partially completed statements. AutoSC benefits from the strengths of both directions, in which the completed code statement is both frequent and valid. AutoSC is first trained on a large code corpus to learn the templates of candidate statements. Then, it uses program analysis to validate and concretize the templates into syntactically and type-valid candidate statements. Finally, these candidates are ranked by using a language model trained on the lexical form of the source code in the code corpus. Our empirical evaluation shows that AutoSC achieves 38.9–41.3% top-1 and 48.2-50.1% top-5 accuracy in statement completion and outperforms the state-of-the-art approach from 9X–69X in top-1 accuracy.
Context: Classification techniques of supervised machine learning have been successfully applied to various domains of practice. When building a predictive model, there are two important criteria: predictive accuracy and interpretability, which generally have a trade-off relationship. In particular, interpretability should be accorded greater emphasis in the domains where the incorporation of expert knowledge into a predictive model is required. Objective: The aim of this research is to propose a new classification model, called superposed naive Bayes (SNB), which transforms a naive Bayes ensemble into a simple naive Bayes model by linear approximation. Method: In order to evaluate the predictive accuracy and interpretability of the proposed method, we conducted a comparative study using well-known classification techniques such as rule-based learners, decision trees, regression models, support vector machines, neural networks, Bayesian learners, and ensemble learners, over 13 real-world public datasets. Results: A trade-off analysis between the accuracy and interpretability of different classification techniques was performed with a scatter plot comparing relative ranks of accuracy with those of interpretability. The experiment results show that the proposed method (SNB) can produce a balanced output that satisfies both accuracy and interpretability criteria. Conclusions: SNB offers a comprehensible predictive model based on a simple and transparent model structure, which can provide an effective way for balancing the trade-off between accuracy and interpretability.
Link to Publication: https://link.springer.com/article/10.1007/s10664-018-9638-1
Defect prediction models focus on identifying defect-prone code elements, for example, to allow practitioners to allocate testing resources on specific subsystems and to provide assistance during code reviews. While the research community has been highly active in proposing metrics and methods to predict defects on long-term periods (i.e., at release time), a recent trend is represented by the so-called short-term defect prediction (i.e., at commit-level). Indeed, this strategy represents an effective alternative in terms of effort required to inspect files likely affected by defects. Nevertheless, the granularity considered by such models might be still too coarse. Indeed, existing commit-level models highlight an entire commit as defective even in cases where only specific files actually contain defects.
In this paper, we first investigate to what extent commits are partially defective; then, we propose a novel fine-grained just-in-time defect prediction model to predict the specific files, contained in a commit, that are defective. Finally, we evaluate our model in terms of (i) performance and (ii) the extent to which it decreases the effort required to diagnose a defect. Our study highlights that: (1) defective commits are frequently composed of a mixture of defective and non-defective files, (2) our fine-grained model can accurately predict defective files with an AUC-ROC up to 82% and (3) our model would allow practitioners to save inspection efforts with respect to standard just-in-time techniques.
Link to Publication: https://www.sciencedirect.com/science/article/pii/S0164121218302656?via%3Dihub
Concurrent programs must be thoroughly tested as concurrency bugs are notoriously hard to detect. Code coverage criteria can be used to quantify the richness of a test suite (e.g., whether a program has been tested sufficiently) or provide practical guidelines on test case generation (e.g., as objective functions used in program fuzzing engines). Traditional code coverage criteria are however designed for sequential programs and thus ineffective for concurrent programs. In this work, we introduce a novel code coverage criteria for testing thread-safe classes called MAP (short for memory-access patterns) coverage. The motivation is that concurrency bugs are often correlated with certain memory-access patterns and thus it is desirable to comprehensively cover all memory-access patterns. Furthermore, we propose a testing method for maximizing MAP-coverage. Our method has been implemented as a self-contained toolkit and the experiment results on 20 benchmark programs show that our toolkit outperforms existing testing methods. Lastly, we show empirically that there exists positive correlation between MAP-coverage and the effectiveness of a set of test executions.
This paper investigates using compiler technology to automatically convert sequential C++ data abstractions, e.g., queues, stacks, maps, and trees, to concurrent lock-free implementations. By automatically tailoring a number of state-of-the-practice synchronization methods to the underlying sequential implementations of different data structures, our automatically synchronized code can attain performance competitive to that of manually-written concurrent data structures by experts and much better performance than heavier-weight support by software transactional memory (STM).
While CUDA has been the dominated parallel computing platform and programming model for general-purpose GPU computing, CUDA synchronization undergoes significant challenges for GPU programmers due to its intricate parallel computing mechanism and coding practices. In this paper,we propose AuCS, the first general framework to automate synchronization for CUDA kernel functions. AuCS transforms the original LLVM-level CUDA program control flow graph in a semantic-preserving manner for exploring the possible barrier function locations. Accordingly, AuCS develops mechanisms to correctly place barrier functions for automating synchronization in multiple erroneous (challenging-to-be-detected) synchronization scenarios, including data race, barrier divergence, redundant barrier functions. To evaluate the effectiveness and efficiency of AuCS, we conduct an intensive set of experiments and the results suggest that AuCS can automate 20 out of 24 erroneous synchronization scenarios.
Existing replay techniques record inter-thread access event orders on shared memory locations to facilitate replay. Some techniques propose strategies to reduce the recorded events often at the expense of deterministic trace reproduction. We propose a technique based on the division of a thread’s execution trace into sequential code blocks called transactions. Our insight is that there are usually few to no atomicity violations reported during a program execution. Based on our insight, we present TPLAY, a novel deterministic replay technique which records thread access interleavings on shared memory locations at a transactional level. TPLAY also generates an artificial pair of interleavings when an atomicity violation is reported on a transaction. We present an experiment and analyze the results of our experiment using the Splash2x extension of the PARSEC benchmark suite. The experimental results indicate that on average, TPLAY experiences a 13-fold improvement in record log sizes and achieves 98% replay probability on subject programs in comparison to existing work.
Swarm-based verification methods split a verification problem into a large number of independent simpler tasks and so exploit the availability of large numbers of cores to speed up verification. Lazy-CSeq is a BMC-based bug-finding tool for C programs using POSIX threads that is based on sequentialization. Here we present the tool VeriSmart 2.0, which extends Lazy-CSeq with a swarm-based bug-finding method. The key idea of this approach is to constrain the interleavings such that context switches can only happen within selected tiles (more specifically, contiguous code segments within the individual threads). This under-approximates the program’s behaviors, with the number and size of tiles as additional parameters, which allows us to vary the complexity of the tasks. Overall, this significantly improves peak memory consumption and (wall-clock) analysis time.
Sources: http://www.cs.sun.ac.za/~bfischer/verismart.tgz
Concurrency vulnerabilities are extremely harmful and can be frequently exploited to launch severe attacks. Due to the non-determinism of multithreaded executions, it is very difficult to detect them. Recently, data race detectors and techniques based on maximal casual model have been applied to detect concurrency vulnerabilities. However, the former are ineffective and the latter report many false negatives. In this paper, we present CONVUL, an effective tool for concurrency vulnerability detection. CONVUL is based on exchangeable events, and adopts novel algorithms to detect three major kinds of concurrency vulnerabilities. To illustrate the competitiveness of CONVUL, we performed a comparison with three widely-used data race detectors and one recent tool based on maximal casual model. In our experiments, CONVUL detected 9 of 10 known vulnerabilities and found 6 zero-day vulnerabilities on MySQL, while other tools only detected at most 3 out of these 16 vulnerabilities. Our tool and data are available at CONVUL web page: https://sites.google.com/site/convultool. A demonstration video is available at https://youtu.be/-26C6ULxtbk
Compression is an emerging source of exploitable side-channel leakage that threatens data security, particularly in web applications where compression is indispensable for performance reasons. Current approaches to mitigating compression side channels have drawbacks in that they either degrade compression ratio drastically or require too much effort from developers to be widely adopted. To bridge the gap, we develop Debreach, a static analysis and program transformation based approach to mitigating compression side channels. Debreach consists of two steps. First, it uses taint analysis to soundly identify flows of sensitive data in the program and uses code instrumentation to annotate data before feeding them to the compressor. Second, it enhances the compressor to exploit the freedom to not compress of standard compression protocols, thus removing the dependency between sensitive data and the size of the compressor’s output. Since Debreach automatically instruments applications and does not change the compression protocols, it has the advantage of being non-disruptive and compatible with existing systems. We have evaluated Debreach on a set of web server applications written in PHP. Our experiments show that, while ensuring leakage-freedom, Debreach can achieve significantly higher compression performance than state-of-the-art approaches.
Dynamic Symbolic Execution (DSE) has seen a rise of its popularity as it allows to check applications for specific behaviours like error patterns automatically. One of its biggest challenges is the state space explosion problem: DSE tries to evaluate all possible execution paths of an application, and for every path, it needs to represent the allocated memory and its accesses. Even though different approaches have been proposed to mitigate the state space explosion problem itself, DSE still needs to represent a multitude of states in parallel to analyse them. If too many states are present, they cannot fit into memory and DSE needs to terminate them prematurely. With a more efficient representation of allocated memory, DSE can handle more states in parallel improving its performance. In this work, we introduce an enhanced, fine-grain and performant memory representation of states. Our implementation on top of the symbolic execution engine KLEE shows a significant reduction of the memory consumption of states by up to \reducedmem{} allowing to represent more states in memory more efficiently and in addition its execution time is reduced by \reducedexectime{} - a speedup of \speedup{}.
Reverse execution and coredump analysis have long been used to diagnose the root cause of software crashes. Each of these techniques, however, face inherent challenges, such as insufficient capability when handling memory aliases. Recent works have used hypothesis testing to address this drawback, albeit with high computational complexity, making them impractical for real world applications. To address this issue, we propose a new deep neural architecture, which could significantly improve memory alias resolution. At the high level, our approach employs a recurrent neural network (RNN) to learn the binary code pattern pertaining to memory accesses. It then infers the memory region accessed by memory references. Since memory references to different regions naturally indicate a non-alias relationship, our neural architecture can greatly reduce the burden of doing hypothesis testing to track down non-alias relation in binary code.
Different from previous researches that have utilized deep learning for other binary analysis tasks, the neural network proposed in this work is fundamentally novel. Instead of simply using off-the-shelf neural networks, we designed a new recurrent neural architecture that could capture the data dependency between machine code segments.
To demonstrate the utility of our deep neural architecture, we implement it as RENN, a neural network-assisted reverse execution system. We utilize this tool to analyze software crashes corresponding to 40 memory corruption vulnerabilities from the real world. Our experiments show that RENN can significantly improve the efficiency of locating the root cause for the crashes. Compared to a state-of-the-art technique, RENN has 36.25% faster execution time on average, detects an average of 21.35% more non-alias pairs, and successfully identified the root cause of 12.5% more cases.
Many program-analysis based tools require precise points-to/alias information only for some program variables. To meet this requirement efficiently, there have been many works on demand-driven analyses that perform only the work necessary to compute the points-to or alias information on the requested variables (queries). However, these demand-driven analyses can be very expensive when applied on large systems where the number of queries can be significant. Such a blow-up in analysis time is unacceptable in cases where scalability with real-time constraints is crucial; for example, when program analysis tools are plugged into an IDE (Integrated Development Environment). In this paper, we propose schemes to improve the scalability of demand-driven analyses without compromising on precision.
Our work is based on novel ideas for eliminating irrelevant and redundant data-flow paths for the given queries. We introduce the idea of batch analysis, which can answer multiple given queries in batch mode. Batch analysis suits the environments with strict time constraints, where the queries come in batch. We present a batch alias analysis framework that can be used to speed up given demand-driven alias analysis. To show the effectiveness of this framework, we use two demand-driven alias analyses (1) the existing best performing demand-driven alias analysis tool for race-detection clients and (2) an optimized version thereof that avoids irrelevant computation. Our evaluations on a simulated data-race client, and on a recent program-understanding tool, show that batch analysis leads to significant performance gains, along with minor gains in precision.
An effective way to maximize code coverage in software tests is through dynamic symbolic execution, a technique that uses constraint solving to systematically explore a program’s state space. We introduce an open-source dynamic symbolic execution framework called Manticore for analyzing binaries and Ethereum smart contracts. Manticore’s flexible architecture allows it to support both traditional and exotic execution environments, and its API allows users to customize their analysis. Here, we discuss Manticore’s architecture and demonstrate the capabilities we have used to find bugs and verify the correctness of code for our commercial clients.
An enterprise system operates business by providing various services that are guided by set of certain business rules (BR) and constraints. These BR are usually written using plain Natural Language in operating procedures, terms and conditions, and other documents or in source code of legacy enterprise systems. For implementing the BR in a software system, expressing them as UML use-case specifications, or preparing for Merger & Acquisition (M&A) activity, analysts manually interpret the documents or try to identify constraints from the source code, leading to potential discrepancies and ambiguities. These issues in the software system can be resolved only after testing, which is a very tedious and expensive activity. To minimize such errors and efforts, we propose BuRRiTo framework consisting of automatic extraction of BR by mining documents and source code, ability to clean them of various anomalies like inconsistency, redundancies, conflicts, etc. and able to analyze the functional gaps present and performing semantic querying and searching.
Tactics are the actions performed by self-adaptive systems that enable them to adapt to changes in their environments. For a self-adaptive cloud-based system, one tactic may include activating additional computing resources when response time thresholds are surpassed. In real-world environments, tactics will frequently experience tactic volatility. Unfortunately, current self-adaptive approaches do not account for tactic volatility in their decision-making processes, and merely assume that tactics have static attributes. This limitation creates uncertainty in the decision-making process and may adversely impact the system’s ability to perform the most optimal action. Additionally, many self-adaptive processes do not properly anticipate or account for future occurrences and volatility in respect to the Service Level Agreement (SLA). This can limit the system’s ability to act proactively, especially when utilizing tactics that contain latency.
To address the limitation of sufficiently accounting for tactic volatility, we propose a Tactic Volatility Aware (TVA) solution. Using Multiple Regression Analysis (MRA), TVA enables self-adaptive systems to accurately estimate the time required to execute tactics and their associated costs. TVA also utilizes Autoregressive Integrated Moving Average (ARIMA) to perform time series forecasting allowing the system to proactively maintain requirements.
The threat of attack faced by cyber-physical systems (CPSs), especially when they play a critical role in automating public infrastructure, has motivated research into a wide variety of attack defence mechanisms. Assessing their effectiveness is challenging, however, as realistic sets of attacks to test them against are not always available. In this paper, we propose smart fuzzing, an automated, machine learning guided technique for systematically finding ‘test suites’ of CPS network attacks, without requiring any expertise in the system’s control programs or physical processes. Our approach uses predictive machine learning models and metaheuristic search algorithms to guide the fuzzing of actuators so as to drive the CPS into different unsafe physical states. We demonstrate the efficacy of smart fuzzing by implementing it for two real-world CPS testbeds—a water purification plant and a water distribution system—finding attacks that drive them into 27 different unsafe states involving water flow, pressure, and tank levels, including six that were not covered by an established attack benchmark. Finally, we use our approach to test the effectiveness of an invariant-based defence system for the water treatment plant, finding two attacks that were not detected by its physical invariant checks, highlighting a potential weakness that could be exploited in certain conditions.
Modern software systems, such as Cyber-Physical Systems (CPSs), often require interacting with environment or human. As environment and human are unpredictable or non-determinate, uncertainty is inevitable in such systems. In this paper, we proposed UncerTest to test such systems (i.e., CPSs in our context) with an explicit consideration of uncertainty, through test design, test generation, test optimization and test reporting. UncerTest is a model-based and search-based approach that rely on belief models — a test ready model with explicit subjective uncertainties. In UncerTest, we extended an uncertainty-wise test modeling framework (UncerTum) for designing and enabling of introduction of indeterminacy sources in the environment during test execution. To obtain tests, we first proposed an uncertainty-wise test generation with two strategies, designed for different coverage criteria of models, for enabling an automated test generations with uncertainties. In addition, we developed an uncertainty-wise test minimization to optimize the generated tests based on uncertainty related property using multi-objective search. We conducted an extensive empirical study with two phases to evaluate UncerTest with five use cases of two industrial CPS case studies. In the first phase, eight commonly used multi-objective search algorithms were selected for studying the best algorithm for each of the four minimization strategies. Next, the test cases obtained with UncerTest strategies (i.e., two generation strategies combined with four minimization strategies with the best algorithm) were executed on the two real CPSs for studying the performance of each UncerTest strategies. With the best strategy, we managed to observe 51% more uncertainties due to unknown indeterminate behaviors of the physical environments of the CPSs as compared to the other test strategies. In addition, with the same strategy, we managed to observe 118% more unknown uncertainties that are not specified in the test ready models.
Link to Publication: https://www.sciencedirect.com/science/article/pii/S0164121219300561
This paper explores the structure of research papers in software engineering. Using text mining, we study 35,391 software engineering (SE) papers from 34 leading SE venues over the last 25 years. These venues were divided, nearly evenly, between conferences and journals. An important aspect of this analysis is that it is fully automated and repeatable. To achieve that automation, we used topic modeling (with LDA) to mine 10 topics that represent much of the structure of contemporary SE. The 10 topics presented here should not be “set in stone” as the only topics worthy of study in SE. Rather our goal is to report that (a) text mining methods can detect large scale trends within our community; (b) those topic change with time; so (c) it is important to have automatic agents that can update our understanding of our community whenever new data arrives.
Link to Publication: https://ieeexplore.ieee.org/abstract/document/8465996
Software engineering has seen much progress in recent past including introduction of new methodologies, new paradigms for software teams, and from smaller monolithic applications to complex, intricate, and distributed software applications. However, the way we represent, discuss, and collaborate on software applications throughout the software development life cycle is still primarily using the source code, textual representations, or charts on 2D computer screens - the confines of which have long limited how we visualize and comprehend software systems. In this paper, we present XRaSE, a novel prototype implementation that leverages augmented reality to visualize a software application as a virtually tangible entity. This immersive approach is aimed at making activities like application comprehension, architecture analysis, knowledge communication, and analysis of a software’s dynamic aspects, more intuitive, richer and collaborative. A video demonstration is available at https://youtu.be/_BYYnayA1FE.
The smart contract cannot be modified when it has been deployed on a blockchain. Therefore, it must be given thoroughly test before its being deployed. Mutation testing is considered as a practical test methodology to evaluate the adequacy of software testing. In this paper, we introduce MuSC, a mutation testing tool for Ethereum Smart Contract (ESC). It can generate numerous mutants at a fast speed and supports the automatic operations such as creating test nets, deploying and executing tests. Specially, MuSC implements a set of novel mutation operators w.r.t ESC programming language, Solidity. Therefore, it can expose the defects of smart contracts to a certain degree. The demonstration video of MuSC is available at https://youtu.be/3KBKXJPVjbQ, and the source code can be downloaded at https://github.com/belikout/MuSC-Tool-Demo-repo.