Pinpointing the location where a given unit of functionality—or feature—was implemented is a demanding and time-consuming task, yet prevalent in most software maintenance or evolution efforts. To that extent, we present PANGOLIN, an Eclipse plugin that helps developers identifying features among the source code. It borrows Spectrum-based Fault Localization techniques from the software diagnosis research field by framing feature localization as a diagnostic problem. PANGOLIN prompts users to label system executions based on feature involvement, and subsequently presents its spectrum-based feature localization analysis to users with the aid of a color-coded, hierarchic, and navigable visualization which was shown to be effective at conveying diagnostic information to users. Our evaluation shows that PANGOLIN accurately pinpoints feature implementations and is resilient to misclassifications by users.
Recurrent neural network (RNN) has achieved great success in processing sequential inputs for applications such as automatic speech recognition, natural language processing and machine translation. However, quality and reliability issues of RNNs make them vulnerable to adversarial attacks and hinder their deployment in real-world applications. In this paper, we propose a quantitative analysis framework — DeepStellar — to pave the way for effective security and quality analysis of software systems powered by RNNs. DeepStellar is generic to handle various RNN architectures, including LSTM and GRU, scalable to work on industrial-grade RNN models, and extensible to develop customized analyzers and tools. We demonstrated that, with DeepStellar, users are able to design efficient test generation tools, and develop effective adversarial sample detectors. We evaluated the developed applications on three real RNN models, including speech recognition and image classification.
DeepStellar outperforms existing approaches three hundred times in generating defect-triggering tests and achieves 97% accuracy in detecting adversarial attacks. A video demonstration which shows the main features of DeepStellar is available at https://sites.google.com/view/deepstellar/home/tool-demo.
Misuse of APIs happens frequently due to misunderstanding of API semantics and lack of documentation. An important category of API-related defects is the error handling defects, which may result in security and reliability flaws. These defects can be detected with the help of static program analysis, provided that error specifications are known. The error specification of an API function indicates how the function can fail. Writing error specifications manually is time-consuming and tedious. Therefore, automatic inferring the error specification from API usage code is preferred. In this paper, we present Ares, a tool for automatically inferring error specifications for C code through static analysis. we employ multiple heuristics to identify error handling blocks and infer error specifications by analyzing the corresponding condition logic. Ares is evaluated on 19 real world projects, and the results reveal that Ares outperforms the state-of-the-art tool APEx by 37% in precision. Ares can also identify more error specifications than APEx. Moreover, the specifications inferred from Ares help find dozens of API-related bugs in well-known projects such as OpenSSL, among them 10 bugs are confirmed by developers. Video: https://youtu.be/nf1QnFAmu8Q. Repository: https://github.com/lc3412/Ares.
Modern software often exists in many different, yet similar versions and/or variants, usually derived from a common code base (e.g., via clone-and-own). In the context of product- line engineering, family-based-analysis has shown very promising potential for improving efficiency in applying quality-assurance techniques to variant-rich software, as compared to a variant- by-variant approach. Unfortunately, these strategies rely on an product-line representation superimposing all program variants in a syntactically well-formed, semantically sound and variant- preserving manner, which is manually hard to obtain in practice. We demonstrate the SiMPOSE methodology for automatically generating superimpositions of N given program versions and/or variants facilitating family-based analysis of variant-rich soft- ware. SiMPOSE is based on a novel N-way model-merging tech- nique operating at the level of control-flow automata (CFA) rep- resentations of C programs, CFAs constitute a unified program abstraction utilized by many recent software-analysis tools. We illustrate different merging strategies supported by SiMPOSE, namely variant-by-variant, N-way merging, incremental 2-way merging, and partition-based N/2-way merging, and demonstrate how SiMPOSE can be used to systematically compare their impact on efficiency and effectiveness of family-based unit-test generation. The SiMPOSE tool, the demonstration of its usage as well as related artifacts and documentation can be found at http://pi.informatik.uni-siegen.de/projects/variance/simpose.
Verification of programs continues to be a challenge and no single technique succeeds on all programs. In this paper we present VeriAbs, a reachability C program verifier that incorporates a portfolio of techniques implemented as four strategies, where a strategy is a set of techniques applied in a specific sequence. It selects a strategy based on the kind of loops in the program. We analysed the effectiveness of implemented strategies on the 3831 verification tasks from the ReachSafety category of the 8th International Competition on Software Verification (SV-COMP) 2019 and found that although simple classic techniques - explicit state model checking and bounded model checking, succeed on a majority of the programs, a wide range of further techniques are required to analyse the rest. A screencast of the tool is available at https://youtu.be/YHxB2V4U0gQ.
Deep neural network (DNN) has been widely applied to safety-critical scenarios such as autonomous vehicle, security surveillance, and cyber-physical control systems. Yet, the incorrect behaviors of DNNs can lead to severe accidents and tremendous losses due to hidden defects. In this paper, we present DeepHunter, a general-purpose fuzzing framework for detecting defects of DNNs. DeepHunter is inspired by traditional grey-box fuzzing and aims to increase the overall test coverage by applying adaptive heuristics according to runtime feedback. Specifically, DeepHunter provides a series of seed selection strategies, metamorphic mutation strategies, and testing criteria customized to DNN testing; all these components support multiple built-in configurations which are easy to extend. We evaluated DeepHunter on two popular datasets and the results demonstrate the effectiveness of DeepHunter in achieving coverage increase and detecting real defects. A video demonstration which showcases the main features of DeepHunter can be found at https://youtu.be/s5DfLErcgrc.
Smart pointers are widely used to prevent memory errors in modern C++ code. However, improper usage of smart pointers may also lead to common memory errors, which makes the code not as safe as expected. To avoid smart pointer errors as early as possible, we present a coding style checker to detect possible bad smart pointer usages during compile time, and notify programmers about bug-prone behaviors. The evaluation indicates that the currently available state-of-the-art static code checkers can only detect 25 out of 116 manually inserted errors, while our tool can detect all these errors. And we also found 521 bugs among 8 open source projects with only 4 false positives.
The fragmentation issue spreads over multiple mobile platforms such as Android, iOS, mobile web and even WeChat, which hinders test scripts from running across platforms. To reduce the cost of adapting scripts for various platforms, some existing tools apply conventional computer vision techniques to replay the same script on multiple platforms automatically. However, because these solutions can hardly identify dynamic or similar widgets, it becomes difficult for engineers to apply them in practice. In this paper, we present an image-driven tool, namely LIRAT, to record and replay test scripts cross platforms, solving the problem of test script cross-platform replay for the first time. LIRAT records screenshots and layouts of the widgets, and leverages image understanding techniques to locate them in the replay process. Based on accurate widget localization, LIRAT supports replaying test scripts across devices and platforms. We employed LIRAT to replay 25 scripts from 5 application across 8 Android devices and 2 iOS devices. The results show that LIRAT can replay 88% scripts on Android platforms and 60% on iOS platforms.
Workflow underlies most process automation software, such as those for product lines, business processes, and scientific computing. However, current Cloud Computing based workflow systems cannot support real-time applications due to network latency, which limits their application in many IoT systems such as smart healthcare and smart traffic. Fog Computing extends the Cloud by providing virtualized computing resources close to the End Devices so that the response time of accessing computing resources can be reduced significantly. However, how to most effectively manage heterogeneous resources and different computing tasks in the Fog is a big challenge. In this paper, we introduce “FogWorkflowSim” an efficient and extensible toolkit for automatically evaluating resource and task management strategies in Fog Computing with simulated user-defined workflow applications. Specifically, FogWorkflowSim is able to: 1) automatically set up a simulated Fog Computing environment for workflow applications; 2) automatically execute user submitted workflow applications; 3) automatically evaluate and compare the performance of different computation offloading and task scheduling strategies with three basic performance metrics, including time, energy and cost. FogWorkflowSim can serve as an effective experimental platform for researchers in Fog based workflow systems as well as practitioners interested in adopting Fog Computing and workflow systems for their new software projects.
Spreadsheets are widely used but subject to various defects. In this paper, we present SGUARD to effectively detect spreadsheet defects. SGUARD learns spreadsheet features to cluster cells with similar computational semantics, and then refine these clusters to recognize anomalous cells as defects. SGUARD well balances the trade-off between the precision (87.8%) and recall rate (71.9%) in the defect detection, and achieves an F-measure of 0.79, exceeding existing spreadsheet defect detection techniques. We introduce the SGUARD implementation and its usage by a video presentation (https://youtu.be/gNPmMvQVf5Q), and provide its download repository for public access (https://github.com/sheetguard/sguard).
Floating-point arithmetic is widely used in applications from several fields including scientific computing, machine learning, graphics, and finance. Many of these applications are rapidly adopting the use of GPUs to speedup computations. GPUs, however, have limited support to detect floating-point exceptions, making it difficult to software developers to detect unexpected exceptional cases in such applications. We present FPChecker, the first tool to automatically detect floating-point exceptions in GPU applications. FPChecker uses the clang/LLVM compiler to instrument GPU kernels and to detect floating-point exceptions in GPUs at runtime. Once an exception is detected, it reports to the programmer the code location of the exception as well as other useful information. The programmer can then use this information to avoid the exception, e.g., by modifying the application algorithm or changing its input. We present the design of FPChecker, an evaluation of the overhead of the tool, and a real-world case scenario on which the tool is used to identify a hidden exception. The slowdown of FPChecker is moderate ($1.5\times$ on average) and the code is publicly available as open source.
This paper presents PMExec, a tool that supports the execution of partial UML-RT models. To this end, the tool implements the following steps: static analysis, automatic refinement, and input-driven execution. The static analysis respects the execution semantics of UML-RT models is used to detect problematic model elements, i.e., elements that cause problems during execution due to the partiality. Then, the models are refined automatically using model transformation techniques, which mostly add decision points where missing information can be supplied. Third, the refined models are executed, and when the execution reaches the decision points, input required to continue the execution is obtained either interactively or from a script that captures how to deal with partial elements. We have evaluated PMExec using several use-cases that show that the static analysis, refinement, and application of user input can be carried out with reasonable performance, and that the overhead of approach, due to the refinement step, is manageable. (https://youtu.be/BRKsselcMnc)
Bugs in commercial software and third-party components are an undesirable and expensive phenomenon. Such software is usually released to users only in binary form. The lack of source code renders users of such software dependent on their software vendors for repairs of bugs. Such dependence is even more harmful if the bugs introduce new vulnerabilities in the software. Automatically repairing security and functionality bugs in binary code increases software robustness without any developer effort. In this research, we propose development of a binary program repair tool that uses existing bug-free fragments of code to repair buggy code.
The error repair process in software systems is,historically, a resource-consuming task that relies heavily in developer manual effort. Automatic program repair approaches enable the repair of software with minimum human interaction,therefore, mitigating the burden from developers. However, a problem automatically generated patches commonly suffer is generating low-quality patches (which overfit to one program specification, thus not generalizing to an independent oracle evaluation). This work proposes a set of mechanisms to increase the quality of plausible patches including an analysis of test suite behavior and their key characteristics for automatic program repair, analyzing developer behavior to inform the mutation operator selection distribution, and a study of patch diversity as a means to create consolidated higher quality fixes.
Fork-based development is a lightweight mechanism that allows developers to collaborate with or without explicit coordination. Although it is easy to use and popular, when developers each create their own fork and develop independently, their contributions are usually not easily visible to others. When the number of forks grows, it becomes very difficult to maintain an overview of what happens in individual forks, which would lead to additional problems and inefficient practices: lost contributions, redundant development, fragmented communities, and so on. Facing the problems mentioned above, we developed two complementary strategies: (1) Identifying existing best practices and suggesting evidence-based interventions for projects that are inefficient; (2) designing new interventions that could improve the awareness of a community using fork-based development, and help developers to detect redundant development to reduce unnecessary effort.
High-fidelity GUI prototyping provides a meaningful manner for illustrating the developers’ understanding of the requirements formulated by the customer and can be used for productive discussions and clarification of requirements and expectations. However, high-fidelity prototypes are time-consuming and expensive to develop. Furthermore, the interpretation of requirements expressed in informal natural language is often error-prone due to ambiguities and misunderstandings. In this dissertation project, we will develop a methodology based on Natural Language Processing (NLP) for supporting GUI prototyping by automatically translating Natural Language Requirements (NLR) into a formal Domain-Specific Language (DSL) describing the GUI and its navigational schema. The generated DSL can be further translated into corresponding target platform prototypes and directly provided to the user for inspection. Most related systems stop after generating artifacts, however, we introduce an intelligent and automatic interaction mechanism that allows users to provide natural language feedback on generated prototypes in an iterative fashion, which accordingly will be translated into respective prototype changes.
In popular continuous integration (CI) practice, coding is followed by building, integration and system testing, pre-release inspection and deploying artifacts. This can reduce integration risk and speed up development process. But large number of CI build failures may interrupt normal software development process. So, the failures need to be analyzed and fixed quickly. Although various automated program repair techniques have great potential to resolve software failures, the existing techniques mostly focus on repairing source code. So, those techniques cannot directly help resolve software build failures. Apart from that, a special challenge to fix build failures in CI environment is that the failures are often involved with both source code and build scripts. This paper outlines promising preliminary work towards automatic build repair in CI environment that involves both source code and build script. As first step we conducted an empirical study on software build failures and build fix patterns. Based on the findings of the empirical study, we developed an approach that can automatically fix build errors involving build scripts. We plan to extend this repair approach considering both source code and build script. Moreover, we plan to quantify our automatic fixes by user study and comparison between fixes generated by our approach and actual fixes.
Continuous Integration (CI) is a widely-adopted software engineering practice. Despite its undisputed benefits, like higher software quality and improved developer productivity, mastering CI is not easy. Among the several barriers when transitioning to CI, developers need to face a new type of software failures (i.e., build failures). Even when a team has successfully adopted a CI culture, living up to its principles and improving the CI practice are also challenging. In my research, I want to provide developers with support for establishing CI and proper recommendation for continuously improving their CI process.
Over the past decades, various techniques for the application of formal program analysis of software systems have been proposed. However, the application of formal methods for software verification is still limited in practise. It is acknowledged that the task of formally specifying requirements by creating property specifications and the need of the interaction by the engineer with the prover during the execution of the proof are valid hindrances. The objectives of the presented PhD project are the automated inference of declarative property specifications from example scenarios during the phase of Requirements Engineering and their automated verification on abstract model level and on code level. Applying Inductive Logic Programming, we introduce the necessary modifications and extensions of the methodology in order to enable property specification mining from example data. Example data are to be collected with the guidance by the Scenario Modeling Strategies to ensure the necessary coverage of the scenario that is stated with the application in order to correct use of the provided information in the property specification mining algorithm. However, the specification mining algorithm is expected to produce too many property specifications. To turn the weakness into strength, our approach proposes to use the properties to automate the proof and by this, reduce the necessary interaction with the prover.
It has been long suggested that commit messages can greatly facilitate code comprehension. However, developers may not write good commit messages in practice. Neural machine translation (NMT) has been suggested to automatically generate commit messages. Despite the efforts in improve NMT algorithms, the quality of the generated commit messages is not yet satisfactory. This paper, instead of improving NMT algorithms, suggests that proper preprocessing of code changes into concise inputs is quite critical to train NMT. We approach it with semantic analysis of code changes. We collect a real-world dataset with 50k+ commits of popular Java projects, and verify our idea with comprehensive experiments. The results show that preprocessing inputs with code semantic analysis can improve NMT significantly. This work sheds light to how to apply existing DNNs designed by the machine learning community, e.g., NMT models, to complete software engineering tasks.
Automatic program repair (APR) reduces the burden of debugging by directly suggesting likely fixes for the bugs. We believe scalability, applicability, and accurate patch validation are among the main challenges for building practical APR techniques that the researchers in this area are dealing with. In this paper, we describe the steps that we are taking toward addressing these challenges.
Until 2017, Android smartphones occupied approximately 87% of the smartphone market. The vast market also promotes the development of Android malware. Nowadays, the number of malware targeting Android devices found daily is more than 38,000. With the rapid progress of mobile application programming and anti-reverse-engineering techniques, it is harder to detect all kinds of malware. To address challenges in existing detection techniques, such as data obfuscation and limited codes coverage, we propose a detection approach that directly learns features of malware from Dalvik bytecode based on deep learning technique (CNN). The average detection time of our model is 0.22 seconds, which is much lower than other existing detection approaches. In the meantime, the overall accuracy of our model achieves over 89%.
Emotion awareness is critical to interpersonal communication, including that in software development. The SE community has studied emotion in software development using isolated emotion states but it has not considered the dynamic nature of emotion. To investigate the emotion dynamics, SE community needs an effective approach. In this paper, we propose such an approach which is able to automatically collect project teams’ communication records, identify the emotions and their intensities in them, model the emotion dynamics into time series, and provide efficient data management. We demonstrate that this approach can provide end-to-end support for various emotion awareness research and practices through automated data collection, modeling, storage, analysis, and presentation using the IPython’s project data on GitHub.
This paper presents a machine learning classifier designed to identify SQL injection vulnerabilities in PHP code. Both classical and deep learning based machine learning algorithms were used to train and evaluate classifier models using input validation and sanitization features extracted from source code files. On ten-fold cross validations a model trained using Convolutional Neural Network(CNN) achieved the highest precision (95.4%), while a model based on Multilayer Perceptron (MLP) achieved the highest recall (63.7%) and the highest f-measure (0.746).
Code comment generation is a crucial task in the field of automatic software development. Most previous neural comment generation systems used an encoder-decoder neural network and encoded only information from source code as input. Software reuse is common in software development. However, this feature has not been introduced to existing systems. Inspired by the traditional IR-based approaches, we propose to use the existing comments of similar source code as exemplars to guide the comment generation process. Based on an open source search engine, we first retrieve a similar code and treat its comment as an exemplar. Then we applied a seq2seq neural network to conduct an exemplar-based comment generation. We evaluate our approach on a large-scale Java corpus, and experimental results demonstrate that our model significantly outperforms the state-of-the-art methods.
When a program is nondeterministic, it is difficult to test and debug. Nondeterminism occurs even in sequential programs: for example, as a result of iterating over the elements of a hash table. This seemingly innocuous and frequently used operation can result in diverging test results.
We have created a type system that can express determinism specifications in a program. The key ideas in the type system are type qualifiers for nondeterminism, order-nondeterminism, and determinism. While state-of-the-art nondeterminism detection tools unsoundly rely on observing runtime output, our approach verifies determinism at compile time, thereby providing stronger soundness guarantees.
We implemented our type system for Java. Our type checker, the Determinism Checker, warns if a program is nondeterministic or verifies that the program is deterministic. In a case study of a 24,000-line software project, it found previously-unknown nondeterminism errors in a program that had been heavily vetted by its developers, who were greatly concerned about nondeterminism errors.
Providing user satisfaction is a major concern for on-demand multimedia service providers and Internet carriers in Wireless Communications. Traditionally, user satisfaction was measured objectively in terms of throughput and latency. Nowadays the user satisfaction is measured using subjective metrices such as Quality of Experience (QoE). Recently, Smart Media Pricing (SMP) was conceptualized to price the QoE rather than the binary data traffic in multimedia services. In this research, we have leveraged the SMP concept to chalk up a QoE-sensitive multimedia pricing framework to allot price, based on the user preference and multimedia quality achieved by the customer. We begin by defining the utility equations for the provider-carrier and the customer. Then we translate the profit maximizing interplay between the parties into a two-stage Stackelberg game. We model the user personal preference using Prelec weighting function which follows the postulates Prospect Theory (PT). An algorithm has been developed to implement the proposed pricing scheme and determine the Nash Equilibrium. Finally, the proposed smart pricing scheme was tested against the traditional pricing method and simulation results indicate a significant boost in the utility achieved by the mobile customers.
In recent years, the extensive application of the Python language in the field of artificial intelligence has made its analysis work more and more valuable. Many static analysis algorithms need to rely on the construction of call graphs. Therefore, the call graph construction for Python language deserves more attention. In this paper, we did a comparative empirical analysis of several widely used Python static call graph tools both quantitatively and qualitatively. Experiments show that the existing Python static call graph tools have a large difference in the construction effectiveness, and there is still room for improvement.
From the point of view of a software engineer, having safe and optimal controllers for real life systems like cyber physical systems is a crucial requirement before deployment. Given the mathematical model of these systems along with their specifications, model checkers can be used to synthesize controllers for them. The given work proposes novel approaches for making controller analysis easier by using machine learning to represent the controllers synthesized by model checkers in a succinct manner, while also incorporating the domain knowledge of the system. It also proposes the implementation of a visualization tool which will be integrated into existing model checkers. A lucid controller representation along with a tool to visualize it will help the software engineer debug and monitor the system much more efficiently.
Designing usable APIs is critical to developers’ productivity and software quality but is quite difficult. In this paper, I focus on “boilerplate” code, sections of code that have to be included in many places with little or no alteration, which many experts in API design have said can be an indicator of API usability problems. I investigate what properties make code count as boilerplate, and present a novel approach to automatically mine boilerplate code from a large set of client code. The technique combines an existing API usage mining algorithm, with novel filters using AST comparison and graph partitioning. With boilerplate candidates identified by the technique, I discuss how this technique could help API designers in reviewing their design decisions and identifying usability issues.
Machine image sniping is a difficult-to-detect security vulnerability in cloud computing code. When programmatically initializing a machine, a developer must specify which machine image (operating system and file system) to use as the basis for the new machine. The developer should restrict the search to only those machine images which their organization controls: otherwise, an attacker can insert a similar but malicious image into the public database, where it might be selected instead of the image intended by the developer when initializing a new machine. We present a lightweight type and effect system that detects requests to a cloud provider that are vulnerable to an image sniping attack, or proves that no vulnerable request exists in a codebase. We prototyped our type system for Java programs that request Amazon Web Services machines, and evaluated it on more than 500 codebases, detecting 12 vulnerable requests with only 3 false positives.
Quality control is a challenge of crowdsourcing, especially in software testing. As having different levels of skills, some crowdworkers may not complete assigned tasks well. Therefore, it is necessary to propose some auxiliary methods to assist crowdworkers, raising bug report quality. This paper proposes a novel method, namely CroReG, to generate crowdsourcing reports based on bug screenshots. CroReG leverages image understanding techniques to analyze bug screenshots. Image understanding techniques including deep learning and optical character recognition (OCR) techniques are used comprehensively to analyze the screenshots and then CroReG can automatically generate corresponding bug reports. The preliminary experiment results show that CroReG can effectively generate bug reports containing accurate screenshot captions and providing guidance for crowdworkers.
Precise pointer analysis is desired since it is a core technique to find memory defects. There are several dimensions of pointer analysis precision, flow sensitivity, context sensitivity, field sensitivity and path sensitivity. For static analysis tools utilizing pointer analysis, considering all dimensions is difficult because the trade-off between precision and efficiency should be balanced. This paper presents TsmartGP, a static analysis tool for finding memory defects in C programs with a precise and efficient pointer analysis. The pointer analysis algorithm is flow, context, field, and quasi path sensitive. Control flow automatons are the key structures for our analysis to be flow sensitive. Function summaries are applied to get context information and elements of aggregate structures are handled to improve precision. Path conditions are used to filter unreachable paths. For efficiency, a multi-entry mechanism is proposed. Utilizing the pointer analysis algorithm, we implement a checker in TsmartGP to find uninitialized pointer errors in 13 real-world applications. Cppcheck and Clang Static Analyzer are chosen for comparison. The experimental results show that TsmartGP can find more errors while its accuracy rate is 100%, much higher than 3.1% for Cppcheck and 16.7% for Clang Static Analyzer. The demo video is available at https://youtu.be/IQlshemk6OA and TsmartGP can be downloaded from https://github.com/laoyaolandq/TsmartGP.
An enterprise system operates business by providing various services that are guided by set of certain business rules (BR) and constraints. These BR are usually written using plain Natural Language in operating procedures, terms and conditions, and other documents or in source code of legacy enterprise systems. For implementing the BR in a software system, expressing them as UML use-case specifications, or preparing for Merger & Acquisition (M&A) activity, analysts manually interpret the documents or try to identify constraints from the source code, leading to potential discrepancies and ambiguities. These issues in the software system can be resolved only after testing, which is a very tedious and expensive activity. To minimize such errors and efforts, we propose BuRRiTo framework consisting of automatic extraction of BR by mining documents and source code, ability to clean them of various anomalies like inconsistency, redundancies, conflicts, etc. and able to analyze the functional gaps present and performing semantic querying and searching.
Programming is typically a difficult and repetitive task. Programmers encounter endless problems during programming, and they often need to write similar code over and over again. To prevent programmers from reinventing wheels thus increase their productivity, we propose a context-aware code-to-code recommendation tool named Lancer. With the support of a Library-Sensitive Language Model (LSLM) and the BERT model, Lancer is able to automatically analyze the intention of the incomplete code and recommend relevant and reusable code samples in real-time. code, and datasets can be found. A video demonstration of Lancer can be found at https://youtu.be/tO9nhqZY35g. Lancer is open source and the code is available at https://github.com/sfzhou5678/Lancer.
We present TestCov, a tool for robust test-suite execution and test-coverage measurement on C programs. TestCov executes program tests in isolated containers to ensure system integrity and reliable resource control. The tool provides coverage statistics per test and for the whole test suite. TestCov uses the simple, XML-based exchange format for test-suite specifications that was established as standard by Test-Comp. TestCov has been successfully used in Test-Comp ’19 to execute almost 9 million tests on 1720 different programs. The source code of TestCov is released under the open-source license Apache 2.0 and available at https://gitlab.com/sosy-lab/software/test-suite-validator. A full artifact, including a demonstration video, is available at https://doi.org/10.5281/zenodo.3418726.
We present Prema, a tool for Precise Requirement Editing, Modeling and Analysis. It can be used in various fields for describing precise requirements using formal notations and performing rigorous analysis. By parsing the requirements written in formal modeling language, Prema is able to get a model which aptly depicts the requirements. It also provides different rigorous verification and validation techniques to check whether the requirements meet users’ expectation and find potential errors. We show that our tool can provide a unified environment for writing and verifying requirements without using tools that are not well inter-related. For experimental demonstration, we use the requirements of the automatic train protection (ATP) system of CASCO signal co. LTD., the largest railway signal control system manufacturer of China. The code of the tool cannot be released here because the project is commercially confidential. However, a demonstration video of the tool is available at https://youtu.be/BX0yv8pRMWs.
Software engineering has seen much progress in recent past including introduction of new methodologies, new paradigms for software teams, and from smaller monolithic applications to complex, intricate, and distributed software applications. However, the way we represent, discuss, and collaborate on software applications throughout the software development life cycle is still primarily using the source code, textual representations, or charts on 2D computer screens - the confines of which have long limited how we visualize and comprehend software systems. In this paper, we present XRaSE, a novel prototype implementation that leverages augmented reality to visualize a software application as a virtually tangible entity. This immersive approach is aimed at making activities like application comprehension, architecture analysis, knowledge communication, and analysis of a software’s dynamic aspects, more intuitive, richer and collaborative. A video demonstration is available at https://youtu.be/_BYYnayA1FE.
The smart contract cannot be modified when it has been deployed on a blockchain. Therefore, it must be given thoroughly test before its being deployed. Mutation testing is considered as a practical test methodology to evaluate the adequacy of software testing. In this paper, we introduce MuSC, a mutation testing tool for Ethereum Smart Contract (ESC). It can generate numerous mutants at a fast speed and supports the automatic operations such as creating test nets, deploying and executing tests. Specially, MuSC implements a set of novel mutation operators w.r.t ESC programming language, Solidity. Therefore, it can expose the defects of smart contracts to a certain degree. The demonstration video of MuSC is available at https://youtu.be/3KBKXJPVjbQ, and the source code can be downloaded at https://github.com/belikout/MuSC-Tool-Demo-repo.
Swarm-based verification methods split a verification problem into a large number of independent simpler tasks and so exploit the availability of large numbers of cores to speed up verification. Lazy-CSeq is a BMC-based bug-finding tool for C programs using POSIX threads that is based on sequentialization. Here we present the tool VeriSmart 2.0, which extends Lazy-CSeq with a swarm-based bug-finding method. The key idea of this approach is to constrain the interleavings such that context switches can only happen within selected tiles (more specifically, contiguous code segments within the individual threads). This under-approximates the program’s behaviors, with the number and size of tiles as additional parameters, which allows us to vary the complexity of the tasks. Overall, this significantly improves peak memory consumption and (wall-clock) analysis time.
Deep neural networks~(DNNs) are increasingly expanding their real-world applications across domains, e.g., image processing, speech recognition and natural language processing. However, there is still limited tool support for DNN testing in terms of test data quality and model robustness. In this paper, we introduce a mutation testing-based tool for DNNs, DeepMutation++, which facilitates the DNN quality evaluation, supporting both feed-forward neural networks~(FNNs) and stateful recurrent neural networks~(RNNs). It not only enables to statically analyze the robustness of a DNN model against the input as a whole, but also allows to identify the vulnerable segments of a sequential input (e.g. audio input) by runtime analysis. It is worth noting that DeepMutation++ specially features the support of RNNs mutation testing. The tool demo video can be found on the project website https://sites.google.com/view/deepmutationpp.
An effective way to maximize code coverage in software tests is through dynamic symbolic execution, a technique that uses constraint solving to systematically explore a program’s state space. We introduce an open-source dynamic symbolic execution framework called Manticore for analyzing binaries and Ethereum smart contracts. Manticore’s flexible architecture allows it to support both traditional and exotic execution environments, and its API allows users to customize their analysis. Here, we discuss Manticore’s architecture and demonstrate the capabilities we have used to find bugs and verify the correctness of code for our commercial clients.
Concurrency vulnerabilities are extremely harmful and can be frequently exploited to launch severe attacks. Due to the non-determinism of multithreaded executions, it is very difficult to detect them. Recently, data race detectors and techniques based on maximal casual model have been applied to detect concurrency vulnerabilities. However, the former are ineffective and the latter report many false negatives. In this paper, we present CONVUL, an effective tool for concurrency vulnerability detection. CONVUL is based on exchangeable events, and adopts novel algorithms to detect three major kinds of concurrency vulnerabilities. To illustrate the competitiveness of CONVUL, we performed a comparison with three widely-used data race detectors and one recent tool based on maximal casual model. In our experiments, CONVUL detected 9 of 10 known vulnerabilities and found 6 zero-day vulnerabilities on MySQL, while other tools only detected at most 3 out of these 16 vulnerabilities. Our tool and data are available at CONVUL web page: https://sites.google.com/site/convultool. A demonstration video is available at https://youtu.be/-26C6ULxtbk
Model Driven Engineering (MDE) techniques raise the level of abstraction at which developers construct software. However, modern cyber-physical systems are becoming more prevalent and complex and hence software models that represent the structure and behavior of such systems still tend to be large and complex. These models may have numerous if not infinite possible behaviors, with complex communications between their components. Appropriate software testing techniques to generate test cases with high coverage rate to put these systems to test at the model-level (without the need to understand the underlying code generator or refer to the generated code) are therefore important. Concolic testing, a hybrid testing technique that benefits from both concrete and symbolic execution, gains a high execution coverage and is used extensively in the industry for program testing but not for software models. In this paper, we present a novel technique and its tool mCUTE1 , an open source 2 model-level concolic testing engine. We describe the implementation of our tool in the context of Papyrus-RT, an open source Model Driven Engineering (MDE) tool based on UML-RT, and report the results of validating our tool using a set of benchmark models.
In modern programming languages, exception handling is an effective mechanism to avoid unexpected runtime errors. Thus, failing to catch and handle exceptions could lead to serious issues like system crashing, resource leaking, or negative end-user experiences. However, writing correct exception handling code is often challenging in mobile app development due to the fast-changing nature of API libraries for mobile apps and the insufficiency of their documentation and source code examples. Our prior study shows that in practice mobile app developers cause many exception-related bugs and still use bad exception handling practices (e.g. catch an exception and do nothing). To address such problems, in this paper, we introduce two novel techniques for recommending correct exception handling code. One technique, XRank, recommends code to catch an exception likely occurring in a code snippet. The other, XHand, recommends correction code for such an occurring exception. We have developed ExAssist, a code recommendation tool for exception handling using XRank and XHand. The empirical evaluation shows that our techniques are highly effective. For example, XRank has top-1 accuracy of 70% and top-3 accuracy of 87%. XHand’s results are 89% and 96%, respectively.
To detect large-variance code clones (i.e. clones with relatively more differences) in large-scale code repositories is difficult because most current tools can only detect almost identical or very similar clones. It will make promotion and changes to some software applications such as bug detection, code completion, software analysis, etc. Recently, CCAligner made an attempt to detect clones with relatively concentrated modifications called large-gap clones. Our contribution is to develop a novel and effective detection approach of large-variance clones to more general cases for not only the concentrated code modifications but also the scattered code modifications. A detector named LVMapper is proposed, borrowing and changing the approach of sequencing alignment in bioinformatics which can find two similar sequences with more differences. The ability of LVMapper was tested on both self-synthetic datasets and real cases, and the results show substantial improvement in detecting large-variance clones compared with other state-of-the-art tools including CCAligner. Furthermore, our new tool also presents good recall and precision for general Type-1, Type-2 and Type-3 clones on the widely used benchmarking dataset, BigCloneBench.
The correctness of compilers is instrumental in the safety and reliability of other software systems, as bugs in compilers can produce programs that do not reflect the intents of programmers. Compilers are complex software systems due to the complexity of optimization. GCC is an optimizing C compiler that has been used in building operating systems and many other system software. In this paper, we describe K-CONFIG, an approach that uses the bugs reported in the GCC repository to generate new test inputs. Our main insight is that the features appearing in the bug reports are likely to reappear in the future bugs, as the bugfixes can be incomplete or those features may be inherently challenging to implement hence more prone to errors. Our approach first clusters the failing test input extracted from the bug reports into clusters of similar test inputs. It then uses these clusters to create configurations for Csmith, the most popular test generator for C compilers. In our experiments on two versions of GCC, our approach could trigger up to 36 miscompilation failures, and 179 crashes, while Csmith with the default configuration did not trigger any failures. This work signifies the benefits of analyzing and using the reported bugs in the generation of new test inputs.
Developer forums are one of the most popular and useful Q&A websites on API usages. The analysis of API forums can be a critical resource for the automated question and answer approaches. In this paper, we empirically study three API forums including Twitter, eBay, and AdWords, to investigate the characteristics of question-answering process. We observe that +60% of the posts on all three forums were answered by providing API method names or documentation. +85% of the questions were answered by API development teams and the answers from API development teams drew fewer follow-up questions. Our results provide empirical evidences for us in a future work to build automated solutions to answer developer questions on API forums.
Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some prediction about it, e.g., bug prediction. We call these models neural programs. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts in developing effective techniques to test neural programs. We discuss the challenges in developing such tools and our future plans. In our preliminary experiment on a recent neural model proposed in the literature, we found that the model is very brittle and simple perturbations in the input can cause the model to make a mistake in its prediction.
Automatically generating code from a textual description of method invocation confronts challenges. There were two current research directions for this problem. One direction focuses on considering a textual description of method invocations as a separate Natural Language query and do not consider the surrounding context of the code. Another direction takes advantage of a practical large scale code corpus for providing a Machine Translation model to generate code. However, this direction got very low accuracy. In this work, we tried to improve these drawbacks by proposing MethodInfoToCode, an approach that embeds context information and optimizes the ability of learning of original Phrase-based Statistical Machine Translation (PBMT) in NLP to infer implementation of method invocation given method name and other context information. We conduct an expression prediction models learned from 2.86 million method invocations from the practical data of high qualities corpus on Github that used 6 popular libraries: JDK, Android, GWT, Joda-Time, Hibernate, and Xstream. By the evaluation, we show that if the developers only write the method name of a method invocation in a body of a method, MethodInfoToCode can predict the generated expression correctly at 73% in F1 score.
Deep learning (DL) techniques have demonstrated satisfactory performance in many tasks, even in safety-critical applications. Reliability is hence a critical consideration to DL-based systems. However, the statistical nature of DL makes it quite vulnerable to invalid inputs, i.e., those cases that are not considered in the training phase of a DL model. This paper proposes to perform data sanity check to identify invalid inputs, so as to enhance the reliability of DL-based systems. To this end, we design and implement a tool to detect behavior deviation of a DL model when processing an input case, and considers it the symptom of invalid input cases. Via a light, automatic instrumentation to the target DL model, this tool extracts the data flow footprints and conducts an assertion-based validation mechanism.
We assert that it is the ethical duty of software engineers to strive to reduce software discrimination. This paper discusses how that might be done. This is an important topic since machine learning software is increasingly being used to make decisions that affect people’s lives. Potentially, the application of that software will result in fairer decisions because (unlike humans) machine learning software is not biased. However, recent results show that the software within many data mining packages exhibits “group discrimination”; i.e. their decisions are inappropriately affected by “protected attributes” (e.g., race, gender, age, etc.). There has been much prior work on validating the fairness of machine-learning models (by recognizing when such software discrimination exists). But after detection, comes mitigation. What steps can ethical software engineers take to reduce discrimination in the software they produce? This paper shows that making fairness as a goal during hyperparameter optimization can (a) preserve the predictive power of a model learned from a data miner while also (b) generates fairer results. To the best of our knowledge, this is the first application of hyperparameter optimization as a tool for software engineers to generate fairer software.
Modern software development relies heavily on Application Programming Interface (API) libraries. However, there are often certain constraints on using API elements in such libraries. Failing to follow such constraints (API misuse) could lead to serious programming errors. Many approaches have been proposed to detect API misuses, but they still have low accuracy and cannot repair the detected misuses. In this paper, we propose SAM, a novel approach to detect and repair API misuses automatically. SAM uses statistical models to describe five factors involving in any API method call: related method calls, exceptions, pre-conditions, post-conditions, and values of arguments. These statistical models are trained from a large repository of high-quality production code. Then, given a piece of code, SAM verifies each of its method calls with the trained statistical models. If a factor has a sufficiently low probability, the corresponding call is considered as an API misuse. SAM performs an optimal search for editing operations to apply on the code until it has no API issue.
The goal of automated program repair is to automate patch generation for buggy programs to reduce the manual effort by developers. A generate-and-validate method, such as GenProg, is a kind of typical repair methods that continuously generate potential patches and then validate the patches with a given test suite. A generate-and-validate method can accumulate patches when the execution time of repair methods increases. However, how many buggy programs can be newly patched when the time increase? In this paper, we conducted an exploratory study of repairing 2980 small-scale buggy programs from the CODEFLAWS benchmark with three repair methods GENPROG, SPR, and PROPHET. The aim of this study is to understand the execution time of repair methods via investigating four research questions. Experimental results show that the time of patch generation correlates with the number of executable lines of code and the Cyclomatic complexity. That is, a complex program is difficult to be repaired. This motivates us to explore a new repair method that can weaken such correlation with the lines of code and the complexity. We designed VANFIX, a simple and effective repair method for small-scale C programs. VANFIX leverages the probability of exploring the search space to conduct a variable search neighborhood for potential patches, rather than patching suspicious statements one by one. The comparison among repair methods shows that VANFIX can generate patches for 653 buggy programs, which contains 408 correctly patched buggy programs. This makes VANFIX achieve 24% to 30% better precision than GENPROG, SPR, and PROPHET.
Modern software projects include automated tests written to check the programs’ functionality. The set of functions invoked by a test is called the trace of the test, and the action of obtaining a trace is called tracing. There are many tracing tools since traces are useful for a variety of software engineering tasks such as test generation, fault localization, and test execution planning. A major drawback in using test traces is that obtaining them, i.e., tracing, can be costly in terms of computational resources and runtime. Prior work attempted to address this in various ways, e.g., by selectively tracing only some of the software components or compressing the trace on-the-fly. However, all these approaches still require building the project and executing the test in order to get its (partial, possibly compressed) trace. This is still very costly in many cases. In this work, we propose a method to predict the trace of each test without executing it, based only on static properties of the test and the tested program, as well as past experience on different tests. This prediction is done by applying supervised learning to learn the relation between various static features of test and function and the likelihood that one will include the other in its trace. Then, we show how to use the predicted traces in a recent automated troubleshooting paradigm called Learn Diagnose and plan (LDP), instead of the actual, costly-to-obtain, test traces. In a preliminary evaluation on real-world open-source projects, we observe that our prediction quality is reasonable. In addition, using our trace predictions in LDP yields almost the same results comparing to when using real traces, while requiring less overhead.
Developers today use significant amounts of open source code, surfacing the need for ways to automatically audit and upgrade library dependencies and leading to the emergence of Software Composition Analysis (SCA). SCA products are concerned with three tasks: discovering dependencies, checking the reachability of vulnerable code for false positive elimination, and automated remediation. The latter two tasks rely on call graphs of library and application code to check whether vulnerable methods found in the open source components are called by applications. However, statically-constructed call graphs introduce both false positives and false negatives on real-world projects. In this paper, we develop a novel, modular means of combining statically- and dynamically-constructed call graphs via instrumentation to improve the performance of false positive elimination. Our experiments indicate significant performance improvements, but that instrumentation-based call graphs are less readily applicable in practice.
The development of complex and dependable systems like autonomous vehicles relies increasingly on the use of systems modeling language (SysML). In fact, SysML has become a de facto standard for systems engineering. With model-driven engineering, a SysML model serves as a reference for the early defect detection of the system under design: the earlier the errors are detected, the less is the cost of handling the errors. Mutation testing is a fault-based technique that has recently seen its applications to SysML behavioral models (e.g., state machine diagrams). Specifically, a system’s state-transition design can be fed to a model checker where mutants are automatically generated and then killed against the desired design specifications (e.g., safety properties). In this paper, we present a novel approach based on process mining to improve the effectiveness and efficiency of the SysML mutation testing based on model checking. In our approach, the mutation operators are applied directly to the state machine diagram. These mutants are then fed as traces into a process mining tool and checked according to the event logs. Our initial results indicates that the process mining approach kills more mutants faster than the model checking method.
For teams using distributed version control systems, the right collaborative development workflows can help maintaining the long-term quality of project repositories and improving work efficiency. Despite the fact that the workflows are important, empirical evidence on how they are used and what impact they make on the project repositories is scarce. Most suggestions on the use of workflows are merely expert opinions. Often, there is only some anecdotal evidence that underpins those advices. As a result, we do not know what workflows are used, how the developers are using them, and what impact do they have on the project repositories. This work investigates the various collaborative development workflows in eight major open-source projects, identifies and analyses the workflows together with the structures of their project repositories, discusses their similarities and differences, and lays out the future works.
Deep neural networks (DNNs) are shown to be promising solutions in many challenging artificial intelligence tasks, including object recognition, natural language processing, and even unmanned driving. A DNN model, generally based on statistical summarization of in-house training data, aims to predict correct output given an input encountered in the wild. In general, 100% precision is therefore impossible due to its probabilistic nature. For DNN practitioners, it is very hard, if not impossible, to figure out whether the low precision of a DNN model is an inevitable result, or caused by defects such as bad network design or improper training process. This paper aims at addressing this challenging problem. We approach with a careful categorization of the root causes of low precision. We find that the internal data flow footprints of a DNN model can provide insights to locate the root cause effectively. We then develop a tool, namely, DeepMorph (DNN Tomography) to analyze the root cause, which can instantly guide a DNN developer to improve the model. Case studies on four popular datasets show the effectiveness of DeepMorph.
Recent studies showed that the dialogs between app developers and app users on app stores are important to increase user satisfaction and app’s overall ratings. However, the large volume of reviews and the limitation of resources discourage app developers from engaging with customers through this channel. One solution to this problem is to develop an Automated Responding System for developers to respond to app reviews in a manner that is most similar to a human response. Toward designing such system, we have conducted an empirical study of the characteristics of mobile apps’ reviews and their human-written responses. We found that an app reviews can have multiple fragments at sentence level with different topics and intentions. Similarly, a response also can be divided into multiple fragments with unique intentions to answer certain parts of their review (e.g., complaints, requests, or information seeking). We have also identified several characteristics of review (rating, topics, intentions, quantitative text feature) that can be used to rank review by their priority of need for response. In addition, we identified the degree of re-usability of past responses is based on their context (single app, apps of the same category, and their common features). Last but not least, a responses can be reused in another review if some parts of it can be replaced by a placeholder that is either a named-entity or a hyperlink. Based on those findings, we discuss the implications of developing an Automated Responding System to help mobile apps’ developers write the responses for users reviews more effectively.
Automated program repair (APR) is one of the recent advances in automated software engineering aiming for reducing the burden of debugging by suggesting high-quality patches that either directly fix the bugs, or help the programmers in the course of manual debugging. We believe scalability, applicability, and accurate patch validation are the main design objectives for a practical APR technique. In this paper, we present PraPR, our implementation of a practical APR technique that operates at the level of JVM bytecode. We discuss design decisions made in the development of PraPR, and argue that the technique is a viable baseline toward attaining aforementioned objectives. PraPR is publicly available at https://github.com/prapr/prapr.
Recent trends in Web development demonstrate an increased interest in serverless applications, i.e. applications that utilize computational resources provided by cloud services on demand instead of requiring traditional server management. This approach enables better resource management while being scalable, reliable, and cost-effective. However, it comes with a number of organizational and technical difficulties which stem from the interaction between the application and the cloud infrastructure, for example, having to set up a recurring task of reuploading updated files. In this paper, we present Kotless – a Kotlin Serverless Framework. Kotless is a cloud-agnostic toolkit that solves these problems by interweaving the deployed application into the cloud infrastructure and automatically generating the necessary deployment code. This relieves developers from having to spend their time integrating and managing their applications instead of developing them. Kotless has proven its capabilities and has been used to develop several serverless applications already in production. Its source code is available at https://github.com/JetBrains/kotless, a tool demo can be found at https://www.youtube.com/watch?v=IMSakPNl3TY.
We present PeASS (Performance Analysis of Software System versions), a tool for detecting performance changes at source code level that occur between different code versions. By using PeASS, it is possible to identify performance regressions that happened in the past to fix them.
PeASS measures the performance of unit tests in different source code versions. To achieve statistic rigor, measurements are repeated and analyzed using an agnostic t-test. To execute a minimal amount of tests, PeASS uses a regression test selection.
We evaluate PeASS on a selection of Apache Commons projects and show that 81% of all unit test covered performance changes can be found by PeASS. A video presentation is available at https://www.youtube.com/watch?v=RORFEGSCh6Y and PeASS can be downloaded from https://github.com/DaGeRe/peass.
The amount of android applications is having a tremendous increasing trend, exerting pressure over practitioners and researchers around application quality, frequent releases, and quick fixing of bugs. This pressure leads practitioners to make use of automated approaches. However, most of those use source-code as input. Nevertheless, third-party services are not able to use this approaches due to privacy factors. In this paper we present MutAPK, an open source mutation testing tool that enables the usage of APKs as input for this task. MutAPK generates mutants without the need of having access to source code, because the mutations are done in an intermediate representation of the code (i.e., SMALI) that does not require compilation. MutAPK is publicly available at GitHub https://bit.ly/2KYvgP9 VIDEO: https://bit.ly/2WOjiyy
Coding convention plays an important role in guaranteeing software quality. However, coding conventions are usually informally presented and inconvenient for programmers to use. In this paper, we present CocoQa, a system that answers programmer’s questions about coding conventions. CocoQa answers questions by querying a knowledge graph for coding conventions. It employs 1) a subgraph matching algorithm that parses the question into a SPARQL query, and 2) a machine comprehension algorithm that uses an end-to-end neural network to detect answers from searched paragraphs. We have imple- mented CocoQa, and evaluated it on a coding convention QA dataset. The results show that CocoQa can answer questions about coding conventions precisely. In particular, CocoQa can achieve a precision of 82.92% and a recall of 91.10%. Repository: https://github.com/14dtj/CocoQa/ Video: https://youtu.be/VQaXi1WydAU
Automated input generators must constantly choose which UI element to interact with and how to interact with it, in order to achieve high coverage with a limited time budget. Currently, most black-box input generators adopt pseudo-random or brute-force searching strategies, which may take very long to find the correct combination of inputs that can drive the app into new and important states. We propose Humanoid, a deep learning-based approach to automated black-box Android app testing, which can explore the app more efficiently. The key technique behind Humanoid is a deep neural network model that can learn how human users choose actions based on an app’s GUI from human interaction traces . The learned model can be used to guide test input generation to achieve higher coverage. Experiments on both open-source apps and market apps demonstrate that Humanoid is able to reach higher coverage, and faster as well, than the state-of-the-art test input generators. Humanoid is open-sourced at https://github.com/yzygitzh/Humanoid and a demo video can be found at https://youtu.be/PDRxDrkyORs.
Evidence shows that developer reputation is extremely important when accepting pull requests or resolving reported issues. It is particularly salient in Free/Libre Open Source Software since the developers are distributed around the world, do not work for the same organization and, in most cases, never meet face to face. The existing solutions to expose developer reputation tend to be forge specific (GitHub), focus on activity instead of impact, do not leverage social or technical networks, and do not correct often misspelled developer identities. We aim to remedy this by amalgamating data from all public Git repositories, measuring the impact of developer work, expose developer’s collaborators, and correct notoriously problematic developer identity data. We leverage World of Code (WoC), a collection of an almost complete (and continuously updated) set of Git repositories by first allowing developers to select which of the 34 million(M) Git commit author IDs belong to them and then generating their profiles by treating the selected collection of IDs as that single developer. As a side-effect, these selections serve as a training set for a supervised learning algorithm that merges multiple identity strings belonging to a single individual. As we evaluate the tool and the proposed impact measure, we expect to build on these findings to develop reputation badges that could be associated with pull requests and commits so developers could easier trust and prioritize them.
Deep Neural Network(DNN) techniques have been prevalent in software engineering. They are employed to facilitate various software engineering tasks and embedded into many software applications. However, because DNNs are built upon a rich data-driven programming paradigm that employs plenty of labeled data to train a set of neurons to construct the internal system logic, analyzing and understanding their behaviors becomes a difficult task for software engineers. In this paper, we present an instance-based visualization tool for DNN, namely NeuralVis, to support software engineers in visualizing and interpreting deep learning models. NeuralVis is designed for: 1). visualizing the structure of DNN models, i.e., neurons, layers, as well as connections; 2). visualizing the data transformation process; 3). integrating existing adversarial attack algorithms for test input generation; 4). comparing intermediate outputs of different instances. To demonstrate the effectiveness of NeuralVis, we design a task-based user study involving ten participants on two classic DNN models, i.e., LeNet and VGG-12. The result shows NeuralVis can assist engineers in identifying critical features that determine the prediction results. Tool is available at http://220.127.116.11:18080. Video: https://youtu.be/hRxCovrOZFI.
Analyzing executions of concurrent software is very difficult. Even if a trace is available, such traces are very hard to read and interpret. A textual trace contains a lot of data, most of which is not relevant to the issue at hand. Past visualization attempts either do not show concurrent behavior, or result in a view that is overwhelming for the user.
We provide a visual analytics tool, VA4JVM, for error traces produced by either the Java Virtual Machine, or by Java Pathfinder. Its key features are a layout that spatially associates events with threads, a zoom function, and the ability to filter event data in various ways. We show in examples how filtering and zooming in can highlight a problem without having to read lengthy textual data.
This paper presents Sip4J, a fully automated, scalable and effective tool to automatically generate access permission contracts for a sequential Java program. The access permission contracts, which represent the dependency of code blocks, have been frequently used to enable concurrent execution of sequential programs. Those permission contracts, unfortunately, need to be manually created by programmers, which is known to be time- consuming, laborious and error-prone. To mitigate those manual efforts, Sip4J performs inter-procedural static analysis of Java source code to automatically extract the implicit dependencies in the program and subsequently leverages them to automatically generate access permission contracts, following the Design by Contract principle. The inferred specifications are then used to identify the concurrent (immutable) methods in the program. Experimental results further show that Sip4J is useful and effective towards generating access permission contracts for sequential Java programs.
To detect specific types of bugs and vulnerabilities, static analysis tools must be correctly configured with security-relevant methods (SRM), e.g., sources, sinks, sanitizers and authentication methods—usually a very labour-intensive and error-prone process. This work presents the semi-automated tool SWAN_ASSIST, which aids the configuration with an IntelliJ plugin based on active machine learning. It integrates our novel automated machine-learning approach SWAN, which identifies and classifies Java SRM. SWAN_ASSIST further integrates user feedback through iterative learning. SWAN_ASSIST aids developers by asking them to classify at each point in time exactly those methods whose classification best impact the classification result. Our experiments show that SWAN_ASSIST classifies SRM with a high precision, and requires a relatively low effort from the user. A video demo of SWAN_ASSIST can be found at https://youtu.be/fSyD3V6EQOY. The source code is available at https://github.com/secure-software-engineering/swan.
Fuzzing is widely used for vulnerability detection. One of the challenges for an efficient fuzzing is covering code guarded by constraints such as the magic number and nested conditions. Recently, academia has partially addressed the challenge via whitebox methods. However, high-level constraints such as array sorts, virtual function invocations, and tree set queries are yet to be handled.
To meet this end, we present VisFuzz, an interactive tool for better understanding and intervening fuzzing process via real-time visualization. It extracts call graph and control flow graph from source code, maps each function and basic block to the line of source code and tracks real-time execution statistics with detail constraint contexts. With VisFuzz, test engineers first locate blocking constraints and then learn its semantic context, which helps to craft targeted inputs or update test drivers. Preliminary evaluations are conducted on four real-world programs in Google fuzzer-test-suite. Given additional 15 minutes to understand and intervene the state of fuzzing, the intervened fuzzing outperform the original pure AFL fuzzing, and the path coverage improvements range from 10.84% to 150.58%, equally fuzzed by for 12 hours.