GUI-based testing has been primarily used to examine the functionality and usability of mobile apps. Despite the numerous GUI-based test input generation techniques proposed in the literature, these techniques are still limited by (1) lack of context-aware text inputs; (2) failing to generate expressive tests; and (3) absence of test oracles. To address these limitations, we propose CraftDroid, a framework that leverages information retrieval, along with static and dynamic analysis techniques, to extract the human knowledge from an existing test suite for one app and transfer the test cases and oracles to be used for testing other apps with the similar functionalities. Evaluation of CraftDroid on real-world commercial Android apps corroborates its effectiveness by achieving 73% precision and 90% recall on average for transferring both the GUI events and oracles. In addition, 75% of the attempted transfers successfully generated valid and feature-based tests for popular features among apps in the same category.
The use of mobile apps is increasingly widespread, and much effort is put into testing these apps to make sure they behave as intended. To reduce this effort, and thus the overall cost of mobile app testing, we propose AppTestMigrator, a technique for migrating test cases between apps in the same category (e.g., banking apps). The intuition behind AppTestMigrator is that many apps share similarities in their functionality, and these similarities often result in conceptually similar user interfaces (through which that functionality is accessed). AppTestMigrator leverages these commonalities between user interfaces to migrate existing tests written for an app to another similar app. Specifically, given (1) a test case for an app (source app) and (2) a second app (target app), AppTestMigrator attempts to automatically transform the sequence of events and oracles in the test for the source app to events and oracles for the target app. We implemented AppTestMigrator for Android mobile apps and evaluated it on a set of randomly selected apps from the Google Play Store in four different categories. Our initial results are promising, support our intuition that test migration is possible, and motivate further research in this direction.
Mobile push notification is widely used in mobile platforms to deliver all sorts of information to app users. Although it offers great convenience for both app developers and mobile users, this feature was recurrently reported to serve malicious and aggressive purposes, such as delivering annoying push notification advertisement. However, to the best of our knowledge, our research community has not touched the problem yet, neither providing techniques to detect/prevent them, nor characterizing this issue in the mobile app ecosystem at large-scale. This paper presents the first study to detect aggressive push notifications and further characterize them in large-scale. To this end, we first provide a taxonomy of mobile push notifications and pick out the aggressive ones using a crowdsourcing-based method. Then we propose DaPANDA, a novel hybrid approach, aiming at automatically detecting aggressive push notifications in Android apps. DaPANDA leverages a guided testing approach to systematically trigger and consume push notifications. By instrumenting the Android framework, DaPANDA further collects all the notification-relevant runtime information for flagging aggressive ones. Our experimental results show that DaPANDA is capable of detecting aggressive push notifications across the spectrum of aggressive types. By applying DaPANDA to 20,000 Android apps, it yields over 1,000 aggressive notifications that are further confirmed to be true positives and are shared with our community to promote advanced approaches for detecting aggressive mobile push notifications.
To ensure security and privacy, Android employs a permission mechanism which requires developers to explicitly declare the permissions needed by their applications (apps). Users must grant those permissions before they install apps or during runtime. This mechanism protects users’ private data, but also imposes additional requirements on developers. For permission declaration, developers need knowledge about what permissions are necessary to implement various features of their apps, which is difficult to acquire due to the incompleteness of Android documentation. To address this problem, we present a novel permission recommendation system named PerRec for Android apps. PerRec leverages mining-based techniques and data fusion methods to recommend permissions for given apps according to their used APIs and API descriptions. The recommendation scores of potential permissions are calculated by a composition of two techniques which are implemented as two components of PerRec: a collaborative filtering component which measures similarities between apps based on semantic similarities between APIs; and a content-based recommendation component which automatically constructs profiles for potential permissions from existing apps. The two components are combined in PerRec for better performance. We have evaluated PerRec on 730 apps collected from Google Play and F-Droid, a repository of free and open source Android apps. Experimental results show that our approach significantly improves the state-of-the-art approaches APRec_CF_correlation , APRec_TEXT and Axplorer.
Link to Publication: https://xin-xia.github.io/publication/AUSE19.pdf
The fragmentation issue spreads over multiple mobile platforms such as Android, iOS, mobile web and even WeChat, which hinders test scripts from running across platforms. To reduce the cost of adapting scripts for various platforms, some existing tools apply conventional computer vision techniques to replay the same script on multiple platforms automatically. However, because these solutions can hardly identify dynamic or similar widgets, it becomes difficult for engineers to apply them in practice. In this paper, we present an image-driven tool, namely LIRAT, to record and replay test scripts cross platforms, solving the problem of test script cross-platform replay for the first time. LIRAT records screenshots and layouts of the widgets, and leverages image understanding techniques to locate them in the replay process. Based on accurate widget localization, LIRAT supports replaying test scripts across devices and platforms. We employed LIRAT to replay 25 scripts from 5 application across 8 Android devices and 2 iOS devices. The results show that LIRAT can replay 88% scripts on Android platforms and 60% on iOS platforms.
Automated input generators must constantly choose which UI element to interact with and how to interact with it, in order to achieve high coverage with a limited time budget. Currently, most black-box input generators adopt pseudo-random or brute-force searching strategies, which may take very long to find the correct combination of inputs that can drive the app into new and important states. We propose Humanoid, a deep learning-based approach to automated black-box Android app testing, which can explore the app more efficiently. The key technique behind Humanoid is a deep neural network model that can learn how human users choose actions based on an app’s GUI from human interaction traces . The learned model can be used to guide test input generation to achieve higher coverage. Experiments on both open-source apps and market apps demonstrate that Humanoid is able to reach higher coverage, and faster as well, than the state-of-the-art test input generators. Humanoid is open-sourced at https://github.com/yzygitzh/Humanoid and a demo video can be found at https://youtu.be/PDRxDrkyORs.
In the past, researchers have developed a number of popular taint-analysis approaches, particularly in the context of Android applications. Numerous studies have shown that automated code analyses are adopted by developers only if they yield a good “signal to noise ratio”, i.e., high precision. Many previous studies have reported analysis precision quantitatively, but this gives little insight into what can and should be done to increase precision further.
To guide future research on increasing precision, we present a comprehensive study that evaluates static Android taint-analysis results on a qualitative level. To unravel the exact nature of taint flows, we have designed COVA, an analysis tool to compute partial path constraints that inform about the circumstances under which taint flows may actually occur in practice.
We have conducted a qualitative study on the taint flows in 1,022 real-world Android applications. Our results reveal several key findings: Many taint flows occur only under specific conditions, e.g., environment settings, user interaction, I/O. Taint analyses should consider the application context to discern such situations. COVA shows that few taint flows are guarded by multiple different kinds of conditions simultaneously, so tools that seek to confirm true positives dynamically can concentrate on one kind at a time, e.g., only simulating user interactions. Lastly, many false positives arise due to a too liberal source/sink configuration. Taint analyses must be more carefully configured, and their configuration could benefit from better tool assistance.
This paper proposes a solution for automated goal-driven exploration of Android applications – a scenario in which a user, e.g., security auditor, needs to dynamically trigger the functionality of interest in an application, e.g., to check whether user-sensitive info is only sent to recognized third-party servers. As the auditor might need to check hundreds or even thousands of apps, manually exploring each app to trigger the desired behavior is too time-consuming to be feasible. Existing automated application exploration and testing techniques are of limited help in this scenario as well, as their goal is mostly to identify faults by systematically exploring different app paths, rather than swiftly navigating to the target functionality.
The goal-driven application exploration approach proposed in this paper, called GoalExplorer, automatically generates an executable test script that directly triggers the functionality of interest. The core idea behind GoalExplorer is to first statically model the application UI screens and transitions between these screens, producing a Screen Transition Graph (STG). Then, GoalExplorer uses the STG to guide the dynamic exploration of the application to the particular target of interest: an Android activity, API call, or a program statement. The results of our empirical evaluation on 93 benchmark applications and 95 the most popular GooglePlay applications show that the STG is substantially more accurate than other Android app UI models and that \tool is able to trigger a target functionality much faster than existing application exploration and testing techniques.
The ability to repeat the execution of a program is a fundamental requirement in many areas of computing from computer system evaluation to software engineering. Reproducing executions of mobile apps, in particular, has proven difficult under real-life scenarios due to multiple sources of external inputs and interactive nature of the apps. Previous works that provide record/replay functionality for mobile apps are restricted to particular input sources (e.g., touchscreen events) and present deployment challenges due to intrusive modifications to the underlying software stack. Moreover, due to their reliance on record and replay of device specific events, the recorded executions cannot be reliably reproduced across different platforms.
In this paper, we present a new practical approach, RandR, for record and replay of Android applications. RandR captures and replays multiple sources of input (i.e., UI and network) without requiring source code (OS or app), administrative device privileges, or any special platform support. RandR achieves these qualities by instrumenting a select set of methods at runtime within an application’s own sandbox. In addition, to enable portability of recorded executions across different platforms for replay, RandR contextualizes UI events as interactions with particular UI components (e.g., a button) as opposed to relying on platform specific features (e.g., screen coordinates). We demonstrate RandR’s accurate cross-platform record and replay capabilities using over 30 real-world Android apps across a variety of platforms including emulators as well as commercial off-the-shelf mobile devices deployed in real life.
Given the event-driven and framework-based architecture of Android apps, finding the ordering of callbacks executed by the framework remains a problem that affects every tool that requires inter-callback reasoning. Previous work has focused on the ordering of callbacks related to the Android components and GUI events. But the execution of callbacks can also come from direct calls of the framework (API calls). This paper defines a novel program representation, called Callback Control Flow Automata (CCFA), that specifies the control flow of callbacks invoked via a variety of sources. We present an analysis to automatically construct CCFAs by combining two callback control flow representations developed from the previous research, namely, Window Transition Graphs (WTGs) and Predicate Callback Summaries (PCSs). To demonstrate the usefulness of our representation, we integrated CCFAs into two client analyses: a taint analysis using FLOWDROID, and a value-flow analysis that computes source and sink pairs of a program. Our evaluation shows that we can compute CCFAs efficiently and that CCFAs improved the callback coverages over WTGs. As a result of using CCFAs, we obtained 33 more true positive security leaks than FLOWDROID over a total of 55 apps we have run. With a low false positive rate, we found that 22.76% of source-sink pairs we computed are located in different callbacks and that 31 out of 55 apps contain source-sink pairs spreading across components. Thus, callback control flow graphs and inter-callback analysis are indeed important. Although this paper mainly uses Android, we believe that CCFAs can be useful for modeling control flow of callbacks for other event-driven, framework-based systems.
Link to Publication: https://ieeexplore.ieee.org/document/8613913
Malware scanning of an app market is expected to be scalable and effective. However, existing approaches use either syntax-based features which can be evaded by transformation attacks or semantic-based features which are usually extracted by performing expensive program analysis. Therefore, in this paper, we propose a lightweight graph-based approach to perform Android malware detection. Instead of traditional heavyweight static analysis, we treat function call graphs of apps as social networks and perform social-network-based centrality analysis to represent the semantic features of the graphs. Our key insight is that centrality provides a succinct and fault- tolerant representation of graph semantics, especially for graphs with certain amount of inaccurate information (e.g., inaccurate call graphs). We implement a prototype system, MalScan, and evaluate it on datasets of 15,285 benign samples and 15,430 malware samples. Experimental results show that MalScan is capable of detecting Android malware with up to 98% accuracy under one second which is more than 100 times faster than two state-of-the-art approaches, namely MaMaDroid and Drebin. We also demonstrate the feasibility of MalScan on market-wide malware scanning by performing a statistical study on over 3 million apps. Finally, in a corpus of dataset collected from Google-Play app market, MalScan is able to identify 18 zero-day malware including malware samples that can evade detection of existing tools.
ACM SIGSOFT Distinguished Paper Award
The IFDS algorithm can be compute- and memory-intensive for some large programs, often running for a long time(more than expected) or terminating prematurely after some time and/or memory budgets have been exhausted. In the latter case, the corresponding IFDS data-flow analyses may suffer from false negatives and/or false positives. To improve this, we introduce a sparse alternative to the traditional IFDS algorithm. Instead of propagating the data-flow facts across all the program points along the program’s (interprocedural) control flow graph, we propagate every data-flow fact directly to its next possible use points along its own sparse control flow graph constructed on the fly, thus reducing significantly both the time and memory requirements incurred by the traditional IFDS algorithm. In our evaluation, we compare FLOWDROID, a taint analysis performed by using the traditional IFDS algorithm, with our sparse incarnation, SPARSEDROID, on a set of 40 Android apps selected. For the time budget (5 hours) and memory budget(220GB) allocated per app, SPARSEDROID can run every app to completion but FLOWDROID terminates prematurely for 9apps, resulting in an average speedup of 22.0x. This implies that when used as a market-level vetting tool, SPARSEDROID can finish analyzing these 40 apps in 2.13 hours (by issuing 228 leak warnings) while FLOWDROID manages to analyze only 30 apps in the same time period (by issuing only 147 leak warnings).
In the app releasing process, Android requires all apps to be digitally signed with a certificate before distribution. Android uses this certificate to identify the author and ensure the integrity of an app. However, a number of signature issues have been reported recently, threatening the security and privacy of Android apps. In this paper, we present the first large-scale systematic measurement study on issues related to Android app signatures. We first create a taxonomy covering four types of app signing issues (21 anti-patterns in total), including vulnerabilities, potential attacks, release bugs and compatibility issues. Then we developed an automated tool to characterize signature-related issues in over 5 million app items (3 million distinct apks) crawled from Google Play and 24 alternative Android app markets. Our empirical findings suggested that although Google has introduced apk-level signing schemes (V2 and V3) to overcome some of the known security issues, more than 93% of the apps still use only the JAR signing schemes (V1), which poses great security threats. Besides, we also revealed that 7% to 45% of the apps in the 25 studied markets have been found containing at least one signing issue, while a large number of apps have been exposed to security vulnerabilities, attacks and compatibility issues. Among them a considerable number of apps we identified are popular apps with millions of downloads. Finally, our evolution analysis suggested that most of the issues were not mitigated after considerable amounts of time across markets. The results demonstrate the emergency for detecting and repairing app signing issues.
Mobile developers use OAuth APIs to implement Single-Sign-On services. However, the OAuth protocol was originally designed for the authorization for third-party websites not to authenticate users in third-party mobile apps. As a result, it is challenging for developers to correctly implement mobile OAuth securely. These vulnerabilities due to the misunderstanding of OAuth and inexperience of developers could lead to data leakage and account breach. In this paper, we perform an empirical study on the usage of OAuth APIs in Android applications and their security implications. In particular, we develop OAUTHLINT, that incorporates a query-driven static analysis to automatically check programs on the Google Play marketplace. OAUTHLINT takes as input an anti-protocol that encodes a vulnerable pattern extracted from the OAuth specifications and a program P. Our tool then generates a counter-example if the anti-protocol can match a trace of P’s possible executions. To evaluate the effectiveness of our approach, we perform a systematic study on 600+ popular apps which have more than 10 millions of downloads. The evaluation shows that 101 (32%) out of 316 applications that use OAuth APIs make at least one security mistake.
Increasing interest in securing the Android ecosystem has spawned numerous efforts to assist app developers in building secure apps. These efforts have resulted in tools and techniques capable of detecting vulnerabilities and malicious behaviors in apps. However, there has been no evaluation of the effectiveness of these tools and techniques in detecting known vulnerabilities. The absence of such evaluations puts app developers at a disadvantage when choosing security analysis tools to secure their apps.
In this regard, we evaluated the effectiveness of vulnerability detection tools for Android apps. We reviewed 64 tools and empirically evaluated 14 vulnerability detection tools against 42 known unique vulnerabilities captured by Ghera benchmarks, which are composed of both vulnerable and secure apps. Of the 20 observations from the evaluation, the main observation is existing vulnerability detection tools for Android apps are very limited in their ability to detect known vulnerabilities — all of the evaluated tools together could only detect 30 of the 42 known unique vulnerabilities.
More effort is required if security analysis tools are to help developers build secure apps. We hope the observations from this evaluation will help app developers choose appropriate security analysis tools and persuade tool developers and researchers to identify and address limitations in their tools and techniques. We also hope this evaluation will catalyze or spark a conversation in the software engineering and security communities to require a more rigorous and explicit evaluation of security analysis tools and techniques.
Link to Publication: https://rdcu.be/bM7na
To detect specific types of bugs and vulnerabilities, static analysis tools must be correctly configured with security-relevant methods (SRM), e.g., sources, sinks, sanitizers and authentication methods—usually a very labour-intensive and error-prone process. This work presents the semi-automated tool SWAN_ASSIST, which aids the configuration with an IntelliJ plugin based on active machine learning. It integrates our novel automated machine-learning approach SWAN, which identifies and classifies Java SRM. SWAN_ASSIST further integrates user feedback through iterative learning. SWAN_ASSIST aids developers by asking them to classify at each point in time exactly those methods whose classification best impact the classification result. Our experiments show that SWAN_ASSIST classifies SRM with a high precision, and requires a relatively low effort from the user. A video demo of SWAN_ASSIST can be found at https://youtu.be/fSyD3V6EQOY. The source code is available at https://github.com/secure-software-engineering/swan.
This paper presents Sip4J, a fully automated, scalable and effective tool to automatically generate access permission contracts for a sequential Java program. The access permission contracts, which represent the dependency of code blocks, have been frequently used to enable concurrent execution of sequential programs. Those permission contracts, unfortunately, need to be manually created by programmers, which is known to be time- consuming, laborious and error-prone. To mitigate those manual efforts, Sip4J performs inter-procedural static analysis of Java source code to automatically extract the implicit dependencies in the program and subsequently leverages them to automatically generate access permission contracts, following the Design by Contract principle. The inferred specifications are then used to identify the concurrent (immutable) methods in the program. Experimental results further show that Sip4J is useful and effective towards generating access permission contracts for sequential Java programs.
Exception mechanism is widely used in cloud systems. This is mainly because it separates the error handling code from main business logic. However, the huge space of potential error conditions and the sophisticated logic of cloud systems present a big hurdle to the correct use of exception mechanism. As a result, mistakes in the exception use may lead to severe consequences, such as system downtime and data loss. To address this issue, the communities direly need a better understanding of the exception-related bugs, i.e., eBugs, which are caused by the incorrect use of exception mechanism, in cloud systems.
In this paper, we present a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper. For all the studied eBugs, we analyze their triggering conditions, root causes, bug impacts, and their relations. To the best of our knowledge, this is the first study on eBugs in cloud systems, and the first eBug study that focuses on triggering conditions. We find that eBugs are severe in cloud systems: 74% eBugs affect system availability or integrity. Luckily, exposing eBugs through testing is possible: 54% eBugs are triggered by non-semantic conditions such as network errors; 40% eBugs can be triggered by simulating the conditions at simple system states. Interestingly, we find that exception triggering conditions are useful for detecting eBugs. Based on such relevant findings, we build a static analysis tool, called DIET, which reports 31 bugs and bad practices from the latest versions of the studied systems. So far developers have confirmed that 23 of them are “previously-unknown” bugs or bad practices.
Large-scale online systems are complex, fast-evolving, and hardly bug-free despite the testing efforts. Backend system monitoring cannot detect many types of issues, such as UI related bugs, bugs with small impact on backend system indicators, or errors from third-party co-operating systems, etc. However, users are good informers of such issues: They will provide their feedback for any types of issues. This experience paper discusses our design of iFeedback, a tool to perform real-time issue detection based on user feedback texts. Unlike traditional approaches that analyze user feedback with computation-intensive natural language processing algorithms, iFeedback is focusing on fast issue detection, which can serve as a system life-condition monitor. In particular, iFeedback extracts word combination-based indicators from feedback texts. This allows iFeedback to perform fast system anomaly detection with sophisticated machine learning algorithms. iFeedback then further summarizes the texts with an aim to effectively present the anomaly to the developers for root cause analysis. We present our representative experiences in successfully applying iFeedback in tens of large-scale production online service systems in ten months.
Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to >100%). This variability originates from three sources (variability inherent to a benchmark, between trials on the same instance, and between different instances) and different benchmark-environment configurations suffer to very different degrees from any of these sources. The bare-metal instance expectedly produces very stable results, however AWS is typically not substantially less stable. We further study falsely-reported performance changes and minimal-detectable slowdowns along two dimensions in all environments: (1) two deployment strategies, executing test and control group on the same (trial-based sampling) and on different (instance-based sampling) instances, and (2) two state-of-the-art statistical tests, i.e., Wilcoxon rank-sum with Cliff’s Delta effect sizes and bootstrapped overlapping confidence intervals. We show that identical measurements (e.g., same benchmark executed in the same environment without any code changes) suffer from falsely-reported changes when they are taken from a small number of instances and trials. Nonetheless, an increase in samples yields a low number of false positives (i.e., <5%) for all studied benchmarks and environments, both sampling-strategies, and both statistical tests. Regarding minimal-detectable slowdowns, our experiments confirm that testing in a trial-based fashion leads to substantially better results than executing on different instances. With trial-based sampling, slowdowns of 10% or less are detectable with high confidence using both statistical tests. Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns with fewer instances than overlapping confidence intervals: already 5 instances are sufficient to find slowdowns in the range of 5% to 10%.
Link to Publication: https://link.springer.com/article/10.1007/s10664-019-09681-1
[Experience Paper] In recent years, online service systems have become increasingly popular. Incidents of these systems could cause significant economic loss and customer dissatisfaction. Incident triage, which is the process of assigning a new incident to the responsible team, is vitally important for quick recovery of the affected service. Our industry experience shows that in practice, incident triage is not conducted only once in the beginning, but is a continuous process, in which engineers from different teams have to discuss intensively among themselves about an incident, and continuously refine the incident-triage result until the correct assignment is reached. In particular, our empirical study on 8 real online service systems shows that the percentage of incidents that were reassigned ranges from 5.43% to 68.26% and the number of discussion items before achieving the correct assignment is up to 11.32 on average. To improve the existing incident triage process, in this paper, we propose DeepCT, a Deep learning based approach to automated Continuous incident Triage. DeepCT incorporates a novel GRU-based model with an attention mechanism and a revised loss function, which can incrementally learn knowledge from discussions and update incident-triage results. Using DeepCT, the correct incident assignment can be achieved with fewer discussions. We conducted an extensive evaluation of DeepCT on 14 large-scale online service systems in a multinational technology company M. The results show that DeepCT is able to achieve more accurate and efficient incident triage, e.g., the average accuracy identifying the responsible team precisely is 0.641~0.729 with the number of discussion items increasing from 1 to 5. Also, DeepCT statistically significantly outperforms the state-of-the-art bug triage approach.
Recent trends in Web development demonstrate an increased interest in serverless applications, i.e. applications that utilize computational resources provided by cloud services on demand instead of requiring traditional server management. This approach enables better resource management while being scalable, reliable, and cost-effective. However, it comes with a number of organizational and technical difficulties which stem from the interaction between the application and the cloud infrastructure, for example, having to set up a recurring task of reuploading updated files. In this paper, we present Kotless – a Kotlin Serverless Framework. Kotless is a cloud-agnostic toolkit that solves these problems by interweaving the deployed application into the cloud infrastructure and automatically generating the necessary deployment code. This relieves developers from having to spend their time integrating and managing their applications instead of developing them. Kotless has proven its capabilities and has been used to develop several serverless applications already in production. Its source code is available at https://github.com/JetBrains/kotless, a tool demo can be found at https://www.youtube.com/watch?v=IMSakPNl3TY.
Workflow underlies most process automation software, such as those for product lines, business processes, and scientific computing. However, current Cloud Computing based workflow systems cannot support real-time applications due to network latency, which limits their application in many IoT systems such as smart healthcare and smart traffic. Fog Computing extends the Cloud by providing virtualized computing resources close to the End Devices so that the response time of accessing computing resources can be reduced significantly. However, how to most effectively manage heterogeneous resources and different computing tasks in the Fog is a big challenge. In this paper, we introduce “FogWorkflowSim” an efficient and extensible toolkit for automatically evaluating resource and task management strategies in Fog Computing with simulated user-defined workflow applications. Specifically, FogWorkflowSim is able to: 1) automatically set up a simulated Fog Computing environment for workflow applications; 2) automatically execute user submitted workflow applications; 3) automatically evaluate and compare the performance of different computation offloading and task scheduling strategies with three basic performance metrics, including time, energy and cost. FogWorkflowSim can serve as an effective experimental platform for researchers in Fog based workflow systems as well as practitioners interested in adopting Fog Computing and workflow systems for their new software projects.
Complex software systems often provide a large number of parameters so that users can configure them for their specific application scenarios. However, configuration tuning requires a deep understanding of the software system, far beyond the abilities of typical system users. To address this issue, many existing approaches focus on exploring and learning good performance estimation models. The accuracy of such models often suffers when the number of available samples is small, a thorny challenge under a given tuning-time constraint. By contrast, we hypothesize that good configurations often share certain hidden structures. Therefore, instead of trying to improve the performance estimation of a given configuration, we focus on capturing the hidden structures of good configurations and utilizing such learned structure to generate potentially better configurations. We propose ACTGAN to achieve this goal. We have implemented and evaluated ACTGAN using 17 workloads with eight different software systems. Experimental results show that ACTGAN outperforms default configurations by 76.22% on average, and six state-of-the-art configuration tuning algorithms by 6.58%-64.56%. Furthermore, the ACTGAN-generated configurations are often better than those used in training and show certain features consisting with domain knowledge, both of which supports our hypothesis.
Nowadays software tends to come in many different, yet similar variants, often derived from a common codebase via clone-and-own. Family-based-analysis strategies have recently shown very promising potential for improving efficiency in applying quality-assurance techniques to such variant-rich programs, as compared to variant-by-variant approaches. Unfortunately, these strategies require a single program representation superimposing all program variants in a syntactically well-formed, semantically sound and variant-preserving manner, which is usually not available and manually hard to obtain in practice. In this paper, we present a novel methodology, called SiMPOSE, for automatically generating superimpositions of existing program variants to facilitate family-based analyses of variant-rich software. To this end, we propose a novel N-way model-merging methodology to integrate the control-flow automaton (CFA) representations of N given variants of a C program into one unified CFA representation. CFA constitute a unified program abstraction used by many recent software-analysis tools for automated quality assurance. To cope with the inherent complexity of N-way model-merging, our approach (1) utilizes principles of similarity-propagation to reduce the number of potential N-way matches, and (2) enables us to decompose a set of N variants into arbitrary subsets and to incrementally derive an N-way superimposition from partial superimpositions. We apply our tool implementation of SiMPOSE to a selection of realistic C programs, frequently considered for experimental evaluation of program-analysis techniques. In particular, we investigate applicability and efficiency/effectiveness trade-offs of our approach by applying SiMPOSE in the context of family-based unit-test generation as well as model-checking as sample program-analysis techniques. Our experimental results reveal very impressive efficiency improvements by an average factor of up to 2.6 for test-generation and up to 2.4 for model-checking under stable effectiveness, as compared to variant-by-variant approaches, thus amortizing the additional effort required for merging. In addition, our results show that merging all N variants at once produces, in almost all cases, clearly more precise results than incremental step-wise 2-way merging. Finally, our comparison with major existing N-way merging techniques shows that SiMPOSE constitutes, in most cases, the best efficiency/effectiveness trade-off.
Link to Publication: https://dl.acm.org/citation.cfm?doid=3343019.3313789
Code snippets are prevalent, but are hard to reuse because they often lack an accompanying environment configuration. Most are not actively maintained, allowing for drift between the most recent possible configuration and the code snippet as the snippet becomes out-of-date over time. Recent work has identified the problem of validating and detecting out-of-date code snippets as the most important consideration for code reuse. However, determining if a snippet is correct, but simply out-of-date, is a non-trivial task. In the best case, breaking changes are well documented, allowing developers to manually determine when a code snippet contains an out-of-date API usage. In the worst case, determining if and when a breaking change was made requires an exhaustive search through previous dependency versions.
We present V2, a strategy for determining if a code snippet is out-of-date by detecting discrete instances of configuration drift, where the snippet uses an API which has since undergone a breaking change. Each instance of configuration drift is classified by a failure encountered during validation and a configuration patch, consisting of dependency version changes, which fixes the underlying fault. V2 uses feedback-directed search to explore the possible configuration space for a code snippet, reducing the number of potential environment configurations that need to be validated. When run on a corpus of public Python snippets from prior research, V2 identifies 248 instances of configuration drift.
Unexpected interactions among features induce most bugs in a configurable software system. Exhaustively analyzing all the exponential number of possible configurations is prohibitively costly. Thus, various sampling techniques have been proposed to systematically narrow down the exponential number legal configurations to be analyzed. Since analyzing all selected configurations can require a huge amount of effort, fault-based configuration prioritization, that helps detect faults earlier, can yield practical benefits in quality assurance. In this paper, we propose CoPro, a novel formulation of feature-interaction bugs via common program entities enabled/disabled by the features. Leveraging from that, we develop an efficient feature-interaction-aware configuration prioritization technique for a configurable system by ranking the configurations according to their total number of potential bugs. We conducted several experiments to evaluate CoPro on the ability to detect configuration-related bugs in a public benchmark. We found that CoPro outperforms the state-of-the-art configuration prioritization techniques when we add them on advanced sampling algorithms. In 78% of the cases, CoPro ranks the buggy configurations at the top 3 positions. Interestingly, CoPro is able to detect 17 not-yet-discovered feature-interaction bugs
The significant rise in the complexity and configurations of software systems potentially requires a large number of test cases to test configurations of the systems. To deal with this challenge, we present a novel automated approach (termed as SBI) to implant existing test cases of software systems with the functionality to cover untested configurations of the software system; thus significantly reducing the effort to write new test cases to test the untested configurations. Our automated approach consists of two keys steps implemented into software components termed as Test Case Analyzer and Test Case Implanter. Test Case Analyzer automatically analyzes the code of test cases and generates a program dependence graph required by the second component. Test Case Implanter is implemented with a multi-objective search approach that searches for the most suitable test cases for implantation using search operators (i.e., selection, crossover, and mutation operators) followed by implanting the selected test cases by applying mutation operations focusing on adding, modifying, or deleting the test case source code. To guide the search, we defined various objectives: number of configuration variable values covered, pairwise coverage of parameter values of test API commands, number of implanted test cases, number of changed statements, and estimated execution time.
To assess the cost-effectiveness of our approach, we applied it to one industrial case study of video conferencing system and one open source case study. We experimented our approach (i.e., SBI) with three search algorithms: Non-dominated Sorting Genetic Algorithm II (NSGA-II), weight-based genetic algorithm (WGA), and random search (RS). In general, we found that the implanted test cases with SBI using the three algorithms had significantly higher configuration coverage than the original test cases for both case studies. When comparing the three algorithms with our approach, we found that NSGA-II performed the best. Specifically, SBI with NSGA-II managed to achieve, on average 19.3% higher number of configuration variable values and 57.0% higher pairwise coverage of parameter values of test API commands.
Link to Publication: https://www.sciencedirect.com/science/article/abs/pii/S0950584919300540
The performance of a software system plays a crucial role for user perception. Learning from the history of a software system’s performance behavior does not only help discovering and locating performance bugs, but also identifying evolutionary performance patterns and general trends, such as when technical debt accumulates in a slow but steady performance degradation. Exhaustive regression testing is usually impractical, because rigorous performance benchmarking requires executing a realistic workload per commit, which results in large execution times. In this paper, we propose a novel active revision sampling approach, which aims at tracking and understanding a system’s performance history by approximating the performance behavior of a software system across all of its revisions. In a nutshell, we iteratively sample and measure the performance of specific revisions that help us in building an exact performance- evolution model, and we use Gaussian Process models to assess in which revision ranges our model is most uncertain with the goal to to sample further revisions for measurement. We have conducted an empirical analysis of the evolutionary performance behavior modeled as a time series of the history of 6 real-world software systems. Our evaluation demonstrates that Gaussian Process models are able to accurately estimate the performance- evolution history of real-world software systems with only few measurements and to reveal interesting behaviors and trends.
Modern web applications rely heavily on databases to query and update information. To ease the development efforts, Object Relational Mapping (ORM) frameworks provide an abstraction for developers to manage databases by writing in the same Object-Oriented programming languages. Prior studies have shown that there are various types of performance issues caused by inefficient accesses to databases via different ORM frameworks (e.g., Hibernate and ActiveRecord). However, it is not clear whether the reported performance anti-patterns (common performance issues) can be generalizable across various frameworks. In particular, there is no study focusing on detecting performance issues for applications written in PHP, which is the choice of programming languages for the majority (79%) of web applications. In this experience paper, we detail our process on conducting performance-aware refactoring of an industrial web application written in Laravel, the most popular web framework in PHP. We have derived a complete catalog of 17 performance anti-patterns based on prior research and our experimentation. We have found that some of the reported anti-patterns and refactoring techniques are framework or programming language specific, whereas others are general. The performance impact of the anti-pattern instances are highly dependent on the actual usage context (workload and database settings). When communicating the performance differences before and after refactoring, the results of the complex statistical analysis may be sometimes confusing. Instead, developers usually prefer more intuitive measures like percentage improvement. Experiments show that our refactoring techniques can achieve up to 1319% and 1417% times speedup for the industrial and the open source application under various scenarios.
Designing field-representative load tests is an essential step for the quality assurance of large-scale systems. Practitioners may capture user behaviour at different levels of granularity. A coarse-grained load test may miss detailed user behaviour, leading to a non-representative load test; while an extremely fine-grained load test would simply replay user actions step by step, leading to load tests that are costly to develop, execute and maintain. Workload recovery is core of these load tests. Prior research often captures the workload as the frequency of user actions. However, there exists much valuable information in the context and sequences of user actions. Such richer information would ensure that the load tests that leverage such workloads are more field-representative. In this experience paper, we study the use of different granularities of user behaviour, i.e., basic user actions, basic user actions with contextual information and user action sequences with contextual information, when recovering workloads for use in the load testing of large-scale systems. We propose three approaches that are based on the three granularities of user behaviour and evaluate our approaches on four subject systems, namely Apache James, OpenMRS, Google Borg, and an ultra-large-scale industrial system (SA) from CompanyA. Our results show that our approach that is based on user action sequences with contextual information outperforms the other two approaches and can generate more representative load tests with similar throughput and CPU usage to the original field workload (i.e., mostly statistically insignificant or with small/trivial effect sizes). Such representative load tests are generated only based on a small number of clusters of users, leading to a low cost of conducting/maintaining such tests. Finally, we demonstrate that our approaches can detect injected users in the original field workloads with high precision and recall. Our paper demonstrates the importance of user action sequences with contextual information in the workload recovery of large-scale systems.
As data volume and complexity grow at an unprecedented rate, the performance of data analytics programs is becoming a major concern for developers. We observed that developers sometimes use alternative data analytics APIs to improve program runtime performance while preserving functional equivalence. However, little is known on the characteristics and performance attributes of alternative data analytics APIs. In this paper, we propose a novel approach to extract alternative implementations that invoke different data analytics APIs to solve the same tasks. A key appeal of our approach is that it exploits the comparative structures in Stack Overflow discussions to discover programming alternatives. We show that our approach is promising, as 86% of the extracted code pairs were validated as true alternative implementations. In over 20% of these pairs, the faster implementation was reported to achieve a 10x or more speedup over its slower alternative. We hope that our study offers a new perspective of API recommendation and motivates future research on optimizing data analytics programs.
The performance issues of apps can influence users’ satisfaction. Therefore, developers exploit application perfor- mance management (APM) tools to locate the potential perfor- mance bottleneck of their apps. Unfortunately, most developers do not understand how APMs monitor their apps during the runtime and whether these APMs have security risks (e.g., confidential data leakage). We demystify APMs by inspecting 25 widely-used APMs that target on Android apps. Currently, there is no systematic analysis of APMs in Android apps. In order to bridge this gap, we build a prototype tool, APMHunter, that can automatically detect the usages of APMs in Apps. We conduct a large-scale empirical study on 500,000 Android apps from Google Play to explore the usage patterns of APMs and discover the potential misuses of APMs. This study reveals our findings from two perspectives: 1) some APMs still employ deprecated permissions and approaches, which makes they cannot work as expected; 2) inappropriate APMs utilization can lead to privacy leakages. Thus, based on our research, we suggest that both APM vendors and developers should design and use APMs scrupulously.
We present PeASS (Performance Analysis of Software System versions), a tool for detecting performance changes at source code level that occur between different code versions. By using PeASS, it is possible to identify performance regressions that happened in the past to fix them.
PeASS measures the performance of unit tests in different source code versions. To achieve statistic rigor, measurements are repeated and analyzed using an agnostic t-test. To execute a minimal amount of tests, PeASS uses a regression test selection.
We evaluate PeASS on a selection of Apache Commons projects and show that 81% of all unit test covered performance changes can be found by PeASS. A video presentation is available at https://www.youtube.com/watch?v=RORFEGSCh6Y and PeASS can be downloaded from https://github.com/DaGeRe/peass.
Bug localization is well-known to be a difficult problem in software engineering, and specifically in compiler development, where it is beneficial to reduce the input program to a minimal reproducing example; this technique is more commonly known as delta debugging. What additionally contributes to the problem is that every new programming language has its own unique quirks and foibles, making it near impossible to reuse existing tools and approaches with full efficiency. In this experience paper we tackle the delta debugging problem w.r.t. Kotlin, a relatively new programming language from JetBrains. Our approach is based on a novel combination of program slicing, hierarchical delta debugging and Kotlin-specific transformations, which are synergistic to each other. We implemented it in a prototype called ReduKtor and did extensive evaluation on both synthetic and real Kotlin programs; we also compared its performance with classic delta debugging techniques. The evaluation results support the practical usability of our approach to Kotlin delta debugging and also shows the importance of using both language-agnostic and language-specific techniques to achieve best reduction efficiency and performance.
The adoption of refactoring techniques for continuous integration received much less attention from the research community comparing to root-canal refactoring to fix the quality issues in the whole system. Several recent empirical studies show that developers, in practice, are applying refactoring incrementally when they are fixing bugs or adding new features. There is an urgent need for refactoring tools that can support continuous integration and some recent development processes such DevOps that are based on rapid releases.
In this paper, we propose for the first time an intelligent software refactoring bot, called RefBot. Integrated into the version control system (e.g. GitHub), our bot continuously monitors the software repository and it is triggered by any open or merge actions on pull-requests. The bot analyzes the files changed during that pull-request to identify refactoring opportunities using a set of quality attributes then it will find the best sequence of refactorings to fix the quality issues, if any. The bot recommends all these refactorings through an automatically generated pull-request. The developer is able to review the recommendations and their impacts in a detailed report and select the code changes that he wants to keep or ignore. After this review, the developer can close and approve the merge of the bot’s pull request. We quantitatively and qualitatively evaluated the performance and effectiveness of RefBot by a survey conducted with experienced developers who used the bot.
Reactive programming languages and libraries, such as ReactiveX, have been shown to significantly improve software design and have seen important industrial adoption over the last years. Asynchronous applications – which are notoriously error-prone to implement and to maintain – greatly benefit from reactive programming because they can be defined in a declarative style, which improves code clarity and extensibility.
In this paper, we tackle the problem of refactoring existing code bases that are designed using traditional abstractions for asynchronous programming. We propose 2Rx, a refactoring tool to automatically convert asynchronous code to reactive programming. Our evaluation on top-starred GitHub projects shows that 2Rx is effective with the most common asynchronous constructs, covering ~94.7% of the projects with asynchronous computations, and it can provide a refactoring for ~91.7% of their occurrences.
Software delivery has seen many shifts, first from single sourcing to multi-vendor global delivery, and onward to crowd-sourced development. Development also increasingly incorporates reusable software components, including open source components, to support an “assemble more, code less” philosophy. As centralized control disperses out to autonomous delivery organizations, the variety of processes and tools used turns transparency into opacity as autonomous teams use different software processes, tools, and metrics, leading to issues like ineffective compliance monitoring, friction prone coordination, and lack of provenance. At Accenture Labs, we conceptualized delivery governance tools that use a notion of ‘software telemetry’ and distributed ledgers to record data from disparate delivery partners in a trusted manner. The tools use smart contracts for automatic enforcement of delivery policies. Furthermore, we use the concept of intelligence augmentation and “smart advisors” to provide role-specific alerts, contextual awareness and remediation actions. In this talk, we will present the conceptual architecture for a trusted software supply chain, illustrative use-cases and demonstration of a proof of concept on automated compliance monitoring, software integrity, and provenance.
For delivering high quality artifacts within the budget and on schedule, software delivery teams ideally should have a holistic and in-process view of the current health and future trajectory of the project. However, such insights need to be at the right level of granularity and need to be derived typically from a heterogeneous project environment, in a way that helps development team members with their tasks at hand. Due to client mandates, software delivery project environments employ many disparate tools and teams tend to be distributed, thus making the relevant information retrieval, insight generation, and developer intelligence augmentation fairly complex. In this talk we discuss our journey in this area spanning across facets like software project modelling and new development metrics, studying developer priorities, adoption of new metrics, and different approaches of developer intelligence augmentation. We end the talk presenting our exploration of new immersive technologies for human-centered software engineering.
We present Prema, a tool for Precise Requirement Editing, Modeling and Analysis. It can be used in various fields for describing precise requirements using formal notations and performing rigorous analysis. By parsing the requirements written in formal modeling language, Prema is able to get a model which aptly depicts the requirements. It also provides different rigorous verification and validation techniques to check whether the requirements meet users’ expectation and find potential errors. We show that our tool can provide a unified environment for writing and verifying requirements without using tools that are not well inter-related. For experimental demonstration, we use the requirements of the automatic train protection (ATP) system of CASCO signal co. LTD., the largest railway signal control system manufacturer of China. The code of the tool cannot be released here because the project is commercially confidential. However, a demonstration video of the tool is available at https://youtu.be/BX0yv8pRMWs.
A popular recommendation to programmers in object-oriented software is to “program to an interface, not an implementation” (PTI). Expected benefits include increased simplicity from abstraction, decreased dependency on implementations, and higher flexibility. Yet, interfaces must be immutable, excessive class hierarchies can be a form of complexity, and “speculative generality” is a known code smell. To advance the empirical knowledge of PTI, we conducted an empirical investigation that involves 126 Java projects on GitHub, aiming to measuring the decreased dependency benefits (in terms of cochange).
Recent works have considered the problem of log differencing: given two or more system’s execution logs, output a model of their differences. Log differencing has potential applications in software evolution, testing, and security. In this paper we present statistical log differencing, which accounts for frequencies of behaviors found in the logs. We present two algorithms, s2KDiff for differencing two logs, and snKDiff, for differencing of many logs at once, both presenting their results over a single inferred model. A unique aspect of our algorithms is their use of statistical hypothesis testing: we let the engineer control the sensitivity of the analysis by setting the target distance between probabilities and the statistical significance value, and report only (and all) the statistically significant differences. Our evaluation shows the effectiveness of our work in terms of soundness, completeness, and performance. It also demonstrates its effectiveness via a user-study and its potential applications via a case study using real-world logs.
Execution logs record detailed runtime information of software systems and are used as the main data source for many tasks around software engineering. As modern software systems are evolving into large scale and complex structures, logs have become one type of fast-growing big data in industry. In particular, such logs often need to be stored for a long time in practice (e.g., a year), in order to analyze recurrent problems or track security issues. However, archiving logs consumes a large amount of storage space and computing resources, which in turn incurs high operational cost. Data compression is essential to reduce the cost of log storage. Traditional compression tools (e.g., gzip) work well for general texts, but are not tailed for execution logs. In this paper, we propose a novel and effective log compression method, namely logzip. Logzip is capable of extracting hidden structures from logs via fast iterative clustering and further generating coherent intermediate representations that can enable more effective compression. We evaluate logzip on five large log datasets of different types, with a total of 63.6 GB in size. The results show that, on average, logzip can save about half of the storage space over traditional compression tools. Meanwhile, the design of logzip is highly parallel and only incurs negligible overhead. In addition, we share the industrial experience of applying logzip in a global company.
Domain models are the most important asset in widely accepted software development approaches, like Domain-Driven Design (DDD), yet those models are still implicitly represented in programs. Model-Driven Engineering (MDE) regards those models as first-order citizens that are amenable to automated analysis and processing, facilitating quality assurance while increasing productivity in software development processes. Although this connection is not new, very few approaches facilitate adoption of MDE tooling without compromising existing value, their data. Moreover, switching to MDE tooling usually involves re-engineering core parts of an application, hindering backward compatibility and, thereby, continuous integration, while requiring an up-front investment in training in specialized modeling frameworks. In those approaches that overcome the previous problem, there is no clear indication - from a quantitative point of view - of the extent to which adopting state-of-the-art MDE practices and tooling is feasible or advantageous.
In this work, we advocate a code-first approach to modeling through an approach for applying MDE techniques and tools to existing object-oriented software applications that fully preserves the semantics of the original application, which need not be modified. Our approach consists both of a semi-automated method for specifying explicit view models out of existing object-oriented applications and of a conservative extension mechanism that enables the use of such view models at run time, where view model queries are resolved on demand and view model updates are propagated incrementally to the original application. This mechanism enables an iterative, flexible application of MDE tooling to software applications, where metamodels and models do not exist explicitly. An evaluation of this extension mechanism, implemented for Java applications and for view models atop the Eclipse Modeling Framework (EMF), has been conducted with an industry benchmark for decision support systems, analyzing performance and scalability of the synchronization mechanism. Backward propagation of large updates over very large views is instant.
Many works infer finite-state models from execution logs. Large models are more accurate but also more difficult to present and understand. Small models are easier to present and understand but are less accurate. In this work we investigate the tradeoff between model size and accuracy in the context of the classic k-Tails model inference algorithm. First, we define mk-Tails, a generalization of k-Tails from one to many parameters, which enables fine-grained control over the tradeoff. Second, we extend mk-Tails with a reduction based on past-equivalence, which effectively reduces the size of the model without decreasing its accuracy. We implemented our work and evaluated its performance and effectiveness on models and generated logs from the literature.
This paper presents PMExec, a tool that supports the execution of partial UML-RT models. To this end, the tool implements the following steps: static analysis, automatic refinement, and input-driven execution. The static analysis respects the execution semantics of UML-RT models is used to detect problematic model elements, i.e., elements that cause problems during execution due to the partiality. Then, the models are refined automatically using model transformation techniques, which mostly add decision points where missing information can be supplied. Third, the refined models are executed, and when the execution reaches the decision points, input required to continue the execution is obtained either interactively or from a script that captures how to deal with partial elements. We have evaluated PMExec using several use-cases that show that the static analysis, refinement, and application of user input can be carried out with reasonable performance, and that the overhead of approach, due to the refinement step, is manageable. (https://youtu.be/BRKsselcMnc)
Model Driven Engineering (MDE) techniques raise the level of abstraction at which developers construct software. However, modern cyber-physical systems are becoming more prevalent and complex and hence software models that represent the structure and behavior of such systems still tend to be large and complex. These models may have numerous if not infinite possible behaviors, with complex communications between their components. Appropriate software testing techniques to generate test cases with high coverage rate to put these systems to test at the model-level (without the need to understand the underlying code generator or refer to the generated code) are therefore important. Concolic testing, a hybrid testing technique that benefits from both concrete and symbolic execution, gains a high execution coverage and is used extensively in the industry for program testing but not for software models. In this paper, we present a novel technique and its tool mCUTE1 , an open source 2 model-level concolic testing engine. We describe the implementation of our tool in the context of Papyrus-RT, an open source Model Driven Engineering (MDE) tool based on UML-RT, and report the results of validating our tool using a set of benchmark models.
Context Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeling technique is Latent Dirichlet allocation. When running on different datasets, LDA suffers from “order effects”, i.e., different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results; specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classification results.
Objective To provide a method in which distributions generated by LDA are more stable and can be used for further analysis.
Method We use LDADE, a search-based software engineering tool which uses Differential Evolution (DE) to tune the LDA’s parameters. LDADE is evaluated on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands of Software Engineering (SE) papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark) across Linux platform and for different kinds of LDAs (VEM, Gibbs sampling). Results were scored via topic stability and text mining classification accuracy.
Results In all treatments: (i) standard LDA exhibits very large topic instability; (ii) LDADE’s tunings dramatically reduce cluster instability; (iii) LDADE also leads to improved performances for supervised as well as unsupervised learning.
Conclusion Due to topic instability, using standard LDA with its “off-the-shelf” settings should now be depreciated. Also, in future, we should require SE papers that use LDA to test and (if needed) mitigate LDA topic instability. Finally, LDADE is a candidate technology for effectively and efficiently reducing that instability.
Link to Publication: https://www.sciencedirect.com/science/article/pii/S0950584917300861
Systems-of-systems are formed by the composition of independently created software components. These components are designed to satisfy their individual requirements, rather than the global requirements of the systems-of-systems. We refer to components that cannot be adapted to meet both individual and global requirements as defiant components. In this paper, we propose a cautious adaptation approach which supports changing the behaviour of such defiant components under exceptional conditions to satisfy global requirements, while continuing to guarantee the satisfaction of the components’ individual requirements. The approach represents both normal and exceptional conditions as scenarios; models the behaviour of exceptional conditions as wrappers implemented using an aspect-oriented technique; and deals with both single and multiple instances of defiant components with different precedence order at runtime. We evaluated an implementation of the approach using drones and boats for an organ delivery application conceived by our industrial partners, in which we assess how the proposed approach helps achieve the system-of-systems’ global requirements while accommodating increased complexity of hybrid aspects such as multiplicity, precedence ordering, openness, and heterogeneity.
Ensure the correctness of safety critical systems play a key role in the worldwide software engineer. Over the past yeas we have been helping CASCO Signal Ltd which is the Chinese biggest high speed railway company to develop high speed railway safety critical software. We have also contributed specific methods for develop better safety critical software, including a search-based model-driven software development approach which uses VIATRA Solver and UML diagram refinement method to construct Sysml model and use SAT solver to check the model. This talk aims at sharing the challenge of developing high speed railway safety critical system, what we learn from develop a safety critical software with a Chinese high speed railway company, and we use ZC subsystem as a case study to show the systematic model-driven safety critical software development method.
Architecture degradation has a strong negative impact on software quality and can result in significant losses. Severe software degradation does not happen overnight. Software evolves continuously, through numerous issues, fixing bugs and adding new features, and architecture flaws emerge quietly and largely unnoticed until they grow in scope and significance when the system becomes difficult to maintain. Developers are largely unaware of these flaws or the accumulating debt as they are focused on their immediate tasks of address individual issues. As a consequence, the cumulative impacts of their activities, as they affect the architecture, go unnoticed. To detect these problems early and prevent them from accumulating into severe ones, we propose to monitor software evolution by tracking the interactions among files revised to address issues. In particular, we propose and show how we can automatically detect active hotspot, to reveal architecture problems. We have empirically studied hundreds of hotspots along the evolution timelines of 21 open source projects, and showed that there exist just a few dominating active hotspots per project at any given time. Moreover, these dominating active hotspots persist over long time periods, and thus deserve special attention. Compared with state-of-the-art design and code smell detection tools we report that by using active hotspots it is possible to detect signs of software degradation both earlier and more precisely.
More and more software-intensive systems employ machine learning and runtime optimization to improve their functionality by providing advanced features (e. g. personal driving assistants or recommendation engines). Such systems incorporate a number of smart software functions (SSFs) which gradually learn and adapt to the users’ preferences. A key property of SSFs is their ability to learn based on data resulting from the interaction with the user (implicit and explicit feedback)—which we call trainability. Newly developed and enhanced features in a SSF must be evaluated based on their effect on the trainability of the system. Despite recent approaches for continuous deployment of machine learning systems, trainability evaluation is not yet part of continuous integration and deployment (CID) pipelines. In this paper, we describe the different facets of trainability for the development of SSFs. We also present our approach for automated trainability evaluation within an automotive CID framework which proposes to use automated quality gates for the continuous evaluation of machine learning models. The results from our indicative evaluation based on real data from eight BMW cars highlight the importance of continuous and rigorous trainability evaluation in the development of SSFs.
Programming is typically a difficult and repetitive task. Programmers encounter endless problems during programming, and they often need to write similar code over and over again. To prevent programmers from reinventing wheels thus increase their productivity, we propose a context-aware code-to-code recommendation tool named Lancer. With the support of a Library-Sensitive Language Model (LSLM) and the BERT model, Lancer is able to automatically analyze the intention of the incomplete code and recommend relevant and reusable code samples in real-time. code, and datasets can be found. A video demonstration of Lancer can be found at https://youtu.be/tO9nhqZY35g. Lancer is open source and the code is available at https://github.com/sfzhou5678/Lancer.