The logic behind design decisions, called design rationale, is very valuable. In the past, researchers have tried to automatically extract and exploit this information, but prior techniques are only applicable to specific contexts and there is insufficient progress on an end-to-end rationale information extraction pipeline. Here we outline a path towards such a pipeline that leverages several Machine Learning (ML) and Natural Language Processing (NLP) techniques. Our proposed context-independent approach, called Kantara, produces a knowledge graph representation of decisions and of their rationales, which considers their historical evolution and traceability. We also propose inconsistency checking mechanisms to ensure the correctness of the extracted information and the coherence of the development process. We conducted a preliminary evaluation of our proposed approach on a small example sourced from the Linux Kernel, which shows promising results.
Constructing a System Knowledge Graph of User Tasks and Failures from Bug Reports to Support Soap Opera Testing
Exploratory testing is an effective testing approach which leverages the tester’s knowledge and creativity to design test cases to provoke and recognize failures at the system level from the end user’s perspective. Although some principles and guidelines have been proposed to guide exploratory testing, there are no effective tools for automatic generation of exploratory test scenarios (a.k.a soap opera tests). Existing test generation techniques rely on specifications, program differences and fuzzing, which are not suitable for exploratory test generation. In this paper, we propose to leverages the scenario and oracle knowledge in bug reports to generate soap opera test scenarios. We develop open information extraction methods to construct a system knowledge graph (KG) of user tasks and failures from the steps to reproduce, expected results and observed results in bug reports. We construct a proof-of-concept KG from 25,939 bugs of the Firefox browser. Our evaluation shows the constructed KG is of high quality. Based on the KG, we creates soap opera test scenarios by combining the scenarios of relevant bugs, and develop a web tool to present the created test scenarios and support exploratory testing. In our user study, 5 users find 18 bugs from 5 seed bugs in 2 hours using our tool, while the control group find only 5 bugs based on the recommended similar bugs.
Data Augmentation for Improving Emotion Recognition in Software Engineering Communication
Emotions (e.g., Joy, Anger) are prevalent in daily software engineering (SE) activities, and are known to be significant indicators of work productivity (e.g., bug fixing efficiency). Recent studies have shown that directly applying general purpose emotion classification tools to SE corpora is not effective. Even within the SE domain, tool performance degrades significantly when trained on one communication channel and evaluated on another (e.g, StackOverflow vs. GitHub comments). Retraining a tool with channel-specific data takes significant effort since manually annotating large datasets of ground truth data is expensive.
In this paper, we address this data scarcity problem by automatically creating new training data using a data augmentation technique. Based on an analysis of the types of errors made by popular SE-specific emotion recognition tools, we specifically target our data augmentation strategy in order to improve the performance of emotion recognition. Our results show an average improvement of 9.3% in micro F1-Score for three existing emotion classification tools (ESEM-E, EMTk, SEntiMoji) when trained with our best augmentation strategy.
The logic behind design decisions, called design rationale, is very valuable. In the past, researchers have tried to automatically extract and exploit this information, but prior techniques are only applicable to specific contexts and there is insufficient progress on an end-to-end rationale information extraction pipeline. Here we outline a path towards such a pipeline that leverages several Machine Learning (ML) and Natural Language Processing (NLP) techniques. Our proposed context-independent approach, called Kantara, produces a knowledge graph representation of decisions and of their rationales, which considers their historical evolution and traceability. We also propose inconsistency checking mechanisms to ensure the correctness of the extracted information and the coherence of the development process. We conducted a preliminary evaluation of our proposed approach on a small example sourced from the Linux Kernel, which shows promising results.
Towards digitalization of requirements: Generating context-sensitive user stories from diverse specifications
Requirements Engineering in the industry is expertise-driven, heavily manual, and centered around various types of requirement specification documents being prepared and maintained. These specification documents are in diverse formats and vary depending on whether it is a business requirement document, functional specification, interface specification, client specification, and so on. These diverse specification documents embed crucial product knowledge such as functional decomposition of the domain into features, feature hierarchy, feature types and their specific feature characteristics, dependencies, business context, etc. Moreover, in a product development scenario, thousands of pages of requirement specification documentation is created over the years. Comprehending functionality and its associated context from large volumes of specification documents is a highly complex task. To address this problem, we propose to digitalize the requirement specification documents into processable models. This paper discusses the salient aspects involved in the digitalization of requirements knowledge from diverse requirement specification documents. It proposes an AI engine for the automatic transformation of diverse text-based requirement specifications into machine-processable models using NLP techniques and the generation of context-sensitive user stories. The paper describes the key requirement abstractions and concepts essential in an industrial scenario, the conceptual meta-model, and DizReq engine (AI engine for digitalizing requirements) implementation for automatically transforming diverse requirement specifications into user stories embedding the business context. The evaluation results from digitalizing specifications of an IT product suite are discussed: mean feature extraction efficiency is 40 features/file, mean user story extraction efficiency is 71 user stories/file, feature extraction accuracy is 94%, and requirement extraction accuracy is 98%.
Which neural network makes more explainable decisions? An approach towards measuring explainability
Virtual
Neural networks are getting increasingly popular thanks to their exceptional performance in solving many real-world problems. At the same time, they are shown to be vulnerable to attacks, difficult to debug and subject to fairness issues. To improve people’s trust in the technology, it is often necessary to provide some human-understandable explanation of neural networks’ decisions, e.g., why is that my loan application is rejected whereas hers is approved? That is, the stakeholder would be interested to minimize the chances of not being able to explain the decision consistently and would like to know how often and how easy it is to explain the decisions of a neural network before it is deployed.
In this work, we provide two measurements on the decision explainability of neural networks. Afterwards, we develop algorithms for evaluating the measurements of user-provided neural networks automatically. We evaluate our approach on multiple neural network models trained on benchmark datasets. The results show that existing neural networks’ decisions often have low explainability according to our measurements. This is in line with the observation that adversarial samples can be easily generated through adversarial perturbation, which are often hard to explain. Our further experiments show that the decisions of the models trained with robust training are not necessarily easier to explain, whereas decisions of the models retrained with samples generated by our algorithms are easier to explain.
Automatically Identifying the Quality of Developer Chats for Post Hoc Use
Virtual
Software engineers are crowdsourcing answers to their everyday challenges on Q&A forums (e.g., Stack Overflow) and more recently in public chat communities such as Slack, IRC and Gitter. Many software-related chat conversations contain valuable expert knowledge that is useful for both mining to improve programming support tools and for readers who did not participate in the original chat conversations. However, most chat platforms and communities do not contain built-in quality indicators (e.g., accepted answers, vote counts). Therefore, it is difficult to identify conversations that contain useful information for mining or reading, i.e,. conversations of post hoc quality. In this paper, we investigate automatically detecting developer conversations of post hoc quality from public chat channels. We first describe an analysis of 400 developer conversations that indicate potential characteristics of post hoc quality, followed by a machine learning-based approach for automatically identifying conversations of post hoc quality. Our evaluation of 2000 annotated Slack conversations in four programming communities (python, clojure, elm, and racket) indicates that our approach can achieve precision of 0.82, recall of 0.90, F-measure of 0.86, and MCC of 0.57. To our knowledge, this is the first automated technique for detecting developer conversations of post hoc quality.