Plenary
Wed 14 Jun 2023 08:30 - 09:00 at Aurora Hall - Opening Chair(s): Paul Ralph Dalhousie University, Burak Turhan University of Oulu, Sira Vegas Universidad Politecnica de Madridno description available
Plenary
Wed 14 Jun 2023 09:00 - 10:00 at Aurora Hall - Keynote Chair(s): Burak Turhan University of Oulu, Sira Vegas Universidad Politecnica de MadridIt is the time of trust and transformation in software. We want explainable AI to assist us in dialogue, write our programs, test our software, and improve how we communicate. The digitalization, automation, and transformation to be able to use new technologies are all much to slow even if things change at a lightning speed. Change is the only thing we can be sure will happen. Evaluating and assessing quality of software sounds easy but is only as good as you design it to be. Having a multi-facetted perspective is important when analyzing complex contexts. In software, listening skills and asking the right questions to the right people is often invaluable to complement blunt data. On the other side - much information is probably missing as you are too easily getting “only” what you asked for. So, we cannot judge what we cannot observe – and analyzing this data, is another issue all together. Therefore, it is easy to lose perspective in a fast-changing world. Despite drowning in tools, we still miss a lot of them. The threshold of using a tool is high, as we cannot trust them, and we cannot be sure that the data these tools collect does represent what we want to investigate. Therefore, the role of the scientist is more important than ever. Trusting the scientific process, utilizing multiple methods, and combining them is the receipt! Another goal is doing our best to select topics and collaborators – as building better software (quality) for humanity. It starts with you and me. I hope I will in this context be able to touch upon areas like security, testing, automation, AI/ML, ethics and “human in the loop”, analysis, tools, and technical debt, with a focus on evaluations and assessments.
File AttachedSigrid Eldh is a senior specialist and researcher at Ericsson, leading Ericsson’s research in the area in software engineering, testing, product quality, incl. fault-related activities. She is also an adjunct professor at Carleton University in Ottawa, and a senior lecturer at Mälardalen University, where she took her PhD “On Test Design”. With 35+ years of knowledge of the software industry, she is an innovator with patents and publications, leading large research collaborations on test automation for which she been nominated as best women in leadership at Eurekanetworks.org in 48 countries. Her research focus on improving quality of software, ways of working.
Catering
Wed 14 Jun 2023 10:00 - 10:30 at Aurora Hall - Coffee BreakBreak
Short Papers and Posters
Wed 14 Jun 2023 10:00 - 10:30 at Aurora Hall - PosterWe propose a solution combining source code static analysis with searchable symmetric encryption to detect input validation vulnerabilities of web applications in encrypted PHP code, allowing developers to protect their codebase from malicious third parties while simultaneously discovering vulnerabilities in it. Results show that our solution is capable of identifying vulnerabilities with similar precision when compared with non-confidential tools and exhibits a moderate overhead increase of around 16,55%.
Short Papers and Posters
Wed 14 Jun 2023 10:00 - 10:30 at Aurora Hall - PosterThe exponential increase in smartphone usage has fueled the rapid growth of Android applications (apps). Unfortunately, this growth has also resulted in an alarming rise in security vulnerabilities, posing a significant challenge for developers of smartphone apps. In this paper, we conducted a quantitative and qualitative study to analyze security-related issues in open-source Android apps available on GitHub. Our study included a total set of 689 security-related commits identified from 111,224 commits distributed over 2,187 apps. We proposed a taxonomy of ten distinct categories of security issues, which we identified using the card-sorting technique. Our findings showed that permission issues were the most prevalent in our dataset (370, 53.7%), followed by Login issues (160, 23.22%). Issues such as Privacy (5, 0.72%) and Framework (3, 0.43%) were rare in our dataset. Other security issues were related to Encryption, Authentication, Generic Security, Decryption, Network, and Database. Our taxonomy also included 91 sub-categories/sub-themes, with permission issues having the highest number of sub-categories (37). Developers discussed permission sub-categories, such as permission sub-categories in their commits are camera permission, WiFi permissions, storage permission, WRITE/READ_PHONE_STATE permission, and location permission, among others, in their code commits. These preliminary findings serve as an initial step towards comprehending the primary security concerns from the perspective of both developers and researchers. Furthermore, our long-term objective is to investigate how developers address these security issues in their apps and determine whether they effectively resolve them. This research could provide valuable insights into improving the security of Android apps and preventing potential security breaches.
Short Papers and Posters
Wed 14 Jun 2023 10:00 - 10:30 at Aurora Hall - PosterFault-proneness is an indication of programming errors that decreases software quality and maintainability. On the contrary, code smell is a symptom of potential design problems which has impact on fault-proneness. In the literature, negative impact of code smells on fault-proneness has been investigated. However, it is still unclear that how frequency of each code smell type impacts on the fault-proneness. To mitigate this research gap, we present an empirical study to identify whether frequency of individual code smell types has a relationship with fault-proneness. More specifically, we identify 13 code smell types and fault-proneness of the corresponding smelly classes in the well-known open source systems from Apache and Eclipse ecosystems. Then we analyse the relationship between their frequency of occurrences based on the correlation. The results show that Anti Singleton, Blob and Class Data Should Be Private smell types have strong relationship with fault-proneness though their frequencies are not very high. On the other hand, comparatively high frequent code smell types such as Complex Class, Large Class and Long Parameter List have moderate relationship with fault-proneness. These findings will assist developers to prioritize code smells while performing refactoring activities in order to improve software quality.
DOI Pre-print File AttachedShort Papers and Posters
Wed 14 Jun 2023 10:00 - 10:30 at Aurora Hall - PosterOne of the objectives of software engineering education is to make students to learn essential teamwork skills. This is done by having the students work in groups for course assignments. Student team composition plays a vital role in this, as it significantly affects learning outcomes, what is learned, and how. The study presented in this paper aims to better understand the student team composition in software engineering education and investigate the factors affecting it in the international software engineering education context. Those factors should be taken into consideration by software engineering teachers when they design group work assignments in their courses. In this paper, the initial findings of the ongoing Action research study are presented. The results give some identified principles that should be considered when designing student team composition in software engineering courses.
DOI Authorizer link File AttachedResearch (Full Papers)
Wed 14 Jun 2023 10:30 - 10:50 at Aurora Hall - AI and Software Engineering Chair(s): Valentina Lenarduzzi University of OuluData quality assessment has become a prominent component in the successful execution of complex data-driven artificial intelligence (AI) software systems. In practice, real-world applications generate huge volumes of data at speeds. These data streams require analysis and preprocessing before being permanently stored or used in a learning task. Therefore, significant attention has been paid to the systematic management and construction of high-quality datasets. Nevertheless, managing voluminous and velocious data streams is usually performed manually (i.e., offline), making it impractical strategy in production environments. To address this challenge, DataOps has emerged to achieve life-cycle automation of data processes using DevOps principles. However, determining the data quality based on a fitness scale constitutes a complex task within the framework of DataOps. This paper presents a novel Data Quality Scoring Operations (DQSOps) framework that yields a quality score for production data in DataOps workflows. The framework incorporates two scoring approaches, an ML prediction-based approach that predicts the data quality score and a standard-based approach that periodically produces the ground-truth scores based on assessing several data quality dimensions. We deploy the DQSOps framework in a real-world industrial use case. The results show that DQSOps achieves a significant computational speedup rates compared to the conventional approach of data quality scoring while maintaining high prediction performance.
Pre-print File AttachedJournal First
Wed 14 Jun 2023 10:50 - 11:00 at Aurora Hall - AI and Software Engineering Chair(s): Valentina Lenarduzzi University of OuluContext: If deep learning models in safety-critical systems misbehave, serious accidents may occur. Previous studies have proposed approaches to overcome such misbehavior by detecting and modifying the responsible faulty parts in deep learning models. For example, fault localization has been applied to deep neural networks to detect neurons that cause misbehavior. Objective: However, such approaches are not applicable to deep learning models that have internal states, which change dynamically based on the input data samples (e.g., recurrent neural networks (RNNs)). Hence, we propose a new fault localization approach to be applied to RNNs. Methods: We propose probabilistic automaton-based fault localization (PAFL). PAFL enables developers to detect faulty parts even in RNNs by computing suspiciousness scores with fault localization using n-grams. We convert RNNs into probabilistic finite automata (PFAs) and localize faulty sequences of state transitions on PFAs. To consider various sequences and to detect faulty ones more precisely, we use n-grams inspired by natural language processing. Additionally, we distinguish data samples related to the misbehavior to evaluate PAFL. We also propose a novel suspiciousness score, average n-gram suspiciousness (ANS) score, based on n-grams to distinguish data samples. We evaluate PAFL and ANS scores on eight publicly available datasets on three RNN variants: simple recurrent neural network, gated recurrent units, and long short-term memory. Results: The experiment demonstrates that ANS scores identify faulty parts of RNNs when n is greater than one. Moreover, PAFL is statistically significantly better and has large effect sizes compared to state-of-the-art fault localization in terms of distinguishing data samples related to the misbehavior. Specifically, PAFL is better in 66.74% of the experimental settings. Conclusion: The results demonstrate that PAFL can be used to detect faulty parts in RNNs. Hence, in future studies, PAFL can be used as a baseline for fault localization in RNNs.
Link to publication DOI File AttachedResearch (Full Papers)
Wed 14 Jun 2023 11:00 - 11:20 at Aurora Hall - AI and Software Engineering Chair(s): Valentina Lenarduzzi University of OuluSociety’s increasing dependence on Artificial Intelligence (AI) and AI-enabled systems require a more practical approach from software engineering (SE) executives in middle and higher-level management to improve their involvement in implementing AI ethics by making ethical requirements part of their management practices. However, research indicates that most work on implementing ethical requirements in SE management primarily focuses on technical development, with scarce findings for middle and higher-level management. We investigate this by interviewing ten Finnish SE executives in middle and higher-level management to examine how they consider and implement ethical requirements. We use ethical requirements from the European Union (EU) Trustworthy Ethics guidelines for Trustworthy AI as our reference for ethical requirements and an Agile portfolio management framework to analyze implementation. Our findings reveal a general consideration of Privacy and data governance ethical requirements as legal requirements with no other consideration for ethical requirements identified. The findings also show practicable consideration of ethical requirements as Technical robustness and safety for implementation as risk requirements and Societal and environmental well-being for implementation as sustainability requirements. We examine a practical approach to implementing ethical requirements using the ethical risk requirements stack employing the Agile portfolio management framework.
Pre-print File AttachedShort Papers and Posters
Wed 14 Jun 2023 11:20 - 11:30 at Aurora Hall - AI and Software Engineering Chair(s): Valentina Lenarduzzi University of OuluCode smells is the term used to signal certain patterns or structures in software code that may contain a potential design or architecture problem, leading to maintainability or other software quality issues. Detecting code smells early in the software development process helps prevent these problems and improve the overall software quality. Existing research concentrates on the process of collecting and handling dataset, then exploring the potential of utilizing deep learning models to detect smells, while ignoring extensive feature engineering. Though these approaches obtained promising results, there are the following issues that need to be tackled: (i) extracting both structural and semantic features from the software units; (ii) mitigating the effects of imbalanced data distribution on the performance of learning models. In this paper, we propose DeepSmells as a novel approach to code smells detection. To learn the complex hierarchical representations of the code fragment, we apply a deep convolutional neural network (CNN). Then, in order to improve the quality of the context encoding and preserve semantic information, long short-term memory networks (LSTM) is placed immediately after the CNN. The final classification is conducted by deep neural networks with weighted loss function to reduce the impact of skewed data distribution. We performed an empirical study using the existing code smell benchmark datasets to assess the performance of our proposed approach, and compare it with state-of-the-art baselines. The results demonstrate the effectiveness of our proposed method for all kinds of code smells with outperformed evaluation metrics in terms of F1 score and MCC.
DOI Authorizer link File AttachedResearch (Full Papers)
Wed 14 Jun 2023 11:30 - 11:50 at Aurora Hall - AI and Software Engineering Chair(s): Valentina Lenarduzzi University of OuluCarefully selecting the right collection datastructure can significantly improve the performance of a Java program. Unfortunately, the performance impact of a certain collection selection can be hard to estimate.To assist developers there are tools that recommend collections to use based on static and/or dynamic information about a program. The majority of existing collection selection tools for Java (e.g., CoCo, CollectionSwitch) pick their selections dynamically, which means that they must trade off sophistication in their selection algorithm against its run time overhead.For static collection selection, the Brainy tool has demonstrated that complex, machine-dependent models can produce substantial performance improvements, albeit only for C++ so far.
In this paper, we port Brainy from C++ to Java, and evaluate its effectiveness for 5 benchmarks from the DaCapo benchmark suite. We compare it against the original program, but also to a variant of a brute-force approach to collection selection, which serves as our ground truth for optimal performance. Our results show that in four benchmarks out of five, our ground truth and the original program are similar. In one case, the ground truth shows an optimization yielding 15% speedup was available, but our port did not find this substantial optimization. We find that the port is more efficient but less effective than the ground truth, can easily adapt to new hardware architectures, and incorporate new datastructures with at most a few hours of human effort. We detail challenges that we encountered porting the Brainy approach to Java, and list a number of insights and directions for future research.
Authorizer link Pre-print File AttachedVision and Emerging Results
Wed 14 Jun 2023 11:50 - 12:00 at Aurora Hall - AI and Software Engineering Chair(s): Valentina Lenarduzzi University of OuluThe ability to allow developers to share their source code and collaborate on software projects has made GitHub a widely used open source platform. Each repository in GitHub is generally equipped with a README.MD file to exhibit an overview of the main functionalities. Nevertheless, while offering useful information, README.MD is usually lengthy, requiring time and effort to read and comprehend. Thus, besides README.MD, GitHub also allows its users to add a short description called “About,” giving a brief but informative summary about the repository. This allows visitors to quickly grasp the main content and decide whether to continue reading. Unfortunately, due to various reasons–not excluding laziness–oftentimes this field is left blank by developers. This paper proposes GitSum as a novel approach to the summarization of README.MD, automatically filling the blank “About” field for repositories. GitSum is built on top of BART and T5, two cutting-edge deep learning techniques, learning from existing data to perform recommendations for repositories with a missing description. We test its performance using two datasets collected from GitHub. The evaluation shows that GitSum can generate relevant predictions, outperforming a well-established baseline.
DOI Authorizer link File AttachedResearch (Full Papers)
Wed 14 Jun 2023 13:30 - 13:50 at Aurora Hall - Repository Mining Chair(s): César França Universidade Federal Rural de PernambucoDuring the last few years, Continuous Integration (CI) has become a common practice in open source and industrial environments in order to reduce the scope for errors and increase the speed to market through automated build and test processes. However, despite this wide adoption throughout the years, little is known about the challenges developers discuss. Analyzing the discussions of developers is required to understand what researchers, educators and practitioners should focus on, and how discussion communities can be helpful to shed the light on CI challenges. In this study, we examine Stack Overflow (SO), the most popular crowd-sourced forum, to understand the challenges developers face under CI context. We collect a corpus of 27,728 CI related developers posts from SO and analyze those posts through a mixed-method with quantitative and qualitative analyzes. To study the trends of CI discussions, we investigated the metadata of CI questions, users and tags. Then, we extract the CI main topics using Latent Dirichlet Allocation (LDA) tuned with Genetic Algorithm (GA). Finally, we investigate the most popular and difficult topics faced by developers and perform a qualitative analysis based on a statistical sample of unanswered questions to get further insights into CI challenges. The LDA clustering reveals that developers face challenges with six main topics namely Build, Testing, Version Control, Configuration, Deployment and CI Culture. Particularly, we found that the build topic is the most popular among the studied topics and that version control and testing topics are the most difficult for SO community. Our study uncovers insights about CI challenges and adds evidence to existing knowledge about CI issues related especially to software build. Based on the results of our study, we conclude several implications for researchers, e.g., need for more effort to investigate the reasons behind the reported issues, educators, e.g., teach CI principals and philosophy, and practitioners, e.g., take the difficult topics into consideration when distributing the tasks.
Link to publication DOI File AttachedIndustry
Wed 14 Jun 2023 13:50 - 14:00 at Aurora Hall - Repository Mining Chair(s): César França Universidade Federal Rural de PernambucoThe fast distribution and deployment of security patches is important to protect users against cyberattacks. These fixes can be detected automatically by patch management triage systems. However, previous work has shown that automating the task is not easy, in some cases, because of poor documentation or lack of information in security fixes. For many years, standard practices in the security community have steered engineers to provide cryptic commit messages—i.e., patch software vulnerabilities silently—to avoid potential attacks and reputation damages. However, not providing enough documentation on vulnerability fixes is known to damage trust between vendors and users. Current efforts in the security community aim to increase the level of transparency during patch and disclosing times to help build trust in the development community and make patch management processes faster. In this paper, we evaluate how informative security commit messages (i.e., messages attached to security fixes) are and how different levels of information can affect the different tasks in automated patch triage systems. We observed that security engineers provide some levels of detail in security commit messages that can be leveraged to improve or enable one or two of the automated triage tasks but not all of them. In addition, results show that security commit messages need to be more informative—56.6% of the messages analyzed were documented poorly. Best practices to write informative and well-structured security commit messages (such as SECOM) should become a standard practice in the security community.
Link to publication DOIResearch (Full Papers)
Wed 14 Jun 2023 14:00 - 14:20 at Aurora Hall - Repository Mining Chair(s): César França Universidade Federal Rural de PernambucoMobile app development frameworks lower the effort to write and deploy apps across different execution platforms, e.g., mobile, web, and stand-alone PCs. At the same time, their use may limit native optimizations and impose overhead, increasing resource usage. In mobile devices, higher resource usage results in faster battery depletion, a significant disadvantage. In this paper, we analyze the resource usage of Android benchmarks and apps based on three mobile app development frameworks, Flutter, React Native, and Ionic, comparing them to functionally equivalent, native variants written in Java. These frameworks, besides being in widespread use, represent three different approaches for developing multiplatform apps: Flutter supports deployment of apps that are compiled and run fully natively, React Native runs interpreted JavaScript code combined with native views for different platforms, and Ionic is based on web apps, which means that it does not depend on platform-specific details. We measure the energy consumption, execution time, and memory usage of ten optimized, CPU-intensive benchmarks, to gauge overhead in a controlled manner, and two applications, to measure their impact when running commonly mobile app functionalities. Our results show that cross-platform and hybrid frameworks can be competitive in CPU-intensive applications. In five of the ten benchmarks, at least one framework-based version exhibits lower energy consumption and execution time than its native counterpart, up to a reduction of 81% in energy and 83% in execution time. Furthermore, in three other benchmarks, framework-based and native versions achieved similar results. Overall, Flutter, usually imposes the least overhead in execution time and energy, while React Native imposes the highest in all the benchmarks. However, in an app that continuously animates multiple images on the screen, without interaction, the React Native version uses the least CPU and energy, up to a reduction of 96% in energy compared to the second-best framework-based version. These findings highlight the importance of analyzing expected application behavior before committing to a specific framework.
Link to publication Pre-print File AttachedShort Papers and Posters
Wed 14 Jun 2023 14:20 - 14:30 at Aurora Hall - Repository Mining Chair(s): César França Universidade Federal Rural de PernambucoMost client software employs a bug-tracking system, which utilizes user-submitted reports (bug reports). that contain information necessary for software developers to fix bugs. The quality of bug reports drastically differs. Bug reports can include severity, priority, and associated issues determined by researching the addressed bug. Herein we investigate the influence of bug report qualities on successfully fixing a bug and estimating the fixing time. We also examine the claim in previous studies that bias and differences in the treatment of bug reports exist due to broad expertness among the reporters. Our approach examines the relationship between the qualities within the bug-fixing cycle and modeling graphical causal dependencies through a Bayesian Network. Bug reports with attachments, dependencies on another bug, and frequent discussions have a higher probability of being fixed. In addition, bug reports with a high severity tend to be fixed faster. Moreover, the difficulty of the bug itself may influence the fixing rate such that a straightforward bug will be fixed easier and faster regardless of the bug report quality.
DOI File AttachedShort Papers and Posters
Wed 14 Jun 2023 14:30 - 14:40 at Aurora Hall - Repository Mining Chair(s): César França Universidade Federal Rural de PernambucoProgramming languages often demarcate the internal sandbox, consisting of entities such as objects and variables, from the outside world, e.g., files or network. Although communication with the external world poses fundamental challenges for live programming, reversible debugging, testing, and program analysis in general, studies about this phenomenon are rare. In this paper, we present a preliminary empirical study about the prevalence of input/output (I/O) method usage in Java. We manually categorized 1435 native methods in a Java Standard Edition distribution into non-I/O and I/O-related methods, which were further classified into areas such as desktop or file-related ones. According to the static analysis of a call graph for 798 projects, about 57% of methods potentially call I/O natives. The results of dynamic analysis on 16 benchmarks showed that 21% of the executed methods directly or indirectly called an I/O native. We conclude that neglecting I/O is not a viable option for tool designers and suggest the integration of I/O-related metadata with source code to facilitate their querying.
DOI Pre-print File AttachedIndustry
Wed 14 Jun 2023 14:40 - 14:50 at Aurora Hall - Repository Mining Chair(s): César França Universidade Federal Rural de PernambucoAutomated unit test generation has been extensively studied, with prior research mostly focusing on dynamically compiled or dynamically typed programming languages like Java and Python. However, Go, a popular statically compiled and typed programming language used extensively in server application development, has received limited support from existing tools. To address this gap, we present NxtUnit, an automatic unit test generation tool for Go that uses random testing and is well-suited for microservice architecture. NxtUnit employs a random approach to generate unit tests quickly, making it ideal for smoke testing and providing quick quality feedback. It comes with three types of interfaces: an integrated development environment (IDE) plugin, a command-line interface (CLI) tool, and a browser-based platform. The plugin and CLI tool allow engineers to write unit tests more efficiently, while the platform provides unit test visualization and asynchronous unit test generation. We evaluated NxtUnit by generating unit tests for 13 public open-source repositories and 500 ByteDance in-house repositories, resulting in a code coverage of 20.74% for in-house repositories. We conducted a survey among Bytedance engineers and found that NxtUnit can save them 48% of the time they previously spent on writing unit tests. We have made the tool available at https://github.com/bytedance/nxt_unit.
DOI Pre-printShort Papers and Posters
Wed 14 Jun 2023 14:50 - 15:00 at Aurora Hall - Repository Mining Chair(s): César França Universidade Federal Rural de PernambucoThe purpose of this study is to identify the characteristics of agile development processes that impact user satisfaction. We used user reviews of OSS smartphone apps and various data from version control systems to examine the relationships, especially time-series correlations, between user satisfaction and development metrics that are expected to be related to user satisfaction. Although no metrics conclusively indicate an improved user satisfaction, motivation of the development team, the ability to set appropriate work units, the appropriateness of work rules, and the improvement of code maintainability should be considered as they are correlated with improved user satisfaction. In contrast, changes in the release frequency and workload are not correlated.
DOI Pre-print File AttachedResearch (Full Papers)
Wed 14 Jun 2023 15:30 - 15:50 at Aurora Hall - Methodology and Secondary Studies Chair(s): Thomas Fehlmann Euro Project OfficeBinary classifiers are commonly used in software engineering research, to estimate several software qualities, e.g., defectiveness or vulnerability. Thus, it is important to adequately evaluate how well binary classifiers perform, before they are used in practice. The Area Under the Curve (AUC) of Receiver Operating Characteristic curves has often been used to this end. However, AUC has been widely criticized, so it is necessary to evaluate under what conditions and to what extent AUC can be a reliable performance metric.
We analyze AUC in relation to φ (also known as Matthews Correlation Coefficient), often considered a more reliable performance metric, by building the lines in the ROC space with constant value of φ, for several values of φ, and computing the corresponding values of AUC.
By their very definitions, AUC and φ depend on the prevalence ρ of a dataset, which is the proportion of its positive instances (e.g., the defective software modules). Hence, so does the relationship between AUC and φ. It turns out that AUC and φ are very well correlated, and therefore provide concordant indications, for balanced datasets (those with ρ around 0.5). Instead, AUC tends to become quite large, and hence provide over-optimistic indications, for very imbalanced datasets (those with ρ close to 0 or 1).
We use examples from the software engineering literature to illustrate the analytical relationship linking AUC, φ and ρ. We show that, for some values of ρ, the evaluation of performance based exclusively on AUC can be deceiving. In conclusion, this paper provides some guidelines for an informed usage and interpretation of AUC.
Pre-print File AttachedShort Papers and Posters
Wed 14 Jun 2023 15:50 - 16:00 at Aurora Hall - Methodology and Secondary Studies Chair(s): Thomas Fehlmann Euro Project OfficeBackground: Construct validity concerns the use of indicators to measure a concept that is not directly measurable.
Aim: This study intends to identify, categorize, assess and quantify discussions of threats to construct validity in empirical software engineering literature and use the findings to suggest ways to improve the reporting of construct validity issues.
Method: We analyzed 83 articles that report human-centric experiments published in five top-tier software engineering journals from 2015 to 2019. The articles’ text concerning threats to construct validity was divided into segments (the unit of analysis) based on predefined categories. The segments were then evaluated regarding whether they clearly discussed a threat and a construct.
Results: Three-fifths of the segments were associated with topics not related to construct validity. Two-thirds of the articles discussed construct validity without using the definition of construct validity given in the article. The threats were clearly described in more than four-fifths of the segments, but the construct in question was clearly described in only two-thirds of the cases. The construct was unclear when the discussion was not related to construct validity but to other types of validity.
Conclusions: The results show potential for improving the understanding of construct validity in software engineering. Recommendations addressing the identified weaknesses are given to improve the awareness and reporting of CV.
DOI Pre-print File AttachedResearch (Full Papers)
Wed 14 Jun 2023 16:00 - 16:20 at Aurora Hall - Methodology and Secondary Studies Chair(s): Thomas Fehlmann Euro Project OfficeWith privacy concerns within the field of machine learning, federated learning (FL) was invented in 2017, in which the clients, such as mobile devices, compute or train a model and send the update to the centralized server. Choosing clients randomly for FL can harm learning performance due to different reasons. Many studies have proposed approaches to address the challenges of client selection of FL. However, there was no systematic literature review (SLR) on this topic. This SLR investigates the state of the art of client selection in FL and answers the challenges, solutions, and metrics to evaluate the solutions. We systematically reviewed 43 primary studies. The main challenges found in client selection are heterogeneity, resource allocation, communication costs, and fairness. The client selection schemes aim to improve the original random selection algorithm by focusing on one or more of the aforementioned challenges. The most common metric used is testing accuracy versus communication rounds, as testing accuracy measures the success- fulness of the learning and preferably in as few communication rounds as possible, as these are quite expensive. Although several possible improvements can be made with the current state of client selection, the most beneficial ones are evaluating the impact of unsuccessful clients and gaining a more theoretical understanding of the impact of fairness in federated learning.
DOI Authorizer link Pre-print File AttachedEASIER
Wed 14 Jun 2023 16:20 - 16:30 at Aurora Hall - Methodology and Secondary Studies Chair(s): Thomas Fehlmann Euro Project OfficeHannah Deters, Jakob Droste and Kurt Schneider
Explainability is an emerging quality aspect of software systems. Explanations offer a solution approach for achieving a variety of quality goals, such as transparency and user satisfaction. Therefore, explainability should be considered a means to an end. The evaluation of quality aspects is essential for successful software development. Evaluating explainability allows an initial assessment of the quality of explanations and enables the comparison of different explanation variants. As the evaluation depends on what quality goals the explanations are supposed to achieve, evaluating explainability is non-trivial. To address this problem, we combine the already well-established method of expert evaluation with goal-oriented heuristics. Goal-oriented heuristics are heuristics that are grouped with respect to the goals that the explanations are meant to achieve. By establishing appropriate goal-oriented heuristics, software engineers are enabled to evaluate explanations and identify problems with affordable resources. To show that this way of evaluating explainability is suitable, we conducted an interactive user study, using a high-fidelity software prototype. The results suggest that the alignment of heuristics with specific goals can enable an effective assessment of explainability.
DOI File AttachedJournal First
Wed 14 Jun 2023 16:30 - 16:40 at Aurora Hall - Methodology and Secondary Studies Chair(s): Thomas Fehlmann Euro Project OfficeA key part of software evolution and maintenance is the continuous integration from collaborative efforts, often resulting in complex traceability challenges between software artifacts: features and modules remain scattered in the source code, and traceability links become harder to recover. In this paper, we perform a systematic mapping study dealing with recent research recovering these links through information retrieval, with a particular focus on natural language processing (NLP).
Our search strategy gathered a total of 96 papers in focus of our study, covering a period from 2013 to 2021. We conducted trend analysis on NLP techniques and tools involved, and traceability efforts (applying NLP) across the software development life cycle (SDLC). Based on our study, we have identified the following key issues, barriers, and setbacks: syntax convention, configuration, translation, explainability, properties representation, tacit knowledge dependency, scalability, and data availability.
Based on these, we consolidated the following open challenges: representation similarity across artifacts, the effectiveness of NLP for traceability, and achieving scalable, adaptive, and explainable models. To address these challenges, we recommend a holistic framework for NLP solutions to achieve effective traceability and efforts in achieving interoperability and explainability in NLP models for traceability.
Link to publication DOI File AttachedJournal First
Wed 14 Jun 2023 16:40 - 16:50 at Aurora Hall - Methodology and Secondary Studies Chair(s): Thomas Fehlmann Euro Project OfficeContext: Burnout is a work-related syndrome that, similar to many occupations, influences most software developers. For decades, studies in software engineering(SE) have explored the causes of burnout and its consequences among IT professionals.
Objective: This paper is a systematic mapping study (SMS) of the studies on burnout in SE, exploring its causes and consequences, and how it is studied (e.g., choice of data).
Method: We conducted a systematic mapping study and identified 92 relevant research articles dating as early as the early 1990s, focusing on various aspects and approaches to detect burnout in software developers and IT professionals.
Results: Our study shows that early research on burnout was primarily qualitative, which has steadily moved to more quantitative, data-driven in the last decade.
The emergence of machine learning (ML) approaches to detect burnout in developers has become a de-facto standard.
Conclusion: Our study summarises what we now know about burnout, how software artifacts indicate burnout, and how machine learning can help its early detection. As a comprehensive analysis of past and present research works in the field, we believe this paper can help future research and practice focus on the grand challenges ahead and offer necessary tools.
Link to publication DOI File AttachedEASIER
Wed 14 Jun 2023 16:50 - 17:00 at Aurora Hall - Methodology and Secondary Studies Chair(s): Thomas Fehlmann Euro Project OfficeVita Santa Barletta, Danilo Caivano, Domenico Gigante and Azzurra Ragone
In the last years, the raise of Artificial Intelligence (AI), and its pervasiveness in our lives, has sparked a flourishing debate about the ethical principles that should lead its implementation and use in society. Driven by these concerns, we conduct a rapid review of several frameworks providing principles, guidelines, and/or tools to help practitioners in the development and deployment of Responsible AI (RAI) applications. We map each framework w.r.t. the different Software Development Life Cycle (SDLC) phases discovering that most of these frameworks fall just in the \textit{Requirements Elicitation} phase, leaving the other phases uncovered. Very few of these frameworks offer supporting tools for practitioners, and they are mainly provided by private companies. Our results reveal that there is not a “catching-all” framework supporting both technical and non-technical stakeholders in the implementation of real-world projects. Our findings highlight the lack of a comprehensive framework encompassing all RAI principles and all (SDLC) phases that could be navigated by users with different skill sets and with different goals.
Pre-print File Attachedno description available
Research (Full Papers)
Thu 15 Jun 2023 08:30 - 08:50 at Aurora Hall - Human Factors Chair(s): Rahul Mohanani University of JyväskyläCONTEXT:Self-efficacy is a concept researched in various areas of knowledge that impacts various factors such as performance, satisfaction, and motivation. In Software Engineering, it has mainly been studied in the academic context, presenting results similar to other areas of knowledge. However, it is also important to understand its impact in the industrial context. OBJECTIVE: Therefore, this study aims to understand the impact on the software development context with a focus on understanding the behavioral signs of self-efficacy in software engineers and how self-efficacy can impact the work-day of software engineers. METHOD: A qualitative research was conducted using semi-structured questionnaires with 32 interviewees from a software development company located in Brazil. The interviewees participated in a Bootcamp and were later assigned to software development teams. Thematic analysis was used to analyze the data. RESULTS: In the perception of the interviewees, 27 signs were found that are related to people with high and low self-efficacy. These signs were divided into three dimensions: social, cognitive, and performance. Also, 30 situations were found that can lead to an increase or decrease of self-efficacy of software engineers. Finally, 14 factors were mentioned that can impact software development teams. CONCLUSION: From this work, it is possible to understand even more the importance of self-efficacy in the industrial context. It presents a set of signs that can help team leaders to better perceive the self-efficacy of their members. In addition, it also presents a set of situations that both leaders and individuals can use to improve their self-efficacy in the development context, and finally, factors that can be impacted by self-efficacy in the software development context are also presented. This work emphasizes the importance of understanding self-efficacy in the industrial context.
Pre-print File AttachedShort Papers and Posters
Thu 15 Jun 2023 08:50 - 09:00 at Aurora Hall - Human Factors Chair(s): Rahul Mohanani University of JyväskyläSocial inclusion is a fundamental feature of thriving societies. This paper first investigates barriers for social inclusion in online Software Engineering (SE) communities, by identifying a set of attributes and organising them as a taxonomy. Second, by applying the taxonomy and analysing language used in the comments posted by members in 189 Gitter projects (with > 3 million comments), it presents evidence for the presence of the social exclusion problem. Third, it presents a framework for improving social inclusion in SE communities.
Link to publication DOI Pre-print File AttachedResearch (Full Papers)
Thu 15 Jun 2023 09:00 - 09:20 at Aurora Hall - Human Factors Chair(s): Rahul Mohanani University of JyväskyläGitHub Actions were introduced in 2019 to increase workflow velocity and add customized automation to the repositories. In addition, GitHub introduced its own marketplace for promoting, commercializing, and sharing these automation tools, which currently hosts 16,730 Actions. Further, there are numerous Actions that are developed and distributed in local repositories and outside the marketplace. So far, the research community conducted mining studies to understand GitHub Actions with a significant focus on CI/CD. We performed a survey study with 90 Action developers and users of GitHub Actions to understand the motivations and best practices in using, developing, and debugging Actions, and the challenges associated with these tasks. We found that developers prefer Actions with verified creators and more stars when choosing between similar Actions, and often switch to a different Action when an Action has bugs, is not properly documented, or another Action of better quality is available. Developers choose to develop new Actions rather than using the existing ones mainly when there is no existing Action for their task or the existing Actions are limited in functionality. 60.87% of the developers consider the composition of YAML files challenging and error-prone and mostly check Q/A forums to fix issues with these YAML files. Finally, developers tend to avoid using Actions to reduce complexity, and security risk, or when the benefits of Actions are not worth the cost/effort of setting up Actions for automation.
Pre-print File AttachedVision and Emerging Results
Thu 15 Jun 2023 09:20 - 09:30 at Aurora Hall - Human Factors Chair(s): Rahul Mohanani University of JyväskyläBackground: The UK cyber skills gap/shortage amplifies the broader impact of cyber-attacks, which inflict harms such as privacy and economic loss on wider society. The demand is greatest (and growing fastest) in cyber-enabled disciplines, such as software engineering.
Objectives: In this paper, we create a term frequency-inverse document frequency representation of the Cyber Security Body of Knowledge (CyBOK). We then evaluate the potential of this representation by using it to automatically map job descriptions to the different areas of the CyBOK.
Method: We generate two representations of the CyBOK. The representations are mapped to a corpus of 454 job descriptions using TF-IDF. Comparing the similarity scores across these mappings allows us to identify relevant knowledge areas/groups.
Results: The results are preliminary, but suggest that the approach warrants further investigation. Certain job descriptions are mapped to certain knowledge areas/groups in a way that makes intuitive sense to the authors. However, there is a degree homogeneity to the scores returned for certain knowledge areas/groups. There are several threats to validity, most notably the low number of job descriptions that have been studied.
Conclusions: Our work shows that it is possible to automatically map job descriptions to the CyBOK in a meaningful way. Further research is required to address threats and to explore alternative mapping approaches. The authors intend to undertake this research culminating with a Grey Literature Informed Model of Practice in Secure Software Engineering.
Link to publication DOI Pre-print File AttachedResearch (Full Papers)
Thu 15 Jun 2023 09:30 - 09:50 at Aurora Hall - Human Factors Chair(s): Rahul Mohanani University of JyväskyläBackground: Adaptive user interfaces have the advantage of being able to dynamically change their aspect and/or behaviour depending on the characteristics of the context of use, i.e. to improve user experience. User experience is an important quality factor that has been primarily evaluated with classical measures (e.g. effectiveness, efficiency, satisfaction), but to a lesser extent with physiological measures, such as emotion recognition, skin response, or brain activity. Aim: In a previous exploratory experiment involving users with different profiles and a wide range of ages, we analysed user experience in terms of cognitive load, engagement, attraction and memorisation when employing twenty graphical adaptive menus through the use of an Electroencephalogram (EEG) device. The results indicated that there were statistically significant differences for these four variables. However, we considered that it was necessary to confirm or reject these findings using a more homogeneous group of users. Method: We conducted a strict internal replication study with 40 participants. We also investigated the potential correlation between EEG signals and the participants’ user experience ratings, such as their preferences. Results: The results of this experiment confirm that there are statistically significant differences between the EEG variables when the participants interact with the different adaptive menus. Moreover, there is a high correlation among the participants’ user experience ratings and the EEG signals, and a trend regarding performance has emerged from our analysis. Conclusions: These findings suggest that EEG signals could be used to evaluate user experience. With regard to the menus studied, our results suggest that graphical menus with different structures and font types produce more differences in users’ brain responses, while menus which use colours produce more similarities in users’ brain responses. Several insights with which to improve users’ experience of graphical adaptive menus are outlined.
DOI Pre-print File AttachedShort Papers and Posters
Thu 15 Jun 2023 09:50 - 10:00 at Aurora Hall - Human Factors Chair(s): Rahul Mohanani University of JyväskyläAs a part of a research project concerning software maintainability assessment in collaboration with the development team, we replicated a study from Schnappinger et al. on human-level ordinal maintainability prediction. Our goal was to validate that we could obtain the same results with the open-source dataset and the open-source tool Javanalyser. Moreover, we extended the setup to predict continuous maintainability and evaluated the overall influence of the size of the class over the predictions. Our approach consisted of nearly $20,000$ experimental shots to replicate and extend the original study. All our datasets, code, and results are released publicly available to allow for further analysis by the community. In the end, we successfully replicated the original study. Moreover, we showed that continuous maintainability leverage better prediction than an ordinal scale. Finally, we have shown that metrics other than size contain information that is essential for a fine-grained maintainability prediction. This study shows that it is necessary to explore the nature of what is measured by code metrics, and is also the first step in the construction of a maintainability model.
Link to publication DOI File AttachedResearch (Full Papers)
Thu 15 Jun 2023 10:30 - 10:50 at Aurora Hall - Software Architecture Chair(s): Andrea Janes FHV Vorarlberg University of Applied SciencesArchitectural smells have been studied in the literature looking at several aspects, such as their impact on maintainability as a source of architectural debt, their correlations with code smells, and their evolution in the history of complex projects. The goal of this paper is to extend the study of architectural smells from a different perspective. We focus our attention on software performance and we aim to quantify the impact of architectural smells as support to explain the root causes of system performance hindrances. Our method consists of a study design matching the occurrence of architectural smells with performance metrics. We exploit state-of-the-art tools for architectural smell detection, software performance profiling and testing the systems under analysis. The removal of architectural smells generates new versions of systems from which we derive some observations on design changes improving/worsening performance metrics. Our experimentation considers two complex open-source projects, and results show that the detection and removal of two common types of architectural smells yield lower response time (up to 47%) with a large effect size (for 50%-90% of the hot spot methods). The median memory consumption is also lower (up to 40%) with a large effect size (for all the services).
Pre-print File AttachedIndustry
Thu 15 Jun 2023 10:50 - 11:00 at Aurora Hall - Software Architecture Chair(s): Andrea Janes FHV Vorarlberg University of Applied SciencesTechnical debt is frequently the result of short-run decisions made during code development, which can lead to long-term maintenance costs and risks. In this way, it is essential to evaluate the project’s progression and understand different influence factors. Fortunately, the prioritization process for addressing technical debt can be expedited with static code analysis tools like the established SonarQube. Unfortunately, we experienced, inter alia, with SonarQube some limitations and have perceived some requirements from the industry that were not yet addressed. By means of this experience report and the analysis of scientific papers, this work contributes: (1) a comprehensive reassessment of Code Debt within the industry, (2) highlights the benefits of employing SonarQube as well as its limitations when evaluating and prioritizing code debt, (3) introduces a novel tool named SoHist which addresses some of these limitations and offers additional features for the assessment and prioritization of technical debt, and (4) exemplifies the usage of this tool in two industrial settings as part of the ITEA3 SmartDelta project.
Pre-print File AttachedResearch (Full Papers)
Thu 15 Jun 2023 11:00 - 11:20 at Aurora Hall - Software Architecture Chair(s): Andrea Janes FHV Vorarlberg University of Applied SciencesCode review is a common practice in software development and often conducted before code changes are merged into the code repository. A number of approaches for automatically recommending appropriate reviewers have been proposed to match such code changes to pertinent reviewers. However, such approaches are generic, i.e., they do not focus on specific types of issues during code reviews. In this paper, we propose an approach that focuses on architecture violations, one of the most critical type of issues identified during code review. Specifically, we aim at automating the recommendation of code reviewers, who are potentially qualified to review architecture violations, based on reviews of code changes. To this end, we selected three common similarity detection methods to measure the file path similarity of code commits and the semantic similarity of review comments. We conducted a series of experiments on finding the appropriate reviewers through evaluating and comparing these similarity detection methods in separate and combined ways with the baseline reviewer recommendation approach, RevFinder. The results show that the common similarity detection methods can produce acceptable performance scores and achieve a better performance than RevFinder. The sampling techniques used in recommending code reviewers can impact the performance of reviewer recommendation approaches. We also discuss the potential implications of our findings for both researchers and practitioners.
Link to publication Pre-printVision and Emerging Results
Thu 15 Jun 2023 11:20 - 11:30 at Aurora Hall - Software Architecture Chair(s): Andrea Janes FHV Vorarlberg University of Applied SciencesArchitecting software-intensive systems can be a complex process. It deals with the daunting tasks of unifying stakeholders’ perspectives, designers’ intellect, tool-based automation, pattern-driven reuse, and so on, to sketch a blueprint that guides software implementation and evaluation. Despite its benefits, architecture-centric software engineering (ACSE) inherits a multitude of challenges. ACSE challenges could stem from a lack of standardized processes, socio-technical limitations, and scarcity of human expertise etc. that can impede the development of existing and emergent classes of software (e.g., IoTs, blockchain, quantum systems). Software Development Bots (DevBots) trained on large language models can help synergise architects’ knowledge with artificially intelligent decision support to enable rapid architecting in a human-bot collaborative ACSE. An emerging solution to enable this collaboration is ChatGPT, a disruptive technology not primarily introduced for software engineering, but is capable of articulating and refining architectural artifacts based on natural language processing. We detail a case study that involves collaboration between a novice software architect and ChatGPT for architectural analysis, synthesis, and evaluation of a services-driven software application. Preliminary results indicate that ChatGPT can mimic an architect’s role to support and often lead ACSE, however; it requires human oversight and decision support for collaborative architecting. Future research focuses on harnessing empirical evidence about architects’ productivity and exploring socio-technical aspects of architecting with ChatGPT to tackle emerging and futuristic challenges of ACSE.
Link to publication Pre-printResearch (Full Papers)
Thu 15 Jun 2023 11:30 - 11:50 at Aurora Hall - Software Architecture Chair(s): Andrea Janes FHV Vorarlberg University of Applied SciencesRepairing design models is a laborious task that requires a considerable amount of time and effort from developers. Repair recommendation (RR) approaches focus on reducing the effort and improving the quality of the repairs performed. Such approaches have been evaluated in terms of scalability, correctness, and minimalism. These evaluations, however, have not investigated how developers can benefit from using RRs and how they perceive the difficulty of applying RRs. Investigating and discussing the use of RRs from the developers’ perspective is important to demonstrate the benefits of applying such approaches in practice. We explore this opportunity by conducting a controlled experiment carried out with 24 developers where they repaired UML design models in eight different tasks, with and without RRs. The findings indicate that developers can benefit from RRs in complex tasks by improving their effectiveness and efficiency. The results also evidence that the use of RRs does not impact the developers’ perceived difficulty and confidence when repairing models. Furthermore, our findings show that not all developers choose the same RR, but rather, have varied preferences. Thus, the provision of RRs leads to developers considering additional alternatives to repair an inconsistency.
Link to publication DOI Pre-print File AttachedShort Papers and Posters
Thu 15 Jun 2023 11:50 - 12:00 at Aurora Hall - Software Architecture Chair(s): Andrea Janes FHV Vorarlberg University of Applied SciencesMaking sub-optimal design decisions during software development leads to the accumulation of Technical Debt (TD) in software projects. There are tools to identify TD Items in software code through static code analysis. However, quantifying TD to support decision-making on whether to keep taking on TD or if it is time to refactor TD is a difficult task, and proposed approaches for this still lack consensus. Prior work observed that TD Interest could be further decomposed into constituents ‘New Code Cost’ and ‘Rework Cost,’ which gives an interesting direction of research to explore TD quantification in terms of these costs. Therefore, through our experiments, we plan to explore the relationship between TD, New Code Cost, and Rework Cost in Open-Source Software Projects. This paper reports on an initial motivating experiment, our plan for future work, and implications for researchers.
Link to publication DOI Pre-print File AttachedResearch (Full Papers)
Thu 15 Jun 2023 13:30 - 13:50 at Aurora Hall - Software Testing and Analysis Chair(s): Davide Taibi University of OuluBug tracking systems define bug life cycles that outline their bug tracking process. In this study, we assess bug life cycles to identify bottlenecks in the bug tracking processes, and examine the effectiveness of bug tracking tracking system usage practices linked to bug states and state transitions. To achieve this, we examined the bug life cycles of three open-source software projects which use Bugzilla as their bug tracking system. In total, we have analyzed 106.196 bugs gathered from these projects. We started by looking at the temporal and quantitative aspects of these projects’ bug life cycles. After that, we collected data about how bug life cycles differ over time. Finally, we inspected the frequency of reopened and state-looping bugs in these projects. After our analysis, we have deduced that the presented temporal and quantitative analysis of bug life cycles is useful for finding bottlenecks and undesired behaviors in the bug tracking processes. We also inferred that examining the changes in bug life cycles over time can provide insights into how bug tracking practices changed throughout the project’s lifetime, and it can be used as a parameter to assess whether the bug tracking system usage has improved. Lastly, we deducted that analyzing undesired state trails’ frequency provides insights into the performance of bug tracking processes. Based on the insights gained from analyzing bug life cycles with the presented methods, we believe that decision makers can improve their workflow by introducing or removing new states to the bug life cycle and adding new rules and restrictions to their bug tracking process.
DOI Pre-print File AttachedIndustry
Thu 15 Jun 2023 13:50 - 14:00 at Aurora Hall - Software Testing and Analysis Chair(s): Davide Taibi University of OuluKubernetes is a free, open-source container orchestration system for deploying and managing Docker containers that host microservices. Kubernetes cluster logs help in determining the reason for the failure. However, as systems become more complex, identifying failure reasons manually becomes more difficult and time-consuming. This study aims to identify effective and efficient classification algorithms to automatically determine the failure reason. We compare five classification algorithms, Support Vector Machines, K-Nearest Neighbors, Random Forest, Gradient Boosting Classifier, and Multilayer Perceptron. Our results indicate that Random Forest produces good accuracy while requiring fewer computational resources than other algorithms.
Pre-print File AttachedResearch (Full Papers)
Thu 15 Jun 2023 14:00 - 14:20 at Aurora Hall - Software Testing and Analysis Chair(s): Davide Taibi University of OuluMany advanced program analysis and verification methods are based on solving systems of Constrained Horn Clauses (CHC). Testing CHC solvers is very important, as correctness of their work determines whether bugs in the analyzed programs are detected or missed. One of the well-established and efficient methods of automated software testing is fuzzing: analyzing the reactions of programs to random input data. Currently, there are no fuzzers for CHC solvers, and fuzzers for SMT solvers are not efficient in CHC solver testing, since they do not consider CHC specifics. In this paper, we present HornFuzz, a mutation-based gray-box fuzzing technique for detecting bugs in CHC solvers based on the idea of metamorphic testing. We evaluated our fuzzer on one of the highest performing CHC solvers, Spacer, and found a handful of bugs in Spacer. In particular, some discovered problems are so serious that they require fixes with significant changes to the solver.
Link to publication DOI Pre-print File AttachedJournal First
Thu 15 Jun 2023 14:20 - 14:30 at Aurora Hall - Software Testing and Analysis Chair(s): Davide Taibi University of OuluAbstract: State model inference of software applications through the Graphical User Interface (GUI) is a technique that identifies GUI states and transitions, and maps them into a model. Scriptless GUI testing tools can benefit substantially from the availability of these state models, for example, to improve the exploration, or have sophisticated test oracles. However, inferring models for large systems requires a long execution time. Our goal is to improve the speed of the state model inference process. To achieve this goal, this paper presents a distributed state model inference approach with an open source scriptless GUI testing tool. Moreover, in order to be able to infer a suitable model, we design a set of strategies to deal with abstraction challenges and to distinguish GUI states and transitions in the model. To validate it, we conduct an experiment with two open source web applications that have been tested with the distributed architecture using one to six Docker containers sharing the same state model. With the obtained results, we can conclude that it is feasible to infer a model with a distributed approach and that using the distributed approach reduces the time required for inferring a state model.
Acceptance date: 8 February 2023 Journal: Journal of Systems and Software (JSS) DOI: https://doi.org/10.1016/j.jss.2023.111645
Link to publication DOI File AttachedEASIER
Thu 15 Jun 2023 14:30 - 14:40 at Aurora Hall - Software Testing and Analysis Chair(s): Davide Taibi University of OuluMarian Daun and Jennifer Brings
Requirements validation is an important aspect for ensuring high quality software. During requirements validation not the software product is validated but the requirements themselves. Therefore, much effort is spent on manual validation activities. Commonly used are requirements inspections, where the specification is read from different persons assuming different roles or applying different reading techniques partly accompanied by checklists. Actual defect detection with requirements inspection is costly and defect detection rates must be considered low. Therefore, repeated validation is used or validation with multiple inspection groups - known as N-fold inspections. However, this does not only yield more defects found but also more false positives. In this paper, we investigate how defect aggregation can be used to improve the overall quality of validation. Therefore, we conducted an experiment with 22 N-fold inspection groups consisting of four to five reviewers each. Results show that simple aggregation of all results leads to a number of false positives that can actually negatively impact the validation task, while the use of more tailored aggregation strategies can considerably improve the validation of requirements with N-fold inspections.
Link to publicationEASIER
Thu 15 Jun 2023 14:40 - 14:50 at Aurora Hall - Software Testing and Analysis Chair(s): Davide Taibi University of OuluMarkus Borg, Adam Tornhill and Enys Mones
[Context] Accurate time estimation is a critical aspect of predictable software engineering. Previous work shows that low source code quality increases the uncertainty in issue resolution times. [Objective] Our goal is to evaluate how developers’ project experience and file ownership are related to issue resolution times. [Method] We mine 40 proprietary software repositories and conduct an observational study. Using CodeScene, we measure source code quality and active development time connected to Jira issues. [Results] Most source code changes are made by either a marginal or dominant code owner. Also, most changes to low-quality source code are made by developers with low levels of ownership. In low-quality source code, marginal owners need 45% more time for small changes, and 93% more time for large changes. [Conclusions] Collective code ownership is a popular target, but industry practice results in many dominant and marginal owners. Marginal owners are particularly hampered when working with low-quality source code, which leads to productivity losses. In codebases plagued by technical debt, newly onboarded developers will require more time to complete tasks.
Pre-print File AttachedIndustry
Thu 15 Jun 2023 14:50 - 15:00 at Aurora Hall - Software Testing and Analysis Chair(s): Davide Taibi University of OuluPopular modern code review tools (e.g. Gerrit and GitHub) sort files in a code review in alphabetical order. A prior study (on open-source projects) shows that the changed files’ positions in the code review affect the review process. Their results show that files placed lower in the order have less chance of receiving reviewing efforts than the other files. Hence, there is a higher chance of missing defects in these files. This paper explores the impact of file order in the code review of the well-known industrial project IntelliJ IDEA. First, we verify the results of the prior study on a big proprietary software project. Then, we explore an alternative to the default Alphabetical order: ordering changed files according to their code diff. Our results confirm the observations of the previous study. We discover that reviewers leave more comments on the files shown higher in the code review. Moreover, these results show that, even with the data skewed toward Alphabetical order, ordering changed files according to their code diff performs better than standard Alphabetical order regarding placing problematic files, which needs more reviewing effort, in the code review. These results confirm that exploring various ordering strategies for code review needs more exploration.
Pre-print File AttachedResearch (Full Papers)
Thu 15 Jun 2023 15:30 - 15:50 at Aurora Hall - Software Development Processes Chair(s): Eray Tüzün Bilkent UniversityContext. WebAssembly (WASM) is a low-level bytecode format that is gaining traction among Internet of Things (IoT) devices. Because of IoT devices’ resources limitations, Using WASM is becoming a popular technique for virtualization on IoT devices. However, it is unclear if the promises of WASM regarding its efficient use of energy and performance gains hold true. Goal. This study aims to determine how different source programming languages and runtime environments affect the energy consumption and performance of WASM binaries. Method. We perform a controlled experiment where we compile three benchmarking algorithms from four different programming languages (i.e., C, Rust, Go, and JavaScript) to WASM and run them using two different WASM runtimes on a Raspberry Pi 3B. Results. The source programming language significantly influences the performance and energy consumption of WASM binaries. Differently, we did not find evidence of the impact of the runtime environment. However, certain combinations of source programming language and runtime environment leads to a significant improvement of its energy consumption and performance. Conclusions. IoT developers should choose the source programming language wisely to benefit from increased performance and a significant reduction in energy consumption. Specifically, JavaScript should be generally avoided, while C and Rust are better options.
Link to publication DOI Pre-print File AttachedIndustry
Thu 15 Jun 2023 15:50 - 16:00 at Aurora Hall - Software Development Processes Chair(s): Eray Tüzün Bilkent UniversityAgile teams measure their velocity for performance, based on Story Points. However, such velocity does not allow predicting when the product will be finished. Story points measure effort only. They do not discriminate between creating functionality and other tasks. Non-functional requirements, such as agreeing with stakeholders, designing, testing, or documenting, consume effort but do not add functionality. Thus, it remains unclear whether the product makes any progress, or the team is just looping around technical debt and unclear requirements. Euro Project Office has therefore developed a method how to complement a product backlog by functional size, indicating progress and completeness in unambiguous terms. The method is based on the international standard ISO/IEC 14143 and ISO/IEC 19761. Tools are available as open source and can be used by development teams with minimum investment into training.
File AttachedResearch (Full Papers)
Thu 15 Jun 2023 16:00 - 16:20 at Aurora Hall - Software Development Processes Chair(s): Eray Tüzün Bilkent UniversityThe use of software engineering artifacts in DevOps is central to enabling collaboration between involved teams when integrating the development and operations domains. At the same time, collaboration around DevOps artifacts has yet to receive detailed research attention. To address this research gap, we explore the specific software engineering artifacts that act as a means of translation between DevOps stakeholders. We apply the sociological concept of Boundary Objects, which has been used to describe, analyze, and evaluate artifacts that enable a cross-disciplinary understanding. While Boundary Objects have not been explicitly studied in DevOps contexts, they appear promising to investigate how different teams can collaborate efficiently using common artifacts. We performed a multiple case study and conducted twelve semi-structured interviews with DevOps practitioners in nine companies. We elicited participants’ collaboration practices, focusing on the coordination of stakeholders and the use of engineering artifacts as Boundary Objects. This paper presents a consolidated overview of four categories of DevOps Boundary Objects and eleven stakeholder groups relevant to DevOps. To help practitioners assess cross-disciplinary knowledge management strategies, we detail how DevOps Boundary Objects contribute to four areas of DevOps knowledge and propose dimensions to evaluate their use.
Link to publication DOIIndustry
Thu 15 Jun 2023 16:20 - 16:30 at Aurora Hall - Software Development Processes Chair(s): Eray Tüzün Bilkent UniversityThe complexity of delivering enterprise-grade software, especially as-a-service, keeps getting more sophisticated even with the large set of open-source and commercial helper tools. Every single commit by the developers must go through a large group of checks to ensure that it won’t break or regress reliability, resiliency, security, compliance, privacy, performance, accessibility, operability, etc. Being a developer in such an environment is not a fulfilling role at all. Full stack, as a notion, is not applicable to large-scale systems and enterprise software. We are introducing a new, horizontal, approach called “full-spec software” where each layer of the system is architected, designed, and built with the long list of enterprise readiness attributes listed above. Making full-spec software a reality requires a new organizational construct called “platform engineering.”
DOI File AttachedShort Papers and Posters
Thu 15 Jun 2023 16:30 - 16:40 at Aurora Hall - Software Development Processes Chair(s): Eray Tüzün Bilkent University[Context] With the recent advent of artificially intelligent pairing partners in software engineering, it is interesting to renew the study of the psychology of pairing. Pair programming provides an attractive way of teaching software engineering to university students. Its study can also lead to a better understanding of the needs of professional software engineers in various programming roles and for the improvement of the concurrent pairing software. [Objective] This preliminary study aimed to gain quantitative and qualitative insights into pair programming, especially students’ attitudes towards its specific roles and what they require from the pairing partners. The research’s goal is to use the findings to design further studies on pairing with artificial intelligence. [Method] Using a mixed-methods and experimental approach, we distinguished the effects of the pilot, navigator, and solo roles on (N = 35) students’ intrinsic motivation. Four experimental sessions produced a rich data corpus in two software engineering university classrooms. It was quantitatively investigated using the Shapiro-Wilk normality test and one-way analysis of variance (ANOVA) to confirm the relations and significance of variations in mean intrinsic motivation in different roles. Consequently, seven semi-structured interviews were conducted with the experiment’s participants. The qualitative data excerpts were subjected to the thematic analysis method in an essentialist way. [Results] The systematic coding interview transcripts elucidated the research topic by producing seven themes for understanding the psychological aspects of pair programming and for its improvement in university classrooms. Statistical analysis of 612 self-reported intrinsic motivation inventories confirmed that students find programming in pilot-navigator roles more interesting and enjoyable than programming simultaneously. [Conclusion] The executed experimental settings are viable for inspecting the associations between students’ attitudes and the distributed cognition practice. The preliminary results illuminate the psychological aspects of the pilot-navigator roles and reveal many areas for improvement. The results also provide a strong basis for conducting further studies with the same design involving the big five personality and intrinsic motivation on using artificial intelligence in pairing and to allow comparison of those results with results of pairing with human partners.
DOI Pre-print File AttachedEASIER
Thu 15 Jun 2023 16:40 - 16:50 at Aurora Hall - Software Development Processes Chair(s): Eray Tüzün Bilkent UniversityXingru Chen, Muhammad Usman and Deepika Badampudi
Background: InnerSource helps improve software reuse through increased transparency and inter-team collaboration. Companies need to understand their context and specific needs before deciding to adopt any specific InnerSource practices since they cannot apply all InnerSource practices at once. Aim This study aims to support the case company in assessing its readiness for adopting InnerSource practices to improve its internal reuse, identify and prioritize the improvement areas, and identify suitable solutions. Method: We performed a case study using a questionnaire and a workshop to check the current and desired status of adopting InnerSource practices and collect potential solutions. Results: The study participants identified that the company needs to prioritize the improvements related to the discoverability, communication channels, and ownership of the reusable assets. In addition, they identified certain InnerSource practices as solutions for the prioritized improvement areas, such as better structured repositories for storing and searching the reusable assets and standardized documentation of the reusable assets. Conclusion: The questionnaire instrument aids the case company in identifying the improvement areas related to InnerSource and reuse practices. InnerSource practices could improve the development and maintenance of reusable assets.
DOI Pre-print File AttachedIndustry
Thu 15 Jun 2023 16:50 - 17:00 at Aurora Hall - Software Development Processes Chair(s): Eray Tüzün Bilkent UniversityIn DevOps, the traceability of software artifacts is critical to the successful development and operation of project delivery to stakeholders. This paper reports on the experience of developers using DevOps for developing and productionising a Javascript React web application, with a focus on traceability management of artifacts produced throughout the life cycle. This report also highlights key opportunities and challenges in traceability management from the development stage to production.
File Attached