Understanding Exception-Related Bugs in Large-Scale Cloud Systems (ASE 2019 Research Papers)

Session: Cloud and Online Services Chair(s): Dan HaoPeking University

Exception mechanism is widely used in cloud systems. This is mainly because it separates the error handling code from main business logic. However, the huge space of potential error conditions and the sophisticated logic of cloud systems present a big hurdle to the correct use of exception mechanism. As a result, mistakes in the exception use may lead to severe consequences, such as system downtime and data loss. To address this issue, the communities direly need a better understanding of the exception-related bugs, i.e., eBugs, which are caused by the incorrect use of exception mechanism, in cloud systems.

In this paper, we present a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper. For all the studied eBugs, we analyze their triggering conditions, root causes, bug impacts, and their relations. To the best of our knowledge, this is the first study on eBugs in cloud systems, and the first eBug study that focuses on triggering conditions. We find that eBugs are severe in cloud systems: 74% eBugs affect system availability or integrity. Luckily, exposing eBugs through testing is possible: 54% eBugs are triggered by non-semantic conditions such as network errors; 40% eBugs can be triggered by simulating the conditions at simple system states. Interestingly, we find that exception triggering conditions are useful for detecting eBugs. Based on such relevant findings, we build a static analysis tool, called DIET, which reports 31 bugs and bad practices from the latest versions of the studied systems. So far developers have confirmed that 23 of them are “previously-unknown” bugs or bad practices.

Haicheng Chen

The Ohio State University

United States

Wensheng Dou

Institute of Software, Chinese Academy of Sciences

China

Yanyan Jiang

Nanjing University

China

Feng Qin

Ohio State University, USA

Session: Cloud and Online Services Chair(s): Dan HaoPeking University

Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to >100%). This variability originates from three sources (variability inherent to a benchmark, between trials on the same instance, and between different instances) and different benchmark-environment configurations suffer to very different degrees from any of these sources. The bare-metal instance expectedly produces very stable results, however AWS is typically not substantially less stable. We further study falsely-reported performance changes and minimal-detectable slowdowns along two dimensions in all environments: (1) two deployment strategies, executing test and control group on the same (trial-based sampling) and on different (instance-based sampling) instances, and (2) two state-of-the-art statistical tests, i.e., Wilcoxon rank-sum with Cliff’s Delta effect sizes and bootstrapped overlapping confidence intervals. We show that identical measurements (e.g., same benchmark executed in the same environment without any code changes) suffer from falsely-reported changes when they are taken from a small number of instances and trials. Nonetheless, an increase in samples yields a low number of false positives (i.e., <5%) for all studied benchmarks and environments, both sampling-strategies, and both statistical tests. Regarding minimal-detectable slowdowns, our experiments confirm that testing in a trial-based fashion leads to substantially better results than executing on different instances. With trial-based sampling, slowdowns of 10% or less are detectable with high confidence using both statistical tests. Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns with fewer instances than overlapping confidence intervals: already 5 instances are sufficient to find slowdowns in the range of 5% to 10%.

Link to Publication: https://link.springer.com/article/10.1007/s10664-019-09681-1