Software Microbenchmarking in the Cloud. How Bad is it Really? (ASE 2019 Journal First Presentations)

Blogs (1) >>

Sun 10 - Fri 15 November 2019 San Diego, California, United States

Who

Christoph Laaber, Joel Scheuner, Philipp Leitner

Track

ASE 2019 Journal First Presentations

When

Wed 13 Nov 2019 11:20 - 11:40 at Hillcrest - Cloud and Online Services Chair(s): Dan Hao

Abstract

Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to >100%). This variability originates from three sources (variability inherent to a benchmark, between trials on the same instance, and between different instances) and different benchmark-environment configurations suffer to very different degrees from any of these sources. The bare-metal instance expectedly produces very stable results, however AWS is typically not substantially less stable. We further study falsely-reported performance changes and minimal-detectable slowdowns along two dimensions in all environments: (1) two deployment strategies, executing test and control group on the same (trial-based sampling) and on different (instance-based sampling) instances, and (2) two state-of-the-art statistical tests, i.e., Wilcoxon rank-sum with Cliff’s Delta effect sizes and bootstrapped overlapping confidence intervals. We show that identical measurements (e.g., same benchmark executed in the same environment without any code changes) suffer from falsely-reported changes when they are taken from a small number of instances and trials. Nonetheless, an increase in samples yields a low number of false positives (i.e., <5%) for all studied benchmarks and environments, both sampling-strategies, and both statistical tests. Regarding minimal-detectable slowdowns, our experiments confirm that testing in a trial-based fashion leads to substantially better results than executing on different instances. With trial-based sampling, slowdowns of 10% or less are detectable with high confidence using both statistical tests. Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns with fewer instances than overlapping confidence intervals: already 5 instances are sufficient to find slowdowns in the range of 5% to 10%.

Link to Publication

https://link.springer.com/article/10.1007/s10664-019-09681-1

Link to Preprint

http://t.uzh.ch/T4

Christoph Laaber

University of Zurich

Switzerland

Joel Scheuner

Chalmers | University of Gothenburg

Sweden

Philipp Leitner

Chalmers University of Technology & University of Gothenburg

Sweden

Session Program

Wed 13 Nov

10:40 - 12:20: Papers - Cloud and Online Services at Hillcrest
Chair(s): Dan HaoPeking University

10:40 - 11:00
Talk

Understanding Exception-Related Bugs in Large-Scale Cloud Systems

Haicheng ChenThe Ohio State University, Wensheng DouInstitute of Software, Chinese Academy of Sciences, Yanyan JiangNanjing University, Feng QinOhio State University, USA

Pre-print

11:00 - 11:20
Talk

iFeedback: Exploiting User Feedback for Real-time Issue Detection in Large-Scale Online Service Systems

Wujie ZhengTencent, Inc., Haochuan LuFudan University, Yangfan ZhouFudan University, Jianming LiangTencent, Haibing ZhengTencent, Yuetang DengTencent, Inc.

11:20 - 11:40
Talk

Software Microbenchmarking in the Cloud. How Bad is it Really?

Christoph LaaberUniversity of Zurich, Joel ScheunerChalmers | University of Gothenburg, Philipp LeitnerChalmers University of Technology & University of Gothenburg

Link to publication Pre-print

11:40 - 12:00
Talk

Continuous Incident Triage for Large-Scale Online Service Systems

Junjie ChenTianjin University, Xiaoting HeMicrosoft, Qingwei LinMicrosoft Research, China, Hongyu ZhangThe University of Newcastle, Dan HaoPeking University, Feng GaoMicrosoft, Zhangwei XuMicrosoft, Yingnong DangMicrosoft Azure, Dongmei ZhangMicrosoft Research, China

12:00 - 12:10
Demonstration

Kotless: a Serverless Framework for Kotlin

Vladislav TankovJetBrains, ITMO University, Yaroslav GolubevJetBrains Research, ITMO University, Timofey BryksinJetBrains Research, Saint-Petersburg State University

12:10 - 12:20
Demonstration

FogWorkflowSim: An Automated Simulation Toolkit for Workflow Performance Evaluation in Fog Computing

Xiao LiuSchool of Information Technology, Deakin University, Lingmin FanSchool of Computer Science and Technology, Anhui University, Jia XuSchool of Computer Science and Technology, Anhui University, Xuejun LiSchool of Computer Science and Technology, Anhui University, Lina GongSchool of Computer Science and Technology, Anhui University, John GrundyMonash University, Yun YangSwinburne University of Technology