Understanding Exception-Related Bugs in Large-Scale Cloud Systems
Exception mechanism is widely used in cloud systems. This is mainly because it separates the error handling code from main business logic. However, the huge space of potential error conditions and the sophisticated logic of cloud systems present a big hurdle to the correct use of exception mechanism. As a result, mistakes in the exception use may lead to severe consequences, such as system downtime and data loss. To address this issue, the communities direly need a better understanding of the exception-related bugs, i.e., eBugs, which are caused by the incorrect use of exception mechanism, in cloud systems.
In this paper, we present a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper. For all the studied eBugs, we analyze their triggering conditions, root causes, bug impacts, and their relations. To the best of our knowledge, this is the first study on eBugs in cloud systems, and the first eBug study that focuses on triggering conditions. We find that eBugs are severe in cloud systems: 74% eBugs affect system availability or integrity. Luckily, exposing eBugs through testing is possible: 54% eBugs are triggered by non-semantic conditions such as network errors; 40% eBugs can be triggered by simulating the conditions at simple system states. Interestingly, we find that exception triggering conditions are useful for detecting eBugs. Based on such relevant findings, we build a static analysis tool, called DIET, which reports 31 bugs and bad practices from the latest versions of the studied systems. So far developers have confirmed that 23 of them are “previously-unknown” bugs or bad practices.
Wed 13 Nov
10:40 - 11:00 Talk | Understanding Exception-Related Bugs in Large-Scale Cloud Systems Haicheng ChenThe Ohio State University, Wensheng DouInstitute of Software, Chinese Academy of Sciences, Yanyan JiangNanjing University, Feng QinOhio State University, USA Pre-print | |||||||||||||||||||||||||||||||||||||||||
11:00 - 11:20 Talk | iFeedback: Exploiting User Feedback for Real-time Issue Detection in Large-Scale Online Service Systems Wujie ZhengTencent, Inc., Haochuan LuFudan University, Yangfan ZhouFudan University, Jianming LiangTencent, Haibing ZhengTencent, Yuetang DengTencent, Inc. | |||||||||||||||||||||||||||||||||||||||||
11:20 - 11:40 Talk | Software Microbenchmarking in the Cloud. How Bad is it Really? Christoph LaaberUniversity of Zurich, Joel ScheunerChalmers | University of Gothenburg, Philipp LeitnerChalmers University of Technology & University of Gothenburg Link to publication Pre-print | |||||||||||||||||||||||||||||||||||||||||
11:40 - 12:00 Talk | Continuous Incident Triage for Large-Scale Online Service Systems Junjie ChenTianjin University, Xiaoting HeMicrosoft, Qingwei LinMicrosoft Research, China, Hongyu ZhangThe University of Newcastle, Dan HaoPeking University, Feng GaoMicrosoft, Zhangwei XuMicrosoft, Yingnong DangMicrosoft Azure, Dongmei ZhangMicrosoft Research, China | |||||||||||||||||||||||||||||||||||||||||
12:00 - 12:10 Demonstration | Kotless: a Serverless Framework for Kotlin Vladislav TankovJetBrains, ITMO University, Yaroslav GolubevJetBrains Research, ITMO University, Timofey BryksinJetBrains Research, Saint-Petersburg State University | |||||||||||||||||||||||||||||||||||||||||
12:10 - 12:20 Demonstration | FogWorkflowSim: An Automated Simulation Toolkit for Workflow Performance Evaluation in Fog Computing Xiao LiuSchool of Information Technology, Deakin University, Lingmin FanSchool of Computer Science and Technology, Anhui University, Jia XuSchool of Computer Science and Technology, Anhui University, Xuejun LiSchool of Computer Science and Technology, Anhui University, Lina GongSchool of Computer Science and Technology, Anhui University, John GrundyMonash University, Yun YangSwinburne University of Technology |