Continuous Incident Triage for Large-Scale Online Service Systems
[Experience Paper] In recent years, online service systems have become increasingly popular. Incidents of these systems could cause significant economic loss and customer dissatisfaction. Incident triage, which is the process of assigning a new incident to the responsible team, is vitally important for quick recovery of the affected service. Our industry experience shows that in practice, incident triage is not conducted only once in the beginning, but is a continuous process, in which engineers from different teams have to discuss intensively among themselves about an incident, and continuously refine the incident-triage result until the correct assignment is reached. In particular, our empirical study on 8 real online service systems shows that the percentage of incidents that were reassigned ranges from 5.43% to 68.26% and the number of discussion items before achieving the correct assignment is up to 11.32 on average. To improve the existing incident triage process, in this paper, we propose DeepCT, a Deep learning based approach to automated Continuous incident Triage. DeepCT incorporates a novel GRU-based model with an attention mechanism and a revised loss function, which can incrementally learn knowledge from discussions and update incident-triage results. Using DeepCT, the correct incident assignment can be achieved with fewer discussions. We conducted an extensive evaluation of DeepCT on 14 large-scale online service systems in a multinational technology company M. The results show that DeepCT is able to achieve more accurate and efficient incident triage, e.g., the average accuracy identifying the responsible team precisely is 0.641~0.729 with the number of discussion items increasing from 1 to 5. Also, DeepCT statistically significantly outperforms the state-of-the-art bug triage approach.
Wed 13 Nov
10:40 - 11:00 Talk | Understanding Exception-Related Bugs in Large-Scale Cloud Systems Haicheng ChenThe Ohio State University, Wensheng DouInstitute of Software, Chinese Academy of Sciences, Yanyan JiangNanjing University, Feng QinOhio State University, USA Pre-print | |||||||||||||||||||||||||||||||||||||||||
11:00 - 11:20 Talk | iFeedback: Exploiting User Feedback for Real-time Issue Detection in Large-Scale Online Service Systems Wujie ZhengTencent, Inc., Haochuan LuFudan University, Yangfan ZhouFudan University, Jianming LiangTencent, Haibing ZhengTencent, Yuetang DengTencent, Inc. | |||||||||||||||||||||||||||||||||||||||||
11:20 - 11:40 Talk | Software Microbenchmarking in the Cloud. How Bad is it Really? Christoph LaaberUniversity of Zurich, Joel ScheunerChalmers | University of Gothenburg, Philipp LeitnerChalmers University of Technology & University of Gothenburg Link to publication Pre-print | |||||||||||||||||||||||||||||||||||||||||
11:40 - 12:00 Talk | Continuous Incident Triage for Large-Scale Online Service Systems Junjie ChenTianjin University, Xiaoting HeMicrosoft, Qingwei LinMicrosoft Research, China, Hongyu ZhangThe University of Newcastle, Dan HaoPeking University, Feng GaoMicrosoft, Zhangwei XuMicrosoft, Yingnong DangMicrosoft Azure, Dongmei ZhangMicrosoft Research, China | |||||||||||||||||||||||||||||||||||||||||
12:00 - 12:10 Demonstration | Kotless: a Serverless Framework for Kotlin Vladislav TankovJetBrains, ITMO University, Yaroslav GolubevJetBrains Research, ITMO University, Timofey BryksinJetBrains Research, Saint-Petersburg State University | |||||||||||||||||||||||||||||||||||||||||
12:10 - 12:20 Demonstration | FogWorkflowSim: An Automated Simulation Toolkit for Workflow Performance Evaluation in Fog Computing Xiao LiuSchool of Information Technology, Deakin University, Lingmin FanSchool of Computer Science and Technology, Anhui University, Jia XuSchool of Computer Science and Technology, Anhui University, Xuejun LiSchool of Computer Science and Technology, Anhui University, Lina GongSchool of Computer Science and Technology, Anhui University, John GrundyMonash University, Yun YangSwinburne University of Technology |