Cats Are Not Fish: Deep Learning Testing Calls for Out-Of-Distribution Awareness (ASE 2020 - Research Papers)

Session: Testing and AI Thu 24 Sep 2020 09:10 - 10:10 Chair(s): Xiaoyuan XieSchool of Computer Science, Wuhan University, China

Multi-tier distributed systems are systems composed of several distributed nodes organized in layered tiers. Each tier implements a set of conceptually homogeneous functionalities that provide services to the tier above and use services of the tier below, in the layered structure. The distributed computing infrastructure and the connection among the vertical and horizontal structures make multi-tier distributed systems extremely complex and difficult to understand even for their developers. Indeed, runtime failures are becoming the norm rather than the exception in many multi-tier distributed systems [2–4]. Predicting failures at runtime is essential to trigger automatic and operator-driven reactions to either avoid the incoming failures or mitigate their impact on the overall system reliability. Current approaches for predicting failures exploit either anomaly-based or signature-based strategies. Anomaly-based strategies consider behaviors that significantly deviate from the normal system behavior as symptoms of failures that may occur in the near future. Signature-based strategies rely on known patterns of failure-prone behaviors, called signatures, to predict failures that match the pattern. Anomaly-based techniques suffer from false positives, while signature-based techniques cannot cope with emerging failures. In our paper [1], we present PreMiSE (PREdicting failures in Multi-tIer distributed SystEms), a novel approach to accurately predict failures and precisely locate the responsible faults in multi tier distributed systems. PreMiSE combines signature-based with anomaly-based approaches, to reduce the false positive rate of anomaly-based approaches, and improve the accuracy of signature-based approaches. As illustrated in Figure 1, PreMiSE (i) monitors the status of the system by collecting (a large set of) performance indicators that we refer to as Key Performance Indicators (KPIs) (KPI monitoring), (ii) identifies deviations from normal behaviors by pinpointing anomalous KPIs with anomaly-based techniques (Anomaly detection), (iii) identifies incoming failures by identifying symptomatic anomalous KPI sets with signature-based techniques (Signature-based failure prediction). We evaluated PreMiSE on a prototype multi-tier distributed architecture that implements telecommunication services. The experimental data indicate that PreMiSE can predict failures and locate faults with high precision and low false positive rates for some relevant classes of faults, thus confirming our research hypotheses.

Leonardo Mariani

University of Milano Bicocca

Italy

Mauro Pezze

USI Lugano, Switzerland

Switzerland

Oliviero Riganelli

University of Milano-Bicocca, Italy

Rui Xin

USI Università della Svizzera italiana

Session: Testing and AI Thu 24 Sep 2020 09:10 - 10:10 Chair(s): Xiaoyuan XieSchool of Computer Science, Wuhan University, China

As Deep Learning (DL) is continuously adopted in many industrial applications, its quality and reliability start to raise concerns. Similar to the traditional software development process, testing the DL software to uncover its defects at an early stage is an effective way to reduce the risks after deployment. According to the fundamental assumption of deep learning, the DL software does not provide statistical guarantee and has limited capability in handling data that go beyond it’s learned distribution, i.e., out-of-distribution (OOD) data. Recent progress has been made in designing novel testing techniques for DL software, which can detect thousands of errors. However, the current state-of-the-art DL testing techniques do not take the distribution of generated test data into consideration. It is therefore hard to judge whether the “identified errors” are indeed meaningful errors to the DL application (i.e., due to the quality issue of the model) or outliers that cannot be handled by the current model (i.e., due to the lack of training data).

To fill this gap, we take the first step and conduct a large scale empirical study, with a total of 451 experiment configurations, 42 DNN and over 1.2 million test data instances, to investigate and characterize the capability of DL software from data distribution perspective towards understanding its impact on the DL testing techniques. We first perform a large scale empirical study on five state-of-the-art OOD detection techniques to investigate their performance in distinguishing the in-distribution (ID) data and OOD data. Based on the results, we select the best OOD detection technique and investigate the characteristics of the generated test data by different DL testing techniques, i.e., 8 mutation operators and 6 testing criteria. The results demonstrate that some mutation operators and testing criteria tend to guide generating OOD test data, while some show to be the opposite. After identifying the ID and OOD errors, we further investigate their effectiveness in DL model robustness enhancement. The results confirm the importance of data distribution awareness in both testing and enhancement phases outperforming distribution unaware retraining up to 21.5%. As deep learning follows the data-driven development paradigm, whose behavior highly depends on the training data, the results of this paper confirm the importance and calls for the inclusion of data-awareness during designing new testing and analysis techniques for DL software.

David Berend

Nanyang Technological University, Singapore

Singapore

Xiaofei Xie

Nanyang Technological University

Lei Ma

Kyushu University

Japan

Lingjun Zhou

College of Intelligence and Computing, Tianjin University

China

Yang Liu

Nanyang Technological University, Singapore

Singapore

Chi Xu

Singapore Institute of Manufacturing Technology, A*Star

Jianjun Zhao

Kyushu University

Japan

	09:10 - 10:10: Testing and AIResearch Papers / Journal-first Papers at Koala Chair(s): Xiaoyuan XieSchool of Computer Science, Wuhan University, China

	09:10 - 09:30 Talk		Predicting failures in multi-tier distributed systems Journal-first Papers Leonardo MarianiUniversity of Milano Bicocca, Mauro PezzeUSI Lugano, Switzerland, Oliviero RiganelliUniversity of Milano-Bicocca, Italy, Rui XinUSI Università della Svizzera italiana
	09:30 - 09:50 Talk		Cats Are Not Fish: Deep Learning Testing Calls for Out-Of-Distribution Awareness Research Papers David BerendNanyang Technological University, Singapore, Xiaofei XieNanyang Technological University, Lei MaKyushu University, Lingjun ZhouCollege of Intelligence and Computing, Tianjin University, Yang LiuNanyang Technological University, Singapore, Chi XuSingapore Institute of Manufacturing Technology, A*Star, Jianjun ZhaoKyushu University
	09:50 - 10:10 Talk		Metamorphic Object Insertion for Testing Object Detection Systems Research Papers Shuai WangHong Kong University of Science and Technology, Zhendong SuETH Zurich, Switzerland