Registered user since Mon 19 Jun 2023
Contributions
View general profile
Registered user since Mon 19 Jun 2023
Contributions
Fuzzing applies input mutations iteratively with the only goal of finding more bugs, resulting in synthetic tests that tend to lack realism. Big data analytics are expected to ingest real-world data as input. Therefore, when synthetic test data are not easily comprehensible, they are less likely to facilitate the downstream task of fixing errors. Our position is that fuzzing in this domain must achieve both high naturalness and high code coverage. We propose a new natural synthetic test generation tool for big data analytics, called NaturalFuzz. It generates both unstructured, semi-structured, and structured data with corresponding semantics such as ‘zipcode’ and ‘age.’ The key insights behind NaturalFuzz are two-fold. First, though existing test data may be small and lack coverage, we can grow this data to increase code coverage. Second, we can strategically mix constituent parts across different rows and columns to construct new realistic synthetic data by leveraging fine-grained data provenance. On commercial big data application benchmarks, NaturalFuzz achieves an additional 19.9% coverage and detects 1.9× more faults than a machine learning-based synthetic data generator (SDV) when generating comparably sized inputs. This is because an ML-based synthetic data generator does not consider which code branches are exercised by which input rows from which tables, while NaturalFuzz is able to select input rows that have a high potential to increase code coverage and mutate the selected data towards unseen, new program behavior. NaturalFuzz’s test data is more realistic than the test data generated by two baseline fuzzers (BigFuzz and Jazzer), while increasing code coverage and fault detection potential. NaturalFuzz is the first fuzzing methodology with three benefits: (1) exclusively generate natural inputs, (2) fuzz multiple input sources simultaneously, and (3) find deeper semantics faults.
File Attached