Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics

Humayun, Ahmad; Kim, Miryung; Gulzar, Muhammad Ali

Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics

dc.contributor.author	Humayun, Ahmad	en
dc.contributor.author	Kim, Miryung	en
dc.contributor.author	Gulzar, Muhammad Ali	en
dc.date.accessioned	2024-03-01T13:18:31Z	en
dc.date.available	2024-03-01T13:18:31Z	en
dc.date.issued	2023-11-30	en
dc.date.updated	2024-01-01T08:55:43Z	en
dc.description.abstract	Data-intensive scalable computing has become popular due to the increasing demands of analyzing big data. For example, Apache Spark and Hadoop allow developers to write dataflow-based applications with user-defined functions to process data with custom logic. Testing such applications is difficult. (1) These applications often take multiple datasets as input. (2) Unlike in SQL, there is no explicit schema for these datasets and each unstructured (or semi-structured) dataset is segmented and parsed at runtime. (3) Dataflow operators (e.g., join) create implicit co-dependence constraints between the fields of multiple datasets. An efficient and effective testing technique must analyze co-dependence among different regions of multiple datasets at the level of rows and columns and orchestrate input mutations jointly on co-dependent regions. We propose DepFuzz to increase the effectiveness and efficiency of fuzz testing dataflow-based big data applications. The key insight behind DepFuzz is twofold. It keeps track of which code segments operate on which datasets, which rows, and which columns. By analyzing the use of dataflow operators (e.g., join and groupByKey) in tandem with the semantics of UDFs, DepFuzz generates test data that subsequently reach hard-to-reach regions of the application code. In real-world big data applications, DepFuzz finds 3.4× more faults, achieving 29% more statement coverage in half the time as Jazzer’s, a state-of-the-art commercial fuzzer for Java bytecode. It outperforms prior DISC testing by exposing deeper semantic faults beyond simpler input formatting errors, especially when multiple datasets have complex interactions through dataflow operators.	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1145/3611643.3616298	en
dc.identifier.uri	https://hdl.handle.net/10919/118230	en
dc.language.iso	en	en
dc.publisher	ACM	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.holder	The author(s)	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.title	Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3611643.3616298.pdf
Size:: 622.55 KB
Format:: Adobe Portable Document Format
Description:: Published version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Journal Articles, Association for Computing Machinery (ACM)
Scholarly Works, Computer Science