Uncertainty Quantification and Data Provenance for Data Pipeline Security Analysis

Files

TR Number

Date

2025-05-06

Journal Title

Journal ISSN

Volume Title

Publisher

ACM

Abstract

Ensuring data integrity and reliability is essential for real-world applications, especially in automated decision-making and anomaly detection systems. In this study, we introduce a data pipeline augmentation tool that combines Uncertainty Quantification (UQ) techniques with Data Provenance Tracking to detect anomalies and shifts. By leveraging a task runner for pipeline orchestration, our approach ensures scalable, fault-tolerant execution while maintaining full traceability and monitoring at each processing stage.

To validate our framework, we conduct two experiments using the Lawrence Berkeley National Laboratory (LBNL) Fault Detection and Diagnostics (FDD) datasets, focusing on Fan Coil Unit (FCU) operations in HVAC systems. Our experiments assess the pipeline’s ability to detect anomalies under different corruption scenarios: (1) Detecting corruption in a single pipeline stage, (2) Capturing inline data corruption.

We integrate statistical tests, such as the Kolmogorov-Smirnov (KS) test, to identify distributional shifts between sequential data batches. Additionally, we apply UQ techniques to quantify uncertainty, enhancing confidence in detected anomalies. The results demonstrate that our work effectively identifies computational corruption, providing a robust and scalable solution for anomaly detection in real-world data pipelines.

Description

Keywords

Citation