A Framework for Data Quality for Synthetic Information
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Data quality has been an area of increasing interest for researchers in recent years due to the rapid emergence of 'big data' processes and applications. In this work, the data quality problem is viewed from the standpoint of synthetic information. Based on the structure and complexity of synthetic data, a need to have a data quality framework specific to it was realized. This thesis presents this framework along with implementation details and results of a large synthetic dataset to which the developed testing framework is applied. A formal conceptual framework was designed for assessing data quality of synthetic information. This framework involves developing analytical methods and software for assessing data quality for synthetic information. It includes dimensions of data quality that check the inherent properties of the data as well as evaluate it in the context of its use. The framework developed here is a software framework which is designed considering software design techniques like scalability, generality, integrability and modularity. A data abstraction layer has been introduced between the synthetic data and the tests. This abstraction layer has multiple benefits over direct access of the data by the tests. It decouples the tests from the data so that the details of storage and implementation are kept hidden from the user. We have implemented data quality measures for several quality dimensions: accuracy and precision, reliability, completeness, consistency, and validity. The particular tests and quality measures implemented span a range from low-level syntactic checks to high-level semantic quality measures. In each case, in addition to the results of the quality measure itself, we also present results on the computational performance (scalability) of the measure.