Ebola RDF Database Validator
In 2014, the Ebola virus, a deadly disease with a fatality rate of about 50 percent, spread throughout several countries. This marked the largest outbreak of the Ebola virus ever recorded. Our client is gathering data on the Ebola virus to create the largest database of information ever made for this disease. The purpose of our project is to verify our client’s data, to ensure the integrity and accuracy of the database.
The main requirements for this project are to locate multiple sources of data to be used for verification, to parse and standardize multiple types of sources to verify the data in our client’s database, and to deliver the results to our client in an easy-to-interpret manner. Additionally, a key requirement is to provide the client with a generic script that can be run to validate any data in the database, given a CSV file. This will allow our client to continue validating data in the future.
The design for this project revolves around two major elements: the structure of the existing database of Ebola information and the structure of the incoming validation files. The existing database is structured in RDF format with Turtle7 syntax, meaning it uses relational syntax to connect various data values. The incoming data format is in CSV format, which is what most Ebola data is stored as. Our design revolves around normalizing the incoming validation source data with the database content, so that the two datasets can be properly compared.
After standardizing the datasets, data can be compared directly. The project encountered several challenges in this domain, ranging from data incompatibility to inconsistent formatting on the side of the database. Data incompatibility can be seen clearly when the validation data matches the date range of the database data, but the exact days of data collection vary slightly. Inconsistent formatting is often seen in naming conventions for the data and the way that dates are stored in the database (i.e., 9/8/2014 vs. 2014-09-08). These issues were the main hindrance in our project. Each was addressed before the project could be considered complete.
After all data was converted, standardized, and compared, the results were produced and formatted in a CSV file to be given to our client. The results are given individually, for each time the script is run, so if the user runs the script for 4 different datasets over 4 different sessions, there will be 4 different result files.
The second main goal of our project, to produce a generic script that allows the user to validate data on his own, uses all previously mentioned design elements such as parsing RDF and CSV files, standardization of data, and printing results to a CSV file. This script builds a GUI interface on top of these design elements, providing a validation tool that the users can employ on their own.