Cholera Database

Abstract

This project involved work toward a database of Cholera records from 2010 – 2020. The WHO repository was used to extract and normalize data to build CSV files. Each year where data is available has a CSV file containing location and total number of cases in the location. The ProMED repository was used to collect data for the same timeframe. The data was extracted, condensed, and tagged for easier manual viewing. Data for all years available is given in one CSV file. Data from WHO can be viewed in logarithmically colored maps based on the number of cases in each location. These visualizations are produced for each year in the study. The data from ProMED can be viewed in bar graphs which graph the number of articles that occur and in what weeks the articles are written for each country. These visualizations can be seen or downloaded at choleradb.cs.vt.edu. Additionally, all the CSV files of data produced are available for download on our website. Due to the complexity of NLP and the inconsistencies in the ProMED articles, our data is not completely normalized and requires some manual work. Unforeseen circumstances, including the COVID-19 crisis, slowed the project’s progress. Therefore, the ProMED data extraction did not proceed further, other data repositories have not been explored, and interactive visualizations have not been built. The results of this project are compiled datasets and data visualizations from the WHO and ProMED repositories. These are useful to our client for future analysis as well as anyone else who may be interested in the trends of Cholera outbreaks. The results of data collection are formatted for easy analysis and reading. The graphics provide a simple visual for those who are more interested in higher level analysis. This project can be useful to developers who are working on data extraction and representation in the field of epidemiology or other case based global studies. In the future, more repositories can be explored for more extensive results. Additionally, further work can be done with the ProMED set developed in order to condense it further and eliminate the need for any manual analysis after our program is run. The results of this project are all available publicly on choleradb.cs.vt.edu, including for download. All code is open source and available on Gitlab.

Description
Keywords
WHO, World Health Organization, ProMED, Cholera, Cholera Database, Database, Python, spaCy, NLP, BeautifulSoup, Website
Citation