A Machine-Learning Based Approach to Predicting Waterborne Disease Outbreaks Caused by Hurricanes

TR Number



Journal Title

Journal ISSN

Volume Title


Virginia Tech


Climate change is increasing the frequency and intensity of (extra-) tropical cyclones including hurricanes and winter storms worldwide. Waterborne diseases, resulting from flood-related impacts, affect public health and are of major concern for society. Previous research studies have highlighted a statistically significant linear correlation between waterborne diseases and climate variables, especially precipitation and temperature. However, to the best of our knowledge, no studies have explored nonlinear models (e.g., machine learning) to predict waterborne disease outbreaks in the aftermath of hurricanes and winter storms. Here, we aim at predicting waterborne disease counts as well as disease outbreaks using historic climate demographic, and public health data of Florida, U.S. that date back to 1992. For this, we first predicted diseases in aggregated coastal counties using multiple linear (MLR) and random forest regression (RFR) models. Then, we developed a binary random forest classifier (RFC) model to predict waterborne disease outbreaks (e.g., 0: no outbreak and 1: outbreak). Results of this study showed that the highest coefficient of determination (R2) for the MLR model was 0.65 for two aggregated county groups, namely St. Johns-Duval-Nassau and Sarasota-Charlotte-Lee. The RFR model showed the highest R2 of 0.69 for the county group Sarasota-Charlotte-Lee. The highest Root Mean Square Error (RMSE) was found for the county group Miami Dade-Broward- Palm Beach with a value of 15 and 16 people for both the MLR and RFR models. St. Johns-Duval-Nassau and Sarasota-Charlotte-Lee groups achieved the highest Kling-Gupta Efficiency (KGE) of 0.76 for the MLR model. Sarasota-Charlotte-Lee also performed the best in terms of KGE for the RFR model with a score of 0.69. On the other hand, the binary RFC model for Pinellas-Hillsborough-Manatee achieved a model's accuracy of 0.93 and f1-score of 0.48. We anticipate that the models' performance can substantially be improved with access to higher spatial resolution climate data as well as longer demographic and public health records. Nevertheless, we here provide a solid methodology that can inform local authorities about imminent public health impacts and mitigate negative effects on society, economy, and environment.



Machine Learning, Hurricanes, Prediction, Waterborne Diseases, Civil Engineering