New Opportunities in Crowd-Sourced Monitoring and Non-government Data Mining for Developing Urban Air Quality Models in the US

TR Number

Date

2020-05-15

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Ambient air pollution is among the top 10 health risk factors in the US. With increasing concerns about adverse health effects of ambient air pollution among stakeholders including environmental scientists, health professionals, urban planners and community residents, improving air quality is a crucial goal for developing healthy communities. The US Environmental Protection Agency (EPA) aims to reduce air pollution by regulating emissions and continuously monitoring air pollution levels. Local communities also benefit from crowd-sourced monitoring to measure air pollution, particularly with the help of rapidly developed low-cost sampling technologies. The shift from relying only on government-based regulatory monitoring to crowd-sourced effort has provided new opportunities for air quality data. In addition, the fast-growing data sciences (e.g., data mining) allow for leveraging open data from different sources to improve air pollution exposure assessment. My dissertation investigates how new data sources of air quality (e.g., community-based monitoring, low-cost sensor platform) and model predictor variables (e.g., non-government open data) based on emerging modeling approaches (e.g., machine learning [ML]) could be used to improve air quality models (i.e., land use regression [LUR]) at local, regional, and national levels for refined exposure assessment.

LUR models are commonly used for predicting air pollution concentrations at locations without monitoring data based on neighboring land use and geographic variables. I explore the use of crowd-sourced low-cost monitoring data, new/open dataset from government and non-government sponsored platforms, and emerging modeling techniques to develop LUR models in the US. I focus on testing whether: (1) air quality data from community-based monitoring is feasible for developing LUR models, (2) air quality data from non-government crowd-sourced low-cost sensor platforms could supplement regulatory monitors for LUR development, and (3) new/open data extracted from non-government sponsored platforms could serve as alternative datasets to traditional predictor variable sources (e.g., land use and geographic features) in LUR models.

In Chapter 3, I developed LUR models using community-based sampling (n = 50) for 60 volatile organic compounds (VOC) in the city of Minneapolis, US. I assessed whether adding area source-related features improves LUR model performance and compared model performance using variables featuring area sources from government vs. non-government sponsored platforms. I developed three sets of models: (1) base-case models with land use and transportation variables, (2) base-case models adding area source variables from local business permit data (government sponsored platform), and (3) base-case models adding Google point of interest (POI) data for area sources. Models with Google POI data performed the best; for example, the total VOC (TVOC) model had better goodness-of-fit (adj-R2: 0.56; Root Mean Square Error [RMSE]: 0.32 µg/m3) as compared to the permit data model (0.42; 0.37) and the base-case model (0.26; 0.41). This work suggests that VOC LUR models can be developed using community-based samples and adding Google POI could improve model performance as compared to using local business permit data.

In Chapter 4, I evaluated a national LUR model using annual average PM2.5 concentrations from low-cost sensors (i.e., PurpleAir platform) in 6 US urban areas (n = 149) and tested the feasibility of using low-cost sensor data for developing LUR models. I compared LUR models using only the PurpleAir sensors vs. hybrid LUR models (combining both the EPA regulatory monitors and the PurpleAir sensors). I found that the low-cost sensor network could serve as a promising alternative to fill the gaps of existing regulatory networks. For example, the national regulatory monitor-based LUR (i.e., CACES LUR developed as part of the Center for Air, Climate, and Energy Solutions) may fail to capture locations with high PM2.5 concentrations and the within-city spatial variability. Developing LUR models using the PurpleAir sensors was reasonable (PurpleAir sensors only: 10-fold CV R2 = 0.66, MAE = 2.01 µg/m3; PurpleAir and regulatory monitors: R2 = 0.85, MAE = 1.02 µg/m3). I also observed that incorporating PurpleAir sensor data into LUR models could help capture within-city variability and merit further investigation on areas of disagreement with the regulatory monitors. This work suggests that the use of crowd-sourced low-cost sensor networks for LUR models could potentially help exposure assessment and inform environmental and health policies, particularly for places (e.g., developing countries) where regulatory monitoring network is limited.

In Chapter 5, I developed national LUR models to predict annual average concentrations of 6 criteria pollutants (NO2, PM2.5, O3, CO, SO2 and PM10) in the US to compare models using new data (Google POI, Google Street View [GSV] and Local Climate Zone [LCZ]) vs. traditional geographic variables (e.g., road lengths, area of built land) based on different modeling approaches (partial least square [PLS], stepwise regression and machine learning [ML] with and without Kriging effect). Model performance was similar for both variable scenarios (e.g., random 10-fold CV R2 of ML-kriging models for NO2, new vs. traditional: 0.89 vs. 0.91); whereas adding the new variables to the traditional LUR models didn't necessarily improve model performance. Models with kriging effect outperformed those without (e.g., CV R2 for PM2.5 using the new variables, ML-kriging vs. ML: 0.83 vs. 0.67). The importance of the new variables to LUR models highlights the potential of substituting traditional variables, thus enabling LUR models for areas with limited or no data (e.g., developing countries) and across cities.

The dissertation presents the integration of new/open data from non-government sponsored platform and crowd-sourced low-cost sensor networks in LUR models based on different modeling approaches for predicting ambient air pollution. The analyses provide evidence that using new data sources of both air quality and predictor variables could serve as promising strategies to improve LUR models for tracking exposures more accurately. The results could inform environment scientists, health policy makers, as well as urban planners interested in promoting healthy communities.

Description

Keywords

Hazardous air pollutants, volunteer-based monitoring, local emissions, exposure assessment, crowdsourcing, low-cost monitoring, LUR validation, hybrid models, open data, urban morphology, enhanced models

Citation