Machine Learning for Structure-Agnostic Chemical Analysis from Chromatographic Data
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Environmental monitoring relies heavily on gas chromatography (GC) to measure airborne contaminants such as volatile organic compounds (VOCs), yet many detected compounds lack structural or spectral references, limiting identification, property estimation, and quantitative analysis. This thesis investigates how machine learning (ML) can extract chemically meaningful information directly from chromatographic data to overcome these limitations. First, ML models are developed to establish a bidirectional relationship between chromatographic retention behavior on orthogonal GC phases and key physicochemical properties (vapor pressure, Henry's law constant, and solubility). Using XGBoost regression models trained on the NIST retention index database, a structure-agnostic "Index-to-Property" model predicts physicochemical properties from paired retention indices, while a complementary "Property-to-Index" model predicts retention behavior from known properties, achieving predictive performance up to R^2=0.98. Second, this work demonstrates that compound identity and concentration can be inferred directly from chromatographic peak shape, bypassing manual peak integration. ML classification and regression models trained on peaks from ambient atmospheric samples achieve 89% identification accuracy and a mean absolute error of 0.085 ppbv in concentration prediction. Together, these results show that machine learning can address key identification and data reduction challenges in environmental GC, enabling faster, structure-independent interpretation of complex mixtures.