Science Guided Machine Learning: Incorporating Scientific Domain Knowledge for Learning Under Data Paucity and Noisy Contexts
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In recent years, the large amount of labeled data available has helped tend machine learning (ML) research toward using purely data driven end-to-end pipelines, e.g., in deep neural network research. However, in many situations, data is limited and of poor quality. Traditional ML pipelines are known to be susceptible to various issues when trained on low volumes of non-representative, noisy datasets. We investigate the question of whether prior domain knowledge about the problem being modeled can be employed within the ML pipeline to improve model performance under data paucity and in noisy contexts? This report presents recent developments as well as details, novel contributions in the context of incorporating prior domain knowledge in various data-driven modeling (i.e., machine learning - ML) pipelines particularly geared towards scientific applications. Such domain knowledge exists in various forms and can be incorporated into the machine learning pipeline using different implicit and explicit methods (termed: science-guided machine learning (SGML)). All the novel techniques proposed in this report have been presented in the context of developing SGML to model fluid dynamics applications, but can be easily generalized to other applications. Specifically, we present SGML pipelines to (i) incorporate prior domain knowledge into the ML model architecture (ii) incorporate knowledge about the distribution of the target process as statistical priors for improved prediction performance (iii) leverage prior knowledge to quantify consistency of ML decisions with scientific principles (iv) explicitly incorporate known mathematical relationships of scientific phenomena to influence the ML pipeline (v) develop science-guided transfer learning to improve performance under data paucity. Each technique that is presented, has been designed with a focus on simplicity and minimal cost of implementation with a goal of yielding significant improvements in model performance especially under low data volumes or under noisy data conditions. In each application, we demonstrate through rigorous qualitative and quantitative experiments that our SGML pipelines achieve significant improvements in performance and interpretability over corresponding models that are purely data-driven and agnostic to scientific knowledge.