Systematic auditing is essential to debiasing machine learning in biology

Eid, Fatma-Elzahraa; Elmarakeby, Haitham A.; Chan, Yujia Alina; Fornelos, Nadine; ElHefnawi, Mahmoud; Van Allen, Eliezer M.; Heath, Lenwood S.; Lage, Kasper

Systematic auditing is essential to debiasing machine learning in biology

dc.contributor.author	Eid, Fatma-Elzahraa	en
dc.contributor.author	Elmarakeby, Haitham A.	en
dc.contributor.author	Chan, Yujia Alina	en
dc.contributor.author	Fornelos, Nadine	en
dc.contributor.author	ElHefnawi, Mahmoud	en
dc.contributor.author	Van Allen, Eliezer M.	en
dc.contributor.author	Heath, Lenwood S.	en
dc.contributor.author	Lage, Kasper	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2021-05-19T13:02:58Z	en
dc.date.available	2021-05-19T13:02:58Z	en
dc.date.issued	2021-02-10	en
dc.description.abstract	Biases in data used to train machine learning (ML) models can inflate their prediction performance and confound our understanding of how and what they learn. Although biases are common in biological data, systematic auditing of ML models to identify and eliminate these biases is not a common practice when applying ML in the life sciences. Here we devise a systematic, principled, and general approach to audit ML models in the life sciences. We use this auditing framework to examine biases in three ML applications of therapeutic interest and identify unrecognized biases that hinder the ML process and result in substantially reduced model performance on new datasets. Ultimately, we show that ML models tend to learn primarily from data biases when there is insufficient signal in the data to learn from. We provide detailed protocols, guidelines, and examples of code to enable tailoring of the auditing framework to other biomedical applications. Fatma-Elzahraa Eid et al. illustrate a principled approach for identifying biases that can inflate the performance of biological machine learning models. When applied to three biomedical prediction problems, they identify previously unrecognized biases and ultimately show that models are likely to learn primarily from data biases when there is insufficient learnable signal in the data.	en
dc.description.notes	We thank Yu Xia (McGill University), Paul A. Clemons (Broad Institute of MIT and Harvard), and Lucas Janson (Harvard University) for helpful discussions and Shuyu Wang (UCSF) for help in dataset preparation. This work was supported by grants from The Stanley Center for Psychiatric Research, the National Institute of Mental Health (R01 MH109903), the Simons Foundation Autism Research Initiative (award 515064), the Lundbeck Foundation (R223-2016-721), a Broad Next10 grant, and a Broad Shark Tank grant. Y.A.C. was funded by a Human Frontier Science Program Postdoctoral Fellowship [LT000168/2015-L].	en
dc.description.sponsorship	Stanley Center for Psychiatric Research; National Institute of Mental HealthUnited States Department of Health & Human ServicesNational Institutes of Health (NIH) - USANIH National Institute of Mental Health (NIMH) [R01 MH109903]; Simons Foundation Autism Research Initiative [515064]; Lundbeck FoundationLundbeckfonden [R223-2016-721]; Broad Next10 grant; Broad Shark Tank grant; Human Frontier Science Program Postdoctoral FellowshipHuman Frontier Science Program [LT000168/2015-L]	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1038/s42003-021-01674-5	en
dc.identifier.eissn	2399-3642	en
dc.identifier.issue	1	en
dc.identifier.other	183	en
dc.identifier.pmid	33568741	en
dc.identifier.uri	http://hdl.handle.net/10919/103378	en
dc.identifier.volume	4	en
dc.language.iso	en	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.title	Systematic auditing is essential to debiasing machine learning in biology	en
dc.title.serial	Communications Biology	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en
dc.type.dcmitype	StillImage	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: s42003-021-01674-5.pdf
Size:: 826.74 KB
Format:: Adobe Portable Document Format
Description:: Published version

Download

Collections

Scholarly Works, Computer Science