SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials

Eastman, Peter; Behara, Pavan Kumar; Dotson, David L.; Galvelis, Raimondas; Herr, John E.; Horton, Josh T.; Mao, Yuezhi; Chodera, John D.; Pritchard, Benjamin P.; Wang, Yuanqing; De Fabritiis, Gianni; Markland, Thomas E.

SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials

dc.contributor.author	Eastman, Peter	en
dc.contributor.author	Behara, Pavan Kumar	en
dc.contributor.author	Dotson, David L.	en
dc.contributor.author	Galvelis, Raimondas	en
dc.contributor.author	Herr, John E.	en
dc.contributor.author	Horton, Josh T.	en
dc.contributor.author	Mao, Yuezhi	en
dc.contributor.author	Chodera, John D.	en
dc.contributor.author	Pritchard, Benjamin P.	en
dc.contributor.author	Wang, Yuanqing	en
dc.contributor.author	De Fabritiis, Gianni	en
dc.contributor.author	Markland, Thomas E.	en
dc.date.accessioned	2023-04-04T15:06:25Z	en
dc.date.available	2023-04-04T15:06:25Z	en
dc.date.issued	2023-01-04	en
dc.description.abstract	Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the omega B97M-D3(BJ)/def2-TZVPPD level of theory, along with other useful quantities such as multipole moments and bond orders. We train a set of machine learning potentials on it and demonstrate that they can achieve chemical accuracy across a broad region of chemical space. It can serve as a valuable resource for the creation of transferable, ready to use potential functions for use in molecular simulations.	en
dc.description.notes	Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM140090 (JDC, TEM, PE, GdF) and R01GM132386 (JDC, PKB, YW). BPP acknowledges support from the National Science Foundation under award number CHE-2136142.	en
dc.description.sponsorship	National Institute of General Medical Sciences of the National Institutes of Health [R01GM140090, R01GM132386]; National Science Foundation [CHE-2136142]	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1038/s41597-022-01882-6	en
dc.identifier.eissn	2052-4463	en
dc.identifier.issue	1	en
dc.identifier.other	11	en
dc.identifier.pmid	36599873	en
dc.identifier.uri	http://hdl.handle.net/10919/114246	en
dc.identifier.volume	10	en
dc.language.iso	en	en
dc.publisher	Nature Portfolio	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.subject	Protein-ligand binding	en
dc.subject	accuracy	en
dc.subject	database	en
dc.title	SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials	en
dc.title.serial	Scientific Data	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: s41597-022-01882-6.pdf
Size:: 1.14 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

Collections

Scholarly Works, Chemistry