Automated Synthesis Procedure Generation in Heterogeneous Catalysis via Fine-Tuned Language Models
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The exploration of catalytic materials and their synthesis routes traditionally demands extensive iterative experimentation and significant time investment. To overcome these constraints, we have developed an advanced extraction workflow integrating language models and multimodal processing techniques. Initially, textual data from over 9,000 scientific articles were analyzed to identify and extract detailed catalyst attributes such as chemical composition, structural motifs, morphology, crystal structure, size, shape, and support materials. Additionally, images and their associated captions were systematically captured from these publications, enriching the dataset through advanced vision- language processing methods. Subsequently, this structured information was refined through rigorous classification, synthesis query generation, and feasibility validation, resulting in a curated dataset comprising 1,632 high-quality catalyst synthesis procedures. Leveraging this dataset, we fine-tuned a large language model using parameter-efficient adaptation, significantly enhancing its capability to accurately predict detailed catalyst synthesis methods. Performance evaluation of our fine-tuned model revealed stable and effective convergence, demonstrating substantial improvements over baseline models with a ROUGE-1 score of 0.522, a ROUGE-L score of 0.290, and a BERTScore of 0.863. These results underscore the effectiveness of integrating multimodal data and validation methods, offering a powerful pathway to accelerate catalyst discovery, thereby reducing research timelines and resource demands.