- Research
- Open access
- Published:
Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature
Journal of Cheminformatics volume 16, Article number: 131 (2024)
Abstract
With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.
Scientific contribution
In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.
Introduction
Artificial intelligence (AI) can discover potential new medicines faster than ever before [1, 2]. The efficiency of AI-driven models relies on the extensive and high-quality datasets used for training [3]. One such major task in the drug discovery pipeline is the prediction of synthesis routes for novel drug molecules. Recent advancements in computer-aided synthesis planning (CASP) offer efficient tools for reaction prediction and synthesis planning [4]. The knowledge of existing chemical reaction data aids in synthesizing a molecule in the lab, and to predict the reaction conditions and yield [5]. Utilizing deep learning techniques, researchers have developed predictive models capable of forecasting reaction outcomes, facilitating the design of efficient synthetic routes [5, 6]. However, the efforts to improve synthesizability prediction are constrained by the lack of high-quality open access data on the synthesis routes and reaction conditions [7]. Such chemical reaction data is scattered among scientific literature like journal articles and patents. Large-scale chemical reaction datasets can be instrumental in the development of more accurate and versatile synthesis planning tools [8] or lab automation [4]. This necessity has led to the development of commercial structured reaction databases such as Reaxys [9] and Scifinder [10]. Access to these databases is limited by subscription fees or institutional access, posing challenges for researchers with limited resources. Reaxys [9] is probably the most used dataset for drug discovery tasks, curated through manual efforts. However, a notable drawback of Reaxys is its relatively slow pace of updates, which may cause researchers to miss the latest information on chemical research [9].
The largest open-source dataset of organic reactions is curated from patent literature (USPTO), which is widely used for various applications [11]. Patents often disclose new chemical reactions and synthetic methodologies that may not be published in academic literature. Extracting chemical reactions from patents can help to uncover previously unknown reactions, reagents, and reaction conditions. Although, reactions in patents can also be biased to certain types of reactions that are often used in industry for commercial purposes [12]. A grammar-based chemical reaction extraction pipeline was utilized [11], that integrates information parsing and integration of grammatical rules for extracting chemical reactions from USPTO patent documents. The work utilized the ChemicalTagger [13] method to efficiently match against extensive grammars defining entity types (such as reactant, solvent, catalyst, workup, and product) for chemical and physical quantities. This approach was supplemented with manually crafted regular expressions to identify other entity types. The resulting data was formatted in Extensible Markup Language (XML) format and is popularly known as the USPTO dataset. Later, the USPTO dataset was integrated into a structured format (Web Interface and JSON) as part of Open Reaction Database (ORD) [14] by a consortium of academic and industry partners. But the dataset contains reactions from patents only up to September 2016 and several questions have been since raised on the quality of the data [15]. While the open source USPTO chemical reaction data is available from the patents collected between the years 1976–2016, an extended dataset is commercially available as Pistachio [16].
Most other chemical reaction extraction studies are centered around the Cheminformatics Elsevier Melbourne University (ChEMU) Competition [17], where the task was to extract chemical reaction information from patents. Various versions of bidirectional encoder representation from transformer (BERT) models were developed e.g. ChemBERTa [18], Patent_BioBERT [19], BioLinkBERT [20] as named entity recognition (NER) systems for chemical entity identification from patents. One study [21] explored Conditional Random Fields (CRFs) and multilayer perceptrons (MLPs), incorporating word-level, grammatical, and functional terms/features in their investigation to identify entities present in patent documents. There were also attempts to develop methods for extracting chemical reactions specifically from journal articles. Wilary et al. [22] developed ReactionDataExtractor, which uses a combination of neural networks and symbolic artificial intelligence methods, while a transformer-based model was developed by Qian et al. [23]. A machine learning model, RxnScribe was developed for extracting reaction schemes from reaction diagrams in the chemistry literature [23] dataset curated from four major organic chemistry journals. Patiny et al. [24] and Ai et al. [25] have proposed an LLM-based approach for automatic extraction of experimental data of molecules from literature. According to a recent review by Schilling-Wilhelmi et al. [26] structured data is important and LLM might play a significant role in curating such scientific datasets.
Due to the exponential growth of organic chemistry literature, extraction of chemical reactions through manual approaches requires intensive human effort and are often time consuming. Any approach (automated or rule-based) should ensure the accuracy, completeness, and consistency of the extracted reaction data. Any error or inconsistency in the data can affect the reliability of the search result, analysis and models developed based on the data. Error(s) might happen during extraction of metadata such as reaction conditions, yields, and participant molecules from patent documents, which can be challenging to generalize due to variations in how this information is presented, as pointed out by previous studies on reaction data extraction [27, 28]. Recent works have demonstrated how pretrained large language models (LLMs) can be used for extraction of structured information from complex scientific knowledge [29]. The pre-trained LLMs are trained on vast amounts of text data, including scientific literature, which increases their ability to understand complex language patterns, including variations, synonyms, and context-dependent meanings. Large language models can also capture and handle ambiguity, uncertainty, and variation in language usage more effectively than rule-based systems. Unlike rule-based methods, there is no need for retraining the LLM model to adapt to new styles of author writing, nor does the model need to be updated [30]. Additionally, it can automatically map the chemical entities with their quantities without the need for another mapping algorithm, such as ChemicalTagger [12].
This study explores the utility of LLMs in extracting chemical reactions from USPTO patents. A complete pipeline for chemical reaction extraction from patent documents was developed using LLMs. Multiple challenges were encountered in the process and for some of them, alternative solutions were proposed. The performance of popular LLMs such as GPT-3.5 [31], Gemini 1.0 Pro [32], Llama2-13b [33], and Claude 2.1 [34] were compared. The extracted data was carefully compared with the randomly chosen reaction data extracted from 1 month of patent literature by the existing non-LLM based method [11]. While the proposed method could extract 26% additional new reaction data from the same set of patents, it could also identify multiple wrong entries in the previously extracted dataset. The above results highlight the potential of the method in enhancing reaction data quality.
Materials and methods
The proposed pipeline (shown in Fig. 1) consists of (1) creation of a dataset comprised of organic chemistry patents, (2) identification of reaction-containing paragraphs from the patents, (3) extraction of chemical entities using LLMs from the reaction-containing paragraph and (4) converting the identified chemical entities in IUPAC format to SMILES format, (5) atom mapping between reactant(s) and product(s) from the SMILES format, which ensures whether the extracted reactions are valid. In the next few sub-sections, all the components of the proposed pipeline are discussed.
Dataset curation
In the first step, US patents associated with organic chemistry were curated from USPTO for the month of February 2014. Google Patents service was used to download USPTO patents corresponding to the International Patent Classification (IPC) code ‘C07’, where ‘C’ designates the section related to chemistry and metallurgy, and ‘07’ denotes the sub-category specific to organic chemistry. A total of 618 patents were obtained from the Google Patents website.
Validation dataset
The open reaction database (ORD) was created based on an open-access schema and infrastructure for structuring and sharing organic reaction data, including a centralized data repository [13]. It also integrates the data extracted from the USPTO [11] dataset. The refined version of reaction data from February 2014 in the ORD database [13] was chosen to assess the performance of the proposed pipeline. This selection allows for a comparative analysis, utilizing previous extractions as a benchmark or ground truth to gauge the efficacy of the automated pipeline.
Identification of reaction containing paragraphs
Passing the whole patent document to the LLM model for extraction of chemical reactions will be a resource- and time-intensive process. An alternative approach can be the identification of reaction-containing paragraphs in the patent, which can be later passed through the LLM model for chemical entity recognition and reaction condition extraction. A BioBERT classifier [35] and Naïve-Bayes classifier [36] were trained using a manually labelled corpus of reaction paragraphs [37], and their performance was compared on the task of classifying the reaction-containing paragraphs from those which do not contain reactions. While the training performance of both the models were comparable, the Naïve-Bayes classifier had better performance during tenfold cross validation. The cross-validation performance of the Naïve-Bayes model (precision = 96.4%, recall = 96.6%) exceeds that of the BioBERT model (precision = 86.9%, recall = 90.2%). Therefore, the Naïve-Bayes classifier was chosen for identification of reaction-containing paragraphs in patent documents. An example of reaction and non-reaction paragraphs predicted by the Naïve-Bayes model is shown in supporting information Fig. S1.
Identification of chemical reaction entities using LLM
The identified reaction paragraphs were next passed through the LLM models for Named Entity Recognition (NER). LLM’s zero-shot Named Entity Recognition (NER) capability [31], was used to extract chemical reaction entities including reactants, solvents, workup, reaction conditions, catalysts, and product(s) along with their quantities from reaction paragraphs (shown in Fig. 2). Pretrained LLMs will be efficient in this task since they are trained on vast amounts of text data including scientific literature, which increase their ability to understand complex language patterns, such as variations, synonyms, and context-dependent meanings. A set of prompts were designed for this task and experimented to identify the final prompt to be provided to the LLM (shown in Fig. 2). The temperature value of 0 was used while inferencing the LLMs.
IUPAC to SMILES conversion
The chemical entities are extracted by the LLM in their respective IUPAC (International Union of Pure and Applied Chemistry) name present in the reaction paragraphs. These chemical entities need to be converted to SMILES (Simplified Molecular Input Line Entry System) or InChI (International Chemical Identifier) format for later storage and subsequent atom mapping process. It is important to note that initially, the capability of LLM was tested for direct extraction of chemical entities in the SMILES format. However, it was observed that the accuracy of the extracted chemical entities in the SMILES format was low (73.5% accuracy), which has also been reported by other studies [38]. This might happen due to the known hallucination problem of LLMs [38, 39]. It was observed that the IUPAC to SMILES conversion rate for common chemicals (e.g. solvents) and the rest of the chemical entities (e.g. reactant, product etc.) was 87.63% and 61.06% respectively. Therefore, chemical entities were extracted in the IUPAC format and later post-processed to obtain the SMILES format. The Open Parser for Systematic IUPAC Nomenclature (OPSIN) library [40] was used to convert chemical entities from the IUPAC format to SMILES format. For few cases, where the chemical names are directly mentioned such as Na2SO4 solution or present in the form of abbreviations such as DMF, DMSO, etc., the OPSIN library is not designed to handle such cases, as it can only handle IUPAC names. In such cases, the PubChemPy library was used for conversion, which enhanced the overall recall of the conversion process.
It is important to note that the LLMs may have identified prefix/suffix words such as “crude/dry/aqueous/solid/anhydrous/concentrated/solution” etc., associated to the chemical entities, and extracted them together with the chemical entity (e.g. aqueous NaOH solution). Custom post-processing techniques were utilized to remove such descriptions from the IUPAC name of the chemical entity, before conversion to the SMILES format. An example of such post-processing is shown in supporting figure S2. The rules for IUPAC extraction were developed manually and finetuned based on the February 2014 patent dataset through an iterative process.
Atom mapping
In a well-formed chemical reaction, all atoms in the product(s) must originate from the reactants (shown in Fig. 3), and any reaction that does not adhere to this principle should not be considered valid [11]. Atom–atom mapping (AAM) serves to establish these relationships by tracing the atoms in the product(s) to those in the reactants. Typically, a maximum common subgraph algorithm is utilized to identify the maximum number of atoms in the product that can be attributed to a specific reactant. However, it is important to note that the resulting mapping may lack uniqueness in terms of the selected atoms within a reactant or the reactants providing atoms [41].
RXNMapper [27], a transformer neural network-based technique was used to perform atom-mapping between the product(s) and the reactants. RXNMapper does not require supervision or human labelling and exhibited much better performance especially for unbalanced and chemically complex reactions compared to previous methods such as the Indigo Toolkit [41] and Mappet [42]. It is important to note that, in some instances, the solvent may function as a reactant, complicating the atom mapping process. To mitigate this complication, efforts are made to reclassify the solvent as a reactant and subsequently repeat the atom mapping process to ensure accurate mapping of atoms (Refer SI Figure S3).
Results and discussion
Performance of the proposed pipeline
The final set of 618 patents with ‘C07’ classification were analyzed using the Naïve-Bayes classifier to identify reaction-containing paragraphs. 20,016 reaction paragraphs were identified, which were then processed by the large language model (LLM) to extract the reaction entities such as reactants, solvents, and products along with their quantities using the zero-shot prompts. Four LLM models were tested, and their performances are compared in “Comparison with previously extracted dataset, the open reaction database (ORD)” section. The results described here are from the best performing LLM model, Gemini 1.0 Pro. The reaction entities identified in IUPAC format were subsequently converted to SMILES strings using the OPSIN and PubChemPy libraries as described in “Materials and methods” section. Finally, the reactants, solvents, and products in SMILES format were passed through RXNMapper [27] to map the product atoms with the reactant atoms. A total of 10, 726 valid reactions were obtained out of 20,016 reaction paragraphs. The remaining 9290 paragraphs failed to progress through the pipeline due to various issues (1) OPSIN conversion error (2) unresolved references of reactants/solvents/products, and (3) missing product atoms in the reactants, even though all the reactants were identified. In the next subsections, the challenges associated with chemical reaction extraction using LLM models are described.
Challenges associated with chemical reaction extraction from patent documents
In the following sub-sections, challenges faced by the proposed automated reaction extraction pipeline are discussed and for most cases possible solutions are also discussed.
Presence of product information in the reaction paragraph sub-heading
While it was assumed that all the reaction paragraphs contained reactants, solvents, products, and their respective yield information within them, there were 64.4% reaction paragraphs where the product information was actually available only in the sub-heading prior to the reaction-containing paragraph, and it was recalled as title compound/product later within the reaction paragraph (see Figs. 4 & 5a). To resolve this issue, the IUPAC name of a chemical present in the sub-heading/title of the reaction paragraph (shown in Fig. 5a), along with the corresponding references, if any, were identified. The LLM's few-shot NER method (shown in SI Fig. S6) was applied to store the identified title compound from the sub-heading in a dictionary, where the key represented the reference, and the value represented the IUPAC name. When such references were used in the reaction paragraphs, the stored sub-heading/title compound is promptly substituted in place of the product reference.
a An example of a reaction paragraph where the product information in the reaction paragraph was recalled from the header. b OPSIN failed to convert the highlighted IUPAC name (in yellow) to a SMILES string. c Use of compound references in the reaction paragraph. d Missing brackets in the IUPAC, leading to OPSIN error when converting IUPAC to SMILES format. e Example of a reaction paragraph where some of the chemical entity information needs to be extracted from previous reaction paragraphs
Failed IUPAC to SMILES conversion
In 1088 instances, both OPSIN and PubChemPy libraries failed to convert the chemical name from IUPAC format to SMILES format. One such example is shown in Fig. 5b, where all parentheses are correctly balanced. However, the error arises due to a typo. There are instances, where error occurred as specific IUPAC substrings were not resolved through both the OPSIN and PubChemPy libraries. In some other cases, the conversion of IUPAC names failed (see Fig. 5d) due to missing parentheses in the IUPAC name, possibly resulting from typos by the patent authors. Such parentheses can only be removed/added manually, which can be a tedious job. All the failed IUPAC names using OPSIN library are provided in the supplementary information (“Material and methods” section. Incorrect Reaction Information present in ORD and the corresponding extracted information using LLM method).
Failures due to unresolved compound references
In a few patents, it was observed that previously defined compounds are directly referenced, such as ‘Compound x is added to water’, when writing the reactants or products within paragraphs describing reactions (Fig. 5c). Unless these references are identified and replaced, it may result in omitting a reactant/product from the reaction. Two different types of approaches were employed to resolve these references: (1) The LLM's few-shot NER method (shown in SI Fig. S6) was used to identify the compounds along with reference. Each identified compound was stored in a dictionary where the key represented the reference, and the value represented the IUPAC name. (2) The previous 10 paragraphs, each with headings, were provided as context to the zero-shot prompt given to the LLM. Here, the LLM successfully resolved complex references that were not identified using approach (1) (shown in SI Tables S1 & S2).
The identified reaction paragraph does not contain the full list of reaction entities
Around 9% of the reaction-containing paragraphs were identified where product/reactant/solvent information is available in a previous paragraph, which may not be extracted as reaction paragraph. Most of such reaction paragraphs do not contain the quantity of the chemical entities of the workup process either (see Fig. 5e). Extracting reaction information from such paragraphs were challenging through the current approach, as it may require passing the whole patent document through the LLM for each reaction paragraph, which can be a time- and resource-intensive process. To resolve this issue, previous reaction-containing paragraphs identified by Naïve-Bayes model in the same patent and the corresponding subheadings along with the current reaction paragraph were supplied to the LLM. A successful extracted reaction is shown in supporting information Table 2.
Reference to compounds present in images
In some of the experimental paragraphs, the reactant/product compound references are present within an image (see Fig. 6). Since the current approach is a text-based approach, we were unable to resolve the image references using the LLM.
Missing product atoms error
Atom mapping process acted as an additional layer to validate if the extracted reaction entities are correct. Even if all chemical entities present in the reaction paragraphs are successfully identified and converted into SMILES format, issues may arise during the subsequent atom mapping process. This could be due to discrepancies in stoichiometry, where the relationship between the number of reactant and product molecules involved is not accurately accounted for. One such example is shown in Supplementary Fig. S5. Failures in the IUPAC to SMILES conversion for a reactant can result in missing atoms in the products, particularly if the missing atoms are present in the reactant that encountered the conversion error.
Comparison with previously extracted dataset, the open reaction database (ORD)
The refined February 2014 reaction dataset from ORD (validation dataset) consisted of 8979 reaction-containing paragraphs from 282 patents. Among these 8979 reaction-containing paragraphs, 8857 were part of the 20,016 reaction-containing paragraphs identified from the same set of patents by this work (Fig. 7). When checked manually, 22 paragraphs identified by the previous study to be reaction-containing were found to be non-reaction paragraphs (SI Fig. S4).
A close comparison with the rest of the 8,979 common reaction paragraphs showed that 8497 (94.63%) were successfully passed through the proposed pipeline. The remaining 482 (5.37%) reaction paragraphs exhibited various issues (Fig. 7). These issues include (1) Incorrect entry in the ORD dataset but the proposed LLM-based method was able to obtain the correct information for the corresponding reaction paragraph (174 entries), (2) LLM error (53 entries) and (3) Other errors such as (a) OPSIN conversion error for the IUPAC names present in the reaction paragraph despite those entries with SMILES in the ORD dataset, (b) unresolved references of reactants/solvents/products using the proposed method, and (c) missing product atoms in the reactants even though all the reactants were identified, but surprisingly there was a reaction entry in the ORD database. An in-depth analysis of these issues will be discussed in the following sections. It is important to note that the pipeline proposed in this work could add an extra 2229 reactions, which accounts for 26% more reactions than the February 2014 ORD dataset. It is also important to note that the data presented in ORD dataset used Indigo for atom mapping, whereas this study used RXNMapper for atom mapping which will contribute to the overall performance. In the next section we have described various errors in the ORD dataset with examples.
There were 53 reactions which failed due to LLM error. These can be broadly classified into three categories and representative examples are shown in supporting Figure S7. (1) Missing entities—where the LLM failed to identify all the entities (Fig. S7a); (2) Multiple entities in single list—for few of the cases it was observed that the LLM has combined multiple entities into single list which makes the post processing step challenging and the corresponding entity is not resolved (Fig. S7b) and (3) unable to replace compound—for cases where a compound from previously mentioned protocol is replaced with a new compound. In such cases both the compounds were identified as reactants by the LLMs (Fig. S7c).
Entries with incorrect reaction information in ORD
The ORD dataset consists of 174 (2.1%) incorrect reaction paragraphs (Refer SI Section S2). Examples of such cases include: (1) Removal of certain symbols (e.g. slash) from the heading of the paragraph during postprocessing resulted in an inaccurate conversion of the IUPAC name to SMILES string (shown in Fig. 8a); (2) In some cases, incorrect reactant/product were identified and reported in ORD. For example, as shown in Fig. 9, “2-methyl-2,3,5,6,7,8-hexahydro-cyclopenta[b]naphthalen-1-one” was identified as a reactant whereas, the proposed LLM-based approach identified the reactants correctly.
Representative examples of incorrect ORD entries where a the reactants and product have been mislabelled, b incomplete product entry in ORD to solve OPSIN unmatched opening bracket error leading to a different product, c incomplete product entry in ORD to solve OPSIN unmatched closing bracket error leading to a different product, d reactants and products identified from a non-reaction paragraph and e OPSIN unable to resolve IUPAC name but corresponding SMILES of product is mentioned in ORD
In another case in the ORD dataset, such parentheses are removed along with part of the IUPAC name before the parenthesis, leading to successful conversion (shown in Fig. 8b & 8c). However, there is a mismatch between the 2D structure obtained from the patent and the SMILES obtained after IUPAC conversion in ORD, leading to an error in the extracted chemical reaction.
Comparison of various state-of-the-art LLMs on chemical reaction data extraction
Performance of four state-of-the-art LLMs were compared for extraction of chemical reactions from patent documents. Apart from GPT-3.5, API access to Gemini 1.0 Pro and Claude 2.1 were freely available during the time of this study.
Gemini 1.0 Pro's performance was comparable to that of GPT-3.5 in terms of both accuracy and time (Table 1). Claude 2.1’s accuracy was approximately 4–5% lower than that of Gemini 1.0 Pro and GPT-3.5 models, and it takes 100% more time. The lower performance of the Llama 2-13B model can be attributed to its fewer parameters. Unfortunately, due to hardware constraints (Refer SI Section S1.2 for LLMs implementation details), we were unable to run the larger model of Llama 2 and used only 13B model with 4-bit quantization. In this study, Gemini 1.0 Pro API was used for identifying the reaction entities in patent documents, as the access to API was free at the time of doing this work. Although the LLMs considered in Table 1 are not comparable in terms of the number of parameters, the results conclusively indicate the applicability of each LLM in the context of reaction mining. The improvement in chemical reaction data extraction achieved by the proposed approach provides a lower bound on the capability of LLMs in the advancement of text mining.
In order to understand the generalizability, the proposed method was applied to Jan 2017 USPTO data where it could extract 11,009 from 841 patent documents.
Conclusion
In this work, an automated LLM-based approach was proposed for reaction data extraction which can help to develop a high-quality reaction dataset. The suitability of the proposed approach was assessed by comparing it with a randomly chosen 1 month reaction dataset extracted by an earlier rule-based algorithm. In comparison to the existing dataset available in ORD, the proposed method has added several new reactions from the same set of patent documents. It has also identified several wrong reactions in the previous dataset. The main of this work was to find, whether the LLM based approach can significantly improve the quality and quantity of the chemical reaction dataset. Several challenges were identified and for some of them alternative solutions were suggested. Most importantly, the proposed method can add 26% new reactions from the same set of patents compared to the previous methods. In future work, we aim to apply the proposed method on multiple years of patent documents and prepare a high-quality reaction dataset for development of synthesis prediction models. The improvement of quantity and quality of the dataset can significantly improve the retrosynthesis and forward synthesis predictions.
It should be noted that for the chemical reaction extraction previous grammar-based approach [11] was faster and less resource intensive compared to the LLM-based approach. We hope that the time and cost will be reduced with future advancements in LLM models. Several discussions are going on regarding the usefulness of small language models vs large language models. We believe that an in-house built chemical reaction-specific small language model will significantly reduce the time and cost of data extraction in future. In future, such small language models will be more accessible and easier to use with limited resources and they can be more easily fine-tuned to meet specific needs [43].
Availability of data and materials
Data is provided in supplementary information 2.
References
Ren F, Aliper A, Chen J, Zhao H, Rao S, Kuppe C et al (2024) A small-molecule TNIK inhibitor targets fibrosis in preclinical and clinical models. Nat Biotechnol. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41587-024-02143-0
Blanco-Gonzalez A, Cabezon A, Seco-Gonzalez A, Conde-Torres D, Antelo-Riveiro P, Pineiro A et al (2023) The role of AI in drug discovery: challenges, opportunities, and strategies. Pharmaceuticals 16(6):891
Bender A, Cortes-Ciriano I (2021) Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discov Today 26(4):1040–1052
Coley CW, Thomas DA III, Lummiss JAM, Jaworski JN, Breen CP, Schultz V et al (2019) A robotic platform for flow synthesis of organic compounds informed by AI planning. Science 365(6453):eaax1566
Schwaller P, Vaucher AC, Laino T, Reymond JL (2021) Prediction of chemical reaction yields using deep learning. Mach Learn Sci Technol 2(1):15016
Krishnan SR, Bung N, Srinivasan R, Roy A (2024) Target-specific novel molecules with their recipe: incorporating synthesizability in the design process. J Mol Graph Model 129:108734
Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T (2018) The rise of deep learning in drug discovery. Drug Discov Today 23(6):1241–1250
Fooshee D, Mood A, Gutman E, Tavakoli M, Urban G, Liu F et al (2018) Deep learning for chemical reaction prediction. Mol Syst Des Eng 3(3):442–452
Lawson AJ, Swienty-Busch J, Géoui T, Evans D (2014) The making of reaxys—towards unobstructed access to relevant chemistry information. the future of the history of chemical information. American Chemical Society, pp 127–148
Gabrielson SW (2018) SciFinder. J Med Libr Assoc 106(4):588
Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. Doctoral dissertation. Cambridge: University of Cambridge; 2012.
Guo J, Ibanez-Lopez AS, Gao H, Quach V, Coley CW, Jensen KF, Barzilay R (2022) Automated chemical reaction extraction from scientific literature. J Chem Inf Model 62(9):2035–2045. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/acs.jcim.1c00284
Hawizy L, Jessop DM, Adams N, Murray-Rust P (2011) ChemicalTagger: a tool for semantic text-mining in chemistry. J Cheminform 3:1–13
Kearnes SM, Maser MR, Wleklinski M, Kast A, Doyle AG, Dreher SD et al (2021) The open reaction database. J Am Chem Soc 143(45):18820–18826
Gimadiev TR, Lin A, Afonina VA, Batyrshin D, Nugmanov RI, Akhmetshin T et al (2021) Reaction data curation I: chemical structures and transformations standardization. Mol Inform 40(12):2100119
Mayfield J, Lowe D, Sayle R. Pistachio: search and faceting of large reaction databases. In: Abstracts of papers of the American Chemical Society, vol. 254; 2017.
He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R et al (2021) Chemu 2020: natural language processing methods are effective for information extraction from chemical patents. Front Res Metr Anal 6:654438
Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv preprint arXiv:2010.098852020.
Zhang J, Zhang Y. Melaxtech: a report for CLEF 2020–ChEMU task of chemical reaction extraction from patent. Work Notes CLEF. 2020.
Yasunaga M, Leskovec J, Liang P. Linkbert: pretraining language models with document links. arXiv preprint arXiv:2203.15827; 2022.
Malarkodi CS, Rao PR, Devi SL. CLRG ChemNER: a chemical named entity recognizer@ ChEMU CLEF 2020. In: CLEF (Working Notes); 2020.
Wilary DM, Cole JM (2023) ReactionDataExtractor 2.0: a deep learning approach for data extraction from chemical reaction schemes. J Chem Inf Model 63(19):6053–6067
Qian Y, Guo J, Tu Z, Coley CW, Barzilay R (2023) RxnScribe: a sequence generation model for reaction diagram parsing. J Chem Inf Model 63(13):4030–4041
Patiny L, Godin G (2023) Automatic extraction of FAIR data from publications using LLM. ChemRxiv. https://doiorg.publicaciones.saludcastillayleon.es/10.26434/chemrxiv-2023-05v1b-v2
Ai Q, Meng F, Shi J, Pelkie B, Coley CW (2024) Extracting structured data from organic synthesis procedures using a fine-tuned large language model. ChemRxiv. https://doiorg.publicaciones.saludcastillayleon.es/10.26434/chemrxiv-2024-979fz
Schilling-Wilhelmi M, Ríos-García M, Shabih S, Gil MV, Miret S, Koch CT, Jablonka KM. From text to insight: large language models for materials science data extraction. arXiv preprint arXiv:2407.16867; 2024
Schwaller P, Hoover B, Reymond J-L, Strobelt H, Laino T (2021) Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. Sci Adv 7(15):eabe4166
Voinarovska V, Kabeshov M, Dudenko D, Genheden S, Tetko IV (2023) When yield prediction does not yield prediction: an overview of the current challenges. J Chem Inf Model 64(1):42–56
Dagdelen J, Dunn A, Lee S, Walker N, Rosen AS, Ceder G et al (2024) Structured information extraction from scientific text with large language models. Nat Commun 15(1):1418
Nori H, Lee YT, Zhang S, Carignan D, Edgar R, Fusi N et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv preprint arXiv:2311.16452; 2023.
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Team G, Anil R, Borgeaud S, Wu Y, Alayrac JB, Yu J et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805; 2023.
Claude 2. https://www.anthropic.com/news/claude-2. Accessed 15 Oct 2023.
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288; 2023.
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH et al (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234–1240
Rish I. An empirical study of the Naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3; 2001. P. 41–46.
Jessop DM, Sam EA, Peter M (2011) Mining chemical information from open patents. J Cheminform 3(1):40
Yu B, Baker FN, Chen Z, Ning X, Sun H. Llasmol: advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv preprint arXiv:2402.09391; 2024.
Rajan K, Zielesny A, Steinbeck C (2021) STOUT: SMILES to IUPAC names using neural machine translation. J Cheminform 13(1):34
Lowe DM, Corbett PT, Murray-Rust P, Glen RC (2011) Chemical name to structure: OPSIN, an open source solution. J Chem Inf Model 51(3):739–753
Pavlov D, Rybalkin M, Karulin B, Kozhevnikov M, Savelyev A, Churinov A (2011) Indigo: universal cheminformatics API. J Cheminform 3(Suppl 1):P4
Jaworski W, Szymkuć S, Mikulak-Klucznik B, Piecuch K, Klucznik T, Kaźmierowski M et al (2019) Automatic mapping of atoms across both simple and complex chemical reactions. Nat Commun 10(1):1434
Tiny but mighty: The Phi-3 small language models with big potential. https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/. Accessed 20 Sept 2024.
Acknowledgements
The authors thank their colleagues G Bulusu, D Das, N Pandey, B Chakrabarty, R Padmasini, and A Jain for their valuable suggestions.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
AR, SRV, NB and SRK conceived the research methodology and performed the research. AR, SRV, NB, SRK, DN, GR, SK, SS and RS evaluated the results; SRV, SRK, NB and AR wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
All the authors are employed by Tata Consultancy Services Limited.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13321_2024_928_MOESM1_ESM.docx
Supplementary Material 1. Details of various Large Language Models with respective prompts used in this study, examples of reaction & non-reaction paragraphs, LLM erros, analysis of incorrect entries in the ORD dataset and their comparison with LLM extraction are presented in supporting Figures S1-S7, Table S1, S2 and Sections S1, S2.
13321_2024_928_MOESM2_ESM.xlsx
Supplementary Material 2. Dataset of the extracted chemical reactions for February 2014 USPTO patents, where each row consists of the reaction data in the format of <Patent ID, Reaction Paragraph, Reactants, Solvents, Workup, Catalysts, Reaction Conditions, Product(s), Reactant SMILES, Product SMILES, Reaction SMILES and number of missing atoms in the product> and list of failed IUPAC names which got OPSIN error when converting them to SMILES.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Vangala, S.R., Krishnan, S.R., Bung, N. et al. Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature. J Cheminform 16, 131 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13321-024-00928-8
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13321-024-00928-8