INFORMATION ABOUT PROJECT,
SUPPORTED BY RUSSIAN SCIENCE FOUNDATION

The information is prepared on the basis of data from the information-analytical system RSF, informative part is represented in the author's edition. All rights belong to the authors, the use or reprinting of materials is permitted only with the prior consent of the authors.

 

COMMON PART


Project Number19-73-10137

Project titleiSynthesis - artificial intelligence approach to chemical compound synthesis design

Project LeadMadzhidov Timur

AffiliationKazan (Volga region) Federal University, Kazan University, KFU,

Implementation period 07.2019 - 06.2022 

Research area 03 - CHEMISTRY AND MATERIAL SCIENCES, 03-705 - Chemical informatics

Keywordssynthesis design, de novo design, synthesis planning, synthesis strategy prediction, reaction condition prediction, chemoinformatics, chemical reaction, artificial intelligence, machine learning


 

PROJECT CONTENT


Annotation
A key stage in the development of any substance, especially of new drugs, is its synthesis. Computational experiments and technologies of modern medical chemistry, as well as a huge amount of available information in databases, provide the widest opportunities for the rational drug design. Modern methods of virtual screening and tools of de novo molecular design, including those based on deep learning methods, allow generating hypothetical structures of the molecules that potentially have the required spectrum of biological activity or other useful properties. At the same time, a big part of theoretically designed molecules do not reach biological tests because of the problems with their synthesis. Although the task of planning organic synthesis on a computer was set half a century ago, and many approaches have been proposed and implemented in a large number of programs, but at present the set of really used tools is limited to several commercially available retrosynthetic analysis programs. Herewith, none of the existing tools can predict the conditions for reactions involved in synthetic plan; at best, data on typical conditions are provided. Besides, the underlying retrosynthetic analysis has two major drawbacks. First, the proposed synthesis strategy always needs further refinement. In particular, the need to introduce and remove protective groups cannot be specified in a general retrosynthetic plan. Second, this approach considers the method of synthesis of only the target compound. At the same time, the synthesis of its analogue can be simpler. In the tasks of medical chemistry, when it is necessary to synthesize many structurally similar compounds in the shortest possible time, this information can be extremely useful. The purpose of this project is to create the first methodology of computer-aided synthesis, within which not only a synthesis plan is proposed, but also the conditions for all its stages. A feature of the proposed technology is also to use an unusual strategy for constructing a synthesis scheme, which will be based on the approach “from the reagents to the products”. Such an approach, which was rarely used previously because of computational complexity, is free from the above-mentioned disadvantages of the retrosynthetic analysis. Its main advantage is the ability to search for the synthesis of the compounds that are structurally similar to the target, due to optimization of a function depending on the similarity to the target compound. At the same time, this approach can easily be transformed into a de novo design tool for creating compounds with specified properties, if the objective function for optimization depends either on predicted property values or on the similarity with active molecules. We plan to solve the problems associated with the computational complexity of this approach by applying effective heuristics, Monte Carlo tree search approach, which has demonstrated its applicability for problems of such type, a partial enumeration and storage of the products of the first transformation of all starting material in a database, as well as specially elaborated stochastic search algorithms. A key feature of this project is the development of software tools for predicting the conditions of all reactions that will be involved in the synthesis being planned. At present, the problem of predicting the optimal reaction conditions has not yet been solved, and there are only a few publications in this field. Solving the problem of predicting optimal conditions is the central feature of the project. We will develop two methodologies to solve this problem. The first one is based on the search for conditions that would make reactions fast and selective. For this purpose, models for predicting the kinetic characteristics of some common reactions will be used. This methodology allows not only predicting reaction conditions, but also assessing product ratio, the yield and the reaction time. However, kinetic data are available only for a limited number of reaction types, and therefore this approach is not universal. The second strategy is based on direct prediction using the models built on large samples of data on reaction conditions. In the latter case, the possible conditions (catalyst, range of temperatures and pressures) will be ranked according to their applicability to a given reaction. The models will be built using modern methods of machine learning, including methods of deep learning. Although this methodology is more universal, however the models obtained with it are less informative and interpretative. To build the models, the data on kinetic characteristics and reaction conditions available in the laboratory will be used, and kinetic characteristics for new types of reactions (addition, substitution in the aromatic ring) will be collected. Data will also be collected on reaction conditions for other types of reactions for which kinetic characteristics are unknown. The developed tools, as well as the data sets, will be shared for public usage. Web services will be developed to access them. The developed technology will be applied in a project to develop a strategy for the synthesis of new types of antidepressants. In the scientific and educational center of pharmaceuticals KFU repeatedly attempted synthesis of analogues of antidepressants doxepin and dosulepin, in which one of the benzene rings is replaced with a fragment of pyridoxine. However, an attempt to synthesize it with classical approaches was a failure. The developed iSynthesis technology will be tested on this example, with its use alternative synthesis strategies will be proposed, which will then be tested and used in the group of medical chemistry of the REC KPU Pharmacist. The development of tools for synthesis planning, including prediction of reaction conditions, will significantly advance the development of fully automated synthesis systems.

Expected results
The goal of the project is the development of a novel methodology for computer-aided synthesis planning. It allows solving two key problems that synthetic or medical chemists are usually faced: (1) identifying the sequence of chemical reactions leading to the desired compound or its analogs (synthesis strategy), (2) determining the conditions for carrying out each reaction. When the development of a compound becomes too costly, it is necessary at early stages to abandon the synthesis in favor of a more available and cheaper analogue. The developed methodology will also allow determining the plan for the synthesis of the analogues of the target compound, for which the synthesis would be simpler or cheaper. This would also allow using the developed software to construct synthetically accessible compounds with desired properties (that is, for de novo design). This project is aimed at the development of a special section of chemoinformatics - informatics of chemical reactions. In recent years, the world science has experienced an extremely rapid increase in interest in it due to the accumulation of large data sets and the development of the methods for their in-depth analysis based on the principles of artificial intelligence. Previously, we began systematic studies of methods and technologies for predicting the characteristics of reactions, processing information about them using a universal methodology based on the concept of Condensed Reaction Graphs. Within the framework of this project, a methodology for automated planning of the synthesis of organic compounds using the strategy "from reagents to products" will be developed on this basis, and a methodology for predicting the optimal reaction conditions will be elaborated. For this purpose, the methodology for predicting the characteristics of chemical reactions will be developed further new methods for estimating the quality of statistical models for predicting the characteristics of chemical reactions will be developed, since the standard approaches used in chemoinformatics are oriented to modeling the properties of molecules, and therefore they are not suitable for modeling chemical reactions. Two new methodologies for predicting reaction conditions based on direct and indirect prediction of conditions will be developed. The following results are planned: 1. New Methodologies - methodology for the automatic construction of a plan for the synthesis of organic compounds using the strategy “from reagents to products” (direct synthesis), - methodologies of direct and indirect prediction of reaction conditions, - methodology for assessing the domain of applicability of models for predicting the characteristics (rates, conditions) of chemical reactions, - a methodology for unbiased estimation of the predictive ability of models designed to predict the characteristics of chemical reactions, using sliding control by groups, - methodology for extracting the rules for the transformation of chemical compounds based on the database of reactions, - a methodology for applying transformation rules to one or more chemical compounds and predicting possible major products, - methodology for assessing the possibility of performing of a given chemical reaction. 2. Software - software for planning the synthesis and prediction of reaction conditions, - software for creating and assessing the quality of models designed to predict the kinetic characteristics of reactions, - software for creating and assessing the quality of models for direct and indirect prediction of optimal reaction conditions, - software for storage of chemical compounds in the form of a database. 3. Modeling - models of indirect prediction of optimal conditions on the basis of kinetic characteristics for reactions: nucleophilic substitution (including Williamson, Menshutkin, substitution in the aromatic ring), elimination, cycloaddition, hydrolysis, electrophilic addition, - models of direct prediction of Michael's reaction conditions and the hydrogenation of various functional groups using a multilayer perceptron, deep neural networks, and auto-encoders. - model to predict the possibility of performing of a given chemical reaction. 4. Data - the data on the kinetic characteristics of hydrolysis reactions (5000 data), substitution (2000 data), addition (2000) and other reactions (if necessary) will be collected. - collection and preparation of data on conditions of reduction of basic functional groups (nitro, double and triple bonds, carbonyl, carboxyl, hydrogenolysis of single bonds) reactions - creation of a database of synthetic rules by analyzing data in the literature and in databases. It is planned to collect about 70 synthetic rules, - collection of a reagent database from the suppliers' catalogs. It is planned to collect about 150,000 reagents - synthetic blocks from suppliers databases. 5. Validation The developed strategy for planning a synthesis strategy will be tested in the NEC Pharmaceuticals KFU project on the development of new tricyclic antidepressants, which was suspended due to the difficulty of synthesizing compounds of interest using classical approaches. A model for predicting the conditions of hydrogenation reactions will be tested to be applicable to predicting the conditions of reactions occurring in a flow-through hydrogenation reactor. The members of the research group participate in educational activities within the framework of the master's program in chemoinformatics and molecular modeling, established in Kazan Federal University in 2012. Students of the Master's program will be actively involved in research conducted within the framework of our project. The study has a pronounced applied nature. All developed tools will be provided to chemists both in the form of ready-to-use standalone and web-based software tools. The iSynthesis will become the first synthesis planning tool accessible for non-commercial use. Unlike previous approaches, in the developed software complex the construction of synthesis strategy for chemical compounds will be accompanied by a complete description of all stages of synthesis, including the conditions for carrying out all the reactions, which will allow selecting more reliable and easily realizable synthesis routes. We believe that the results of this project will be interesting and important for the entire chemical community. Almost any project to create drugs and materials inevitably involves a synthetic stage, which is extremely long process if a traditional trial and error method is used. We believe that the created tools will save human and material resources when developing methods for the synthesis of chemical compounds.


 

REPORTS


Annotation of the results obtained in 2021
The key step in the development of any substance, especially new drugs, to carry out its synthesis. The goal of this project is to create a computer-based synthesis planning technology (iSynthesis) that proposes not only a synthesis plan, but also the conditions of the reactions at all its stages. There are three aspects to this project: the development of new techniques, their software implementation, and the use of the software to create models. The third stage of the project had three key tasks: (i) completing the development of the iSynthesis system, (ii) developing a model for predicting reaction conditions, (iii) developing a way to evaluate the feasibility of a particular reaction. After optimizing both the system itself and the development/ refinement of approaches that implement individual aspects of the system, 2 versions of the iSynthesis system were created - a downloadable public version at https://cimm.site/projects/isynthesis.html (packaged as a Docker container) that used information from the publicly available USPTO database, and a proprietary, available for use in the laboratory, based on data from the Reaxys database (cannot be published due to requirements of the data owner). The final version of the tool was tested on 15 drug compounds with known synthesis pathways (extracted from USPTO), for 5 of which the system suggested synthesis pathways shorter than those presented in the patent literature. However, when the developed tool did not find the literature pathway in a certain number of steps, it suggested synthesis of analogues. In most cases (>85%), the average number of steps in the synthesis pathway of the target molecule and/or its analogues obtained with the tool is shorter than the literature pathways. On average, the calculation of a single molecule required 10 hours. Two new methodologies for direct condition prediction have been proposed and explored - the neural network-based transformer architecture approach and approach based on rapid assessment of the similarity of reactions. Both approaches distinguished themselves by their high predictive performance compared to the k nearest neighbor approach, which was the best method for predicting conditions in the second stage of the project. The approach based on rapid assessment of the similarity of reactions was integrated into the iSynthesis synthesis planning system because of its highly superior prediction rate (0.0043-0.008 sec per reaction). A model based on this approach and trained on USPTO was published on the laboratory page at rcconditions.cimm.site. We attempted to optimize the reagent-to-product synthesis strategy search by replacing the Monte-Carlo tree search method with the one based on the use of a genetic optimization algorithm. Despite a significant reduction in memory consumption and simplification of the tuning procedure, the new method was not suitable for integration into iSynthesis due to its slow convergence. To train and test most of the modules that make up the iSynthesis reagent-to-product synthesis planning system and the various methodologies and approaches that underlie them, carefully curated samples containing minimal errors were required. To solve this problem, a protocol for curation of chemical reaction information, including standardization of individual molecules, was developed, and implemented. Its application made it possible to obtain carefully curated datasets from the original Reaxys and USPTO datasets, which were subsequently used to extract reaction transformation rules and model reaction conditions and yields. A comparative analysis of over 4 popular atom-atom reaction mapping tools was performed, leading to the selection of the optimal RXNMapper in terms of correctness-to-cost ratio as the tool used in the project. The AAM errors remaining after that were corrected using the AAM Fixer tool based on special rules implemented using the meta-CGR approach. Data sets including as many reaction conditions as possible were required to develop the approaches and test them in terms of integration into the condition prediction and reaction capability modules. For this purpose, a protocol for identifying conditions was implemented (and subsequently tested with a final sample of 1,237,813 reactions) based on data presented in the USPTO database. To develop a module to filter out unrealistic reactions, an approach based on prediction reaction yield was tested. The output prediction model was trained on the data set from the Reaxys database, but the balanced accuracy values were around 0.76 for classification, the regression modeling was extremely low in accuracy. Multi-instance learning approach for enentioselectivity prediction was proposed. Separate work was carried out for improvement of the synthetic transformation rules, on which the correctness of the synthetic path depends to a large extent. To improve the quality of the synthetic transformation database, firstly, it was expanded by adding supplementary sets of synthetic transformations - the rules of "pseudo-transformation" rules, manually assembled methods for the synthesis of N-heterocycles with one or more rings. Previously included synthetic transformations obtained by automatic extraction from the USPTO database were replaced by additionally purified and verified using a special reaction curation protocol and additional filters for the correctness of synthetic transformations. The iSynthesis tool was validated at the Scientific-educational Center of pharmaceutics, KFU. The system was used to find a way to synthesize new antimicrobial drugs. The system found a way to synthesize 3 compounds among 4 interesting syntheses, for one synthesis, only the closest analog was found. One synthesis was reproduced completely with good yields, another was reproduced almost completely - it was not possible to carry out one step. It was decided not to reproduce one synthesis due to the lack of necessary reagents. Only in one case there were doubts about the possibility of achieving the compound according to the proposed procedure because of the unsuccessful choice of the protective group system. According to the results of the current stage of the project, 4 articles were published, including one Q1 level article and one review. The project participants made 10 reports at 6 scientific events.

 

Publications

1. V.A. Afonina, D.A. Mazitov,A. Nurmukhametova,M.D. Shevelev, D.A. Khasanova,R.I. Nugmanov,V.A. Burilov,T.I. Madzhidov, A. Varnek Prediction of Optimal Conditions of Hydrogenation Reaction Using the Likelihood Ranking Approach International Journal of Molecular Sciences, V. 23, Is. 1, P. 248 (year - 2022) https://doi.org/10.3390/ijms23010248

2. Zankov D.V., Matveieva M., Nikonenko A.V., Nugmanov R.I., Baskin I.I., Varnek A. QSAR Modeling Based on Conformation Ensembles Using a Multi-Instance Learning Approach Journal of Chemical Information and Modeling, V. 61, Is. 10, P. 4913-4923 (year - 2021) https://doi.org/10.1021/acs.jcim.1c00692

3. Zankova D., Polishchuk P., Madzhidov T., Varnek A. Multi-Instance Learning Approach to Predictive Modeling of Catalysts Enantioselectivity SynLett, V. 32, P. 1833-1836 (year - 2021) https://doi.org/10.1055/a-1553-0427

4. Madzhidov T.I., Rakhimbekova A., Afonina V.A., Gimadiev T.R., Mukhametgaleev R.N., Nugmanov R.I., Baskin I.I., Varnek A. Machine learning modelling of chemical reaction characteristics: yesterday, today, tomorrow Mendeleev Communication, V. 31, P. 769-780 (year - 2021) https://doi.org/10.1016/j.mencom.2021.11.003

5. - Искусственный химик: как искусственный интеллект помогает открывать лекарства и синтезировать молекулы На острие науки, Лекция тематического месяца "Искусственный интеллект" (year - )


Annotation of the results obtained in 2019
In the development of any substance, especially new drugs, the key stage is to carry out its synthesis. The purpose of this project is to create a methodology for computer-aided synthesis planning (iSynthesis), within the framework of which not only a synthesis plan but also the conditions for carrying out reactions at all its stages is proposed. The first year of the project addressed three main objectives: development of a synthesis planning system prototype, development of tools for predicting optimal reaction conditions, and improvement of methodology of the Quantitative Structure-Reactivity modelling for subsequent use in the indirect prediction of reaction conditions. The latter implies the selection of conditions that optimize the kinetics or regioselectivity of reactions. This project has three aspects: the development of new methodologies, their software implementation, and the use of the latter to create models. The QSRR DB database of kinetic and thermodynamic properties of chemical reactions, which was collected earlier and updated within the framework of the project, is used for modelling. The main goal of this project is to develop a synthesis planning tool. A feature of the tool is the use of the “reagent-to-product” search strategy for the synthesis of the compound of interest, which has some advantages over the standard retrosynthetic approach. In particular, it allows one to return not only the synthesis route of the compound of interest but also the synthesis of similar compounds. The search speed is planned to be accelerated through the use of the Monte Carlo tree search technology, as well as special heuristics and artificial intelligence methods. Within the framework of this project, the basic design of the software for chemical synthesis planning based on the Monte Carlo tree search algorithm was developed, and work on the software implementation of the tool has begun. The API interaction modules have been developed in the form of Python documentation, which allows for the fast parallel development of the application. Work on the components of the product has started: a module that describes the architecture of the tree search and the available methods of working with it, a module for Monte Carlo tree searches, and auxiliary modules for working with data. The next important component of the designed synthesis planning tool is the development of an instrument for predicting the optimal reaction conditions. Within the framework of the project, it is planned to try two main technologies for predicting the reaction conditions: direct prediction and indirect. In the first technology, conditions are predicted directly, without using surrogate models that predict the characteristics of reactions under certain conditions. However, the use of the direct condition prediction methodology has some difficulties. First of all, it is impossible to use the classical QSPR approach because the same reaction can be carried out in different conditions. Secondly, modelling is complicated by the absence of negative examples, that is, there is no data on the conditions under which the reaction does not proceed. Thirdly, it is impossible to assert that the conditions predicted by the model are not suitable for the current reaction since for most reactions there isn’t any exhaustive study of the possible conditions. As part of the first stage of the project, 3 technologies for direct prediction of reaction conditions have been developed : (1) based on a classification neural network with ranking by a “likelihood function”, (2) based on a recurrent prediction of an individual condition characteristic (temperature, pressure, catalysts and additives) using a deep neural network, and (3) ranking a combination of conditions using the “nearest neighbour” method. The first two approaches are based on the use of modern achievements of deep learning methods, the third one is created as a comparison and uses classical methods of machine learning. The proposed approaches have been implemented as a software tool using the TensorFlow and Scikit-learn libraries. A special model validation approach has also been developed, which is based on the ranking quality metrics used in information retrieval. The proposed approaches have been compared using a specially prepared data set on the reduction of many functional groups extracted from the Reaxys database. The data set contained more than 30.000 reactions, some of which were used as a training set and about 3000 - as a test set. Modelling using this set has shown that the most accurate results were achieved by using the "likelihood function" ranking approach. The worst approach was the recurrent prediction of conditions. The "nearest neighbour" method showed intermediate results, although it is the easiest method to implement. Models that indirectly predict the optimal reaction conditions require the creation of additional models that predict the characteristics of the reactions occurring under certain conditions. A search over possible conditions allows choosing the conditions that are optimal for a given reaction. During this stage of the project, this approach was tested using data sets on the kinetic characteristics of reactions. It is shown that the obtained predictions of the optimal conditions correspond to the ideas about the reaction mechanisms and the approach can be used when there is available data to build surrogate models. An additional advantage of indirect conditions prediction is the prediction of the reaction rate constant - one of the most important characteristics that makes it possible to estimate a reaction’s yield and selectivity. At the same time, a key point for proficient indirect conditions prediction is the high quality of surrogate models that predict the reactions' kinetic characteristics. For this reason, work has been done to improve the methodology of reaction characteristics prediction. A large number of different applicability domain estimation approaches have been benchmarked and the best approaches have been identified. Apart from the widely used approaches, we also studied the approaches of applicability domain estimation that we have specially developed to be used in reaction characteristics modelling. It has been shown that the classical validation method used in QSAR modelling (k-fold cross-validation) overestimates models’ performance in the case of Quantitative Structure-Reactivity models. Therefore, within the framework of the project, two new validation strategies ("solvent-out" and "transformation-out") have been developed. The "solvent-out" validation strategy evaluates the model’s ability to predict the properties of reactions occurring in new solvents. The "transformation-out" validation strategy provides an assessment of the model’s ability to predict the rate of a reaction involving new reagents and products. The models for predicting the kinetic characteristics of reactions have been updated by using the proposed methodological innovations. Models are available at https://models2019.cimm.site/. Finally, alongside with the development of a methodology for modeling the characteristics of chemical reactions, we have proposed a concept of conjugated QSPR models which makes it possible to consider fundamental chemical equations when building "structure-property" models and incorporate them into machine learning methods. For linear conjugated models, an analytical expression extending the popular ridge regression method has been developed. For nonlinear conjugated models, a special neural network architecture has been proposed. The developed approach has been tested by predicting tautomeric equilibria constant which is associated with the acidity of tautomeric forms. It is shown that this allows us to predict both characteristics without losing the quality of predictions and at the same time improve the quality of acidity prediction for the minor tautomeric forms.

 

Publications

1. Zankov D.V., Madzhidov T.I., Rakhimbekova A., Gimadiev T.R., Nugmanov R.I., Kazymova M.A., Baskin I.I., Varnek A. Conjugated Quantitative Structure-Property Relationship Models: Application to Simultaneous Prediction of Tautomeric Equilibrium Constants and Acidity of Molecules Journal of Chemical Information and Modeling, 59, 11, 4569-4576 (year - 2019) https://doi.org/10.1021/acs.jcim.9b00722

2. - Kazan University chemists teach neural networks to predict properties of compounds EurekAlert, Дата публикации: 21.01.2020 (year - )

3. - Исследователи КФУ научили нейросеть использовать законы химии Медиапортал КФУ, Дата публикации: 20.01.2020 (year - )

4. - Исследователи научили нейросеть принимать во внимание химические уравнения для создания новых лекарств и материалов Indicator.ru, Дата публикации: 19.01.2020 (year - )

5. - Нейросеть научилась использовать химические уравнения для создания новых лекарств Газета.Ru, Дата публикации: 15.01.2020 (year - )


Annotation of the results obtained in 2020
In the development of any substance, especially new drugs, a key stage is the synthesis of the compound. The aim of this project is to create a computer-aided synthesis planning methodology (iSynthesis) that proposes not only a synthesis plan but also reaction conditions for all its stages. Three aspects are included in this project: the development of new methodologies, their software implementation and the use of the latter to create models. The second phase of the project addressed three key tasks: (i) designing a working prototype of a synthesis planning system "from reagents to products", (ii) developing technologies for predicting reaction conditions, (iii) collecting a set of starting molecules (building blocks) for synthesis and a set of reaction transformation rules. The crucial task was to refine the synthesis strategy prediction tool. Refinement was done by searching reagent combinations leading to structure maximum similarity to the target (in the limit - the target structure itself). At this stage all software components required for tool operation were developed: database of molecular building blocks, virtual reactor, search algorithm based on Monte Carlo tree traversal. We have collected molecular building block from databases of commercially available compounds, with a total of 501K molecules remained after cleaning. In order to collect reactive molecular transformation rules (analogue of retrosynthesis rules) a special approach was implemented to include chemical structure elements in the rule that determine reactivity and selectivity. Using it, 170,000 rules were extracted from the USPTO database. It was found that rules are often incorrect if they correspond to a very small number of reactions. For this reason, we retained about 2,300 reaction patterns corresponding to 50 or more reactions. Two components were required for the synthesis planning algorithm to determine its success and speed: an approach to assess the prospectivity of the search tree node (in other words, a numerical assessment of the possibility of synthesizing the target molecule from a given molecule), and an approach to rapidly select the reagents to be added to the current molecule in order to obtain a product more similar to the target molecule. To implement each approach, 2 groups of methods were tested - (i) based on heuristics driven by molecular similarity, and (ii) based on training a neural network with a special architecture. The approaches based on chemical reaction network analysis were implemented to collect data for training the neural networks. As a result, each neural network was trained on several million data points. It was found that a heuristic based on Tversky's similarity index was best suited for evaluating the prospectivity of a search tree node. For rapid reagent selection, a neural network is most suitable to rank building blocks according to their applicability to a given reaction and a given target molecule. Moreover, program interfaces (API functions) were implemented in the synthesis planning tool. These functions enable the implementation of models for the prediction of synthesis conditions and assessment of the feasibility of the generated reaction in the developed tool. The developed tool was tested on the task of searching synthesis strategies for 10 drugs with known synthesis pathways. For 5 drugs a synthesis pathway was found, in the remaining cases synthesis pathways to similar molecules were proposed, among which from 7 to 76% (depending on the molecule) were very similar to the target ones (Tanimoto index > 0.8). The second important component of the developed technology is the prediction of the reaction conditions. Two approaches were proposed: direct prediction of conditions and indirect prediction based on QSAR models. For direct prediction of conditions, conditional variational AutoEncoders were used, which could sample a list of possible conditions for each reaction. This approach has shown very good results on a data set of closely related reaction types (catalytic hydrogenation). However, for large sets of diverse reactions (9.5 million Reaxys reactions), the AutoEncoder significantly underperformed the nearest-neighbours ranking method proposed by us in the previous phase of the project. For this reason, a model based on nearest-neighbour ranking was implemented in the software for predicting conditions, which required additional optimisation due to the use of large amounts of data. The indirect approach was based on trained models that predict the reaction rates depending on the conditions. This approach included screening the predicted reaction conditions and selecting the conditions that maximize rate or selectivity for a given reaction. In order to implement this approach, 6 models for predicting the reaction rate constant (SN1, SN2, SNAr, E2, Diels-Alder, hydrolysis) and a model for predicting the tautomeric equilibrium constant were built using all the data accumulated in the project and refined modelling techniques (listed below). Two data sets (SN1, SNAr) were collected at this stage of the project. Furthermore, a number of methodological innovations were proposed to build structure-response models. One of them was a new modelling workflow which directly uses the Condensed Reaction Graph by applying graph convolutional neural networks. This approach has shown advantages over other modelling methods for the majority of the reactions used in testing. A more rigorous and unbiased way of assessing the predictive ability of models was also proposed to evaluate the quality of predictions of reactions with new reagents and products or reactions proceeding in new solvents. Finally, a conjugate learning methodology was developed that allows the incorporation of known chemical regularities within structure-responsiveness models. Models were proposed to predict reaction rates by introducing the Arrhenius equation to find the relationship between reaction structure and rate. For this purpose, an equation for finding optimal coefficients in a linear model (comb regression) has been derived and a special neural network architecture for coupled modelling has been developed. It has been shown that this approach offers a number of advantages over classical modelling. In addition, an approach to modelling the product ratio of competing reactions has been developed using its dependence on rate constants. The application of the derived structure-responsivity models to the prediction of optimal conditions showed that the proposed optimal solvents and temperatures are essentially unchanged for different reactions. On the other hand, this suggests that it is possible to replace the models mediated prediction of reaction conditions by simple rules in the iSynthesis tool. Models built for indirect prediction of reactions are available on the laboratory server at http://models2021.cimm.site As part of this phase of the project, an approach for predicting new types of chemical reactions was proposed. This approach is based on the representation of a reaction as a Condensed Graph of Reaction, which is used to train an AutoEncoder architecture. As a result, new reactions similar to the Suzuki reactions were revealed by sampling from the trained model. An approach was developed to estimate the possibility of the specified chemical reaction occurring. For this purpose, a data set of known reactions was generated and a set of reactions that did not occur under the given conditions was created using a specific approach. The neural network trained on these reactions has shown good balanced accuracy, reaching 84%. Four papers were published within the project, three of which were published in Q1 journals.

 

Publications

1. Bort W., Baskin I., Gimadiev T., Mukanov A., Nugmanov R., Sidorov P., Marcou G., Horvath D., Klimchuk O., Madzhidov T., Varnek A. Discovery of novel chemical reactions by deep generative recurrent neural network Scientific Reports, V. 11, №3178 (year - 2021) https://doi.org/10.1038/s41598-021-81889-y

2. Gimadiev T., Nugmanov R., Batyrshin D., Madzhidov T., Maeda S., Sidorov P., Varnek A. Combined Graph/Relational Database Management System for Calculated Chemical Reaction Pathway Data Journal of Chemical Information and Modeling, V. 61, Is. 2, P. 554-559 (year - 2021) https://doi.org/10.1021/acs.jcim.0c01280

3. Rakhimbekova A., Akhmetshin T.N., Minibaeva G.I., Nugmanov R.I., Gimadiev T.R., Madzhidov T.I., Baskin I.I., Varnek A. Cross-validation strategies in QSPR modelling of chemical reactions SAR and QSAR in Environmental Research, V. 32, Is. 3, P. 207-219 (year - 2021) https://doi.org/10.1080/1062936X.2021.1883107

4. Rakhimbekova A., Madzhidov T.I., Nugmanov R.I., Gimadiev T.R., Baskin I.I., Varnek A. Comprehensive Analysis of Applicability Domains of QSPR Models for Chemical Reactions INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, V. 21, Is. 15, P. 5542 (year - 2020) https://doi.org/10.3390/ijms21155542

5. - Искусственный интеллект поможет в создании лекарств Indicator, дата выхода: 26.03.21 (year - )

6. - Искусственный интеллект научили предсказывать новые химические реакции ПОРТАЛ «НАУЧНАЯ РОССИЯ», дата выхода: 25.02.21 (year - )

7. - Искусственный интеллект научился предсказывать новые химические реакции ТАСС, дата выхода: 25.02.21 (year - )

8. - Искусственный интеллект научили предсказывать новые химические реакции Газета.ру, дата выхода: 25.02.2021 (year - )

9. - Казанские химики нашли 40 новых типов реакций с помощью искусственного интеллекта Татар-информ, дата выхода: 20.02.2021 (year - )