The ESPRC has awarded an £89.5m ‘specialty hopping’ grant for new research using natural language processing (NLP) and deep learning to navigate the chemical space.
Dr. Jiayun Pang, a chemist at the University of Greenwich, collaborates with Dr. Ivan Vulić, an expert in NLP and machine learning at the University of Cambridge, to study the latest developments in the field of NLP and explore further applications in the field of chemistry. To do.
NLP lies at the intersection of linguistics and computer science and aims to process and analyze human language, which is usually presented as written language.
NLP is currently focused on using machine learning to challenge tasks with innovative algorithms. They currently power a wide range of real-world applications, including ChatGPT, virtual assistants, and automatic text completion.
This study specifically investigates how the Transformer model, a deep learning algorithm developed by Google in 2017, can be adapted to solve research questions in chemistry.
According to the researchers, while chemical structures are usually three-dimensional, they are often combined with a simplified molecular input line entry system ( SMILES).
The researchers said that through SMILES, they can analyze chemical structures in a similar way to analyzing text using NLP algorithms.
“We are committed to harnessing the power of state-of-the-art NLP algorithms for a wide range of tasks, including molecular similarity search, chemical reaction prediction, and chemical space exploration,” said Dr. Pan.
“We believe in the power of interdisciplinary research to find collaborative AI solutions to science and engineering challenges.”
This study investigates a concept called transfer learning. This is a concept currently prevalent in machine learning and NLP, where previously developed machine learning models are reused for other tasks.
This approach allows researchers to reuse large general-purpose models to specialize in specific applications, reducing the data annotation required to develop models from scratch that require additional expense and expertise. I can.
The Transformer model is trained to learn potential representations of chemical space defined by millions of SMILEs. This learned latent representation is used to predict the molecular properties of a given chemical structure during fine-tuning.
The researchers argue that the advantage of this approach is that the resulting machine learning models are labeled with labels that would be time-consuming or impossible to generate in chemistry, given the associated costs and experimental challenges. He said this is the point where there is less reliance on data (molecules with experimentally determined properties). .
This research aims to increase the computational efficiency and accuracy of Transformer models using two modern machine learning techniques called sentence encoding and contrastive learning.
The study, which will begin in February 2024, aims to provide an alternative approach to evaluating molecular structures in the context of their properties. It underpins many research and development tasks in the chemical and pharmaceutical industry.