Successful use of data in drug discovery

At our upcoming virtual Discovery Science meeting, we’ll hear from speakers on the theme: High-performance data meets high-performance computing. We’re increasingly seeing from the literature and our user community that combining quality data with computing power is changing drug discovery approaches. Here I want to share some examples of this theme.

High-performance data

Artificial Intelligence (AI) and Machine Learning (ML) approaches are widely used in drug discovery, from big pharma to agile start-ups. Increasingly the field is aware that the data used in these approaches must be suitable. You are what you eat.

So what is “high-performance” data? What features should data have for successful implementation in drug discovery?

Andreas Bender (University of Cambridge / AstraZeneca) argues that in addition to quality, quantity and relevance of data are key. He observes that the advances in AI approaches for image or speech recognition are greater than those in drug discovery due to the far larger volume of data. He also states that the types and format of data available must be relevant to allow scientists to effectively connect structural features to observed effects in their investigations. See Andreas' paper; “Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data”


When this combination of the right data quantity, quality and relevance are met, advances in drug discovery are seen. To share a few;

  • ExScientia - using data-driven approaches to drive chemical transformations. (Watch presentation)
  • Bristol Myers-Squibb - virtual screening and subsequent validation of conformations (Read case study)
  • University of Groningen - identification of alternate scaffolds by data mining. (Read case study)


High-performance computing

But data is not the only requirement - having computing power to effectively process it is vital.

Over the past year rapid, massive-scale studies to identify drugs for use against COVID-19 have put this combination of data and computing to the test - none more so than the COVID Moonshot.

This massive project saw crowd-submitted molecules tested on a high-performance computing resource to prioritize candidates for synthesis and testing. Robert Glen (Imperial College London) led the charge on docking, using GOLD and other programs to prioritize hits for further analysis.

The project is ongoing and still welcomes computing power donations to support the effort.


I hope you can join us at the Discovery Science meeting to hear Andreas Bender and Robert Glen present as part of our fantastic agenda of speakers exploring these concepts further. Register here.

Alternatively, you can explore more in our industry report examining if it’s time for a data revolution, and why scaling data has not scaled science.