Tech giants take on the big problems in science - but are they ready for crystal structures?

Registration opened today for the 7th CSP Blind Test - the leading challenge in crystal structure prediction. In 1988 Nature editor John Maddox called it "one of the continuing scandals in the physical sciences" that we could not take a 2D molecular structure and predict it's 3D form. This particular puzzle in computational science continues to challenge some of the best practitioners and developers of computational chemistry. Although tech giants have been turning their hand to scientific problems like protein folding, CSP is a very different kind of problem, and might just be a tougher nut to crack.

Amazon IBM and Google take on science challengesTechnology companies have embraced scientific challenges in the past few years - will they take on Crystal Structure Prediction next?

When tech tackles science

The giants of tech underpin almost every facet of modern life. They are household names with well known products, such as Google, Facebook, Amazon, Apple and Microsoft. But they are also quiet titans such as ARM, IBM, SAP and Intel, which support devices and advances in a myriad of industries.

These companies have pushed computational boundaries in developing their products and services, including expertise in AI and ML. Increasingly they are creating scientific research arms to apply this knowledge to biological and chemical problems - for example Amazon Science and Google Accelerated Science.


So what happens when a tech company looks at a scientific problem?

So what happens when a tech company looks at a scientific problem? In 2018 Google's DeepMind took on protein folding - with impressive results. The team at DeepMind developed a system called AlphaFold, which applied their knowledge of AI to predicting protein structures in the Critical Assessment of Structure Prediction (CASP) challenge.

IBM's research arm is more widely known for Watson, the Jeopardy-playing AI, but they too are exploring health and drug research; for example with their PaccMann framework which integrates biomolecular information to design new possible chemistries for therapeutic purposes using re-inforcement learning. 

The coronavirus pandemic has further spurred technology companies interests in biological and chemical research - with Amazon identifying drug molecules to repurpose against COVID-19. The project, known as the Drug Repurposing Knowledge Graph (DRKG) used machine learning methods to link datasets and identify compounds of interest. It identified 41 drugs, of which 11 were or are under clinical trials. The emergence of new AI focussed companies such as BenevolentAI and Exscientia in recent years and adoption within big pharma show that there is clear interest in use of such methods in the current round of pharma innovation.

In a world where tech giants can fold proteins, suggest possible repurposed drugs and design new molecules - can they predict crystal structures?


Why do we want to be able to predict Crystal Structures?

There are many reasons why we'd like to be able to reliably predict the set of possibly observable crystal structures for a given molecular compound. Firstly, there's risk assessment in drug development. If we know that there's a more stable crystal form of a structure that could form we need to worry about the stability of the form we are using for distribution in tablets; its possible that it could, over time transform into the more stable form. Secondly, we may have a situation where no crystal forms have been observed, and it may be desirable to have a given compound in a crystalline state. By up front predicting the forms that could be made, we could provide experimentalists with information that would in turn allow them to tune experiments to increase the likelihood of crystallisation. Another use case is to interpret experimental information.

The Nirvana of crystal structure prediction, though, would be to be able to predict the underlying crystalline properties of stable forms. Many effects in crystalline solids are influenced by the overall packing (for example - second harmonic generation, the process of frequency doubling of light,  requires a polar crystal, so you'd like to design stable non-centrosymmetric crystal structures with a relatively high dipole). If we wish to design materials rather than discover materials we need to be able to predict crystal structures very accurately.


Learn more: Introduction to CSP video


Crystal Structure Prediction - the challenge

The prediction of crystal structures in many ways is more complex than many other scientific challenges. First we must consider the resolution - generally for a prediction to be useful it must approximate the observed structure such that the atomic coordinates are within 0.7A Root Mean Square Deviation of the observed structure (this, of course varies, depending on what the end user wishes to do with the information. For accurate property prediction, one may need a model that is even closer to the observed form than that, whereas for a qualitative understanding of a form (e.g. does it have large pores?) we may be satisfied with a more approximate structure). That said, 0.7A is a considerably higher bar than has been used in protein fold prediction.

Then we come to the many and varied parts that make up a crystal structure; as well as the molecule's 3D shape, there is packing with many space groups and unit cells possible. Add to this the fact that the lowest energy structure is not always the observed form, and indeed there could be multiple polymorphs which are all real observed forms - and we soon see that there is more to the story. Indeed even the very best CSP methods still find it a challenge to say which form(s) in the "landscape" of predictions are observable and which are not.

Past CSP Blind Tests run by the Cambridge Crystallographic Data Centre from 1999 to now have seen great advances in the methods, with techniques such as dispersion-corrected density functional theory (DFT-D) becoming more reliable, and more complex forms like hydrates and salts being successfully predicted. However, the work is far from done.

And remember, CSP isn't just an academic pursuit - the ability to solve this puzzle could have very real implications in spotting unstable drug molecules earlier, or designing new molecular entities in silico before embarking on lengthy and expensive laboratory work. The often cited example is Ritonavir, a drug product which reached market before an alternative structure was observed - ultimately costing the manufacturer's around $250m in lost revenues.


What will the 7th CSP Blind Test, starting in October 2020 bring?

And will we see tech companies taking on CSP as their next scientific challenge?

Follow the progress of the test here.