Protein–Ligand Interaction Screening: a Bridge Between the Protein and Chemical Space
Recursion, a clinical-stage biotechnology company industrializing drug discovery by decoding biology, has predicted the protein target(s) for approximately 36 billion compounds in the Enamine REAL Space chemical library (see the release Recursion Pharmaceuticals, Inc. – Recursion Bridges the Protein and Chemical Space with Massive Protein-Ligand Interaction Predictions Spanning 36 Billion Compounds).
Using its MatchMaker technology, the company bridged the protein and chemical space by predicting a massive number of protein–ligand interactions, and digitally evaluating a total of more than 2.8 quadrillion small molecule-target pairs.
The mission of Recursion focuses on decoding biology by integrating technological innovations across biology, chemistry, automation, data science, and engineering to industrialize drug discovery. This work represents a significant achievement for the company, which leveraged machine learning, computational expertise, and NVIDIA’s technology to predict vast interactions between molecules and proteins.
This work contributes to the digital data explosion that to be successful needs high quality ‘smart’ data to succeed.
Protein–Ligand Interactions using the Cambridge Structural Database
Here at the CCDC we specialize in the collation, preservation, and application of scientific structural data for use in pharmaceutical discovery (including protein–ligand interactions), materials development, research, and education. We compile and distribute the Cambridge Structural Database (CSD), a certified trusted database of fully curated and enhanced organic and metal-organic structures, used by researchers across the globe.
Despite recently hitting the 1.25M milestone and being the world’s largest database of small-molecule organic and metal-organic crystal structure data, the focus of the CSD is quality not quantity (if you are looking for a needle in a haystack, why add more hay?).
The data is extensively curated post-deposition, with human editing and enhancement by our Scientific Editorial Team. The chemical connectivity will be checked, validated compound names and 2D chemical diagrams will be added, and the quality of the entry considered using the R-factor (a measure of the agreement between the crystallographic model and the experimental X-ray diffraction data).
This curation maintains the standards of the database:
Quality – the CSD is a trusted resource being relied upon by industry and academia. It is vital that we perform checks on the accuracy and quality of the data deposited.
Consistent and readable – with over 1.2M structures it is important to maintain consistency, readability, and understandability of the data.
Accessible and discoverable – we annotate and enhance the data with metadata such as names, diagrams, and properties to make the data Findable, Accessible, Interoperable, and Reusable (FAIR).
“It’s always exciting to see ‘big’ and this effort is certainly big. A lot of people like ‘big’ and think ‘big’ is better, but we do need to stop and think sometimes. It will be interesting to see whether these methods can reliably lead to better models for a given target more quickly, or genuinely probe biological space more quickly. After all, prediction of biological behaviour requires way more understanding of a system than just approximate binding of a particular molecular entity to a given target.” Dr Jason Cole, Senior Research Fellow at the CCDC.