Drugs, pesticides and COVID-19 drug subsets - new in 2020.2 release

The 2020.2 CSD Release includes another 12,000+ CSD entries since the last data update in June and takes the total size of the CSD to over 1.08 million entries. Whilst this impressive rate of growth is testament to the ongoing efforts of scientists around the world, who publish and share the outcomes of their research with the community, the sheer size and diversity of structures in the CSD can be somewhat daunting.

To help researchers investigate insights into particular categories of compounds, the 2020.2 CSD Release has extended the range of available subsets, with new lists of structures focused on drugs, pesticides and those structures that have been highlighted as being of interest in the fight against COVID-19. These new CSD subsets add to the existing subsets that include ‘best representative’ lists for statistical analysis of the CSD, subsets of metal-organic frameworks (MOFs) and CSD entries containing information on atomic anisotropic displacement parameters (ADPs; otherwise known as thermal ellipsoids).


The new CSD subsets are available directly through our ConQuest software, with users able to either browse through the structures directly (as shown in the screenshot below) or to limit the results of any ConQuest query to return results from only the subset of choice. The subsets can also be used within Mercury or the CSD Python API, allowing users to create bespoke searches and workflows.



Screenshot of ConQuest showing the subsets available to view in the 2020.2 CSD Release


There are two new pharmaceutical-based subsets, based on the methodology presented in the 2019 publication “The CSD Drug Subset: The changing chemistry and crystallography of small molecule pharmaceuticals”. The first of these, called the CSD Drug Subset, provides users with a single list of all CSD entries containing a molecule that features in the Approved Drug list provided by DrugBank. This subset has a wide scope, including any solvates, co-crystals or hydrated forms, and currently provides a set of 12,277 entries to help users gather insights into drug-like compounds. I have previously shown examples of different trends that can be observed when comparing CSD entries with drug-like properties to more general ‘organic’ structures in a past blog. For a more precise match with approved drug molecules, the release also contains the Single-Component CSD Drug Subset, which, as the name suggests, includes 1,989 CSD entries where a drug molecule is the only modelled component in the crystal structure. These subsets will be regenerated with every future CSD data update, so any new structures in the CSD or changes to the DrugBank approved drug list will be reflected in the subsets.

An ethanol solvate of the xanthine oxidase inhibitor febuxostat (CSD refcode BUVBEC, https://dx.doi.org/10.5517/ccdc.csd.cc2555j2) published in the journal Organic & Biomolecular Chemistry in May 2020 and part of the CSD Drug Subset and the 2020.2 CSD Release


A similar approach has been taken to produce a CSD Pesticide Subset. The CCDC has collaborated with the Pesticide Property Database (PPDB) produced by the Agriculture & Environment Research Unit (AERU) at the University of Hertfordshire to provide links between the two databases. Links directly to the PPDB are available via our Access Structures and WebCSD services, with reciprocal links to the CSD in the PPDB database. The new subset allows users to browse and restrict searches to all 972 matches directly through our ConQuest desktop software, as well as accessible through Mercury and the CSD Python API.



Screenshots of the PPDB and WebCSD showing reciprocal links between the two webpages (highlighted in red)



The last new subset included with the 2020.2 CSD Release is the CSD COVID-19 Subset, a complete set of the structures we’ve highlighted in recent blogs by Suzanna Ward and Ian Bruno on molecules of interest in the fight against COVID-19.

We hope these new subsets will help users as a starting point to quickly and effectively utilise the data in the CSD within these particular specialist areas of pharmaceutical and pesticide research – where it would be otherwise challenging to define a search query to find structures of interest. The subsets also help to demonstrate the benefits of links between databases and other information sources, and are examples of how the CCDC’s values of collaboration and community help progress structural science.