New Year, New Data Resolutions!

It’s a new year and we are planning our CSD Improvements for the upcoming year. Some of us may have already broken our resolutions (oops), but here at CCDC we have resolved to keep improving the database in 2021.

Every year, as well as curating new entries, we undertake several projects to improve existing data in the CSD. Last year over 35,000 updates were made to existing entries, as mentioned in a blog about the most recent CSD data release. In this blog we aim to give you some additional details on how the CSD was improved in 2020 as well as insights into our plans for 2021. We always try to make improvements that will most benefit you, our users, so we hope this blog will also inspire you to get in touch with your suggestions and priorities too! 

 

Integrity checks of data 

The CSD stores a variety of information about the properties of the compounds taken from the deposited CIF or associated paper. Despite our curation efforts some incorrect information does slip through from the originally-deposited CIF files. Some time was spent in 2020 identifying and rectifying issues, such as cases of the crystal colour and habit being swapped around.  

A number of investigations into potentially incorrect structures also took place throughout the year The CCDC is an associate member of the COPE (Committee on Publication Ethics) and aims to follow their protocols if an issue is suspected. The CCDC contacts journals when such issues arise to request that they investigate any issues related to their publications and we update or retract any data as advised by the editorial committee. The COPE guidelines have been particularly helpful when dealing with issues related to CSD Communications, which do not have an accompanying scientific article.  

 

Atomic displacement parameters 

Historically atomic displacement parameters (ADPs) were not included in the CSD (50 years ago space/CSD size was an issue!). Since adding these values to the CSD we know some of you have encountered some unusually large ADPs, particularly in some of the older entries and these large values have been assessed. In some cases, these were the result of the ADP values having been scaled from standard units – resulting in the ellipsoids looking very large when viewed in Mercury (see the image below of XUTLAZ). Over 150 of the structures with the largest ADP values were investigated, verified against the corresponding scientific article and the values amended if necessary.

 

 

Figure of CSD refcode XUTLAZ in Mercury before amendment (without hydrogens) and after   

 

Property data and subsets 

Our project to identify structures measured using neutron, electron or synchrotron radiation continued this year, with over 700 additional structures identified after curation and flagged in the database. We also have worked to identify and add bioactivity and compound source information to drug structures to enhance information available about them. 

Additionally, new subsets of CSD data have been created, including identifying structures of interest in the fight against COVID-19.  

 

New year’s resolutions  

The data team at the CCDC are hard at work planning the improvement projects we will undertake for 2021. 

One area we are investigating is the standardisation of information in some property fields. This is to improve the usability of the CSD for machine learning with the aim to further reduce the amount of pre-processing some researchers would need to do before using the data in the CSD. We are also planning for how the CSD can be evolved and enhanced following on from an internal project that is currently underway to create a new flexible, expandable database format. This project will enable us to add new data fields more easily to the CSD. Work is already underway to investigate additional information that could be added to the database and any curation that this information would require. This includes the assessment of additional metadata already captured in CIFs but not stored in the CSD as well as other information requested by users such as enhanced property information and more quality metrics. 

Many of you have told us how much you value the information on different polymorphs in the CSD. That's why this year we will also be analysing the polymorphic data we store to see how we can enhance the information we hold and increase the consistency of these entries. 

We would like to hear your input of what future projects that CCDC should undertake to be able to get the most out of the CSD. So, if you have a wishlist of enhancements you would like to see made to the CSD drop us an email on hello@ccdc.cam.ac.uk or let us know on one of our social media channels.  

Wishing you a Happy New Year from the CCDC Database team!