Increasing the Value and Wealth of Data in the Cambridge Structural Database – Thank You 2022 Summer Students

This summer, four young scientists — Zena Younes, Hallam Greene, Maximillian Stanzione and Jaya Kumar-Mehay — joined the CCDC team to assist in finding, digitalising, and depositing structures in the Cambridge Structural Database (CSD) as well as helping to improve existing entries and educate others using the Database. We hear about how the CSD has benefitted from the students' efforts and the impact the summer placements have had on their future scientific careers. 

The way data is currently consumed is beyond anything Olga Kennard could have imagined when she established the CSD over 50 years ago, but as a visionary who understood the importance of collecting and sharing data she made the first step – and the CSD is a result of this vision.  

But how is this data collected and how do our annual paid summer students contribute to this global effort?  

The CSD contains all the experimental small molecule structures that have been published including structures published in:

  • Scientific literature including ASAP & early view articles
  • The CSD directly as CSD Communications
  • Patents 
  • University repositories 
  • Thesis publications

While the majority of structures are associated with articles published in the scientific literature, for which automated workflows exist, some structures take a lot more effort to deposit in the CSD and that’s where our summer students come in. 

 

Archiving Legacy Data 

A key part of our summer project was to digitalise historic crystallographic data into CIFs and then deposit them into the CSD. The data came from a variety of sources but any data – whether it is old, printed data sitting in a filing cabinet, or a file on a hard drive – could be invaluable to someone’s research and making it available to the scientific community remains a priority. We encourage everyone in this position to send us data, either as a scanned copy or by depositing directly. Find out more how to send in data.

Amongst the CIF files created were interesting structures such as paddlewheel complexes, charge transfer salts and cryptates. These structures are now easily accessible within the CSD.

Paddlewheel complex, added as part of the hardcopy project. ECOFIP: Shuji Emori, Michio Nakashima, Takurou Eguchi, Fumio Takenouchi, ITE Letters on Batteries, New Technologies and Medicine, 2007, 8, 563.

The projects this year also involved reviewing new patents as a possible source of crystal structures that could be added to the existing collection of patent data in the CSD. All relevant/recent patents were reviewed and nearly 100 new patent publications were added to the CSD with over 100 further requests for data. Although scrutinising each patent for crystal structures took a long time, the fact that a lot of industrially relevant data was added and requested shows how valuable it is; access to any one of these new structures could save a lot of time in the lab or help with research.

 

A zirconium MOF, added as a result of the patent project. ECODOT: Bu Xianhe, Li Na, Chang Ze, U.S. Patents, 2020.

 

Another potential source of crystal structures are theses. If a thesis includes x-ray crystallography the structures can sometimes go unpublished elsewhere. Published theses were reviewed to assess whether the crystal structures had been uploaded to the database. Although we only scratched the surface of what was out there, it was clear that many theses have been published without submitting the structures to the CSD. As well as making the data easier to locate, this would benefit the author as CCDC numbers could be included in the publication so that the reader could find the relevant structures with ease. Once again, we encourage everyone to deposit all their solved structures via https://www.ccdc.cam.ac.uk/deposit.

"By being involved in the project, we were able to contribute to the larger crystallographic and scientific community, making data more accessible to scientists across the world. We were also able to bring older structures to life by creating CIFs from scanned documents. Digitalising a structure solution which was almost 100 years old ( ZEKVUK: E. Gordon Cox, Nature (London) (1928) ,122 ,401 DOI: 10.1038/122401b0) was a very gratifying experience! Increasing the availability and accessibility of structures which would have otherwise been lost was incredibly satisfying. We all made valuable additions to the existing collection, and this has given us a deeper appreciation of the work that goes on within the CCDC. Observing every point of the journey a structure goes through on its path to validation has led to further inspiration, as we have seen first-hand the meticulous nature of the work at the CCDC, and we all plan to take this attribute into our future careers."

 

Data Integrity and CSD Improvements

Another part of our summer placement was with the Editorial Team. This involved making improvements to existing entries within the database. The students standardised pressure units to facilitate both searching and machine-reading, identified and labelled radical species, labelled data that was recorded at synchrotron facilities, and assigned oxidation states to entries. The completion of these tasks makes it easier for users to locate subsets of data.

One of the students, Zena Younes wrote her thesis on high pressure structures in the CSD.  The thesis work was carried out before the pressure subset was introduced, and thus required a meticulous API search under many filters. Moreover, the lack of standardised unit for pressure in the CSD meant spending a few days converting over 2000 entries into GPa.

"It has been a highly rewarding and personal experience to carry out these editorial projects, knowing that it will help future researchers maximise efficiency."

 

Educational Resources

Finally, the students also got a chance to get creative, working on educational videos for the CCDC YouTube channel with the Education & Outreach Team.

"Creating the videos gave us a chance to get more familiar with a wider range of CSD software functionality that can be very helpful in education to illustrate effectively chemical and crystallography concepts to produce clear, informative videos. Using the wealth of crystal structures in the CSD and powerful CSD visualisation tools included in Mercury, we were able to explain key concepts, creating educational resources for our fellow students, the current and next generation of crystallographers."

 

What Have the Students Learnt?

"Throughout the past nine weeks, we have experienced the many steps involved in creating CIFs and depositing, validating, and improving entries in the CSD. We have learnt the importance of accurate, easily accessible data, and this is something we will take with us through our scientific careers. We have used our crystallographic knowledge to build on the existing CCDC educational material, creating videos to help teach and inspire the next generation of crystallographers. Our chemistry skills have also been put to work, improving entries in the CSD to make the database more searchable and machine-readable for its many users. On top of this we have been welcomed and supported by a fantastic team who have been willing to help us out at every stage. The two months have flown by, but all of us will fondly remember our time at the CCDC."

The 2022 Summer Students

Zena Younes is a recent graduate of the University of Edinburgh having completed an MChem in 2022. She wrote her thesis under the supervision of high-pressure extraordinaire Prof. Simon Parsons, where she used the CSD daily to study the effects of high pressure on molecular interactions. After her time at the CCDC, she will start a PhD at the University of Edinburgh continuing her high-pressure crystallography.


Jaya Kumar-Mehay is about to embark on her third year studying Chemistry at the University of Birmingham, where she used the CSD to help with her studies. She was first introduced to crystallography during her EPQ project on antibiotic resistance. She hopes to continue with crystallography in the future in the drug discovery field.


Maximillian Stanzione is a recent graduate of the University of Birmingham having completed a Chemistry MSci in 2022. After his time with the CCDC he will start a PhD at the University of St Andrews where he plans to utilise the new skills and knowledge he has learnt throughout the summer in his PhD.


Hallam Greene recently completed his 3rd year of Chemistry at the University of Sheffield. He developed an interest in crystallography in his Level 3 project where he solved and described the crystal structure of an indium MOF. After working for the CCDC, he will start a placement year at Diamond Light Source.


Next Steps

  • Want to be a 2023 summer student? Check out our careers page.
  • Learn more about how the 1.1M+ structures in the CSD can help your research.