As you can imagine we have all been excitedly anticipating the one millionth structure for quite some time. With the majority of structures added by automated workflows, trying to predict exactly when we were going to reach this milestone has been tricky to say the least, and I have to admit has involved quite a few sleepless nights!
A graph showing the growth of the CSD to one million structures (only from 1972 is depicted for clarity, and the annual growth is shown in red) and the increasing complexity of the CSD as demonstrated by the blue line showing the increase in formula weight
I am therefore delighted to report that the waiting is over, and we now have our millionth structure in the database. I know many crystallographers worldwide were also waiting with bated breath to see if they could time the publication of their latest structure so that they could become the author of the millionth structure. That accolade goes to Yao Wang and co-workers from Shandong University in China for the structure of an N-heterocycle produced by a chalcogen bonding catalyst. Our attention can now turn to producing a fitting prize for the crystallographer.
CSD Refcode XOPCAJ (DOI 10.5517/ccdc.csd.cc20vdhs) the million structure added to the CSD
The sharing of the millionth structure comes fifty-four years after work on the database first began. Its beginnings can be traced back to 1965 and J.D.Bernal and Olga Kennard who had the vision and foresight that the collective use of data would lead to the discovery of new knowledge and that vision has certainly come to fruition today. Both the structures and the database itself have evolved significantly since then, as has its value to scientists worldwide.
The real value of the CSD
With the value of the CSD directly linked to the contributions of dedicated scientists who publish and deposit crystallographic data, it seems a perfect opportunity to thank these scientists and crystallographers. So, I want to ask the nearly 400,000 authors in the CSD to take a bow while we thank you for your contribution to this amazing resource. Each individual structure is important and has a story to tell but I would particularly like to salute the 750+ plus authors with over 500 structures in the CSD1. These authors are collectively responsible for nearly half of the structures!
A wordle showing the most prolific authors in the CSD
Alongside this army of scientists determining new crystal structures, we have a dedicated team here at the CCDC responsible for collecting, curating and enhancing the data. It is thanks to all the Deposition Coordinators, Scientific Editors and supporting CCDC teams past and present that the database is the trusted, high quality resource it is today.
Thankfully, due to advances in technology and our software development team, the process and the tools used to create the database have evolved considerably over the years. The early days saw the team deploy punch cards and knitting needles but today we have automated workflows and state of the art software allowing us to focus our expertise where it has most impact. An editor is now able to scientifically curate 100 new structures a day. To put this into perspective there were only 655 structures published during the whole of 1965; using modern techniques that would only take a single editor just over a week to process!
Of course though, that is only part of the process. With the advent of one million structures we estimate that over 400 person years’ worth of effort has been invested in the curation of the CSD. Additionally, if we were to create the CSD from scratch it would still take one person about 110 years to produce, and they would need to use today’s tools that exploit the data built up in the CSD to do so. If we then add in the time the crystallographers have spent running the diffraction experiments and refining the structures, then we could envisage that the CSD easily encapsulates well over 1,500 person years’ worth of effort (assuming at least 4 hours have been spent on each structure)! This is quite remarkable and is one of the many reasons that it such a special and valuable resource.
A photo taken at Downing College during the CSD50 Event in 2015 showing some of the many crystallographers, scientists and CCDC staff past and present that have contributed to the CSD over the years
Drawing insights from the data
It isn’t just the number of structures that has grown over the decades, the diversity, complexity and breadth of information contained in the CSD has also grown. This, coupled with the continued development of our software tools means that the ability to draw upon the knowledge and insights from the CSD has become more possible and has steadily increased.
Our attention will of course now turn to adding the next one million structures and fittingly the one millionth structure comes at a significant time for the CCDC. We are continuing to evolve our underlying database format to enable us to expand the breadth of data available and better meet the needs of our ever-expanding user base. So, watch this space to see what the next million will bring. One thing is for certain, we know you will continue to delight us with the insights you are able to derive from it!
So a million thanks to the thousands of scientists who have contributed to the CSD and have helped make it the wonderful resource that it is today.
We would love to hear more about the structures you have authored in the CSD, who your CSD greats are or how you have used the million structures in the CSD in your research or in education. If you want to get involved use #MyCSD1inaMillion and #CSD1Million on Instagram, Twitter, Facebook and LinkedIn and make sure you share some images of your beautiful structures too!
- CSD Annual Statistics
- CCDC on twitter @CCDC_cambridge
- CCDC on Facebook: @ccdc.cambridge
CCDC on LinkedIn: www.linkedin.com/company/cambridge-crystallographic-data-centre