Does publication source affect structure quality?
In recent years the world of scientific publishing has seen an increased interest in the data behind scientific articles and in publishing this data, sometimes without the article at all. This can be seen in the rise of data journals and databases. For many years the Cambridge Crystallographic Data Centre (CCDC) has offered a way to share crystal structure data without an associated scientific paper, so called CSD Communication (CSD Comms). This allows authors to get credit for their work, which would gain them nothing sitting in a drawer or on a hard drive. However, one of the most common concerns we hear about CSD Communications is regarding the quality of these structures that have not undergone peer-review and how we ensure the continued integrity of the Cambridge Structural Database (CSD). With this in mind, we have used three methods to investigate potential differences in the data quality between structures from different journals and structures without traditional peer‑review. We selected a range of high impact journals covering science, general chemistry and crystallography, as well as a data journal along with the CCDC’s own CSD Communications.
We started with the most common metric for looking at the quality of a crystal structure, the R‑factor. This is the standard metric quoted in papers and it’s generally accepted to aim for, the lower the better. The bar chart below illustrates the average R‑factor values over five years for the selected journals, with the dashed red line showing the average R‑factor across the whole of the CSD. This suggests that CSD Communications are actually pretty average compared to the other journals and doesn’t indicate a general problem with these structures.
A bar chart of the average R-factors for the selected journals between 2013-2017 with a horizontal dashed line (red) for the average R-factor over the whole CSD
The second method we considered was the IUCr’s checkCIF report. This is a widespread validation tool, that checks the consistency and integrity of the data and produces a list of alerts for anything anomalous. The level of the alert indicates the severity of a potential issue and is summarised in the chart below for level A to C alerts for each of the selected journals between 2014 and 2017. The bar for CSD Communications, demonstrates that these structures don’t give anymore alerts than for the structures from some of the selected peer-reviewed journals. Whilst there is some variation in the number of alerts, it is clear that the least serious level C alerts are more frequently seen compared to the higher‑level alerts. CheckCIF is continually evolving as new checks are added, meaning that some of the early structures may now generate an alert that wasn’t there when it was published.
Chart showing the average number of checkCIF level A (red), B (orange) and C (yellow) alerts per structure for the selected journals
Thirdly, we used the knowledge from the CCDC’s Mogul program, which evaluates the geometry of the structure. Mogul can compare each individual bond length and angle in the selected structure to chemically similar fragments in the CSD and, where there is enough data, determine if the selected fragment is unusual compared to the rest of the CSD. An abnormally long bond or unexpected angle could indicate an issue with the structural model, disorder or could be an interesting feature to write a paper about! Using bond angles to illustrate, the chart below shows the average percentage of unusual angles per structure for the selected journals between 2013 and 2017. This demonstrates that structures shared as CSD Communications don’t have significantly more unusual geometries than the structures in peer‑reviewed papers. If we study this a bit closer, we can see an interesting “mirror effect” for some of the journals where the higher, or lower, average percentage of unusual angles corresponds to a higher, or lower, average number of angles in the structure. If we consider this as an indication of the size and complexity of the structures published in each of the journals, it is perhaps no surprise that many of the journals with higher average percentage of unusual angles also have the larger more complex structures.
Plot showing the average percentage of unusual bond angles (blue) and average number of angles per structure (yellow) for selected journals
Finally, to further our data quality investigation, we compared the different methods and wondered if we are highlighting the same poor structures with each method. If this was the case we would expect to see a linear correlation between the structures with high R-factor, lots of checkCIF alerts and high percentage of unusual geometries. However, as the scatter plot below illustrates the expected linear correlation doesn’t appear, so we can conclude that each method is picking up different aspects of the data. This is consistent if we compare each of the methods to the others. The lack of correlation between the methods raises the questions; should we be considering multiple methods to get a complete picture of the structure quality, and even, are there new methods that could tell us more about the reliability of the structure?
Plot of R-factor vs percentage unusual bond lengths for selected structures
Reflecting on the trends from all three methods, most of the peer‑reviewed structures don’t exhibit better quality compared to structures added to the CSD as CSD Communications. This is a great credit to the hard work of the authors who have deposited these structures, and the value of CSD Communications. Of course, a structure with high metrics from these methods may not necessarily be a poor structure at all. It could be a seminal structure that is pushing the bounds of crystallography, a compound that doesn’t crystallise well or a symptom of other challenging conditions. Another observation is that journal Data1 consistently shows lower values for all three methods, as well as generally smaller structures, which may correspond to the stricter peer-review guidelines for this journal. All structures, irrespective of if they have been published in a peer‑reviewed article or not, are validated by our expert team of editors at the CCDC. This manual curation means that the data and chemistry of every CSD Communication is carefully checked before it is added to the CSD.
We know crystal structure quality and reliability is important to you and we will be continuing our investigations in this area. We hope to be able to continue to help you share all your data, while also adding more filtering options as well as additional validation and integrity checks to the CSD. This will enable you to select the data you want to use in your future research. We would love to hear your thoughts and any ideas you might have!