The HBond in CATKIT is not found because the default path length range for detecting hydrogen bonds is set to (4, 999), so excluding contacts between separate components of the molecule. You can include such contacts by setting the path_length_range to (-1, 999), i.e:
from ccdc import io
csd = io.MoleculeReader('csd')
catkit = csd.molecule('CATKIT')
print catkit.hbonds(path_length_range=(-1, 999))
The value -1 is used to cope with both options to the 'require_hydrogens' parameter of the hbonds() method. I appreciate that this is not clear from the documentation, and this will be rectified in a forthcoming release.
I think the default behaviour is somewhat counterintuitive; I shall discuss with colleagues whether the default should be made more permissive.
Hope this is helpful.
I agree - or rather a friendly chemist agrees - that the structure is a bit rubbish. The first kekulize misassigns the double bonds in the carbon you mentioned, so the second aromatic assignment does not regard these bonds as aromatic, then the second kekulize does not operate on the same structure as the first. I agree that this is not ideal behaviour, but it is comprehensible.
The only solution I can think of is to assign all bond types:
where the double bond to the phosphorus is detected, the five membered ring is no longer aromatic and the kekulisation works as expected.
I have mailed the database group to see if they want to fix the bonds in the structure, but this will be too late for the forthcoming November release.
I'm afraid you have unearthed a genuine bug. Discussions are underway here to see if it can be fixed in the forthcoming API version 1.3 release.
In the meantime you can work around the problem by using the internal API:
mol = Molecule.from_string(...)
mol = Molecule(mol.identifier, _molecule=mol._molecule.create_editable_molecule())
Sorry about this.
Please carry on raising any difficulties you have, and making suggestions for ways in which we may improve the API.
Here's the slightly modified script, testing for 3D coordinates.
I've attached a table of spacegroup, average void space (as a percentage of the unit cell volume), number of observations from the 673,606 structures of CSD V536 with 3D coordinates. I'll leave it to the crystallographically adept to extract any meaning there is in the table.
I've attached a script which will do this over the whole CSD. It's running on my desktop at the moment, but I don't expect the results until tomorrow - the void calculation is computationally fairly heavy.
I'll let you know the results when I get them.
I've looked a little further into the discrepancy between ConQuest searching and API searching. Once the difference between ConQuest and the API's notion of error is straightened out, ConQuest finds nine extra structures:
'BUDFEN', 'DENLOY', 'HOHKIA', 'HOHLEX', 'HOHLIB', 'IYEXUF', 'MURCIL', 'MURCOR', 'YAFRAY01'
These structures all contain 'Unknown' bond types between a Cu and an O, and so do not get selected by the API search. The structures will be found by the ConQuest search which, in the absence of a 3D parameter will perform a 2D search.
There is a good case to be made that a 2D search mode would be useful in the API. I shall raise this for consideration for the next API release.
In the meantime, if you want these structures, use ConQuest; if you can live without them continue to use the API.
You are certainly going about things the right way.
The API currently has different criteria for identifying error flagged structures compared to ConQuest - this is something we intend to review and fix in the next release. The error flag used by ConQuest is much more appropriate for regular use.
The max_r_factor value should be expressed in percent, i.e.
settings.max_r_factor = 5.0
I'll change the documentation to reflect this.
There do seem to be some discrepancies between the ConQuest and the API results, even when errors are ignored: I shall look further into this.