Dear Forum,


I have probably a rather simple question.  I would like to search the csd using a smiles string but match only on compounds in which the whole string is matched (rather than a substructure search).  How would one do this?







It's not possible to do an exact SMILES search at the moment. The best option is likely to do the substructure search and then check each resulting hit for the expected atom count. You do have to take care to consider if you want to include hydrogens in the count and how you want to handle matches in multi-component systems. The hit.molecule object will include all components. There is also a hit.matched_components list which will contain only those molecular components which match the search so you would check these. If you want to match complete systems, you can just check the hit.molecule. To quickly count the atoms use the following:

len([mol.atoms) # all atoms
len([a for a in mol.atoms if a.atomic_number > 1]) # non-hydrogens


Hi Paul,

Thanks for the clarification.  Perhaps in future versions this facility could be considered? To illustrate why this is important (at least to me) is this:  I have a long list of common chemical entities and I want to pull their crystal structures (if they exist) and then do additional calculations (such as lattice energies) using interfaces with other program packages (such as materials studio).  Now, I want to do this in an automated fashion so the search could be done on "name", "smiles string" or any other type of identifier (except of course the CSD identifier since I do not have prior knowledge of it). The problem with doing a substructure search or a name search is that there are far too many hits to go through and find the correct structure for simple chemical entities.  For example, finding if a crystal structure of THF exits gives >20,000 hits alone when you do a name search, benzene is similar. In my case I have a list of 1000+ chemicals this quickly becomes difficult to handle.

Thanks, Geoff 



I agree it would be a worthwhile addition. I have added it to our system as an enhancement request. However, it won't be included in the next release in November as that is currently undergoing final bug fixing. Hopefully next year.


there is a workaround to this problem along the lines of:

from ccdc import search

smiles = 'something interesting'

searcher = search.SubstructureSearch()
hits = [h for h in if len(h.molecule.components) == 1 and h.molecule.smiles = smiles]

Best wishes


You must be signed in to post in this forum.