Saturday, January 13, 2018

Number of possible fragments for the connectivity-based hierarchy scheme



Faithful readers of this blog (hi, mom!) will know that we have been working with the connectivity-based hierarchy (CBH) approach for a while (paper almost done). The method works by breaking molecules up into fragments and truncating with hydrogens. In the CBH-1 scheme you fragment into bonds (so propane would be fragmented into 2 ethane molecules) and in the CBH-2 scheme you include all bonds to an atom with 2 or more bonds (butane would be fragmented into 2 propane molecules).

I started wondering how many different fragments we would need to cover most organic molecules wth the CBH-2 scheme so I wrote some code (shown below) to find out and the number turns out to be 15,670 neutral molecules using ["C","N","O","F","Si","P","S","Cl","Br","I"]

This number also includes CBH-1 fragments because you need them in the CBH-2 scheme. There are a few special cases missing such as isocyanide and there aren't any rings such as cyclopropane, since these are not made until you get higher up in the CBH hierarchy.  Also, there are some very weird molecules that you'll probably never see as a functional group in an organic molecule.

The code considers all possible combinations (so it runs for a long time) and then uses RDKit to figure out if it's a reasonable molecule.

As mentioned the code only generates neutral molecules, so the actual number of fragments needed will be higher.


This work is licensed under a Creative Commons Attribution 3.0 Unported License.

No comments: