SureChEMBL iPython Notebook Tutorial

An introduction to patent chemoinformatics using SureChEMBL data and the RDKit toolkit

George Papadatos, ChEMBL group, EMBL-EBI

In this tutorial:

  1. Read a file that contains all chemistry extracted from the Levitra US patent (US6566360) along with all the other members of the same patent family.
  2. Filter by different text-mining and chemoinformatics properties to remove noise and enrich the genuinely novel structures claimed in the patent documents.
  3. Visualise the chemical space using MDS and dimensionality reduction.
  1. Identify outliers in the chemistry space.
  2. Fix wrongly extracted structures.
  1. Find the Murcko and MCS scaffolds of the claimed compounds in the patent family.
  1. Compare the derived core with the actual Markush structure as reported in the original patent document.
  1. Identify key compounds using structural information only. This is based on the publications by Hattori et al. and Tyrchan et al..
  1. Count the number of NNs per compound.
  2. Perform R-Group decomposition.
In [80]:
import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import display
from IPython.display import display_pretty, display_html, HTML, Javascript, Image

from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import MCS
from rdkit.Chem import Draw
Draw.DrawingOptions.elemDict[0]=(0.,0.,0.)  # draw dummy atoms in black

from itertools import cycle
from sklearn import manifold
from scipy.spatial.distance import *

import mpld3
mpld3.enable_notebook()

import warnings
warnings.filterwarnings('ignore')
In [81]:
from rdkit import rdBase
rdBase.rdkitVersion
Out[81]:
'2014.03.1'

This is the base URL of the publicly accessible ChEMBL Beaker server.

In [82]:
base_url = 'www.ebi.ac.uk/chembl/api'
In [83]:
pd.options.display.mpl_style = 'default'
pd.options.display.float_format = '{:.2f}'.format
In [84]:
rcParams['figure.figsize'] = 16,10

1. Read and filter the input file

The file was generated by extracting all chemistry from a list of patent documents in SureChEMBL. The Lucene query used to retrieve the list of relevant patents was: pn:"US6566360".

In [85]:
df = pd.read_csv('document chemistry_20141011_114421_271.csv',sep=',')

Let's check the contents:

In [86]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8497 entries, 0 to 8496
Data columns (total 28 columns):
Patent ID                    8497 non-null object
Annotation Reference         8497 non-null object
Chemical ID                  8497 non-null int64
SMILES                       8497 non-null object
Type                         8497 non-null object
Chemical Document Count      8497 non-null int64
Annotation Document Count    3372 non-null float64
Title Count                  3372 non-null float64
Abstract Count               3372 non-null float64
Claims Count                 3372 non-null float64
Description Count            3372 non-null float64
Chemical Corpus Count        8497 non-null int64
Annotation Corpus Count      3372 non-null float64
Molecular Weight             8497 non-null float64
Med Chem Alert               8497 non-null int64
Log P                        8497 non-null float64
Donor Count                  8497 non-null int64
Acceptor Count               8497 non-null int64
Ring Count                   8497 non-null int64
Rotatable Bound Count        8497 non-null int64
Radical                      8497 non-null int64
Fragment                     8497 non-null int64
Connected                    8497 non-null int64
Singleton                    8497 non-null int64
Simple                       8497 non-null int64
Lipinski                     8497 non-null int64
Lead Likeness                8497 non-null int64
Bio Availability             8497 non-null int64
dtypes: float64(8), int64(16), object(4)
In [87]:
df.shape
Out[87]:
(8497, 28)

Add a SCHEMBL ID column and sort:

In [88]:
df['SCHEMBL ID'] = df['Chemical ID'].map(lambda x: 'SCHEMBL{0}'.format(x))
df = df.sort('Chemical ID')
In [89]:
#df.head()

First round of filtering: Novel compounds appear in the description or claims section of the document. Alternatively, they are extracted from images or mol files

In [90]:
dff = df[(df['Claims Count'] > 0) | (df['Description Count'] > 0) | (df['Type'] != "TEXT")]
In [91]:
dff.shape
Out[91]:
(8484, 29)

Second round of filtering: Simple physicochemical properties and counts

In [92]:
dff = dff[(dff['Rotatable Bound Count'] < 15) & (dff['Ring Count'] > 0) & (df['Radical'] == 0) & (df['Singleton'] == 0) & (df['Simple'] == 0)]
In [93]:
dff.shape
Out[93]:
(8218, 29)
In [94]:
dff = dff[(dff['Molecular Weight'] >= 300) & (dff['Molecular Weight'] <= 800) & (dff['Log P'] > 0) & (dff['Log P'] < 7)]
In [95]:
dff.shape
Out[95]:
(6711, 29)

Third round of filtering: Using Global Corpus Count

In [96]:
dff = dff[(dff['Chemical Corpus Count'] < 400) & ((dff['Annotation Corpus Count'] < 400) | (dff['Annotation Corpus Count'].isnull()))]
In [97]:
dff.shape
Out[97]:
(6459, 29)

Keep largest fragment and convert SMILES to RDKit molecules

In [98]:
dff['SMILES'] = dff['SMILES'].map(lambda x: max(x.split('.'), key=len))
In [99]:
PandasTools.AddMoleculeColumnToFrame(dff, smilesCol = 'SMILES')
In [100]:
#dff.head(10)

Fourth round of filtering: Remove duplicates based on Chemical IDs and InChI keys

In [101]:
dff['InChI'] = dff['ROMol'].map(Chem.MolToInchi)
dff['InChIKey'] = dff['InChI'].map(Chem.InchiToInchiKey)
In [102]:
#dff.head()
In [103]:
dff = dff.drop_duplicates('Chemical ID')
In [104]:
dff.shape
Out[104]:
(401, 32)
In [105]:
dff = dff.drop_duplicates('InChIKey')
In [106]:
dff.shape
Out[106]:
(394, 32)

Wow, that was a lot of duplicates. This is because the same compounds come from different patents in the same patent family. In addition, in US patents a compound may come from 3 different sources: text, image and mol file.

Fifth round of filtering: Remove Boron-containing compounds as they are likely to be reactants.

In [107]:
dff = dff.ix[~(dff['ROMol'] >= Chem.MolFromSmarts('[#5]'))]
In [108]:
dff.shape
Out[108]:
(394, 32)
In [109]:
dff = dff.set_index('Chemical ID', drop=False)

That was the end of the filters - let's see a summary of the remaining compounds and their patent document source:

In [110]:
dff_counts = dff[['Patent ID','ROMol']].groupby('Patent ID').count()
In [111]:
dff_counts['Link'] = dff_counts.index.map(lambda x: '<a href="https://www.surechembl.org/document/{0}/" target="_blank">{0}</a>'.format(x))
In [112]:
dff_counts = dff_counts.rename(columns={'ROMol':'# Compounds'})
In [113]:
dff_counts
Out[113]:
# Compounds Link
Patent ID
EP-1049695-A1 2 EP-1049695-A1
EP-1049695-B1 7 EP-1049695-B1
EP-1174431-A2 1 EP-1174431-A2
EP-2295436-A1 68 EP-2295436-A1
US-20040067945-A1 4 US-20040067945-A1
US-20050070541-A1 2 US-20050070541-A1
US-20060189615-A1 2 US-20060189615-A1
US-20080113972-A1 34 US-20080113972-A1
US-20100016323-A1 29 US-20100016323-A1
US-20110009367-A1 43 US-20110009367-A1
US-20130059844-A1 46 US-20130059844-A1
US-6362178-B1 6 US-6362178-B1
US-6566360-B1 4 US-6566360-B1
US-6890922-B2 2 US-6890922-B2
US-7122540-B2 7 US-7122540-B2
US-7314871-B2 43 US-7314871-B2
US-7696206-B2 32 US-7696206-B2
US-7704999-B2 59 US-7704999-B2
WO-1999024433-A1 3 WO-1999024433-A1

Right, is Levitra included in this set? ChEMBL tells me that its InChiKey is 'SECKRCOLJRRGGV-UHFFFAOYSA-N'.

In [114]:
dff.ix[dff.InChIKey == 'SECKRCOLJRRGGV-UHFFFAOYSA-N',['SCHEMBL ID','ROMol']]
Out[114]:
SCHEMBL ID ROMol
Chemical ID
5441 SCHEMBL5441 Mol

Good - it is...

2. Visualise the patent chemical space - MDS

Preparation of the pairwise distance matrix.

In [115]:
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in dff['ROMol']]
In [116]:
dist_mat = squareform(pdist(fps,'jaccard'))
In [117]:
mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=3, n_jobs = 2)
In [118]:
results = mds.fit(dist_mat)
In [119]:
coords = results.embedding_
In [120]:
dff['X'] = [c[0] for c in coords]
In [121]:
dff['Y'] = [c[1] for c in coords]
In [122]:
csss = """
table
{
  border-collapse: collapse;
}
th
{
  color: #ffffff;
  background-color: #848482;
}
td
{
  background-color: #f2f3f4;
}
table, th, td
{
  font-family:Arial, Helvetica, sans-serif;
  border: 1px solid black;
  text-align: right;
}
"""
In [123]:
import base64
In [124]:
dff['base64smi'] = dff['SMILES'].map(base64.b64encode)

Let's produce a scatter plot of the chemical space - points represent compounds, color-coded by the patent document they were found in. Thanks to mpld3, the scatter plot is interactive with live structure rendering calls to the local myChEMBL Beaker server.

In [125]:
fig, ax = plt.subplots()
fig.set_size_inches(14.0, 12.0)
#colors = cycle('bgrcmykwbgrcmykbgrcmykwbgrcmykw')
colors = cycle(cm.Dark2(np.linspace(0,1,len(dff_counts.index))))
for name, group in dff.groupby('Patent ID'):
    labels = []
    for i in group.index:
        zz = group.ix[[i],['Patent ID','SCHEMBL ID','base64smi']]
        zz['mol'] = zz['base64smi'].map(lambda x: '<img src="https://{0}/utils/smiles2image/{1}/300">'.format(base_url,x))
        del zz['base64smi']
        label = zz.T
        del zz
        label.columns = ['Row: {}'.format(i)]
        labels.append(str(label.to_html()))
    points = ax.scatter(group['X'], group['Y'],c=colors.next(), s=80, alpha=0.8)
    tooltip = mpld3.plugins.PointHTMLTooltip(points, labels, voffset=10, hoffset=10, css=csss)
    mpld3.plugins.connect(fig,tooltip)                                          

Nice, we get Beaker rendering on the fly. However, there is a bunch of compounds on the left that are distinctively different. This is because of an image extraction error.
At least the error is systematic: extraction breaks the imidazole ring. Let's try to 'repair' the ring and rescue these compounds - we'll use an appropriate intramolecular reaction SMARTS.

In [126]:
resmarts = "([R:5][C;!R:1]=[C;H2].[C;H0,H1;!R:4]-[N;H2:2])>>[R:5][*:1]=[*:2]-[*:4]"
In [127]:
Draw.ReactionToImage(Chem.ReactionFromSmarts(resmarts))
Out[127]:
In [128]:
def fixRing(mol,resmarts=resmarts):
    rxn = Chem.ReactionFromSmarts(resmarts)
    ps = rxn.RunReactants((mol,))
    if ps:
        m = ps[0][0]
        Chem.SanitizeMol(m)
        return m
    else:
        return mol

So this...

In [129]:
mm = dff.ix[12823029]['ROMol']
mm
Out[129]:

...will become this:

In [130]:
fixRing(mm)
Out[130]:

Let's apply the repair to all affected structures:

In [131]:
for i, p, m in zip(dff.index, dff['Patent ID'], dff['ROMol']):
    if p == 'EP-2295436-A1':
        dff.ix[i,'ROMol'] = fixRing(m)
In [132]:
dff['InChI'] = dff['ROMol'].map(Chem.MolToInchi)
dff['InChIKey'] = dff['InChI'].map(Chem.InchiToInchiKey)
dff = dff.drop_duplicates('InChIKey')

So after the fix, we have even less structures because some of the fixed ones already existed. Hopefully, however, we have less false positives too:

In [133]:
dff.shape
Out[133]:
(363, 35)
In [134]:
dff['RD_SMILES'] = dff['ROMol'].map(Chem.MolToSmiles)
dff['base64rdsmi'] = dff['RD_SMILES'].map(base64.b64encode)

We're going to re-calculate the distance matrix and plot the MCS just to double-check they were no side-effects from the applications of the reaction.

In [135]:
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in dff['ROMol']]
dist_mat = squareform(pdist(fps,'jaccard'))
mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=3, n_jobs = 2)
results = mds.fit(dist_mat)
coords = results.embedding_
dff['X'] = [c[0] for c in coords]
dff['Y'] = [c[1] for c in coords]
In [136]:
fig, ax = plt.subplots()
fig.set_size_inches(14.0, 12.0)
#colors = cycle('bgrcmykwbgrcmykbgrcmykwbgrcmykw')
colors = cycle(cm.Dark2(np.linspace(0,1,len(dff_counts.index))))
for name, group in dff.groupby('Patent ID'):
    labels = []
    for i in group.index:
        zz = group.ix[[i],['Patent ID','SCHEMBL ID','base64rdsmi']]
        zz['mol'] = zz['base64rdsmi'].map(lambda x: '<img src="https://{0}/utils/smiles2image/{1}/300">'.format(base_url,x))
        del zz['base64rdsmi']
        label = zz.T
        del zz
        label.columns = ['Row: {}'.format(i)]
        labels.append(str(label.to_html()))
    points = ax.scatter(group['X'], group['Y'],c=colors.next(), s=80, alpha=0.8)
    tooltip = mpld3.plugins.PointHTMLTooltip(points, labels, voffset=10, hoffset=10, css=csss)
    mpld3.plugins.connect(fig,tooltip)  

So the "island" of outliers has disappeared. There's only one obvious outlier on the far left. This might be a reactant which wasn't filtered out.

3. Define the MCS for this set of claimed molecules

A helper function, inspired by Greg Landrum's post here.

In [137]:
def MCS_Report(ms,atomCompare='any',**kwargs):
    mcs = MCS.FindMCS(ms,atomCompare=atomCompare,timeout=120,**kwargs)
    nAts = np.array([x.GetNumAtoms() for x in ms])
    print 'Mean nAts: {0}, mcs nAts: {1}'.format(nAts.mean(),mcs.numAtoms)
    print 'MCS SMARTS: {0}'.format(mcs.smarts)
    mcsM = Chem.MolFromSmarts(mcs.smarts)
    # find the common atoms
    if atomCompare == 'any':
        mcsM2 = Chem.MolFromSmiles(mcs.smarts,sanitize=False)
        atNums=[-1]*mcsM.GetNumAtoms()
        for m in ms:
            match = m.GetSubstructMatch(mcsM) 
            if not match:
                continue
            for qidx,midx in enumerate(match):
                anum = m.GetAtomWithIdx(midx).GetAtomicNum()
                if atNums[qidx]==-1:
                    atNums[qidx]=anum
                elif anum!=atNums[qidx]:
                    atNums[qidx]=0
        for idx,atnum in enumerate(atNums):
            if atnum>0:
                mcsM.GetAtomWithIdx(idx).SetAtomicNum(atnum)
            
    mcsM.UpdatePropertyCache(False)
    Chem.SetHybridization(mcsM)
    smi = Chem.MolToSmiles(mcsM,isomericSmiles=True)
    print "MCS SMILES: {0}".format(smi)
    img=Draw.MolToImage(mcsM,kekulize=False)
    return mcs.smarts,smi,img,mcsM

Let's add the Murcko framework to our data frame:

In [138]:
PandasTools.AddMurckoToFrame(dff)
In [139]:
PandasTools.AddMoleculeColumnToFrame(dff,smilesCol='Murcko_SMILES', molCol='MurMol')
In [140]:
PandasTools.AlignToScaffold(dff)
In [141]:
dff[['ROMol','MurMol']].head()
Out[141]:
ROMol MurMol
Chemical ID
5441 Mol Mol
5517 Mol Mol
5530 Mol Mol
5537 Mol Mol
5606 Mol Mol

And now we'll extract the MCS core:

In [142]:
mols = list(dff.MurMol)
smarts,smi,img,mcsM = MCS_Report(mols,threshold=0.8,ringMatchesRingOnly=True)
img
Mean nAts: 24.6914600551, mcs nAts: 16
MCS SMARTS: [*]=!@[*]:1:[*]:[*](:[*]:[*]:2:[*]:[*]:[*]:[*]:1:2)-!@[*]:1:[*]:[*]:[*]:[*]:[*]:1
MCS SMILES: O=C1:N:C(C2:C:C:C:C:C:2):N:N2:C:[*]:C:C:1:2

Out[142]:

4. Compare the extracted MCS core with the actual claimed Markush structure.

Given the MCS core, we'll do an R-Group decomposition first to define the R-Groups and their attachement points.

In [143]:
core = Chem.MolFromSmarts(smarts)    
In [144]:
pind = []
rgroups = []
for i, m in zip(dff.index, dff.ROMol):
    rgrps = Chem.ReplaceCore(m,core,labelByIndex=True)
    if rgrps:
        s = Chem.MolToSmiles(rgrps, True)
        for e in s.split("."):
            pind.append(e[1:4].replace('*','').replace(']',''))
            rgroups.append(e)
        #print "R-Groups: ", i, s
In [145]:
pind = sorted(list(set(pind)),key=int)
In [146]:
pind = map(int,pind)
In [147]:
Draw.MolToImage(core,includeAtomNumbers=True,highlightAtoms=pind)
Out[147]:

Is the MCS similar to the claimed Markush structure? Let's fetch the original document in PDF and check in page 4.

Pretty close!

5. Identify key compounds - in this case, vardenafil itself

This is based on the publication by Hattori et al. and Tyrchan et al.. First we'll try the Hattori et al. version of finding the compounds with the highest number of nearest neighbours (NNs) for a given similarity threshold. We will need to calculate the similarity matrix.

In [149]:
sim_mat = []
CUTOFF = 0.8
for i,fp in enumerate(fps):
    tt = DataStructs.BulkTanimotoSimilarity(fps[i],fps)
    for i,t in enumerate(tt):
        if tt[i] < CUTOFF:
            tt[i] = None
    sim_mat.append(tt)
sim_mat=numpy.array(sim_mat) 
In [150]:
sim_df = pd.DataFrame(sim_mat, columns = dff['Chemical ID'], index=dff['Chemical ID'])
In [151]:
sim_df.head()
Out[151]:
Chemical ID 5441 5517 5530 5537 5606 5613 5673 12070 13527 13579 13598 13607 13631 13694 13812 13825 13848 14279 972506 972565 972730 973032 973183 973241 973332 973348 973349 973382 973386 973387 973603 973661 973671 973678 973718 973727 973746 973867 973955 974006 974073 974080 974182 974189 974215 974259 974265 974295 974365 974554 974562 974563 974600 974749 974752 974824 974886 975008 975154 975172 975193 975207 975233 975270 975295 975296 975385 975388 975390 975490 975627 1456170 1456512 1457300 3173334 3182150 4338924 4580977 5931932 5949554 6400765 6793077 7126552 7130126 7419685 7601429 8251760 8255604 8256987 8259092 12822592 12822595 12822605 12822651 12822661 12822664 12822668 12822851 12822865 12822870 12822871 12822873 12822881 12822901 12822909 12822916 12823010 12823018 12823023 12823025 12823027 12823028 12823029 12823050 12823052 12823062 12823071 12823075 12823076 12823078 12823080 12823083 12823089 12823092 12823099 12823105 12823108 12823115 12823120 12823122 12823128 12823129 12823132 12823133 12823134 12823135 12823139 12823141 12823147 12823154 12823156 12823172 12823185 12823190 12823225 12823228 12823229 12823233 12823240 12823262 12823273 12823289 12823297 12823314 12823320 12823321 12823325 12823326 12823329 12823330 12823342 12823354 12823372 12823405 12823411 12823425 12823426 12823511 12823517 12823522 12823528 12823541 12823546 12823547 12823550 12823557 12823559 12823560 12823564 12823567 12823579 12823582 12823596 12823601 12823604 12823607 12823609 12823610 12823611 12823614 12823621 12823625 12823626 12823629 12823632 12823633 12823634 12823635 12823636 12823637 12823642 12823643 12823644 12823645 12823648 12823650 12823652 12823654 12823739 12823745 12823752 12823756 12823757 12823758 12823762 12823763 12823764 12823765 12823767 12823768 12823769 12823770 12823771 12823772 12823773 12823774 12823775 12823777 12823778 12823779 12823781 12823786 12823790 12823791 12823792 12823793 12823795 12823796 12823797 12823799 12823800 12823801 12823802 12823805 12823808 12823810 12823815 12823816 12823818 12823819 12823821 12823823 12823967 12823974 12823977 12823978 12823982 12823983 12823984 12823985 12823986 12823987 12823988 12823992 12823998 12823999 12824000 12824002 12824003 12824004 12824005 12824006 12824007 12824008 12824009 12824010 12824011 12824012 12824013 12824016 12824017 12824018 12824022 12824024 12824026 12824028 12824029 12824030 12824031 12824034 12824035 12824036 12824037 12824038 12824040 12824041 12824042 12824044 12824047 12824049 12824053 12824055 12824057 12824059 12824060 12824062 12824063 12824064 12824065 12824066 12824067 12824068 12824069 12824070 12824071 12824072 12824073 12824074 12824080 12824081 12824085 12824086 12824087 12824089 12922516 12922517 12922519 12922520 12922521 12922522 12922527 12922529 12922530 12922536 12922538 12922546 12922547 12922549 12922550 12922555 12922558 12922560 12922561 12922563 12922566 12922567 12922569 12922570 12922574 12922577 12922578 12922579 13401084 13401207 13401208 13435578 13533556 14199034 14278252 14278326 14301695 14752321 14752362
Chemical ID
5441 1.00 0.86 None 0.87 0.80 None None None None None None None None 0.87 None None None None None None 0.83 None None None None None None None None None None None None 0.83 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.85 None None None None None None None None None None None None None None 0.84 None None None None None None None None None 0.81 0.83 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.81 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.82 None None None None None None None None None None None None None None None None None None None None None None None 0.83 None None None None None None None None None None None None None None None None None None None None None None 0.83 None None None None None None None None None None None None None None None None None None None None 0.81 None None None None None None None None None None None None None None None None 0.83 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.85 None None None None None None None None None None 0.83 None None None None None None None 0.83 None 0.83 None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.84 None None None None None
5517 0.86 1.00 None 0.84 None None None None None None None None None None None None None None None None 0.92 None None None None None None None None None None None None None None None None 0.85 None None None None None None None None None None None None None None None None None None None None None None 0.90 None None 0.85 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.80 None None None None None None None None None None 0.81 None None None None None None None None None 0.84 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
5530 None None 1.00 0.84 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.85 None None None None None None None None None 0.82 None None None None None None None None None None None None None None None None None None None None None 0.81 None None None None None None None None None None None None None None None None None None None None None None None None None None 0.94 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.84 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.81 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.87 None None None None None
5537 0.87 0.84 0.84 1.00 0.81 None None None None None None 0.81 None None None None None None None None 0.82 None None None None None None None None None None None None None None None None None None None None None None None None 0.89 None None None None None None None None None None None None None None None None None 0.83 None None None None None None None None None None None None None None 0.85 None None None None None None None None None 0.82 0.84 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.80 None None None None None None None None None 0.82 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.94 None None None None None None None None None None None None None None None None None None None None None None None 0.84 None None None None None None None None None None None None None None None None None None None None None None 0.84 None None None None None None None None None None None None None None None None None None None None 0.83 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.86 None None None None None None None None None None 0.84 None None None None None None None None None 0.82 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
5606 0.80 None None 0.81 1.00 None None None None None None None None None None None None None 0.90 None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.85 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.81 None None None None None None None None None None 0.80 None None 0.91 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.87 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 0.82 None None None None None None None None None None None 0.80 None None None None None None None None None None None None None None None None None None None None None None None None None 0.84 None None None None None None None None None None None None None None None None None 0.87 None None None None None None None None None None None None None None None None None None 0.83 None None None None None None None None None None None None None None None None None None None None None None None None None 0.81 None None None None None None None None None None None None None None None None None 0.83 None None None None None None None None None None 0.81 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None
In [152]:
d = sim_df.count().to_dict()
nn_df = pd.DataFrame(d.items(),columns=['Chemical ID', '# NNs'])
nn_df.set_index('Chemical ID',inplace=True)
nn_df.head()
Out[152]:
# NNs
Chemical ID
975193 5
974886 3
1456170 1
12822592 2
12822595 1
In [153]:
pd.merge(dff[['Chemical ID','ROMol']],nn_df,left_index=True, right_index=True, sort=False).sort('# NNs',ascending=False).head()
Out[153]:
Chemical ID ROMol # NNs
Chemical ID
12823654 12823654 Mol 23
5441 5441 Mol 22
12824064 12824064 Mol 21
5537 5537 Mol 21
12822873 12822873 Mol 18

So, the method seems to be working in this case. Vardenafil has the second largest number of NNs with a cutoff of 0.8. Next, we'll try one of the Tyrchan et al. methods (Frequency of Groups, FOG) and count the frequency of the derived R-Groups:

In [154]:
from collections import Counter
rgroups = [r.replace('[15*]','[11*]').replace('[14*]','[12*]') for r in rgroups]
c = Counter(rgroups)
In [155]:
rgroup_counts = pd.DataFrame(c.most_common(20),columns=['RGroups','Frequency'])
rgroup_counts.head(10)
Out[155]:
RGroups Frequency
0 [6*]CCC 349
1 [11*]OCC 306
2 [8*]C 294
3 [8*]CC 63
4 [11*]OCCC 28
5 [11*]O 16
6 [6*]C 10
7 [12*]S(=O)(=O)N1CCN(C)CC1 8
8 [13*]OC 8
9 [12*]S(=O)(=O)N(CC)CCO 7
In [156]:
PandasTools.AddMoleculeColumnToFrame(rgroup_counts, smilesCol = 'RGroups')
rgroup_counts.head(10)
Out[156]:
RGroups Frequency ROMol
0 [6*]CCC 349 Mol
1 [11*]OCC 306 Mol
2 [8*]C 294 Mol
3 [8*]CC 63 Mol
4 [11*]OCCC 28 Mol
5 [11*]O 16 Mol
6 [6*]C 10 Mol
7 [12*]S(=O)(=O)N1CCN(C)CC1 8 Mol
8 [13*]OC 8 Mol
9 [12*]S(=O)(=O)N(CC)CCO 7 Mol
In [157]:
rgroup_counts[[r.startswith('[12*]') for r in rgroup_counts.RGroups]].head(10)
Out[157]:
RGroups Frequency ROMol
7 [12*]S(=O)(=O)N1CCN(C)CC1 8 Mol
9 [12*]S(=O)(=O)N(CC)CCO 7 Mol
10 [12*]S(=O)(=O)Cl 6 Mol
11 [12*]S(=O)(=O)NCCCN1CCOCC1 4 Mol
12 [12*]S(=O)(=O)N1CCN(CCO)CC1 4 Mol
13 [12*]S(=O)(=O)N1CCC(O)CC1 4 Mol
14 [12*]S(=O)(=O)N(CC)CC 4 Mol
15 [12*]S(=O)(=O)N1CCN(CC)CC1 4 Mol
16 [12*]S(=O)(=O)N1CCC(CO)CC1 3 Mol
17 [12*]S(=O)(=O)NCCOC 3 Mol

Although the most frequent R-groups in positions 6, 8 and 11 correspond to those in Vardenafil (see below), the exact piperazine R-group in position 12 is the 7th most frequent one in that position.

In [158]:
dff.ix[dff.InChIKey == 'SECKRCOLJRRGGV-UHFFFAOYSA-N',['SCHEMBL ID','ROMol']]
Out[158]:
SCHEMBL ID ROMol
Chemical ID
5441 SCHEMBL5441 Mol