2/
General methods that predict for a chosen protein if a drug binds to it need to take into account information from both ligand and protein to model the physicochemical interactions.
But to model interactions we typically need to know their arrangement to each other - the pose.
3/
In docking this pose has to be sampled by running several steps of a function that assesses the likelihood of binding or interaction - the scoring function.
Thus the computational cost for getting a rough binding energy prediction for many molecules quickly gets quite high.
4/
So if you want to quickly get a good guess on not just a few hundred or thousand compounds but millions, we might want to simplify that.
If we can make an educated guess about the binding location in the protein we could ignore poses and hope to have enough info to predict!
5/
In our method RASPD+ we take just the most important features that contribute to interactions (e.g. H-bond donor/acceptors, logP, the molar refractivity) from the ligand and in a sphere around the binding site of the protein (thus pose invariant) and train ML models.
6/
This was conceptually already demonstrated with simple linear regression by our coauthors Goutam Mukherjee and B. Jayaram in their inital RASPD approach:
https://doi.org/10.1039/C3CP44697B
While linear models capture general trends, a lot of information gets lost for accurate prediction.
7/
As the PDBbind dataset w/ bound protein structures and associated binding data is quite small we tried several different machine learning methods on our 6 ligand and 14 protein features, found that random forests performed best for regression and evaluated feature importance.
8/
While the random forests outperformed the simple linear models when the goal was to predict binding free energy for known bound structures.
Yet when the goal was to identify binders from computationally generated non-binders this trend curiously changed.
9/
While the accuracy of RASPD+ doesn't generally surpass more elaborate methods (in some cases might not meet all requirements on your protein), it is by a factor of >100 faster than docking and quickly generates guesses for further evaluation and provides a strong baseline.
11/
RASPD+ wouldn't have been possible without Goutam Mukherjee and B. Jayaram, who started with RASPD at @iitdelhi, and Lukas Adam and @Rebecca_Wade_C from the MCM group @HITStudies where I had the pleasure to work on it.
You can follow @SHolderbach.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: