Protein-ligand binding process analysis with MSM#

In this tutorial, we demonstrate how to use the HTMD code for analysing a protein-ligand binding process.

To analyze, one needs MD trajectories first, which can be generated with HTMD. Here, we already provide the trajectories (data) to analyze. You can download the data from here. (Warning: 3GB filesize)

Alternatively, you can download the dataset using wget.

import os
assert os.system('wget -rcN -np -nH -q --cut-dirs=2 -R index.html* http://pub.htmd.org/tutorials/ligand-binding-analysis/datasets/') == 0

Getting started#

First we import the modules we are going to need for the tutorial:

%pylab inline
from htmd.ui import *
config(viewer='webgl')
Populating the interactive namespace from numpy and matplotlib
/shared/alejandro/c03_slim_home/docs_sources/htmd/htmd/versionwarnings.py:30: UserWarning: As of HTMD 1.16 the default number of threads HTMD spawns for calculations is set to 1. You can enable parallelism at your own risk using config(njobs=-2) in the beginning of your scripts. To disable this warning run once: from htmd import _disableWarnings; _disableWarnings('1.16');
  UserWarning,
/shared/alejandro/c03_slim_home/docs_sources/htmd/htmd/versionwarnings.py:38: UserWarning: As of HTMD 1.21 support for ACEMD v2 has stopped. Please use ACEMD3 instead as well as the corresponding equilibration and production protocols. To disable this warning run once: from htmd import _disableWarnings; _disableWarnings('1.21');
  UserWarning,
Please cite HTMD: Doerr et al.(2016)JCTC,12,1845. https://dx.doi.org/10.1021/acs.jctc.6b00049

HTMD Documentation at: https://www.htmd.org/docs/latest/
2021-01-11 09:58:54,079 - binstar - INFO - Using Anaconda API: https://api.anaconda.org
You are on the latest HTMD version (unpackaged : /shared/alejandro/c03_slim_home/docs_sources/htmd/htmd).

Creating a simulation list#

For the purposes of an analysis, full atom information is required by HTMD. Therefore each simulation trajectory has to be associated with a corresponding PDB file. Additionally, if you plan to run adaptive, each trajectory needs to be associated with an input directory which contains the files used to run the simulations. The simlist function allows us to make these associations

sims = simlist(glob('data/*/'), glob('input/*/structure.pdb'), glob('input/*/'))

Filtering trajectories#

Typically waters are removed from the trajectories as they are not used in the analysis and it helps speed up further calculations. These filtered trajectories are written into a new directory. The advantage of filtering is that if your input structures only differ in the number of water molecules, you can produce a single unique pdb file for all simulations once you remove the waters, which will speed up calculations. This step is not necessary for the analysis and you can skip it if you don’t mind the slowdown in calculations. In that case replace in the following commands the fsims variable with sims.

fsims = simfilter(sims, './filtered/', filtersel='not water')

In this tutorial, in order for the trajectories to be downloaded in a timely manner, we provide only the filtered trajectories, which are stored in three separate dataset folders. Therefore we will skip the above two commands and construct the simulation list directly from the filtered trajectories:

sets = glob('datasets/*/')
sims = []
for s in sets:
    fsims = simlist(glob(s + '/filtered/*/'), 'datasets/1/filtered/filtered.pdb')
    sims = simmerge(sims, fsims)
Creating simlist: 100%|██████████| 314/314 [00:00<00:00, 787.40it/s]
Creating simlist: 100%|██████████| 309/309 [00:00<00:00, 837.84it/s]
Creating simlist: 100%|██████████| 230/230 [00:00<00:00, 809.48it/s]

Calculating metrics#

To build a Markov state model we need to project the atom coordinates onto a lower dimensional space which can be used for clustering the conformations into a set of states. For protein systems we typically use the binary contact map between the carbon alpha atoms of the protein. This will calculate contacts between all carbon-alpha atoms.

metr = Metric(sims)
metr.set(MetricDistance('protein and name CA', 'resname MOL and noh', periodic='selections', metric='contacts'))
data = metr.project()
Projecting trajectories: 100%|██████████| 852/852 [02:04<00:00,  6.84it/s]
2021-01-11 10:01:20,229 - htmd.projections.metric - INFO - Frame step 0.001ns was read from the trajectories. If it looks wrong, redefine it by manually setting the MetricData.fstep property.

Here, we also provide the frame-step in nanoseconds i.e. the time that passes between two consecutive frames in a trajectory. This should be automatically read from the trajectories, but some simulation software does not contain the correct fstep in the trajectories, so it can be useful to manually define it:

data.fstep = 0.1

Removing trajectories#

Sometimes the set of trajectories can contain trajectories of incorrect length. These are typically corrupted trajectories and are removed.

The plotTrajSizes method plots all trajectory lengths, sorted:

data.plotTrajSizes()
../../_images/ligand-binding-analysis_19_0.png

The dropTraj method has multiple options for removing simulations from the dataset. Here, we use it to remove all trajectories whose length is not equal to the mode length:

data.dropTraj()
2021-01-11 10:01:49,740 - htmd.metricdata - INFO - Dropped 2 trajectories from 852 resulting in 850
array([496, 562])

TICA#

TICA is a method that can be used to improve the clustering of the conformations. This is done by projecting the data onto a lower-dimensional space which separates well the metastable minima and places clusters on the transition regions.

tica = TICA(data, 2, units='ns')
dataTica = tica.project(3)
HBox(children=(FloatProgress(value=0.0, description='calculate covariances', layout=Layout(flex='2'), max=850.…
HBox(children=(FloatProgress(value=0.0, description='getting output of TICA', layout=Layout(flex='2'), max=850…

Bootstrapping#

If we want to bootstrap our calculations we can at this point drop a random 20% of the trajectories and do the rest of the analysis multiple times to see if the results are consistent. Alternatively we can keep on using dataTica in the following commands.

dataBoot = dataTica.bootstrap(0.8)

Clustering conformations#

Once we have cleaned the dataset we proceed to cluster the conformations.

Here we use the mini-batch kmeans clustering algorithm to produce 1000 clusters.

dataBoot.cluster(MiniBatchKMeans(n_clusters=1000))

Building the Markov model#

After clustering it is time to build the Markov model.

model = Model(dataBoot)

Before constructing the Markov model, we need to choose the lag-time at which it will be built. The lag-time is typically chosen by looking at the implied timescale (ITS) plot and selecting a lag-time at which the top timescales start converging. By constructing Markov models at various lag times HTMD creates a plot which shows the slowest implied timescales of each Markov model at various lag times. If a model is Markovian at a specific lag time, the implied timescales should stay unchanged for any higher lag times. Therefore, given an implied timescales plot, the user can monitor the convergence and choose the lag time at which to construct his Markov model, typically the Markov time which is the shortest lag time at which the timescales converge. Too large lag times can reduce the temporal resolution of the Markov model and can create more statistical uncertainty due to fewer transition counts and thus instability in the implied timescales.

model.plotTimescales()
HBox(children=(FloatProgress(value=0.0, description='estimating MaximumLikelihoodMSM', layout=Layout(flex='2')…
11-01-21 10:07:07 pyemma.msm.estimators.implied_timescales.ImpliedTimescales[29] WARNING  Some timescales could not be computed. Timescales array is smaller than expected or contains NaNs
2021-01-11 10:07:07,710 - pyemma.msm.estimators.implied_timescales.ImpliedTimescales[29] - WARNING - Some timescales could not be computed. Timescales array is smaller than expected or contains NaNs
../../_images/ligand-binding-analysis_35_3.png

After seeing the ITS plot we decided on a lag-time of 50 frames (5ns). Additionally the ITS plot showed us that there is a separation between 4 slow timescales and the rest of the timescales which are fast. Therefore we choose to lump our microstates together into 5 macrostates.

model.markovModel(5, 5, units='ns')
2021-01-11 10:07:39,166 - htmd.model - INFO - 100.0% of the data was used
2021-01-11 10:07:39,188 - htmd.model - INFO - Number of trajectories that visited each macrostate:
2021-01-11 10:07:39,188 - htmd.model - INFO - [152  96 356 128 248]

Once we have a Markov model we can plot the free energy surface by projecting it on any of our projected coordinates. For example to plot it on the first two TICA coordinates we call it like this.

model.plotFES(0, 1, temperature=298)
100%|██████████| 987/987 [00:00<00:00, 1074.55it/s]
../../_images/ligand-binding-analysis_39_1.png

We can also plot the micro and macrostates on top of the FES by setting states=True

model.plotFES(0, 1, temperature=298, states=True)
100%|██████████| 987/987 [00:00<00:00, 1112.00it/s]
../../_images/ligand-binding-analysis_41_1.png

Visualizing the states#

To see what the states look like we can use the viewStates method. We load the macrostates and add a ligand representation using the ligand atomselection.

model.viewStates(ligand='resname MOL and noh')
Getting state Molecules:   0%|          | 0/5 [00:00<?, ?it/s]2021-01-11 10:10:00,547 - moleculekit.molecule - WARNING - Wrapping detected 0 bonds and 4507 atoms. Ignore this message if you believe this is accurate, otherwise make sure you have loaded a topology containing all the bonds of the system before wrapping. The results may be inaccurate. If you want to use guessed bonds use the guessBonds argument.
/shared/alejandro/c03_slim_home/docs_sources/moleculekit/moleculekit/align.py:16: NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float32, 2d, A), array(float32, 2d, A))
  covariance = np.dot(P.T, Q)
/shared/alejandro/c03_slim_home/docs_sources/moleculekit/moleculekit/align.py:54: NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float32, 2d, A), array(float32, 2d, A))
  all1 = np.dot(all1, rot.T)
Getting state Molecules:  20%|██        | 1/5 [00:04<00:16,  4.20s/it]2021-01-11 10:10:03,118 - moleculekit.molecule - WARNING - Wrapping detected 0 bonds and 4507 atoms. Ignore this message if you believe this is accurate, otherwise make sure you have loaded a topology containing all the bonds of the system before wrapping. The results may be inaccurate. If you want to use guessed bonds use the guessBonds argument.
Getting state Molecules:  40%|████      | 2/5 [00:04<00:09,  3.07s/it]2021-01-11 10:10:03,471 - moleculekit.molecule - WARNING - Wrapping detected 0 bonds and 4507 atoms. Ignore this message if you believe this is accurate, otherwise make sure you have loaded a topology containing all the bonds of the system before wrapping. The results may be inaccurate. If you want to use guessed bonds use the guessBonds argument.
Getting state Molecules:  60%|██████    | 3/5 [00:04<00:04,  2.25s/it]2021-01-11 10:10:03,937 - moleculekit.molecule - WARNING - Wrapping detected 0 bonds and 4507 atoms. Ignore this message if you believe this is accurate, otherwise make sure you have loaded a topology containing all the bonds of the system before wrapping. The results may be inaccurate. If you want to use guessed bonds use the guessBonds argument.
Getting state Molecules:  80%|████████  | 4/5 [00:05<00:01,  1.72s/it]2021-01-11 10:10:04,319 - moleculekit.molecule - WARNING - Wrapping detected 0 bonds and 4507 atoms. Ignore this message if you believe this is accurate, otherwise make sure you have loaded a topology containing all the bonds of the system before wrapping. The results may be inaccurate. If you want to use guessed bonds use the guessBonds argument.
Getting state Molecules: 100%|██████████| 5/5 [00:05<00:00,  1.17s/it]
_ColormakerRegistry()
HBox(children=(Checkbox(value=True, description='macro 0'), Checkbox(value=True, description='macro 1'), Check…
NGLWidget()

Calculating the kinetics#

One of the major advantages of Markov state models is that they can provide quantitative results about the kinetics between states.

Provide the Kinetics class with the system temperature and ligand concentration. It automatically then calculates the source and sink states.

kin = Kinetics(model, temperature=298, concentration=0.0037)
2021-01-11 10:10:20,913 - htmd.kinetics - INFO - Detecting source state...
2021-01-11 10:10:21,278 - htmd.kinetics - INFO - Guessing the source state as the state with minimum contacts.
2021-01-11 10:10:21,278 - htmd.kinetics - INFO - Source macro = 2
2021-01-11 10:10:21,279 - htmd.kinetics - INFO - Detecting sink state...
2021-01-11 10:10:21,280 - htmd.kinetics - INFO - Sink macro = 4

To see the rates between the source and sink states we use the getRates method.

r = kin.getRates()
print(r)
2021-01-11 10:10:23,521 - htmd.kinetics - INFO - Calculating rates between source: [2] and sink: [4] states.
2021-01-11 10:10:23,628 - htmd.kinetics - INFO - Concentration correction of -3.32 kcal/mol.
mfpton = 7.14E+02 (ns)
mfptoff = 4.97E+03 (ns)
kon = 3.78E+08 (1/M 1/s)
koff = 2.01E+05 (1/s)
koff/kon = 5.32E-04 (M)
kdeq = 4.31E-04 (M)
g0eq = -4.59 (kcal/M)

To plot the free energies and mean first passage times of all state relative to the source state, use the plotRates method:

kin.plotRates()
../../_images/ligand-binding-analysis_51_0.png ../../_images/ligand-binding-analysis_51_1.png ../../_images/ligand-binding-analysis_51_2.png

To visualize the kinetic flux pathways between the source and sink states, use the plotFluxPathways method:

kin.plotFluxPathways()
Path flux           %path   %of total       path
1.2033666249205421e-05      63.3%   63.3%           [2 4]
6.566536177795934e-06       34.5%   97.8%           [2 3 4]
4.195034357516102e-07       2.2%    100.0%          [2 1 4]
<Figure size 432x288 with 0 Axes>
../../_images/ligand-binding-analysis_53_2.png

And this concludes the ligand binding tutorial.