Protein folding process analysis with MSM

In this tutorial, we demonstrate how to use the HTMD code for analysing a protein folding process, in this case, of the protein Villin.

To analyze, one needs MD trajectories first, which can be generated with HTMD. Here, we already provide the trajectories (data) to analyze. You can download the data from here (Warning: 3GB filesize)

Alternatively, you can download the dataset using wget.

import os
assert os.system('wget -rcN -np -nH -q --cut-dirs=2 -R index.html* http://pub.htmd.org/tutorials/protein-folding-analysis/datasets/') == 0

Getting started

First we import the modules we are going to need for the tutorial

%pylab inline
from htmd.ui import *
config(viewer='webgl')
Populating the interactive namespace from numpy and matplotlib

Please cite HTMD: Doerr et al.(2016)JCTC,12,1845.
https://dx.doi.org/10.1021/acs.jctc.6b00049
Documentation: http://software.acellera.com/
To update: conda update htmd -c acellera -c psi4

You are on the latest HTMD version (unpackaged : /home/joao/maindisk/software/repos/Acellera/htmd/htmd).

Creating a simulation list

For the purposes of an analysis, full atom information is required by HTMD. Therefore each simulation trajectory has to be associated with a corresponding PDB file. Additionally, if you plan to run adaptive, each trajectory needs to be associated with an input directory which contains the files used to run the simulations. The simList function allows us to make these associations

sims = simlist(glob('data/*/'), glob('input/*/structure.pdb'), glob('input/*/'))

Filtering trajectories

Typically waters are removed from the trajectories as they are not used in the analysis and it helps speed up further calculations. These filtered trajectories are written into a new directory. The advantage of filtering is that if your input structures only differ in the number of water molecules, you can produce a single unique pdb file for all simulations once you remove the waters, which will speed up calculations. This step is not necessary for the analysis and you can skip it if you don’t mind the slowdown in calculations. In that case replace in the following commands the fsims variable with sims.

fsims = simfilter(sims, './filtered/', filtersel='not water')

In this tutorial, in order for the trajectories to be downloaded in a timely manner, we provide only the filtered trajectories which are stored in three separate dataset folders. Therefore we will skip the above two commands and construct the simulation list directly from the filtered trajectories.

sets = glob('datasets/*/')
sims = []
for s in sets:
    fsims = simlist(glob(s + '/filtered/*/'), 'datasets/1/filtered/filtered.pdb')
    sims = simmerge(sims, fsims)
Creating simlist: 100%|██████████| 708/708 [00:00<00:00, 3765.24it/s]
Creating simlist: 100%|██████████| 710/710 [00:00<00:00, 3582.33it/s]
Creating simlist: 100%|██████████| 719/719 [00:00<00:00, 3835.12it/s]

Calculating metrics

To build a Markov state model we need to project the atom coordinates onto a lower dimensional space which can be used for clustering the conformations into a set of states. Here we have selected to use the carbon alpha atoms of the protein. This will calculate contacts between all carbon alpha atoms.

metr = Metric(sims)
metr.set(MetricSelfDistance('protein and name CA', metric='contacts'))
data = metr.project()
Projecting trajectories: 100%|██████████| 2137/2137 [00:18<00:00, 117.61it/s]
2018-03-20 00:19:18,232 - htmd.projections.metric - INFO - Frame step 0.10000000149011612ns was read from the trajectories. If it looks wrong, redefine it by manually setting the MetricData.fstep property.

Here we provide the frame-step in nanoseconds i.e. the time that passes between two consecutive frames in a trajectory. This is automatically read from the trajectories, however not all trajectories contain the correct fstep so it can be useful to manually define it like here.

data.fstep = 0.1

Removing trajectories

Sometimes the set of trajectories can contain trajectories of incorrect length. These are typically corrupted trajectories and are removed.

The plotTrajSizes method plots all trajectory lengths sorted

data.plotTrajSizes()
../../_images/protein-folding-analysis_19_0.png

dropTraj has multiple options for removing simulations from the dataset. Here we use it to remove all trajectories whose length is not equal to the mode length.

data.dropTraj()
2018-03-20 00:19:26,084 - htmd.metricdata - INFO - Dropped 7 trajectories from 2137 resulting in 2130
array([ 168,  472,  709, 1507, 1601, 2100, 2111])

TICA

TICA is a method that can be used to improve the clustering of the conformations. This is done by projecting the data onto a lower-dimensional space which separates well the metastable minima and places clusters on the transition regions.

tica = TICA(data, 2, units='ns')
dataTica = tica.project(3)
A Jupyter Widget
/home/joao/maindisk/SANDBOX/miniconda3/miniconda3/lib/python3.6/site-packages/pyemma/__init__.py:91: UserWarning: You are not using the latest release of PyEMMA. Latest is 2.5.1, you have 2.4.
  .format(latest=latest, current=current), category=UserWarning)
A Jupyter Widget

Bootstrapping

If we want to bootstrap our calculations we can at this point drop a random 20% of the trajectories and do the rest of the analysis multiple times to see if the results are consistent. Alternatively we can keep on using dataTica in the following commands.

dataBoot = dataTica.bootstrap(0.8)

Clustering conformations

Once we have cleaned the dataset we proceed to cluster the conformations.

Here we use the mini-batch kmeans clustering algorithm to produce 1000 clusters.

dataBoot.cluster(MiniBatchKMeans(n_clusters=1000))

Building the Markov model

After clustering it is time to build the Markov model.

model = Model(dataBoot)

Before constructing the Markov model we need to choose the lag-time at which it will be built. The lag-time is typically chosen by looking at the implied timescale (ITS) plot and selecting a lag-time at which the top timescales start converging. By constructing Markov models at various lag times HTMD creates a plot which shows the slowest implied timescales of each Markov model at various lag times. If a model is Markovian at a specific lag time, the implied timescales should stay unchanged for any higher lag times. Therefore, given an implied timescales plot, the user can monitor the convergence and choose the lag time at which to construct his Markov model, typically the Markov time which is the shortest lag time at which the timescales converge. Too large lag times can reduce the temporal resolution of the Markov model and can create more statistical uncertainty due to fewer transition counts and thus instability in the implied timescales.

model.plotTimescales()
A Jupyter Widget
20-03-18 00:30:28 pyemma.msm.estimators.implied_timescales.ImpliedTimescales[2] WARNING  Some timescales could not be computed. Timescales array is smaller than expected or contains NaNs
2018-03-20 00:30:28,228 - pyemma.msm.estimators.implied_timescales.ImpliedTimescales[2] - WARNING - Some timescales could not be computed. Timescales array is smaller than expected or contains NaNs
/home/joao/maindisk/SANDBOX/miniconda3/miniconda3/lib/python3.6/site-packages/matplotlib/scale.py:114: RuntimeWarning: invalid value encountered in less_equal
  out[a <= 0] = -1000
../../_images/protein-folding-analysis_35_3.png

After seeing the ITS plot we decide on a lag-time of 25ns. Additionally the ITS plot showed us that there is a separation between 3 slow timescales and the rest of the timescales which are fast. Therefore we choose to lump our microstates together into 4 macrostates.

model.markovModel(25, 4, units='ns')
2018-03-20 00:35:11,623 - htmd.model - INFO - 100.0% of the data was used
2018-03-20 00:35:11,744 - htmd.model - INFO - Number of trajectories that visited each macrostate:
2018-03-20 00:35:11,745 - htmd.model - INFO - [ 133 1650  559   97]

We can also visualize the equilibrium populations of each macrostate using the following command

model.eqDistribution()
../../_images/protein-folding-analysis_39_0.png
array([ 0.00373013,  0.66425711,  0.11840471,  0.21360806])

Once we have a Markov model we can plot the free energy surface by projecting it on any of our projected coordinates. For example to plot it on the first two TICA coordinates we call it like this.

model.plotFES(0, 1, temperature=360)
/home/joao/maindisk/software/repos/Acellera/htmd/htmd/metricdata.py:786: MatplotlibDeprecationWarning: The griddata function was deprecated in version 2.2.
  zi = griddata(x, y, z, xi, yi, interp='linear')
../../_images/protein-folding-analysis_41_1.png

We can also plot the micro and macrostates on top of the FES by setting states=True

model.plotFES(0, 1, temperature=360, states=True)
/home/joao/maindisk/software/repos/Acellera/htmd/htmd/metricdata.py:786: MatplotlibDeprecationWarning: The griddata function was deprecated in version 2.2.
  zi = griddata(x, y, z, xi, yi, interp='linear')
../../_images/protein-folding-analysis_43_1.png

Visualizing the states

To see what the states look like we can use the viewStates method. We load the macrostates and add a protein representation.

model.viewStates(protein=True)
Getting state Molecules: 100%|██████████| 4/4 [00:08<00:00,  2.03s/it]
A Jupyter Widget
A Jupyter Widget

Calculating the kinetics

One of the major advantages of Markov state models is that they can provide quantitative results about the kinetics between states.

Provide the Kinetics constructor with the system temperature. It automatically then calculates the source and sink states.

kin = Kinetics(model, temperature=360)
print(kin.source)
print(kin.sink)
2018-03-20 00:36:47,599 - htmd.kinetics - INFO - Detecting source state...
2018-03-20 00:36:49,155 - htmd.kinetics - INFO - Guessing the source state as the state with minimum contacts.
2018-03-20 00:36:49,157 - htmd.kinetics - INFO - Source macro = 1
2018-03-20 00:36:49,158 - htmd.kinetics - INFO - Detecting sink state...
2018-03-20 00:36:49,189 - htmd.kinetics - INFO - Sink macro = 3
1
3

To see the rates between the source and sink states we use the getRates method.

r = kin.getRates()
print(r)
2018-03-20 00:37:08,689 - htmd.kinetics - INFO - Calculating rates between source: [1] and sink: [3] states.
mfpton = 1.81E+03 (ns)
mfptoff = 2.28E+02 (ns)
kon = 5.54E+05 (1/M 1/s)
koff = 4.40E+06 (1/s)
koff/kon = 7.94E+00 (M)
kdeq = 3.11E+00 (M)
g0eq = 0.81 (kcal/M)

To plot the free energies and mean first passage times of all state use the plotRates() method.

kin.plotRates()
../../_images/protein-folding-analysis_53_0.png ../../_images/protein-folding-analysis_53_1.png ../../_images/protein-folding-analysis_53_2.png

To visualize the kinetic flux pathways between the source and sink states, use the plotFluxPathways method:

kin.plotFluxPathways()
/home/joao/maindisk/SANDBOX/miniconda3/miniconda3/lib/python3.6/site-packages/pyemma/plots/networks.py:544: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if state_labels == 'auto':
/home/joao/maindisk/SANDBOX/miniconda3/miniconda3/lib/python3.6/site-packages/matplotlib/figure.py:459: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "
Path flux           %path   %of total       path
5.8777605420020455e-05      100.0%  100.0%          [1 3]
../../_images/protein-folding-analysis_55_2.png

And this concludes the protein folding tutorial.