In this tutorial, we demonstrate how to use the HTMD code for analysing a protein folding process, in this case, of the protein Villin.

To analyze, one needs MD trajectories first, which can be generated with
HTMD. Here, we already provide the trajectories (data) to analyze. You
can download the data from
here
(**Warning: 3GB filesize**)

Alternatively, you can download the dataset using `wget`

.

```
import os
assert os.system('wget -rcN -np -nH -q --cut-dirs=2 -R index.html* http://pub.htmd.org/tutorials/protein-folding-analysis/datasets/') == 0
```

First we import the modules we are going to need for the tutorial

```
%pylab inline
from htmd.ui import *
config(viewer='webgl')
```

```
Populating the interactive namespace from numpy and matplotlib
Please cite HTMD: Doerr et al.(2016)JCTC,12,1845.
https://dx.doi.org/10.1021/acs.jctc.6b00049
Documentation: http://software.acellera.com/
To update: conda update htmd -c acellera -c psi4
You are on the latest HTMD version (unpackaged : /home/joao/maindisk/software/repos/Acellera/htmd/htmd).
```

For the purposes of an analysis, full atom information is required by HTMD. Therefore each simulation trajectory has to be associated with a corresponding PDB file. Additionally, if you plan to run adaptive, each trajectory needs to be associated with an input directory which contains the files used to run the simulations. The simList function allows us to make these associations

```
sims = simlist(glob('data/*/'), glob('input/*/structure.pdb'), glob('input/*/'))
```

Typically waters are removed from the trajectories as they are not used in the analysis and it helps speed up further calculations. These filtered trajectories are written into a new directory. The advantage of filtering is that if your input structures only differ in the number of water molecules, you can produce a single unique pdb file for all simulations once you remove the waters, which will speed up calculations. This step is not necessary for the analysis and you can skip it if you don’t mind the slowdown in calculations. In that case replace in the following commands the fsims variable with sims.

```
fsims = simfilter(sims, './filtered/', filtersel='not water')
```

In this tutorial, in order for the trajectories to be downloaded in a timely manner, we provide only the filtered trajectories which are stored in three separate dataset folders. Therefore we will skip the above two commands and construct the simulation list directly from the filtered trajectories.

```
sets = glob('datasets/*/')
sims = []
for s in sets:
fsims = simlist(glob(s + '/filtered/*/'), 'datasets/1/filtered/filtered.pdb')
sims = simmerge(sims, fsims)
```

```
Creating simlist: 100%|██████████| 708/708 [00:00<00:00, 3765.24it/s]
Creating simlist: 100%|██████████| 710/710 [00:00<00:00, 3582.33it/s]
Creating simlist: 100%|██████████| 719/719 [00:00<00:00, 3835.12it/s]
```

To build a Markov state model we need to project the atom coordinates onto a lower dimensional space which can be used for clustering the conformations into a set of states. Here we have selected to use the carbon alpha atoms of the protein. This will calculate contacts between all carbon alpha atoms.

```
metr = Metric(sims)
metr.set(MetricSelfDistance('protein and name CA', metric='contacts'))
data = metr.project()
```

```
Projecting trajectories: 100%|██████████| 2137/2137 [00:18<00:00, 117.61it/s]
2018-03-20 00:19:18,232 - htmd.projections.metric - INFO - Frame step 0.10000000149011612ns was read from the trajectories. If it looks wrong, redefine it by manually setting the MetricData.fstep property.
```

Here we provide the frame-step in nanoseconds i.e. the time that passes between two consecutive frames in a trajectory. This is automatically read from the trajectories, however not all trajectories contain the correct fstep so it can be useful to manually define it like here.

```
data.fstep = 0.1
```

Sometimes the set of trajectories can contain trajectories of incorrect length. These are typically corrupted trajectories and are removed.

The `plotTrajSizes`

method plots all trajectory lengths sorted

```
data.plotTrajSizes()
```

`dropTraj`

has multiple options for removing simulations from the
dataset. Here we use it to remove all trajectories whose length is not
equal to the mode length.

```
data.dropTraj()
```

```
2018-03-20 00:19:26,084 - htmd.metricdata - INFO - Dropped 7 trajectories from 2137 resulting in 2130
```

```
array([ 168, 472, 709, 1507, 1601, 2100, 2111])
```

TICA is a method that can be used to improve the clustering of the conformations. This is done by projecting the data onto a lower-dimensional space which separates well the metastable minima and places clusters on the transition regions.

```
tica = TICA(data, 2, units='ns')
dataTica = tica.project(3)
```

```
A Jupyter Widget
```

```
/home/joao/maindisk/SANDBOX/miniconda3/miniconda3/lib/python3.6/site-packages/pyemma/__init__.py:91: UserWarning: You are not using the latest release of PyEMMA. Latest is 2.5.1, you have 2.4.
.format(latest=latest, current=current), category=UserWarning)
```

```
A Jupyter Widget
```

If we want to bootstrap our calculations we can at this point drop a random 20% of the trajectories and do the rest of the analysis multiple times to see if the results are consistent. Alternatively we can keep on using dataTica in the following commands.

```
dataBoot = dataTica.bootstrap(0.8)
```

Once we have cleaned the dataset we proceed to cluster the conformations.

Here we use the mini-batch kmeans clustering algorithm to produce 1000 clusters.

```
dataBoot.cluster(MiniBatchKMeans(n_clusters=1000))
```

After clustering it is time to build the Markov model.

```
model = Model(dataBoot)
```

Before constructing the Markov model we need to choose the lag-time at which it will be built. The lag-time is typically chosen by looking at the implied timescale (ITS) plot and selecting a lag-time at which the top timescales start converging. By constructing Markov models at various lag times HTMD creates a plot which shows the slowest implied timescales of each Markov model at various lag times. If a model is Markovian at a specific lag time, the implied timescales should stay unchanged for any higher lag times. Therefore, given an implied timescales plot, the user can monitor the convergence and choose the lag time at which to construct his Markov model, typically the Markov time which is the shortest lag time at which the timescales converge. Too large lag times can reduce the temporal resolution of the Markov model and can create more statistical uncertainty due to fewer transition counts and thus instability in the implied timescales.

```
model.plotTimescales()
```

```
A Jupyter Widget
```

```
20-03-18 00:30:28 pyemma.msm.estimators.implied_timescales.ImpliedTimescales[2] WARNING Some timescales could not be computed. Timescales array is smaller than expected or contains NaNs
```

```
2018-03-20 00:30:28,228 - pyemma.msm.estimators.implied_timescales.ImpliedTimescales[2] - WARNING - Some timescales could not be computed. Timescales array is smaller than expected or contains NaNs
/home/joao/maindisk/SANDBOX/miniconda3/miniconda3/lib/python3.6/site-packages/matplotlib/scale.py:114: RuntimeWarning: invalid value encountered in less_equal
out[a <= 0] = -1000
```

After seeing the ITS plot we decide on a lag-time of 25ns. Additionally the ITS plot showed us that there is a separation between 3 slow timescales and the rest of the timescales which are fast. Therefore we choose to lump our microstates together into 4 macrostates.

```
model.markovModel(25, 4, units='ns')
```

```
2018-03-20 00:35:11,623 - htmd.model - INFO - 100.0% of the data was used
2018-03-20 00:35:11,744 - htmd.model - INFO - Number of trajectories that visited each macrostate:
2018-03-20 00:35:11,745 - htmd.model - INFO - [ 133 1650 559 97]
```

We can also visualize the equilibrium populations of each macrostate using the following command

```
model.eqDistribution()
```

```
array([ 0.00373013, 0.66425711, 0.11840471, 0.21360806])
```

Once we have a Markov model we can plot the free energy surface by projecting it on any of our projected coordinates. For example to plot it on the first two TICA coordinates we call it like this.

```
model.plotFES(0, 1, temperature=360)
```

```
/home/joao/maindisk/software/repos/Acellera/htmd/htmd/metricdata.py:786: MatplotlibDeprecationWarning: The griddata function was deprecated in version 2.2.
zi = griddata(x, y, z, xi, yi, interp='linear')
```

We can also plot the micro and macrostates on top of the FES by setting
`states=True`

```
model.plotFES(0, 1, temperature=360, states=True)
```

```
/home/joao/maindisk/software/repos/Acellera/htmd/htmd/metricdata.py:786: MatplotlibDeprecationWarning: The griddata function was deprecated in version 2.2.
zi = griddata(x, y, z, xi, yi, interp='linear')
```

To see what the states look like we can use the `viewStates`

method.
We load the macrostates and add a protein representation.

```
model.viewStates(protein=True)
```

```
Getting state Molecules: 100%|██████████| 4/4 [00:08<00:00, 2.03s/it]
```

```
A Jupyter Widget
```

```
A Jupyter Widget
```

One of the major advantages of Markov state models is that they can provide quantitative results about the kinetics between states.

Provide the Kinetics constructor with the system temperature. It automatically then calculates the source and sink states.

```
kin = Kinetics(model, temperature=360)
print(kin.source)
print(kin.sink)
```

```
2018-03-20 00:36:47,599 - htmd.kinetics - INFO - Detecting source state...
2018-03-20 00:36:49,155 - htmd.kinetics - INFO - Guessing the source state as the state with minimum contacts.
2018-03-20 00:36:49,157 - htmd.kinetics - INFO - Source macro = 1
2018-03-20 00:36:49,158 - htmd.kinetics - INFO - Detecting sink state...
2018-03-20 00:36:49,189 - htmd.kinetics - INFO - Sink macro = 3
```

```
1
3
```

To see the rates between the source and sink states we use the
`getRates`

method.

```
r = kin.getRates()
print(r)
```

```
2018-03-20 00:37:08,689 - htmd.kinetics - INFO - Calculating rates between source: [1] and sink: [3] states.
```

```
mfpton = 1.81E+03 (ns)
mfptoff = 2.28E+02 (ns)
kon = 5.54E+05 (1/M 1/s)
koff = 4.40E+06 (1/s)
koff/kon = 7.94E+00 (M)
kdeq = 3.11E+00 (M)
g0eq = 0.81 (kcal/M)
```

To plot the free energies and mean first passage times of all state use
the `plotRates()`

method.

```
kin.plotRates()
```

To visualize the kinetic flux pathways between the source and sink
states, use the `plotFluxPathways`

method:

```
kin.plotFluxPathways()
```

```
/home/joao/maindisk/SANDBOX/miniconda3/miniconda3/lib/python3.6/site-packages/pyemma/plots/networks.py:544: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if state_labels == 'auto':
/home/joao/maindisk/SANDBOX/miniconda3/miniconda3/lib/python3.6/site-packages/matplotlib/figure.py:459: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
"matplotlib is currently using a non-GUI backend, "
```

```
Path flux %path %of total path
5.8777605420020455e-05 100.0% 100.0% [1 3]
```

And this concludes the protein folding tutorial.