Protein folding MSM: villin headpiece#

You will learn: how to take a set of unbiased MD trajectories of a small protein and build a Markov-state model that gives you the folded / unfolded macrostates, the folding timescale, and the folding rate.

Prerequisites:

HTMD installed.
wget available on $PATH (or substitute your favourite downloader).
Patience for a one-time ~1 GB dataset download.

The system#

The villin headpiece is a 35-residue protein with a single hydrophobic core - one of the smallest naturally-occurring autonomously folding domains. We use a set of unbiased explicit-solvent trajectories run at 360 K (above the melting point, so both folded and unfolded states are populated). Total aggregate ≈ 107 µs across 2137 short trajectories.

The flow#

The MSM pipeline you’ll follow:

Download + index the filtered trajectories with simlist().
Project the geometry onto a low-dimensional distance space with MetricSelfDistance.
Reduce dimensions with TICA - find the slow collective coordinates.
Cluster the TICA space into ~1000 microstates.
Build a Markov model: choose a lag time from the implied-timescales plot, then lump microstates into macrostates with PCCA.
Read kinetics: equilibrium populations, free-energy surface, mean first-passage times.

Setup#

import os
from glob import glob
from pathlib import Path
from htmd.ui import (
    simlist, simmerge,
    Metric, MetricSelfDistance,
    TICA, Model, Kinetics,
)
from sklearn.cluster import MiniBatchKMeans

Please cite HTMD: Doerr et al.(2016)JCTC,12,1845. https://dx.doi.org/10.1021/acs.jctc.6b00049
HTMD Documentation at: https://software.acellera.com/htmd/

You are on the latest HTMD version (2.8.5.dev17+gf26491d44.d20260602).

2026-06-02 18:43:57,627 - rdkit - INFO - Enabling RDKit 2026.03.2 jupyter extensions

Step 1 - Build a simulation list#

The filtered trajectories (already stripped of waters and ions, aligned for analysis) ship as a zip on Figshare (HTMD tutorial data, DOI 10.6084/m9.figshare.32541291) - download protein_folding_datasets.zip and extract it once. ~2.5 GB. The tutorial reads the absolute path from the HTMD_TUTORIAL_DATASETS environment variable so the same notebook works regardless of where you extracted to.

The dataset is organised as several “epochs” (independent adaptive-sampling rounds). simlist() builds one list per epoch — duplicate folder basenames within a single call would raise — and simmerge() stitches the per-epoch lists into one combined list, renumbering the simid indices to be sequential across the merged whole:

DATASETS = Path(os.environ["HTMD_TUTORIAL_DATASETS"]) / "protein_folding_datasets"
topology = str(DATASETS / "1" / "filtered")  # any epoch works - all share the same topology
sims = []
for epoch in sorted(DATASETS.glob("*/")):
    trajs = glob(os.path.join(epoch, "filtered", "*", ""))
    sims = simmerge(sims, simlist(trajs, topology))
len(sims)

A Sim is the unit of analysis: trajectory frames + the matching topology / structure file (any moleculekit-supported topology format works — PDB, PSF, prmtop, …).

Step 2 - Project geometry to Cα distances#

Folding is encoded in the network of Cα-Cα contacts: in the unfolded state the helices are apart, in the folded state they pack against each other. MetricSelfDistance computes the upper-triangular distance matrix between every Cα pair, frame by frame. metric="contacts" thresholds those distances into 0/1 contacts (1 if below the threshold, 0 otherwise - default threshold=8 Å). Contact features are cheaper and more robust than raw distances for MSMs.

metr = Metric(sims)
metr.set(MetricSelfDistance("protein and name CA", metric="contacts", periodic="chains"))
data = metr.project()
data.fstep = 0.1  # ns per frame - tells downstream code the time axis

htmd.projections.metric - INFO - Frame step 0.1ns was read from the trajectories. If it looks wrong, redefine it by manually setting the MetricData.fstep property.

data is a MetricData object: per-trajectory projected coordinates plus the metric metadata.

data.plotTrajSizes()

../../_images/ddff1301c87cdaf4717b32064c0b779ea92edd02e56e9ea9ea11de9fe9ea5d1b.png

Drop any trajectories whose length isn’t the statistical mode (the most common frame count). By default dropTraj() removes anything off that mode — both crashed (too short) and over-run (too long) — so the rest of the pipeline sees a uniform-length set:

data.dropTraj()

htmd.metricdata - INFO - Dropped 7 trajectories from 2137 resulting in 2130

array([ 168,  472,  709, 1507, 1601, 2100, 2111])

Step 3 - TICA: find the slow coordinates#

TICA (Time-lagged Independent Component Analysis) re-projects the high-dimensional contact space onto the directions of slowest decorrelation. For an MSM, those slow directions are exactly what you want to discretise.

tica = TICA(data, 2, units="ns")        # lag time for the TICA fit
dataTica = tica.project(3)              # keep the 3 slowest TIC coordinates

The lag time you pick here (2 ns) should be a few times the fastest non-trivial relaxation; for villin folding 2 ns is plenty.

Step 4 - Cluster the TICA space#

Discretise the continuous TICA space into ~1000 microstates with mini-batch k-means. Bootstrap the data first (sample 80% of trajs) - re-running with different bootstrap draws gives you deviations / error bars on the timescales and is the standard check that the model is sample-stable:

dataBoot = dataTica.bootstrap(0.8)
dataBoot.cluster(MiniBatchKMeans(n_clusters=1000))

Step 5 - Build the MSM#

model = Model(dataBoot)
model.plotTimescales(maxlag=40, units="ns")

../../_images/27fafc5128b2e0605e15195c78c288fc15d1de3adbc95182ba490b585c035dd9.png

The implied-timescales (ITS) plot tells you when the model becomes Markovian: pick the shortest lag time at which the top timescales have plateaued. For villin folding that’s ~25 ns - the slow folding timescale is clearly separated from the bulk continuum.

model.markovModel(25, 4, units="ns")

htmd.model - INFO - 100.0% of the data was used
htmd.model - INFO - Number of trajectories that visited each macrostate:
htmd.model - INFO - [ 116  479   74 1655]

25 is the MSM lag time in ns; 4 is the number of macrostates PCCA lumps the microstates into. We pick 4 because the ITS plot shows the top 3 timescales clearly separated from the rest of the spectrum - 3 slow processes partition the microstate space into 3 + 1 = 4 metastable basins. The result captures the folded basin + a couple of unfolded sub-states + the transition region.

Step 6 - Equilibrium populations and FES#

model.eqDistribution()

../../_images/d9a7514963e5f8d133308b8c545557082cbbf580ab567c49594260b86308b8b4.png

array([0.00353847, 0.09011645, 0.16917454, 0.73717054])

model.plotFES(0, 1, temperature=360)

../../_images/588bb0c271ce4f3ad64eb97ffc90133f4f4ecf1fe00340a4d08bf4440388fcfd.png

The FES on TIC1 / TIC2 shows the macrostate basins. Overlay the state assignments:

model.plotFES(0, 1, temperature=360, states=True)

../../_images/a15bbdd12589127ca974b6ae6254e8ee7836f35e63735f97a3d4b2fae5a4ccc5.png

To eyeball the macrostates as structures, run model.viewStates(protein=True) from an interactive session - it launches VMD with one representative frame per macrostate. (Omitted here because it needs a display.)

Step 7 - Kinetics#

kin = Kinetics(model, temperature=360)
r = kin.getRates()
print(r)

mfpton = 2.09E+03 (ns)
mfptoff = 2.35E+02 (ns)
kon = 4.79E+05 (1/M 1/s)
koff = 4.25E+06 (1/s)
koff/kon = 8.87E+00 (M)
kdeq = 4.36E+00 (M)
g0eq = 1.05 (kcal/M)

htmd.kinetics - INFO - Detecting source state...
htmd.kinetics - INFO - Guessing the source state as the state with minimum contacts.
htmd.kinetics - INFO - Source macro = 3
htmd.kinetics - INFO - Detecting sink state...
htmd.kinetics - INFO - Sink macro = 2
htmd.kinetics - INFO - Calculating rates between source: [np.int64(3)] and sink: [np.int64(2)] states.

Kinetics reads the macrostate model and computes mean-first-passage times and rates. getRates() returns a single Rates object for one auto-detected source → sink pair (folded → unfolded by default). To inspect all pairs, pass source= / sink= explicitly per call, or use kin.plotRates() (below) which iterates internally.

kin.plotRates()

../../_images/2f69b7a45c8461df1c62c04c6e3942dced66392921e19513d28a82ac3581ffa0.png

../../_images/da185690c4ceac092e757c334e64d5fe8e19a8a1ce40937c405960971d5b87fd.png

../../_images/c775502b5cfd9d50f11614de26875bf52eff14ad4f7db6b930bf55c95a3dae9d.png

kin.plotFluxPathways()

../../_images/9628ca6245ce93350fd6afaeb36b1b45d7d5a853009df8a240c3d1adc3b3068a.png

Path flux		%path	%of total	path
0.014825165238110399	100.0%	100.0%		[3 2]

Flux pathways visualise the dominant transition routes in TPT (transition-path theory) form - which intermediates the protein actually visits on its way from unfolded to folded.

Parameters that matter#

Knob	Effect
`MetricSelfDistance(... metric="contacts")`	Contact features are coarser but more robust than raw distances - the binary thresholding makes clusters insensitive to small Cα-Cα fluctuations within a folded basin.
`TICA(data, 2, units="ns")` lag	Set a few times the fastest non-trivial relaxation. Too small → noise; too large → loses fast slow modes.
`tica.project(3)`	Number of TIC coordinates kept. 3-5 is typical; check the eigenvalue spectrum to decide.
`MiniBatchKMeans(n_clusters=1000)`	More clusters = finer microstate resolution but slower. 500-2000 is the usual range.
`model.markovModel(25, 4, units="ns")` lag	Read off the ITS plot - shortest lag at which slow timescales plateau.
`model.markovModel(..., macronum=4)`	PCCA macrostates. Look at the timescale gaps in the ITS plot to pick the number.

Gotchas#

ITS plateau, not minimum. Pick the lag where slow timescales first level off, not where they’re highest. Too-long lags throw away kinetic resolution for no gain.
bootstrap(0.8) lets you get deviations / error bars and see whether the model is sample-stable. Re-run with several bootstrap draws and compare; if different bootstraps give wildly different ITS / states, your dataset is undersampled.
Macrostate count is a judgement call. Look at the timescale gap pattern - if you see a clear gap after k slow timescales, take k+1 macrostates.