The MSM analysis workflow#

HTMD’s analysis side is built around a fixed five-step pipeline. Each step has a dedicated class, and the output of one step is the input of the next. Knowing the shape of the pipeline makes the API obvious in retrospect.

        flowchart LR
    A[SimList] --> B["Projection<br/>(MetricDistance, MetricRmsd, ...)"]
    B --> C["Dimensionality reduction<br/>(TICA / GWPCA)"]
    C --> D["Clustering<br/>(MiniBatchKMeans, KCenters, RegCluster)"]
    D --> E["Markov state model<br/>(Model)"]
    E --> F["Kinetics<br/>(rates, MFPTs)"]

1. SimList - enumerate trajectories#

simlist() takes an explicit list of trajectory directories and a matching list of topology files (or one shared topology), and returns a list of Sim objects, each pointing at one trajectory together with its topology. It’s the moral equivalent of glob plus topology matching - you provide the globs - and it’s the only object you carry through the rest of the pipeline.

from htmd.ui import *
sims = simlist(glob("data/*/"), glob("input/*/structure.pdb"))

2. Projection - per-frame features#

A projection maps each frame of each simulation to a feature vector. Projections come from moleculekit’s metric* family - MetricDistance, MetricRmsd, MetricDihedral, MetricSecondaryStructure, … - configured once and applied across the whole SimList:

metr = Metric(sims)
metr.set(MetricDistance("name CA", "resname BEN", periodic="selections"))
data = metr.project()

The result is a MetricData object: a ragged list of per-trajectory (n_frames_i, n_features) arrays (different sims can have different lengths) plus the mapping back to individual frames.

3. Dimensionality reduction - keep what matters#

For all but the simplest systems, the raw projection has too many features for clustering to behave well. HTMD provides:

TICA - time-lagged independent component analysis, the default choice for MD.
GWPCA - globally-weighted PCA.

Both classes take a MetricData and a lag time (in frames or absolute time units) and return a new MetricData in the reduced space. TICA can also wrap a Metric directly for on-the-fly projection without materialising the full feature array first.

4. Clustering - discretise the state space#

Clustering assigns each frame to one of K microstates. HTMD wraps several clustering schemes:

MiniBatchKMeans from scikit-learn (the usual default).
KCenter - good for low-population states.
RegCluster - radius-based assignment around precomputed centres.

The result is a per-frame integer label stored on the same MetricData object.

5. Markov state model and kinetics#

Model builds the Markov state model from the clustered data: counts transitions at a chosen lag time, computes the transition matrix, identifies macrostates by PCCA+, and exposes equilibrium populations and timescales.

Kinetics consumes the Model and computes mean first-passage times, rate constants, and free energies between user-specified source and sink states.

data_clust.fstep = 0.1                          # ns per frame - Kinetics needs this
model = Model(data_clust)
model.markovModel(lag=20, macronum=4)
kin = Kinetics(model, temperature=300, concentration=1e-3)
kin.getRates()

Kinetics raises if model.data.fstep is None - it needs the per-frame timestep to convert rates into wall-clock units.

What this means in practice#

Each tutorial in Analysing trajectories is one trip through this pipeline on a different system. If a tutorial mentions a class you don’t recognise, find the box it sits in above and you’ll know what its inputs and outputs look like.