How to choose a featurization with a VAMP-2 score#
Goal#
Decide which features to build your MSM on - and how many TICA dimensions and what TICA lag to keep - objectively rather than by intuition. The cross-validated VAMP-2 score (vampScore()) rates how much of the slow dynamics a featurization resolves; you score your candidates and keep the best one. This is the variational feature-selection idea of McGibbon and Pande.
Minimal example#
Starting from your sims (a simlist()), score two candidate featurizations of the same trajectories and compare. Higher is better:
from htmd.ui import Metric, MetricSelfDistance, MetricDihedral, vampScore
# Candidate A: Ca-Ca contact map
metrA = Metric(sims)
metrA.set(MetricSelfDistance("protein and name CA", metric="contacts", periodic="chains"))
dataA = metrA.project()
dataA.fstep = 0.1
# Candidate B: backbone dihedrals
metrB = Metric(sims)
metrB.set(MetricDihedral())
dataB = metrB.project()
dataB.fstep = 0.1
for label, data in [("contacts", dataA), ("dihedrals", dataB)]:
mean, std = vampScore(data, lag=2, dim=4, units="ns")
print(f"{label}: VAMP-2 = {mean:.2f} +/- {std:.2f}")
Keep the featurization with the highest cross-validated score.
How to read the score#
Higher is better - a higher VAMP-2 score means the features resolve more of the slow dynamics.
About 1.0 is the floor - a featurization that captures no slow process (pure noise) scores around 1.0, the trivial stationary process.
Compare like with like - scores are only comparable when computed at the same
laganddim. Changing either changes the absolute score.It is cross-validated - the score is averaged over
nfoldsfolds, so a featurization that looks good only by overfitting the training data is penalised on the held-out folds. Use the mean and watch the spread (std).
Common variations#
Choose how many dimensions to keep. The score keeps rising as you add dimensions (it sums the squared singular values), so look for diminishing returns rather than a maximum:
for d in (2, 3, 4, 6, 8):
mean, std = vampScore(data, lag=2, dim=d, units="ns")
print(f"dim={d}: VAMP-2 = {mean:.2f} +/- {std:.2f}")
Check robustness to the TICA lag:
for lag in (0.5, 1, 2, 5):
mean, std = vampScore(data, lag=lag, dim=4, units="ns")
print(f"lag={lag} ns: VAMP-2 = {mean:.2f} +/- {std:.2f}")
Get the per-fold scores instead of the mean and standard deviation (for example to plot your own error bars):
scores = vampScore(data, lag=2, dim=4, units="ns", return_scores=True)
If you have only a few long trajectories, cross-validating over whole trajectories is not possible; split each trajectory into blocks of blocksize frames instead:
mean, std = vampScore(data, lag=2, dim=4, units="ns", blocksize=1000)
Gotchas#
Only compare at matched
laganddim. A featurization scored atdim=6will out-score one atdim=3purely because of the extra dimensions, not because it is better.Do not simply maximise over
dim. Because VAMP-2 sums squared singular values, it rises monotonically withdim. Read the shape of the curve (where the gains flatten), not the peak.A tie is a real and useful answer. Featurizations that both capture the slow process well score about the same - the score is telling you the choice does not matter for the kinetics, so decide on other grounds such as clustering robustness or cost. For example, on the villin headpiece a Cα contact map and raw Cα distances score about equally; the contact map is preferred there for its clustering robustness, not because it resolves more dynamics.
Needs at least 2 trajectories for the default whole-trajectory cross-validation. With a single long trajectory, pass
blocksizeso it can be split into independent blocks.Scoring is not validation. A high VAMP-2 score says the features are promising; it does not replace building the MSM and checking the implied timescales and the Chapman-Kolmogorov test.
See also#
htmd.projections.vamp.vampScore()- the API reference.How to read an implied-timescales plot - the validation step after picking features.
Protein folding MSM and Ligand binding MSM - the full pipelines these features feed into.