# How to choose a featurization with a VAMP-2 score ## Goal Decide *which* features to build your MSM on - and how many TICA dimensions and what TICA lag to keep - objectively rather than by intuition. The cross-validated VAMP-2 score ({py:func}`~htmd.projections.vamp.vampScore`) rates how much of the slow dynamics a featurization resolves; you score your candidates and keep the best one. This is the variational feature-selection idea of McGibbon and Pande. ## Minimal example Starting from your `sims` (a {py:func}`~htmd.simlist.simlist`), score two candidate featurizations of the same trajectories and compare. Higher is better: ```python from htmd.ui import Metric, MetricSelfDistance, MetricDihedral, vampScore # Candidate A: Ca-Ca contact map metrA = Metric(sims) metrA.set(MetricSelfDistance("protein and name CA", metric="contacts", periodic="chains")) dataA = metrA.project() dataA.fstep = 0.1 # Candidate B: backbone dihedrals metrB = Metric(sims) metrB.set(MetricDihedral()) dataB = metrB.project() dataB.fstep = 0.1 for label, data in [("contacts", dataA), ("dihedrals", dataB)]: mean, std = vampScore(data, lag=2, dim=4, units="ns") print(f"{label}: VAMP-2 = {mean:.2f} +/- {std:.2f}") ``` Keep the featurization with the highest cross-validated score. ## How to read the score - **Higher is better** - a higher VAMP-2 score means the features resolve more of the slow dynamics. - **About 1.0 is the floor** - a featurization that captures no slow process (pure noise) scores around 1.0, the trivial stationary process. - **Compare like with like** - scores are only comparable when computed at the **same `lag` and `dim`**. Changing either changes the absolute score. - **It is cross-validated** - the score is averaged over `nfolds` folds, so a featurization that looks good only by overfitting the training data is penalised on the held-out folds. Use the mean and watch the spread (`std`). ## Common variations Choose how many dimensions to keep. The score keeps rising as you add dimensions (it sums the squared singular values), so look for **diminishing returns** rather than a maximum: ```python for d in (2, 3, 4, 6, 8): mean, std = vampScore(data, lag=2, dim=d, units="ns") print(f"dim={d}: VAMP-2 = {mean:.2f} +/- {std:.2f}") ``` Check robustness to the TICA lag: ```python for lag in (0.5, 1, 2, 5): mean, std = vampScore(data, lag=lag, dim=4, units="ns") print(f"lag={lag} ns: VAMP-2 = {mean:.2f} +/- {std:.2f}") ``` Get the per-fold scores instead of the mean and standard deviation (for example to plot your own error bars): ```python scores = vampScore(data, lag=2, dim=4, units="ns", return_scores=True) ``` If you have only a few long trajectories, cross-validating over whole trajectories is not possible; split each trajectory into blocks of `blocksize` frames instead: ```python mean, std = vampScore(data, lag=2, dim=4, units="ns", blocksize=1000) ``` ## Gotchas - **Only compare at matched `lag` and `dim`.** A featurization scored at `dim=6` will out-score one at `dim=3` purely because of the extra dimensions, not because it is better. - **Do not simply maximise over `dim`.** Because VAMP-2 sums squared singular values, it rises monotonically with `dim`. Read the *shape* of the curve (where the gains flatten), not the peak. - **A tie is a real and useful answer.** Featurizations that both capture the slow process well score about the same - the score is telling you the choice does not matter for the kinetics, so decide on other grounds such as clustering robustness or cost. For example, on the villin headpiece a Cα contact map and raw Cα distances score about equally; the contact map is preferred there for its clustering robustness, not because it resolves more dynamics. - **Needs at least 2 trajectories** for the default whole-trajectory cross-validation. With a single long trajectory, pass `blocksize` so it can be split into independent blocks. - **Scoring is not validation.** A high VAMP-2 score says the features are promising; it does not replace building the MSM and checking the implied timescales and the Chapman-Kolmogorov test. ## See also - {py:func}`htmd.projections.vamp.vampScore` - the API reference. - {doc}`How to read an implied-timescales plot ` - the validation step after picking features. - {doc}`Protein folding MSM <../tutorials/analysis/villin-folding>` and {doc}`Ligand binding MSM <../tutorials/analysis/trypsin-benzamidine-binding>` - the full pipelines these features feed into.