How to filter trajectories with simfilter#

Goal#

Strip waters (or any other atom selection) out of a simlist’s trajectories before they hit the Metric / TICA / clustering pipeline. Filtered trajectories are written to disk as a new XTC set, dramatically reducing per-frame I/O and memory use for downstream analysis - and reproducing the same step the adaptive-sampling pipeline performs automatically on completed sims.

Minimal example#

from glob import glob
from htmd.simlist import simlist, simfilter

sims  = simlist(glob("./data/*/"), "./structure.pdb")
fsims = simfilter(sims, "./filtered/", filtersel="not water")
# fsims now points at water-stripped XTCs under ./filtered/

simfilter() writes a stripped XTC per input Sim plus a single shared topology pair (filtered.psf + filtered.pdb) in outfolder, then returns a new simlist where each Sim points at its filtered trajectory with molfile=[filtered.psf, filtered.pdb]. The original trajectories are untouched.

Parameters that matter#

Parameter

What it does

sims

Input simlist from simlist().

outfolder

Directory where filtered XTCs and the shared topology PDB are written. Created if missing.

filtersel

Atom-selection string passed to moleculekit’s selector. Atoms matching the selection are kept. "not water" strips waters; "protein or resname LIG" keeps only protein + ligand.

njobs

Number of parallel workers. Defaults to htmd.config['njobs'], which is 1 out of the box. Pass njobs=N explicitly (or set htmd.config['njobs']) to enable parallelism.

Common variations#

Strip waters and ions#

fsims = simfilter(sims, "./filtered/", filtersel="not (water or ion)")

The right default for MSM analysis - water and counter-ions are noise for any conformational / binding projection, and removing them speeds projection 5-20× depending on the box.

Keep just the protein backbone#

fsims = simfilter(sims, "./filtered/", filtersel="protein and backbone")

Useful when projection only needs backbone Cα / N / C / O atoms (e.g. RMSD-to-reference, secondary structure). Smaller filtered XTCs, faster every step downstream.

Reuse a pre-existing filtered directory#

# If ./filtered/ already has the filtered trajectories from a prior run
fsims = simlist(glob("./filtered/*/"), "./filtered/filtered.pdb")

simfilter doesn’t keep state - it just writes new files to disk. If you’ve already filtered once and want to re-analyse, build the simlist directly from ./filtered/ and skip the simfilter call.

Gotchas#

  • simfilter writes filtered XTCs to disk and is the slow path (linear in total trajectory size). Run it once and reuse the filtered directory across analyses; don’t call it inside an inner loop.

  • The filtered topology pair (filtered.psf + filtered.pdb) is written from sims[0]’s topology only - simfilter does not check that the other sims have matching topologies, so atom-count mismatches downstream can surface silently if you mix incompatible sims. Pre-validate your simlist before filtering.

  • filtersel applies the same selection to every sim. There’s no per-sim filter customisation - if some sims need a different selection, split into separate simlists and simfilter each.

  • Adaptive-sampling auto-filters completed sims as part of its _algorithm loop (controlled by AdaptiveMD.filter default True, filtersel default "not water", filteredpath default "filtered"). You usually don’t need to re-run simfilter on an adaptive dataset - just build the simlist directly from filteredpath/.

See also#