How to filter trajectories with simfilter#
Goal#
Strip waters (or any other atom selection) out of a simlist’s trajectories before they hit the Metric / TICA / clustering pipeline. Filtered trajectories are written to disk as a new XTC set, dramatically reducing per-frame I/O and memory use for downstream analysis - and reproducing the same step the adaptive-sampling pipeline performs automatically on completed sims.
Minimal example#
from glob import glob
from htmd.simlist import simlist, simfilter
sims = simlist(glob("./data/*/"), "./structure.pdb")
fsims = simfilter(sims, "./filtered/", filtersel="not water")
# fsims now points at water-stripped XTCs under ./filtered/
simfilter() writes a stripped XTC per input Sim plus a single shared topology pair (filtered.psf + filtered.pdb) in outfolder, then returns a new simlist where each Sim points at its filtered trajectory with molfile=[filtered.psf, filtered.pdb]. The original trajectories are untouched.
Parameters that matter#
Parameter |
What it does |
|---|---|
|
Input simlist from |
|
Directory where filtered XTCs and the shared topology PDB are written. Created if missing. |
|
Atom-selection string passed to moleculekit’s selector. Atoms matching the selection are kept. |
|
Number of parallel workers. Defaults to |
Common variations#
Strip waters and ions#
fsims = simfilter(sims, "./filtered/", filtersel="not (water or ion)")
The right default for MSM analysis - water and counter-ions are noise for any conformational / binding projection, and removing them speeds projection 5-20× depending on the box.
Keep just the protein backbone#
fsims = simfilter(sims, "./filtered/", filtersel="protein and backbone")
Useful when projection only needs backbone Cα / N / C / O atoms (e.g. RMSD-to-reference, secondary structure). Smaller filtered XTCs, faster every step downstream.
Reuse a pre-existing filtered directory#
# If ./filtered/ already has the filtered trajectories from a prior run
fsims = simlist(glob("./filtered/*/"), "./filtered/filtered.pdb")
simfilter doesn’t keep state - it just writes new files to disk. If you’ve already filtered once and want to re-analyse, build the simlist directly from ./filtered/ and skip the simfilter call.
Gotchas#
simfilterwrites filtered XTCs to disk and is the slow path (linear in total trajectory size). Run it once and reuse the filtered directory across analyses; don’t call it inside an inner loop.The filtered topology pair (
filtered.psf+filtered.pdb) is written fromsims[0]’s topology only -simfilterdoes not check that the other sims have matching topologies, so atom-count mismatches downstream can surface silently if you mix incompatible sims. Pre-validate your simlist before filtering.filterselapplies the same selection to every sim. There’s no per-sim filter customisation - if some sims need a different selection, split into separate simlists andsimfiltereach.Adaptive-sampling auto-filters completed sims as part of its
_algorithmloop (controlled byAdaptiveMD.filterdefaultTrue,filterseldefault"not water",filteredpathdefault"filtered"). You usually don’t need to re-runsimfilteron an adaptive dataset - just build the simlist directly fromfilteredpath/.
See also#
How to build a simlist from a non-standard layout - the producer of the input simlist.
How to drop bad trajectories - the downstream cleanup once you have filtered XTCs.
htmd.simlist.simfilter()- API reference.