How to build a simlist from a non-standard directory layout#
Goal#
Construct an htmd.simlist.Sim list for trajectories whose directory structure doesn’t fit the default data/<simname>/{topology, traj.xtc} pattern - per-trajectory topology, multiple trajectories per simulation, manually-curated subsets, or trajectories spread across several roots.
Minimal example#
from glob import glob
from htmd.simlist import simlist
sims = simlist(
glob("./data/*/"), # one subdir per simulation
"./structure.pdb", # one shared topology for all sims
)
The simplest case: every sim folder uses the same topology PDB. simlist discovers a moleculekit-supported trajectory inside each folder (xtc, dcd, nc/netcdf, trr, binpos, h5, …) and emits one Sim per folder.
Tip
When all sims share the same topology, pass a single path. A single shared topology lets every Sim reference the same file, and the downstream projection / Metric pipeline only loads it once for the whole simlist. The per-sim list form is needed only when the topologies actually differ.
Parameters that matter#
Parameter |
What it does |
|---|---|
|
List of directories, each containing one or more trajectory files for that sim. |
|
Either a single path (shared topology) or a list with one path per |
|
Optional - list of input directories matching |
Common variations#
Per-simulation topology#
sims = simlist(
glob("./data/*/"),
glob("./input/*/structure.pdb"), # one PDB per sim - matched by folder name
)
simlist matches by directory basename - data/sim_42/ is paired with the topology under input/sim_42/structure.pdb. Missing topologies raise FileNotFoundError; duplicate folder basenames within a single call raise RuntimeError.
Trajectories spread across multiple roots#
Since duplicate folder basenames raise inside a single simlist call, multi-root datasets must be split per root and stitched together with simmerge():
from htmd.simlist import simmerge
sims = []
for root in ["./run1/data", "./run2/data", "./run3/data"]:
fsims = simlist(glob(f"{root}/*/"), f"{root}/../input/0/structure.pdb")
sims = simmerge(sims, fsims)
This is the pattern the villin folding and trypsin-benzamidine tutorials use to merge multiple adaptive epochs.
Custom file extension or naming#
simlist walks moleculekit’s trajectory-reader list (xtc, dcd, nc/netcdf, trr, binpos, h5, lh5) and stops at the first extension that matches inside each folder - if a folder happens to contain both .xtc and .dcd, only one set is picked up. If your trajectories live one level deeper (e.g. data/<sim>/output/traj.xtc):
sims = simlist(
glob("./data/*/output/"), # point at the inner folder
"./structure.pdb",
)
A hand-curated subset#
# Only keep sims whose name starts with "binding_"
keep = [d for d in glob("./data/*/") if "binding_" in d]
sims = simlist(keep, "./structure.pdb")
Or pass simfilter() post-hoc to subset by content (see How to drop bad trajectories).
Gotchas#
Trajectory folder names must be unique within a single
simlistcall - the dedup logic raises on duplicates. If you have epochs that reuse names, split per-epoch andsimmerge.The topology must contain the same number of atoms as the trajectory’s first frame. Stripping waters in one but not the other breaks the projection step downstream.
inputfoldersonly matters for adaptive-sampling traceback (see how-to); skip it for static analysis.simlistis cheap: it doesn’t open the trajectory files for coordinate data. Frame counts on eachSimmay show up asNoneuntil projection actually reads each trajectory.
See also#
How to filter trajectories with simfilter - strip waters / ions before projection.
How to drop bad trajectories - prune the resulting simlist.
Villin folding MSM - real example with multi-epoch merge.
htmd.simlist- API reference.