How to build a simlist from a non-standard directory layout#

Goal#

Construct an htmd.simlist.Sim list for trajectories whose directory structure doesn’t fit the default data/<simname>/{topology, traj.xtc} pattern - per-trajectory topology, multiple trajectories per simulation, manually-curated subsets, or trajectories spread across several roots.

Minimal example#

from glob import glob
from htmd.simlist import simlist

sims = simlist(
    glob("./data/*/"),                  # one subdir per simulation
    "./structure.pdb",                  # one shared topology for all sims
)

The simplest case: every sim folder uses the same topology PDB. simlist discovers a moleculekit-supported trajectory inside each folder (xtc, dcd, nc/netcdf, trr, binpos, h5, …) and emits one Sim per folder.

Tip

When all sims share the same topology, pass a single path. A single shared topology lets every Sim reference the same file, and the downstream projection / Metric pipeline only loads it once for the whole simlist. The per-sim list form is needed only when the topologies actually differ.

Parameters that matter#

Parameter	What it does
`datafolders`	List of directories, each containing one or more trajectory files for that sim.
`topologies`	Either a single path (shared topology) or a list with one path per `datafolder` (per-sim topology). Folder names must match for the list form.
`inputfolders`	Optional - list of input directories matching `datafolders`. Required for adaptive-sampling traceback.

Common variations#

Per-simulation topology#

sims = simlist(
    glob("./data/*/"),
    glob("./input/*/structure.pdb"),    # one PDB per sim - matched by folder name
)

simlist matches by directory basename - data/sim_42/ is paired with the topology under input/sim_42/structure.pdb. Missing topologies raise FileNotFoundError; duplicate folder basenames within a single call raise RuntimeError.

Trajectories spread across multiple roots#

Since duplicate folder basenames raise inside a single simlist call, multi-root datasets must be split per root and stitched together with simmerge():

from htmd.simlist import simmerge

sims = []
for root in ["./run1/data", "./run2/data", "./run3/data"]:
    fsims = simlist(glob(f"{root}/*/"), f"{root}/../input/0/structure.pdb")
    sims = simmerge(sims, fsims)

This is the pattern the villin folding and trypsin-benzamidine tutorials use to merge multiple adaptive epochs.

Custom file extension or naming#

simlist walks moleculekit’s trajectory-reader list (xtc, dcd, nc/netcdf, trr, binpos, h5, lh5) and stops at the first extension that matches inside each folder - if a folder happens to contain both .xtc and .dcd, only one set is picked up. If your trajectories live one level deeper (e.g. data/<sim>/output/traj.xtc):

sims = simlist(
    glob("./data/*/output/"),           # point at the inner folder
    "./structure.pdb",
)

A hand-curated subset#

# Only keep sims whose name starts with "binding_"
keep = [d for d in glob("./data/*/") if "binding_" in d]
sims = simlist(keep, "./structure.pdb")

Or pass simfilter() post-hoc to subset by content (see How to drop bad trajectories).

Gotchas#

Trajectory folder names must be unique within a single simlist call - the dedup logic raises on duplicates. If you have epochs that reuse names, split per-epoch and simmerge.
The topology must contain the same number of atoms as the trajectory’s first frame. Stripping waters in one but not the other breaks the projection step downstream.
inputfolders only matters for adaptive-sampling traceback (see how-to); skip it for static analysis.
simlist is cheap: it doesn’t open the trajectory files for coordinate data. Frame counts on each Sim may show up as None until projection actually reads each trajectory.