Run many jobs on one GPU#
Goal#
Pack several PlayMolecule jobs onto a single GPU on a SLURM cluster using NVIDIA MPS, so a small workload doesn’t waste a whole device.
Minimal example#
from playmolecule import slurm_mps
from playmolecule.apps import deepsite
eds = [
deepsite(outdir=f"./run-{i}", pdbfile=f"protein_{i}.pdb")
for i in range(8)
]
slurm_mps(eds, partition="normalGPU", ncpu=1, ngpu=1)
Every ExecutableDirectory in eds is submitted as a single SLURM job — via slurm_mps() — that holds one GPU and shares it across the jobs through NVIDIA’s Multi-Process Service.
When to use it#
Small GPU jobs (a few seconds to a few minutes) where individual SLURM submissions would burn more time queueing than running.
Workloads where one GPU has plenty of memory for several processes — e.g., parameter sweeps over the same model.
Don’t use MPS for jobs that are individually GPU-saturating: they’ll just serialise without benefit.
Parameters#
slurm_mps(exec_dirs, **kwargs) accepts the same SLURM kwargs as ed.run(queue="slurm", ...) — partition, ncpu, ngpu, memory, walltime, nodelist, exclude, envvars, prerun, the mail options, and the stream options. See Run an app on SLURM for the table.
The resource defaults come from the first ExecutableDirectory in the list (its execution_resources, which were copied from the app manifest at setup time). If you mix apps with different defaults, pass ncpu / ngpu explicitly to be safe.
Gotchas#
All
ExecutableDirectorys must live on a shared filesystem (same rule as plain SLURM).MPS needs to be enabled on the chosen partition’s nodes. Talk to your cluster admin if
slurm_mpsjobs fail with “Failed to start MPS” in their logs.The single SLURM job runs to completion when all batched jobs finish; one slow job holds the GPU for the others. Group by expected runtime.
See also#
slurm_mps()