Run many jobs on one GPU#

Goal#

Pack several PlayMolecule jobs onto a single GPU on a SLURM cluster using NVIDIA MPS, so a small workload doesn’t waste a whole device.

Minimal example#

from playmolecule import slurm_mps
from playmolecule.apps import deepsite

eds = [
    deepsite(outdir=f"./run-{i}", pdbfile=f"protein_{i}.pdb")
    for i in range(8)
]

slurm_mps(eds, partition="normalGPU", ncpu=1, ngpu=1)

Every ExecutableDirectory in eds is submitted as a single SLURM job — via slurm_mps() — that holds one GPU and shares it across the jobs through NVIDIA’s Multi-Process Service.

When to use it#

  • Small GPU jobs (a few seconds to a few minutes) where individual SLURM submissions would burn more time queueing than running.

  • Workloads where one GPU has plenty of memory for several processes — e.g., parameter sweeps over the same model.

Don’t use MPS for jobs that are individually GPU-saturating: they’ll just serialise without benefit.

Parameters#

slurm_mps(exec_dirs, **kwargs) accepts the same SLURM kwargs as ed.run(queue="slurm", ...)partition, ncpu, ngpu, memory, walltime, nodelist, exclude, envvars, prerun, the mail options, and the stream options. See Run an app on SLURM for the table.

The resource defaults come from the first ExecutableDirectory in the list (its execution_resources, which were copied from the app manifest at setup time). If you mix apps with different defaults, pass ncpu / ngpu explicitly to be safe.

Gotchas#

  • All ExecutableDirectorys must live on a shared filesystem (same rule as plain SLURM).

  • MPS needs to be enabled on the chosen partition’s nodes. Talk to your cluster admin if slurm_mps jobs fail with “Failed to start MPS” in their logs.

  • The single SLURM job runs to completion when all batched jobs finish; one slow job holds the GPU for the others. Group by expected runtime.

See also#