Job lifecycle#
A PlayMolecule job moves through four states from submission to completion. This page describes the states, how they’re detected, and the heartbeat mechanism the local backend uses to spot dead workers.
The four states#
JobStatus is an IntEnum:
WAITING_INFO = 0 # Submitted, not yet running. No heartbeat seen.
RUNNING = 1 # Container is alive and the heartbeat is fresh.
COMPLETED = 2 # Exit code 0 (local) or backend reported success (HTTP).
ERROR = 3 # Non-zero exit, missing outputs, or stale heartbeat.
str(status) returns a human label via describe(). Numerically comparing (status == JobStatus.RUNNING) is supported.
State diagram#
stateDiagram-v2
[*] --> WAITING_INFO: ed.run()
WAITING_INFO --> RUNNING: container starts, .pm.alive appears
RUNNING --> COMPLETED: .pm.done written
RUNNING --> ERROR: .pm.err written / heartbeat stale > 60s
WAITING_INFO --> ERROR: backend reports failure before run starts
COMPLETED --> [*]
ERROR --> [*]
Local backend: how each state is detected#
The local execution backend uses three sentinel files inside outdir/run_<id>/:
File |
Set by |
What it means |
|---|---|---|
|
The container |
Refreshed periodically with an ISO-format timestamp while the job is running. |
|
The container |
Written on clean exit. Primary |
|
The container |
Written on a controlled failure (non-zero exit, exception). |
ed.status checks these in order:
.pm.doneexists →COMPLETED..pm.errexists →ERROR..pm.aliveexists:timestamp within the last 60 seconds →
RUNNING.timestamp older than 60 seconds →
ERROR(worker died without writing.pm.err).
(SLURM-submitted job) the SLURM queue state is consulted — see below.
Fallback: read the app’s
expected_outputs.json. If it lists files and they’re all present on disk →COMPLETED; if some are missing →RUNNING.None of the above →
WAITING_INFO.
The 60-second timeout is hard-coded. A SLURM worker that crashes silently will leave .pm.alive stale; after 60 seconds the status flips to ERROR even without an explicit failure signal.
The fallback path (5) was tightened recently: an empty expected_outputs.json list now falls through to WAITING_INFO rather than reporting a false COMPLETED for a job that hadn’t started.
SLURM: the same, plus the queue#
When a job is submitted via ed.run(queue="slurm", ...), two things drive the status:
The same
.pm.alive/.pm.done/.pm.errfiles (the worker still uses them).The SLURM queue state from
jobInfo()— consulted when no sentinel and no fresh heartbeat are present.
The SLURM-state-to-JobStatus mapping:
SLURM state |
PlayMolecule state |
|---|---|
|
|
|
|
|
|
|
|
The heartbeat catches the case where SLURM thinks the job is running but the container died silently on the worker (common with GPU driver issues).
HTTP backend: server-side truth#
For HTTP-backend jobs there’s no shared filesystem. ed.status does a single HTTP GET against the backend’s status endpoint, keyed by a job id derived from the outdir path at submission time. The four states are reported directly by the server.
If you move or rename outdir between submission and status queries, the derived job ID won’t match what the server stored and you’ll get a 404. The fix is to never rename outdir.
Polling guidance#
Local interactive runs —
ed.run()is blocking. You don’t poll; you wait.SLURM jobs — poll once every 30–60 seconds. Anything faster is wasted; the controller node sees no benefit.
HTTP-backend jobs — match the polling cadence to job length; once a minute is reasonable for jobs that take 10+ minutes.
Or set PM_BLOCKING=1 and the app call itself waits until terminal state — useful in scripts where you’d otherwise write the polling loop anyway.
Failure modes worth knowing#
Stale heartbeat — worker died with no
.pm.err. Cause: usually OOM-kill or hardware fault. Check SLURM accounting (sacct).Missing expected outputs — manifest’s
expected_outputslist and the actual run disagree. Either the app code regressed or the inputs were unusable.WAITING_INFOforever — the container never started. Check the container runtime:docker pullof the image, Apptainer’s SIF cache, gcloud auth.HTTP 404 on status —
outdirwas moved/renamed after submission. There’s no recovery; resubmit.
See also#
JobStatusExecutableDirectory