# Job lifecycle A PlayMolecule job moves through four states from submission to completion. This page describes the states, how they're detected, and the heartbeat mechanism the local backend uses to spot dead workers. ## The four states {py:class}`~playmolecule.JobStatus` is an `IntEnum`: ```text WAITING_INFO = 0 # Submitted, not yet running. No heartbeat seen. RUNNING = 1 # Container is alive and the heartbeat is fresh. COMPLETED = 2 # Exit code 0 (local) or backend reported success (HTTP). ERROR = 3 # Non-zero exit, missing outputs, or stale heartbeat. ``` `str(status)` returns a human label via {py:meth}`~playmolecule.JobStatus.describe`. Numerically comparing (`status == JobStatus.RUNNING`) is supported. ## State diagram ```{mermaid} stateDiagram-v2 [*] --> WAITING_INFO: ed.run() WAITING_INFO --> RUNNING: container starts, .pm.alive appears RUNNING --> COMPLETED: .pm.done written RUNNING --> ERROR: .pm.err written / heartbeat stale > 60s WAITING_INFO --> ERROR: backend reports failure before run starts COMPLETED --> [*] ERROR --> [*] ``` ## Local backend: how each state is detected The local execution backend uses three sentinel files inside `outdir/run_/`: | File | Set by | What it means | |--------------|---------------|------------------------------------------------------------------------------| | `.pm.alive` | The container | Refreshed periodically with an ISO-format timestamp while the job is running. | | `.pm.done` | The container | Written on clean exit. Primary `COMPLETED` signal. | | `.pm.err` | The container | Written on a controlled failure (non-zero exit, exception). | `ed.status` checks these in order: 1. `.pm.done` exists → **`COMPLETED`**. 2. `.pm.err` exists → **`ERROR`**. 3. `.pm.alive` exists: - timestamp within the last 60 seconds → **`RUNNING`**. - timestamp older than 60 seconds → **`ERROR`** (worker died without writing `.pm.err`). 4. *(SLURM-submitted job)* the SLURM queue state is consulted — see below. 5. Fallback: read the app's `expected_outputs.json`. If it lists files and they're all present on disk → **`COMPLETED`**; if some are missing → **`RUNNING`**. 6. None of the above → **`WAITING_INFO`**. The 60-second timeout is hard-coded. A SLURM worker that crashes silently will leave `.pm.alive` stale; after 60 seconds the status flips to `ERROR` even without an explicit failure signal. The fallback path (5) was tightened recently: an empty `expected_outputs.json` list now falls through to `WAITING_INFO` rather than reporting a false `COMPLETED` for a job that hadn't started. ## SLURM: the same, plus the queue When a job is submitted via `ed.run(queue="slurm", ...)`, two things drive the status: - The same `.pm.alive` / `.pm.done` / `.pm.err` files (the worker still uses them). - The SLURM queue state from `jobInfo()` — consulted when no sentinel and no fresh heartbeat are present. The SLURM-state-to-{py:class}`~playmolecule.JobStatus` mapping: | SLURM state | PlayMolecule state | |--------------------------------------|--------------------| | `RUNNING` | `RUNNING` | | `COMPLETED` | `COMPLETED` | | `PENDING`, `None` | `WAITING_INFO` | | `FAILED`, `CANCELLED`, `OUT_OF_MEMORY`, `TIMEOUT` | `ERROR` | The heartbeat catches the case where SLURM thinks the job is running but the container died silently on the worker (common with GPU driver issues). ## HTTP backend: server-side truth For HTTP-backend jobs there's no shared filesystem. `ed.status` does a single HTTP GET against the backend's status endpoint, keyed by a job id derived from the `outdir` path at submission time. The four states are reported directly by the server. If you move or rename `outdir` between submission and status queries, the derived job ID won't match what the server stored and you'll get a 404. The fix is to never rename `outdir`. ## Polling guidance - **Local interactive runs** — `ed.run()` is blocking. You don't poll; you wait. - **SLURM jobs** — poll once every 30–60 seconds. Anything faster is wasted; the controller node sees no benefit. - **HTTP-backend jobs** — match the polling cadence to job length; once a minute is reasonable for jobs that take 10+ minutes. Or set `PM_BLOCKING=1` and the app call itself waits until terminal state — useful in scripts where you'd otherwise write the polling loop anyway. ## Failure modes worth knowing - **Stale heartbeat** — worker died with no `.pm.err`. Cause: usually OOM-kill or hardware fault. Check SLURM accounting (`sacct`). - **Missing expected outputs** — manifest's `expected_outputs` list and the actual run disagree. Either the app code regressed or the inputs were unusable. - **`WAITING_INFO` forever** — the container never started. Check the container runtime: `docker pull` of the image, Apptainer's SIF cache, gcloud auth. - **HTTP 404 on status** — `outdir` was moved/renamed after submission. There's no recovery; resubmit. ## See also - {py:class}`~playmolecule.JobStatus` - {py:class}`~playmolecule.ExecutableDirectory` - [Check job status](../howto/check-job-status.md) - [Architecture](architecture.md)