# Job lifecycle

A PlayMolecule job moves through four states from submission to completion. This page describes the states, how they're detected, and the heartbeat mechanism the local backend uses to spot dead workers.

## The four states

{py:class}`~playmolecule.JobStatus` is an `IntEnum`:

```text
WAITING_INFO = 0   # Submitted, not yet running. No heartbeat seen.
RUNNING      = 1   # Container is alive and the heartbeat is fresh.
COMPLETED    = 2   # Exit code 0 (local) or backend reported success (HTTP).
ERROR        = 3   # Non-zero exit, missing outputs, or stale heartbeat.
```

`str(status)` returns a human label via {py:meth}`~playmolecule.JobStatus.describe`. Numerically comparing (`status == JobStatus.RUNNING`) is supported.

## State diagram

```{mermaid}
stateDiagram-v2
    [*] --> WAITING_INFO: ed.run()
    WAITING_INFO --> RUNNING: container starts, .pm.alive appears
    RUNNING --> COMPLETED: .pm.done written
    RUNNING --> ERROR: .pm.err written / heartbeat stale > 60s
    WAITING_INFO --> ERROR: backend reports failure before run starts
    COMPLETED --> [*]
    ERROR --> [*]
```

## Local backend: how each state is detected

The local execution backend uses three sentinel files inside `outdir/run_<id>/`:

| File         | Set by        | What it means                                                                |
|--------------|---------------|------------------------------------------------------------------------------|
| `.pm.alive`  | The container | Refreshed periodically with an ISO-format timestamp while the job is running. |
| `.pm.done`   | The container | Written on clean exit. Primary `COMPLETED` signal.                            |
| `.pm.err`    | The container | Written on a controlled failure (non-zero exit, exception).                   |

`ed.status` checks these in order:

1. `.pm.done` exists → **`COMPLETED`**.
2. `.pm.err` exists → **`ERROR`**.
3. `.pm.alive` exists:
   - timestamp within the last 60 seconds → **`RUNNING`**.
   - timestamp older than 60 seconds → **`ERROR`** (worker died without writing `.pm.err`).
4. *(SLURM-submitted job)* the SLURM queue state is consulted — see below.
5. Fallback: read the app's `expected_outputs.json`. If it lists files and they're all present on disk → **`COMPLETED`**; if some are missing → **`RUNNING`**.
6. None of the above → **`WAITING_INFO`**.

The 60-second timeout is hard-coded. A SLURM worker that crashes silently will leave `.pm.alive` stale; after 60 seconds the status flips to `ERROR` even without an explicit failure signal.

The fallback path (5) was tightened recently: an empty `expected_outputs.json` list now falls through to `WAITING_INFO` rather than reporting a false `COMPLETED` for a job that hadn't started.

## SLURM: the same, plus the queue

When a job is submitted via `ed.run(queue="slurm", ...)`, two things drive the status:

- The same `.pm.alive` / `.pm.done` / `.pm.err` files (the worker still uses them).
- The SLURM queue state from `jobInfo()` — consulted when no sentinel and no fresh heartbeat are present.

The SLURM-state-to-{py:class}`~playmolecule.JobStatus` mapping:

| SLURM state                          | PlayMolecule state |
|--------------------------------------|--------------------|
| `RUNNING`                            | `RUNNING`          |
| `COMPLETED`                          | `COMPLETED`        |
| `PENDING`, `None`                    | `WAITING_INFO`     |
| `FAILED`, `CANCELLED`, `OUT_OF_MEMORY`, `TIMEOUT` | `ERROR` |

The heartbeat catches the case where SLURM thinks the job is running but the container died silently on the worker (common with GPU driver issues).

## HTTP backend: server-side truth

For HTTP-backend jobs there's no shared filesystem. `ed.status` does a single HTTP GET against the backend's status endpoint, keyed by a job id derived from the `outdir` path at submission time. The four states are reported directly by the server.

If you move or rename `outdir` between submission and status queries, the derived job ID won't match what the server stored and you'll get a 404. The fix is to never rename `outdir`.

## Polling guidance

- **Local interactive runs** — `ed.run()` is blocking. You don't poll; you wait.
- **SLURM jobs** — poll once every 30–60 seconds. Anything faster is wasted; the controller node sees no benefit.
- **HTTP-backend jobs** — match the polling cadence to job length; once a minute is reasonable for jobs that take 10+ minutes.

Or set `PM_BLOCKING=1` and the app call itself waits until terminal state — useful in scripts where you'd otherwise write the polling loop anyway.

## Failure modes worth knowing

- **Stale heartbeat** — worker died with no `.pm.err`. Cause: usually OOM-kill or hardware fault. Check SLURM accounting (`sacct`).
- **Missing expected outputs** — manifest's `expected_outputs` list and the actual run disagree. Either the app code regressed or the inputs were unusable.
- **`WAITING_INFO` forever** — the container never started. Check the container runtime: `docker pull` of the image, Apptainer's SIF cache, gcloud auth.
- **HTTP 404 on status** — `outdir` was moved/renamed after submission. There's no recovery; resubmit.

## See also

- {py:class}`~playmolecule.JobStatus`
- {py:class}`~playmolecule.ExecutableDirectory`
- [Check job status](../howto/check-job-status.md)
- [Architecture](architecture.md)